Transcript PPT
Socal Workshop 2009 @ UCLA From linear sequences to abstract structures: Distributional information in infant-direct speech Hao Wang & Toby Mintz Department of Psychology University of Southern California This research was supported in part by a grant from the National Science Foundation (BCS-0721328). 1 Outline • Introduction – Learning word categories (e.g., noun and verb) is a crucial part of language acquisition – The role of distributional information – Frequent frames (FFs) • Analyses 1 & 2, structures of FFs in childdirected speech • Conclusion and implication 2 Speakers’ Implicit Knowledge of Categories Upon hearing: I saw him slich. The truff was in the bag. Hypothesizing: They slich. He has two truffs. He sliches. She wants a truff. Johny was sliching. Some of the truffs are here. 3 Distributional Information • The contexts a word occurs – Words before and after the target word • Example – the cat is on the mat – Affixes in rich morphology languages • Cartwright & Brent, 1997; Chemla et al, 2009; Maratsos & Chalkley, 1980; Mintz, 2002, 2003; Redington et al, 1998 4 Frequent frames (Mintz, 2003) • Two words co-occurring frequently with one word intervening FRAME you__it you__to you__the what__you to__it want__to ... the__is ... FREQ. • Frame you_it Peter Corpus (Bloom, 1970) 433 265 • 433 tokens, 93 types, 100% verbs 257 put see do did 234 want fix turned get 220 got turn throw closed 219 think leave ... take open 79 5 Mean Token Accuracy Accuracy Results Averaged Over All Six Corpora (Mintz, 2003) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Frame-Based Categorization Chance Categorization Categorization Type 6 Structure of Natural Languages • In contemporary linguistics, sentences are analyzed as hierarchical structures • Word categories are defined by their structural positions in the hierarchical structure • But, FFs are defined over linear sequences • How can they accurately capture abstract structural regularities? 7 Why FFs are so good at categorizing words? • Is there anything special about the structures associated with FFs? • FFs are manifestations of some hierarchically coherent and consistent patterns which largely constrained the possible word categories in the target position. 8 Analysis 1 • Corpora – Same six child-directed speech corpora from CHILDES (MacWhinney, 2000) as in Mintz (2003) – Labeled with dependency structures (Sagae et al., 2007) – Speech to children before age of 2;6 Eve (Brown, 1973), Peter (Bloom, Hood, & Lightbown, 1974; Bloom, Lightbown, & Hood, 1975), Naomi (Sachs, 1983), Nina (Suppes, 1974), Anne (Theakston, Lieven, Pine, & Rowland, 2001), and Aran (Theakston, et al., 2001). 9 Grammatical relations • A dependency structure consists of grammatical relations (GRs) between words in a sentence • Similar to phrase structures, it’s a representation of structural information. 10 Sagae et al., 2005 Method • Consistency of structures of FFs • Combination of GRs to represent structure – W1-W3, W1-W2, W2-W3, W1-W2-W3 • Measures – For each FF, percentage of tokens accounted for by the most frequent 4 GR patterns • Control – Most frequent 45 unigrams (FUs) – E.g., the__ W1 W2 W3 11 Results Mean percentage of tokens accounted for by the most frequent 4 GR patterns 100% 0.92 80% 0.91 * 0.88 0.85 0.64 60% FFs 40% FUs 20% 0% W1-W3 W1-W2 t(5)=26.97, p<.001 W2-W3 W1-W2-W3 12 Top 4 W1-W3 GR patterns Frequent frames what__you you__to what__that you__it GR of W1* 2 OBJ 4 OBJ 5 OBJ 3 POBJ 0 SUBJ 0 SUBJ -2 SUBJ 0 SUBJ 0 PRED 0 PRED 3 OBJ 2 OBJ 0 SUBJ 0 SUBJ -2 OBJ -2 OBJ GR of W3* 2 SUBJ 2 SUBJ 2 SUBJ 2 SUBJ 2 INF 0 JCT 2 INF 0 INF 0 SUBJ 2 DET 2 DET 2 SUBJ 0 OBJ 2 SUBJ 0 OBJ 2 SUBJ Token count 287 46 20 5 260 26 1 1 216 14 4 4 195 6 2 1 *The word position and head position for GRs in this table are positions relative to the target word of a frame. W1’s word position is always -1, W3 is always 1. 13 Analysis 1 Summary • Frequent frames in child-directed speech select very consistent structures, which help accurately categorizing words • Analysis 2, internal organizations of frequent frames 14 Analysis 2 • Same corpora as Analysis 1 • GRs between words in a frame and words outside that frame (external links) and GRs between two words within a frame (internal links) • For each FF type, the number of links per token was computed for each word position External links Not counted Internal links 15 Links from/to W1 1 0.8 0.73 0.49 0.6 0.4 0.31 0.51 FFs 0.31 0.17 0.2 0.58 0.23 FUs 0 Internal links from W1 Links from W1 to W2 External links External links from W1 to W1 16 Conclusion & implications • Frequent frames, which are simple linear relations between words, achieve accurate categorization by selecting structurally consistent and coherent environments. • The third word (W3) helps FFs to focus on informative structures • This relation between a linear order pattern and internal structures of languages may be a cue for children to bootstrap into syntax 17 Thank you! • References – MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk. Mahwah, NJ: Lawrence Erlbaum Associates. – Mintz, T. H. (2003). Frequent frames as a cue for grammatical categories in child directed speech. Cognition, 90(1), 91-117. – Sagae, K., Lavie, A., & MacWhinney, B. (2005). Automatic measurement of syntactic development in child language. ACL Proceedings. – Sagae, K., Davis, E., Lavie, A., MacWhinney, B. and Wintner, S. Highaccuracy annotation and parsing of CHILDES transcripts. In, Proceedings of the ACL-2007 Workshop on Cognitive Aspects of Computational Language Acquisition. 18 Pure frequent frames? 19 Ana. 2 mean token coverage Frequent frames Bigrams W1-W3 W1-W2 W2-W3 W1-W2-W3 W1-W2 Eve 0.96 0.94 0.92 0.89 0.69 Peter 0.87 0.87 0.85 0.80 0.57 Nina 0.94 0.93 0.91 0.89 0.68 Naomi Anne Aran 0.93 0.92 0.89 0.92 0.92 0.88 0.88 0.90 0.82 0.86 0.86 0.79 0.63 0.68 0.61 20 Ana. 2 FF external links Table 3 Average number of links per token for frequent frames Corpus Eve Peter Nina Naomi Anne Aran Average Token count 3601 4541 6709 1447 4435 5245 External links to W1 to W2 to W3 from W1 from W2 from W3 0.19 0.28 0.19 0.20 0.24 0.27 0.54 0.71 0.46 0.77 0.50 0.61 0.50 0.44 0.71 0.46 0.54 0.51 0.15 0.25 0.15 0.13 0.18 0.17 0.33 0.30 0.32 0.36 0.32 0.39 0.39 0.52 0.40 0.52 0.43 0.51 0.23 0.60 0.52 0.17 0.34 0.46 21 FF internal links Corpus Eve Peter Nina Naomi Anne Aran Average Token count 3601 4541 6709 1447 4435 5245 Internal links W1->W2 W1->W3 W2->W1 W2->W3 W3->W1 W3->W2 0.52 0.44 0.48 0.60 0.41 0.50 0.25 0.21 0.29 0.17 0.29 0.20 0.10 0.16 0.09 0.13 0.17 0.16 0.28 0.27 0.37 0.21 0.34 0.24 0.10 0.13 0.07 0.07 0.12 0.10 0.29 0.20 0.23 0.24 0.17 0.21 0.49 0.24 0.14 0.29 0.10 0.22 22 Ana. 2 FU links Corpus Eve Peter Nina Naomi Anne Aran Average Token count 28076 35723 37055 12409 38681 49302 External links Internal links to W1 to W2 from W1 from W2 W1->W2 W2->W1 0.52 0.65 0.66 0.59 0.52 0.52 0.58 0.51 0.48 0.58 0.50 0.48 0.55 0.52 0.51 0.53 0.49 0.51 0.44 0.54 0.51 0.62 0.66 0.64 0.63 0.62 0.60 0.63 0.32 0.28 0.32 0.30 0.36 0.30 0.31 0.18 0.20 0.15 0.19 0.16 0.22 0.19 23