Transcript Document
An idiosyncratic (and very uncomprehensive) history of CL/NLP Eduard Hovy Information Sciences Institute University of Southern California What can’t NLP do today? • Do general-purpose text generation • Deliver semantics—either in theory or in practice • Deliver long/complex answers by extracting, merging, and summarizing web info • Handle extended dialogues • Use pragmatics (style, emotion, user profile…) • Provide significant contributions to a theory of Language (in Linguistics or Neurolinguistics) or of Information • etc.… What can NLP do (robustly) today? • Reliable surface-level preprocessing (POS tagging, word segmentation, NE extraction, etc.): 94%+ 90s-• Shallow syntactic parsing: 92%+ for English (Charniak, Collins, Lin) and deeper analysis (Hermjakob) 00s-80s-• IE: ~40% for well-behaved topics (MUC) • Speech: ~80% large vocab; 20%+ open vocab, noisy input • IR: 40% (TREC) 80-90s • MT: ~70% depending on what you measure • Summarization: ? (~60% for extracts; DUC) • QA: ? (~60% for factoids; TREC) 00s-- 80-90s 80s-- 90-00s Phase 1: getting started The Grand Challenge: MT • • • • 1950–65 Warren Weaver memorandum, 1946 MT demo: IBM/Georgetown U, 1954 (USAF) Journal Machine Translation, 1954 … later Computational Linguistics International MT conferences: 1954, 1956, 1958 – at 1958 conference: MT/NLP IR – Luhn auto-summarizes papers of one session • Very limited computer space/power: 7 minutes to parse long sentence • Tried both statistical and symbolic methods • ALPAC report, 1964 IR: for the librarians • Intense manual effort at index construction • Cleverdon: the Cranfield aeronautics text evaluation experiments Phase 2: trying for theory • NLP 1965–75 – syntax: Transformational Grammar, then other approaches – lexicon efforts: polysemy, etc. – processing: rather ad hoc, then finite state automata (Woods et al.) • IR – lots of work on indexing books and articles – start of vector spaces: Salton at Cornell – system construction: intense manual effort • Speech – units: single words – system construction: intense manual effort at modeling articulatory channel • Early pre-computational attempts at semantics: Masterman, Ceccato Phase 3: higher ambitions! • NLP 1975–85 – formal and informal semantics: Situation Semantics (Barwise and Perry ~77), DRT (Kamp 80); Frames (Minsky 75), Semantic Nets (Bobrow and Collins 75), Conceptual Dependency etc. (Schank 77– 85; Jackendoff 80; Sowa 80s)… – processing: ATNs (e.g., LUNAR, Woods 78) • AI – SHRDLU (Winograd 73) and TALE-SPIN (Meehan 75) • IR – vector spaces firmly established – system construction: automatic, with some tuning • Speech – triumphant introduction of learning methods: HMMs at CMU (Baker) – system construction: some learning, and tuning – units: phrases Phase 4: two methodologies • NLP: theoretical side – logical form and well-formed formulas – formal grammars: HPSG, GPSG and all the other PSGs – processing: unification as the Great Answer (Shieber 86) • MT – statistical MT (Brown et al. 90s); the Statistics Wars 1985–95 theorydriven • NLP: practical side – IE (MUC competitions) – preprocessing, alignment, etc. tools (Church, Brill, etc.) – Penn Treebank and WordNet • IR – TREC competitions (1990--); various tracks – moving to the web • Speech – system construction: learning HMMs (bi-, trigrams) – simple dialogues (ATIS) – DARPA evaluations and systems experimentdriven Phase 5: statistics ‘wins’ • NLP 1995–05 – machine learning of (almost) everything; statistics-based parsing (Collins, Charniak, Hermjakob) – large networks, centers, and corpora of all kinds (ELSNET, Penn Framebank, etc.); LREC, EMNLP, and Very Large Corpora conferences – shallow semantics: WordNet 1.6 (Miller, Fellbaum) and the other Nets – practical applications: summarization • IR – mathematical formulation of theories in vector spaces and language models – ever larger scope: web, cross-language IR, rapid classification… – QA • MT – statistical MT tools (Knight et al.) and automated MT evaluation (Papineni et al.) • Speech – mathematical formulation of theories and machine learning of more than just HMMs – dialogue: adding some context; linking with NLG and synthesis (Verbmobil, DARPA Communicator projects) – toward unlimited vocabulary and noisy backgrounds A credo for our times… Let me finish this part of the survey by warning in general against overestimating the impact of statistical information on the problem of MT and related questions. I believe that this overestimation is a remnant of the time, seven or eight years ago, when many people thought that the statistical theory of communication would solve many, if not all, of the problems of communication. Though it is often possible by a proper organization of the research effort to get a certain amount of statistical information at no great extra cost, it is my impression that much valuable time of MT workers has been spent on trying to obtain statistical information whose impact on MT is by no means evident. It is not true that every statistic on linguistic matter is automatically of importance for MT so that the gathering of such statistics could be regarded as an integral part of MT research without any need for additional justification. Gathering of statistics is regarded by many MT groups as being part of a more general methodological approach -- the so-called "empirical approach" [15]. This term has already caused a lot of confusion. I am using it here in the sense in which it is employed by the RAND group [16]. This sense should become obvious from the following discussion. Adherents of this approach are distrustful of existing grammar books and dictionaries, and regard it as necessary to establish from scratch the grammatical rules by which the source-language text will be machine analyzed, through a human analysis of a large enough corpus of source-language material, constantly improving upon the formulation of these rules by constantly enlarging this corpus. With regard to dictionaries, a similar approach is often implemented and a dictionary compiled from translations performed by bilingual members of the group or by other human translators considered to be qualified by this group. This approach seems to me somewhat wasteful in practice and not sufficiently justified in theory. The underlying distrust seems to have been caused by the well-known fact that most existing grammars are of the normative type, hence often of no great help in the analysis of actual writing (and to an even higher degree, of actual speech), and that existing dictionaries are of such a nature that quite often none of the presented target-language counterparts of a source-language word are satisfactory within certain contexts, especially with regard to terms used in recently developed scientific fields. However, even in view of these facts, I believe that the baby has far too often been thrown away with the bathwater. No justification has been given for the implicit belief of the "empiricists" that a grammar satisfactory for MT purposes will be compiled any quicker or more reliably by starting from scratch and "deriving" the rules of grammar from an analysis of a large corpus than by starting from some authoritative grammar and changing it, if necessary, in accordance with analysis of several texts. The same holds mutatis mutandis with regard to the compilation of dictionaries. But grammars have in general not wholly been dreamt up, nor have dictionaries been compiled by some random process. Existing grammars and dictionaries already based, though admittedly not wholly, upon actual texts of incomparably larger extensions than those that serve as a basis for the new compilers. Russian is not Kwakiutl, and with all due regard to the methods and techniques of structural linguistics and to the insights which this science has given us in respect some deficiencies of traditional grammars, I do not think that it follows from its teachings that all existing codifications of languages with a highly developed literature should be totally disregarded. Let me add, without going here into details for lack of space, that the empiricalness of the derivations of grammar rules from actual texts is rather doubtful as such. For certain general methodological considerations one might well be led to the conclusion that these rules incorporate a lot of subjective and highly biased and untested assumptions such that their degree of validity might very well, on the average, be lower than that of the well-established, often-tested and critically examined grammars, in spite of their normativity. A Critique of the Overestimation of Statistics and the “Empirical” Approach Y. Bar-Hillel, 1960 So what have we learned? Language processing is engineering: • language is too complex for humans to make rules, so systems must learn them • thus need a rapid (automated?) build-evaluate-build cycle Language processing is notation transformation: – – – – – English sentence to Spanish sentence English string to parse tree case frame to English string sound waves to text string long text to short text …with (often) some information added: – POS, syntactic, semantic, and other labels – associated documents Where next? By creating smaller transformation (sub)steps, we can learn better, and branch out to more apps: – define a new cutlevel (with notation!), – list many XY pairs as training data, – learn the transformation rules. Major bottlenecks: add info that isn’t in the text use EM, ME, etc. to learn best ‘rules’ – Diffuse phenomena require very large training sets: algorithms and speed issues – Shallow semantics – Discourse and dialogue cutlevels – Pragmatics and hearer/user models – Information Theory Performance ceilings of NLP • How to improve IR accuracy? TREC 98–00: around 40%. Query analysis; query term expansion by meaning. • How to achieve conceptual summarization? Never been done yet, at non-toy level. Topic interpretation and fusion via meaning. • How to improve QA? TREC 01: up to 60%, for factoids only. Question and answer analysis; matching of meanings. • How to improve MT quality? MTEval 94: ~70%, depending on what you measure. Wordsense disambiguation, to find correct meaning. Semantics for summarization • FRUMP (DeJong 1978): An event that shakes the world: Pope John Paul I dies September 28, 1978 Pope John Paul I was bishop of Rome for only 34 days in August-September 1978. In that short span he issued no major statements contributing to the development of the tradition of Catholic social teaching. He did manage, however, to catch the world's imagination with his warm smile, and had he lived long enough, he undoubtedly would have placed his own mark on the church's social teaching. In tribute to this "smiling Pope," we present some quotes from his talks, passages which affirm his interest in a world of justice and peace. Albino Luciani was born on October 17, 1912, in what is now Canale d'Agordo, in the diocese of Belluno in the Dolomite Alps of Northern Italy. His father, a confirmed socialist who did not oppose his son's vocation to the priesthood, was a migrant worker and then a glassblower in Venice. The family, like those of Pius X and John XXIII, was always poor, and Albino knew from personal experience the hardships suffered by the modern urban proletariat. He was the first pope chosen from their ranks. “There was an earthquake in Italy. One person died.” Semantics for QA: Webclopedia (Hovy et al. 01, 02) Where do lobsters like to live? — on the table How many people live in Chile? — nine Where are zebras most likely found? — in the dictionary What is an invertebrate? — Dukakis …need semantic and numerical world knowledge The web: a giant source for semantics • Jan 02: over 195 bill words (2 bill pages) on the web (66% English, 12% German, 5% English Other Japanese) • Need IR, MT, Summarization, QA • Need an/some ontologies Language Engl ish German Japanese French Spani sh Chinese Korean Ital ian Portuguese Norwegian Finnish Non-Engl ish Non-Engl ish% Sam pl e (thousands of words) Oct-96 Aug-99 Feb-00 6,082. 09 28, 222.10 48, 064.10 228.94 1,994. 23 3,333. 13 228.94 1,994. 23 3,333. 13 223.32 1,529. 80 2,732. 22 104.32 1,125. 65 1,894. 97 123.56 817.27 1,338. 35 123.56 817.27 1,338. 35 123.56 817.27 1,338. 35 106.17 589.39 1,161. 90 106.50 669.33 947.49 20. 65 107.26 166.60 1,389. 49 10, 461.70 17, 584.48 18. 60% 27. 04% 26. 79% Exp o n en tial Gr o wth Assu mp ti on Dec-01 Dec-03 Dec-05 128,043. 57 419,269. 14 1,375,098. 05 13, 435.07 65, 161.79 316,727. 36 9,375.41 40, 070.32 171,600. 89 9,375.41 40, 070.32 171,600. 89 8,786.78 48, 968.42 273,542. 30 8,786.78 48, 968.42 273,542. 30 4,507.93 18, 206.81 73, 675.11 4,507.93 18, 206.81 73, 675.11 3,455.98 13, 438.26 52, 350.71 3,109.04 11, 474.59 42, 425.27 480.19 1,628.87 5,534.62 65, 820.52 306,194. 61 1,454,674. 58 33. 95% 42. 21% 51. 41% (from Grefenstette 99, with additions by Oard and Hovy) A start: Ankle-deep semantics • ‘Thin’ ontology taxonomies (or CYC, if it works): – WordNet, SENSUS, etc. – profligate termsets, little inference • Rich experiential knowledge: (Miller et al.) (Hobbs) (Fleischman & Hovy) – all commonly known people, places, events, objects… • Some interconnections: in addition to is-a, need: (Lin & Hovy) – topical relations: topic signatures (Schank & Abelson) – scriptal relations: scripts – meronymical (part-of), functional, and material (Guarino) hierarchies • Richer XML-based annotation standards (Semantic Web?) Toward a deeper theory of information • Shannon’s work is ‘relativistic’ to rater: based on guessing the next symbol, therefore implicitly conditioned on the reader’s knowledge — with more knowledge of the language, the domain, etc., the information measure changes • We must aim for a theory of information that is more ‘absolute’: – defined in terms of knowledge symbols (ontology) and relations (and can be related to reader’s knowledge?) – focused on transformations and decomposition of transformations Info Theory for evaluating summaries Evaluating summaries: • Shannon Game: quantify information content…guess next letter/word. • Question Game: test reader’s understanding…ask questions. • Classification Game: compare classifiability…sort texts/summaries into buckets. Compression Ratio: CR = (length S) / (length T) Retention Ratio: RR = (info in S) / (info in T) • • • Case 1: just adding info; no special leverage from summary. Case 2: ‘fuser’ concept(s) at knee add a lot of information. Case 3: ‘fuser’ concepts become weaker. RR CR RR CR RR CR NLP in the world • There are between 10,000 and 15,000 NLP practitioners in the world: – – – – ISCA—3000 members? ACL—2000 members SIGIR—1000 members IAMT—400 members • There are over 20 conference series: ICSLP, ACL (+ NAACL-HLT, EACL), COLING, LREC, SIGIR, EMNLP, MT Summit (+ AMTA, EAMT, AAMT), RANLP, PACLING, INLG, ROCLING, TMI, CICLing… plus numerous workshop series My predictions • Continued drift toward bimodal world: a few large centers (that do everything) and many tiny research efforts, but little in between – Engineering projects need continuity and wide resources • Use of statistical methods to learn semanticsoriented transformations – Strong need for shallow and deeper semantic resources—Framebanks, pragmatic annotations, etc. – Repeat history of (for example) MT development • Growth of information delivery platforms – Integration of NLP applications (IR, MT, QA, Summ, ASR, etc.) on top of IE, WSD, and linked to multimedia interfaces – NLP disappears into the background