Transcript Document

An idiosyncratic
(and very uncomprehensive)
history of CL/NLP
Eduard Hovy
Information Sciences Institute
University of Southern California
What can’t NLP do today?
• Do general-purpose text generation
• Deliver semantics—either in theory or in practice
• Deliver long/complex answers by extracting,
merging, and summarizing web info
• Handle extended dialogues
• Use pragmatics (style, emotion, user profile…)
• Provide significant contributions to a theory of
Language (in Linguistics or Neurolinguistics) or
of Information
• etc.…
What can NLP do (robustly) today?
• Reliable surface-level preprocessing (POS tagging, word
segmentation, NE extraction, etc.): 94%+
90s-• Shallow syntactic parsing: 92%+ for English (Charniak,
Collins, Lin) and deeper analysis (Hermjakob)
00s-80s-• IE: ~40% for well-behaved topics (MUC)
• Speech: ~80% large vocab; 20%+ open vocab, noisy input
• IR: 40% (TREC)
80-90s
• MT: ~70% depending on what you measure
• Summarization: ? (~60% for extracts; DUC)
• QA: ? (~60% for factoids; TREC)
00s--
80-90s
80s--
90-00s
Phase 1: getting started
The Grand Challenge: MT
•
•
•
•
1950–65
Warren Weaver memorandum, 1946
MT demo: IBM/Georgetown U, 1954 (USAF)
Journal Machine Translation, 1954 … later Computational Linguistics
International MT conferences: 1954, 1956, 1958
– at 1958 conference: MT/NLP  IR
– Luhn auto-summarizes papers of one session
• Very limited computer space/power: 7 minutes to parse long sentence
• Tried both statistical and symbolic methods
• ALPAC report, 1964
IR: for the librarians
• Intense manual effort at index construction
• Cleverdon: the Cranfield aeronautics text evaluation experiments
Phase 2: trying for theory
• NLP
1965–75
– syntax: Transformational Grammar, then other approaches
– lexicon efforts: polysemy, etc.
– processing: rather ad hoc, then finite state automata (Woods et al.)
• IR
– lots of work on indexing books and articles
– start of vector spaces: Salton at Cornell
– system construction: intense manual effort
• Speech
– units: single words
– system construction: intense manual effort at modeling articulatory
channel
• Early pre-computational attempts at semantics: Masterman,
Ceccato
Phase 3: higher ambitions!
• NLP
1975–85
– formal and informal semantics: Situation Semantics (Barwise and
Perry ~77), DRT (Kamp 80); Frames (Minsky 75), Semantic Nets
(Bobrow and Collins 75), Conceptual Dependency etc. (Schank 77–
85; Jackendoff 80; Sowa 80s)…
– processing: ATNs (e.g., LUNAR, Woods 78)
• AI
– SHRDLU (Winograd 73) and TALE-SPIN (Meehan 75)
• IR
– vector spaces firmly established
– system construction: automatic, with some tuning
• Speech
– triumphant introduction of learning methods: HMMs at CMU (Baker)
– system construction: some learning, and tuning
– units: phrases
Phase 4: two methodologies
• NLP: theoretical side
– logical form and well-formed formulas
– formal grammars: HPSG, GPSG and all the other PSGs
– processing: unification as the Great Answer (Shieber 86)
• MT
– statistical MT (Brown et al. 90s); the Statistics Wars
1985–95
theorydriven
• NLP: practical side
– IE (MUC competitions)
– preprocessing, alignment, etc. tools (Church, Brill, etc.)
– Penn Treebank and WordNet
• IR
– TREC competitions (1990--); various tracks
– moving to the web
• Speech
– system construction: learning HMMs (bi-, trigrams)
– simple dialogues (ATIS)
– DARPA evaluations and systems
experimentdriven
Phase 5: statistics ‘wins’
• NLP
1995–05
– machine learning of (almost) everything; statistics-based parsing
(Collins, Charniak, Hermjakob)
– large networks, centers, and corpora of all kinds (ELSNET, Penn Framebank, etc.);
LREC, EMNLP, and Very Large Corpora conferences
– shallow semantics: WordNet 1.6 (Miller, Fellbaum) and the other Nets
– practical applications: summarization
• IR
– mathematical formulation of theories in vector spaces and language models
– ever larger scope: web, cross-language IR, rapid classification…
– QA
• MT
– statistical MT tools (Knight et al.) and automated MT evaluation (Papineni et al.)
• Speech
– mathematical formulation of theories and machine learning of more than just HMMs
– dialogue: adding some context; linking with NLG and synthesis (Verbmobil,
DARPA Communicator projects)
– toward unlimited vocabulary and noisy backgrounds
A credo for our times…
Let me finish this part of the survey by warning in general against overestimating the impact of statistical information on the problem of MT and related
questions.
I believe that this overestimation is a remnant of the time, seven or eight years ago, when many people thought that the statistical theory of
communication would solve many, if not all, of the problems of communication. Though it is often possible by a proper organization of the research effort
to get a certain amount of statistical information at no great extra cost, it is my impression that much valuable time of MT workers has been spent on
trying to obtain statistical information whose impact on MT is by no means evident. It is not true that every statistic on linguistic matter is automatically of
importance for MT so that the gathering of such statistics could be regarded as an integral part of MT research without any need for additional
justification.
Gathering of statistics is regarded by many MT groups as being part of a more general methodological approach -- the so-called "empirical approach"
[15]. This term has already caused a lot of confusion. I am using it here in the sense in which it is employed by the RAND group [16]. This sense should
become obvious from the following discussion. Adherents of this approach are distrustful of existing grammar books and dictionaries, and regard it as
necessary to establish from scratch the grammatical rules by which the source-language text will be machine analyzed, through a human analysis of a
large enough corpus of source-language material, constantly improving upon the formulation of these rules by constantly enlarging this corpus. With
regard to dictionaries, a similar approach is often implemented and a dictionary compiled from translations performed by bilingual members of the group
or by other human translators considered to be qualified by this group. This approach seems to me somewhat wasteful in practice and not sufficiently
justified in theory. The underlying distrust seems to have been caused by the well-known fact that most existing grammars are of the normative type,
hence often of no great help in the analysis of actual writing (and to an even higher degree, of actual speech), and that existing dictionaries are of such a
nature that quite often none of the presented target-language counterparts of a source-language word are satisfactory within certain contexts, especially
with regard to terms used in recently developed scientific fields. However, even in view of these facts, I believe that the baby has far too often been
thrown away with the bathwater. No justification has been given for the implicit belief of the "empiricists" that a grammar satisfactory for MT purposes
will be compiled any quicker or more reliably by starting from scratch and "deriving" the rules of grammar from an analysis of a large corpus than by
starting from some authoritative grammar and changing it, if necessary, in accordance with analysis of several texts. The same holds mutatis mutandis
with regard to the compilation of dictionaries. But grammars have in general not wholly been dreamt up, nor have dictionaries been compiled by some
random process. Existing grammars and dictionaries already based, though admittedly not wholly, upon actual texts of incomparably larger extensions
than those that serve as a basis for the new compilers. Russian is not Kwakiutl, and with all due regard to the methods and techniques of structural
linguistics and to the insights which this science has given us in respect some deficiencies of traditional grammars, I do not think that it follows from its
teachings that all existing codifications of languages with a highly developed literature should be totally disregarded. Let me add, without going here into
details for lack of space, that the empiricalness of the derivations of grammar rules from actual texts is rather doubtful as such. For certain general
methodological considerations one might well be led to the conclusion that these rules incorporate a lot of subjective and highly biased and untested
assumptions such that their degree of validity might very well, on the average, be lower than that of the well-established, often-tested and critically
examined grammars, in spite of their normativity.
A Critique of the Overestimation of
Statistics and the “Empirical” Approach
Y. Bar-Hillel, 1960
So what have we learned?
Language processing is engineering:
• language is too complex for humans to make rules, so
systems must learn them
• thus need a rapid (automated?) build-evaluate-build cycle
Language processing is notation transformation:
–
–
–
–
–
English sentence to Spanish sentence
English string to parse tree
case frame to English string
sound waves to text string
long text to short text
…with (often) some information added:
– POS, syntactic, semantic, and other labels
– associated documents
Where next?
 By creating smaller transformation (sub)steps, we
can learn better, and branch out to more apps:
– define a new cutlevel (with notation!),
– list many XY pairs as training data,
– learn the transformation rules.
 Major bottlenecks:
add info that
isn’t in the text
use EM, ME, etc.
to learn best ‘rules’
– Diffuse phenomena require very large training sets:
algorithms and speed issues
– Shallow semantics
– Discourse and dialogue
cutlevels
– Pragmatics and hearer/user models
– Information Theory
Performance ceilings of NLP
• How to improve IR accuracy?
TREC 98–00: around 40%.
 Query analysis; query term expansion by meaning.
• How to achieve conceptual summarization?
Never been done yet, at non-toy level.
 Topic interpretation and fusion via meaning.
• How to improve QA?
TREC 01: up to 60%, for factoids only.
 Question and answer analysis; matching of meanings.
• How to improve MT quality?
MTEval 94: ~70%, depending on what you measure.
 Wordsense disambiguation, to find correct meaning.
Semantics for summarization
• FRUMP (DeJong 1978):
An event that shakes the world:
Pope John Paul I dies
September 28, 1978
Pope John Paul I was bishop of Rome
for only 34 days in August-September
1978. In that short span he issued no
major statements contributing to the
development of the tradition of Catholic
social teaching. He did manage,
however, to catch the world's
imagination with his warm smile, and
had he lived long enough, he
undoubtedly would have placed his own
mark on the church's social teaching. In
tribute to this "smiling Pope," we present
some quotes from his talks, passages which affirm his interest in a world of
justice and peace.
Albino Luciani was born on October 17, 1912, in what is now Canale
d'Agordo, in the diocese of Belluno in the Dolomite Alps of Northern Italy.
His father, a confirmed socialist who did not oppose his son's vocation to the
priesthood, was a migrant worker and then a glassblower in Venice. The
family, like those of Pius X and John XXIII, was always poor, and Albino
knew from personal experience the hardships suffered by the modern urban
proletariat. He was the first pope chosen from their ranks.
“There was an
earthquake in
Italy. One
person died.”
Semantics for QA: Webclopedia
(Hovy et al. 01, 02)
Where do lobsters like to live?
— on the table
How many people live in Chile?
— nine
Where are zebras most likely found?
— in the dictionary
What is an invertebrate?
— Dukakis
…need semantic and numerical world knowledge
The web: a giant source for semantics
• Jan 02: over 195 bill words (2 bill pages)
on the web (66% English, 12% German, 5%
English
Other
Japanese)
• Need IR, MT, Summarization, QA
• Need an/some ontologies
Language
Engl ish
German
Japanese
French
Spani sh
Chinese
Korean
Ital ian
Portuguese
Norwegian
Finnish
Non-Engl ish
Non-Engl ish%
Sam pl e (thousands of words)
Oct-96
Aug-99
Feb-00
6,082. 09
28, 222.10
48, 064.10
228.94
1,994. 23
3,333. 13
228.94
1,994. 23
3,333. 13
223.32
1,529. 80
2,732. 22
104.32
1,125. 65
1,894. 97
123.56
817.27
1,338. 35
123.56
817.27
1,338. 35
123.56
817.27
1,338. 35
106.17
589.39
1,161. 90
106.50
669.33
947.49
20. 65
107.26
166.60
1,389. 49
10, 461.70
17, 584.48
18. 60%
27. 04%
26. 79%
Exp o n en tial Gr o wth Assu mp ti on
Dec-01
Dec-03
Dec-05
128,043. 57
419,269. 14 1,375,098. 05
13, 435.07
65, 161.79
316,727. 36
9,375.41
40, 070.32
171,600. 89
9,375.41
40, 070.32
171,600. 89
8,786.78
48, 968.42
273,542. 30
8,786.78
48, 968.42
273,542. 30
4,507.93
18, 206.81
73, 675.11
4,507.93
18, 206.81
73, 675.11
3,455.98
13, 438.26
52, 350.71
3,109.04
11, 474.59
42, 425.27
480.19
1,628.87
5,534.62
65, 820.52
306,194. 61 1,454,674. 58
33. 95%
42. 21%
51. 41%
(from Grefenstette 99, with additions by Oard and Hovy)
A start: Ankle-deep semantics
• ‘Thin’ ontology taxonomies (or CYC, if it works):
– WordNet, SENSUS, etc.
– profligate termsets, little inference
• Rich experiential knowledge:
(Miller et al.)
(Hobbs)
(Fleischman & Hovy)
– all commonly known people, places, events, objects…
• Some interconnections: in addition to is-a, need:
(Lin & Hovy)
– topical relations: topic signatures
(Schank & Abelson)
– scriptal relations: scripts
– meronymical (part-of), functional, and material
(Guarino)
hierarchies
• Richer XML-based annotation standards
(Semantic Web?)
Toward a deeper theory of information
• Shannon’s work is ‘relativistic’ to rater: based on
guessing the next symbol, therefore implicitly
conditioned on the reader’s knowledge — with
more knowledge of the language, the domain,
etc., the information measure changes
• We must aim for a theory of information that is
more ‘absolute’:
– defined in terms of knowledge symbols (ontology) and
relations (and can be related to reader’s knowledge?)
– focused on transformations and decomposition of
transformations
Info Theory for evaluating summaries
Evaluating summaries:
• Shannon Game: quantify information content…guess next letter/word.
• Question Game: test reader’s understanding…ask questions.
• Classification Game: compare classifiability…sort texts/summaries
into buckets.
Compression Ratio:
CR = (length S) / (length T)
Retention Ratio:
RR = (info in S) / (info in T)
•
•
•
Case 1: just adding info; no special leverage from summary.
Case 2: ‘fuser’ concept(s) at knee add a lot of information.
Case 3: ‘fuser’ concepts become weaker.
RR
CR
RR
CR
RR
CR
NLP in the world
• There are between 10,000 and 15,000 NLP
practitioners in the world:
–
–
–
–
ISCA—3000 members?
ACL—2000 members
SIGIR—1000 members
IAMT—400 members
• There are over 20 conference series: ICSLP, ACL (+
NAACL-HLT, EACL), COLING, LREC, SIGIR, EMNLP, MT
Summit (+ AMTA, EAMT, AAMT), RANLP, PACLING, INLG,
ROCLING, TMI, CICLing… plus numerous workshop series
My predictions
• Continued drift toward bimodal world: a few large
centers (that do everything) and many tiny research
efforts, but little in between
– Engineering projects need continuity and wide resources
• Use of statistical methods to learn semanticsoriented transformations
– Strong need for shallow and deeper semantic
resources—Framebanks, pragmatic annotations, etc.
– Repeat history of (for example) MT development
• Growth of information delivery platforms
– Integration of NLP applications (IR, MT, QA, Summ,
ASR, etc.) on top of IE, WSD, and linked to multimedia
interfaces – NLP disappears into the background