Transcript Large-scale Knowledge Resources in Speech and Language
Large-scale Knowledge Resources in Speech and Language Research
Mark Liberman University of Pennsylvania [email protected]
LKR2004 3/8/2004
Outline
• Glimpse of LKR in the U.S. landscape • What is the relationship between large-scale knowledge resources and research and development on speech and language?
• What are some needs and opportunities?
• What are the trends?
• Illustrative examples
3/8/2004 LKR2004 2
Glimpses of the U.S. LKR landscape
• DARPA research areas – Human Language Technology – Cognitive Information Processing • NSF initiatives – Digital Libraries – ITR, Human Social Dynamics – “terascale linguistics” • Biomedical research: – text, ontologies, databases, experiments – collaborations with Japan and Europe • Language documentation • Web archives in many disciplines • ...too many other things to list...
3/8/2004 LKR2004 3
What is the relationship between large-scale knowledge resources and research and development on speech and language?
Speech and language R&D
needs
LKR
Modeling text: 10 4 -10 6 words in 1975, 10 9 -10 12 words today Modeling speech: 1-10 hours in 1975, 10 + a thousand languages and dialects 3 -10 4 + lexicons, parallel text, DBs for entity tracking, etc.
+ history, social variation, register and genre, ...
hours today
Speech and language R&D
creates
LKR
see above.
but also something entirely new...
3/8/2004 LKR2004 4
Some needs and opportunities
• Standards and tools for LKR – for creation, improvement, maintenance – for publication, distribution, archiving – for search, access and use • An academic culture that rewards production and distribution of LKR – most LKR are a side effect of individual and small-group research – virtual “meta-resources” from many sources • Part of the answer: integrate LKR into the system of (scientific and scholarly) publication 3/8/2004 LKR2004 5
Themes and trends
• A New Empiricism
focus on large-scale resources, because quantity (of data) → quality (of knowledge)
• Language + Life = Meaning
something new emerges from large collections of symbols, signals, contexts, connections
• People and machines: better together
– cognitive prosthetics – interactive working, playing and learning
• Failure is the basis for success
if we can measure error, we can learn to improve 3/8/2004 LKR2004 6
Some illustrative examples...
3/8/2004 LKR2004 7
A famous argument
(1) Colorless green ideas sleep furiously.
(2) Furiously sleep ideas green colorless.
“. . . It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence,
in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally ‘remote’ from English
. Yet (1), though nonsensical, is grammatical, while (2) is not.” Noam Chomsky, “Syntactic Structures” (1957) 3/8/2004 LKR2004 8
But is it true?
3/8/2004 LKR2004 9
43 years later
• someone finally checked...
– Pereira, “ Formal grammar and information theory ” (2000) – simple “aggregate bigram model” using hidden class variables c – with C=16, trained on ~100MW of newswire data
• the result:
"Furiously sleep green ideas colorless" is more than 200,000 times less probable than “Colorless green ideas sleep furiously” 3/8/2004 LKR2004 10
What changed?
• Partly: – new models and estimation methods – better computing resources –
more accessible data
• Mostly: – willingness to look for solutions – opportunities to apply them To be fair, this kind of modeling became a real option only about 1980 Now it can be done as an undergraduate term project ...
3/8/2004 LKR2004 11
Social structure from conversation • Human social dynamics: model of conversational turn-taking • U.S. Supreme Court oral arguments • Modeling is simple and local
– one session modeled at a time (~250 turns) – data is just sequence of (~250) speaker IDs
• Undergraduate term project in intro course
(credit to: Chris Osborn) 3/8/2004 LKR2004 12
CHIEF JUSTICE WILLIAM H. REHNQUIST : We'll hear argument next in No. 01-298, Paul Lapides v. the Board of Regents of the University System of Georgia. Spectators are admonished, do not talk until you get outside the courtroom. The court remains in session. Mr. Bederman. MR. DAVID J. BEDERMAN: Mr. Chief Justice, and may it please the Court: When a State affirmatively invokes the jurisdiction of the Federal court by removing a case, that acts as a waiver of the State's forum immunity to Federal jurisdiction under the Eleventh Amendment. This principle ...
JUSTICE ANTONIN SCALIA: defendant? When you say as an actor in any role, does it ever intervene as a MR. BEDERMAN: Yes, Justice Scalia. This Court's precedents seem to indicate that wherever the State is cast in the role of plaintiff, defendant, intervenor, or claimant, that the entry into the Federal proceeding submits the State to the jurisdiction of the Federal court. CHIEF JUSTICE REHNQUIST: How about the Ford Motor Company case? MR. BEDERMAN : Well, of course, the authorization requirement in Ford Motor -- and that's the particular holding in Ford Motor that I think is of concern to this Court -- need not be reached here because, of course, ...
CHIEF JUSTICE REHNQUIST: being drawn in as a respondent or involuntarily as opposed to removing and thereby invoking Federal jurisdiction. So, you think a line can be drawn between the State defendant + ... 254 turns ...
3/8/2004 LKR2004 13
Two class “aggregate bigram model”, trained on a single one-hour argument (01-298), highest-probability class for each speaker: 3/8/2004 class 1 = ( chief justice william h. rehnquist justice anthony kennedy justice antonin scalia justice john paul stevens justice ruth bader ginsburg justice sandra day o'connor justice stephen g. breyer ) class 2 = ( mr. david j. bederman mr. irving l. gornstein ms. devon orland ) ms. julie c. parsley) LKR2004 14
So human social roles can emerge from a trivial statistical model of speaker sequencing in a formal setting.
and sometimes you don’t need a lot of data.
...though in this case, it was crucial that Jerry Goldman’s Oyez Project is publishing all Supreme Court oral arguments (audio and transcripts) In most cases the quantity of data is crucial:
Data quantity → knowledge quality
... and available resources are just starting to pass a threshold 3/8/2004 LKR2004 15
A case where size matters...
• English complex nominals:
sequence of nouns and adjectives, e.g.
Volume Feeding Management Success Formula Award
• Part-of-speech string offers little help in parsing: [ stone [ traffic barrier ]] [[ job growth ] statistics ] N N N • Apparently, parsing requires “understanding” 3/8/2004 LKR2004 16
The MEDLINE corpus
• U.S. National Library of Medicine • ~12 million references and abstracts
– biomedical journal articles – 1966 to present
• ~10
9
words
3/8/2004 LKR2004 17
3/8/2004
Parsing by counting (in MEDLINE)
[NN]N N[NN] [NA]N N[AN] [AN]N A[NN] [AA]N A[AN] sickle cell anemia 10561 2422 rat bile duct 203 22366 information theoretic criterion 112 5 monkey temporal lobe 16 10154 giant cell tumour 7272 1345 cellular drug transport 262 746 small intestinal activity 8723 120 inadequate topical cooling 4 195 LKR2004 18
Parsing by counting (google hits)
[N [N N] [[N N] N] stone traffic barrier 338 7,010 job growth statistics 349,000 11,600 First attempt at this idea: for AT&T TTS in 1987 First real success: ~15 years later The difference: It doesn’t really work with 10 7 -10 8 It works pretty well with 10 9 -10 12 tokens tokens “You can observe a lot just by watching.” -Yogi Berra 3/8/2004 here... “You can analyze a lot just by counting.” LKR2004 19
As the SCOTUS example suggests, “large-scale” is not just the number of words or hours.
Structure, context and external relationships can also be crucial – here it was the sequence of speaker identities.
Here’s a simple but compelling example of how symbol-like structure emerges as zebra finches practice a song...
This is research by Ofer Tchernichovski (CCNY), Partha Mitra and others 3/8/2004 LKR2004 20
8
Zebra finch song learning Ofer Tchernichovski (CCNY)
3/8/2004 LKR2004 0 Time (ms) 700 21
Song motifs vary across individuals
3/8/2004 LKR2004 22
Song imitation – young birds imitate adults
Tutor’s song Pupil’s song
3/8/2004 LKR2004 23
Song imitation
* Can be very accurate * Critical period – developmental learning * Song template – memory traces of a model * Learning requires auditory feedback
Sensory-motor phase Sensory phase 0 20 40 60 80 100 Age(days) 3/8/2004 LKR2004 24
Initially: Social & acoustic isolation Days 35 / 43 / 60: Start training
3/8/2004 LKR2004 25
The training system
Laboratory of Animal Behavior, CCNY 3/8/2004 LKR2004 26
3/8/2004 LKR2004 27
3/8/2004 LKR2004 28
Real-time calculation of acoustic features
4 simple acoustic features with articulatory correlates: Pitch Low -
+
High FM Low -
+
High Pure tone Wiener entropy
+
Noise Low Spectral continuity
+
High 3/8/2004 LKR2004 29
3/8/2004 LKR2004 The training system Song recognition Song analysis Database table 30
3/8/2004 16454 16571 17000 17189 17761 17873 18051 18092 18219 18536 19446 20405 20644 20729 20847 23287 24243 10874 10972 11042 11136 11465 11521 12355 13481 13669 14466 36 62 44 53 53 65 81 55 72 53 47 0.076791681
0.10109444
0.221805096
0.203947186
0.14567025
0.139529422
0.536730945
0.185585603
0.342740119
0.276962578
0.078976907
1103.130981
2110.150879
2779.580322
878.0430298
811.8573608
868.633667
982.7991333
733.9207764
772.1679077
699.7897949
1122.309326
-1.929902196
-2.650181532
-3.222234249
-1.2962991
-1.186548352
-1.330822468
-2.679917574
-2.271656036
-2.455365419
-2.140806913
-1.729982138
58.78096008
46.28370285
60.9871254
46.85206223
41.14878082
42.92938232
37.7701149
39.42351151
30.38383102
40.342556
48.15994644
0.811875403
0.830607355
0.79437232
0.485266626
0.42596662
0.542328238
0.523121655
0.816531181
0.765049458
0.822018743
0.823718846
76 54 58 51 58 47 38 81 66 69 46 51 65 61 51 68 70 0.216472968
0.52569139
0.135118335
0.124977574
0.144002378
0.066938281
0.066276349
0.200010121
0.335276693
0.261755675
0.15915972
0.193706796
0.24410592
0.166723967
0.198818251
0.178408563
0.185866207
769.9150391
687.6394043
864.5578613
752.3527222
1021.027527
1339.068604
1847.560913
2080.408936
858.1080933
890.3964233
993.3217773
800.2883911
802.0982666
901.6841431
-2.356431723
-1.956387162
-2.363121986
-1.94250226
-2.258356094
-1.668018103
-2.551876307
-3.075473547
-1.750756502
-1.860459447
-1.601477981
-1.413753867
-1.589150429
-1.771348119
852.6430664
-1.053611994
LKR2004 784.8914185
-2.134843588
990.8589478
-2.562700748
39.29466629
37.81315613
31.00643349
36.36558151
40.53672409
46.29984665
38.55633545
50.34065247
46.40740204
42.50422668
43.11263275
41.22149277
39.50386429
47.49161148
48.11198425
41.99195862
39.49663925
0.794104338
0.616944551
0.858065724
0.691144586
0.708231866
0.69986397
0.805839062
0.776402116
0.511499882
0.500995994
0.527124286
0.428571522
0.429761887
0.556119919
0.44106108
0.656920671
0.763919473
31
Dynamic Vocal Development maps
Duration 66 66 53 62 76 121 61 65 92 50 70 Mean Pitch
802.5073242
704.6381836
812.2409058
744.0402222
1212.450928
663.1687012
719.1973877
1119.903198
980.5782471
1089.148315
811.1593628
Mean Entropy
-2.626851082
-2.524046659
-1.880394816
-2.562429667
-2.24555397
-2.535212278
-2.427448273
-2.556747913
-2.776203156
-2.479059219
-2.734509706
Mean FM 33.58778763
27.59897423
45.26642609
34.36729431
48.8947258
20.65950394
29.89187622
45.04622269
29.98022079
29.93981934
27.13637352
90 80 70 60 50 40 30 20 10 0 0 100 200 300
Duration
400 500 3/8/2004 LKR2004 32
Dynamic Vocal Development (DVD) Map of a single bird
Day 85 Day 75 Day 65 Day 55
Onset of training
Day 45 Day 35 3/8/2004 90 80 70 60 50 40 30 20 10 0 0 100 200 300
Duration
LKR2004 400 500 33
3/8/2004 LKR2004 34
Language + Life = Meaning
• Text (and speech) structured by: – conversational context • time, place, sequence, participants, ...
– content • types and identities of referenced entities • explicit links (anaphora, references, hyperlinks) • implicit links (quotation, imitation, opposition) – other contextual data • e.g. neurological, gene expression data in birdsong learning • gaze, gesture, posture, physiological data in conversation 3/8/2004 LKR2004 35
A small application: real conversational transcription
• Perfect automatic speech-to-text (STT) yields: ew very nice yes that’s that’s the ah first car uh well my first ownership of something major that’s cool i had to buy my car my other car burned down so it was my first brand new car uh-huh but i love it so i am very happy • STT + “metadata” yields “Rich Transcription”:
Speaker 1: Speaker 2: Speaker 1: Speaker 2: Speaker 1:
Very nice.
Yes. That’s my first ownership of something major.
That’s cool. I had to buy my car. My other car burned down. It was my first brand new car.
Uh-huh.
But I love it. I am very happy.
3/8/2004 LKR2004 36
One aspect of conversational metadata: Diarization
Goal:
Label acoustic “sources” and their attributes – speakers, music, noise, DTMF, background events 3/8/2004 LKR2004 37
Interactive annotation
• Supervised learning: human annotates, machine learns • Unsupervised learning: machine looks for structure in raw data • Semi-supervised learning: human annotates a few examples, machine tries to generalize • “Active learning”: machine selects cases that are interesting or uncertain, asks for human judgments • Sampling experiments human checks machine annotation of selected cases, apply sample confusion matrix to estimate overall statistics 3/8/2004 LKR2004 38
The cycle of interactive annotation
Hand Annotation Hand Correction Automatic annotation 3/8/2004 Machine Learning (Selective) Sampling/ Labeling LKR2004 39
POS tagger trained on WSJ applied to MEDLINE: 3/8/2004 LKR2004 40
Same tagger, after retraining...
(~200 MEDLINE abstracts): 3/8/2004 LKR2004 41
The key to success: learn to measure failure...
Even a badly flawed measure can produce important gains.
3/8/2004 LKR2004 42
100% 90% 80% 70% 60% 50% Arabic to English 89% 58% Best Research System Best COTS System 57% 51% 2002 2003
One year of quantitative evaluation...
3/8/2004 LKR2004 43
Scoring Method
Percent of Human Machine Translation Score = ——————————— Human Translation Score x 100 Translation Score = Weighted sum of n-gram matches between translation being scored (human or machine) and three good reference translations
Reference translation:
The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Uni-gram match Tri-gram match Bi-gram match
Machine translation:
The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
3/8/2004 LKR2004 44
Best System Outputs
2002 2003
insistent Wednesday may recurred her trips to Libya tomorrow for flying Cairo 6-4 ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment .
And said the official " the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air , a situation her receiving replying are so a trip will pull to Libya a morning Wednesday " . Certain are " the lines is air Libyan I will start also in of three trips running weekly to Cairo in the coordination with Egypt for flying " .
3/8/2004 LKR2004 Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.
" The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the first take off a trip to Libya on Wednesday morning ".
The Libyan Arab Airways will also in the conduct of the three times a week in Cairo in coordination with egyptair ".
45
Human v. Machine
Human 2003
Egypt Air May Resume its Flights to Libya Tomorrow Cairo, April 6 (AFP) - An Egypt Air official announced, on Tuesday, that Egypt Air will resume its flights to Libya as of tomorrow, Wednesday, after the UN Security Council had announced the suspension of the embargo imposed on Libya.
The official said that, "the company sent a letter to the Ministry of Foreign Affairs to inquire about the lifting of the air embargo on Libya, and in the event that it receives a response, then the first flight to Libya, will take off, Wednesday morning." He stressed that "the Libyan Airlines will begin scheduling three weekly flights to Cairo, in coordination with Egypt air." 3/8/2004 LKR2004 Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.
" The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the first take off a trip to Libya on Wednesday morning ".
The Libyan Arab Airways will also in the conduct of the three times a week in Cairo in coordination with egyptair ".
46
Summary
• Speech and Language Research – needs LKR – creates LKR – can help other disciplines deal with LKR – is helped by other disciplines, who provide • raw data as well as relevant LKR pieces • problems, algorithms, inspiration • The whole is greater than the sum of the parts – Types, sources and amounts of data – Collaboration within and across disciplines – Cooperation of humans and machines 3/8/2004 LKR2004 47