Introduction to Natural Language Processing (600.465)

Download Report

Transcript Introduction to Natural Language Processing (600.465)

*Textbooks you need
• Manning, C. D., Schütze, H.:
• Foundations of Statistical Natural Language Processing. The
MIT Press. 1999. ISBN 0-262-13360-1. [required]
• Allen, J.:
• Natural Language Understanding. The Benjamins/Cummins
Publishing Co. 1995. 2nd edition
• Jurafsky, D. and J. H. Martin:
• Speech and Language Processing. Prentice-Hall. 2009. 2nd
edition
1
Other reading
• Charniak, E:
– Statistical Language Learning. The MIT Press. 1996. ISBN 0-262-53141-0.
• Cover, T. M., Thomas, J. A.:
– Elements of Information Theory. Wiley. 1991. ISBN 0-471-06259-6.
• Jelinek, F.:
– Statistical Methods for Speech Recognition. The MIT Press. 1998. ISBN 0262-10066-5
• Proceedings of major conferences:
–
–
–
–
–
ACL (Assoc. of Computational Linguistics)
NAACL HLT (North American Chapter of ACL)
COLING (Intl. Committee of Computational Linguistics)
ACM SIGIR
Interspeech/ASRU/SLT
2
Course segments
• Intro & Probability & Information Theory
– The very basics: definitions, formulas, examples.
• Language Modeling
– n-gram models, parameter estimation
– smoothing (EM algorithm)
• A Bit of Linguistics
– phonology, morphology, syntax, semantics, discourse
• Words and the Lexicon
– word classes, mutual information, bit of lexicography.
3
Course segments (cont.)
• Hidden Markov Models
– background, algorithms, parameter estimation
• Tagging: Methods, Algorithms, Evaluation
– tagsets, morphology, lemmatization
– HMM tagging, Transformation-based, Feature-based
• NL Grammars and Parsing: Data, Algorithms
– Grammars and Automata, Deterministic Parsing
– Statistical parsing. Algorithms, parameterization, evaluation
• Applications (MT, ASR, IR, Q&A, ...)
4
Goals of the HLT
Computers would be a lot more useful if they could
handle our email, do our library research, talk to
us …
But they are fazed by natural human language.
How can we make computers have abilities to handle
human language? (Or help them learn it as kids
do?)
5
A few applications of HLT
• Spelling correction, grammar checking …(language
learning and evaluation e.g. TOEFL essay score)
• Better search engines
• Information extraction, gisting
• Psychotherapy; Harlequin romances; etc.
• New interfaces:
– Speech recognition (and text-to-speech)
– Dialogue systems (USS Enterprise onboard computer)
– Machine translation; speech translation (the Babel tower??)
• Trans-lingual summarization, detection, extraction …
6
Dan Jurafsky
Question Answering: IBM’s Watson
• Won Jeopardy on February 16, 2011!
WILLIAM WILKINSON’S
“AN ACCOUNT OF THE PRINCIPALITIES OF
WALLACHIA AND MOLDOVIA”
INSPIRED THIS AUTHOR’S
MOST FAMOUS NOVEL
7
Bram Stoker
Dan Jurafsky
Information Extraction
Event: Curriculum mtg
Date: Jan-16-2012
Subject: curriculum meeting
Start: 10:00am
Date: January 15, 2012
End: 11:30am
To: Dan Jurafsky
Where: Gates 159
Hi Dan, we’ve now scheduled the curriculum meeting.
It will be in Gates 159 tomorrow from 10:00-11:30.
-Chris
Create new Calendar entry
8
Dan Jurafsky
Information Extraction & Sentiment Analysis
Attributes:
zoom
affordability
size and weight
flash
ease of use
Size and weight
✓
✓
✗
9
• nice and compact to carry!
• since the camera is small and light, I won't need to carry
around those heavy, bulky professional cameras either!
• the camera feels flimsy, is plastic and very light in weight you
have to be very delicate in the handling of this camera
Dan Jurafsky
Machine Translation
• Helping human translators
• Fully automatic
Enter Source Text:
这 不过 是 一 个 时间 的 问题 .
Translation from Stanford’s Phrasal:
This is only a matter of time.
10
Dan Jurafsky
Language Technology
making good progress
Sentiment analysis
mostly solved
still really hard
Best roast chicken in San Francisco!
Question answering (QA)
The waiter ignored us for 20 minutes.
Spam detection
✓
Let’s go to Agra!
✗
Buy V1AGRA …
ADJ NOUN VERB
Carter told Mubarak he shouldn’t run again.
Word sense disambiguation (WSD)
Part-of-speech (POS) tagging
ADJ
Q. How effective is ibuprofen in reducing
fever in patients with acute febrile illness?
Coreference resolution
XYZ acquired ABC yesterday
ABC has been taken over by XYZ
I need new batteries for my mouse.
Summarization
ADV
Colorless green ideas sleep furiously.
Parsing
The Dow Jones is up
I can see Alcatraz from the window!
Named entity recognition (NER)
PERSON
ORG
LOC
Einstein met with UN officials in Princeton
Paraphrase
The S&P500 jumped
Economy is
good
Housing prices rose
Machine translation (MT)
Dialog
第13届上海国际电影节开幕…
Where is Citizen Kane playing in SF?
The 13th Shanghai International Film Festival…
Castro Theatre at 7:30. Do
you want a ticket?
Information extraction (IE)
You’re invited to our dinner
party, Friday May 27 at 8:30
Party
May 27
add
Dan Jurafsky
Ambiguity makes NLP hard:
“Crash blossoms”
Violinist Linked to JAL Crash Blossoms
Teacher Strikes Idle Kids
Red Tape Holds Up New Bridges
Hospitals Are Sued by 7 Foot Doctors
Juvenile Court to Try Shooting Defendant
Local High School Dropouts Cut in Half
Dan Jurafsky
Ambiguity is pervasive
New York Times headline (17 May 2000)
Fed raises interest rates
Fed raises interest rates
Fed raises interest rates 0.5%
Dan Jurafsky
Why else is natural language
understanding difficult?
non-standard English
segmentation issues
idioms
Great job @justinbieber! Were
SOO PROUD of what youve
accomplished! U taught us 2
#neversaynever & you yourself
should never give up either♥
the New York-New Haven Railroad
the New York-New Haven Railroad
dark horse
get cold feet
lose face
throw in the towel
neologisms
unfriend
Retweet
bromance
world knowledge
Mary and Sue are sisters.
Mary and Sue are mothers.
But that’s what makes it fun!
tricky entity names
Where is A Bug’s Life playing …
Let It Be was recorded …
… a mutation on the for gene …
Dan Jurafsky
Making progress on this problem…
• The task is difficult! What tools do we need?
• Knowledge about language
• Knowledge about the world
• A way to combine knowledge sources
• How we generally do this:
• probabilistic models built from language data
• P(“maison”  “house”) high
• P(“L’avocat général”  “the general avocado”) low
• Luckily, rough text features can often do half the job.
Dan Jurafsky
This class
• Teaches key theory and methods for statistical NLP:
•
•
•
•
•
Viterbi
Naïve Bayes, Maxent classifiers
N-gram language modeling
Statistical Parsing
Inverted index, tf-idf, vector models of meaning
• For practical, robust real-world applications
•
•
•
•
Information extraction
Spelling correction
Information retrieval
Sentiment analysis
Levels of Language
• Phonetics/phonology/morphology: what words
(or subwords) are we dealing with?
• Syntax: What phrases are we dealing with?
Which words modify one another?
• Semantics: What’s the literal meaning?
• Pragmatics: What should you conclude from the
fact that I said something? How should you react?
17
What’s hard – ambiguities, ambiguities,
all different levels of ambiguities
John stopped at the donut store on his way home from work.
He thought a coffee was good every few hours. But it
turned out to be too expensive there. [from J. Eisner]
- donut: To get a donut (doughnut; spare tire) for his car?
- Donut store: store where donuts shop? or is run by donuts? or looks like a big donut? or
made of donut?
- From work: Well, actually, he stopped there from hunger and exhaustion, not just from
work.
- Every few hours: That’s how often he thought it? Or that’s for coffee?
- it: the particular coffee that was good every few hours? the donut store? the situation
- Too expensive: too expensive for what? what are we supposed to conclude about what
John did?
18
NLP: The Main Issues
• Why is NLP difficult?
– many “words”, many “phenomena” --> many “rules”
• OED: 400k words; Finnish lexicon (of forms): ~2 . 107
• sentences, clauses, phrases, constituents, coordination,
negation, imperatives/questions, inflections, parts of speech,
pronunciation, topic/focus, and much more!
• irregularity (exceptions, exceptions to the exceptions, ...)
• potato -> potato es (tomato, hero,...); photo -> photo s, and
even: both mango -> mango s or -> mango es
• Adjective / Noun order: new book, electrical engineering,
general regulations, flower garden, garden flower, ...: but
Governor General
19
Difficulties in NLP (cont.)
– ambiguity
• books: NOUN or VERB?
– you need many books vs. she books her flights online
• No left turn weekdays 4-6 pm / except transit vehicles
(Charles Street at Cold Spring)
– when may transit vehicles turn: Always? Never?
• Thank you for not smoking, drinking, eating or playing
radios without earphones. (MTA bus)
– Thank you for not eating without earphones??
– or even: Thank you for not drinking without earphones!?
• My neighbor’s hat was taken by wind. He tried to catch it.
– ...catch the wind or ...catch the hat ?
20
(Categorical) Rules or Statistics?
• Preferences:
– clear cases: context clues: she books --> books is a verb
– rule: if an ambiguous word (verb/nonverb) is preceded by
a matching personal pronoun -> word is a verb
– less clear cases: pronoun reference
– she/he/it refers to the most recent noun or pronoun (?) (but
maybe we can specify exceptions)
– selectional:
– catching hat >> catching wind (but why not?)
– semantic:
– never thank for drinking in a bus! (but what about the
earphones?)
21
Solutions
• Don’t guess if you know:
•
•
•
•
•
morphology (inflections)
lexicons (lists of words)
unambiguous names
perhaps some (really) fixed phrases
syntactic rules?
• Use statistics (based on real-world data) for
preferences (only?)
• No doubt: but this is the big question!
22
Statistical NLP
• Imagine:
– Each sentence W = { w1, w2, ..., wn } gets a probability
P(W|X) in a context X (think of it in the intuitive sense
for now)
– For every possible context X, sort all the imaginable
sentences W according to P(W|X):
– Ideal situation:
best sentence (most probable in context X)
NB: same for
interpretation
P(W)
“ungrammatical” sentences
23
Real World Situation
• Unable to specify set of grammatical sentences today using
fixed “categorical” rules (maybe never)
• Use statistical “model” based on REAL WORLD DATA and
care about the best sentence only (disregarding the
“grammaticality” issue)
best sentence
P(W)
Wbest
Wworst
24