Document 7115941

Download Report

Transcript Document 7115941

현재 언어처리 기술 현황과 통
계적 접근, 사용하는 이유
3주 강의
The Dream
• It’d be great if machines could
– Process our email (usefully)
– Translate languages accurately
– Help us manage, summarize, and
aggregate information
– Use speech as a UI (when needed)
– Talk to us / listen to us
• But they can’t:
– Language is complex, ambiguous,
flexible, and subtle
– Good solutions need linguistics
and machine learning knowledge
• So:
What is NLP?
•
•
Fundamental goal: deep understand of broad language
– Not just string processing or keyword matching!
End systems that we want to build:
– Ambitious: speech recognition, machine translation, information
extraction, dialog interfaces, question answering, trend finding …
– Modest: spelling correction, text categorization…
Speech Systems
• Automatic Speech Recognition (ASR)
– Audio in, text out
– SOTA: 0.3% for digit strings, 5% dictation, 50%+ TV
• Text to Speech (TTS)
– Text in, audio out
– SOTA: totally intelligible (if sometimes unnatural)
• Speech systems currently:
– Model the speech signal
– Model language
Machine Translation
• Translation systems encode:
– Something about fluent language
– Something about how two languages correspond (middle of term)
• SOTA: for easy language pairs, better than nothing, but more an
understanding aid than a replacement for human translators
Information Extraction
• Information Extraction (IE)
– Unstructured text to database entries
– SOTA: perhaps 70% accuracy for multi-sentence
temples, 90%+ for single easy fields
Question Answering
•
Question Answering:
– More than search
– Ask general
comprehension
questions of a
document collection
– Can be really easy:
“What’s the capital of
Wyoming?”
– Can be harder: “How
many US states’ capitals
are also their largest
cities?”
– Can be open ended:
“What are the main
issues in the global
warming debate?”
•
SOTA: Can do factoids,
even when text isn’t a
perfect match
What is nearby NLP?
• Computational Linguistics
– Using computational methods to learn more
about how language works
– We end up doing this and using it
• Cognitive Science
– Figuring out how the human brain works
– Includes the bits that do language
– Humans: the only working NLP prototype!
• Speech?
– Mapping audio signals to text
– Traditionally separate from NLP, converging?
– Two components: acoustic models and
language models
– Language models in the domain of stat NLP
What is this Class?
• Three aspects to the course:
– Linguistic Issues
• What are the range of language phenomena?
• What are the knowledge sources that let us disambiguate?
• What representations are appropriate?
– Technical Methods
• Learning and parameter estimation
• Increasingly complex model structures
• Efficient algorithms: dynamic programming, search
– Engineering Methods
• Issues of scale
• Sometimes, very ugly hacks
• We’ll focus on what makes the problems hard, and
what works in practice…
Class Requirements and Goals
• Class requirements
– Uses a variety of skills / knowledge:
• Basic probability and statistics
• Basic linguistics background
• Decent coding skills (Java)
– Most people are probably missing one of the above
– We’ll address some review concepts with sections, TBD
• Class goals
– Learn the issues and techniques of statistical NLP
– Build the real tools used in NLP (language models, taggers, parsers,
translation systems)
– Be able to read current research papers in the field
– See where the gaping holes in the field are!
Rational versus Empiricist Approaches to
Language (I)
• Question: What prior knowledge should be built into
our models of NLP?
• Rationalist Answer: A significant part of the
knowledge in the human mind is not derived by the
senses but is fixed in advance, presumably by genetic
inheritance (Chomsky: poverty of the stimulus).
• Empiricist Answer: The brain is able to perform
association, pattern recognition, and generalization
and, thus, the structures of Natural Language can be
learned.
Rational versus Empiricist Approaches to
Language (II)
• Chomskyan/generative linguists seek to describe the
language module of the human mind (the Ilanguage)
for which data such as text (the Elanguage) provide
only indirect evidence, which can be supplemented
by native speakers intuitions.
• Empiricists approaches are interested in describing
the E-language as it actually occurs.
• Chomskyans make a distinction between linguistic
competence and linguistic performance. They
believe that linguistic competence can be described in
isolation while Empiricists reject this notion.
Empiricist
• Seeks methods that can work on raw
text as it exists
– Knowledge induction (automatic learning),
not by disambiguation
• American structuralism
– The work of Shannon
– Assign probabilities on linguistic events
compared to concentrating on categorical
judgments about rare types of sentences
의문
• 언어학이 대답해야 하는 것
– 사람은 무엇을 말하는가?
– 세상에 대해 무엇을 말하고, 묻고, 요구하는가?
• 인간이 어떻게 언어를 배우고, 또 실시간에 언어
를 생성하고 이해하는가?
• 문법적으로는 바르나 어색한 문장
– In additions to this, she insisted that women were
regarded as a different existence from men
unfairly.
• (언어학적 논쟁은 알아서 볼 것)
– take a while, sort of/kind of,
• I kind of love you.(고어)
Some Early NLP History
• 1950’s:
– Foundational work: automata, information theory, etc.
– First speech systems
– Machine translation (MT) hugely funded by military (imagine that)
• Toy models: MT using basically word-substitution
– Optimism!
• 1960’s and 1970’s: NLP Winter
– Bar-Hillel (FAHQT) and ALPAC reports kills MT
– Work shifts to deeper models, syntax
– … but toy domains / grammars (SHRDLU, LUNAR)
• 1980’s: The Empirical Revolution
–
–
–
–
Expectations get reset
Corpus-based methods become central
Deep analysis often traded for robust and simple approximations
Evaluate everything
Today’s Approach to NLP
• From ~1970-1989, people were concerned with the
science of the mind and built small (toy) systems that
attempted to behave intelligently.
• Recently, there has been more interest on engineering
practical solutions using automatic learning
(knowledge induction).
• While Chomskyans tend to concentrate on categorical
judgements about very rare types of sentences,
statistical NLP practitioners concentrate on common
types of sentences.
Why is NLP Difficult?
• NLP is difficult because Natural Language is highly
ambiguous.
• Example: “The company is training workers” has 2 or
more parse trees (i.e., syntactic analyses).
• “List the sales of the products produced in 1973 with
the products produced in 1972” has 455 parses.
• Therefore, a practical NLP system must be good at
making disambiguation decisions of word sense,
word category, syntactic structure, and semantic
scope.
Methods that don’t work well
• Maximizing coverage while minimizing ambiguity is
inconsistent with symbolic NLP.
• Furthermore, hand-coded syntactic constraints and
preference rules are time consuming to build, do not
scale up well and are brittle in the face of the
extensive use of metaphor in language.
• Example: if we code
animate being --> swallow --> physical object
I swallowed his story, hook, line, and sinker.
The supernova swallowed the planet.
Classical NLP: Parsing
• Write symbolic or logical rules:
• Use deduction systems to prove parses from words
– Minimal grammar on “Fed raises” sentence: 36 parses
– Simple 10-rule grammar: 592 parses
– Real-size grammar: many millions of parses
• This scaled very badly, didn’t yield broad-coverage
tools
NLP: Annotation
• Much of NLP is annotating text with structure
which specifies how it’s assembled.
– Syntax: grammatical structure
– Semantics: “meaning,” either lexical or
compositional
What Made NLP Hard?
• The core problems:
– Ambiguity
– Sparsity
– Scale
– Unmodeled Variables
Problem: Ambiguities
• Headlines:
–
–
–
–
–
–
–
–
Iraqi Head Seeks Arms
Ban on Nude Dancing on Governor’s Desk
Juvenile Court to Try Shooting Defendant
Teacher Strikes Idle Kids
Stolen Painting Found by Tree
Kids Make Nutritious Snacks
Local HS Dropouts Cut in Half
Hospitals Are Sued by 7 Foot Doctors
• Why are these funny?
Syntactic Ambiguities
• Maybe we’re sunk on funny headlines, but normal, bo
ring sentences are unambiguous?
• Our company is training workers.
• Fed raises interest rates 0.5 % in a measure against inflation
Dark Ambiguities
• Dark ambiguities: most analyses are shockingly bad
(meaning, they don’t have an interpretation you can
get your mind around)
• Unknown words and new usages
• Solution: We need mechanisms to focus attention on
the best ones, probabilistic techniques do this
Semantic Ambiguities
• Even correct tree-structured syntactic analyses don’t
always nail down the meaning
Every morning someone’s alarm clock wakes me up
John’s boss said he was doing better
Other Levels of Language
• Tokenization/morphology:
– What are the words, what is the sub-word structure?
– Often simple rules work (period after “Mr.” isn’t sentence break)
– Relatively easy in English, other languages are harder:
• Segmentation
• Morphology
• Discourse: how do sentences relate to each other?
• Pragmatics: what intent is expressed by the literal meaning, how
to react to an utterance?
• Phonetics: acoustics and physical production of sounds
• Phonology: how sounds pattern in a language
Disambiguation for
Applications
• Sometimes life is easy
– Can do text classification pretty well just knowing the set of words
used in the document, same for authorship attribution
– Word-sense disambiguation not usually needed for web search
because of majority effects or intersection effects (“jaguar habitat”
isn’t the car)
• Sometimes only certain ambiguities are relevant
he hoped to record a world record
• Other times, all levels can be relevant (e.g., translation)
Problem: Scale
• People did know that language was ambiguous!
– …but they hoped that all interpretations would be “good”
ones (or ruled out pragmatically)
– …they didn’t realize how bad it would be
Corpora
• A corpus is a collection of text
– Often annotated in some way
– Sometimes just lots of text
– Balanced vs. uniform corpora
• Examples
– Newswire collections: 500M+ words
– Brown corpus: 1M words of tagged
“balanced” text
– Penn Treebank: 1M words of
parsed WSJ
– Canadian Hansards: 10M+ words
of aligned French / English
sentences
– The Web: billions of words of who
knows what
Corpus-Based Methods
•
A corpus like a treebank gives us three important tools:
– It gives us broad coverage
Corpus-Based Methods
– It gives us statistical information
This is a very different kind of subject/object
asymmetry than what many linguists are interested in.
Corpus-Based Methods
– It lets us check our answers!
Problem: Sparsity
• However: sparsity is always a problem
– New unigram (word), bigram (word pair),
and rule rates in newswire
The (Effective) NLP Cycle
• Pick a problem (usually some disambiguation)
• Get a lot of data (usually a labeled corpus)
• Build the simplest thing that could possibly
work
• Repeat:
– See what the most common errors are
– Figure out what information a human would use
– Modify the system to exploit that information
• Feature engineering
• Representation design
• Machine learning methods
• We’re going to do this over and over again
Language isn’t Adversarial
• One nice thing: we know NLP can be done!
• Language isn’t adversarial:
– It’s produced with the intent of being understood
– With some understanding of language, you can often tell
what knowledge sources are relevant
• But most variables go unmodeled
– Some knowledge sources aren’t easily available (realworld
knowledge, complex models of other people’s plans)
– Some kinds of features are beyond our technical ability to
model (especially cross-sentence correlations)
바르게 보인다고 바르지 않다!!
•
•
•
•
•
Epistemological accuracy!!
쓰레기통행이다.
한국해이다. 한국해이를, 한국해를
미국해이다. 미국해이를
미국행이다.
What Statistical NLP can do for us
• Disambiguation strategies that rely on hand-coding
produce a knowledge acquisition bottleneck and
perform poorly on naturally occurring text.
• A Statistical NLP approach seeks to solve these
problems by automatically learning lexical and
structural preferences from corpora. In particular,
Statistical NLP recognizes that there is a lot of
information in the relationships between words.
• The use of statistics offers a good solution to the
ambiguity problem: statistical models are robust,
generalize well, and behave gracefully in the presence
of errors and new data
Corpora
•
•
•
•
•
Brown Corpus – 1 million words
British National Corpus – 100 mil. Words
American National Corpus – 10 mil. words -> 100
Penn TreeBank - parsed WSJ text
Canadian Hansard – parallel corpus (bilingual)
Dictionaries:
• Longman Dictionary of Contemporary English
• WordNet (hierarchy of synsets)
Things that can be done with Text Corpora (I)
Word Counts
• Word Counts to find out:
– What are the most common words in the text.
– How many words are in the text (word tokens and word
types).
– What the average frequency of each word in the text is.
• Limitation of word counts: Most words appear very
infrequently and it is hard to predict much about the
behavior of words that do not occur often in a corpus.
==> Zipf’s Law.
Things that can be done with Text Corpora (II)
Zipf’s Law
• If we count up how often each word type of a
language occurs in a large corpus and then list the
words in order of their frequency of occurrence, we
can explore the relationship between the frequency of
a word, f, and its position in the list, known as its
rank, r.
• Zipf’s Law says that: f ∝ 1/r
• Significance of Zipf’s Law: For most words, our data
about their use will be exceedingly sparse. Only for a
few words will we have a lot of examples.
Common words in Tom Sawyer
Frequencies of frequencies in Tom Sawyer
Zipf's law in Tom Sawyer
Zipf's law in Tom Sawyer
Zipf’s Law
Zipf's law for the Brown corpus
Mandelbrot's formula for the
Brown corpus
Things that can be done with Text Corpora (III)
Collocations
• A collocation is any turn of phrase or accepted usage
where somehow the whole is perceived as having an
existence beyond the sum of its parts (e.g., disk drive,
make up, bacon and eggs).
• Collocations are important for machine translation.
• Collocations can be extracted from a text (example,
the most common bigrams can be extracted).
However, since these bigrams are often insignificant
(e.g., “at the”, “of a”), they can be filtered.
이용 예
drive  disk drive, make up
사고  교통 사고  사고 방법
전하  전하번호  전하를 걸다
전하를 너무 일찍 걸었다.
자주 거짓말을 하는
순희는 아주 예쁘지만, 영희는 그렇지 않다. 그
리 예쁜 여자가 거짓말을 하다니
• 개발일정완전재조정 (띄어 읽기, 띄어 쓰기)
•
•
•
•
•
•
순서
• bigram: of the, in the, to the, on the ….
New York, he said, as a
• Filtering :: adjective+noun, noun+noun
– last year, next year ::: 예외적
Commonest bigrams in the NYT
Filtered common bigrams in the
NYT
Things that can be done with Text Corpora (IV)
Concordances
• Finding concordances corresponds to finding
the different contexts in which a given word
occurs.
• One can use a Key Word In Context (KWIC)
concordancing program.
• Concordances are useful both for building
dictionaries for learners of foreign languages
and for guiding statistical parsers.
KWIC display
Syntactic frames for showed in
Tom Sawyer
Why study NLP Statistically?
• Up until the late 1980’s, NLP was mainly investigated
using a rule-based approach.
• However, rules appear too strict to characterize
people’s use of language.
• This is because people tend to stretch and bend
rules in order to meet their communicative needs.
• Methods for making the modeling of language
more accurate are needed and statistical methods
appear to provide the necessary flexibility.
Subdivisions of NLP
• Parts of Speech and Morphology (words, their
syntactic function in sentences, and the various
forms they can take).
• Phrase Structure and Syntax (regularities and
constraints of word order and phrase structure).
• Semantics (the study of the meaning of words
(lexical semantics) and of how word meanings are
combined into the meaning of sentences, etc.)
• Pragmatics (the study of how knowledge about the
world and language conventions interact with literal
meaning).
Topics Covered in this course
Tools and Resources Used
• Probability/Statistical Theory: Statistical
Distributions, Bayesian Decision Theory.
• Linguistics Knowledge: Morphology, Syntax,
Semantics and Pragmatics.
• Corpora: Bodies of marked or unmarked text to
which statistical methods and current linguistic
knowledge can be applied in order to discover
novel linguistic theories or interesting and
useful knowledge organization.
Textbook and other useful information
• Foundations of Statistical Natural Language
Processing, by Chris Manning and Hinrich
Schütze, MIT Press, 1999.
• Course Website: borame.cs.pusan.ac.kr