Transcript Slide 1

Introduction to Natural Language
Processing
(aka, Computational Linguistics)
Slides by me, Martha Palmer,
Eleni Miltsakaki, Dan Jurafsky,
CIS 8590 – Fall 2008
Tarkan Kacmaz,
and others
1
NLP
Overview
• General Methods in NLP (3 weeks)
– Low-level NLP problems and techniques
– Graphical Models for NLP
– Text Mining basics
• Information Retrieval Overview (2 weeks)
• Information Extraction Overview (2 weeks)
• Selected topics in Information Extraction
(4 weeks)
CIS 8590 – Fall 2008
2
NLP
Practical Matters
• Prereqs: General understanding of
probability and statistics
• Grading:
• 20% In-class participation, paper presentations
• 30% Course projects
• 50% Midterm (and possibly quizzes)
• Course projects
– Please start finding project partners soon!
– I will supply some ideas for projects later
CIS 8590 – Fall 2008
3
NLP
WHAT IS LANGUAGE?
When we study human language, we
are approaching what some might call
the “human essence”, the distinctive
qualities of mind that are, so far as we
know, unique to man.
Noam Chomsky
WHAT IS LANGUAGE?
• Definition with respect to form:
Language is a system of speech symbols. It is
realized acoustically (sound waves), visually-spatially
(sign language) and in written form.
• Definition with respect to function:
Language is the most important means of human
communication. It is used to convey and exchange
information (informative function)
• Multiplicity of languages:
We know of about 7000 languages, which is
about 1% of all the languages that ever existed.
LANGUAGE AND THE
BRAIN
LANGUAGE AND THE
BRAIN
THEORIES OF LANGUAGE
• Noam Chomsky claims that language is innate.
• B. F. Skinner claims that language is learned; it is
basically a stimulus-response mechanism.
WHAT IS GRAMMAR?
• When we learn a language we also learn the rules that
govern how language elements, such as words, are
combined to produce meaningful language.
• These elements and rules constitute the Grammar of
a language.
• The Grammar is “what we know”
• Grammar represents our linguistic competence.
DESCRIPTIVE vs PRESCRIPTIVE
GRAMMAR
Prescriptive
(should be)
Descriptive
(is)
Areas of Linguistics
• phonetics - the study of speech sounds
• phonology - the study of sound systems
• morphology- the rules of word formation
• syntax - the rules of sentence formation
• semantics - the study of word meanings
• pragmatics – the study of discourse meanings
• sociolinguistics - the study of language in society
• applied linguistics –the application of the methods and
results of linguistics to such areas as language teaching,
national language policies, lexicography, translation,
language in politics etc.
What is phonetics?
•
•
•
•
•
Phonetics is the science of speech.
We all speak.
But how many of us know how we speak?
Or what speech is like?
Phonetics seeks to answer those
questions.
Orthography and Sounds
• The English language is not phonetic.
• Words are not spelled as they are
pronounced
• There is no one to one correspondence
between the letters and the sounds or
phonemes.
Orthography and Sounds
• Did he believe that Caesar could see
the people seize the seas.
• The silly amoeba stole the key to the
machine
Articulatory Phonetics
• The production of any speech sound
involves the movement of an air stream.
• Most speech sounds are produced by
pushing the air out of the lungs through
the mouth (oral) and sometimes through
the nose (nasal).
SPEECH ORGANS
Phonology
• Phonology deals with the system and
pattern of speech sounds in a language.
• Phonology of a language is the
system and pattern of speech sounds.
Phonology
Phonological knowledge permits us to:
•
•
•
•
•
produce sounds which form meaningful utterances,
to recognize a “foreign” accent,
to make up new words,
To know what is or is not a sound in one’s language
to know what different sound strings may represent
Phonetics vs Phonology
Phonetics
The study of speech
sounds.
Phonology
The study of the way
speech sounds form
patterns.
Sequences of Phonemes
k
blık
klıb
bılk
kılb
b
possible
l
ı
Ibkı
ılbk
bkıl
ıblk
•“I just bought a beautiful new blick” What is a blick?
•“I just bought a beautiful new bkli” WHAT!!
impossible
Sequences of Phonemes
• Your knowledge of English “tells” you that
certain strings of phonemes are permissible
and others are not.
• That’s why /bkli/ does not sound like an
English word.
• It violates the restrictions on the sequencing of
phonemes; i.e. it violates the phonological
rules of English.
Rules of Phonology
• Delete a word-final /b/ when it occurs
after a /m/
as in:
But not!
bomb
crumb
lamb
tomb
bombard
crumble
limber
tumble
Morphology & Syntax
• Morphology deals with the combination
of morphemes into words.
• Syntax deals with the combination of
words into sentences.
What is the meaning of
‘meaning’?
• Learning a language includes learning
the “agreed upon” meanings of certain
strings of sounds and,
• Learning how to combine these
meaningful units into larger units which
also convey meaning.
Morphemes
• Morpheme is the smallest linguistic unit
that has meaning.
• Morpheme is a grammatical unit in
which there is an arbitrary union of
sound and a meaning and,
• which cannot be further analysed.
Morphemes
• A morpheme may be represented by a
single sound:
• e.g. the plural morpheme [s] in cat+s
• A morpheme may be represented by a
syllable (monosyllabic):
• e.g. child+ish
Morphemes
A morpheme may be represented by
more than one syllable (polysyllabic):
• e.g. lady, water
or three syllables:
• e.g. crocodile
or four syllables:
• e.g. salamander
Words
• Two basic ways to form words
– Inflectional (e.g. English verbs)
• Open + ed = opened
• Open + ing = opening
– Derivational (e.g. adverbs from adjectives, nouns
from adjectives)
• Happy  happily
• Happy  happiness (nouns from adjectives)
32
Syntax
The study of classes of words
and the rules that govern how the words can
combine to make phrases and sentences.
33
Basic classes of words
• Classes of words aka parts of speech (POS)
–
–
–
–
Nouns
Verbs
Adjectives
Adverbs
• The above classes of word belong to the type open class
words
• We also have closed class words
– Articles, pronouns, prepositions, particles, quantifiers,
conjunctions
34
Basic phrases
• A word from an open class can be used to
form the basis of a phrase
• The basis of a phrase is called the head
35
Examples of phrases
• Noun phrases
– The manager of the institute
– Her worry to pass the exams
– Several students from the English Department
• Adjective phrases
– easy to understand
– mad as a dog
– glad that he passed the exam
36
Examples of phrases
• Adverb phrases
– fast like the wind
– outside the building
• Verb phrases
– ate her sandwich
– went to the doctor
– believed what I told him
37
“Complements”
• Notice that to be meaningful the verb “go”, for
example requires a phrase for “location”
– *John went
– John went home
• Such phrases “complete” the meaning of the
verb (or other type of head) and are called
complements
38
Inside the noun phrase
• NPs are used to refer to things: objects, places,
concepts, events, qualities, etc
• NPs may consist of:
–
–
–
–
–
A single pronoun (he, she, etc)
A name or proper noun (John, Athens, etc)
A specifier and a noun
A qualifier and a noun
A specifier and a qualifier and a noun (e.g., the first
three winners)
39
Specifiers
• Specifiers indicate how many objects are
described and also how these objects
relate to the speaker
• Basis types of specifiers
– Ordinals (e.g., first, second)
– Cardinals (e.g., one, two)
– Determiners (see next slide)
40
Determiners
• Basic types of determiners
– Articles (the, a, an)
– Demonstratives (this, that, these, those)
– Possessives (‘s, her, my, whose, etc)
– Wh-determiners (which, what –in questions)
– Quantifying determiners (some, every, most,
no, any, etc.)
41
Qualifiers
• Basic types of qualifiers
– Adjectives
• Happy cat
• Angry feelings
– Noun modifiers
• Cook book
• University hospitals
42
Inside the verb phrase
• A simple VP
– Adverbial modifier + head verb +
complements
• Types of verbs
– Auxiliary (be, do, have)
– Modal (will, can, could)
– Main (eat, work, think)
43
Types of verb complements
• Intransitive verbs do not require complements
• Transitive verbs require an object as a complement (e.g.
find a key)
• Transitive verbs allow passive forms (e.g. a key was
found)
• Ditransitive verbs require one direct and on indirect
object (e.g. give Mary a book)
44
Other verb complements
• Clausal complements
– Some verbs require clausal complements
• Mary knows that John left
• Prepositional phrase complements
– Some verbs requires specific PP complements
• Mary gave the book to John
– Others require any PP complement
• John put the book on the shelf/in the room/under the table
45
Adjective phrases
• Simple
– Angry, easy, etc
• Complex
– Pleased with the prize
– Angry at the committee
– Willing to read the book
• Complex AdjP normally do not precede nouns, they are
used as complements of verbs such as be or seem
46
Adverbial phrases
• Indicators of
–
–
–
–
–
–
Degree
Location
Manner
The time of something (now, yesterday, etc)
Frequency
Duration
• Location in the sentence
– Initial
– Medial
– Final
47
Grammars and parsing
• What is syntactic parsing
– Determining the syntactic structure of a
sentence
• Basic steps
– Identify sentence boundaries
– Identify what part of speech is each word
– Identify syntactic relations
48
Context Free Grammar
•
•
•
•
•
•
•
•
S -> NP VP
NP -> det (adj) N
NP -> Proper N
NP -> N
VP -> V, VP -> V PP
VP -> V NP
VP -> V NP PP, PP -> Prep NP
VP -> V NP NP
LING 2000 - 2006
49
NLP
Parses
The cat sat on the mat
S
NP
VP
Det
the
N
cat
PP
V
sat
Prep
on
LING 2000 - 2006
50
NP
Det
the
N
mat
NLP
Parses
Time flies like an arrow.
S
NP
VP
N
time
V
flies
PP
Prep
like
LING 2000 - 2006
51
NP
Det
an
N
arrow
NLP
Parses
Time flies like an arrow.
S
NP
N
time
N
flies
VP
V
like
NP
Det
an
LING 2000 - 2006
52
N
arrow
NLP
Features
• C for Case, Subjective/Objective
– She visited her.
• P for Person agreement, (1st, 2nd, 3rd)
– I like him, You like him, He likes him,
• N for Number agreement, Subject/Verb
– He likes him, They like him.
• G for Gender agreement, Subject/Verb
– English, reflexive pronouns He washed himself.
– Romance languages, det/noun
• T for Tense,
– auxiliaries, sentential complements, etc.
– * will finished is bad
LING 2000 - 2006
53
NLP
Probabilistic Context Free
Grammars
• Adding probabilities
• Often, lexicalizing the probabilities
LING 2000 - 2006
54
NLP
A PCFG
•
•
•
•
•
•
•
•
S -> NP VP (0.5)
S -> ADVP NP VP (0.5)
NP -> det (adj) N (0.7)
NP -> Proper N (0.15)
NP -> N (0.15)
VP -> V, (0.1); VP -> V PP (0.1)
VP -> V NP (0.4); VP -> V NP PP (0.4)
PP -> Prep NP (1)
CIS 8590 – Fall 2008
55
NLP
A Lexicalized PCFG
Sample rules:
• S_give -> NP VP_give (1.0)
• NP_friend -> det (adj) N_friend (1.0)
• NP_Sally -> ProperN_Sally (1.0)
• VP_give -> V_give NP NP (0.3)
• VP_give -> V_give NP PP_to (0.7)
CIS 8590 – Fall 2008
56
NLP
Parsing
Computational task:
Given a set of grammar rules and a sentence, find
a valid parse of the sentence (efficiently)
Naively, you could try all possible combinations of
rules until you get to a parse tree that has “S” at
the root, and the right words at the leaves.
But that takes exponential time in the number of words.
CIS 8590 – Fall 2008
57
NLP
CKY Parsing (aka, CYK)
• CKY parsing is a dynamic programming solution
• I bring it up now because dynamic programming
shows up all the time in NLP
Dynamic programming: simplifying a complicated
problem by breaking it down into simpler
subproblems in a recursive manner
CIS 8590 – Fall 2008
58
NLP
CKY – Basic Idea
Let the input be a string S consisting of n characters: a1 ... an.
Let the grammar contain r nonterminal symbols R1 ... Rr. This grammar
contains the subset Rs which is the set of start symbols.
Let P[n,n,r] be an array of booleans. Initialize all elements of P to false.
At each step, the algorithm sets P[i,j,k] to be true if the subsequence of
words (span) starting from i of length j can be generated from Rk
We will start with spans of length 1 (individual words), and then
proceed to increasingly larger spans, and determining which ones
are valid given the smaller spans that have already been processed.
CIS 8590 – Fall 2008
59
NLP
CKY Algorithm
For each i = 1 to n
For each unit production Rj -> ai,
set P[i,1,j] = true.
For each i = 2 to n -- Length of span
For each j = 1 to n-i+1 -- Start of span
For each k = 1 to i-1 -- Partition of span
For each production RA -> RB RC
If P[j,k,B] and P[j+k,i-k,C]
then set P[j,i,A] = true
If any of P[1,n,x] is true (x is iterated over the set s, where s are all the
indices for Rs)
Then S is member of language
Else S is not member of language
CIS 8590 – Fall 2008
60
NLP
CKY In Action
http://homepages.unituebingen.de/student/martin.lazarov/demo
s/cky.html
CIS 8590 – Fall 2008
61
NLP
Finding the Best Parse
• With a PCFG (or lexicalized PCFG), it’s
possible to score the trees to find the best
(highest probability) parse
• Instead of a boolean array P, you would
need to store weights (or probabilities) in
the array; for the rest, the algorithm is
almost identical.
CIS 8590 – Fall 2008
62
NLP