EE669 Lecture 9 - National Cheng Kung University

Transcript EE669 Lecture 9 - National Cheng Kung University

Lecture 13: Probabilistic CFGs
(Chapter 11 of Manning and Schutze)
Wen-Hsiang Lu (盧文祥)
Department of Computer Science and Information Engineering,
National Cheng Kung University
2004/12/15
(Slides from Dr. Mary P. Harper,
http://min.ecn.purdue.edu/~ee669/)
Fall 2001
EE669: Natural Language Processing
1
Motivation
• N-gram models and HMM Tagging only allow us to process
sentences linearly; however, the structural analysis of simple
sentences require a nonlinear model that reflects the
hierarchical structure of sentences rather than the linear order
of words.
• Context-free grammars can be generalized to include
probabilistic information by adding rule use statistics.
• Probabilistic Context Free Grammars (PCFGs) are the
simplest and most natural probabilistic model for tree
structures and the algorithms for them are closely related to
those for HMMs.
• Note, however, that there are other ways of building
probabilistic models of syntactic structure (see Chapter 12).
Fall 2001
EE669: Natural Language Processing
2
CFG Parse Example
•
•
•
•
•
•
•
•
•
•
•
#1 S  NP VP
#2 VP V NP PP
#3 VP V NP
#4 NP N
#5 NP N PP
#6 PP PREP N
#7 N a_dog
#8 N a_cat
#9 N a_telescope
#10 V saw
#11 PREP with
Fall 2001
VP
S
NP
VP
PP
NP
NP
PP
V
N
PREP
N
V
N PREP
N
a_dog saw a_cat with a_telescope
EE669: Natural Language Processing
3
N
Dependency Parse Example
• Same example, dependency representation
saw
saw
a_cat
with
a_dog
Sb
Fall 2001
a_cat a_telescope
Obj Adv_Tool
Obj
a_dog
Sb
EE669: Natural Language Processing
with
a_telescope
Attr
4
PCFG Notation
• G is a PCFG
• L is a language generated by G
• {N1, …, Nn} is a set of nonterminals for G; there
are n nonterminals
• {w1, …, wV} is the set of terminals; there are V
terminals
• w1…wm is a sequence of words constituting a
sentence to be parsed, also denoted as w1m
• N pqj nonterminal Nj spans wp through wq, or in
other words Nj * wp wp+1 …wq
Fall 2001
EE669: Natural Language Processing
5
Formal Definition of a PCFG
• A PCFG consists of:
– A set of terminals, {wk}, k= 1,…,V
– A set of nonterminals, Ni, i= 1,…, n
– A designated start symbol N1
– A set of rules, {Ni  j}, (where j is a
sequence of terminals and nonterminals)
– A corresponding set of probabilities on rules
such that: i j P(Ni  j) = 1
Fall 2001
EE669: Natural Language Processing
6
Probability of a Derivation Tree and
a String
• The probability of a derivation (i.e. parse) tree:
P(T) = Pi=1..k p(r(i))
where r(1), …, r(k) are the rules of the CFG used
to generate the sentence w1m of which T is a parse.
• The probability of a sentence (according to
grammar G) is given by:
P(w1m) = t P(w1m,t) = {t: yield(t)=w1m} P(t)
where t is a parse tree of the sentence. Need
dynamic programming to make this efficient!
Fall 2001
EE669: Natural Language Processing
7
Example: Probability of a Derivation
Tree
Fall 2001
EE669: Natural Language Processing
8
Example: Probability of a Derivation
Tree
Fall 2001
EE669: Natural Language Processing
9
The PCFG
• Below is a probabilistic CFG (PCFG) with
probabilities derived from analyzing a parsed
version of Allen's corpus.
Rule
1. S  NP VP
2. VP  V
3. VP  V NP
4. VP  V NP PP
5. NP  NP PP
6. NP  N N
7. NP  N
8. NP  ART N
9. PP  P NP
Fall 2001
Count for
LHS
300
300
300
300
1032
1032
1032
1032
307
Count for
Rule
300
116
118
66
241
92
141
558
307
EE669: Natural Language Processing
PROB
1
.386
.393
.22
.23
.09
.14
.54
1
10
Assumptions of the PCFG Model
• Place Invariance: The probability of a subtree does
not depend on where in the string the words it
dominates are (like time invariance in an HMM)
k P( N kj( k c)   ) is thesame
• Context Free: The probability of a subtree does
not depend on words not dominated by that
subtree.
P( N klj   | anythingoutside k l )  P( N klj   )
• Ancestor Free: The probability of a subtree does
not depend on nodes in the derivation outside the
subtree.
P( N klj   | any ancestornodes not in N klj )  P( N klj   )
Fall 2001
EE669: Natural Language Processing
11
Probability of a Rule
• Rule r: Nj   (Nw)+
• Let RNj be the set of all rules that have nonterminal
Nj on the left-hand side;
• Then define probability distribution on RNj :
rRNj p(r) = 1, 0 p(r) 1
• Another point of view:
p(| Nj ) = p(r), where r is Nj 
Fall 2001
EE669: Natural Language Processing
12
Estimating Probability of a Rule
• MLE from a treebank produced using a CFG
grammar
Nj
• Let r: Nj , then
–
–
–
–
Fall 2001
c(Nj)
1
2
p(r) = c(r) /
c(r): how many times r appears in the treebank.
c(Nj)=  c(Nj )
p(r) = c(Nj ) /  c(Nj )
EE669: Natural Language Processing
k
13
Example PCFG
Rule
1. S  NP VP
2. VP  V
3. VP  V NP
4. VP  V NP PP
5. NP  NP PP
6. NP  N N
7. NP  N
8. NP  ART N
9. PP  P NP
Fall 2001
Count for
LHS
300
300
300
300
1032
1032
1032
1032
307
Count for
Rule
300
116
118
66
241
92
141
558
307
EE669: Natural Language Processing
PROB
1
.386
.393
.22
.23
.09
.14
.54
1
14
Some Features of PCFGs
• A PCFG gives some idea of the plausibility of different parses;
however, the probabilities are based on structural factors and
not lexical ones.
• PCFGs are good for grammar induction.
• PCFGs are robust.
• PCFGs give a probabilistic language model for English.
• The predictive power of a PCFG tends to be greater than for an
HMM. Though in practice, it is worse.
• PCFGs are not good models alone but they can be combined
with a trigram model.
• PCFGs have certain biases which may not be appropriate;
smaller trees are preferred. (why?)
• Estimates of PCFG parameters from a corpus obtains a proper
probability distribution (Chi and Geman, 1998).
Fall 2001
EE669: Natural Language Processing
15
Recall the Questions for HMMs
• Given a model =(A, B, P), how do we efficiently
compute how likely a certain observation is, that is,
P(O| )
• Given the observation sequence O and a model ,
how do we choose a state sequence (X1, …, X T+1)
that best explains the observations?
• Given an observation sequence O, and a space of
possible models found by varying the model
parameters  = (A, B, ), how do we find the model
that best explains the observed data?
Fall 2001
EE669: Natural Language Processing
16
Questions for PCFGs
• There are also three basic questions we wish to
answer for PCFGs:
– What is the probability of a sentence w1m according to a
grammar G: P(w1m|G)?
– What is the most likely parse for a sentence:
argmaxt P(t|w1m,G)?
– How can we choose rule probabilities for the grammar
G that maximize the probability of a sentence:
argmaxG P(w1m|G) ?
Fall 2001
EE669: Natural Language Processing
17
Restriction
• For this presentation, we only consider the case of
Chomsky Normal Form (CNF) Grammars, which only
have unary and binary rules of the form:
• Ni  Nj Nk // two nonterminals on the RHS
• Ni  wj
// one terminal on the RHS
• The parameters of a PCFG in Chomsky Normal Form are:
• P(Nj  Nr Ns | G)
// an n3 matrix of parameters
• P(Nj  wk|G)
// n×V parameters
(n is the number of nonterminals and V is the number
of terminals)
• For j=1,…,n, r,s P(Nj  Nr Ns) + k P (Nj  wk) = 1
• Any CFG can be represented by a weakly equivalent CNF
form.
Fall 2001
EE669: Natural Language Processing
18
HMMs and PCFGs
• An HMM is able to efficiently do calculations using
forward and backward probabilities.
– i(t) = P(w1(t-1), Xt = i)
– i(t) = P(wtT|Xt = i)
• The forward probability corresponds in a parse tree to
everything above and including a certain node (outside),
while the backward probability corresponds to the
probability of everything below a certain node (inside).
• For PCFGs, the Outside (j ) and Inside (j) Probabilities
are defined as:
 j ( p, q)  P( w1( p 1) , N pqj , w( q 1) m | G )
 j ( p, q)  P( w pq | N pqj , G)
Fall 2001
EE669: Natural Language Processing
19
Inside and Outside Probabilities of
PCFGs
1
N

j
N

w1
Fall 2001
…
wp-1wp
…
wq wq+1
EE669: Natural Language Processing
…
wm
20
Inside and Outside Probabilities
• The inside probability j(p,q) is the total
probability of generating wpq starting off with
nonterminal Nj.
• The outside probability j(p,q) is the total
probability of beginning with nonterminal N1 and
generating Njpq and all words outside of wpq.
Fall 2001
EE669: Natural Language Processing
21
From HMMs to Probabilistic
Regular Grammars (PRG)
• A PRG has start state N1 and rules of the form:
– Ni  wj Nk
– Ni  wj
• This is like an HMM except that for an HMM:
n w1n P(w1n) = 1
whereas, for a PCFG:
 wL P(w) = 1
where L is the language generated by the grammar.
• PRGs are related to HMMs in that a PRG is an HMM to
which a start state and a finish (or sink) state is added.
Fall 2001
EE669: Natural Language Processing
22
Inside Probability
• j(p,q) = P(Nj * wpq)
Nj
wp
Fall 2001
...
wq
EE669: Natural Language Processing
23
The Probability of a String: Using
Inside Probabilities
• We can calculate the probability of a string using
the inside algorithm, a dynamic programming
algorithm based on the inside probabilities:
*
P( w1m | G)  P( N  w1m | G)
1
 P( w1m | N11m , G)  1 (1, m)
• Base Case: j
 j (k , k )  P(wk | N kkj , G)  P( N j  wk | G)
Fall 2001
EE669: Natural Language Processing
24
Induction: Divide the String wpq
Nj
Nr
wp
Fall 2001
…
Ns
wd
wd+1
EE669: Natural Language Processing
wq
25
Induction Step
j, 1  p < q  m
Nj
Nr
wp … wd
Ns
wd+1 wq
 j ( p, q)  P( w pq | N pqj , G )
q 1
r
  P(w pd , N pd
, w( d 1) q , N (sd 1) q | N pqj , G )
r ,s d  p
q 1
r
r
  P(N pd
, N (sd 1) q | N pqj , G )  P( w pd | N pqj , N pd
, N (sd 1) q , G ) 
r ,s d  p
r
P( w( d 1) q | N pqj , N pd
, N (sd 1) q , w pd , G )
q 1
r
r
  P(N pd
, N (sd 1) q | N pqj , G )  P( w pd | N pd
, G )  P( w( d 1) q | N (sd 1) q , G )
r ,s d  p
q 1
  P(N j  N r N s )  r ( p, d )  s (d  1, q)
r ,s d  p
Fall 2001
EE669: Natural Language Processing
26
Example PCFG
•
•
•
•
•
•
•
•
•
•
•
#1 S  NP VP
#2 VP V NP PP
#3 VP V NP
#4 NP N
#5 NP N PP
#6 PP PREP N
#7 N a_dog
#8 N a_cat
#9 N a_telescope
#10 V saw
#11 PREP with
VP
1.0
S
0.6
0.4
NP
1.0
0.6
VP
0.3
0.7
PP
0.4
0.3
NP
NP
PP V N
0.7
0.7
1.0
1.0
1.0
PREP N
0.3
N V N PREP N
0.5
0.3 1.0 0.5 1.0
0.2
0.2
1.0
1.0 P(a_dog saw a_cat with a_telescope) =
1.7 .4 .3 .7  1 .5  1  1  .2 + ... .6... .3... = .00588 + .00378 = .00966
Fall 2001
EE669: Natural Language Processing
27
Computing Inside Probability
• a_dog saw a_cat with a_telescope
1
2 3
4
5
from/to
1
2
3
4
5
1
NP .21
N .3
2
3
S .441
V1
VP .21
NP .35
N .5
4
5
S .00966
VP .046
NP .03
PREP 1
PP .2
N .2
• Create table m  m (m is length of string).
• Initialize on diagonal, using N  rules.
• Recursively compute along the diagonal towards the upper right
corner.
Fall 2001
EE669: Natural Language Processing
28
The Probability of a String: Using
Outside Probabilities
• We can also calculate the probability of a string using the
outside algorithm, based on the outside probabilities, for any
k, 1 k  m:
P( w1m | G )   P( w1( k 1) , wk , w( k 1) m , N kkj | G )
j
  P( w1( k 1) , N kkj , w( k 1) m | G )  P( wk | w1( k 1) , N kkj , w( k 1) m , G )
j
   j (k , k ) P( N j  wk )
j
• Outside probabilities are calculated top down and require
reference to inside probabilities, which must be calculated
first.
Fall 2001
EE669: Natural Language Processing
29
Outside Algorithm
• Base Case: 1(1,m)= 1; j(1,m)=0 for j1 (the
probability of the root being nonterminal Ni with
nothing outside it)
• Inductive Case: 2 cases depicted on the next two
slides. Sum over both possibilities but restrict first
sum so that gj to avoid double counting for rules
of the form Ni  Nj Nj
Fall 2001
EE669: Natural Language Processing
30
Case 1: left previous derivation step
1
N
N pef
j
Ν pq
w1
Fall 2001
…
wp-1 wp … wq
N (gq1)e
wq+1 we we+1
EE669: Natural Language Processing
…
wm
31
Case 2: right previous derivation step
1
N
N eqf
N
w1
Fall 2001
…
g
e ( p 1)
we-1 we … wp-1 wp
N pqj
wq wq+1
EE669: Natural Language Processing
…
wm
32
Outside Algorithm Induction
 j ( p, q )  [
m
f
j
g
P
(
w
,
w
,
N
,
N
,
N
 1( p1) ( q1) m pe pq ( q1)e ) 

f , g  j e  q 1
p 1
 P(w
1( p 1)
f , g e 1
, w( q 1) m , N eqf , N eg( p 1) , N pqj )]
m
  P( w
[
f , g  j e  q 1
1( p 1)
, w( e 1) m , N pef )  P( N pqj , N (gq 1) e | N pef )  P( w( q 1) e | N (gq 1) e ) 
p 1
f
g
j
f
g
P
(
w
,
w
,
N
)

P
(
N
,
N
|
N
)

P
(
w
|
N
 1(e1) ( q1) m eq
e ( p 1)
pq
eq
e ( p 1)
e ( p 1) )]
f , g e 1

[
m
f
j
g

(
p
,
e
)
P
(
N

N
N
)  g (q  1, e) 
 f
f , g  j e  q 1
p 1

f , g e 1
Fall 2001
f
(e, q)P( N f  N g N j )  g (e, p  1)]
EE669: Natural Language Processing
33
Product of Inside and Outside
Probabilities
•
Similarly to an HMM, we can form a product of inside
and outside probabilities:
 j ( p, q)  j ( p, q)  P( w1( p 1) , N pqj , w( q 1) m | G) P( w pq | N pqj , G)
 P( w1m , N pqj | G)
•
The PCFG requires us to postulate a nonterminal node.
Hence, the probability of the sentence AND that there is
a constituent spanning the words p through q is:
P(w1m, Npq|G)= j j(p,q) j(p,q)
Fall 2001
EE669: Natural Language Processing
34
CNF Assumption
• Since we know that there will always be some
nonterminal spanning the whole tree (the start
symbol), as well as each individual terminal (due
to CNF assumption), we obtain the following as
the total probability of a string:
P( w1m | N 11m , G )   1 (1, m)  1 (1, m)
j

(
k
,
k
)
P
(
N
 wk )    j (k , k )  j (k , k )
 j
j
Fall 2001
j
EE669: Natural Language Processing
35
Most Likely Parse for a Sentence
• The O(m3n3) algorithm works by finding the highest
probability partial parse tree spanning a certain substring
that is rooted with a certain nonterminal.
• i(p,q) is the highest inside probability parse of a subtree
Nipq
• Initialization: i(p,p) = P(Ni  wp)
• Induction: i(p,q) = max1j,kn,pr<qP(Ni  Nj Nk) j(p,r)
k(r+1,q)
• Store backtrace: i(p,q)=argmax(j,k,r)P(Ni  Nj Nk) j(p,r)
k(r+1,q)
• At termination: The probability of the most likely tree is
P( tˆ )= 1(1,m). To reconstruct the tree, see the next
Fall 2001
EE669: Natural Language Processing
36
slide.
Extracting the Viterbi Parse
• To reconstruct the maximum probability tree tˆ ,
we regard the tree as a set of nodes {Xx}.
– Because the grammar has a start symbol, the root
node must be N11m
– Thus we only need to construct the left and right
daughters of a non-terminal node recursively. If
Xx=Ni1m is in the Viterbi parse, and i(p,q)= (j,k,r)
then left(Xx)= Njpr and right(Xx)= Nk(r+1)q
• There can be more than one maximum value
parse.
Fall 2001
EE669: Natural Language Processing
37
Training a PCFG
• Purpose of training: limited grammar learning (induction)
• Restrictions: We must provide the number of terminals and
nonterminals, as well as the start symbol, in advance. Then it is
possible to assume all possible CNF rules for that grammar exist, or
alternatively provide more structure on the rules, including specifying
the set of rules.
• Purpose of training: to attempt to find the optimal probabilities to
assign to different grammar rules once we have specified the grammar
in some way.
• Mechanism: This training process uses an EM Training Algorithm
called the Inside-Outside Algorithm which allows us to train the
parameters of a PCFG on unannotated sentences of the language.
• Basic Assumption: a good grammar is one that makes the sentences in
the training corpus likely to occur, i.e., we seek the grammar that
maximizes the likelihood of the training data.
Fall 2001
EE669: Natural Language Processing
38
Inside-Outside Algorithm
• If a parsed corpus is available, it is best to train using that
information. If such is unavailable, then we have a hidden
data problem: we need to determine probability functions
on rules by only seeing the sentences.
• Hence, we must use an iterative algorithm (like Baum
Welch for HMMs) to improve estimates until some
threshold is achieved. We begin with a grammar topology
and some initial estimates of the rules (e.g., all rules with a
non-terminal on the LHS are equi-probable).
• We use the probability of the parse of each sentence in the
training set to indicate our confidence in it, and then sum
the probabilities of each rule to obtain an expectation of
how often it is used.
Fall 2001
EE669: Natural Language Processing
39
Inside-Outside Algorithm
• The expectations are then used to refine the
probability estimates of the rules, with the goal of
increasing the likelihood of the training set given
the grammar.
• Unlike the HMM case, when dealing with multiple
training instances, we cannot simply concatenate
the data together. We must assume that the
sentences in the training set are independent and
then the likelihood of the corpus is simply the
product of their component sentence probabilities.
• This complicates the reestimation formulas, which
appear on pages 399-401 of Chapter 11.
Fall 2001
EE669: Natural Language Processing
40
Problems with the Inside-Outside
Algorithm
• Extremely Slow: For each sentence, each iteration of
training is O(m3n3), given m is the sentence length and n is
the number of nonterminals.
• Local Maxima are much more of a problem than in HMMs
• Satisfactory learning requires many more nonterminals
than are theoretically needed to describe the language.
• There is no guarantee that the learned nonterminals will be
linguistically motivated.
• Hence, although grammar induction from unannotated
corpora is possible, it is extremely difficult to do well.
Fall 2001
EE669: Natural Language Processing
41
Another View of Probabilistic
Parsing
• By assuming that the probability of a constituent being
derived by a rule is independent of how the constituent is
used as a subconstituent, the inside probability can be
used to develop a best-first PCFG parsing algorithm.
• This assumption suggests that the probabilities of NP rules
are the same whether the NP is used as a subject, object, or
object of a preposition. (However, subjects are more likely
to be pronouns!)
• The inside probabilities for lexical categories can be their
lexical generation probabilities. For example,
P(flower|N)=.063 is the inside probability that the
constituent N is realized as the word flower.
Fall 2001
EE669: Natural Language Processing
42
Lexical Probability Estimates
The table below gives the lexical probabilities which
are needed for our example:
P(the|ART)
P(flies|N)
P(flies|V)
P(like|V)
P(like|P)
P(like|N)
Fall 2001
.54
.025
.076
.1
.068
.012
P(a|ART)
P(a|N)
P(flower|N)
P(flower|V)
P(birds|N)
EE669: Natural Language Processing
.360
.001
.063
.05
.076
43
Word/Tag Counts
flies
fruit
like
a
the
flower
flowers
birds
others
TOTAL
Fall 2001
N
21
49
10
1
1
53
42
64
592
833
V
23
5
30
0
0
15
16
1
210
300
ART
0
1
0
201
300
0
0
0
56
558
P
0
0
21
0
2
0
0
0
284
307
EE669: Natural Language Processing
TOTAL
44
55
61
202
303
68
58
65
1142
1998
44
The PCFG
• Below is a probabilistic CFG (PCFG) with
probabilities derived from analyzing a parsed
version of Allen's corpus.
Rule
1. S  NP VP
2. VP  V
3. VP  V NP
4. VP  V NP PP
5. NP  NP PP
6. NP  N N
7. NP  N
8. NP  ART N
9. PP  P NP
Fall 2001
Count for
LHS
300
300
300
300
1032
1032
1032
1032
307
Count for
Rule
300
116
118
66
241
92
141
558
307
EE669: Natural Language Processing
PROB
1
.386
.393
.22
.23
.09
.14
.54
1
45
Parsing with a PCFG
• Using the lexical probabilities, we can derive
probabilities that the constituent NP generates a
sequence like a flower. Two rules could generate
the string of words:
NP
NP
8
6
ART
a
Fall 2001
N
N
flower
a
EE669: Natural Language Processing
N
flower
46
Parsing with a PCFG
• The likelihood of each rule has been estimated from the
corpus, so the probability that the NP generates a flower is
the sum of the two ways it can be derived:
P(a flower|NP) = P(R8|NP)×P(a|ART)×P(flower|N) +
P(R6|NP)×P(a|N)×PROB(flower|N)
= .55 × .36 × .063 + .09 × .001 × .063 = .012
• This probability can then be used to compute probabilities
for larger constituents, like the probability of generating
the words, A flower wilted (枯萎), from an S constituent.
Fall 2001
EE669: Natural Language Processing
47
Three Possible Trees for an S
NP
a
S
1
1
NP
VP2
8
ART
S
N
7
N
V
a
S
flower wilted
1
NP
N
a
3
V
flower
NP
7
N
wilted
VP2
6
Fall 2001
VP
N
V
flower wilted
EE669: Natural Language Processing
48
Parsing with a PCFG
• The probability of a sentence generating A flower
wilted:
P(a flower wilted|S) = P(R1|S) × P(a flower|NP) ×
P(wilted|VP) + P(R1|S) × P(a|NP) × P(flower wilted|VP)
• Using this approach, the probability that a given
sentence will be generated by the grammar can be
efficiently computed. It only requires some way
of recording the value of each constituent between
each two possible positions. The requirement can
be filled by a packed chart structure.
Fall 2001
EE669: Natural Language Processing
49
Parsing with a PCFG
• The probabilities of specific parse trees can be found using
a standard chart parsing algorithm, where the probability of
each constituent is computed from the probability of its
subconstituents and the rule used. When entering an entry
E of nonterminal Nj using rule i with n subconstituents
corresponding to E1, E2, ..., En, then: P(E) = P(Rule i| Nj) ×
P(E1) ×... × P(En)
• A standard chart parser can be used with a step to compute
the probability of each entry when it is added to the chart.
Using a bottom-up algorithm plus probabilities from a
corpus, the figure on slide 49 shows the complete chart for
the input, a flower.
Fall 2001
EE669: Natural Language Processing
50
Lexical Generation Probabilities
• A better estimate of the lexical category for the word
would be calculated by determining how likely it is that
category Ni occurred at a position t over all sequences
given the input w1t. In other words, rather than ignoring
context or simply searching for the one sequence that
yields the maximum probability for the word, we want to
compute the sum of the probabilities from all sequences
using the forward algorithm.
• The forward probability can then be used in the chart as the
probability for a lexical item. The forward probabilities
strongly prefer the noun reading flower in contrast to the
context independent probabilities (as can be seen on the
next slide).
Fall 2001
EE669: Natural Language Processing
51
PCFG Chart For A flower
NP425
NP423
NP423
S421
NP418
1.
2.
1.
2.
1.
2.
Fall 2001
A
.14
.00011
.54
3.2*10^-8
.00018
.001
ART416
1
N417
N422
ART416
N422
NP418
VP420
1. N417
N417
1. N422
VP420
1. V419
N422
V419
.99
2
.00018
.999
flower
EE669: Natural Language Processing
.00047
3
52
Accuracy Levels
• A parser built using this technique will identify the correct
parse about 50% of the time. The drastic independence
assumptions may prevent it from doing better. For
example, the context-free model assumes that the
probability of a particular verb being used in a VP rule is
independent of the rule in question.
• For example, any structure that attaches a PP to a verb will
have a probability of .22 from the fragment, compared with
one that attaches to an NP in the VP (.39*.24 = .093).
VP
VP
.39
.22
V
NP
PP
V
NP
.24
NP
Fall 2001
EE669: Natural Language Processing
PP
53
Best-First Parsing
• Algorithms can be developed that explore high-probability
constituents first using a best-first strategy. The hope is
that the best parse can be found quickly without using
much search space and that low probability constituents
will never be created.
• The chart parser can be modified by making the agenda a
priority queue (where the most probable elements are first
in the queue). The parser then operates by removing the
highest ranked constituent first and adding it to the chart.
• The previous chart parser relied on the fact that it worked
from left to right, finishing early constituents before those
later. With the best-first strategy, this assumption does not
hold, and so the algorithm must be modified.
Fall 2001
EE669: Natural Language Processing
54
Best-First Chart Issues
• For example, if the last word in the sentence has
the highest score, it will be entered into the chart
first in a best-first parser.
• This possibility suggests that a constituent needed
to extend an arc may already be in the chart,
requiring that whenever an active arc is added to
the chart, we need to check if it can be extended
by anything already in the chart.
• The arc extension algorithm must therefore be
modified.
Fall 2001
EE669: Natural Language Processing
55
Arc Extension Algorithm
•
To add a constituent C from position p1 to p2:
1.
2.
•
Insert C into the chart from position p1 to p2
For any active arc of the form: X  X1 ... •C ... Xn from position
p0 to p1, add a new active arc X  X1 ... C • ... Xn from position
p0 to p2
To add an active arc X  X1 ... C • C' ... Xn to the chart
from p0 to p2:
1.
2.
Fall 2001
If C is the last constituent (i.e., the arc is completed), add a new
constituent of type X to the agenda
Otherwise, if there is a constituent Y of category C' in the chart
from p2 to p3, then recursively add an active arc: X  X1 ... C C'
• ... Xn from p0 to p3 (which may of course add further arcs or
create further constituents).
EE669: Natural Language Processing
56
Best-First Strategy
• The best-first strategy improves the efficiency of the parser
by creating fewer constituents than a standard chart parser
which terminates as soon as a sentence parse is found.
• Even though it does not consider every possible
constituent, it is guaranteed to find the highest probability
parse first.
• Although conceptually simple, there are some problems
with the technique in practice. For example, if we
combine scores by multiplying probabilities, the scores of
constituents fall as they cover more words. Hence, with
large grammars, the probabilities may drop off so quickly
that the search resembles breadth-first search!
Fall 2001
EE669: Natural Language Processing
57
Context-Dependent Best-First Parser
• The best-first approach may improve efficiency,
but cannot improve the accuracy of the
probabilistic parser.
• Improvements are possible if we use a simple
alternative for computing rule probabilities that
uses more context-dependent lexical information.
• In particular, we exploit the observation that the
first word in a constituent is often its head and, as
such, exerts an influence on the probabilities of
rules that account for its complements.
Fall 2001
EE669: Natural Language Processing
58
Simple Addition of Lexical Context
• Hence, we can use the following measure that
takes into account the first word in the constituent.
P(R|Ni,w)=C(rule R generates Ni that starts with w)/
C(Ni starts with w).
• This makes the probabilities sensitive to more
lexical information. This is valuable:
– It is unusual for singular nouns to be used by
themselves.
– Plurals are rarely used to modify other nouns.
Fall 2001
EE669: Natural Language Processing
59
Example
• Notice the probabilities for the two rules, NP  N
and NP  N N, given the words house and
peaches. Also, context sensitive rules can encode
verb subcategorization preferences.
Rule
NP  N
NP N N
NP  NP PP
NP  ART N
the
0
0
.23
.76
house
0
.82
.18
0
peaches
.65
0
.35
0
flowers
.76
0
.24
0
Rule
VP  V
VP  V NP
VP  V NP PP
ate
.28
.57
.14
bloom
.84
.1
.05
like
0
.9
.1
put
.03
.03
.93
Fall 2001
EE669: Natural Language Processing
60
Accuracy and Size
Strategy
Full Parse
Context-Free
Probabilities
ContextDependent
Probabilities
Fall 2001
Accuracy on 84
PP Attachment
Problems
33% (taking first S
found)
49%
Size of Chart
Generated for The man
put the bird in the house
158
66%
36
65
EE669: Natural Language Processing
61
Source of Improvement
• To see why the context-dependent parser does
better, consider the attachment decision that must
be made for sentences like:
– The man put the bird in the house.
– The man likes the bird in the house.
• The basic PCFG will assign the same structure to
both, i.e., it will attach the PP to the VP.
However, the context-sensitive approach will
utilize subcategorization preferences for the verbs.
Fall 2001
EE669: Natural Language Processing
62
Chart for put the bird in the house
• The probability of attaching the PP to the VP is
.54 (.93×.99×.76×.76) compared to .0038 for the
alternative.
Fall 2001
EE669: Natural Language Processing
63
Chart for likes the bird in the house
• The probability of attaching the PP to the VP is
.054 compared to .1 for the alternative .
Fall 2001
EE669: Natural Language Processing
64
Room for Improvement
• The question is, just how accurate can this approach
become? 33% error is too high. Additional information
could be added to improve performance.
• One can use probabilities relative to a larger fragment of
input (e.g., bigrams or trigrams for the beginnings of
rules). This would require more training data.
• Attachment decisions depend not only on the word that a
PP is attaching to but also the PP itself. Hence, we could
use a more complex estimate based on the previous
category, the head verb, and the preposition. This creates a
complication in that one cannot compare across rules
easily: VP  V NP PP is evaluated using a verb and a
preposition, but VP  V NP has no preposition.
Fall 2001
EE669: Natural Language Processing
65
Room for Improvement
• In general, the more selective the lexical categories, the
more predictive the estimates can be, assuming that there is
enough data.
• Clearly function words like prepositions, articles,
quantifiers, and conjunctions can receive individual
treatment; however, open class words like nouns and verbs
are too numerous for individual handling.
• Words may be grouped based on similarity, or useful
classes can be learned by analyzing corpora.
Fall 2001
EE669: Natural Language Processing
66

EE669 Lecture 9 - National Cheng Kung University

Transcript EE669 Lecture 9 - National Cheng Kung University

Directory