Probabilistic Parsing - University of Manchester

Download Report

Transcript Probabilistic Parsing - University of Manchester

Parsing III
Probabilistic Parsing
and
Conclusions
1/13
Probabilistic CFGs
• also known as Stochastic Grammars
• Date back to Booth (1969)
• Have grown in popularity with the
growth of Corpus Linguistics
2/13
Probabilistic CFGs
Essentially same as ordinary CFGS except
that each rule has associated with it a
probability
S  NP VP
.80
S  aux NP VP .15
S  VP
.05
NP  det n
.20
NP  det adj n .35
NP  n
.20
NP  adj n
.15
NP  pro
.10
• Notice that P for each
set of rules sums to 1
3/13
Probabilistic CFGs
• Probabilities are used to calculate the
probability of a given derivation
– Defined as the product of the Ps of the rules used
in the derivation
• Can be used to choose between competing
derivations
– As the parse progresses (so, can determine which
rules to try first) as an efficiency measure
– Or at the end, as a way of disambiguating, or
expressing confidence in the results
4/13
Where do the probabilities
come from?
1) Use a corpus of already parsed sentences: a
“treebank”
– Best known example is the Penn Treebank
• Marcus et al. 1993
• Available from Linguistic Data Consortium
• Based on Brown corpus + 1m words of Wall Street
Journal + Switchboard corpus
– Count all occurrences of each rule variation (e.g.
NP) and divide by total number of NP rules
– Very laborious, so of course is done
automatically
5/13
Where do the probabilities
come from?
2) Create your own treebank
– Easy if all sentences are unambiguous:
just count the (successful) rule
applications
– When there are ambiguities, rules which
contribute to the ambiguity have to be
counted separately and weighted
6/13
Where do the probabilities
come from?
3) Learn them as you go along
–
–
–
–
Again, assumes some way of identifying
the correct parse in case of ambiguity
Each time a rule is successfully used, its
probability is adjusted
You have to start with some estimated
probabilities, e.g. all equal
Does need human intervention, otherwise
rules become self-fulfilling prophecies
7/13
Problems with PCFGs
• PCFGs assume that all rules are essentially
independent
– But, e.g. in English “NP  pro” more likely when
in subject position
• Difficult to incorporate lexical information
– Pre-terminal rules can inherit important
information from words which help to make
choices higher up the parse, e.g. lexical choice can
help determine PP attachment
8/13
Probabilistic Lexicalised CFGs
• One solution is to identify in each rule that
one of the elements on the RHS (daughter) is
more important: the “head”
– This is quite intuitive, e.g. the n in an NP rule,
though often controversial (from linguistic point
of view)
• Head must be a lexical item
• Head value is percolated up the parse tree
• Added advantage is that PS tree has the feel
of a dependency tree
9/13
S(shot)
S
VP(shot)
VP
NP(man)
NP
det
the
n
shot
NP(elephant)
NP
v
det
man shot an
n
man
the
elephant
an
elephant
10/13
Dependency Parsing
• Not much different from PSG parsing
• Grammar rules still need to be stated as
A B c
– except that one daughter is identified as
the head, e.g. A  x h y
– As structure is built, the trees are headed
by “h” rather than “A”
• Can be probabilistic or not
11/13
Conclusion 1
• Basic parsing approaches (without
constraints) not practical in real applications
• Whatever approach taken, bear in mind that
the lexicon is the real bottleneck
• There’s a real trade-off between coverage and
efficiency, so it’s a good idea to sacrifice broad
coverage (e.g. domain-specific parsers,
controlled language), or use a scheme that
minimizes the disadvantages (e.g.
probabilistic parsing)
12/13
Conclusion 2
• From computational perspective, a parser
provides
– a formalism for writing linguistic rules
– an implementation which can apply the rules to
an input text
• Also, as necessary
– An interface to allow grammar development and
testing (eg tracing rules, showing trees)
– An interface with the application of which it is a
part (may be hidden to the end-user)
• All of the above tailored to meet the needs
13/13