Morphological and Syntactic Analysis

Transcript Morphological and Syntactic Analysis

Morphological Analysis
Context-Free Grammars
Daniel Zeman
http://ufal.mff.cuni.cz/course/npfl094/
Warning
• We are going to observe a number of reasons why
pure CFGs are not very suitable for MA.
• Nevertheless we are going to study them because:
– Extensions such as unification grammars (see later) are
much more suitable
– CFGs are also used to describe sentence structure
– The chart parsing algorithm is interesting enough and
we would be looking at it anyway, sooner or later
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
2
Context-Free Grammars
• Quadruple (T, N, S, P)
– T … alphabet of terminal symbols, usually lowercase
letters
– N … alphabet of non-terminal symbols, usually
uppercase letters
– S  N … start non-terminal symbol
– P … set of rewrite rules of the form X  
where X  N and   (TN)*
• A string can be derived in a CFG if it can be
created by repeated application of the rules on the
start symbol.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
3
Morphological Example
• The first step towards a context-free description of the structure of
words in Czech could roughly look like this.
• Non-terminals start with uppercase, terminals with lowercase.
–
–
–
–
–
Word  Comparison Negation Stem Suffix | Stem Suffix
Comparison  nej
Negation  ne
Stem  abatyš | abbé | abdikac | abdikov | …
Suffix  λ | a | ovi | e | em | y | u | o | ou | …
• Distinguish stems that permit concrete groups of affixes.
• Solve irregularities, alternations of stem-final consonants, …
• Problem: The grammar would be too large!
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
4
Example of a Derivation Tree
Word
non-terminals
Stem
Suffix
abatyš
e
terminals
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
5
Example Czech Paradigms:
žena (woman), matka (mother)
•
•
•
•
•
•
•
žena – ženy (sg. – pl.)
ženy – žen
ženě – ženám
ženu – ženy
ženo – ženy
ženě – ženách
ženou – ženami
19.11.2010
•
•
•
•
•
•
•
matka – matky (nom)
matky – matek (gen)
matce – matkám (dat)
matku – matky (acc)
matko – matky (voc)
matce – matkách (loc)
matkou – matkami (ins)
http://ufal.mff.cuni.cz/course/npfl094
6
Changes of Stem Consonants
• A somewhat possible solution (as before, we are
experimenting with the Czech feminine nouns matka
“mother”, žena “woman” etc.):
–
–
–
–
StemNF1  m a t K | ž e N | …
Kk|c
Nn|ň
SuffixNF1  a | y | e | …
• Accepts matka, matky, matce but also *matca, *matcy,
*matke. Either we need a sort of supplementary out-ofgrammar rule (e.g. soft consonant before “e”, hard
consonant everywhere else), or a more complex grammar.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
7
Changes of Stem Consonants
• More complex grammar:
– Word  StemNF1Normal SuffixNF1Normal |
StemNF1Soft SuffixNF1Soft
– StemNF1Normal  m a t k | ž e n
– StemNF1Soft  m a t c | ž e ň
– SuffixNF1Normal  a | y | u | o | ou |  | ám | ách | ami
– SuffixNF1Soft  e
• The size of the grammar could end up close to the
size of the enumeration of all possible word forms.
Repeating the parts “m a t” and “ž e” is
superfluous.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
8
Inserting / Deleting e
• The things are furthermore complicated by epenthesis, i.e.
inserting/deleting e: matek (genitive plural of matka)
– Word  StemNF1Normal SuffixNF1Normal | StemNF1Soft
SuffixNF1Soft | StemNF1InsE
– StemNF1Normal  m a t k | ž e n
– StemNF1Soft  m a t c | ž e ň
– StemNF1InsE  m a t e k | ž e n
– SuffixNF1Normal  a | y | u | o | ou | ám | ách | ami
– SuffixNF1Soft  e
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
9
Context-Free Morphological
Analysis and Generation
• Generation
– Start with the start symbol.
– Choose a non-terminal symbol in the current string and rewrite it
according to a rewrite rule. Sometimes (often!) we have to select
one rule of many that can be applied to the same non-terminal.
– The string is complete when it contains only terminal symbols.
• Analysis
– We have a string of terminal symbols. In the case of morphological
analysis the string is a word form.
– We look for parts that can be replaced by non-terminals. Nondeterministic!
– Goal: the start symbol S.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
10
Context-Free Generation
• Input:
<l>matka<t>NNFS6-----A----
• Expected output:
<f>matce
• Grammar:
FormMatka  StemMatka SuffMatka
StemMatka  mat | bab | vlaj | …
SuffMatka  MatS1 | MatS2 | …
MatS1  ka ; MatS2  ky ; MatS3  ce
MatP1  ky ; MatP2  ek ; MatP3  kám
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
11
Context-Free Generation
• Supplementary rule:
– Names of some non-terminals contain information from
morphological tags.
– In this particular case: the last two characters of nonterminals immediately under the non-terminal whose
name begins in “Suff”.
• In theory we could proceed like this:
– First analyze the lemma matka. It will turn out that it
consists of mat + MatS1.
– Replace by required MatS6.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
12
Morphology and CFG:
Too Many Paradigms
• žena
– žen | vlád | mát | láv | …
+ a | y | ě | u | o | ě | ou |
y | λ | ám | y | y | ách | ami
• matka
– mat | bab | vlaj | …
+ ka | ky | ce | ku | ko | ce | kou |
ky | ek | kám | ky | ky | kách | kami
Traditional school
grammars of Czech
assign all these words
to the paradigm žena
(“woman”).
• banka
– ban
+ ka | ky | ce | ku | ko | ce | kou |
ky | k | kám | ky | ky | kách | kami
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
13
Morphology and CFG:
Too Many Paradigms
• barva “color”
– bar | lar | kur | … | bit | pit | …
+ va | vy | vě | vu | vo | vě | vou |
vy | ev | vám | vy | vy | vách | vami
• tráva “grass”
– tr | kr | šť | …
(but not e.g. k!)
+ áva | ávy | ávě | ávu | ávo | ávě | ávou |
ávy | av | ávám | ávy | ávy | ávách | ávami
• louka “meadow”
– l|m
(but not e.g. prv, mrav!)
+ ouka | ouky | ouce | ouku | ouko | ouce | oukou |
ouky | uk | oukám | ouky | ouky | oukách | oukami
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
14
Too Much of a Good Thing
• Paradigm explosion
• Two independent problems:
– Palatalization of the stem-final consonant
• ha/ze, ga/ze, cha/še, ka/ce, ra/ře, da/dě, ta/tě, na/ně, ba/bě, fa/fě,
ma/mě, pa/pě, va/vě … 13
– Shortening of the stem-internal vowel
• láva/láv, tráva/trav, louka/luk, síla/sil, díra/děr … 5
– Originally 1 paradigm žena expands to
• 13 + 5 = 18
• 13 × 5 = 65
• Can we separate solutions to the two problems?
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
15
Softening and Shortening
SuffV  NFVS1 | NFVS2 | NFVS3 | NFVS4 | NFVS5 |
NFVS6 | NFVS7 | NFVP1 | NFVP3 | NFVP4 | NFVP5 |
NFVP6 | NFVP7
SuffShV  NFVP2
StemV  láv | sův | tráv | smlouv | …
StemShV  láv | sův | trav | smluv | …
FormV  StemV SuffV
 StemShV SuffShV
Drawback: Analysis won’t tell us that the stems smlouv and
smluv belong to the same lemma!
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
16
Recall the Supplementary Rule
• Present supplementary rule:
– Non-terminals immediately under a non-terminal
beginning in Suff encode morphological tags.
• E.g. a non-terminal beginning in NFV corresponds to tag
NNF??-----A----.
• The rule contains a table of correspondences between tags and
non-terminals. A tag corresponds to several non-terminals of
different inflection classes (paradigms)!
– The last two characters of non-terminals under Suff
encode the number and case.
• E.g. NFVP2 corresponds to number P and case 2, i.e. the whole
tag is NNFP2-----A----.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
17
Extended Supplementary Rule
• Add the following (rather wild) rule:
– Non-terminals starting with X contain characters that
shall appear in the stem of the lemma at that position.
• E.g. a non-terminal Xá says that the lemma contains a long á at
this position although Xá rewrites as a short a for the form
currently being analyzed.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
18
Softening and Shortening
SuffV  NFVS1 | NFVS2 |
NFVS6 | NFVS7 | NFVP1
NFVP6 | NFVP7
SuffShV  NFVP2
Xá  a
Xou  u
StemV  láv | sův | …
StemS0V  tráv | smlouv
StemS1V  tr Xá v | sml
FormV  StemV SuffV
|
 StemS0V SuffV |
19.11.2010
NFVS3 | NFVS4 | NFVS5 |
| NFVP3 | NFVP4 | NFVP5 |
These stems never shorten.
These stems shorten in P2
(genitive plural).
| …
Xou v | …
StemV SuffShV
StemS1V SuffShV
http://ufal.mff.cuni.cz/course/npfl094
19
The Result of the Analysis
• When analysis is done use the supplementary rule to read
off the stem of the lemma and the morphological tag.
FormV
StemS1V SuffShV
Xou
sml
19.11.2010
u
NFVP2
v
λ
http://ufal.mff.cuni.cz/course/npfl094
20
The Result of the Analysis
• After analysis, generate the base form (lemma; per
definition it’s S1). FormV refers to the correct paradigm.
FormV
FormV
StemK1V SuffShV
Xou
sml
19.11.2010
u
StemS0V
NFVP2
v
λ
SuffV
NFVS1
smlouv
http://ufal.mff.cuni.cz/course/npfl094
a
21
Non-determinism
FormV13  StemV13 SuffV13
SuffV13  V13PS1 | V13PS2 | V13PS3 |
V13PP1 | V13PP2 | V13PP3
StemV13  nes | ber | maž | jd | …
V13PS1  u
V13PS2  eš
nesoucími
V13PS3  e
V13PS1
V13PP3
V13PP1  eme | em
V13PP2  ete
V13PP3  ou
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
22
Homonymy
FormV
StemS0V
SuffV
NFVS2
NFVP1
NFVP4
smlouv
y NFVP5
smlouv
y
smlouv
y
smlouv
y
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
23
Algorithm for Context-Free
Analysis
• Top-down
– Start with one non-terminal – the start symbol.
– Expand one non-terminal: find a rule where it forms the left-hand
side and replace it by the rule’s right-hand side.
– Repeat until there are only terminals.
– If the resulting string is not the analyzed word, backtrack and
choose a different rule for some non-terminal.
– If the resulting string is the analyzed word, return anyway as there
may exist other analyses as well.
– If all combinations have been considered the set of analyses is
complete. Is it empty? Then the word is not in the lexicon.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
24
Algorithm for Context-Free
Analysis
• Bottom-up
– Start with a sequence of terminals – the analyzed word.
– Collapse one non-terminal: search the analyzed string for the righthand side of a rule, replace by the left-hand-side non-terminal.
– Repeat until no rule found.
– If the result is not the start symbol, backtrack and choose a
different rule for some non-terminal.
– If the result is the start symbol, backtrack anyway as there may
exist other analyses.
– If all combinations have been considered the set of analyses is
complete. Is it empty? Then the word is not in the lexicon.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
25
Top-Down Analysis
• Somehow we must enforce heading to the terminal
string being analyzed.
• Solution: continuously check that the terminals in
the current state correspond to a prefix of the
string.
• State of the analysis: string of terminals and nonterminals, a period delimits the checked (read)
prefix.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
26
Example of Top-Down Analysis
Analyzing the word matce
. Form
. FormNFeka
. StemNFeka SuffNFeka
. bár SuffNFeka !!! BACK
. bud SuffNFeka !!! BACK
…
. mat SuffNFeka
mat . SuffNFeka
mat . NFekaS1
mat . ka !!! BACK
19.11.2010
mat .
mat .
mat .
mat .
matce
NFekaS2
ky !!! BACK
NFekaS3
ce
. 
http://ufal.mff.cuni.cz/course/npfl094
27
An Observation about the Lexicon
• In practice the lexicon should be separated and
implemented more effectively.
• The last non-terminal above the lexicon is the socalled pre-terminal.
• It knows the list of strings belonging to it and it
can search the strings quickly.
• Implementation: hash table, search tree, trie…
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
28
An Observation about (Left)
Recursion
dělat – dělávat – dělávávat –
dělávávávat – dělávávávávat …
Form → FormV5
FormV5 → StemV5 SuffV5
| StemV5 Iter SuffV5
SuffV5 → V5INF | V5PS1
| V5PS2 | …
StemV5 → děl | lét | …
Iter → Iter áv | áv
V5INF → at | ati
V5PS1 → ám
…
19.11.2010
. Tvar
. TvarV5
…
děl . Iter
děl . Iter
děl . Iter
děl . Iter
SuffV5
děl . Iter
SuffV5
…
http://ufal.mff.cuni.cz/course/npfl094
SuffV5
áv SuffV5
áv áv SuffV5
áv áv áv
áv áv áv áv
29
Recursion and Infinite-Loop
Prevention
• Convert the grammar to a non-left-recursive one.
• Make sure that the recurrent rules are not used until
necessary.
• If there are more than one recurrent rule, try all
combinations.
• There is a finite number of combinations—the recursion of
any rule can be stopped once the number of symbols is
greater than the number of the input terminals.
• Ban left recursion, permit right recursion
(Iter → áv | áv Iter).
• Perform bottom-up analysis.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
30
Example of Bottom-Up Analysis
Analyzing word matce
. matce
. NNíS7 atce
. SuffNNí atce !!! BACK
. NFaD7 tce
. SuffNFa tce !!! BACK
. StemNFeka ce
StemNFeka . ce
StemNFeka . NFekaS3
StemNFeka . SuffNFeka
StemNFeka SuffNFeka .
19.11.2010
FormNFeka .
Form .
http://ufal.mff.cuni.cz/course/npfl094
31
How to Remember Alternate
Paths
• In case of crash we are supposed to return to the last fork
where more than one rule were available.
• So we have to remember the forks.
• Possibility: stack of alternate states.
• Don’t just generate one new state at a fork. Generate all
possible continuations. Store them on a stack.
• Pick the top state from the stack, make it the current state
and go on with it.
• In case of crash discard the state and pick the next one
from the stack.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
32
Context-Free Analysis as a Path
Searching Problem
• The analysis can be viewed as a general problem of finding
a path in a tree of possibilities from the root to the leaves.
• Depth-first search: the list of possibilities is a stack
(LIFO).
• Breadth-first search: the list of possibilities is a queue
(FIFO).
• Breadth-first search requires more memory for alternate
states but it faces fewer recursion-related problems.
However, both approaches will run infinitely if the input is
ungrammatical.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
33
The Bottom-Up Algorithm
• Pick the next state from the stack (queue) and make it the current state.
• Consider all substrings of the current state that start somewhere to the
left of the period and end at the period. Compare them to the righthand sides of all rules. For every substring that corresponds to a righthand side generate a new state where the detected right-hand side is
replaced by the left-hand side of the given rule; the rest of the state is
identical to the current state. Put the newly generated state on the
stack.
• Finally generate a state whose only difference to the current state is
that the period is shifted 1 symbol to the right. Put it on the stack.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
34
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
SCD
Cc|BC
Dd|dC
B  b | ab
.abcdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
Stack
35
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
.abcdbc
a.bcdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
36
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
SCD
Cc|BC
Dd|dC
B  b | ab
a.bcdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
Stack
37
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
a.bcdbc
ab.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
38
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
SCD
Cc|BC
Dd|dC
B  b | ab
ab.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
Stack
39
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
ab.cdbc
abc.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
40
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abc.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
41
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abc.dbc
abcd.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
42
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcd.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
43
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcd.bc
abcdb.c
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
44
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdb.c
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
45
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdb.c
abcdbc.
abcdB.c
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
46
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdbc.
abcdB.c
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
47
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdbc.
abcdbC.
abcdB.c
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
48
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdbC.
abcdB.c
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
49
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdbC.
abcdB.c
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
50
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdB.c
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
51
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdB.c
abcdBc.
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
52
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdBc.
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
53
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdBc.
abcdBC.
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
54
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdBC.
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
55
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdBC.
abcdC.
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
56
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdC.
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
57
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdC.
abcD.
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
58
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcD.
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
59
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcD.
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
60
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcD.bc
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
61
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcD.bc
abcDb.c
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
62
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcDb.c
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
63
Example of Bottom-Up Analysis
Including the Stack
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcDb.c
abcDbc.
abcDB.c
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
64
The Same RHS Is Recognized at the
Same Position Over and Over!
Grammar
Current State
Stack
SCD
Cc|BC
Dd|dC
B  b | ab
abcdb.c
abcDbc.
abcDB.c
abC.dbc
B.cdbc
aB.cdbc
19.11.2010
abcDb.c
…
abCdb.c
Bcdb.c
aBcdb.c
…
http://ufal.mff.cuni.cz/course/npfl094
65
Computational Complexity
• Described algorithm is exponential (all
paths in a tree must be considered).
– Problem: the same right-hand side is repeatedly
compared and recognized at the same position.
• There is a polynomial algorithm:
CYK, chart parser.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
66
Chart Parser
• Chart [ča:t] = „přehled“, „diagram“
– The principal data structure in the chart parser.
– It remembers which right-hand sides have been
recognized and where.
• A note on Czech terminology: chart parser could
in theory be translated as analýza s přehledem but
in practice the original English term is used.
• Chart parsing is a special case of dynamic
programming.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
67
Don’t Look for All Combinations.
Store Each Constituent Separately!
Grammar
String
Chart
SCD
Cc|BC
Dd|dC
B  b | ab
0a1b2c3d4b5c6
B02
B12
C23
C13
…
S06
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
68
State of Analysis
• The input is read one-by-one terminal symbol.
• A right-hand side of a rule can be recognized after
any terminal.
• In addition, the chart contains a list of rules whose
right-hand sides are partially read:
– The period delimits the part of the right-hand side that
has been recognized.
– Again we know the positions in the input string where
the right-hand side began and where it currently ends
(where the period is).
– Example: (B -> a . b) (0;1)
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
69
The Chart
• Agenda. The list of constituents that have been recognized
in the input and are waiting to be processed. The span of
each of them is saved (start and end positions in the input).
• List of “active arcs”, i.e. right-hand sides that have been
partially recognized in the input. The span of each of them
is saved (start position of the RHS in the input and the
position of the period—position to which the RHS has
been recognized)
• List of processed constituents. The span of each of them
is saved. Constituents are moved here from the Agenda
after they have been processed.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
70
Chart Parsing Algorithm
1. Start with empty agenda, list of active arcs and list of
processed constituents.
2. If agenda is empty then read next terminal from the input
and add it to agenda.
3. If agenda is empty and input finished then go to 10.
4. Pick new current constituent (C,i,j) from agenda. The
constituent spans the input substring between positions i
and j.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
71
Chart Parsing Algorithm
5. Consider all grammar rules. For every rule of the form X
 C X1 … Xn add to the chart a new active arc from i to i
of the form X  • C X1 … Xn. (New rules that start here.)
6. For every active arc from k to i of the form X  X1 … • C
… Xn add to the chart a new active arc from k to j of the
form X  X1 … C • … Xn. (Rules that continue here.)
7. For every active arc from k to j of the form X  X1 … Xn
C • add to the agenda a new constituent X spanning k to j
unless it already is in the agenda or the list of processed
constituents. (Rules that end here.)
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
72
Chart Parsing Algorithm
8. Move (C,i,j) from agenda to the list of processed
constituents. If C=S and i,j spans the entire input then an
analysis of the input has been found. Nevertheless there
may be other analyses.
9. Go back to 2.
10. If the list of processed constituents contains (S,0,n) where
n is the number of input terminals then the input has been
recognized and analyzed successfully.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
73
Complexity of Chart Parsing
• Polynomial (O(gn3), n is number of input
terminals, g is number of grammar rules).
• There are (n+1)2/2 spans (from i to j).
• Maximum number of constituents within one span
equals to the number of grammar rules.
• Maximum number of states in which one partially
recognized rule can find itself is n+1 (number of
possible period positions).
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
74
Keep All States of Each Rule!
6.For every active arc from k to i of the form X  X1
… • C … Xn add to the chart a new active arc
from k to j of the form X  X1 … C • … Xn. (Rules
that continue here.)
• After shifting the period keep both the new and the
old state of the rule!
• What if the same constituent is recognized later
that starts at the same position but is longer?
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
75
Apply this
rule
A  a | a A
B  b A
…
baac
to this
input
19.11.2010
Recognize (A,1,2) but
don’t discard
this active arc!
Example
B
A
B
A
A
B
…






b
a
b
a
a
b
•
•
A
•
A
A
A (0,1)
(1,2)
• (0,2)
(2,3)
• (1,3)
• (0,3)
If we later recognize
(A,1,3) we’ll need the
active arc to create this!
http://ufal.mff.cuni.cz/course/npfl094
76
How to Remember the Analysis?
• So far we only can figure out whether there is an
analysis, i.e. whether a string is accepted by the
grammar.
• We need to know the constituent hierarchy
(“derivation tree”) as well. We will read off it the
output of the analysis:
– Form ( FormNFeka ( StemNFeka ( m a t ) SuffNFeka
( NFekaS1 ( k a ) ) ) )
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
77
How to Remember the Analysis?
7. For every active arc from k to j of the form X  X1 … Xn C
• add to the agenda a new constituent X spanning k to j
unless it already is in the agenda or the list of processed
constituents. (Rules that end here.)
• Keep with every constituent the information what it is
composed of. Same for every partially recognized rule.
• Caution: the same constituent spanning the same input
substring may have arisen in several alternate ways!
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
78
How to Remember the Analysis?
• S  A | Ab
• A  a | ab
• string “ab”
$agenda[$i][$j]{$N}{composition}[$k][$l]
• constituent starts at position $i
• constituent ends at position $j
• constituent is labeled by non-terminal $N
• there is the following information about a constituent
• composition
• description
• of all possible ways of composing the constituent we want the $k-th one
• composition is list of links to subconstituents of which we want the $l-th
– S(A(ab))
– S(A(a)b)
$agenda[0][2]{"S"}{description} = "S:0:2";
push(@composition, [$agenda[0][1]{"A"},
$agenda[1][2]{"b"}]);
PERL
push(@composition, [$agenda[0][2]{"A"}]);
push(@{$agenda[0][2]{"S"}{composition}},
\@composition);
# Print j-th constituent of i-th composition of constituent S:0:2.
print
$agenda[0][2]{"S"}{composition}[$i][$j]{description};
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
79
Chart Parsing Example
Grammar
SCD
Cc|BC
Dd|dC
B  b | ab
abcdbc
19.11.2010
c 6
b
5
d
4
c
3
b
2
a
1
0 1 2 3 4 5
http://ufal.mff.cuni.cz/course/npfl094
80
Chart Parsing Example
c
C
Grammar
SCD
Cc|BC
Dd|dC
B  b | ab
abcdbc
19.11.2010
b
B
d
D
c
C
b
B
6
5
4
3
2
a
1
0 1 2 3 4 5
http://ufal.mff.cuni.cz/course/npfl094
81
Chart Parsing Example
C
Grammar
b
B
c
C
6
5
SCD
Cc|BC
Dd|dC
B  b | ab
B
abcdbc
a
1
0 1 2 3 4 5
19.11.2010
S
C
c
C
b
B
http://ufal.mff.cuni.cz/course/npfl094
d
D
4
3
2
82
Chart Parsing Example
D C
Grammar
SCD
Cc|BC
Dd|dC
B  b | ab
abcdbc
19.11.2010
b
B
S S
C C
B
c
C
b
B
d
D
c
C
6
5
4
3
2
a
1
0 1 2 3 4 5
http://ufal.mff.cuni.cz/course/npfl094
83
Chart Parsing Example
S D C
Grammar
SCD
Cc|BC
Dd|dC
B  b | ab
abcdbc
19.11.2010
b
B
S S S
C C
B
c
C
b
B
d
D
c
C
6
5
4
3
2
a
1
0 1 2 3 4 5
http://ufal.mff.cuni.cz/course/npfl094
84
Chart Parsing Example
S S D C
Grammar
SCD
Cc|BC
Dd|dC
B  b | ab
abcdbc
19.11.2010
b
B
S S S
C C
B
c
C
b
B
d
D
c
C
6
5
4
3
2
a
1
0 1 2 3 4 5
http://ufal.mff.cuni.cz/course/npfl094
85
Chart Parsing Example
Grammar
SCD
Cc|BC
Dd|dC
B  b | ab
abcdbc
19.11.2010
S S S D C
b
B
S S S
C C
B
c
C
b
B
d
D
c
C
6
5
4
3
2
a
1
0 1 2 3 4 5
http://ufal.mff.cuni.cz/course/npfl094
86
Context-Free Grammars and
Morphological Analysis: A Summary
 They nicely describe regular phenomena.
 They can describe long-distance dependencies!
 “Regular irregularities” may require operations
that are not directly supported by CFGs,
simulation required.
 The grammar grows unbearably.
 High number of inflection classes  we need
good maintenance tools. When the user is to add a
new word we cannot reasonably require the word
to be assigned one of 30 almost identical
paradigms.
19.11.2010
http://ufal.mff.cuni.cz/course/npfl094
87

Morphological and Syntactic Analysis

Transcript Morphological and Syntactic Analysis

Directory