Abstract data types

Download Report

Transcript Abstract data types

Parsing
Parsing
Calculate grammatical structure of
program, like diagramming
sentences, where:
Tokens = “words”
Programs = “sentences”
For further information, read:
Aho, Sethi, Ullman, “Compilers: Principles, Techniques,
and Tools” (a.k.a, the “Dragon Book”)
Outline of coverage
Context-free grammars
 Parsing

– Tabular Parsing Methods
– One pass
• Top-down
• Bottom-up

Yacc
Parser: extracts grammatical structure of program
function-def
name
arguments
stmt-list
stmt
main
expression
expression
operator
expression
variable
<<
string
cout
“hello, world\n”
Context-free languages
Grammatical structure defined by contextfree grammar
statement  labeled-statement
| expression-statement
| compound-statement
labeled-statement  ident : statement
| case constant-expression : statement
compound-statement 
{ declaration-list statement-list }
“Context-free” = only one non-terminal in left-part
terminal
non-terminal
Parse trees
Parse tree = tree labeled with grammar
symbols, such that:
 If node is labeled A, and its children
are labeled x1...xn, then there is a
production
A x1...xn
 “Parse tree from A” = root labeled
with A
 “Complete parse tree” = all leaves
labeled with tokens
Parse trees and sentences

Frontier of tree = labels on leaves (in leftto-right order)
 Frontier of tree from S is a sentential form
 Frontier of a complete tree from S is a
sentence
L
E
a
L
; E
“Frontier”
Example
G: L L ; E | E
E a | b
Syntax trees from start symbol (L):
L
E
a
L
L
; E
E
L
L
E
a
L
; E ;
b
E
b
a
Sentential forms:
a
a;E
a;b;b
Derivations
Alternate definition of sentence:
 Given ,  in V*, say  is a derivation
step if ’’’ and  = ’’’ , where A 
is a production
  is a sentential form iff there exists a
derivation (sequence of derivation steps)
S ( alternatively, we say that S* )
Two definitions are equivalent, but note that there
are many derivations corresponding to each parse tree
Another example
H: L E ; L | E
E a | b
L
L
E ; L
E
a
E
a
L
L
E ;
b
E
b
; L
E
a
Ambiguity

For some purposes, it is important to
know whether a sentence can have more
than one parse tree
 A grammar is ambiguous if there is a
sentence with more than one parse tree
 Example: E  E+E | E*E | id
E
E
E
+
E
id
E
*
id
E
E
E
id
id
+
*
E
id
E
id
If e then if b then d else f
 { int x; y = 0; }
 A.b.c = d;
 Id -> s | s.id

E -> E + T -> E + T + T -> T + T + T -> id
+ T + T -> id + T * id + T -> id + id * id
+ T ->
id + id * id + id
Ambiguity
Ambiguity is a function of the
grammar rather than the language
 Certain ambiguous grammars may
have equivalent unambiguous ones

Grammar Transformations
Grammars can be transformed
without affecting the language
generated
 Three transformations are discussed
next:

– Eliminating Ambiguity
– Eliminating Left Recursion
(i.e.productions of the form AA  )
– Left Factoring
Eliminating Ambiguity

Sometimes an ambiguous grammar can
be rewritten to eliminate ambiguity
 For example, expressions involving
additions and products can be written as
follows:
E  E+T | T
T  T*id | id

The language generated by this grammar
is the same as that generated by the
grammar on tranparency 11. Both
generate id(+id|*id)*
 However, this grammar is not ambiguous
Eliminating Ambiguity (Cont.)

One advantage of this grammar is
that it represents the precedence
between operators. In the parsing
tree, products appear nested within
additions
E
E
T
id
+
T
T
*
id
id
Eliminating Ambiguity (Cont.)
An example of ambiguity in a
programming language is the
dangling else
 Consider
S  if  then S else S | if  then
S|

Eliminating Ambiguity (Cont.)

When there are two nested ifs and
only one else..
S
if

then
else
S
S
if  then S

S
if


then
if
S

then
S

else
S

Eliminating Ambiguity (Cont.)

In most languages (including C++ and Java),
each else is assumed to belong to the
nearest if that is not already matched by an
else. This association is expressed in the
following (unambiguous) grammar:
S  Matched
| Unmatched
Matched  if  then Matched else Matched
| 
Unmatched  if then S
| if  then Matched else Unmatched
Eliminating Ambiguity (Cont.)
Ambiguity is a property of the
grammar
 It is undecidable whether a context
free grammar is ambiguous
 The proof is done by reduction to
Post’s correspondence problem
 Although there is no general
algorithm, it is possible to isolate
certain constructs in productions
which lead to ambiguous grammars

Eliminating Ambiguity (Cont.)

For example, a grammar containing the
production AAA |  would be ambiguous,
because the substring  has two parses:
A
A
A

A
A
A



A

A
A
A


This ambiguity disappears if we use the productions
AAB | B and B 
or the productions
ABA | B and B .
Eliminating Ambiguity (Cont.)

Examples of ambiguous productions:
AAA
AA | A and
AA | AA

A language generated by an ambiguous
CFG is inherently ambiguous if it has no
unambiguous CFG
– An example of such a language is
L={aibjcm | i=j or j=m} which can be generated
by the grammar:
SAB | DC
AaA | e
CcC | e
BbBc | e
DaDb | e
Elimination of Left Recursion

A grammar is left recursive if it has a nonterminal
A and a derivation A+Afor some string Top-down
parsing methods (to be discussed shortly) cannot handle
left-recursive grammars, so a transformation to eliminate
left recursion is needed.


Immediate left recursion (productions of the form AA  )
can be easily eliminated.
We group the A-productions as
AA 1 | A 2 | … | A m | 1| 2 | … | n
where no i begins with A. Then we replace the Aproductions by
A 1 A’ | 2 A’ | … | n A’
A’ 1 A’ | 2 A’| … | m A’ | e
Elimination of Left Recursion (Cont.)

The previous transformation,
however, does not eliminate left
recursion involving two or more
steps. For example, consider the
grammar
SAa | b
AAc| Sd |e
S is left-recursive because
SAaSda,but it is not immediately left
recursive
Elimination of Left Recursion (Cont.)
Algorithm. Eliminate left recursion
Arrange nonterminals in some order A1, A2 ,,…, An
for i =1 to n {
for j =1 to i -1 {
replace each production of the form AiAj
by the production Ai d1  |

d2  | … | dn 
where Aj d1 | d2 |…| dn are all the current Ajproductions
}
eliminate the immediate left recursion among the Aiproductions
}
Elimination of Left Recursion (Cont.)


To show that the previous algorithm actually
works all we need notice is that iteration i only
changes productions with Ai on the left-hand side.
And m > i in all productions of the form AiAm 
Induction proof:
– Clearly true for i=1
– If it is true for all i<k, then when the outer loop is
executed for i=k, the inner loop will remove all
productions AiAm  with m < i
– Finally, with the elimination of self recursion, m in the
AiAm  productions is forced to be > i

So, at the end of the algorithm, all derivations of
the form Ai+Amwill have m > i and therefore left
recursion would not be possible
Left Factoring




Left factoring helps transform a grammar for
predictive parsing
For example, if we have the two productions
S  if  then S else S
| if  then S
on seeing the input token if, we cannot
immediately tell which production to choose to
expand S
In general, if we have A  1 |  2 and the input
begins with , we do not know (without looking
further) which production to use to expand A
Left Factoring (Cont.)
However, we may defer the decision
by expanding A to A’
 Then after seeing the input derived
from , we may expand A’ to 1 or to
2
 Left-factored, the original
productions become
A  A’
A’ 1 | 2

Non-Context-Free Language Constructs

Examples of non-context-free languages are:
– L1 = {wcw | w is of the form (a|b)*}
– L2 = {anbmcndm | n  1 and m 1 }
– L3 = {anbncn | n  0 }

Languages similar to these that are context free
– L’1 = {wcwR | w is of the form (a|b)*} (wR stands for w
reversed)
This language is generated by the grammar
S aSa | bSb | c
– L’2 = {anbmcmdn | n  1 and m 1 }
This language is generated by the grammar
S aSd | aAd
A bAc | bc
Non-Context-Free Language Constructs
(Cont.)

L”2={anbncmdm | n  1 and m 1 }
is generated by the grammar
S AB
A aAb | ab
B cBd | cd

L’3={anbn | n  1}
is generated by the grammar
S aSb | ab
This language is not definable by any
regular expression
Non-Context-Free Language Constructs
(Cont.)





Suppose we could construct a DFSM D accepting
L’3.
D must have a finite number of states, say k.
Consider the sequence of states s0, s1, s2, …, sk
entered by D having read e, a, aa, …, ak.
Since D only has k states, two of the states in the
sequence have to be equal. Say, si  sj (ij).
From si, a sequence of i bs leads to an accepting
(final) state. Therefore, the same sequence of i bs
will also lead to an accepting state from sj.
Therefore D would accept ajbi which means that
the language accepted by D is not identical to L’3.
A contradiction.
Parsing
The parsing problem is: Given string of
tokens w, find a parse tree whose frontier
is w. (Equivalently, find a derivation from
w.)
A parser for a grammar G reads a list of
tokens and finds a parse tree if they form
a sentence (or reports an error otherwise)
Two classes of algorithms for parsing:
– Top-down
– Bottom-up
Parser generators

A parser generator is a program that reads
a grammar and produces a parser
 The best known parser generator is yacc
It produces bottom-up parsers
 Most parser generators - including yacc do not work for every CFG; they accept a
restricted class of CFG’s that can be
parsed efficiently using the method
employed by that parser generator
Top-down parsing
Starting from parse tree containing just
S, build tree down toward input.
Expand left-most non-terminal.
 Algorithm: (next slide)

Top-down parsing (cont.)

Let input = a1a2...an
current sentential form (csf) = S
loop {
suppose csf = t1...tkA
if t1...tk a1...ak , it’s an error
based on ak+1..., choose production
A 
csf becomes t1...tk
}
Top-down parsing example
Grammar: H: L E ; L | E
E a | b
Input: a;b
Parse tree
Sentential form
L
L
E ;L
L
E ;L
a
Input
L
E;L
a;b
a;b
a;L
a;b
Top-down parsing example (cont.)
Parse tree
L
E ;L
a
a;E
a;b
a;b
a;b
E
L
E ;L
a
Sentential form
E
b
Input
LL(1) parsing
Efficient form of top-down parsing
 Use only first symbol of remaining
input (ak+1) to choose next
production. That is, employ a
function M:  N P in “choose
production” step of algorithm.
 When this works, grammar is called
LL(1)

LL(1) examples

Example 1:
H: L E ; L | E
E a | b
Given input a;b, so next symbol is
a.
Which production to use? Can’t tell.
 H not LL(1)
LL(1) examples

Example 2:
Exp  Term Exp’
Exp’  $ | + Exp
Term id
(Use $ for “end-of-input” symbol.)
Grammar is LL(1): Exp and Term have only
one production; Exp’ has two productions
but only one is applicable at any time.
Nonrecursive predictive parsing
It is possible to build a nonrecursive
predictive parser by maintaining a
stack explicitly, rather than implicitly
via recursive calls
 The key problem during predictive
parsing is that of determining the
production to be applied for a nonterminal

Nonrecursive predictive parsing
Algorithm. Nonrecursive predictive parsing
Set ip to point to the first symbol of w$.
repeat
Let X be the top of the stack symbol and a the symbol pointed to by ip
if X is a terminal or $ then
if X == a then
pop X from the stack and advance ip
else error()
else // X is a nonterminal
if M[X,a] == XY1 Y2 … Y k then
pop X from the stack
push YkY k-1, …, Y1 onto the stack with Y1 on top
(push nothing if Y1 Y2 … Y k is e )
output the production XY1 Y2 … Y k
else error()
until X == $
LL(1) grammars

No left recursion
A  A : If this production is chosen,
parse makes no progress.

No common prefixes
A  | 
Can fix by “left factoring”:
A  A’
’  | 
LL(1) grammars (cont.)

No ambiguity
Precise definition requires that
production to choose be unique
(“choose” function M very hard to
calculate otherwise)
Top-down Parsing
L
Input tokens: <t0,t1,…,t-i,...>
E0 … E-n
L
Input tokens: <t-i,...>
E0 … E-n
From left to right,
“grow” the parse
tree downwards
...
Start symbol and
root of parse tree
Checking LL(1)-ness

For any sequence of grammar symbols ,
define set FIRST()   to be
FIRST() = { a |  * a for some }
Checking LL(1)-ness


Define: Grammar G = (N, , P, S) is LL(1) iff whenever there
are two left-most derivations (in which the leftmost nonterminal is always expanded first)
S =>* wA => w =>* wx
S =>* wA => w =>* wy
such that FIRST(x) = FIRST(y), it follows that  =
In other words, given
1. A string wA in V* and
2. The first terminal symbol to be derived from A, say t
there is at most one production that can be applied to A to
yield a derivation of any terminal string beginning with wt
FIRST sets can often be calculated by inspection
FIRST Sets
Exp  Term Exp’
Exp’  $ | + Exp
Term id
(Use $ for “end-of-input” symbol)
FIRST($) = {$}
FIRST(+ Exp) = {+}
FIRST($)  FIRST(+ Exp) = {}
 grammar is LL(1)
FIRST Sets
L E ; L | E
E a | b
FIRST(E ; L) = {a, b} = FIRST(E)
FIRST(E ; L)  FIRST(E)  {}
 grammar not LL(1).
Computing FIRST Sets
Algorithm. Compute FIRST(X) for all grammar symbols X
forall X  V do FIRST(X)={}
forall X   (X is a terminal) do FIRST(X)={X}
forall productions X  e do FIRST(X) = FIRST(X) U {e}
repeat
c: forall productions XY1 Y2 … Y k do
forall i  [1,k] do
FIRST(X) = FIRST(X) U (FIRST(Yi) - {e})
if e  FIRST(Yi) then continue c
FIRST(X) = FIRST(X) U {e}
until no more terminals or e are added to any FIRST set
FIRST Sets of Strings of Symbols
FIRST(X1X2…Xn) is the union of
FIRST(X1) and all FIRST(Xi) such that e 
FIRST(Xk) for k=1, 2, …, i-1
 FIRST(X1X2…Xn) contains e iff e 
FIRST(Xk) for k=1, 2, …, n

FIRST Sets do not Suffice
Given the productions
A T x
A T y
T w
T e
 T w should be applied when the next
input token is w.
 T eshould be applied whenever the next
terminal (the one pointed to by ip) is
either x or y

FOLLOW Sets

For any nonterminal X, define set
FOLLOW(X)   as
FOLLOW(X) = {a | S *Xa}
Computing the FOLLOW Set
Algorithm. Compute FOLLOW(X) for all nonterminals X
FOLLOW(S) ={$}
forall productions A  B do FOLLOW(B)=Follow(B) U
(FIRST() - {e})
repeat
forall productions A  B or A  B with e 
FIRST() do
FOLLOW(B) = FOLLOW(B) U FOLLOW(A)
until all FOLLOW sets remain the same
Construction of a predictive parsing table
Algorithm. Construction of a predictive parsing table
M[:,:] = {}
forall productions A   do
forall a  FIRST() do
M[A,a] = M[A,a] U {A   }
if e  FIRST() then
forall b  FOLLOW(A) do
M[A,b] = M[A,b] U {A   }
Make all empty entries of M be error
Another Definition of LL(1)
Define: Grammar G is LL(1) if for every
A N with productions A  1 ||
n
FIRST(i FOLLOW(A))  FIRST(j
FOLLOW(A) ) =  for all i, j
Regular Languages

Definition. A regular grammar is one
whose productions are all of the
type:
– A  aB
–Aa

A Regular Expression is either:
–a
– R1 | R2
– R1 R2
– R*
Nondeterministic Finite State
Automaton
a
start
0
b
a
1
b
2
b
3
Regular Languages

Theorem. The classes of languages
– Generated by a regular grammar
– Expressed by a regular expression
– Recognized by a NDFS automaton
– Recognized by a DFS automaton
coincide.
Deterministic Finite Automaton
space, tab, new line
START
digit
digit
NUM
$
$
$
KEYWORD
letter
=, +, -, /, (, )
OPERATOR
circle
state
double circle
accept state
arrow
transition
bold, cap labels
state names
lower case labels
transition characters
Scanner code
state := start
loop
if no input character buffered then read one, and add it to the accumulated token
case state of
start:
case input_char of
A..Z, a..z : state := id
0..9
: state := num
else ...
end
id:
case input_char of
A..Z, a..z : state := id
0..9
: state := id
else ...
end
num:
case input_char of
0..9: ...
...
else ...
end
...
end;
end;
Table-driven DFA
0-start
1-num
2-id
3-operator
4-keyword
white space
0
exit
exit
exit
exit
letter
2
error
2
exit
error
digit
1
1
2
exit
error
operator
3
exit
exit
exit
exit
$
4
error
error
exit
4
Language Classes
L0
L0
CSL
CFL [NPA]
LR(1)
LL(1)
RL
[DFA=NFA]
Question

Are regular expressions, as provided
by Perl or other languages, sufficient
for parsing nested structures, e.g.
XML files?