Abstract data types

Download Report

Transcript Abstract data types

Parsing
Giuseppe Attardi
Università di Pisa
Parsing
Calculate grammatical structure of
program, like diagramming
sentences, where:
Tokens = “words”
Programs = “sentences”
For further information:
Aho, Sethi, Ullman, “Compilers: Principles,
Techniques, and Tools” (a.k.a, the “Dragon Book”)
Outline of coverage
Context-free grammars
 Parsing

– Tabular Parsing Methods
– One pass
• Top-down
• Bottom-up

Yacc
Parser: extracts grammatical structure of program
function-def
name
arguments
stmt-list
stmt
main
expression
expression
operator
expression
variable
<<
string
cout
“hello, world\n”
Context-free languages
Grammatical structure defined by contextfree grammar
statement  labeled-statement
| expression-statement
| compound-statement
labeled-statement  ident : statement
| case constant-expression : statement
compound-statement 
{ declaration-list statement-list }
“Context-free” = only one non-terminal in left-part
terminal
non-terminal
Parse trees
Parse tree = tree labeled with grammar
symbols, such that:
 If node is labeled A, and its children
are labeled x1...xn, then there is a
production
A x1...xn
 “Parse tree from A” = root labeled
with A
 “Complete parse tree” = all leaves
labeled with tokens
Parse trees and sentences

Frontier of tree = labels on leaves (in leftto-right order)
 Frontier of tree from S is a sentential form
 Frontier of a complete tree from S is a
sentence
L
E
a
L
; E
“Frontier”
Example
G: L L ; E | E
E a | b
Syntax trees from start symbol (L):
L
E
a
L
L
; E
E
L
L
E
a
L
;
; E
b
E
b
a
Sentential forms:
a
a;E
a;b;b
Derivations
Alternate definition of sentence:
 Given ,  in V*, say  is a derivation
step if ’’’ and  = ’’’ , where A 
is a production
  is a sentential form iff there exists a
derivation (sequence of derivation steps)
S ( alternatively, we say that S* )
Two definitions are equivalent, but note that there
are many derivations corresponding to each parse tree
Another example
H: L E ; L | E
E a | b
L
L
E ; L
E
a
E
a
L
L
E ;
b
E
b
; L
E
a
Ambiguity

For some purposes, it is important to
know whether a sentence can have more
than one parse tree
 A grammar is ambiguous if there is a
sentence with more than one parse tree
 Example: E  E+E | E*E | id
E
E
E
+
E
id
E
*
id
E
E
id
id
E
*
+
E
id
E
id
Notes
If e then if b then d else f
 { int x; y = 0; }
 A.b.c = d;
 Id -> s | s.id

E -> E + T -> E + T + T -> T + T + T -> id
+ T + T -> id + T * id + T -> id + id * id
+ T ->
id + id * id + id
Ambiguity
Ambiguity is a function of the
grammar rather than the language
 Certain ambiguous grammars may
have equivalent unambiguous ones

Grammar Transformations
Grammars can be transformed
without affecting the language
generated
 Three transformations are discussed
next:

– Eliminating Ambiguity
– Eliminating Left Recursion
(i.e.productions of the form AA  )
– Left Factoring
Eliminating Ambiguity

Sometimes an ambiguous grammar can
be rewritten to eliminate ambiguity
 For example, expressions involving
additions and products can be written as
follows:
E  E +T | T
T  T * id | id

The language generated by this grammar
is the same as that generated by the
grammar in slide “Ambiguity”. Both
generate id(+id|*id)*
 However, this grammar is not ambiguous
Eliminating Ambiguity (Cont.)

One advantage of this grammar is
that it represents the precedence
between operators. In the parsing
tree, products appear nested within
additions
E
E
T
id
+
T
T
*
id
id
Eliminating Ambiguity (Cont.)
An example of ambiguity in a
programming language is the
dangling else
 Consider
S  if  then S else S | if  then
S|

Eliminating Ambiguity (Cont.)

When there are two nested ifs and
only one else..
S
if

then
else
S
S
if  then S

S
if


then
if
S

then
S

else
S

Eliminating Ambiguity (Cont.)

In most languages (including C++ and Java),
each else is assumed to belong to the
nearest if that is not already matched by an
else. This association is expressed in the
following (unambiguous) grammar:
S  Matched
| Unmatched
Matched  if  then Matched else Matched
| 
Unmatched  if then S
| if  then Matched else Unmatched
Eliminating Ambiguity (Cont.)
Ambiguity is a property of the
grammar
 It is undecidable whether a context
free grammar is ambiguous
 The proof is done by reduction to
Post’s correspondence problem
 Although there is no general
algorithm, it is possible to isolate
certain constructs in productions
which lead to ambiguous grammars

Eliminating Ambiguity (Cont.)

For example, a grammar containing the
production AAA |  would be ambiguous,
because the substring  has two parses:
A
A
A

A
A
A



A

A
A
A


This ambiguity disappears if we use the productions
AAB | B and B 
or the productions
ABA | B and B .
Eliminating Ambiguity (Cont.)

Examples of ambiguous productions:
AAA
AA | A
AA | AA

A CF language is inherently ambiguous if
it has no unambiguous CFG
– An example of such a language is
L = {aibjcm | i=j or j=m} which can be generated
by the grammar:
SAB | DC
AaA | e
CcC | e
BbBc | e
DaDb | e
Elimination of Left Recursion

A grammar is left recursive if it has a
nonterminal A and a derivation A +A
for some string 
–

Top-down parsing methods cannot handle leftrecursive grammars, so a transformation to
eliminate left recursion is needed
Immediate left recursion (productions of
the form A  A) can be easily eliminated:
1. Group the A-productions as
A  A  1 | A  2 | … | A  m |  1|  2 | … |  n
where no i begins with A
2. Replace the A-productions by
A  1A’ | 2A’ | … | nA’
A’  1A’ | 2A’| … | mA’ | e
Elimination of Left Recursion (Cont.)
The previous transformation,
however, does not eliminate left
recursion involving two or more
steps
 For example, consider the grammar

S  Aa | b
A  Ac | Sd | e
S is left-recursive because S Aa
Sda,but it is not immediately left recursive
Elimination of Left Recursion (Cont.)
Algorithm. Eliminate left recursion
Arrange nonterminals in some order A1, A2 ,,…, An
for i = 1 to n {
for j = 1 to i - 1 {
replace each production of the form Ai  Aj
by the production Ai  d1  |

d2  | … | dn 
where Aj  d1 | d2 |…| dn are all the current Ajproductions
}
eliminate the immediate left recursion among the Aiproductions
}
Elimination of Left Recursion (Cont.)


To show that the previous algorithm actually
works, notice that iteration i only changes
productions with Ai on the left-hand side. And m >
i in all productions of the form Ai  Am 
Induction proof:
– Clearly true for i = 1
– If it is true for all i < k, then when the outer loop is
executed for i = k, the inner loop will remove all
productions Ai  Am with m < i
– Finally, with the elimination of self recursion, m in the
Ai Am productions is forced to be > i

At the end of the algorithm, all derivations of the
form Ai +Amwill have m > i and therefore left
recursion would not be possible
Left Factoring



Left factoring helps transform a grammar for
predictive parsing
For example, if we have the two productions
S  if  then S else S
| if  then S
on seeing the input token if, we cannot
immediately tell which production to choose to
expand S
In general, if we have A  1 | 2 and the input
begins with , we do not know (without looking
further) which production to use to expand A
Left Factoring (Cont.)
However, we may defer the decision
by expanding A to A’
 Then after seeing the input derived
from , we may expand A’ to 1 or to
2
 Left-factored, the original
productions become
A  A’
A’ 1 | 2

Non-Context-Free Language Constructs

Examples of non-context-free languages are:
– L1 = {wcw | w is of the form (a|b)*}
– L2 = {anbmcndm | n  1 and m  1 }
– L3 = {anbncn | n  0 }

Languages similar to these that are context free
– L’1 = {wcwR | w is of the form (a|b)*} (wR stands for w
reversed)
This language is generated by the grammar
S aSa | bSb | c
– L’2 = {anbmcmdn | n  1 and m 1 }
This language is generated by the grammar
S aSd | aAd
A bAc | bc
Non-Context-Free Language Constructs
(Cont.)

L”2 = {anbncmdm | n  1 and m 1 }
is generated by the grammar
S AB
A aAb | ab
B cBd | cd

L’3 = {anbn | n  1}
is generated by the grammar
S aSb | ab
This language is not definable by any
regular expression
Non-Context-Free Language Constructs
(Cont.)





Suppose we could construct a DFSM D accepting
L’3.
D must have a finite number of states, say k.
Consider the sequence of states s0, s1, s2, …, sk
entered by D having read e, a, aa, …, ak.
Since D only has k states, two of the states in the
sequence have to be equal. Say, si  sj (i  j).
From si, a sequence of i bs leads to an accepting
(final) state. Therefore, the same sequence of i bs
will also lead to an accepting state from sj.
Therefore D would accept ajbi which means that
the language accepted by D is not identical to L’3.
A contradiction.
Parsing
The parsing problem is: Given string of
tokens w, find a parse tree whose frontier
is w. (Equivalently, find a derivation from
w)
A parser for a grammar G reads a list of
tokens and finds a parse tree if they form
a sentence (or reports an error otherwise)
Two classes of algorithms for parsing:
– Top-down
– Bottom-up
Parser generators

A parser generator is a program that reads
a grammar and produces a parser
 The best known parser generator is yacc It
produces bottom-up parsers
 Most parser generators - including yacc do not work for every CFG; they accept a
restricted class of CFG’s that can be
parsed efficiently using the method
employed by that parser generator
Top-down parsing
Starting from parse tree containing just
S, build tree down toward input.
Expand left-most non-terminal.
 Algorithm: (next slide)

Top-down parsing (cont.)
Let input = a1a2...an
current sentential form (csf) = S
loop {
suppose csf = a1…akA
based on ak+1…, choose production
A
csf becomes a1…ak
}
Top-down parsing example
Grammar: H: L E ; L | E
E a | b
Input: a;b
Parse tree
Sentential form
L
L
E ;L
L
E ;L
a
Input
L
E;L
a;b
a;b
a;L
a;b
Top-down parsing example (cont.)
Parse tree
L
E ;L
a
a;E
a;b
a;b
a;b
E
L
E ;L
a
Sentential form
E
b
Input
LL(1) parsing
Efficient form of top-down parsing
 Use only first symbol of remaining
input (ak+1) to choose next
production. That is, employ a
function M:   N P in “choose
production” step of algorithm.
 When this is possible, grammar is
called LL(1)

LL(1) examples

Example 1:
H: L E ; L | E
E a | b
Given input a;b, so next symbol is
a.
Which production to use? Can’t tell.
 H not LL(1)
LL(1) examples

Example 2:
Exp Term Exp’
Exp’  $ | + Exp
Term id
(Use $ for “end-of-input” symbol.)
Grammar is LL(1): Exp and Term have only
one production; Exp’ has two productions
but only one is applicable at any time.
Nonrecursive predictive parsing
Maintain a stack explicitly, rather
than implicitly via recursive calls
 Key problem during predictive
parsing: determining the production
to be applied for a non-terminal

Nonrecursive predictive parsing
Algorithm. Nonrecursive predictive parsing
Set ip to point to the first symbol of w$.
repeat
Let X be the top of the stack symbol and a the symbol pointed to by ip
if X is a terminal or $ then
if X == a then
pop X from the stack and advance ip
else error()
else // X is a nonterminal
if M[X,a] == XY1 Y2 … Y k then
pop X from the stack
push YkY k-1, …, Y1 onto the stack with Y1 on top
(push nothing if Y1 Y2 … Y k is e )
output the production XY1 Y2 … Y k
else error()
until X == $
LL(1) grammars

No left recursion
A  A : If this production is chosen,
parse makes no progress.

No common prefixes
A  | 
Can fix by “left factoring”:
A  A’
’  | 
LL(1) grammars (cont.)

No ambiguity
Precise definition requires that
production to choose be unique
(“choose” function M very hard to
calculate otherwise)
Top-down Parsing
L
Input tokens: <t0,t1,…,ti,...>
E0 … En
L
Input tokens: <ti,...>
E0 … En
From left to right,
“grow” the parse
tree downwards
...
Start symbol and
root of parse tree
Checking LL(1)-ness

For any sequence of grammar symbols ,
define set FIRST()   to be
FIRST() = { a |  * a for some }
LL(1) definition

Define: Grammar G = (N, , P, S) is LL(1) iff whenever there
are two left-most derivations (in which the leftmost nonterminal is always expanded first)
S * wA  w * wtx
S * wA  w * wty
it follows that  =

In other words, given
1. a string wA in V* and
2. t, the first terminal symbol to be derived from A
there is at most one production that can be applied to A to
yield a derivation of any terminal string beginning with wt

FIRST sets can often be calculated by inspection
FIRST Sets
Exp  Term Exp’
Exp’  $ | + Exp
Term id
(Use $ for “end-of-input” symbol)
FIRST($) = {$}
FIRST(+ Exp) = {+}
FIRST($)  FIRST(+ Exp) = {}
 grammar is LL(1)
FIRST Sets
L E ; L | E
E a | b
FIRST(E ; L) = {a, b} = FIRST(E)
FIRST(E ; L)  FIRST(E)  {}
 grammar not LL(1).
Computing FIRST Sets
Algorithm. Compute FIRST(X) for all grammar
symbols X
forall X  V do FIRST(X) = {}
forall X   (X is a terminal) do FIRST(X) = {X}
forall productions X  e do FIRST(X) = FIRST(X) U {e}
repeat
c: forall productions X  Y1Y2 … Yk do
forall i  [1,k] do
FIRST(X) = FIRST(X) U (FIRST(Yi) - {e})
if e  FIRST(Yi) then continue c
FIRST(X) = FIRST(X) U {e}
until no more terminals or e are added to any FIRST
set
FIRST Sets of Strings of Symbols
FIRST(X1X2…Xn) is the union of
FIRST(X1) and all FIRST(Xi) such that e
 FIRST(Xk) for k = 1, 2, …, i-1
 FIRST(X1X2…Xn) contains e iff e 
FIRST(Xk) for k = 1, 2, …, n

FIRST Sets do not Suffice
Given the productions
A Tx
A Ty
Tw
Te
 T w should be applied when the next
input token is w.
 T eshould be applied whenever the next
terminal is either x or y

FOLLOW Sets

For any nonterminal X, define the set
FOLLOW(X)   as
FOLLOW(X) = {a | S * Xa}
Computing the FOLLOW Set
Algorithm. Compute FOLLOW(X) for all nonterminals
X
FOLLOW(S) ={$}
forall productions A  B do
FOLLOW(B)=Follow(B)  (FIRST() - {e})
repeat
forall productions A  B or A  B with e 
FIRST() do
FOLLOW(B) = FOLLOW(B)  FOLLOW(A)
until all FOLLOW sets remain the same
Construction of a predictive parsing table
Algorithm. Construction of a predictive parsing
table
M[:,:] = {}
forall productions A   do
forall a  FIRST() do
M[A,a] = M[A,a] U {A   }
if e  FIRST() then
forall b  FOLLOW(A) do
M[A,b] = M[A,b] U {A   }
Make all empty entries of M be error
Another Definition of LL(1)
Define: Grammar G is LL(1) if for every
A N with productions A  1 ||
n
FIRST(i FOLLOW(A))  FIRST(j
FOLLOW(A) ) = {} for all i, j
Regular Languages

Definition. A regular grammar is one
whose productions are all of the
type:
– A  aB
–Aa

A Regular Expression is either:
–a
– R1 | R2
– R1 R2
– R*
Nondeterministic Finite State
Automaton
a
start
0
b
a
1
b
2
b
3
Regular Languages

Theorem. The classes of languages
– Generated by a regular grammar
– Expressed by a regular expression
– Recognized by a NDFS automaton
– Recognized by a DFS automaton
coincide.
Deterministic Finite Automaton
space, tab, new line
START
digit
digit
NUM
$
$
$
KEYWORD
letter
=, +, -, /, (, )
OPERATOR
circle
state
double circle
accept state
arrow
transition
bold, cap labels
state names
lower case labels
transition characters
Scanner code
state := start
loop
if no input character buffered then read one, and add it to the accumulated token
case state of
start:
case input_char of
A..Z, a..z : state := id
0..9
: state := num
else ...
end
id:
case input_char of
A..Z, a..z : state := id
0..9
: state := id
else ...
end
num:
case input_char of
0..9: ...
...
else ...
end
...
end;
end;
Table-driven DFA
0-start
1-num
2-id
3-operator
4-keyword
white space
0
exit
exit
exit
exit
letter
2
error
2
exit
error
digit
1
1
2
exit
error
operator
3
exit
exit
exit
exit
$
4
error
error
exit
4
Language Classes
L0
L0
CSL
CFL [NPA]
LR(1)
LL(1)
RL
[DFA=NFA]
Question

Are regular expressions, as provided
by Perl or other languages, sufficient
for parsing nested structures, e.g.
XML files?
Recursive Descent Parser
stat → var = expr ;
expr → term [+ expr]
term → factor [* factor]
factor → ( expr ) | var | constant
var → identifier
Scanner
public class Scanner {
private StreamTokenizer input;
private Type lastToken;
public enum Type { INVALID_CHAR, NO_TOKEN , PLUS,
// etc. for remaining tokens, then:
EOF
};
public Scanner (Reader r) {
input = new StreamTokenizer(r);
input.resetSyntax();
input.eolIsSignificant(false);
input.wordChars('a', 'z');
input.wordChars('A', 'Z');
input.ordinaryChar('+');
input.ordinaryChar('*');
input.ordinaryChar('=');
input.ordinaryChar('(');
input.ordinaryChar(')');
input.whitespaceChars('\u0000', ' ');
}
Scanner
public int nextToken() {
Type token;
try {
switch (input.nextToken()) {
case StreamTokenizer.TT_EOF:
token = EOF;
break;
case Type.TT_WORD:
if (input.sval.equalsIgnoreCase("false"))
token = FALSE;
else if (input.sval.equalsIgnoreCase("true"))
token = TRUE;
else
token = VARIABLE;
break;
case '+':
token = PLUS;
break;
// etc.
}
} catch (IOException ex) { token = EOF; }
return token;
}
}
Parser
public class Parser {
private LexicalAnalyzer lexer;
private Type token;
public Expr parse(Reader r) throws
SyntaxException {
lexer = new LexicalAnalyzer(r);
nextToken(); // assigns token
Statement stat = statement();
expect(LexicalAnalyzer.EOF);
return stat;
}
Statement
// stat ::= variable '=' expr ';'
private Statement stat() throws
SyntaxException {
Expr var = variable();
expect(LexicalAnalyzer.ASSIGN);
Expr exp = expr();
Statement stat = new Statement(var, exp);
expect(LexicalAnalyzer.SEMICOLON);
return stat;
}
Expr
// expr ::= term ['+' expr]
private Expr expr() throws SyntaxException
{
Expr exp = term();
while (token == LexicalAnalyzer.PLUS) {
nextToken();
exp = new Exp(exp, expression());
}
return exp;
}
Term
// term ::= factor ['*' term ]
private Expr term() throws
SyntaxException {
Expr exp = factor();
// Rest of body: left as an exercise.
}
Factor
// factor ::= ( expr ) | var
private Expr factor() throws S.Exception {
Expr exp = null;
if (token == LexicalAnalyzer.LEFT_PAREN) {
nextToken();
exp = expression();
expect(LexicalAnalyzer.RIGHT_PAREN);
} else {
exp = variable();
}
return exp;
}
Variable
// variable ::= identifier
private Expr variable() throws S.Exception {
if (token == LexicalAnalyzer.ID) {
Expr exp = new Variable(lexer.getString());
nextToken();
return exp;
}
}
Constant
private Expr constantExpression() throws
S.Exception {
Expr exp = null;
// Handle the various cases for constant
// expressions: left as an exercise.
return exp;
}
Utilities
private void expect(Type t) throws
SyntaxException {
if (token != t) { // throw SyntaxException...
}
nextToken();
}
private void nextToken() {
token = lexer.nextToken();
}
}