Topic #1: Intro

Download Report

Transcript Topic #1: Intro

Topic #4: Syntactic Analysis
(Parsing)
CSC 338 – Compiler Design and
implementation
Dr. Mohamed Ben Othman
(1435-1436)
1
Lexical Analyzer and Parser
2
Parser
• Accepts string of tokens from lexical
analyzer (usually one token at a time)
• Verifies whether or not string can be
generated by grammar
• Reports syntax errors (recovers if
possible)
3
Errors
• Lexical errors (e.g. misspelled word)
• Syntax errors (e.g. unbalanced
parentheses, missing semicolon)
• Semantic errors (e.g. type errors)
• Logical errors (e.g. infinite recursion)
4
Error Handling
• Report errors clearly and accurately
• Recover quickly if possible
• Poor error recover may lead to avalanche
of errors
5
Error Recovery
• Panic mode: discard tokens one at a time
until a synchronizing token is found
• Phrase-level recovery: Perform local
correction that allows parsing to continue
• Error Productions: Augment grammar to
handle predicted, common errors
• Global Production: Use a complex
algorithm to compute least-cost sequence
of changes leading to parseable code
6
Context Free Grammars
• CFGs can represent recursive constructs that
regular expressions can not
• A CFG consists of:
– Tokens (terminals, symbols)
– Nonterminals (syntactic variables denoting sets of
strings)
– Productions (rules specifying how terminals and
nonterminals can combine to form strings)
– A start symbol (the set of strings it denotes is the
language of the grammar)
7
Derivations (Part 1)
• One definition of language: the set of
strings that have valid parse trees
• Another definition: the set of strings that
can be derived from the start symbol
E  E + E | E * E | (E) | – E | id
E => -E (read E derives –E)
E => -E => -(E) => -(id)
8
Derivations (Part 2)
• αAβ => αγβ if A  γ is a production
and α and β are arbitrary strings of
grammar symbols
• If a1 => a2 => … => an, we say a1
derives an
• => means derives in one step
• *=> means derives in zero or more steps
• +=> means derives in one or more steps
9
Sentences and Languages
• Let L(G) be the language generated by
the grammar G with start symbol S:
– Strings in L(G) may contain only tokens of G
– A string w is in L(G) if and only if S +=> w
– Such a string w is a sentence of G
• Any language that can be generated by a
CFG is said to be a context-free language
• If two grammars generate the same
language, they are said to be equivalent
10
Sentential Forms
• If S *=> α, where α may contain
nonterminals, we say that α is a sentential
form of G
• A sentence is a sentential form with no
nonterminals
11
Leftmost Derivations
• Only the leftmost nonterminal in any sentential
form is replaced at each step
• A leftmost step can be written as wAγ lm=> wδγ
– w consists of only terminals
– γ is a string of grammar symbols
• If α derives β by a leftmost derivation, then we
write α lm*=> β
• If S lm*=> α then we say that α is a leftsentential form of the grammar
• Analogous terms exist for rightmost derivations
12
Parse Trees
• A parse tree can be viewed as a graphical
representation of a derivation
• Every parse tree has a unique leftmost
derivation (not true of every sentence)
• An ambiguous grammars has:
– more than one parse tree for at least one
sentence
– more than one leftmost derivation for at least
one sentence
13
Regular Expressions vs. CFGs
• Every construct that can be described by
an RE and also be described by a CFG
• Why use REs at all?
– Lexical rules are simpler to describe this way
– REs are often easier to read
– More efficient lexical analyzers can be
constructed
14
Eliminating Ambiguity (1)
stmt  if expr then stmt
| if expr then stmt else stmt
| other
if E1 then if E2 then S1 else S2
15
Eliminating Ambiguity (2)
16
Eliminating Ambiguity (3)
stmt  matched
| unmatched
matched  if expr then matched else matched
| other
unmatched  if expr then stmt
| if expr then matched else unmatched
17
Left Recursion
• A grammar is left recursive if for any
nonterminal A such that there exists any
derivation A +=> Aα for any string α
• Most top-down parsing methods can not
handle left-recursive grammars
18
Eliminating Left Recursion (1)
A  Aα1 | Aα2 | … | Aαm | β1 | β2 | … | βn
A  β1A’ | β2A’ | … | βnA’
A’  α1A’ | α2A’ | … | αmA’ | ε
Harder case:
S  Aa | b
A  Ac | Sd | ε
19
Eliminating Left Recursion (2)
• First arrange the nonterminals in some
order A1, A2, … An
• Apply the following algorithm:
for i = 1 to n {
for j = 1 to i-1 {
replace each production of the form Ai  Ajγ
by the productions Ai  δ1γ | δ2γ | … | δkγ,
where Aj  δ1 | δ2 | … | δk are the Aj productions
}
eliminate the left recursion among Ai productions
}
20
Left Factoring
• Rewriting productions to delay decisions
• Helpful for predictive parsing
• Not guaranteed to remove ambiguity
A  αβ1 | αβ2
A  αA’
A’  β1 | β2
21
Top Down Parsing
• Can be viewed two ways:
– Attempt to find leftmost derivation for input
string
– Attempt to create parse tree, starting from at
root, creating nodes in preorder
• General form is recursive descent parsing
– May require backtracking
– Backtracking parsers not used frequently
because not needed
22
Predictive Parsing
• A special case of recursive-descent
parsing that does not require backtracking
• Must always know which production to use
based on current input symbol
• Can often create appropriate grammar:
– removing left-recursion
– left factoring the resulting grammar
23
FIRST
• FIRST(α) is the set of all terminals that begin
any string derived from α
• Computing FIRST:
– If X is a terminal, FIRST(X) = {X}
– If Xε is a production, add ε to FIRST(X)
– If X is a nonterminal and XY1Y2…Yn is a
production:
•
•
For all terminals a, add a to FIRST(X) if a is a member of
any FIRST(Yi) and ε is a member of FIRST(Y1),
FIRST(Y2), … FIRST(Yi-1)
If ε is a member of FIRST(Y1), FIRST(Y2), …
FIRST(Yn), add ε to FIRST(X)
24
FOLLOW
• FOLLOW(A), for any nonterminal A, is the
set of terminals a that can appear
immediately to the right if A in some
sentential form
• More formally, a is in FOLLOW(A) if and
only if there exists a derivation of the form
S *=>αAaβ
• $ is in FOLLOW(A) if and only if there
exists a derivation of the form S *=> αA
25
Computing FOLLOW
• Place $ in FOLLOW(S)
• If there is a production A  αBβ, then
everything in FIRST(β) (except for ε) is
in FOLLOW(B)
• If there is a production A  αB, or a
production A  αBβ where FIRST(β)
contains ε,then everything in FOLLOW(A)
is also in FOLLOW(B)
26
Left recursion Example
EE+T|T
TT*F|F
F  (E) | id
We can remove the left recursion:
ETX
X  +TX | e
TFY
Y  *FY | e
F  (E) | id
27
FIRST and FOLLOW Example
First(E) = First(T) = First(F) = { (, id}
E  TX
X  +TX | e
T  FY
Y  *FY | e
F  (E) | id
First(X) = {+, e}
First(Y) = {*, e}
Follow(E) = Follow(X) = { ), $}
Follow(T) = Follow(Y) = {+, ), $}
Follow(F) = {+, *, ), $}
28
Parser
After removing the ambiguity and the left recursion, and left
factoring. And after getting the first and follow sets we
can write the parser as follow:
1-The start symbol is the name of the parser function.
2-For each non terminal there is a function that takes the
name of this non terninal. So if we have S  a
Void S( )
{
T(a); // T is the transformation
}
29
T is defined as follow:
1- if “a” is terminal with token a:
T(“a”) = match(a)
2- for any non terminal A:
T(A) = A( )
3- T(a1a2…an) = T(a1); T(a2); …; T(an);
3- T(a | b | … | d) =
switch(lookahead) {
case First(a): T(a); break;
case First(b): T(b); break;
…
case First(d): T(d); break;
default: error(“syntax error”);
}
30
Sa | b | … | d | -5
e
switch(lookahead) {
case First(a): T(a); break;
case First(b): T(b); break;
…
case First(d): T(d); break;
case Follow(S): break; // do nothing
default: error(“syntax error”);
}
31
Exercise
• Write the parser of the last grammar with:
– plus is the token of ‘+”
– mult is the token of ‘*’
– closep is the token of ‘)’
– openp is the token of ‘(‘
– id is the token of identifiers
32
E()
{
T();
X();
}
________________________
X()
{
lookahead = lexan();
switch(lookahead){
case plus:
match(plus); T();
X(); break;
case closep: break;
default: error(“syntax error”);
}
}
T()
{
F();
Y();
}________________________
Y()
{
lookahead = lexan();
switch(lookahead){
case mult:
match(plus);F();
Y(); break;
case closep: case plus: break;
default: error(“syntax error”);
}
}
33
F()
{
lookahead = lexan();
switch(lookahead){
case openp:
match(openp); E(); match(closep); break;
case id: match(id); break;
default: error(“syntax error”);
}
}
34