Chapter 1 Introduction

Download Report

Transcript Chapter 1 Introduction

Chapter 3
Context-Free Grammars
Dr. Frank Lee
3.1 CFG Definition
• The next phase of compilation after
lexical analysis is syntax analysis.
• This phase is called parsing. It is to
determine if the program is syntactically
correct or not.
• The tool we use to describe the syntax
of a programming language is contextfree grammars (CFG).
3.1 CFG Definition
• A CFG includes 4 components:
– A set of terminals T, which are the tokens of
the language
– A set of non-terminals N
– A set of rewriting rules R.
• The left-hand side of each rewriting rule is a single
non-terminal.
• The right-hand side of each rewriting rule is a
string of terminals and/or non-terminals
– A special non-terminal S Є N, which is the
start symbol
3.1 CFG Definition
•
•
Just as regular expression generate strings of
characters, CFG generate strings of tokens
A string of tokens is generated by a CFG in the
following way:
– The initial input string is the start symbol S
– While there are non-terminals left in the string:
1. Pick any non-terminal in the input string A
2. Replace a single occurrence of A in the string with the righthand side of any rule that has A as the left-hand side
3. Repeat 1 and 2 until all elements in the string are terminals
– See Fig. 3.1 (next slide or p38)
Fig. 3.1 A CFG for Some Simple Statements
Terminals = { id, num, if, then, else, print, =, {, }, ;, (, ) }
Non-Terminals = { S, E, B, L }
Rules = (1) S  print(E);
(2) S  while (B) do S
(3) S  { L }
(4) E  id
(5) E  num
(6) B  E > E
(7) L  S
(8) L  SL
Start Symbol = S
3.1 CFG Example 1
1.
2.
3.
4.
5.
EE+E
EE–E
EE*E
EE/E
E  num
• Example 1:
1. E  E + E
2.
E*E+E
3.
 num * E + E
4.
 num * num + E
5.
 num * num + num
3.1 CFG Example 2
1.
2.
3.
4.
5.
6.
7.
S  NP V NP
NP  the N
N  boy
N  ball
N  window
V  threw
V  broke
• Example 2
1. S  NP V NP
2.
 the N V NP
3.
 the boy V NP
4.
 the boy broke NP
5.
 the boy broke the N
6.
 the boy broke the window
3.2 Derivations
• A derivation is a description of how a string is generated
from a grammar
• A leftmost derivation always picks the leftmost nonterminal to replace (see slides 9 and 10)
• A rightmost derivation always picks the rightmost nonterminal to replace (see slides 12 and 13)
• Some derivations are neither leftmost nor rightmost (see
slide 15)
• For example: Use the CFG in Fig. 3.1 (p38) to generate
print (id);
S  print (E);
 print (id);
3.2 Derivations
3.2.1 Leftmost Derivations
• A string of terminals and non-terminals α that can
be derived from the initial symbol of the grammar
is called a sentential form
• Thus the strings “{ S L }” and “while(id>E) do S”,
are all sentential forms, but print(E>id)” isn’t.
• A derivation is “leftmost” if, at each step in the
derivation, the leftmost non-terminal is selected to
replace
• All of the above examples are leftmost derivations
• A sentential form that occurs in a leftmost
derivation is called a left-sentential form
3.2.1 Leftmost Derivations
• Example 1: We can use leftmost derivations
to generate while(id > num) do print(id);
from this CFG as follows:
S  while(B) do S
 while(E>E) do S
 while(id>E) do S
 while(id>num) do S
 while(id>num) do print(E);
 while(id>num) do print(id);
3.2.1 Leftmost Derivations
• Example 2: We also can generate
{ print(id); print(num); } from the CFG as
follows:
S{L}
{SL}
 { print(E); L }
 { print(id); L }
 { print(id); S }
 { print(id); print(E); }
 { print(id); print(num); }
3.2.2 Rightmost Derivations
• In addition to leftmost derivations, there are also
rightmost derivations, where we always choose
the rightmost non-terminal to replace
Example 1: To generate while(num > num) do print(id);
S  while(B) do S
 while(B) do print(E);
 while(B) do print(id);
 while(E>E) do print(id);
 while(E>num) do print(id);
 while(num>num) do print(id);
3.2.2 Rightmost Derivations
Example 2: Try to derivate
{ print(num); print(id); } from S
S{L}
{SL}
{SS}
 { S print(E); }
 { S print(id); }
 { print(E); print(id); }
 { print(num); print(id); }
3.2.3 Non-Lefmost, Non-Rightmost
Derivations
• Some derivations are neither leftmost or
rightmost, such as:
S  while(B) do S
 while(E>E) do S
 while(E>E) do print(E);
 while(E>id) do print(E);
 while(num>int) do print(E);
 while(num>id) do print(num);
3.2.3 Non-Lefmost, Non-Rightmost
Derivations
• Some strings that are not derivable from
this CFG, such as:
1.
2.
3.
4.
–
–
–
print(id)
{ print(id); print(id) }
while (id) do print(id);
print(id > id);
1 & 2: no ; to terminate statements.
3: the id in while (id) is not derivable from B.
4: id > id is not derivable from E.
3.3 CFG Shorthand
• We can combine two rules of the form
S  α and S  β
to get the single rule
S  α│β
• See Fig. 3.2 for an example (next slide or
p40)
Fig. 3.2 Shorthand for the CFG in Fig. 3.1
Terminals = { id, num, if, then, else, print, =, {, }, ;, (, ) }
Non-Terminals = { S, E, B, L }
Rules = S  print(E); | while (B) do S | { L }
E  id | num
BE>E
L  S | SL
Start Symbol = S
3.4 Parse Trees ($)
• A parse tree is a graphical representation of a derivation
• We start with the initial symbol S of the grammar as the
root of the tree
• The children of the root are the symbols that were used
to rewrite the initial symbol in the derivation
• The internal nodes of the parse tree are non-terminals
• The children of each internal node N are the symbols on
the right-hand side of a rule that has N as the left-hand
side (e.g. B  E > E where E > E is the right-hand side
and B is the left-hand side of the rule)
• Terminals are leaves of the tree
• See three parse tree examples on p40-41.
3.5 Ambiguous Grammars
• A grammar is ambiguous if there is at least one string
derivable from the grammar that has more then one
different parse tree
• Fig. 3.4 and Fig. 3.6 are ambiguous grammars because
there are several strings derivable from these grammars
that have multiple parse trees, such as the two trees in
Fig. 3.5 (p42) and Fig. 3.7, respectively
• Ambiguous grammars are bad, because the parse trees
don’t tell us the exact meaning of the string. For example,
in Fig. 3.5.a, the string means id*(id+id), but in Fig. 3.5.b,
the string means (id*id)+id. This is why we call it
“ambiguous”.
• We need to change the grammar to fix this problem
• We may rewrite the grammar as in Fig. 3.8 and Fig. 3.9.
They are unambiguous CFGs for expressions
3.5 Ambiguous Grammars
• We need to make sure that all additions appear higher in the
tree than multiplications (Why?) How can we do this?
• Once we replace an E with E*E using single rule 4, we don’t
want to rewrite any of the Es we’ve just created using rule 2,
since that would place an addition (+) lower in the tree than a
multiplication (*)
• Let’s create a new non-terminal T for multiplication and division
• T will generate strings of id’s multiplied or divided together, with
no additions or subtractions
• Then we can modify E to generate strings of T’s added together
• This modified grammar is in Fig. 3.6 (p42)
• However, this grammar is still ambiguous. It is impossible to
generate a parse tree from Fig. 3.6 that has * higher than + in
the tree
3.5 Ambiguous Grammars
•
•
•
•
•
•
Consider the string id+id+id, which has two parse trees, as listed
in Fig. 3.7 (p43)
id+id+id = (id+id)+id or
= id+(id+id) are all ok
id-id-id = (id-id)-id
!= id-(id-id) but this is wrong
We would like addition and subtraction to have leftmost
association as above
In other words, we need to make sure that the right subtree of an
addition or subtraction is not another addition or subtraction
We modified the CFG in Fig. 3.6 as in Fig. 3.8 (p43, unambiguous
CFG)
In Fig. 3.9 (p44), we add parentheses to the grammar in Fig. 3.8
to express expressions like id*(id+id).
Fig. 3.9 is unambiguous too. See the three example parse
trees for this grammar in Fig. 3.10-3.12
3.6 Extended Backus Naur Form
•
•
•
•
Another term for a CFG is a Backus Naur Form (BNF).
There is an extension to BNF notation, called
Extended Backus Naur Form, or EBNF
EBNF rules allow us to mix and match CFG notation
and regular expression notation in the right-hand side
of CFG rules
For example, consider the following CFG, which
describes simpleJava statement blocks and stylized
simpleJava print statements:
1. S  { B }
2. S  print(id)
3. B  S ; C
4. C  S ; C
5. C  ε
3.6 Extended Backus Naur Form
•
•
Rules 3, 4, and 5 in the above grammar
describe a series of one or more statements S,
terminated by semicolons
We could express the same language using an
EBNF as follows:
1. S  { B }
2. S  print”(“id”)”
3. B  (S;)+
Note, in Rule 2, when we want a parenthesis to appear in
EBNF, we need to surround it with quotation marks. But in
Rule 3, the pair of parenthesis is for the + symbol, not
belongs to the language.
3.6 Extended Bakus Naur Form
• Another Example: Consider the following CFG
fragment, which describes Pascal for statements:
1. S  for id := E to E do S
2. S  for id := E downto E do S
• The CFG rules for both types of Pascal for
statement (to and downto) can be represented
by a single EBNF rule, as below:
• S  for id := E (to│downto) E do S
• The “, +, and | notations in the above examples
are from regular expressions.
Chapter 3 Homework
• Due; 2/18/2013
• Pages: 46-47
• Do the following exercises:
1 (a,c,e), 2(a), 3(a,c)