Chapter 4 Lexical Analysis

Transcript Chapter 4 Lexical Analysis

Chapter 4
Lexical Analysis
Lexical Analysis
• Why split it from parsing?
– Simplifies design
• Parsers with whitespace and comments are more
awkward
– Efficiency
• Only use the most powerful technique that works
• And nothing more
– No parsing sledgehammers for lexical nuts
– Portability
• More modular code
• More code re-use
Source Code Characteristics
• Code
– Identifiers
• Count, max, get_num
– Language keywords: reserved or predefined
• switch, if .. then.. else, printf, return, void
• Mathematical operators
– +, *, >> ….
– <=, =, != …
– Literals
• “Hello World”
• Comments
• Whitespace
Reserved words versus
predefined identifiers
• Reserved words cannot be used as the name of
anything in a definition (i.e., as an identifier).
• Predefined identifiers have special meanings, but
can be redefined (although they probably
shouldn’t).
• Examples of predefined identifiers in Java:
anything in java.lang package, such as String,
Object, System, Integer.
Language of Lexical Analysis
Tokens: category
Patterns: regular
expression
Lexemes:actual string
matched
Tokens are not enough…
• Clearly, if we replaced every occurrence of a
variable with a token then ….
We would lose other valuable information
(value, name)
• Other data items are attributes of the tokens
• Stored in the symbol table
Token delimiters
• When does a token/lexeme end?
e.g xtemp=ytemp
Ambiguity in identifying tokens
• A programming language definition will
state how to resolve uncertain token
assignment
• <> Is it 1 or 2 tokens?
• Reserved keywords (e.g. if) take
precedence over identifiers (rules are
same for both)
• Disambiguating rules state what to do
• ‘Principle of longest substring’: greedy
Regular Expressions
• To represent patterns of strings of characters
• REs
– Alphabet – set of legal symbols
– Meta-characters – characters with special meanings
•  is the empty string
• 3 basic operations
– Choice – choice1|choice2,
• a|b matches either a or b
– Concatenation – firstthing secondthing
• (a|b)c matches the strings { ac, bc }
– Repetition (Kleene closure)– repeatme*
• a* matches { , a, aa, aaa, aaaa, ….}
• Precedence: * is highest, | is lowest
– Thus a|bc* is a|(b(c*))
Regular Expressions…
• We can add in regular definitions
– digit = 0|1|2 …|9
• And then use them:
– digit digit*
• A sequence of 1 or more digits
• One or more repetitions:
– (a|b)(a|b)*  (a|b)+
• Any character in the alphabet .
– .*b.* - strings containing at least one b
• Ranges [a-z], [a-zA-Z], [0-9], (assume character
set ordering)
• Not: ~a or [^a]
Some exercises
•
Describe the languages denoted by the
following regular expressions
1. 0 ( 0 | 1 ) * 0
2. ( ( 11 | 0 ) * ) *
3. 0* 1 0* 1 0* 1 0 *
•
Write regular definitions for the following
regular expressions
1. All strings that contain the five vowels in order
(but not necessarily adjacent) aabcaadggge is
okay
2. All strings of letters in which the letters are in
ascending lexicographic order
3. All strings of 0’s and 1’s that do not contain the
substring 011
Limitations of REs
• REs can describe many language constructs but not
all
• For example
Alphabet = {a,b}, describe the set of strings
consisting of a single a surrounded by an equal
number of b’s
S= {a, bab, bbabb, bbbabbb, …}
• For example, nested Tags in HTML
Lookahead
• <=, <>, <
• When we read a token delimiter to establish a token
we need to make sure that it is still available as part
of next token
– It is the start of the next token!
• This is lookahead
– Decide what to do based on the character we
‘haven’t read’
• Sometimes implemented by reading from a buffer
and then pushing the input back into the buffer
• And then starting with recognizing the next token
Classic Fortran example
• DO 99 I=1,10 becomes DO99I=1,10
versus
DO99I=1.10
The first is a do loop, the second an
assignment. We need lots of lookahead to
distinguish.
• When can the lexical analyzer assign a token?
Push back into input buffer
– or ‘backtracking’
Finite Automata
• A recognizer determines if an input string
is a sentence in a language
• Uses a regular expression
• Turn the regular expression into a finite
automaton
• Could be deterministic or nondeterministic
Transition diagram for identifiers
• RE
– Identifier -> letter (letter | digit)*
letter
start
0
letter
1
other
digit
2
accept
• An NFA is similar to a DFA but it also permits
multiple transitions over the same character and
transitions over  . In the case of multiple
transitions from a state over the same character,
when we are at this state and we read this
character, we have more than one choice; the
NFA succeeds if at least one of these choices
succeeds. The  transition doesn't consume any
input characters, so you may jump to another
state for free.
• Clearly DFAs are a subset of NFAs. But it turns
out that DFAs and NFAs have the same
expressive power.
From a Regular Expression to an NFA
Thompson’s Construction
(a | b)* abb

start
0


2
a
3

1

6
4
5
b



7
a
8
b
9
b
10
accept
a
start
a
0
b
1
b
2
3
accept
b
Non-deterministic finite state automata NFA
b
a
start
0
a
01
b
02
b
03 accept
a
b
a
Equivalent deterministic finite state automata DFA
Transition Table (NFA)
State
0
Input Symbol
a
b
{0,1}
{0}
1
{2}
2
{3}
NFA -> DFA (subset construction)
• Suppose that you assign a number to each NFA state.
• The DFA states generated by subset construction have
sets of numbers, instead of just one number. For example,
a DFA state may have been assigned the set {5, 6, 8}. This
indicates that arriving to the state labeled {5, 6, 8} in the
DFA is the same as arriving to the state 5, the state 6, or
the state 8 in the NFA when parsing the same input.
• Recall that a particular input sequence when parsed by a
DFA, leads to a unique state, while when parsed by a NFA it
may lead to multiple states.
• First we need to handle transitions that lead to other states
for free (without consuming any input). These are the
transitions. We define the closure of a NFA node as the set
of all the nodes reachable by this node using zero, one, or
more transitions.
NFA -> DFA (cont)
• The start state of the constructed DFA is labeled
by the closure of the NFA start state.
• For every DFA state labeled by some set {s1,...,
sn} and for every character c in the language
alphabet, you find all the states reachable by s1,
s2, ..., or sn using c arrows and you union
together the closures of these nodes.
• If this set is not the label of any other node in the
DFA constructed so far, you create a new DFA
node with this label.
Transition Table (DFA)
State
0
Input Symbol
a
b
01
0
01
01
02
02
01
03
03
01
0
Writing a lexical analyzer
• The DFA helps us to write the scanner.
• Figure 4.1 in your text gives a good
example of what a scanner might look
like.
LEX (FLEX)
• Tool for generating programs which
recognize lexical patterns in text
• Takes regular expressions and turns them
into a program
Lexical Errors
• Only a small percentage of errors can be
recognized during Lexical Analysis
Consider if (good == “bad)
Examples from the PERL language
– Line ends inside literal string
– Illegal character in input file
– missing semi-colon
– missing operator
– missing paren
– unquoted string
– unopened file handle
In general
• What does a lexical error mean?
• Strategies for dealing with:
– “Panic-mode”
• Delete chars from input until something matches
– Inserting characters
– Re-ordering characters
– Replacing characters
• For an error like “illegal character” then we
should report it sensibly
Syntax Analysis
• also known as Parsing
• Grouping together tokens into larger
structures
• Analogous to lexical analysis
• Input:
– Tokens (output of Lexical Analyzer)
• Output:
– Structured representation of original program
Parsing Fundamentals
• Source program:
– 3+4
• After Lexical Analysis:
???
A Context Free Grammar
•
•
•
•
•
A grammar is a four tuple (, N,P,S) where
 is the terminal alphabet
N is the non terminal alphabet
P is the set of productions
S is a designated start symbol in N
Parsing
• Expression  number plus number
– Similar to regular definitions:
• Concatenation
• Choice
Expression  number Operator number
operator  + | - | * | /
• Repetition is done differently
BNF Grammar
Expression  number Operator number
Operator  + | - | * | /
Meta-symbols:  |
Structure on the left is defined to consist of the choices
on the right hand side
Different conventions for writing BNF Grammars:
<expression> ::= number <operator> number
Expression  number Operator number
Derivations
• Derivation:
– Sequence of replacements of structure
names by choices on the RHS of grammar
rules
– Begin: start symbol
– End: string of token symbols
– Each step one replacement is made
Exp  Exp Op Exp | number
Op  + | - | * | /
Example Derivation
Note the different arrows:
 Derivation applies grammar rules
 Used to define grammar rules
Non-terminals: Exp, Op Terminals: number, *
Terminals: because they terminate the derivation
Derivations (2)
• E(E)|a
• What sentences does this grammar generate?
An example derivation:
• E  ( E ) ((E))  ((a))
• Note that this is what we couldn’t achieve with
regular definitions
Recursive Grammars
• E(E)|a
– is recursive
E  ( E ) is the general case
E  a is the terminating case
• We have no * operator in context free grammars
– Repetition = recursion
• EE|
– derives ,  ,   ,     ,      ….
– All strings beginning with  followed by zero or more
repetitions of 
•  *
Recursive Grammars (2)
• a+
(regular expression)
– E  E a | a (1)
– Or
– E  a E | a (2)
• 2 different grammars can derive the same
language
(1) is left recursive
(2) is right recursive
• a*
– Implies we need the empty production
– EEa|
Recursive Grammars (3)
• Require recursive data structures
–  trees
Exp  Exp Op Exp | number
• Parse Trees
Op  + | - | * | /
1
exp
4
exp
3
op
2
exp
number
*
number
Parse Trees & Derivations
• Leafs = terminals
• Interior nodes = non-terminals
• If we replace the non-terminals right to
left
– The parse tree sequence is right to left
– A rightmost derivation -> reverse post-order
traversal
• If we derive left to right:
– A leftmost derivation
– pre-order traversal
– parse trees encode information about the
derivation process
•
•
•
•
Formal Methods of Describing Syntax
1950: Noam Chomsky (noted linguist) described generative
devices which describe four classes of languages (in order of
decreasing power)
recursively enumerable x y where x and y can be any string of
nonterminals and terminals.
context-sensitive x  y where x and y can be string of terminals
and non-terminals but y must be the same length or longer than x.
– Can recognize anbncn
•
context-free (yacc) - nonterminals appear singly on left-side of
productions. Any nonterminal can be replaced by its right hand
side regardless of the context it appears in.
– Ex: If you were in the boxing ring and said ``Hit me'' it would imply a
different action than if you were playing cards.
– Ex: If a IDENTSY which is between brackets is treated differently in
terms of what it matches than an IDENTSY between parens, this is
context sensitive
– Can recognize anbn, palindromes
•
regular (lex)
– Can recognize anbm
Chomsky was interested in the theoretic nature of natural
languages.
Abstract Syntax Trees
Parse trees contain surplus information
Parse Tree
Abstract Syntax Tree
exp
+
exp
number
3
op
+
3
exp
number
4
Token
sequence
4
This is all the information
we actually need
An exercise
•
Consider the grammar
S->(L) | a
L->L,S |S
(a) What are the terminals, nonterminals and start symbol
(b) Find leftmost and rightmost derivations and parse trees
for the following sentences
i. (a,a)
ii. (a, (a,a))
iii. (a, ((a,a), (a,a)))
Parsing token sequence: id + id * id
E  E + E | E * E | ( E ) | - E | id
Ambiguity
• If a sentence has two distinct parse trees, the
grammar is ambiguous
• Or alternatively:is ambiguous if there are two
different right-most derivations for the same
string.
• In English, the phrase ``small dogs and cats'' is
ambiguous as we aren't sure if the cats are small
or not.
• `I see flying planes' is also ambiguous
• A language is said to be ambiguous if no
unambiguous grammar exists for it.
• Dance is at the old main gym. How it is parsed?
Ambiguous Grammars
• Problem – no clear structure is expressed
• A grammar that generates a string with 2 distinct
parse trees is called an ambiguous grammar
– 2+3*4 = 2 + (3*4) = 14
– 2+3*4 = (2+3) * 4 = 20
• Our experience of math says interpretation 1 is
correct but the grammar does not express this:
– E  E + E | E * E | ( E ) | - E | id
Example of Ambiguity
• Grammar:
expr  expr + expr | expr  expr
| ( expr ) | NUMBER
• Expression: 2 + 3 * 4
• Parse trees:
expr
expr
NUMBER
(2)
+
expr
expr
expr
expr * expr
expr + expr
NUMBER
(3)
NUMBER
(4)
*
expr
NUMBER
(4)
NUMBER
(2)
NUMBER
(3)
Removing Ambiguity
Two methods
1. Disambiguating Rules
positives: leaves grammar unchanged
negatives: grammar is not sole source of
syntactic knowledge
2. Rewrite the Grammar
Using knowledge of the meaning that we
want to use later in the translation into
object code to guide grammar alteration
Precedence
E  E addop Term | Term
Addop  + | Term  Term * Factor | Term/Factor |Factor
Factor  ( exp ) | number | id
• Operators of equal precedence are grouped
together at the same ‘level’ of the grammar 
’precedence cascade’
• The lowest level operators have highest precedence
• (The first shall be last and the last shall be first.)
Associativity
• 45-10-5 ?
30 or 40
Subtraction is left associative, left to right (=30)
• E  E addop E | Term
Does not tell us how to split up 45-10-5
• E  E addop Term | Term
Forces left associativity via left recursion
• Precedence & associativity remove ambiguity of
arithmetic expressions
– Which is what our math teachers took years telling us!
Ambiguous grammars
Statement -> If-statement | other
If-statement -> if (Exp) Statement
| if (Exp) Statement else Statement
Exp
-> 0 | 1
Parse
if (0) if (1) other1 else other2
Removing ambiguity
Statement
-> Matched-stmt | Unmatched-stmt
Matched-stmt -> if (Exp) Matched-stmt else Matched-stmt |
other
Unmatched-stmt ->if (Exp) Statement
| if (Exp) Matched-stmt else Unmatched-stmt
Extended BNF Notation
• Notation for repetition and optional features.
• {…} expresses repetition:
expr  expr + term | term becomes
expr  term { + term }
• […] expresses optional features:
if-stmt if( expr ) stmt
| if( expr ) stmt else stmt
becomes
if-stmt if( expr ) stmt [ else stmt ]
Notes on use of EBNF
• Use {…} only for left recursive rules:
expr  term + expr | term
should become expr  term [ + expr ]
• Do not start a rule with {…}: write
expr  term { + term }, not
expr  { term + } term
• Exception to previous rule: simple token repetition, e.g.
expr  { - } term …
• Square brackets can be used anywhere, however:
expr  expr + term | term | unaryop term
should be written as
expr  [ unaryop ] term { + term }
Syntax Diagrams
• An alternative to EBNF.
• Rarely seen any more: EBNF is much
more compact.
• Example (if-statement, p. 101):
if-statement
if
(
statement
expression
)
else
statement
How is Parsing done?
1. Recursive descent (top down).
2. Bottom up – tries to match input with the
right hand side of a rule. Sometimes
called shift-reduce parsers.
Predictive Parsing
• Top down parsing
• LL(1) parsing
• Table driven predictive parsing (no
recursion) versus recursive descent
parsing where each nonterminal is
associated with a procedure call
• No backtracking
E -> E + T | T
T -> T * F | F
F -> (E) | id
Two grammar problems
• Eliminating left recursion
(without changing associativity)
A -> A | 
A -> A’
A’ -> A’ | 
Example
E -> E + T | T
T -> T * F | F
F -> (E) | id
The general case
A -> A1 | A2 | …| Am | 1 | 2 | …| n
Two grammar problems
• Eliminating left recursion involving
derivations of two or more steps
S -> Aa | b
A -> Ac | Sd | 
A -> Ac | Aad | bd | 
• Removing Left Recursion
Before
• A --> A x
• A --> y
After
• A --> yB
• B --> x B
• B --> e
Two grammar problems…
• Left factoring
Stmt -> if Exp then Stmt else Stmt
| if Expr then Stmt
A -> 1 | 2
A -> A’
A’ -> 1 | 2
exercises
Eliminate left recursion from the following
grammars.
a) S->(L) | a
L->L,S | S
b) Bexpr ->Bexpr or Bterm | Bterm
Bterm -> Bterm and Bfactor | Bfactor
Bfactor -> not Bfactor | (Bexpr) | true | false
COMP313A Programming
Languages
Syntax Analysis (3)
• Table driven predictive
parsing
• Getting the grammar right
• Constructing the table
Table Driven Predictive Parsing
Input
X
a + b $
Predictive Parsing
Program
Y
Z
$
Stack
Parsing Table
id + id * id
Output
Table Driven Predictive Parsing
Non
Terminal
E
Input Symbol
id
$
E’->
E’->
T’->
T’->
T->FT’
T->FT’
T’->
F->id
)
E->TE’
E’->+TE’
T’
F
(
E->TE’
E’
T
+
T’->*FT’
F->(E)
Table Driven Predictive Parsing
Parse id + id * id
Leftmost derivation and parse tree using the grammar
E ->
E’ ->
T ->
T’ ->
F ->
TE’
+TE’ | 
FT’
*FT’ | 
(E) | id
First and Follow Sets
• First and Follow sets tell when it is
appropriate to put the right hand side of
some production on the stack.
(i.e. for which input symbols)
E ->
E’ ->
T ->
T’ ->
F ->
TE’
+TE’ | 
FT’
*FT | 
(E) | id
id + id * id
First Sets
1. If X is a terminal, then FIRST(X) is {X}
2. IF X ->  is a production, then add  to
FIRST(X)
3. IF X is a non terminal and X -> Y1Y2…Yk is a
production, then place a in FIRST(X) if for
some i, a is in FIRST(Yi), and  is in all of
First(Y1), …First(Yi-1). If  is in FIRST(Yj) for all
j = 1, 2, …k, then add  to FIRST(X).
FIRST sets
E ->
E’ ->
T ->
T’ ->
F ->
TE’
+TE’ | 
FT’
*FT | 
(E) | id
Follow Sets
1. Place $ in Follow(S), where S is the start
symbol and $ is the input right endmarker
2. If there is a production A -> B, then
everything in FIRST() except for  is placed in
FOLLOW(B)
3. If there is a production A -> B, or a *
production A -> B where FIRST() contains 
(i.e.,   ), then everything in Follow(A) is in
FOLLOW(B)
Follow Sets
E ->
E’ ->
T ->
T’ ->
F ->
TE’
+TE’ | 
FT’
*FT | 
(E) | id
FIRST and FOLLOW sets
Construct first and follow sets for the
following grammar after left recursion has
been eliminated
a) S->(L) | a
L->L,S | S
Construction of the Predictive
Parsing Table
•
Algorithm from Aho et al.
1. For each production A ->  of the grammar do
steps 2 and 3
2. For each terminal a in FIRST(), add A ->  to
M[A,a]
3. If  is in FIRST(), add A ->  to M[A, b] for each
terminal b in FOLLOW(A). If  is in FIRST() and
$ is in FOLLOW(A), add
A ->  to M[A, $].
4. Make each undefined entry of M be error
Predictive Parsing Table
Construct the parsing table for the grammar
a) S->(L) | a
L->L,S | S
COMP313A Programming
Languages
Syntax Analysis (4)
Lecture Outline
•
A problem for predictive parsers
•
Predictive parsing LL(1)
grammars
•
Error recovery in predictive
parsing
•
Recursive Descent parsing
Producing code from parse on the fly
E ->
E’ ->
T ->
T’ ->
F ->
TE’
+TE’ | 
FT’
*FT’ | 
(E) | id
Table Driven Predictive Parsing
Non
Terminal
E
Input Symbol
(
T->FT’
$
E’->
E’->
T’->
T’->
T->FT’
T’->
F->id
)
E->TE’
E’->+TE’
T’
F
*
E->TE’
E’
T
+
id
T->*FT’
F->(E)
LL(1) grammars
S -> if E then S S’ | a
S’ -> else S | 
E -> b
FIRST(S) = {if, a}
FIRST(S’) = {else, }
FIRST(E) = {b}
FOLLOW(S) = {$, else}
FOLLOW(S’) ={$, else}
FOLLOW(E) = {then}
Construct the parsing table
LL(1) grammars
•
•
•
An LL(1) grammar has no multiply defined
entries in its parsing table
Left-recursive and ambiguous grammars are
not LL(1)
A grammar G is LL(1) iff whenever A ->  | 
are two distinct productions of G
1. For no terminal a do both  and  derive strings
beginning with a
2. At most one of  and  can derive the empty string
* then  does not derive any string beginning
3. If 
with a terminal in FOLL0W(A)
Error Recovery in Predictive
Parsing
• Panic mode recovery
based on a set of synchronizing tokens
• Heuristics for synchronizing sets
1. For nonterminal A all symbols in Follow(A) and
FIRST(A)
2. Symbols that begin higher constructs
3. If A derives  then A ->  can be used as the default
4. Pop a nonmatching terminal from the top of the stack
Recursive Descent Parsers
• A function for each nonterminal
example expression grammar
Expr -> Term Expr’
Expr’ -> +Term Expr’ | 
Term -> Factor Term’
Term’ -> *Factor Term’ | 
Factor -> (Expr) | id
Function Expr
If the next input symbol is a ( or id then
call function Term followed by function Expr’
Else Error

Chapter 4 Lexical Analysis

Transcript Chapter 4 Lexical Analysis

Directory