Transcript Sebesta 4
CSci210.BA4
Chapter 4 Topics
Introduction
Lexical and Syntax Analysis
The Parsing Problem
Recursive-Descent Parsing
Bottom-Up Parsing
Introduction
Syntax analyzers almost always based
on a formal description of the syntax of
the source language (grammars)
Almost all compilers separate analyzing
syntax into:
Lexical Analysis – low-level
Syntax Analysis – high-level
Reasons to Separate Syntax and
Lexical Analysis
Simplicity – lexical analysis is less complex, so
the process is simpler when separated
Efficiency – allows for selective optimization
Portability – lexical analyzer is somewhat
platform dependent whereas the syntax analyzer
is more platform independent
Lexical Analysis
A pattern matcher for character strings
Performs syntax analysis at the lowest
level of the program structure
Extracts lexemes from a given input
string and produce the corresponding
tokens
Lexical Analysis (continued)
result = oldsum – value / 100;
Token
Lexeme
IDENT
ASSIGN_OP
IDENT
SUB_OP
IDENT
DIV_OP
INT_LIT
SEMICOLON
result
=
oldsum
value
/
100
;
Building a Lexical Analyzer
Write a formal description of the tokens
and use a software tool that constructs
lexical analyzers when given such a
description
Design a state transition diagram that
describes the tokens and write a program
that implements the diagram
Design a state transition diagram that
describes the tokens and hand-construct a
table-driven implementation of the state
diagram
State (Transition) Diagram Design
A directed graph with nodes labeled with
state names and arcs labeled with input
characters
Including states and transitions for each
and every token pattern would be too
large and complex
Transitions can be combined to simplify
the state diagram
The Parsing Problem
Two goals of syntax analysis:
Check the input program for any syntax
errors, produce a diagnostic message if an
error is found, and recover
Produce the parse tree, or at least a trace of
the parse tree, for the program
Two Classes of parsers:
Top-down
Bottom-up
Top-Down Parsers
Traces or builds a parse tree in preorder
(leftmost derivation)
The most common top-down parsing
algorithms:
Recursive descent
LL parsers
Bottom-Up Parsers
Produce the parse tree by beginning at
the leaves and progressing towards the
root
Most common bottom-up parsers are in
the LR family
Complexity of Parsing
Parsing algorithms that work for any
unambiguous grammar are complex and
inefficient: O(n3)
Compilers use parsers that only work for
a subset of all unambiguous grammars,
but do it in linear time: O(n)
Recursive-Descent Parsing
Top-Down Parser
EBNF is ideal for the basis of a
recursive-descent parser
Each terminal maps to a function
For a non-terminal with more than one RHS,
look at the next token to determine which
side to choose
No mapping = syntax error
Recursive-Descent Parsing
Grammar for an expression:
<expr> → <term> {+ <term>}
<term> → <factor> {* <factor>}
<factor> → id | int_constant | ( <expr> )
How do we parse?
Expression: 1 + 2
<expr>
→ <term> + <term>
→ <factor> + <term>
→ 1 + <term>
Recursive-Descent Parsing
Grammar for an expression:
<expr> → <term> {+ <term>}
<term> → <factor> {* <factor>}
<factor> → id | int_constant | ( <expr> )
What does code look like?
void expr() {
term();
while (nextToken == ADD_OP) {
lex();
term();
}
}
Recursive-Descent Parsing
The LL (Left Recursion) Problem
<expr> → <expr> + <term>
<expr> → <expr> + <term> + <term>
<expr> → <expr> + <term> + <term> + <term>
How do we fix it?
Modify grammar to remove left recursion
Before:
After:
<expr> → <expr> + <term>
<expr> → <term> + <term>
<term> → id | int_constant | <expr>
Recursive-Descent Parsing
The Pairwise Disjointness Problem
If the grammar is not pairwise disjoint, how do you know
which RHS to pick based on the next token?
<variable> → identifier | identifier[<expr>]
How do we fix it?
Left Factoring
<variable> → identifier<new>
<new> → ø | [<expr>]
Bottom-Up Parsing
Parsing is based on reduction
Reverse of a rightmost derivation
At each step, find the correct RHS that
reduces to the previous step in the
derivation
Example Grammar
<S> → <A>b
<A> → a
<A> → b
Input: ab
Step 1: <A>b
Step 2: <S>
Bottom-Up Parsing
Most bottom-up parsers are shift-reduce
algorithms
Shift – move token onto the stack
Reduce – replace RHS with LHS
Bottom-Up Parsing
Handles
Def: is the handle of the right sentential
form iff
= w if and only if S =>*rm Aw =>rm w
The handle of a right sentential form is its
leftmost simple phrase
Bottom-Up Parsing is essentially looking for
handles and replacing them with their LHS
Bottom-Up Parsing
Advantages of Shift Reduction Parsers
They can be built for all programming
languages
They can detect syntax errors as soon as it
is possible in a left-to-right scan
They LR class of grammars is a proper
superset of the class parsable by LL parsers
(for example, many left recursive grammars
are LR, but none are LL)
Bottom-Up Parsing
Shift Reduction Algorithms
Input Sequence – input to be parsed
Parse Stack – input is shifted onto the
parse stack
ACTION Table – what the parser does
GOTO Table – holds state symbols to be
pushed onto the stack when a reduction is
completed
Bottom-Up Parsing
ACTION Table (or Parse Table)
Rows = State Symbols
Columns = Terminal symbols
Values
Shift – push token on stack
Reduce – replace handle with LHS
Accept – stack only has start symbol and
input is empty
Error – original input is invalid
Bottom-Up Parsing
GOTO Table (or Parse Table)
Rows = State Symbols
Columns = Nonterminal Symbols
Values indicate which state symbol
should be pushed onto the parse stack
after a reduction has been completed