Lecture Data Structures and Practise

Download Report

Transcript Lecture Data Structures and Practise

Lecture 4
Concepts of Programming
Languages
Arne Kutzner
Hanyang University / Seoul Korea
Topics
•
•
•
•
Lexical Analysis
The Parsing Problem
Recursive-Descent Parsing
Bottom-Up Parsing
Concepts of Programming Languages
L4.2
Introduction
• Language implementation systems
must analyze source code, regardless
of the specific implementation approach
• Nearly all syntax analysis is based on a
formal description of the syntax of the
source language (BNF)
Concepts of Programming Languages
L4.3
Syntax Analysis
• The syntax analysis portion of a language
processor nearly always consists of two parts:
– A low-level part called a lexical analyzer
(mathematically, a finite automaton based on a
regular grammar)
– A high-level part called a syntax analyzer, or
parser
(mathematically, a push-down automaton based
on a context-free grammar, or BNF)
Concepts of Programming Languages
L4.4
Advantages of Using CFG/BNF
to Describe Syntax
• Provides a clear and concise syntax
description
• The parser can be constructed of
foundation of CFG/BNF
Concepts of Programming Languages
L4.5
Lexical Analysis
• A lexical analyzer is a “front-end” for the
parser
– pattern matcher for character strings
• Identifies substrings of the source program
that belong together - lexemes
– Lexemes match a character pattern, which is
associated with a lexical category called a token
– sum is a lexeme; its token may be IDENT
Concepts of Programming Languages
L4.6
Reasons to Separate Lexical and
Syntax Analysis
• Simplicity - less complex approaches
can be used for lexical analysis
(no need for the use of grammars for
token extraction)
• Efficiency - separation allows significant
less complex parsers
Concepts of Programming Languages
L4.7
We need first some theory…
Regular Expressions
• Given a finite alphabet Σ, the following
constants are defined as regular
expressions:
– (empty set) Ø denoting the set Ø.
– (empty string) ε denoting the set
containing only the "empty" string, which
has no characters at all.
– (literal character) a in Σ denoting the set
containing only the character a.
Concepts of Programming Languages
L4.9
Regular Expressions (cont.)
• Given regular expressions R and S, the following
operations over them are defined to produce
regular expressions:
1. (concatenation) RS denotes the set of strings that can
be obtained by concatenating a string in R and a string
in S.
For example {"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd",
"cef"}.
2. (alternation) R | S denotes the set union of sets
described by R and S.
For example, if R describes {"ab", "c"} and S describes
{"ab", "d", "ef"}, expression R | S describes {"ab", "c",
"d", "ef"}.
– Alternation is sometimes denoted by +
Concepts of Programming Languages
L4.10
Regular Expressions (cont.)
3. (Kleene star) R* denotes the smallest
superset of set described by R that
contains ε and is closed under string
concatenation. This is the set of all strings
that can be made by concatenating any
finite number (including zero) of strings from
set described by R.
For example, {"0","1"}* is the set of all finite
binary strings (including the empty string),
and {"ab", "c"}* = {ε, "ab", "c", "abab", "abc",
"cab", "cc", "ababab", "abcab", ... }.
Concepts of Programming Languages
L4.11
Regular Languages
•
The collection of regular languages over an
alphabet Σ is defined recursively as follows:
1. The empty language Ø is a regular language.
2. For each a ∈ Σ (a belongs to Σ), the singleton
language {a} is a regular language.
3. If A and B are regular languages, then A ∪ B
(union), A • B (concatenation), and A* (Kleene
star) are regular languages.
4. No other languages over Σ are regular.
Concepts of Programming Languages
L4.12
Regular Expressions and
Regular Languages
• The family of languages defined by regular
expressions are the regular languages.
• Regular expressions can be used for
lexeme/token description/specification. E.g.:
description of a token Identifier as
Letter (Digit | Letter)*
• Regular expressions are generators like
grammars
– In fact, you can describe every regular
expressions by means of a grammar
Concepts of Programming Languages
L4.13
Examples of regular expressions
• What are the words of the following
expressions?
– (0 | 1)(00 | 01 | 10 | 11)*
– (0 | 1)(0 | 1)(0 | 1)(0 | 1)(0 | 1)
• Are languages of the following 3 expressions
– 0* 1 (0 | 1)*
– (0 | 1) *1 (0 | 1)*
– (0 | 1)* 1 0*
equal?
Concepts of Programming Languages
L4.14
Recognizer for Regular
Expressions
•
A deterministic finite automaton
(DFA) M is a 5-tuple, (Q, Σ, δ, q0, F),
consisting of
1. a finite set of states (Q)
2. a finite set of input symbols called the
alphabet (Σ)
3. a transition function (δ : Q × Σ → Q)
4. a start state (q0 ∈ Q)
5. a set of accept states (F ⊆ Q)
Concepts of Programming Languages
L4.15
Language accepted by a DFA
• Let w = a1a2 ... an be a string over the
alphabet Σ. The automaton M accepts
the string w if a sequence of states,
r0,r1, ..., rn, exists in Q with the
following conditions:
1. r0 = q0
2. ri+1 = δ(ri, ai+1), for i = 0, ..., n−1
3. rn ∈ F.
Concepts of Programming Languages
L4.16
DFA Example
• M = (Q, Σ, δ, q0, F) where
– Q = {S1, S2},
– Σ = {0, 1},
– q0 = S1,
– F = {S1},
– δ is the following
state transition table:
corresponding state
diagram for M
Concepts of Programming Languages
L4.17
DFA Example (cont.)
• The Language recognized by M is the
regular language given by the regular
expression 1*( 0 1* 0 1* )*,
– The accepted language consists of all
words that contains an even number of 0s.
Concepts of Programming Languages
L4.18
Kleene’s Theorem
• Part 1: If R is regular expression over the
alphabet Σ, and L is the language in Σ*
corresponding to R, then there is a
(deterministic) finite automaton M recognizing
L.
• Part 2: If M = (Q, Σ, δ, q0, F) is a
(deterministic) finite automaton recognizing
the language L, then there is a regular
expression over Σ corresponding to L.
• So, DFAs recognize exactly the set of
regular languages/expressions.
Concepts of Programming Languages
L4.19
Limitations of regular languages
• There is no regular expression for the
language 1n 0n , n ≥ 0 (n ones followed by n
zeros)
– But you can easily give a CFG for the above
language:
<A> -> 1 <A> 0 | ε
• Other example: Dyck language; balanced
strings of parentheses (e.g. [ [ ] [ [ ] ] ] [ ] )
– Grammar ? (-> Exercise)
Concepts of Programming Languages
L4.20
Practical implementation of
lexical analyzers
• DFAs and regular expressions are the
foundations of lexical analyzer construction
• Possible approaches for implementing a lexical
analyzer:
– Write a formal description of the tokens and use a
software tool that constructs table-driven lexical
analyzers given such a description
– Design a state diagram that describes the tokens
and write a program that implements the state
diagram
– Design a state diagram that describes the tokens
and hand-construct a table-driven implementation of
the state diagram
Concepts of Programming Languages
L4.21
Lexical Analysis (cont.)
• In many cases, symbols of transitions
are “combined/grouped” in order to
simplify the state diagram
– When recognizing an identifier, all
uppercase and lowercase letters are
equivalent
• Use a character class that includes all letters
– When recognizing an integer literal, all
digits are equivalent - use a digit class
Concepts of Programming Languages
L4.22
Lexical Analysis (cont.)
• Reserved words can be recognized in
the context of identifier recognition
– Use a table lookup to determine whether a
possible identifier is in fact a reserved word
Concepts of Programming Languages
L4.23
Lexical Analysis (cont.)
Example Program …
• The proposed lexical analyzer is a function
that should be called by the parser when it
request a fresh token/lexems
• Utility subprograms:
– getChar - gets the next character of input, puts it
in nextChar, determines its class and puts the
class in charClass
– addChar - puts the character from nextChar into
the place the lexeme is being accumulated,
lexeme
– lookup - determines whether the string in lexeme
is a reserved word (returns a code)
Concepts of Programming Languages
L4.24
State diagram for recognizing
identifiers and integer numbers
Concepts of Programming Languages
L4.25
Lexical Analysis / Example Prg.
int lex() {
lexLen = 0;
static int first = 1;
/* If it is the first call to lex, initialize by calling
getChar */
if (first) {
getChar();
first = 0;
}
getNonBlank();
switch (charClass) {
/* Parse identifiers and reserved words */
case LETTER:
addChar();
getChar();
while (charClass == LETTER || charClass == DIGIT){
addChar();
getChar();
}
return lookup(lexeme);
break;
Concepts of Programming Languages
…
L4.26
Lexical Analysis / Example Prg.
…
/* Parse integer literals */
case DIGIT:
addChar();
getChar();
while (charClass == DIGIT) {
addChar();
getChar();
}
return INT_LIT;
break;
} /* End of switch */
} /* End of function lex */
Concepts of Programming Languages
L4.27
Parsing Problem…
The Parsing Problem
• Goals of the parser, given an input
program:
– Produce a parse tree
– Find all syntax errors; for each, produce an
appropriate diagnostic message and
recover quickly
Concepts of Programming Languages
L4.29
The Parsing Problem (cont.)
• Two categories of parsers
– Top down - produce the parse tree, beginning at
the root
• Order is that of a leftmost derivation
• Traces or builds the parse tree in preorder
– Bottom up - produce the parse tree, beginning at
the leaves
• Order is that of the reverse of a rightmost derivation
• Useful parsers look only one token ahead in
the input
Concepts of Programming Languages
L4.30
The Parsing Problem (cont.)
• Top-down Parsers
– Given a sentential form, xA , the parser must
choose the correct A-rule to get the next sentential
form in the leftmost derivation, using only the
first token produced by A
• The most common top-down parsing
algorithms:
– Recursive descent - a coded implementation
– LL parsers - table driven implementation
Concepts of Programming Languages
L4.31
The Parsing Problem (cont.)
• Bottom-up parsers
– Special form of push down automata
• Given a right sentential form, , determine what
substring of  is the right-hand side of the rule
in the grammar that must be reduced to
produce the previous sentential form in the right
derivation
– The most common bottom-up parsing
algorithms are in the LR family
Concepts of Programming Languages
L4.32
Recursive-Descent Parsing
• Approach - Coded parser:
– Subprogram for each nonterminal in the
grammar, which can parse sentences that
can be generated by that nonterminal
– EBNF well suited for being the basis of a
recursive-descent parser, because EBNF
minimizes the number of nonterminals
Concepts of Programming Languages
L4.33
Recursive-Descent Parsing
(cont.)
• A grammar for simple expressions:
<expr>  <term> {(+ | -) <term>}
<term>  <factor> {(* | /) <factor>}
<factor>  id | ( <expr> )
Concepts of Programming Languages
L4.34
Recursive-Descent Parsing
(cont.)
• Assume we have a lexical analyzer named
lex, which puts the next token code in
nextToken
• The coding process when there is only one
RHS:
– For each terminal symbol in the RHS, compare it
with the next input token; if they match, continue,
else there is an error
– For each nonterminal symbol in the RHS, call its
associated parsing subprogram
Concepts of Programming Languages
L4.35
Recursive-Descent Parsing
(cont.)
/* Function expr
Parses strings in the language
generated by the rule:
<expr> → <term> {(+ | -) <term>}
*/
void expr() {
/* Parse the first term */
term();
…
Concepts of Programming Languages
L4.36
Recursive-Descent Parsing
/* As long as the next token is + or -, call
lex to get the next token, and parse the
next term */
while (nextToken == PLUS_CODE ||
nextToken == MINUS_CODE){
lex();
term();
}
}
• This particular routine does not detect errors
• Convention: Every parsing routine leaves the
next token in nextToken
Concepts of Programming Languages
L4.37
Recursive-Descent Parsing
(cont.)
• A nonterminal that has more than one
RHS requires an initial process to
determine which RHS it is to parse
– The correct RHS is chosen on the basis of
the next token of input (the lookahead)
– The next token is compared with the first
token that can be generated by each RHS
until a match is found
– If no match is found, it is a syntax error
Concepts of Programming Languages
L4.38
Recursive-Descent Parsing
(cont.)
/* Function factor
Parses strings in the language
generated by the rule:
<factor> -> id | (<expr>) */
void factor() {
/* Determine which RHS */
if (nextToken) == ID_CODE)
/* For the RHS id, just call lex */
lex();
Concepts of Programming Languages
L4.39
Recursive-Descent Parsing
(cont.)
/* If the RHS is (<expr>) – call lex to pass
over the left parenthesis, call expr, and
check for the right parenthesis */
else if (nextToken == LEFT_PAREN_CODE) {
lex();
expr();
if (nextToken == RIGHT_PAREN_CODE)
lex();
else
error();
} /* End of else if (nextToken == ... */
else error(); /* Neither RHS matches */
}
Concepts of Programming Languages
L4.40
Recursive-Descent Parsing
(cont.)
•
The Left Recursion Problem:
If a grammar comprises left recursion,
either direct or indirect, it cannot be the
basis of a top-down (recursive-decent)
parser
– A grammar can be modified, so that it becomes
free of left recursion
•
LL Grammar Class =
Class of grammars without left recursion
Concepts of Programming Languages
L4.41
Elimination of left recursion
•
Direct recursion:
For each nonterminal A,
1. Group the A-rules as A → Aα1 | … | Aαm | β1 | β2 | … | βn
where none of the β‘s begins with A
2. Replace the original A-rules with
A → β1A’ | β2A’ | … | βnA’
A’ → α1A’ | α2A’ | … | αmA’ | ε
•
Indirect recursion:
–
See separated PDF-document
Concepts of Programming Languages
L4.42
Recursive-Descent Parsing
(cont.)
• The other characteristic of grammars
that disallows top-down parsing is the
lack of pairwise disjointness
– The inability to determine the correct RHS
on the basis of one token of lookahead
– Def: FIRST() = {a |  =>* a }
(If  =>* ,  is in FIRST())
Concepts of Programming Languages
L4.43
Recursive-Descent Parsing
(cont.)
• Pairwise Disjointness Test:
– For each nonterminal, A, in the grammar that has
more than one RHS, for each pair of rules, A  i
and A  j, it must be true that
FIRST(i) ⋂ FIRST(j) = 
• Examples:
A  a | bB | cAb
A  a | aB
Concepts of Programming Languages
L4.44
Recursive-Descent Parsing
(cont.)
• Left factoring can be used for removing pairwise
disjointness.
• Example:
<variable>ident | ident'('<expression>')'
left factor to:
<variable>  ident <new>
<new>   | '('<expression>')'
or in EBNF:
<variable>  ident [ '('<expression>')' ]
• Problem with first transformation:
Introduction of  rule. (Troublemaker in the context of the
elimination of left recursion)
Concepts of Programming Languages
L4.45
Bottom-up Parsing
• The parsing problem is finding the
correct RHS in a right-sentential form to
reduce to get the previous rightsentential form in the derivation
• Bottom-up parser represent an
extended form of push down
automata.
Concepts of Programming Languages
L4.46
Definition Pushdown Automaton
A PDA is formally defined as a 7-tuple
(Q, Σ, Γ, δ, q0, Z, F), where
1.
2.
3.
4.
5.
6.
7.
Q is a finite set of states
Σ is a finite set which is called the input alphabet
Γ is a finite set which is called the stack alphabet
δ : Q × (Σ{ε}) × Γ → Q × Γ* , the transition function
q0 ∈ Q is the start state
Z ∈ Q is the initial stack symbol
F ⊆ Q is the set of accepting states
Concepts of Programming Languages
L4.47
PDA computation
• Assume δ of M maps (p,a,A) to (q,α) and that
M is
– in state p∈Q,
– with a ∈(Σ{ε}) on input
– and A∈ Γ as topmost stack symbol,
Then M performs the following actions:
– may read a (move one position right on input)
– change the state to q
– pop A, replacing it by α
• IMPORTANT:
The (Σ{ε}) component of the transition relation is used
to formalize that the PDA can either read a letter from the
input, or proceed leaving the input untouched.
Concepts of Programming Languages
L4.48
PDA computation graphically
Concepts of Programming Languages
L4.49
Example PDA
M=(Q, Σ, Γ, δ, p, Z, F), where
1. states: Q = { p,q,r }
2. input alphabet: Σ = {0, 1}
3. stack alphabet: Γ = {A, Z}
4. start state: q0 = p
5. start stack symbol: Z
6. accepting states: F = {r}
Move number
State
Input
Stack symbol
Moves
1
p
0
Z
p, AZ
2
p
0
A
p, AA
3
p
ε
Z
q, Z
4
p
ε
A
q, A
5
q
1
A
q, ε
6
q
ε
Z
r, Z
Concepts of Programming Languages
L4.50
Language of example PDA
• PDA for language {0n1n | n ≥ 0}
• Corresponding grammar:
<A> -> 1 <A> 0 | ε
Concepts of Programming Languages
L4.51
Important Lemmas
• For every grammar G there is a pushdown
automaton M, so that the language generated
by G is recognized by the automaton M.
• For very PDA M there is a grammar G, so that
language recognized by M is generated by
the grammar G.
• PDA and context free grammars are equal
concepts with respect to its
recognized/generated languages.
Concepts of Programming Languages
L4.52
Bottom-up Parsing / Handles
• Definitions of Handle / Phrase / Simple
Phrase:
–  is the handle of the right sentential form  =
w if and only if S =>*rm Aw =>rm w
–  is a phrase of the right sentential form  if
and only if S =>*  = 1A2 =>+ 12
–  is a simple phrase of the right sentential
form  if and only if S =>*  = 1A2 =>
12
Concepts of Programming Languages
L4.53
Bottom-up Parsing (cont.)
• Shift-Reduce Algorithms
– Reduce is the action of replacing the
handle on the top of the parse stack with
its corresponding LHS
– Shift is the action of moving the next token
to the top of the parse stack
Concepts of Programming Languages
L4.54
Bottom-up Parsing (cont.)
• Advantages of LR parsers:
– They will work for nearly all grammars that
describe programming languages.
– They can detect syntax errors as soon as it
is possible.
– The LR class of grammars is a superset of
the class parsable by LL parsers.
Concepts of Programming Languages
L4.55
Bottom-up Parsing (cont.)
• LR parsers must be constructed with a tool
• Knuth’s insight: A bottom-up parser could use
the entire history of the parse, up to the
current point, to make parsing decisions
– There were only a finite and relatively small
number of different parse situations that could
have occurred, so the history could be stored in a
parser state, on the parse stack
Concepts of Programming Languages
L4.56
Bottom-up Parsing (cont.)
• An LR configuration stores the state of
an LR parser
(S0X1S1X2S2…XmSm, aiai+1…an$)
Concepts of Programming Languages
L4.57
Bottom-up Parsing (cont.)
• LR parsers are table driven, where the table
has two components, an ACTION table and a
GOTO table
– The ACTION table specifies the action of the
parser, given the parser state and the next token
• Rows are state names; columns are terminals
– The GOTO table specifies which state to put on
top of the parse stack after a reduction action is
done
• Rows are state names; columns are nonterminals
Concepts of Programming Languages
L4.58
Structure of An LR Parser
Concepts of Programming Languages
L4.59
Bottom-up Parsing (cont.)
• Initial configuration: (S0, a1…an$)
• Parser actions:
– If ACTION[Sm, ai] = Shift S, the next configuration
is:
(S0X1S1X2S2…XmSmaiS, ai+1…an$)
– If ACTION[Sm, ai] = Reduce A   and
S = GOTO[Sm-r, A], where r = the length of , the
next configuration is
(S0X1S1X2S2…Xm-rSm-rAS, aiai+1…an$)
Concepts of Programming Languages
L4.60
Bottom-up Parsing (cont.)
• Parser actions (continued):
– If ACTION[Sm, ai] = Accept, the parse is
complete and no errors were found.
– If ACTION[Sm, ai] = Error, the parser calls
an error-handling routine.
Concepts of Programming Languages
L4.61
LR Parsing Table
S4
Concepts of Programming Languages
L4.62
Bottom-up Parsing (cont.)
• A parser table can be generated from a
given grammar with a tool, e.g., yacc
Concepts of Programming Languages
L4.63