Scanner - 東海大學個人網站/網路硬碟
Download
Report
Transcript Scanner - 東海大學個人網站/網路硬碟
Scanner
2015.03.16
Front End
Source
code
Front
End
IR
Back
End
Machine
code
Errors
The purpose of the front end is to deal with the input language
Perform a membership test: code source language?
Is the program well-formed (semantically) ?
Build an IR version of the code for the rest of the compiler
The front end deals with form (syntax) & meaning (semantics)
The Front End
Source
code
Scanner
tokens
IR
Parser
Errors
Implementation Strategy
Scanning
Parsing
Specify Syntax
regular expressions
context-free
grammars
Implement
Recognizer
deterministic finite
automaton
push-down
automaton
Perform Work
Actions on transitions in automaton
The Front End
stream of
characters
Scanner
microsyntax
stream of
tokens
Parser
syntax
IR +
annotations
Errors
Why separate the scanner and the parser?
Scanner classifies words
Parser constructs grammatical derivations
Parsing is harder and slower
Separation simplifies the implementation
Scanners are simple
Scanner leads to a faster, smaller parser
Scanner is only pass
that touches every
character of the input.
token is a pair
<part of speech, lexeme >
Scanner Generator
Why study automatic scanner construction?
Avoid writing scanners by hand
Harness the theory from classes like COMP 481
compile
time
design
time
source code
specifications
Scanner
parts of speech & words
tables
or code
Scanner
Generator
Represent
words as
indices into a
global table
Specifications written as
“regular expressions”
Goals:
To simplify specification & implementation of scanners
To understand the underlying techniques and technologies
Comp 412, Fall 2010
5
Strings and Languages
Alphabet
An alphabet is a finite set of symbols (characters)
String
A string is a finite sequence of symbols from
s denotes the length of string s
denotes the empty string, thus = 0
Language
A language is a countable set of strings over some fixed
alphabet
Abstract Language Φ
{ε}
String Operations
Concatenation (連接)
The concatenation of two strings x and y is denoted by xy
Identity (單位元素)
The empty string is the identity under concatenation.
s=s=s
Exponentiation
Define
s0 =
si = si-1s for i > 0
By Define
s1 = s
s2 = ss
Language Operations
Union
L M = { s s L or s M }
Concatenation
L M = { xy x L and y M}
Exponentiation
L0 = { }
Li = Li-1L
Kleene closure (封閉包)
L* = ∪i=0,…, Li
Positive closure
L+ = ∪i=1,…, Li
Regular Expressions
Regular Expressions
A convenient means of specifying certain simple sets
of strings.
We use regular expressions to define structures of
tokens.
Tokens are built from symbols of a finite vocabulary.
Regular Sets
The sets of strings defined by regular expressions.
Regular Expressions
Basis symbols:
is a regular expression denoting language L() = {}
a is a regular expression denoting L(a) = {a}
If r and s are regular expressions denoting languages L(r)
and M(s) respectively, then
rs is a regular expression denoting L(r) M(s)
rs is a regular expression denoting L(r)M(s)
r* is a regular expression denoting L(r)*
(r) is a regular expression denoting L(r)
A language defined by a regular expression is called a
regular set.
Operator Precedence
Operator
Precedence
Associative
*
highest
left
concatenation
Second
left
|
lowest
left
Algebraic Laws for Regular Expressions
Law
r|s=s|r
Description
| is commutative
r | ( s | t ) = ( r | s ) | t | is associative
r(st) = (rs)t
r(s|t) = rs | rt
(s|t)r = sr | tr
concatenation is associative
concatenation distributes over |
εr = rε = r
ε is the identity for concatenation
r* = ( r |ε)*
ε is guaranteed in a closure
r** = r*
* is idempotent
Examples of Regular Expressions
Identifiers:
Letter
(a|b|c| … |z|A|B|C| … |Z)
Digit
(0|1|2| … |9)
Identifier Letter ( Letter | Digit )*
Numbers:
shorthand
for
(a|b|c| … |z|A|B|C| … |Z) ((a|b|c| … |z|A|B|C| … |Z) | (0|1|2| … |9))*
Integer (+|-|) (0| (1|2|3| … |9)(Digit *) )
Decimal Integer . Digit *
Real
( Integer | Decimal ) E (+|-|) Digit *
Complex ( Real , Real )
Numbers can get much more complicated!
Using symbolic names
does not imply recursion
underlining indicates
a letter in the input
stream
13
Finite Automata
Finite Automata are recognizers.
FA simply say “Yes” or “No” about each possible input string.
A FA can be used to recognize the tokens specified by a regular
expression
Use FA to design of a Lexical Analyzer Generator
Two kind of the Finite Automata
Nondeterministic finite automata (NFA)
Deterministic finite automata (DFA)
Both DFA and NFA are capable of recognizing the same
languages.
NFA Definitions
NFA = { S, , , s0, F }
A finite set of states S
A set of input symbols Σ
input alphabet, ε is not in Σ
A transition function
: S S
A special start state s0
A set of final states F, F S (accepting states)
Transition Graph for FA
is a state
is a transition
is a the start state
is a final state
Example
0
a
a
1
3
2
b
c
c
This machine accepts abccabc, but it
rejects abcab.
This machine accepts (abc+)+.
Transition Table
The mapping of an NFA can be represented
in a transition table
a
start
a
0
1
b
2
b
3
b
(0, a) = {0,1}
(0, b) = {0}
(1, b) = {2}
(2, b) = {3}
STATE
a
b
ε
0
{0, 1}
{0}
-
1
-
{2}
-
2
-
{3}
-
3
-
-
-
DFA
DFA is a special case of an NFA
There are no moves on input ε
For each state s and input symbol a, there is
exactly one edge out of s labeled a.
Both DFA and NFA are capable of recognizing
the same languages.
S = {0,1,2,3}
= {a, b}
s0 = 0
F = {3}
NFA vs DFA
a
start
a
0
b
1
b
2
3
b
(a | b)*abb
b
0
a
1
a
b
2
b
3
a
a
Concept