Scanner - 東海大學個人網站/網路硬碟

Download Report

Transcript Scanner - 東海大學個人網站/網路硬碟

Scanner
2015.03.16
Front End
Source
code
Front
End
IR
Back
End
Machine
code
Errors
The purpose of the front end is to deal with the input language
 Perform a membership test: code  source language?
 Is the program well-formed (semantically) ?
 Build an IR version of the code for the rest of the compiler
The front end deals with form (syntax) & meaning (semantics)
The Front End
Source
code
Scanner
tokens
IR
Parser
Errors
Implementation Strategy
Scanning
Parsing
Specify Syntax
regular expressions
context-free
grammars
Implement
Recognizer
deterministic finite
automaton
push-down
automaton
Perform Work
Actions on transitions in automaton
The Front End
stream of
characters
Scanner
microsyntax
stream of
tokens
Parser
syntax
IR +
annotations
Errors
Why separate the scanner and the parser?
 Scanner classifies words
 Parser constructs grammatical derivations
 Parsing is harder and slower
Separation simplifies the implementation
 Scanners are simple
 Scanner leads to a faster, smaller parser
Scanner is only pass
that touches every
character of the input.
token is a pair
<part of speech, lexeme >
Scanner Generator
Why study automatic scanner construction?
 Avoid writing scanners by hand
 Harness the theory from classes like COMP 481
compile
time
design
time
source code
specifications
Scanner
parts of speech & words
tables
or code
Scanner
Generator
Represent
words as
indices into a
global table
Specifications written as
“regular expressions”
Goals:


To simplify specification & implementation of scanners
To understand the underlying techniques and technologies
Comp 412, Fall 2010
5
Strings and Languages
 Alphabet
 An alphabet  is a finite set of symbols (characters)
 String
 A string is a finite sequence of symbols from 
s denotes the length of string s
 denotes the empty string, thus  = 0
 Language
 A language is a countable set of strings over some fixed
alphabet 
Abstract Language Φ
{ε}
String Operations
 Concatenation (連接)
 The concatenation of two strings x and y is denoted by xy
 Identity (單位元素)
 The empty string is the identity under concatenation.
s=s=s
 Exponentiation
 Define
s0 = 
si = si-1s for i > 0
 By Define
s1 = s
s2 = ss
Language Operations
 Union
L  M = { s  s  L or s  M }
 Concatenation
L M = { xy  x  L and y  M}
 Exponentiation
L0 = {  }
Li = Li-1L
 Kleene closure (封閉包)
L* = ∪i=0,…, Li
 Positive closure
L+ = ∪i=1,…, Li
Regular Expressions
Regular Expressions
A convenient means of specifying certain simple sets
of strings.
We use regular expressions to define structures of
tokens.
Tokens are built from symbols of a finite vocabulary.
Regular Sets
The sets of strings defined by regular expressions.
Regular Expressions
 Basis symbols:
  is a regular expression denoting language L() = {}
 a   is a regular expression denoting L(a) = {a}
 If r and s are regular expressions denoting languages L(r)
and M(s) respectively, then
 rs is a regular expression denoting L(r)  M(s)
 rs is a regular expression denoting L(r)M(s)
 r* is a regular expression denoting L(r)*
 (r) is a regular expression denoting L(r)
 A language defined by a regular expression is called a
regular set.
Operator Precedence
Operator
Precedence
Associative
*
highest
left
concatenation
Second
left
|
lowest
left
Algebraic Laws for Regular Expressions
Law
r|s=s|r
Description
| is commutative
r | ( s | t ) = ( r | s ) | t | is associative
r(st) = (rs)t
r(s|t) = rs | rt
(s|t)r = sr | tr
concatenation is associative
concatenation distributes over |
εr = rε = r
ε is the identity for concatenation
r* = ( r |ε)*
ε is guaranteed in a closure
r** = r*
* is idempotent
Examples of Regular Expressions
Identifiers:
Letter
 (a|b|c| … |z|A|B|C| … |Z)
Digit
 (0|1|2| … |9)
Identifier  Letter ( Letter | Digit )*
Numbers:
shorthand
for
(a|b|c| … |z|A|B|C| … |Z) ((a|b|c| … |z|A|B|C| … |Z) | (0|1|2| … |9))*
Integer  (+|-|) (0| (1|2|3| … |9)(Digit *) )
Decimal  Integer . Digit *
Real
 ( Integer | Decimal ) E (+|-|) Digit *
Complex  ( Real , Real )
Numbers can get much more complicated!
Using symbolic names
does not imply recursion
underlining indicates
a letter in the input
stream
13
Finite Automata
 Finite Automata are recognizers.
 FA simply say “Yes” or “No” about each possible input string.
 A FA can be used to recognize the tokens specified by a regular
expression
 Use FA to design of a Lexical Analyzer Generator
 Two kind of the Finite Automata
 Nondeterministic finite automata (NFA)
 Deterministic finite automata (DFA)
 Both DFA and NFA are capable of recognizing the same
languages.
NFA Definitions
NFA = { S, , , s0, F }
A finite set of states S
A set of input symbols Σ
input alphabet, ε is not in Σ
A transition function 
 : S    S
A special start state s0
A set of final states F, F  S (accepting states)
Transition Graph for FA
is a state
is a transition
is a the start state
is a final state
Example
0
a
a
1
3
2
b
c
c


This machine accepts abccabc, but it
rejects abcab.
This machine accepts (abc+)+.
Transition Table
The mapping  of an NFA can be represented
in a transition table
a
start
a
0
1
b
2
b
3
b
(0, a) = {0,1}
(0, b) = {0}
(1, b) = {2}
(2, b) = {3}
STATE
a
b
ε
0
{0, 1}
{0}
-
1
-
{2}
-
2
-
{3}
-
3
-
-
-
DFA
DFA is a special case of an NFA
There are no moves on input ε
For each state s and input symbol a, there is
exactly one edge out of s labeled a.
Both DFA and NFA are capable of recognizing
the same languages.
S = {0,1,2,3}
 = {a, b}
s0 = 0
F = {3}
NFA vs DFA
a
start
a
0
b
1
b
2
3
b
(a | b)*abb
b
0
a
1
a
b
2
b
3
a
a
Concept