슬라이드 1 - Byeongdo Kang

Download Report

Transcript 슬라이드 1 - Byeongdo Kang






5.1
5.2
5.3
5.4
5.5



서 론
컴파일러 일반적 구성
컴파일러 자동화 도구
어휘 분석
구문 분석
구문 분석 방법
구문 분석기의 출력
Top-down 방법



Recursive-descent 파서
LL 파서
Bottom-up 방법


Shift-reduce 구문 분석
LR 파서
1

Compiler
“A compiler is a computer program which translates programs written in
a particular high-level programming language into executable code for
a specific target computer.”
Source
Programs
Compiler
Object
Programs
(Assembly Language,
Machine Language)
ex) C compiler on SPARC

C program을 입력으로 받아 SPARC에서 수행 가능한 코드를
출력한다.
[2/29]

Compiler Structure
Source
Programs
Front-End
IC
Back-End


Object
Programs
Front-End : language dependent part
Back-End : machine dependent part
[3/29]
Source
Programs
Lexical Analyzer
Token
Syntax Analyzer
Tree
Intermediate Code
Generator
Intermediate
Code
Code Optimizer
Optimized
Code
Target Code
Generator
Object
Programs
[4/29]
1. Lexical Analyzer(Scanner)

컴파일러 내부에서 효율적이며 다루기 쉬운 정수로 바꾸어 줌.
A sequence
of tokens
Lexical Analyzer
Source
Programs
ex) if ( a > 10 ) ...
Token
: if
Token Number : 32
(
a
>
10
7
4
25
5
) ...
8
[5/29]
2. Syntax Analyzer(Parser)

기능: Syntax checking, Tree generation.
Tree
A sequence
of tokens

Syntax Analyzer
Error message
or syntactic
structure
출력: incorrect - error message 출력
correct - program structure (=> tree 형태) 출력
ex) if (a > 10) a = 1;
if
>
a
=
10 a
1
Introduction to Compiler Design Theory
[6/29]
3. Intermediate Code Generator

Semantic checking

Intermediate Code Generation
ex) if (a > 10) a = 1.0; ☞ a가 정수일 때 semantic error !
ex) a = b + 1;
Tree :
=
a
+
b
Ucode:
1
lod 1 2
ldc 1
add
str 1 1
- variable reference: (base, offset)
[7/29]
4. Code Optimizer



Optional phase
비효율적인 code를 구분해 내서 더 효율적인 code로 바꾸어 준다.
Meaning of optimization


ex)

major part : improve running time
minor part : reduce code size
LDC R1, 1
LDC R1, 1
(x)
Criteria for optimization



preserve the program meanings
speed up on average
be worth the effort
[8/29]

Local optimization
local inspection을 통하여 inefficient한 code들을 구분해 내서
좀 더 efficient한 code들로 바꾸는 방법.

1. Constant folding
2. Eliminating redundant load, store instructions
3. Algebraic simplification
4. Strength reduction

Global optimization

flow analysis technique을 이용
1. Common subexpression
2. Moving loop invariants
3. Removing unreachable codes
[9/29]
5. Target Code Generator

중간 코드로부터 machine instruction을 생성한다.
Intermediate
Code

Target
Code Generator
Target
Code
Code generator tasks
1. instruction selection & generation
2. register management
3. storage allocation
4. code optimization (Machine-dependent optimization)
[10/29]
6. Error Recovery
Error recovery - error가 다른 문장에 영향을 미치지 않도록
수정하는 것
Error repair - error가 발생하면 복구해 주는 것

Error Handling





Error detection
Error recovery
Error reporting
Error repair
Error



Syntax Error
Semantic Error
Run-time Error
[11/29]

Compiler Generating Tools
(= Compiler-Compiler, Translator Writing System)

Language와 machine이 발달할 수록 많은 compiler가 필요.


새로운 언어를 개발하는 이유: 컴퓨터의 응용 분야가 넓어지므로.
N개의 language를 M개의 컴퓨터에서 구현하려면 N*M개의
컴파일러가 필요.
ex) 2개의 language : C, Java
3개의 Machine : IBM, SPARC, Pentium
C-to-IBM, C-to-SPARC, C-to-Pentium
Java-to-IBM, Java-to-SPARC, Java-to-Pentium
[12/29]

Compiler-compiler Model
Program
written in L
Language
Description : L
Compiler Compiler
Compiler
Machine
Description : M
Executable
form on M

Language description은 grammar theory를 이용하고 있으나,
Machine description은 정형화가 이루어져 있지 않은 상태임.


HDL : Hardware Description Language
 Computer Architecture를 design하는 데 사용.
Machine architecture와 programming language의 발전에 따라
automatic compiler generation이 연구됨.
[13/29]
1. LEX : 1975년에 M. E. Lesk 가 고안.

입력 스트림에서 정규표현으로 기술된 토큰들을 찾아내는
프로그램을 작성하는데 유용한 도구.
LEX
Regular Expression
+
Action Code
Sour ce
Pr ogr am
Lexical Analyzer
(lex.yy.c)
Token
St r eam
[14/29]
2. Parser Generator(PGS: Parser Generating System)
Grammar
Description
PGS
Parsing Table
Input
Program
Output
Parser
(program
structures)
(1) Stanford PGS

John Hennessy

파스칼 언어로 쓰여 있음 : 5000 lines
특징 : 구문 구조를 AST 형태로 얻음.


Output : Abstract Syntax Tree(AST)의 정보를 포함한 파싱 테이블을 출력.
[15/29]
(2) Wisconsin PGS

C.N. Fisher

파스칼 언어로 쓰여 있음.: 10000 lines
특징 : error recovery

(3) YACC(Yet Another Compiler Compiler)


UNIX에서 수행.
C language로 쓰여 있음.
Regular Expression
+
Action Code
Source
Program
Grammar Rule
+
Action Code
LEX
YACC
lex.yy.c
y.tab.c
<Lexi cal Anal ysi s>
Result by
Action Code
<Synt ax Anal ysi s>
[16/29]
3. Automatic Code Generation
Machine
Description
Code-Generator
Generator
Table
Intermediate
Code


Code Generator
Target
Code
Three aspects
1. Machine Description : ISP, ISPS, HDL
2. Intermediate language
3. Code generating algorithm
CGA
Pattern matching code generation
Table driven code generation
[17/29]
4. Compiler Compiler System
(1) PQCC(Production Quality Compiler Compiler System)

W.A. Wulf(Carnegie-Mellon University)

input으로 language description과 target machine description을 받아
PQC(Production Quality Compiler)와 table이 output됨.

중간 언어로 tree구조인 TCOL을 사용.
Pattern Matching Code Generation에 의해 code를 생성함.

(2) ACK(Amsterdam Compiler Kit)

Vrije 대학의 Andrew S. Tanenbaum을 중심으로 개발된 Compiler의
Back-End 자동화 도구.

UNCOL 개념에서 출발(N*M=>N+M).

EM이라는 Abstract Machine Code를 중간 언어로 사용.

Portable Compiler를 만들기에 편리.
[18/29]

PQCC Model
Language Description
+
Machine Description
PQCC
Table
Source
Program
Front-End
PQC
TCOL
Object
Code
[19/29]

ACK Model
Source
Program
Front-End
EM
Back-End
Object
Code
Intel
8080/8086/80386
Motorola 6800/6809/
68000/68020
Zilog
Z80/Z8000
VAX
SPARC
FORTRAN
ALGOL
PASCAL
C
ADA
Interpreter
C: \. . .
Result
[20/29]

Lexical Analysis

the process by which the compiler groups certain strings of
characters into individual tokens.
Source
Program

Lexical Analyzer
Token
Stream
Lexical Analyzer  Scanner  Lexer
[21/39]

Token

문법적으로 의미 있는 최소 단위
Token - a single syntactic entity(terminal symbol).
Token Number - string 처리의 효율성 위한 integer number.
Token Value - numeric value or string value.
ex)
if
(
a
>
Token Number : 32 7
4
Token Value : 0 0 ‘a’
10
25
0
) ...
5
10
8
0
[22/39]

Token classes

Special form - language designer
1. Keyword --- const, else, if, int, ...
2. Operator symbols --- +, -, *, /, ++, -- etc.
3. Delimiters --- ;, ,, (, ), [, ] etc.


General form - programmer
4. identifier --- stk, ptr, sum, ...
5. constant --- 526, 3.0, 0.1234e-10, ‘c’, “string” etc.
Token Structure - represented by regular expression.
ex) id = (l + _)( l + d + _)*
[23/39]

Symbol table의 용도



L.A와 S.A시 identifier에 관한 정보를 수집하여 저장.
Semantic analysis와 Code generation시에 사용.
name + attributes
ex) Hashed symbol table
bucket
symbol table
name

attributes
chapter 12 참조
[24/39]

Specification of token structure
Specification of PL

Scanner design steps
- RE
- CFG
Text p.134
1. describe the structure of tokens in re.
2. or, directly design a transition diagram for the tokens.
3. and program a scanner according to the diagram.
4. moreover, we verify the scanner action through regular
language theory.

Character classification



letter : a | b | c... | z | A | B | C |…| Z
digit : 0 | 1 | 2... | 9
special character : + | - | * | / | . | , | ...
l
d
[25/39]

Transition diagram
l, d, _
start

l, _
A
Regular grammar
S  lA | _A

S
A  lA | dA | _A | ε
Regular expression
S = lA + _A = (l + _)A
*
A = lA + dA + _A + ε = (l + d + _)A + ε = (l + d + _)
*
 S = (l + _)( l + d + _)
[26/39]

Form : 10진수, 8진수, 16진수로 구분되어진다.
10진수 : 0이 아닌 수 시작
8진수 : 0으로 시작, 16진수 : 0x, 0X로 시작

Transition diagram
d
start
n
S
A
o
0
n : non-zero digit
o : octal digit
h : hexa digit
o
B
C
h
x, X
D
h
E
[27/39]

Regular grammar
S  nA | 0B
C  oC | ε

A  dA | ε
B  oC | xD | XD | ε
D  hE
E  hE | ε
Regular expression
E = hE + ε = h*ε = h*
D = hE = hh* = h+
C = oC + ε = o*
B = oC + xD + XD + ε = o+ + (x + X)D = o+ + (x + X)h+ +
ε
A = dA + ε = d*
S = nA + 0B = nd* + 0(o+ + (x + X)h+ + ε)
= nd* + 0 + 0o+ + 0(x + X)h+
∴ S = nd* + 0 + 0o+ + 0(x + X)h+
[28/39]
5.5.1
구문 분석 방법
5.5.2
구문 분석기의 출력
5.5.3
Top-down 방법
5.5.4
Bottom-up 방법
[29/28]

How to check whether an input string is a sentence of a
grammar and how to construct a parse tree for the string.
?
Parsing : ∈L(G)

A Parser for grammar G is a program that takes as input
a string ω and produces as output either a parse tree(or
derivation tree) for ω, if ω is a sentence of G, or an error
message indicating that ω is not sentence of G.
A sequence of
tokens
Parser
Correct sentence : Parse tree
Incorrect sentence : Error message
[30/28]

Two basic types of parsers for context-free grammars
① Top down - starting with the root and working down to the
leaves. recursive descent parser, predictive parser.
② Bottom up - beginning at the leaves and working up the root.
precedence parser, shift-reduce parser.
ex) A → XYZ
A
expand
reduce
bottom-up
X
“start symbol로”
Y
Z
top-down
“sentence로”
[31/28]

The output of a parser:
① Parse - left parse, right parse
② Parse tree
③ Abstract syntax tree
ex)
G : 1. E → E + T
2. E → T
3. T → T * F
4. T → F
5. F →(E)
6. F → a
string : a + a * a
[32/28]

left parse : a sequence of production rule numbers applied
in leftmost derivation.
1
2
6
3
6
6
E  E+T
 a+T
 a+a*F
 T+T
4
4
 a+T*F
F+T
a+F*F
 a+a*a
∴ 12463466

right parse : reverse order of production rule numbers
applied in rightmost derivation.
1
3
4
6
4
6
E E+T
E+F*a
F+a*a
6
E+T*F
2
E+a*a
E+T*a
T+a*a
a+a*a
∴ 64264631
[33/28]

parse tree : derivation tree
E
string : a + a * a
E + T
T
F
a
T * F
F
a
a
[34/28]

Abstract Syntax Tree(AST)
::= a transformed parse tree that is a more efficient
representation of the source program.


leaf node
- operand(identifier or constant)
internal node - operator(meaningful production rule name)
ex)
G: 1. E → E + T  add
2. E → T
3. T → T * F  mul
4. T → F
5. F → (E)
6. F → a
string : a + a * a
add
a
mul
a
a
[35/28]
※

의미 있는 terminal  terminal node
의미 있는 production rule  nonterminal node
→ naming : compiler designer가 지정.
ex) if (a > b) a = b + 1; else a = b – 2;
IF_ST
GT
a
ASSIGN_OP
ASSIGN_OP
b
a
a
ADD
b
1
SUB
b
2
[36/28]
::= Beginning with the start symbol of the grammar, it attempts to produce a
string of terminal symbol that is identical to a given source string. This
matching process proceeds by successively applying the productions of the
grammar to produce substrings from nonterminals.
::= In the terminology of trees, this is moving from the root of the tree to a set
of leaves in the parse tree for a program.

Top-Down parsing methods
(1) Parsing with backup or backtracking.
(2) Parsing with limited or partial backup.
(3) Parsing with nobacktracking.

backtracking : making repeated scans of the input.
[37/28]

General Top-Down Parsing method


called a brute-force method
with backtracking (  Top-Down parsing with full backup )
1. Given a particular nonterminal that is to be expanded, the first production
for this nonterminal is applied.
2. Compare the newly expanded string with the input string. In the matching
process, terminal symbol is compared with an input symbol is selected for
expansion and its first production is applied.
3. If the generated string does not match the input string, an incorrect expansion
occurs. In the case of such an incorrect expansion this process is backed up by
undoing the most recently applied production. And the next production of this
nonterminal is used as next expansion.
4. This process continues either until the generated string becomes an input string
or until there are no further productions to be tried. In the latter case, the given
string cannot be generated from the grammar.
[38/28]

Several problems with top-down parsing method

left recursion


A nonterminal A is left recursive if A  Aα for some α.
A grammar G is left recursive if it has a left-recursive nonterminal.
⇒ A left-recursive grammar can cause a top down parser to
go into an infinite loop.
∴ eliminate the left recursion.

Backtracking


the repeated scanning of input string.
the speed of parsing is much slower. (very time consuming)
⇒ the conditions for nobacktracking : FIRST, FOLLOW을
이용하여 formal하게 정의.
Syntax Analysis
[39/28]

Elimination of left recursion


direct left-recursion : A → Aα
+ ∈P
indirect left-recursion : A  Aα
A → Aα ┃ 
A = Aα + 
= α*

general form :

introducing new nonterminal A’ which generates α*.
==> A → A'
A' → αA' ┃ε
[40/28]
ex) E → E + T | T
T→TF |F
F → (E)
|a
*
E  E(+T)*  T(+T)*
||
E'  E' → +TE' | 
※ E → TE'
E' → +TE' | 

general method :
A → Aα1┃Aα2┃ ... ┃Aαm┃β1┃β2┃... ┃βn
==> A → β1 A' | β2 A' | ... | βn A'
A' → α1A' | α2 A' | ... | αm A' | 
[41/28]

Left-factoring

if A →  |  are two A-productions and the input begins with a
non-empty string derived from , we do not know
whether to expand A to  or to  .
==> left-factoring : the process of factoring out the common
prefixes of alternates.
method :
A →  |  ==> A → (|)
==> A → A', A' →  | 

ex) S → iCtS | iCtSeS | a
C→b
[42/28]
S → iCtS | iCtSeS | a
→ iCtS( | eS) | a
∴ S → iCtSS' | a
S' →  | eS
C→b

No-backtracking
::= deterministic selection of the production rule to be applied.
[43/28]
::= Reducing a given string to the start symbol of the grammar.
::= It attempts to construct a parse tree for an input string
beginning at the leaves (the bottom) and working up towards
the root(the top).
ex) G: S → aAcBe
A → Ab | b
B→d
string : abbcde
S
A
B
A
a
b
b
c d
e
[44/28]
[Def 3.1] reduce : the replacement of the right side of a production with the left side.
S  , *A →  ∈ P
rm
 S  A* 
rm
rm
[Def 3.2] handle : If S  *A  , then  is a handle of .
rm
[Def 3.3] handle pruning : S  r0  r1  ...  rn-1  rn
  rn-1
rm
rm
rm
rm
rn-2  ...
=
= S =
rm

=
=
“ reduce sequence ”
ex) G : S → bAe
A → a;A | a
ω: ba;ae
[45/28]
::= a bottom-up style of parsing.

Two problems for automatic parsing
1. How to find a handle in a right sentential form.
2. What production to choose in case there is more than
one production with the same right hand side.
====> grammar의 종류에 따라 방법이 결정되지만
handle를 유지하기 위하여 stack을 사용한다.
[46/28]

Four actions of a shift-reduce parser
$
Sn
Shift-Reduce Parser
.
.
.
: input
output
Parsing Table
$
stack
“Stack top과 current input symbol에 따라 파싱 테이블을 참조해서 action을 결정.”
1. shift : the next input symbol is shifted to the top of the stack.
2. reduce : the handle is reduced to the left side of production.
3. accept : the parser announces successful completion of parsing.
4. error : the parser discovers that a syntax error has occurred
and calls an error recovery routine.
[47/28]
ex) G: E →E + T | T
T →T  F | F
F → (E) | a
STACK
-------------(1) $
(2) $a
(3) $F
(4) $T
(5) $E
(6) $E +
(7) $E + a
(8) $E + F
(9) $E + T
(10) $E + T 
(11) $E + T  a
(12) $E + T  F
(13) $E + T
(14) $E
string : a + a  a
INPUT
-----------------a+aa$
+aa$
+aa$
+aa$
+aa$
aa$
a$
a$
a$
a$
$
$
$
$
ACTION
--------------------shift
a
reduce
F→ a
reduce
T→ F
reduce
E→T
shift
+
shift
a
reduce
F→a
reduce
T→F
shift

shift
a
reduce
F→a
reduce
T→T*F
reduce
E→E+T
accept
[48/28]
<< Thinking points >>
1. the handle will always eventually appear on top of the stack, never
inside.
 ∵ rightmost derivation in reverse.
stack에 있는 contents와 input에 남아 있는 string이 합해져서
right sentential form을 이룬다. 따라서 항상 stack의 top부분이
reduce된다.
2. How to make a parsing table for a given grammar.
→ 문법의 종류에 따라 Parsing table을 만드는 방법이 다르다.
SLR(Simple LR)
LALR(LookAhead LR)
CLR(Canonical LR)
[49/28]

Constructing a Parse tree
1. shift : create a terminal node labeled the shifted symbol.
2. reduce : A → X1X2...Xn.
(1) A new node labeled A is created.
(2) The X1X2...Xn are made direct descendants of the new node.
(3) If A → ε, then the parser merely creates a node labeled A
with no descendants.
ex) G : 1. LIST → LIST , ELEMENT
2. LIST → ELEMENT
3. ELEMENT → a
string : a , a
[50/28]
Step
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
STACK
INPUT
a,a$
,a$
,a$
,a$
a$
$
$
$ $
$a
$ELEMENT
$LIST
$LIST ,
$LIST , a
$LIST ,
ELEMENT
$LIST
ACTION
shift
a
reduce 3
reduce 2
shift
,
shift
a
reduce 3
reduce 1
$ accept
PARSETREE
Build Node
Build Tree
Build Tree
Build Node
Build Node
Build Tree
Build Tree
return that tree
LIST
list
LIST
ELEMENT
ELEMENT
a
a
,
a
a
[51/28]



an efficient Bottom-up parser for a large and useful class of
context-free grammars.
the “L” stands for left-to-right scan of the input;
the “R” for constructing a Rightmost derivation in reverse.
The attractive reasons of LR parsers
(1) LR parsers can be constructed for most programming languages.
(2) LR parsing method is more general than LL parsing method.
(3) LR parsers can detect syntactic errors as soon as possible.
But,

it is too much work to implement an LR parser by hand for a typical
programming-language grammar.
=====>  Parser Generator
[52/60]
PGS
Grammar
Parsing Table
<BNF Notations>
Input

Driver
Routine
Parsing
Table
Output
The driver routine is the same for all LR parsers; only the
parsing table changes from one parser to another.

The techniques for producing LR parsing tables

Simple
LR(SLR) - LR(0) items, FOLLOW

Canonical LR(CLR) - LR(1) items

Lookahead LR(LALR) - ① LR(1) items
② LR(0), Lookahead
CLR
LALR
SLR

LR parser
a1 … ai … an
Sm
Driver
Routine
$
: input
Parsing
Table
stack

Stack : S0X1S1X2 ••• XmSm, where Si : state and Xi  V.

Configuration of an LR parser :
(S0X1S1 ••• XmSm, aiai+1 ••• an$)
stack contents unscanned input
LR Parsing Table (ACTION table + GOTO table)
ACTION Table

…
symbols <Terminals> <Nonterminals>
…
states
GOTO Table
…

The LR parsing algorithm
::= same as the shift-reduce parsing algorithm.

Four Actions :
 shift
 reduce
 accept
 error
1. ACTION[Sm,ai] = shift S
::= (S0X1S1  XmSm, aiai+1  an$)
 (S0X1S1  XmSmaiS, ai+1  an$)
2. ACTION[Sm,ai] = reduce A  α and |α| = r
::= (S0X1S1  XmSm, aiai+1  an$)
 (S0X1S1  Xm-rSm-r, aiai+1  an$), GOTO(Sm-r , A) = S
 (S0X1S1  Xm-rSm-rAS, aiai+1  an$)
3. ACTION [Sm,ai] = accept, parsing is completed.
4. ACTION [Sm,ai] = error, the parser has discovered an error
and calls an error recovery routine.
1. LIST  LIST , ELEMENT
2. LIST  ELEMENT
3. ELEMENT  a

G:

Parsing Table : ( 이 파싱테이블 이용하여 a,a 의 파싱과정 보이기)
symbols
states
0
,
$
s3
1
s4
acc
2
r2
r2
3
r3
r3
4
5
where,
a
LIST
ELEMENT
1
2
5
s3
r1
r1
sj means shift and stack state j,
ri means reduce by production numbered i,
acc means accept, and blank means error.

Parser Generating System
Grammar
PGS
Parsing table
Token
stream
Driver
Routine
Result of
parsing