Scanning and Parsing in Squeak

Download Report

Transcript Scanning and Parsing in Squeak

Scanning and Parsing in
Squeak
 Defining Lexical and Syntactic Analysis
Scanning/Tokenizing
Parsing
 Easy ways to do it
State Transition Tables
Recursive Descent Parsing
 More sophisticated way
T-Gen: Lex and YACC for Squeak
 Examples from Squeak
Smalltalk parser
HTML parser
7/17/2015
Copyright 2000, Georgia Tech
1
Challenge of Compiling
How do you go from source code to object
code?
Lexical analysis: Figure out the pieces (tokens)
of the program: Constants, variables, keywords.
Syntactic analysis: Figure out (and check) the
structure (declaration, statements, etc.)—also
called parsing
Interpret meaning (semantics) of recognized
structures, based on declarations
Backend: Generate object code
7/17/2015
Copyright 2000, Georgia Tech
2
Lexical Analysis
Given a bunch of characters, how do you
recognize the key things and their types?
Simplest way: Parse by “white space”
'This is
a test
with returns in it.' findTokens: (Character
cr asString),(Character space asString).
OrderedCollection ('This' 'is' 'a' 'test'
'with' 'returns' 'in' 'it.' )
7/17/2015
Copyright 2000, Georgia Tech
3
Scanning: Doing It Right
Read in characters one-at-a-time
Recognize when an important token has
arrived
Return the type and the value of the token
7/17/2015
Copyright 2000, Georgia Tech
4
A Theoretical Tool for
Scanning: FSA's
Finite State Automata (FSA)
One model of computation that can scan well
We can make them fast and efficient
FSA's are
A collection of states
Arcs between states
Labeled with input symbols
7/17/2015
Copyright 2000, Georgia Tech
5
Example FSA
State 1 is start state
Incoming arrow
"Incomplete state" —
can't end there
State 2 is terminal or
end state—can stop
there, recognizing a
token
Consume A's in 1, end
with a B in 2
Valid: AB, AAB, AAAB
7/17/2015
Copyright 2000, Georgia Tech
6
General FSA Processing
Enter the Start state
Read input
Go to the state for that input
If an End state, can stop
But may not want to, since we must find the
longest possible token (consider scanning an
identifier)
7/17/2015
Copyright 2000, Georgia Tech
7
Implementing FSAs
Easiest way: State Transition Tables
Read a character
Append character to VALUE
Using a table indexed by states and
characters, find a new state to move to given
current STATE and input CHARACTER
If end state and no more transitions possible,
return VALUE and STATE
(Sometimes need to do a lookahead. Could I
grab the next character and be in another end
state?)
7/17/2015
Copyright 2000, Georgia Tech
8
Syntactic Analysis
Given the tokens, can we recognize the
language?
Parsing
Structure for describing relationship between
tokens is called a grammar
A grammar describes how tokens can be assembled
into an acceptable sentence in a language
We're going to study a kind called context-free
grammars
7/17/2015
Copyright 2000, Georgia Tech
9
Context-free grammars
Made up of a set of rules
Each rule consists of a left-hand side nonterminal which maps to a right-hand side
expression
Expressions are made up of other nonterminals and terminals
Rules can be used as replacements
Either side can be replaced with the other
7/17/2015
Copyright 2000, Georgia Tech
10
Example grammar
Expression := Factor + Expression
Expression := Factor
Factor := Term * Factor
Factor := Term
Term := Number
Term := Identifier (variable)
7/17/2015
Copyright 2000, Georgia Tech
11
Derivation tree using
grammar for 3*4+5
7/17/2015
Copyright 2000, Georgia Tech
12
Implementing Parsing
Simplest way: Recursive descent parsing
Each non-terminal maps to a
method/function/procedure in language
The m/f/p is responsible for recognizing the
related non-terminal
Including calling another m/f/p as needed
Use your scanner to supply tokens
7/17/2015
Copyright 2000, Georgia Tech
13
A Simple Equation Recursive
Descent Parser
Expression := Factor + Expression
Expression := Factor
expression
Transcript show: 'Expression'; cr.
self factor.
(scanner peek = '+')
ifTrue: [Transcript show: '+'; cr.
scanner advance.
self expression].
7/17/2015
Copyright 2000, Georgia Tech
14
Factor and Term:
Simple RD Parsing
 Factor := Term * Factor
 Factor := Term
 Term := Number
factor
Transcript show: 'Factor'; cr.
self term.
(scanner peek = '*') ifTrue:
[Transcript show: '*' ; cr.
scanner advance
self factor.]
term
Transcript show: 'Term' ; cr.
(scanner nextIsNumber)
ifTrue: [Transcript show: 'Number: ',(scanner nextToken); cr.]
ifFalse: [Transcrpt show: ‘Error -- Number expected’]
7/17/2015
Copyright 2000, Georgia Tech
15
Simulating a Scanner
tokens: aCollection
tokens := aCollection “set the tokens to the passed in collection”
peek
^tokens isEmpty “check if the collection is empty”
ifTrue: [nil]
“if empty return nil”
ifFalse: [tokens first] “else return the first token in the collection”
advance
tokens := tokens allButFirst. “reset the tokens to all but the first”
7/17/2015
Copyright 2000, Georgia Tech
16
Simulating a Scanner
nextIsNumber
“return true if the next item in the tokens is a number”
^(tokens first select: [:character |
character asciiValue < $0 asciiValue or:
[character asciiValue > $9 asciiValue]]) isEmpty
nextToken
“return the next token and advance the scanner if not nil”
| token |
token := self peek.
token isNil ifFalse: [self advance].
^token.
7/17/2015
Copyright 2000, Georgia Tech
17
Trying out the toy parser
eqn := EquationParser new.
“create a parser object”
eqnscan := EquationScanner new. “create a scanner obj”
eqn scanner: eqnscan.
“set parser’s scanner”
eqnscan tokens:
('3 + 4 * 5' findTokens:
(Character space asString)). “set the tokens”
eqn expression
“parse an expression”
7/17/2015
Copyright 2000, Georgia Tech
18
Comparing to the earlier
derivation tree
 Transcript:
Expression
Factor
Term
Number: 3
*
Factor
Term
Number: 4
+
Expression
Factor
Term
Number: 5
7/17/2015
Copyright 2000, Georgia Tech
19
Derivation tree for 3 + 4 * 5
Expression
 Transcript:
Expression
Factor
Term
Number: 3
+
Expression
Factor
Term
Number: 4
*
Factor
Term
Number: 5
7/17/2015
Factor + Expression
Term
Factor
Number
3
Term * Factor
Number
4
Term
Number
5
Copyright 2000, Georgia Tech
20
T-Gen: A Translator Generator
for Squeak
7/17/2015
Copyright 2000, Georgia Tech
21
Using T-Gen
 File in the changeSet
 In Morphic, TGenUI open
Enter your tokens as regular expressions in upper-left
Enter your grammar in lower-left
Put in sample code in lower-right
Transcript for parsing is upper-right
Processing of each occurs as soon as you accept (Alt/Cmd-S)
From the transcript pane, you can inspect result
Buttons let you specify kind of parser and kind of result
 You can install the resultant scanner and parser into
your system
7/17/2015
Copyright 2000, Georgia Tech
22
Walking the Graph Language
Example
(From user's manual in Zip: Kind of
obtuse, so we'll walk it slowly here.)
Entering the scanner:
<name> : [A-Za-z][A-Za-z0-9]* ;
<whitespace> : [\s\t\r]+ {ignoreDelimeter} ;
ignoreDelimeter tells the system to drop
these tokens
Tab before "{" is absolutely critical!
7/17/2015
Copyright 2000, Georgia Tech
23
Creating our First T-Gen
Grammar
Smalltalk syntax for comments
CommandList : CreateNode CommandList
| CreateEdge CommandList
| "nothing" ;
CreateNode : 'node' <name> ;
CreateEdge : 'edge' <name> <name> ;
7/17/2015
Copyright 2000, Georgia Tech
24
Toy input
node Alpha
node Beta
node C
edge Beta C
edge Alpha Beta
7/17/2015
Copyright 2000, Georgia Tech
25
Generating Derivation Tree
Only option with a
limited grammar
CommandList
. CreateNode
. . 'node'
. . '<name>'
. CommandList
. . CreateNode
. . . 'node'
. . . '<name>'
. . CommandList
. . . CreateNode
. . . . 'node'
. . . . '<name>'
7/17/2015
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. CommandList
. . CreateEdge
. . . 'edge'
. . . '<name>'
. . . '<name>'
. . CommandList
. . . CreateEdge
. . . . 'edge'
. . . . '<name>'
. . . . '<name>'
. . . CommandList
. . . . '<epsilon>'
Copyright 2000, Georgia Tech
26
Generating Derivation Tree
7/17/2015
Copyright 2000, Georgia Tech
27
But that's not too helpful
That tells us the components, but not the
values
To get further, we want an Abstract Syntax
Tree, or even to call our own methods on
each grammar rule
But we'll need to restructure the grammar
so that each rule produces something
useful.
7/17/2015
Copyright 2000, Georgia Tech
28
Re-Writing the Grammar to
generate an AST
CommandList : CreateNode CommandList
{liftRightChild}
| CreateEdge CommandList {liftRightChild}
| "nothing" {OrderedChildren} ;
CreateNode : 'node' Name {VertexNode} ;
CreateEdge : 'edge' Name Name {EdgeNode} ;
Name : <name> {NameNode} ;
7/17/2015
Copyright 2000, Georgia Tech
29
What's Going On Here?
liftRightChild means "don't do anything
here yet—just recognize the form"
OrderedChildren is a pre-defined class
that simply gathers everything (here,
"nothing") into an OrderedCollection
The rest are all classes that WE have to
build
7/17/2015
Copyright 2000, Georgia Tech
30
Providing the Appropriate
Classes for AST
Our classes must inherit from
ParseTreeNode
Our classes are going to represent the nodes
of the parse tree
We must override methods to record the
terminals passed in
setAttribute: aString to capture single
terminals in a grammar rule
addChildrenInitial: anOrderedCollection to
7/17/2015
Copyright 2000, Georgia Tech
31
capture multiple
terminals
NameNode
ParseTreeNode subclass: #NameNode
instanceVariableNames: 'name '
classVariableNames: ''
poolDictionaries: ''
category: 'edge-parser'
setAttribute: aString
name := aString.
Transcript show: 'NameNode: ',aString; cr.
7/17/2015
Copyright 2000, Georgia Tech
32
VertexNode
ParseTreeNode subclass: #VertexNode
instanceVariableNames: 'node '
classVariableNames: ''
poolDictionaries: ''
category: 'edge-parser'
addChildrenInitial: anOrderedCollection
anOrderedCollection size = 1
ifTrue: [node := anOrderedCollection first.
Transcript show:
'VertexNode: ',node printString; cr]
ifFalse: [self error:
'VertexNode: Wrong number of children']
7/17/2015
Copyright 2000, Georgia Tech
33
EdgeNode
ParseTreeNode subclass: #EdgeNode
instanceVariableNames: 'fromNode toNode '
classVariableNames: ''
poolDictionaries: ''
category: 'edge-parser'
addChildrenInitial: anOrderedCollection
anOrderedCollection size = 2
ifTrue:
[fromNode := anOrderedCollection removeFirst.
toNode := anOrderedCollection first.
Transcript show:
'EdgeNode: ',fromNode printString,'-->',toNode printString; cr.]
ifFalse: [self error: 'EdgeNode: Wrong number of children']
7/17/2015
Copyright 2000, Georgia Tech
34
Now we can generate AST
7/17/2015
Copyright 2000, Georgia Tech
35
AST: Abstract Syntax Tree
Is useful to have a single data structure
with all the components usefully identified
7/17/2015
Copyright 2000, Georgia Tech
36
But AST's Are All-At-Once
Alternatively, can capture elements as-they-areparsed
Transcript from AST generation:
VertexNodeNameNode: Beta
NameNode: Alpha
EdgeNode: a NameNode-->a NameNode
NameNode: C
NameNode: Beta
EdgeNode: a NameNode-->a NameNode
NameNode: C
VertexNode: a NameNode
NameNode: Beta
VertexNode: a NameNode
NameNode: Alpha
VertexNode: a NameNode
7/17/2015
node Alpha
node Beta
node C
edge Beta C
edge Alpha Beta
Copyright 2000, Georgia Tech
37
T-Gen will call methods for you
CommandList : CommandList CreateNode
{toGraph:addVertex:}
| CommandList CreateEdge {toGraph:addEdge:}
| "nothing" {createGraph} ;
CreateNode : 'node' Name {createVertexLabeled:}
;
CreateEdge : 'edge' Name Name {edgeFrom:to:} ;
Name : <name> {answerArgument:} ;
7/17/2015
Copyright 2000, Georgia Tech
38
Note Careful Construction of
Grammar and Methods
Notice in the AST Construction
Things are assembled in reverse!
Grammar is recognized, and then built
bottom-up
Thus, you build things at the terminals,
and assemble them into structures further
up the tree
Thus, createGraph at the bottom
toGraph:addXXXX: in the mid-level rules
7/17/2015
Copyright 2000, Georgia Tech
39
Once parser is built, install it
You provide a name, e.g., Edge, and
EdgeParser and EdgeScanner are generated
Generating an AST:
(EdgeParser new parseForAST: '
node Alpha
node Beta
node C
edge Beta C
edge Alpha Beta
' ifFail: [Smalltalk beep.])
7/17/2015
Copyright 2000, Georgia Tech
40
Smalltalk Parser
7/17/2015
Copyright 2000, Georgia Tech
41
Smalltalk's Parser is
Recursive Descent!
Scanner methods are in Parser
Scanning method category (advance
endOfLastToken match: matchToken:
startOfNextToken)
All the kinds of messages are defined in
the method category Expression Types
argumentName assignment: blockExpression
braceExpression cascade expression
messagePart:repeat: method:context: pattern:inContext:
primaryExpression statements:innerBlock: temporaries
7/17/2015
Copyright 2000,
Georgia Tech
42
temporaryBlockVariables
variable
Example: Parsing an
Assignment
assignment: varNode
" var ':=' expression => AssignmentNode."
| loc |
(loc := varNode assignmentCheck: encoder at: prevMark +
requestorOffset) >= 0
ifTrue: [^self notify: 'Cannot store into' at: loc].
varNode nowHasDef.
self advance.
self expression ifFalse: [^self expected: 'Expression'].
parseNode := AssignmentNode new
variable: varNode
value: parseNode
from: encoder.
7/17/2015
Copyright 2000, Georgia Tech
^true
43
AssignmentNode then
generates the code
emitForValue:on: generates the bytecodes
for the assignment
emitForValue: stack on: aStream
value emitForValue: stack on: aStream.
variable emitStore: stack on: aStream
7/17/2015
Copyright 2000, Georgia Tech
44
HtmlParser
Used for Scamper
HtmlParser parse: '
<html>
<head>
<title>Fred the Page</title>
</head>
<body>
<h1>Fred the Body</h1>
This is a body for Fred.
</body>
</html>'
7/17/2015
Copyright 2000, Georgia Tech
45
HtmlParser returns an
HtmlDocument
HtmlDocument has contents, which is an
OrderedCollection
HtmlHead
HtmlBody
HtmlEntity
Hierarchy exists
7/17/2015
Copyright 2000, Georgia Tech
46
Walk the Object Structure
doc := HtmlParser parse: '
<html>
<head>
<title>Fred the Page</title>
</head>
<body>
<h1>Fred the Body</h1>
This is a body for Fred.
</body>
</html>'.
7/17/2015
body := doc contents last.
"This should be an
HtmlBody"
body contents detect:
[:entity | entity isKindOf:
HtmlHeader]. "This
should be the first
heading."
PrintIt:
<'h1'>
[Fred the Body]
Copyright 2000, Georgia Tech
47
Summary
 Two key activities: Lexical analysis (tokenizing) and
syntactic analysis (parsing)
Tokenizing techniques: FSAs and State Transition Tables
Parsing techniques
Recursive Descent Parsing
Table-driven Parsing (e.g., YACC, T-Gen)
 Example parsers
Simple equation parser
T-Gen Graph language parser
Smalltalk's parser
HtmlParser in Squeak for Scamper
7/17/2015
Copyright 2000, Georgia Tech
48