Scanning and Parsing in Squeak
Download
Report
Transcript Scanning and Parsing in Squeak
Scanning and Parsing in
Squeak
Defining Lexical and Syntactic Analysis
Scanning/Tokenizing
Parsing
Easy ways to do it
State Transition Tables
Recursive Descent Parsing
More sophisticated way
T-Gen: Lex and YACC for Squeak
Examples from Squeak
Smalltalk parser
HTML parser
7/17/2015
Copyright 2000, Georgia Tech
1
Challenge of Compiling
How do you go from source code to object
code?
Lexical analysis: Figure out the pieces (tokens)
of the program: Constants, variables, keywords.
Syntactic analysis: Figure out (and check) the
structure (declaration, statements, etc.)—also
called parsing
Interpret meaning (semantics) of recognized
structures, based on declarations
Backend: Generate object code
7/17/2015
Copyright 2000, Georgia Tech
2
Lexical Analysis
Given a bunch of characters, how do you
recognize the key things and their types?
Simplest way: Parse by “white space”
'This is
a test
with returns in it.' findTokens: (Character
cr asString),(Character space asString).
OrderedCollection ('This' 'is' 'a' 'test'
'with' 'returns' 'in' 'it.' )
7/17/2015
Copyright 2000, Georgia Tech
3
Scanning: Doing It Right
Read in characters one-at-a-time
Recognize when an important token has
arrived
Return the type and the value of the token
7/17/2015
Copyright 2000, Georgia Tech
4
A Theoretical Tool for
Scanning: FSA's
Finite State Automata (FSA)
One model of computation that can scan well
We can make them fast and efficient
FSA's are
A collection of states
Arcs between states
Labeled with input symbols
7/17/2015
Copyright 2000, Georgia Tech
5
Example FSA
State 1 is start state
Incoming arrow
"Incomplete state" —
can't end there
State 2 is terminal or
end state—can stop
there, recognizing a
token
Consume A's in 1, end
with a B in 2
Valid: AB, AAB, AAAB
7/17/2015
Copyright 2000, Georgia Tech
6
General FSA Processing
Enter the Start state
Read input
Go to the state for that input
If an End state, can stop
But may not want to, since we must find the
longest possible token (consider scanning an
identifier)
7/17/2015
Copyright 2000, Georgia Tech
7
Implementing FSAs
Easiest way: State Transition Tables
Read a character
Append character to VALUE
Using a table indexed by states and
characters, find a new state to move to given
current STATE and input CHARACTER
If end state and no more transitions possible,
return VALUE and STATE
(Sometimes need to do a lookahead. Could I
grab the next character and be in another end
state?)
7/17/2015
Copyright 2000, Georgia Tech
8
Syntactic Analysis
Given the tokens, can we recognize the
language?
Parsing
Structure for describing relationship between
tokens is called a grammar
A grammar describes how tokens can be assembled
into an acceptable sentence in a language
We're going to study a kind called context-free
grammars
7/17/2015
Copyright 2000, Georgia Tech
9
Context-free grammars
Made up of a set of rules
Each rule consists of a left-hand side nonterminal which maps to a right-hand side
expression
Expressions are made up of other nonterminals and terminals
Rules can be used as replacements
Either side can be replaced with the other
7/17/2015
Copyright 2000, Georgia Tech
10
Example grammar
Expression := Factor + Expression
Expression := Factor
Factor := Term * Factor
Factor := Term
Term := Number
Term := Identifier (variable)
7/17/2015
Copyright 2000, Georgia Tech
11
Derivation tree using
grammar for 3*4+5
7/17/2015
Copyright 2000, Georgia Tech
12
Implementing Parsing
Simplest way: Recursive descent parsing
Each non-terminal maps to a
method/function/procedure in language
The m/f/p is responsible for recognizing the
related non-terminal
Including calling another m/f/p as needed
Use your scanner to supply tokens
7/17/2015
Copyright 2000, Georgia Tech
13
A Simple Equation Recursive
Descent Parser
Expression := Factor + Expression
Expression := Factor
expression
Transcript show: 'Expression'; cr.
self factor.
(scanner peek = '+')
ifTrue: [Transcript show: '+'; cr.
scanner advance.
self expression].
7/17/2015
Copyright 2000, Georgia Tech
14
Factor and Term:
Simple RD Parsing
Factor := Term * Factor
Factor := Term
Term := Number
factor
Transcript show: 'Factor'; cr.
self term.
(scanner peek = '*') ifTrue:
[Transcript show: '*' ; cr.
scanner advance
self factor.]
term
Transcript show: 'Term' ; cr.
(scanner nextIsNumber)
ifTrue: [Transcript show: 'Number: ',(scanner nextToken); cr.]
ifFalse: [Transcrpt show: ‘Error -- Number expected’]
7/17/2015
Copyright 2000, Georgia Tech
15
Simulating a Scanner
tokens: aCollection
tokens := aCollection “set the tokens to the passed in collection”
peek
^tokens isEmpty “check if the collection is empty”
ifTrue: [nil]
“if empty return nil”
ifFalse: [tokens first] “else return the first token in the collection”
advance
tokens := tokens allButFirst. “reset the tokens to all but the first”
7/17/2015
Copyright 2000, Georgia Tech
16
Simulating a Scanner
nextIsNumber
“return true if the next item in the tokens is a number”
^(tokens first select: [:character |
character asciiValue < $0 asciiValue or:
[character asciiValue > $9 asciiValue]]) isEmpty
nextToken
“return the next token and advance the scanner if not nil”
| token |
token := self peek.
token isNil ifFalse: [self advance].
^token.
7/17/2015
Copyright 2000, Georgia Tech
17
Trying out the toy parser
eqn := EquationParser new.
“create a parser object”
eqnscan := EquationScanner new. “create a scanner obj”
eqn scanner: eqnscan.
“set parser’s scanner”
eqnscan tokens:
('3 + 4 * 5' findTokens:
(Character space asString)). “set the tokens”
eqn expression
“parse an expression”
7/17/2015
Copyright 2000, Georgia Tech
18
Comparing to the earlier
derivation tree
Transcript:
Expression
Factor
Term
Number: 3
*
Factor
Term
Number: 4
+
Expression
Factor
Term
Number: 5
7/17/2015
Copyright 2000, Georgia Tech
19
Derivation tree for 3 + 4 * 5
Expression
Transcript:
Expression
Factor
Term
Number: 3
+
Expression
Factor
Term
Number: 4
*
Factor
Term
Number: 5
7/17/2015
Factor + Expression
Term
Factor
Number
3
Term * Factor
Number
4
Term
Number
5
Copyright 2000, Georgia Tech
20
T-Gen: A Translator Generator
for Squeak
7/17/2015
Copyright 2000, Georgia Tech
21
Using T-Gen
File in the changeSet
In Morphic, TGenUI open
Enter your tokens as regular expressions in upper-left
Enter your grammar in lower-left
Put in sample code in lower-right
Transcript for parsing is upper-right
Processing of each occurs as soon as you accept (Alt/Cmd-S)
From the transcript pane, you can inspect result
Buttons let you specify kind of parser and kind of result
You can install the resultant scanner and parser into
your system
7/17/2015
Copyright 2000, Georgia Tech
22
Walking the Graph Language
Example
(From user's manual in Zip: Kind of
obtuse, so we'll walk it slowly here.)
Entering the scanner:
<name> : [A-Za-z][A-Za-z0-9]* ;
<whitespace> : [\s\t\r]+ {ignoreDelimeter} ;
ignoreDelimeter tells the system to drop
these tokens
Tab before "{" is absolutely critical!
7/17/2015
Copyright 2000, Georgia Tech
23
Creating our First T-Gen
Grammar
Smalltalk syntax for comments
CommandList : CreateNode CommandList
| CreateEdge CommandList
| "nothing" ;
CreateNode : 'node' <name> ;
CreateEdge : 'edge' <name> <name> ;
7/17/2015
Copyright 2000, Georgia Tech
24
Toy input
node Alpha
node Beta
node C
edge Beta C
edge Alpha Beta
7/17/2015
Copyright 2000, Georgia Tech
25
Generating Derivation Tree
Only option with a
limited grammar
CommandList
. CreateNode
. . 'node'
. . '<name>'
. CommandList
. . CreateNode
. . . 'node'
. . . '<name>'
. . CommandList
. . . CreateNode
. . . . 'node'
. . . . '<name>'
7/17/2015
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. CommandList
. . CreateEdge
. . . 'edge'
. . . '<name>'
. . . '<name>'
. . CommandList
. . . CreateEdge
. . . . 'edge'
. . . . '<name>'
. . . . '<name>'
. . . CommandList
. . . . '<epsilon>'
Copyright 2000, Georgia Tech
26
Generating Derivation Tree
7/17/2015
Copyright 2000, Georgia Tech
27
But that's not too helpful
That tells us the components, but not the
values
To get further, we want an Abstract Syntax
Tree, or even to call our own methods on
each grammar rule
But we'll need to restructure the grammar
so that each rule produces something
useful.
7/17/2015
Copyright 2000, Georgia Tech
28
Re-Writing the Grammar to
generate an AST
CommandList : CreateNode CommandList
{liftRightChild}
| CreateEdge CommandList {liftRightChild}
| "nothing" {OrderedChildren} ;
CreateNode : 'node' Name {VertexNode} ;
CreateEdge : 'edge' Name Name {EdgeNode} ;
Name : <name> {NameNode} ;
7/17/2015
Copyright 2000, Georgia Tech
29
What's Going On Here?
liftRightChild means "don't do anything
here yet—just recognize the form"
OrderedChildren is a pre-defined class
that simply gathers everything (here,
"nothing") into an OrderedCollection
The rest are all classes that WE have to
build
7/17/2015
Copyright 2000, Georgia Tech
30
Providing the Appropriate
Classes for AST
Our classes must inherit from
ParseTreeNode
Our classes are going to represent the nodes
of the parse tree
We must override methods to record the
terminals passed in
setAttribute: aString to capture single
terminals in a grammar rule
addChildrenInitial: anOrderedCollection to
7/17/2015
Copyright 2000, Georgia Tech
31
capture multiple
terminals
NameNode
ParseTreeNode subclass: #NameNode
instanceVariableNames: 'name '
classVariableNames: ''
poolDictionaries: ''
category: 'edge-parser'
setAttribute: aString
name := aString.
Transcript show: 'NameNode: ',aString; cr.
7/17/2015
Copyright 2000, Georgia Tech
32
VertexNode
ParseTreeNode subclass: #VertexNode
instanceVariableNames: 'node '
classVariableNames: ''
poolDictionaries: ''
category: 'edge-parser'
addChildrenInitial: anOrderedCollection
anOrderedCollection size = 1
ifTrue: [node := anOrderedCollection first.
Transcript show:
'VertexNode: ',node printString; cr]
ifFalse: [self error:
'VertexNode: Wrong number of children']
7/17/2015
Copyright 2000, Georgia Tech
33
EdgeNode
ParseTreeNode subclass: #EdgeNode
instanceVariableNames: 'fromNode toNode '
classVariableNames: ''
poolDictionaries: ''
category: 'edge-parser'
addChildrenInitial: anOrderedCollection
anOrderedCollection size = 2
ifTrue:
[fromNode := anOrderedCollection removeFirst.
toNode := anOrderedCollection first.
Transcript show:
'EdgeNode: ',fromNode printString,'-->',toNode printString; cr.]
ifFalse: [self error: 'EdgeNode: Wrong number of children']
7/17/2015
Copyright 2000, Georgia Tech
34
Now we can generate AST
7/17/2015
Copyright 2000, Georgia Tech
35
AST: Abstract Syntax Tree
Is useful to have a single data structure
with all the components usefully identified
7/17/2015
Copyright 2000, Georgia Tech
36
But AST's Are All-At-Once
Alternatively, can capture elements as-they-areparsed
Transcript from AST generation:
VertexNodeNameNode: Beta
NameNode: Alpha
EdgeNode: a NameNode-->a NameNode
NameNode: C
NameNode: Beta
EdgeNode: a NameNode-->a NameNode
NameNode: C
VertexNode: a NameNode
NameNode: Beta
VertexNode: a NameNode
NameNode: Alpha
VertexNode: a NameNode
7/17/2015
node Alpha
node Beta
node C
edge Beta C
edge Alpha Beta
Copyright 2000, Georgia Tech
37
T-Gen will call methods for you
CommandList : CommandList CreateNode
{toGraph:addVertex:}
| CommandList CreateEdge {toGraph:addEdge:}
| "nothing" {createGraph} ;
CreateNode : 'node' Name {createVertexLabeled:}
;
CreateEdge : 'edge' Name Name {edgeFrom:to:} ;
Name : <name> {answerArgument:} ;
7/17/2015
Copyright 2000, Georgia Tech
38
Note Careful Construction of
Grammar and Methods
Notice in the AST Construction
Things are assembled in reverse!
Grammar is recognized, and then built
bottom-up
Thus, you build things at the terminals,
and assemble them into structures further
up the tree
Thus, createGraph at the bottom
toGraph:addXXXX: in the mid-level rules
7/17/2015
Copyright 2000, Georgia Tech
39
Once parser is built, install it
You provide a name, e.g., Edge, and
EdgeParser and EdgeScanner are generated
Generating an AST:
(EdgeParser new parseForAST: '
node Alpha
node Beta
node C
edge Beta C
edge Alpha Beta
' ifFail: [Smalltalk beep.])
7/17/2015
Copyright 2000, Georgia Tech
40
Smalltalk Parser
7/17/2015
Copyright 2000, Georgia Tech
41
Smalltalk's Parser is
Recursive Descent!
Scanner methods are in Parser
Scanning method category (advance
endOfLastToken match: matchToken:
startOfNextToken)
All the kinds of messages are defined in
the method category Expression Types
argumentName assignment: blockExpression
braceExpression cascade expression
messagePart:repeat: method:context: pattern:inContext:
primaryExpression statements:innerBlock: temporaries
7/17/2015
Copyright 2000,
Georgia Tech
42
temporaryBlockVariables
variable
Example: Parsing an
Assignment
assignment: varNode
" var ':=' expression => AssignmentNode."
| loc |
(loc := varNode assignmentCheck: encoder at: prevMark +
requestorOffset) >= 0
ifTrue: [^self notify: 'Cannot store into' at: loc].
varNode nowHasDef.
self advance.
self expression ifFalse: [^self expected: 'Expression'].
parseNode := AssignmentNode new
variable: varNode
value: parseNode
from: encoder.
7/17/2015
Copyright 2000, Georgia Tech
^true
43
AssignmentNode then
generates the code
emitForValue:on: generates the bytecodes
for the assignment
emitForValue: stack on: aStream
value emitForValue: stack on: aStream.
variable emitStore: stack on: aStream
7/17/2015
Copyright 2000, Georgia Tech
44
HtmlParser
Used for Scamper
HtmlParser parse: '
<html>
<head>
<title>Fred the Page</title>
</head>
<body>
<h1>Fred the Body</h1>
This is a body for Fred.
</body>
</html>'
7/17/2015
Copyright 2000, Georgia Tech
45
HtmlParser returns an
HtmlDocument
HtmlDocument has contents, which is an
OrderedCollection
HtmlHead
HtmlBody
HtmlEntity
Hierarchy exists
7/17/2015
Copyright 2000, Georgia Tech
46
Walk the Object Structure
doc := HtmlParser parse: '
<html>
<head>
<title>Fred the Page</title>
</head>
<body>
<h1>Fred the Body</h1>
This is a body for Fred.
</body>
</html>'.
7/17/2015
body := doc contents last.
"This should be an
HtmlBody"
body contents detect:
[:entity | entity isKindOf:
HtmlHeader]. "This
should be the first
heading."
PrintIt:
<'h1'>
[Fred the Body]
Copyright 2000, Georgia Tech
47
Summary
Two key activities: Lexical analysis (tokenizing) and
syntactic analysis (parsing)
Tokenizing techniques: FSAs and State Transition Tables
Parsing techniques
Recursive Descent Parsing
Table-driven Parsing (e.g., YACC, T-Gen)
Example parsers
Simple equation parser
T-Gen Graph language parser
Smalltalk's parser
HtmlParser in Squeak for Scamper
7/17/2015
Copyright 2000, Georgia Tech
48