Programming Languages Chapter 2: Syntax

Download Report

Transcript Programming Languages Chapter 2: Syntax

Chapter 2
Syntax
A language that is simple to parse for the compiler is also
simple to parse for the human programmer.
N. Wirth
2.1 Grammars
2.1.1 Backus-Naur Form
2.1.2 Derivations
2.1.3 Parse Trees
2.1.4 Associativity and Precedence
2.1.5 Ambiguous Grammars
2.2 Extended BNF
2.3 Syntax of a Small Language: Clite
2.3.1 Lexical Syntax
2.3.2 Concrete Syntax
2.4 Compilers and Interpreters
2.5 Linking Syntax and Semantics
2.5.1 Abstract Syntax
2.5.2 Abstract Syntax Trees
2.5.3 Abstract Syntax of Clite
Expr 
Term 
Expr + Term | Expr – Term | Term
Term * Factor | Term / Factor |
Term % Factor | Factor
Factor  Primary ** Factor | Primary
Primary  0 | ... | 9 | ( Expr )
Red indicates a terminal of G1, blue
indicates meta-symbols (symbols that
aren’t part of the language but are used
to describe the language)
Example: 10 * ( 5 – 3)

Motivation for using a subset of C:
Language
Pascal
C
C++
Java

Grammar
(pages)
5
6
22
14
Reference
Jensen & Wirth
Kernighan & Richie
Stroustrup
Gosling, et. al.
The Clite grammar fits on one page (next 3
slides), so it’s a far better tool for studying
language design.
Program  int main ( ) { Declarations Statements }
Declarations  { Declaration }
Declaration  Type Identifier [ [Integer ]] { , Identifier [[Integer ] ] };
Type  int | bool | float | char
Statements  { Statement }
Statement  ; | Block | Assignment | IfStatement | WhileStatement
Block  { Statements }
Assignment  Identifier [ [ Expression ] ] = Expression ;
IfStatement  if ( Expression ) Statement [ else Statement ]
WhileStatement  while ( Expression ) Statement
Expression  Conjunction { || Conjunction }
Conjunction  Equality { && Equality }
Equality  Relation [ EquOp Relation ]
EquOp  == | !=
Relation  Addition [ RelOp Addition ]
RelOp  < | <= | > | >=
Addition  Term { AddOp Term }
AddOp  + | Term  Factor { MulOp Factor }
MulOp  * | / | %
Factor  [ UnaryOp ] Primary
UnaryOp  - | !
Primary  Identifier [ [ Expression ] ] | Literal | ( Expression ) |
Type ( Expression )
Identifier  Letter { Letter | Digit }
Letter  a | b | ... | z | A | B | ... | Z
Digit  0 | 1 | ... | 9
Literal  Integer | Boolean | Float | Char
Integer  Digit { Digit }
Boolean  true | false
Float  Integer . Integer
Char  ‘ ASCII Char ‘
(ASCII Char is the set of ASCII characters)
•
•
13 grammar rules – compare to 4 pages, for C++
Metabraces { }(0 or more) is interpreted to mean left
associativity; e.g.,
◦ Addition → Term { AddOp Term }
AddOp → + | -
•
Metabrackets [ ] (optional) means an addition can
only be followed by one or no relational operators
plus another addition.
• Relation  Addition [ RelOp Addition ]
◦
RelOp  < | <= | > | >=
◦ (no a > b > c for example)
•
•
•
•
Comments
The significance of whitespace
Distinguishing one token <= from two tokens
< =
Distinguishing identifiers from keywords like
if

The Clite grammar has two levels
◦ lexical level
(described by lexical syntax)
◦ syntactic level (described by concrete syntax)


They correspond to two separate parts of a
compiler.
The issues on the previous slide are lexical
issues.

Examples of lexical entities (tokens):
◦ Identifiers
◦ Literals
◦ Keywords
◦ Operators
◦ Punctuation
◦ Char
e.g., numbr1, X
e.g., 123, 'x', 3.25, true
bool char else false float
if int main true while
= || && == != < <= > >= + */!%
;,{}()
e.g.,‘?’




Whitespace is any space, tab, end-of-line
character (or characters), or character sequence
inside a comment
No token may contain embedded whitespace
(unless it is a character or string literal)
Example:
>= one token
> = two tokens

while ( a <= b)
legal - spacing between
tokens
while(a<=b)
also legal - spacing not needed

while (a < = b)

no lexical errors but illegal
syntactically – lexer would identify tokens
while, (, a, <, =, b, )




Clite uses // comment style of C++
Not defined in Clite grammar (but could be)
Instead, it’s defined outside the grammar
The use of whitespace to differentiate
between one and two character operators is
also defined outside the grammar.
•
Sequence of letters and digits, starting with a
letter

“if” is an identifier which also is a keyword
•
Keywords versus reserved words:
◦ Keyword: predefined by the language
◦ Reserved word: can only be used as defined.
◦ In most languages all keywords are also reserved,
but in a few; e.g., Pascal, a subset of the keyword
identifiers are predefined but not reserved (and can
be redefined by the programmer).
Implications? Flexibility, confusion …



Concrete syntax of a language is the set of
rules for writing correct programs
The structure of a specific program can be
represented by a parse tree, based on the
concrete syntax of the language, using the
stream of Tokens identified during lexical
analysis
The root of the parse tree is the Start
Symbol of the language (Program, in Clite).
•
Clite’s expression rules are non-
ambiguous with respect to precedence
and associativity
◦ Rule ordering defines precedence; rule
format defines associativity.
•
C/C++ expression grammar definition
is ambiguous – precedence and
associativity are specified separately.
Clite Operator
Unary - !
*/
+< <= > >=
== !=
&&
||
Associativity
none
left
left
none (i.e., no a < b <= c)
none
left
left

… are non-associative.
(an idea borrowed from Ada)

Why is this important?
In C & C++, the expression:
if (10 < x < 20)
is not equivalent to
if (10 < x && x < 20)
But it is error-free!
So, what does it mean?


Grammar rules don’t specify the operand
types to be used with various operators;
e.g., is
true + 13
a legal expression? What is the type of
the expression 123.78 + 37 ?
These are type and semantic issues, not
lexical or syntax.
Lexical
Analyzer
(lexer)
Syntactic
Analyzer
(parser)
Semantic
Analyzer
Code
Optimizer
Code
Generator




Input: characters (the program)
Output: tokens & token type
Lexical grammars are simpler than syntax
grammars
Often generated automatically by lexical
analyzer generating programs
•
•
•
•
Often based on BNF/EBNF grammar
Input: tokens
Output: abstract syntax tree or some
other representation of the program
Abstract syntax: similar to a concrete
parse tree but with punctuation, many
nonterminals discarded
•
Typical tasks:
◦ Check that all identifiers are declared
◦ Perform type checking for expressions, assignments,
…
◦ Insert implied conversion operators (i.e., make them
explicit)
•
•
Context free grammars can’t express the
semantic rules that are needed for this phase
of translation.
Output: Intermediate code (IC) tree, modified
abstract syntax tree representation.

Purpose: Improve the run-time
performance of the object code
◦ Usually, to make it run faster
◦ Other possibilities: reduce amount of storage
required


Drawback: optimization is timeconsuming; slows down debugging
Output: Intermediate code, similar to
abstract syntax notation; closer to
machine code
•
•
•
•
•
•
•
Evaluate constant expressions at compiletime
In-line expansion (of function calls)
Loop unrolling
Reorder code to improve cache
performance
Eliminate common sub-expressions
Eliminate unnecessary code
Store local variables/intermediate results in
registers rather than on the stack or
elsewhere in memory
•
•
•
Output: machine code
Instruction selection, register management
“Peephole” optimization: look at a small
segment of machine code, make it more
efficient; e.g.,
◦ x = y; →→ load y, R0
store R0, x
z = x * 2; load x, R0 // redundant code
mul R0, #2


Replaces last 2 phases of a compiler with
direct execution
Input:
◦ Mixed: generates & uses intermediate
code/abstract syntax
◦ Pure: start from stream of ASCII characters each
time a statement is executed

Mixed interpreters
◦ Java, Perl, Python, Haskell, Scheme

Pure interpreters:
◦ most Basics, shell commands




Source code: a = x + y;
Compiler-generated object code:
load r0, x;
add r0, y;
store r0, a;
Will be executed later, with remainder of
program
Interpreter: Call an interpretive routine to
actually perform the operation.
e.g., add(x, y, a);



It’s not the case that the lexical analyzer
identifies all the tokens, and then the parser
analyzes all the tokens, and then the
type/semantic analysis is performed.
Instead, parser repeatedly contacts lexer to
get another token
As tokens are received they either do or don’t
match the expected syntax
◦ If there’s a match, perform any type or semantic
testing, possibly generate int. code, call for another
token.

Output of parser: the concrete parse
tree is large – probably more than
needed for next phase
◦ The compiler usually produces some more
compact representation of the program

Example: Fig. 2.9 (page 46)


The shape of the parse tree reveals the
meaning of the program.
So as output of syntax analysis we want a
tree that removes its inefficiency and keeps
its meaning.
◦ Remove separator/punctuation terminal symbols
◦ Remove all trivial nonterminals
◦ Replace remaining nonterminals with leaf
terminals

Example: Fig. 2.10
Removes unnecessary details but keeps the
essential language elements; e.g., consider the
following two equivalent loops:
Pascal
while j < n do begin
j := j + 1;
end;
C++
while (j < n) {
j = j + 1;
}
Essential information: 1) it is a loop, 2) its
terminating condition is j >= n, and 3) its body
increments the current value of j.
•
•
•
•
•
Purpose: an intermediate form of the source
code
Generated by the parser during syntax analysis
Used during type checking/semantic analysis
Abstract syntax rules are defined by the
compiler (or interpreter).
One concrete syntax can have several abstract
syntaxes associated with it, depending on the
design of the translator.







LHS = RHS
LHS names an abstract syntax class
RHS either
(1) gives a list of one or more
alternatives
or
(2) lists the essential elements of the
syntax class
Compare to production rules in concrete
grammar
Assignment = Variable target; Expression source
Expression = Variable| Value | Binary | Unary
Variable = String id
Value = Integer value
Binary = Operator op; Expression term1, term2
Unary = UnaryOp op; Expression term
Operator = +| - | * | / | !
Priority? Associativity? ….
•
Concrete:
•
Abstract:
Assignment  Identifier [ [ Expression ] ]
= Expression ;
Assignment = Variable target; Expression
source
target source
Binary
z
Operator
+
Variable
Binary
x
Operator
*
Value
2
Variable
y
op term1
term2
Binary node
op
term
Unary node
Binaries and unaries represent information that
can be used for later processing
Assignment = Variable target; Expression source
Expression = VariableRef | Value | Binary | Unary
VariableRef = Variable | ArrayRef
Variable = String id
ArrayRef = String id; Expression index
Value = IntValue | BoolValue | FloatValue |
CharValue
Binary = Operator op; Expression term1, term2
Unary = UnaryOp op; Expression term
Operator = ArithmeticOp | RelationalOp |
BooleanOp
IntValue = Integer intValue
…
abstract class Expression { }
abstract class VariableRef extends Expression { }
class Variable extends VariableRef { String id; }
class ArrayRef extends VariableRef { String id; Expression index}
class Value extends Expression { … }
class Binary extends Expression {
Operator op;
Expression term1, term2;
}
class Unary extends Expression {
UnaryOp op;
Expression term;
}




Lexical syntax: small, simple, defines language
tokens
Concrete syntax: detailed, specific, defines correct
programs, used to direct parsing algorithms
(language specific, not implementation specific)
Abstract syntax: simpler than concrete, used to
describe the structure of the intermediate code
(implementation specific, not language specific)
NOT INTENDED TO BE USED FOR PARSING
Semantics: program “meaning”, or runtime
behavior




A syntax is ambiguous if a portion of a
program has two or more possible
interpretations (parse trees)
Non-ambiguous grammars can be written
but some ambiguity may be tolerated to
reduce grammar size.
Operator associativity and precedence can be
defined by the concrete syntax
Compilers generate code for later execution
while interpreters execute program
statements as they are analyzed.