Transcript ppt

Announcements

HW1 will be out this evening




Due Monday, 2/8
Submit in HW Server AND at start of class on 2/8
A review of regular expressions, context-free
grammars, derivation, parsing and ambiguity
Lecture notes available online before class:

www.cs.rpi.edu/~milanova/csci4430/
go to Schedule
Spring 16 CSCI 4430, A Milanova
1
Programming Language Syntax
Read: Scott, Chapter 2.1 and 2.2
Lecture Outline



Formal languages
Regular expressions
Context-free grammars





Derivation
Parse
Parse trees
Ambiguity
Scanning
Spring 16 CSCI 4430, A Milanova
3
Last Class: Compiler
character stream
Scanner
token stream
Parser
parse tree
Semantic analysis and
intermediate code generation
abstract syntax tree
or intermediate form
Machine-independent
code improvement
modified
intermediate form
Code generation
target language
(assember)
Machine-dependent
code improvement
modified
target language
Spring 16 CSCI 4430, A Milanova
4
Syntax and Semantics

Syntax is the form or structure of expressions,
statements, and program units of a given language

Syntax of a Java while statement:


Partial syntax of an if statement:


while ( boolean_expr ) statement
if ( boolean_expr ) statement
Semantics is the meaning of expressions,
statements and program units of a given language

Semantics of while ( boolean_expr ) statement

Execute statement repeatedly (0 or more times) as long as
boolean_expr evaluates to true
Spring 16 CSCI 4430, A Milanova
5
Formal Languages





Theoretical foundations – Automata Theory
A language is a set of strings (also called
sentences) over a finite alphabet
A generator is a set of rules that generate the
strings in the language
A recognizer reads input strings and determines
whether they belong to the language
Languages are characterized by the complexity of
generation/recognition rules


E.g., regular languages
E.g., context-free languages
Spring 16 CSCI 4430, A Milanova
6
Question

What are the classes of formal languages?

The Chomsky hierarchy:




Regular languages
Context-free languages
Context-sensitive languages
Recursively enumerable languages
Spring 16 CSCI 4430, A Milanova
7
Formal Languages

Recognizers become more complex as languages
become more complex

Regular languages




Context-free languages




Describe PL tokens (e.g., keywords, identifiers, numeric literals)
Generated by Regular Expressions
Recognized by a Finite Automaton (scanner)
Describe more complex PL constructs (e.g., expressions and
statements)
Generated by a Context-free Grammar
Recognized by a Push-down Automaton (parser)
Even more complex constructs
Spring 16 CSCI 4430, A Milanova
8
Formal Languages


Main application of formal languages: enable
proof of relative difficulty of certain
computational problems
Our focus: formal languages provide the
formalism for describing PL constructs




Most compelling application of formal languages!
Building a scanner
Building a parser
Central issue: build efficient, linear-time parsers
Spring 16 CSCI 4430, A Milanova
9
Regular Expressions


Simplest structure
Formalism to describe the simplest
programming language constructs, the
tokens





each symbols (e.g., “+”, “-”) is a token
an identifier (e.g., rate, initial) is a token
a numeric constant (e.g., 59) is a token
etc.
Recognized by a finite automaton
Spring 16 CSCI 4430, A Milanova
10
Regular Expressions

A Regular Expression is one of the following:




A character, e.g., a
The empty string, denoted by 
Two regular expressions next to each other,
R1 R2, meaning any string generated by R1
followed by (concatenated with) any string
generated by R2
Two regular expressions separated by |, R1 | R2
meaning any string generated by R1 or any string
generated by R2
Spring 16 CSCI 4430, A Milanova
11
Question

What is the language defined by reg. exp.
(a | b) (a a | b b) ?


{aaa, abb, baa, bbb}
We saw concatenation and alternation. What
operation is still missing?
Spring 16 CSCI 4430, A Milanova
12
Regular Expressions

A Regular Expression is one of the following:





A character, e.g., a
The empty string, denoted by 
R1 R2
R1 | R2
A regular expression followed by a Kleene star,
R*, meaning the concatenation of zero or more
strings generated by R

E.g., a* generates {, a, aa, aaa, … }
Spring 16 CSCI 4430, A Milanova
13
Regular Expressions

Operator precedence





Kleene * has highest precedence
Followed by concatenation
Followed by alternation |
E.g., a b | c generates …
E.g., a b* generates …
Spring 16 CSCI 4430, A Milanova
14
Question

What is the language defined by regular
expression (0 | 1)* 1 ?


Answer: all strings of 0s and 1s that end with 1
What about 0* (1 0* 1 0*)* ?
Spring 16 CSCI 4430, A Milanova
15
Regular Expressions in
Programming Languages



Describe tokens
Let
letter  a|b|c| … |z
digit  1|2|3|4|5|6|7|8|9|0
Which token is this?
1. letter ( letter | digit )*
2. digit digit *
3. digit * . digit digit *
Spring 16 CSCI 4430, A Milanova
?
?
?
16
Regular Expressions in
Programming Languages

Which token is this:
number  integer | real
real  integer exponent | decimal ( exponent | ε )
decimal  digit* ( . digit | digit . ) digit*
exponent  ( e | E ) ( + | - | ε ) integer
integer  digit digit*
digit  1|2|3|4|5|6|7|8|9|0
Spring 16 CSCI 4430, A Milanova
17
Context-Free Grammars


Unfortunately, regular languages cannot
specify all constructs in programming
E.g., can we write a regular expression that
specifies valid arithmetic expressions?



id * ( id + id * ( number – id ) )
Among other things, we need to ensure that
parentheses are matched!
Answer is no. We need context-free languages
and context-free grammars!
Spring 16 CSCI 4430, A Milanova
18
Grammar


A grammar is a formalism to describe the strings of
a (formal) language
A grammar consists of a set of terminals, set of
nonterminals, a set of productions, a start symbol




Terminals are the characters in the alphabet
Nonterminals represent language constructs
Productions are rules for forming syntactically correct
constructs
Start symbol tells where to start applying the rules
Spring 16 CSCI 4430, A Milanova
19
Notation
Specification of identifier:
Regular expression: letter ( letter | digit )*
BNF: <digit> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0
<letter> ::= a | b | c | … | x | y | z
<id> ::= <letter> | <id> <letter> | <id> <digit>
Textbook and slides:
(also BNF)
digit  1|2|3|4|5|6|7|8|9|0
letter  a|b|c|d|…|z
id  letter | id letter | id digit
Nonterminals shown in italic
Spring 16 CSCI 4430, A Milanova
Terminals shown in typewriter
20
Regular Grammars


Regular grammars generate regular languages
The rules in regular grammars are of the form:


Each left-hand-side (lhs) has exactly one nonterminal
Each right-hand-side (rhs) is one of the following



A single terminal symbol or
A single nonterminal symbol or
A nonterminal followed by a terminal
e.g., 1 2* | 0+
Spring 16 CSCI 4430, A Milanova
S A|B
A 1|A2
B 0|B0
21
Question

Is this a regular grammar:
S 0A
A S1
S ε
Spring 16 CSCI 4430, A Milanova
22
Lecture Outline



Formal languages
Regular expressions
Context-free grammars





Derivation
Parse
Parse trees
Ambiguity
Scanning
Spring 16 CSCI 4430, A Milanova
23
Context-free Grammars (CFGs)

Context-free grammars generate context-free
languages


Context-free grammars have rules of the form:



Most of what we need in programming languages can be
specified with CFGs
Each left-hand-side has exactly one nonterminal
Each right-hand-side contains an arbitrary sequence of
terminals and/or nonterminals
A context-free grammar
e.g. 0n1n ,n≥1 S  0 S 1
S0 1
Spring 16 CSCI 4430, A Milanova
24
Question

Can you give examples of a non-context-free
language?



E.g., anbmcndm
E.g., wcw
E.g., anbncn
example)
Spring 16 CSCI 4430, A Milanova
n≥1, m≥1
where w is in (0|1)*
n≥1 (canonical
25
Context-free Grammars



Can be used to generate strings in the
context-free language (derivation)
Can be used to recognize well-formed strings
in the context-free language (parse)
We are concerned with two special CFGs,
called LL and LR grammars
Spring 16 CSCI 4430, A Milanova
26
Derivation
Simple context-free grammar for expressions:
expr  id | ( expr ) | expr op expr
op  + | *
We can generate (derive) expressions:
expr  expr op expr
 expr op id
 expr + id
 expr op expr + id
 expr op id + id
 expr * id + id
 id * id + id
sentential form
sentence, string or yield
27
Derivation

A derivation is the process that starts from
the start symbol, and at each step, replaces a
nonterminal with the right-hand-side of a
production


E.g., expr op expr derives expr op id
We replaced the right (underlined) expr with id
due to production expr  id
An intermediate sentence is called a
sentential form

E.g., expr op id is a sentential form
28
Derivation

The resulting sentence is called yield


What is a left-most derivation?


Replaces the left-most nonterminal in the
sentential form at each step
What is a right-most derivation?


E.g., id*id+id is the yield of our derivation
Replaces the right-most nonterminal in the
sentential form at each step
There are derivations that are neither left- or
29
right-most
Question

What kind of derivation is this:
expr  expr op expr
 expr op id
 expr + id
 expr op expr + id
 expr op id + id
 expr * id + id
 id * id + id
Spring 16 CSCI 4430, A Milanova
30
Question
What kind of derivation is this:
expr  expr op expr
 expr op id
 expr + id
 expr op expr + id
 id op expr + id
 id op id + id
 id * id + id

Spring 16 CSCI 4430, A Milanova
31
Parse
Recall our context-free grammar for expressions:
expr  id | ( expr ) | expr op expr
op  + | *

A parse is the reverse of a derivation
id * id + id  expr * id + id
 expr op id + id
 expr op expr + id
 expr + id
 expr op id
 expr op expr
 expr
Spring 16 CSCI 4430, A Milanova
32
Parse

A parse starts with the string of terminals,
and at each step, replaces the right-handside (rhs) of a production with the left-handside (lhs) of that production. E.g.,
…  expr op expr + id
 expr
+ id
Here we replaced expr op expr (the rhs of
production expr  expr op expr) with expr
(the lhs of the production)
Spring 16 CSCI 4430, A Milanova
33
Parse Tree
expr  id | ( expr ) | expr op expr
op  + | *
expr  expr op expr
 expr op id
 expr + id
 expr op expr + id
 expr op id + id
 expr * id + id
 id * id + id
expr
expr
op expr
expr op expr
id *
Internal nodes are nonterminals. Children are
the rhs of a rule for that nonterminal.
Leaf nodes are terminals.
id
+
id
34
Ambiguity

Ambiguity



Ambiguity arises in programming language
grammars



A grammar is ambiguous if some string can be
generated by two or more distinct parse trees
There is no algorithm which can tell if an arbitrary
context-free grammar is ambiguous
Arithmetic expressions
If-then-else: the dangling else problem
Ambiguity is bad
Spring 16 CSCI 4430, A Milanova
35
Ambiguity
expr  id | ( expr ) | expr op expr
op  + | *

How many parse trees for id * id + id ?
Tree 1:expr
expr
op
*
id

Tree 2: expr
expr
expr
expr op expr
id +
id
Which one is “correct”?
op expr
expr op expr
id *
id
+
id
36
Ambiguity
expr  id | ( expr ) | expr op expr
op  + | *

How many parse trees for id + id + id ?
Tree 1:expr
expr
op
+
id

Tree 2: expr
expr
expr
expr op expr
id +
id
Which one is “correct”?
op expr
expr op expr
id +
id
+
id
37
Handling Ambiguity
Our ambiguous grammar, slightly simplified:
expr  id | ( expr ) | expr + expr | expr * expr

Rewrite the grammar into unambiguous one:
expr  expr + term | term
term  term * factor | factor
factor  id | ( expr )


Forces left associativity of + and *
Forces higher precedence of * over +
38
Rewriting Expression Grammars:
Intuition
expr  id | ( expr ) | expr + expr | expr * expr
 A new nonterminal, term



expr * expr becomes term. This pushes *, the
operator with higher precedence, down the parse
tree. Forces operand to associate with *, not +
expr +
becomes expr + term. Pushes
leftmost + down the tree. Forces operand to
associate with + on the left.
expr  expr + expr becomes expr  expr + term
| term
Spring 16 CSCI 4430, A Milanova
39
Rewriting Expression Grammars:
terms in the sum
Intuition
E.g., look at id + id*id*id + id + id*id
expr
expr
+ term
expr + term
expr + term
term
id*id
id
id*id*id
id
Spring 16 CSCI 4430, A Milanova
40
Rewriting Expression Grammars:
Intuition

Another new nonterminal, factor and
productions:


term  term * factor | factor
factor  id | ( expr )
Spring 16 CSCI 4430, A Milanova
41
Lecture Outline



Formal languages
Regular expressions
Context-free grammars





Derivation
Parse
Parse trees
Ambiguity
Scanning
Spring 16 CSCI 4430, A Milanova
42
Scanning


Scanner groups
characters into tokens
Scanner simplifies the
job of the Parser
position = initial + rate * 60;
Scanner
id = id + id * 60
Parser

Scanner is essentially a Finite Automaton


Regular expressions specify the syntax of tokens
Scanner recognizes the tokens in the program
43
Question

Why most programming languages disallow
nested multi-line comments?
Spring 16 CSCI 4430, A Milanova
44
Calculator Language
Tokens
times  *
plus  +
id  letter ( letter | digit )*
except for read and write which are
keywords (keywords are tokens as well)

Spring 16 CSCI 4430, A Milanova
45
Ad-hoc Scanner for Calculator
language
skip any initial white space (space, tab, newline)
if current_char in { +, * }
return corresponding single-character token (plus or times)
if current_char is a letter
read any additional letters and digits
check to see if the resulting string is read or write
if so then return the corresponding token
else return id
else announce an ERROR
Spring 16 CSCI 4430, A Milanova
46
The Scanner as a DFA
Start
space, tab, newline
1
*
letter
4
+
2
3
letter, digit
Spring 16 CSCI 4430, A Milanova
47
Building a Scanner

Scanners are (usually) automatically
generated from regular expressions:
Step 1: From a Regular Expression to an NFA
Step 2: From an NFA to a DFA
Step 3: Minimizing the DFA

lex/flex utilities generate scanner code

Scanner code explicitly captures the states
and transitions of the DFA
Spring 16 CSCI 4430, A Milanova
48
Control-flow-based Scanning
state := 1
loop
read current_char
case state of
1: case current_char of
‘+’, ‘*’,
‘a’…’z’
…
2: case current_char of
…
Spring 16 CSCI 4430, A Milanova
49
Table-driven Scanning
space,tab,newline
*
+
digit
letter other
1
2
3
4
5
-
2
-
3
-
4
4
4
-
times
plus
id
5
5
-
-
-
-
-
space
Sketch of transition table. See Scott, page 6465 for details.
Spring 16 CSCI 4430, A Milanova
50
Summary



Formal Languages in Programming
Regular Expressions
Context-free Grammars


Derivation, Parse, Parse tree, Ambiguity
Scanning
Spring 16 CSCI 4430, A Milanova
51
Group Exercise
expr  expr × expr | expr ^ expr | id

How many parse trees for id×id^id×id ?


No need to draw them all
Rewrite this grammar into an equivalent
unambiguous grammar where
^ has higher precedence than ×
^ is right-associative
× is left-associative
Spring 16 CSCI 4430, A Milanova
52
Next Class

We will cover Chapter 2.3.1 and 2.3.2 from
Scott’s book
Spring 16 CSCI 4430, A Milanova
53