Parsing Techniques: A Practical Guide

Transcript Parsing Techniques: A Practical Guide

Parsing Techniques
A Practical Guide
by
Dick Grune and
Ceriel J. H. Jacobs
1
Book in slide format
• Fantastic book: Parsing Techniques
• I went through chapters 1 and 2 of the book and
created slides of them.
• That is, the following slides are chapters 1 and 2,
in slide form.
• Additionally, there are several slides from:
– Personal correspondence with one of the authors,
Dick Grune.
– Material from other sources.
– Slides that I created, applying the concepts to XML.
Roger L. Costello
June 1, 2014
2
Why grammars, not automata?
• There is a close relationship between formal grammars
and other abstract notions used in computer science,
such as automata and algorithms.
• Indeed, since the results in one theory can often be
translated into another, it seems to be an arbitrary
decision as to which interpretation is primary.
• In these slides formal grammars are given preferential
treatment because they are probably the most
commonly known of the various theories among
computer scientists. This is due to the success of
context-free grammars in describing the syntax of
programming languages.
3
Chapter 1
Defining Parsing and Grammars
4
Parsing
• Parsing is the process of structuring a linear representation in accordance
with a given grammar.
• This definition has been kept abstract on purpose to allow as wide an
interpretation as possible.
• The “linear representation” may be:
–
–
–
–
–
–
a sentence
a computer program
a knitting pattern
a sequence of geological strata
a piece of music
actions of ritual behavior
In short, any linear sequence in which the preceding elements in some
way restrict the next element.
• For some of the examples the grammar is well known, for some it is an
object of research, and for some our notion of a grammar is only just
beginning to take shape.
5
Parsing
grammar
linear representation
parser
structure
Parsing is the process of structuring a linear
representation in accordance with a given
grammar. A “linear representation” is any
linear sequence in which the preceding
elements in some way restrict the next
element.
6
Grammar: a succinct summary
• For each grammar, there are generally an infinite
number of linear representations (“sentences”)
that can be structured with it.
• That is, a finite-sized grammar can supply
structure to an infinite number of sentences.
• This is the main strength of the grammar
paradigm and indeed the main source of the
importance of grammars: they summarize
succinctly the structure of an infinite number of
objects of a certain class.
7
Reasons for parsing
There are several reasons to perform this structuring process called
parsing.
1.
2.
3.
One reason derives from the fact that the obtained structure helps
us to process the object further. When we know that a certain
segment of a sentence is the subject, that information helps in
understanding or translating the sentence. Once the structure of a
document has been brought to the surface, it can be processed
more easily.
A second reason is related to the fact that the grammar in a sense
represents our understanding of the observed sentences: the better
a grammar we can give for the movement of bees, the deeper our
understanding of them.
A third lies in the completion of missing information that parsers,
and especially error-repairing parsers, can provide. Given a
reasonable grammar of the language, an error-repairing parser can
suggest possible word classes for missing or unknown words on clay
tablets.
8
Grammatical inference
• Grammatical inference: Given a (large) set of
sentences, find the/a grammar which
produces them.
• Grammatical inference is also known as
grammar induction or syntactic pattern
recognition.
9
XML Schema from an XML instance
The XML tool oXygen XML does grammatical
inference when it creates an XML Schema from
an XML instance document.
10
The science of parsing
• Parsing is no longer an arcane art.
• In the 1970s Aho, Ullman, Knuth, and many
others put parsing techniques solidly on their
theoretical feet.
11
Mathematician vs. Computer Scientist
• To a mathematician all structures are static. They have
always existed and will always exist. The only timedependence is that we have not discovered all the
structures yet.
– Example: the Peano axioms create the integers without
reference to time.
• The computer scientist is concerned with (and fascinated
by) the continuous creation, combination, separation, and
destruction of structures. Time is of the essence.
– Example: if the computer scientist uses the Peano axioms to
implement integer addition, he finds they describe a very slow
process, which is why he will look for a more efficient approach.
12
Many uses for parsing
Parsing is for anyone who has parsing to do:
– The compiler writer
– The linguist
– The database interface writer
– The geologist who wants to test grammatical
descriptions of a sequence of geological strata
– The musicologist who wants to test grammatical
descriptions of a music piece
13
Requirements for a parser developer
It requires a good ability to visualize, some
programming experience, and the willingness
and patience to follow non-trivial examples.
14
Chapter 2
Grammars as a Generating Device
15
Need to define some terms
• In computer science as in everyday parlance, a
grammar serves to describe a language.
• To establish our terminology and to demarcate
the universe of discourse, we shall examine
these terms:
 Language
 Grammar
 Language Descriptions
16
Language
We examine three views of the word
“language”:
– How the larger part of mankind views language
– How the computer scientist views language
– How the formal-linguist views language
17
Layman’s view of languages
• To the larger part of mankind, language is first
and foremost a means of communication.
• Communication is brought about by sending
messages, through air vibrations or through
written symbols.
• Languages have three levels of composition:
– Messages fall apart into sentences,
– which are composed of words,
– which in turn consist of symbol sequences when
written.
18
Computer scientist view of languages
• A language has sentences, and these
sentences possess structure.
• Information may possibly be derived from the
sentence’s structure; that information is called
the meaning of the sentence.
• Sentences consist of words called tokens, each
possibly carrying a piece of information, which
is its contribution to the meaning of the whole
sentence.
19
Computer scientist view of languages
• A language is a probably infinitely large set of
sentences, each composed of tokens in such a
way that it has structure.
• The tokens and structure cooperate to describe
the semantics (meaning) of the sentence.
• To a computer scientist 3 + 4 × 5 is a sentence
in the language of “arithmetics on single digits”.
Its structure can be shown by inserting
parentheses (3 + (4 × 5)) and its semantics is
23.
20
Formal-linguist view of languages
• A language is a “set” of sentences, and each sentence
is a “sequence” of “symbols”.
• There is no meaning, no structure. Either a sentence
belongs to the language or it does not.
• The only property of a symbol is that is has an identity.
• In any language there are a certain number of different
symbols – the alphabet – and that number must be
finite. Just for convenience we write these symbols as
a, b, c, …, but ◊,▪,ⱴ, … would do equally well, as long as
there are enough symbols.
21
Formal-linguist view of languages
• The word “sequence” means that the symbols in each
sentence are in a fixed order and we should not shuffle
them.
• The word “set” means an unordered collection with all
the duplicates removed. A set can be written down by
writing the objects in it, surrounded by curly braces.
• All this means is that to a formal-linguist the following
is a language: {a, b, ab, ba}
• The formal-linguist also calls a sentence a “word” and
he says that “the word ab is in the language
{a, b, ab, ba}”
22
Formal-linguist vs. computer scientist
• The formal-linguist holds his views of language because
he wants to study the fundamental properties of
languages in their naked beauty. It gives him a grip on a
seemingly chaotic and perhaps infinitely complex
object: natural language.
• The computer scientist holds his view of language
because he wants a clear, well-understood, and
unambiguous means of describing objects in the
computer and of communication with the computer (a
most exacting communication partner).
23
Grammars
We examine three views of the word
“grammar”:
– How the larger part of mankind views grammar
– How the formal-linguist views grammar
– How the computer scientist views grammar
24
Layman’s view of grammars
A grammar is a book of rules and examples
which describes and teaches the language.
25
Formal-linguist’s view of grammars
• A generative grammar is an exact, finite-size,
recipe for constructing the sentences in the
language.
• This means that, following the recipe, it must be
possible to construct each sentence of the
language (in a finite number of actions) and no
others.
• This does not mean that, given a sentence, the
recipe tells us how to construct that particular
sentence, only that it is possible to do so.
26
Computer scientist’s view of grammars
The computer scientist has the same view as the
formal-linguist, with the additional requirement
that the recipe should imply how a sentence can
be constructed.
27
Infinite sets from finite descriptions
A language is a possibly infinite set of sequences
of symbols and a grammar is a finite recipe to
generate those sentences.
28
Example of an infinite set
from a finite description
The set of all positive integers is a very finite-size
description of a definitely infinite-size set.
29
Not all languages are describable
• Can all languages be described by finite
descriptions?
• Answer: No.
30
Outline of the proof
• The proof that not all languages can be
described by finite descriptions is not trivial.
But it is very interesting and famous. We will
present an outline of it.
• The proof is based on two observations and a
trick.
31
Enumerate language descriptions
The language descriptions can be listed. This is
done as follows:
1. Take all descriptions of size one, that is, those of
only one letter long, and sort them alphabetically.
• Depending on what, exactly, we accept as a description,
there may be zero descriptions of size one, or 27 (all letters +
space), or 95 (all printable ASCII characters), or something
similar.
2. Take all descriptions of size two, sort them
alphabetically. Do the same for lengths 3, 4, and
further.
This is observation number one.
32
Each description has a
well-defined position
• Now we have a list of descriptions. Each describes a
language.
• So each description has a position on the list.
• Example: our description the set of all positive integers
is 32 characters long. To find its position on the list, we
have to calculate how many descriptions there are with
less than 32 characters, say L. We then have to
generate all descriptions of size 32, sort them and
determine the position of our description in it, say P,
and add the two numbers L and P. This will, of course,
give a huge number but it does ensure that the
description is on the list in a well-defined position.
This is observation number two.
33
Our example description
is at position L + P
L
{ descriptions of size 1
{ descriptions of size 2
{ descriptions of size 3
...
...
{
descriptions of size 31
....................
P
{
descriptions of size 32
the set of all positive integers
34
Two things to note
• Note #1: Just listing all descriptions
alphabetically, without reference to their lengths,
would not do. There are already infinitely many
descriptions starting with an “a”, so no
description starting with a higher letter could get
a number on the list.
• Note #2: there is no need to actually do all this. It
is just a thought experiment that allows us to
examine and draw conclusions about the
behavior of a system in a situation which we
cannot possibly examine physically.
35
Both nonsensical and
meaningful descriptions
There will be many nonsensical descriptions on
the list. This is immaterial to the argument. The
important thing is that all meaningful
descriptions are on the list, and the strategy
ensures that.
36
Alphabet
• The words (sentences) in a language are
composed of a finite set of symbols.
• This set of symbols is called the alphabet.
• We will assume the symbols in the alphabet
are ordered.
• Then the words in the language can be
ordered too.
• We shall indicate the alphabet by Σ.
37
Language that consists of
all possible words
• The language that consists of all possible
words that can be built from an alphabet is
called Σ*
• For the alphabet Σ = {a, b} we get the
language { , a, b, aa, ab, ba, bb, aaa, …}
The empty word (the word consisting of zero as and zero bs). It
may be easily overlooked, so we shall write it as ε (epsilon),
regardless of the alphabet. So,
Σ* = {ε, a, b, aa, ab, ba, bb, aaa, …}
38
Words in Σ* can be enumerated
• Since the symbols in the alphabet Σ are
ordered, we can list the words in the language
Σ*, using the same technique as in the
previous slides:
– First, list all words of size zero, sorted; then list all
words of size one, sorted; and so on
• This is actually the order already used in our
set notation for Σ*
39
Compare language L against Σ*
• Since Σ* contains all possible words,
all languages using alphabet Σ are
subsets of it.
• Let L be a language over Σ (the word
“over” means “built out of”).
• We can go through the list of words in
Σ* and put checkmarks on all words
that are in L.
• Suppose our language L is “the set of
all words that contain more as than
bs”. L is (a, aa, aab, aba, baa, …}
✓
✓
✓
✓
✓
ε
a
b
aa
ab
ba
bb
aaa
aab
aba
abb
...
40
Encode languages using 0 and 1
• The list of blanks and checkmarks is
sufficient to identify and describe a
language.
• For convenience we write the blank as 0 and
the checkmark as 1, as if they were bits in a
computer.
• We can now write L = 01010001110…
✓
– So, we have attached the infinite bit-string
01010001110… to the language description
“the set of all words that contain more as than
bs”.
✓
✓
✓
• The set of all words over an alphabet is
Σ* = 1111111…
✓
ε
a
b
aa
ab
ba
bb
aaa
aab
aba
abb
...
41
Languages are infinite bit-strings
• Any language can be encoded as an infinite
bit-string, be it a formal language like L, a
programming language like Java, or a natural
language like English.
• For the English language the 1s in the bitstring will be very scarce, since hardly any
arbitrary sequence of letters is a good English
sentence.
42
List of languages
• We attached the infinite bit-string
01010001110… to the language
description “the set of all words that
contain more as than bs”.
• In the same way, we can attach bitstrings to all descriptions.
• Some descriptions may not yield a
language, in which case we can attach
an arbitrary infinite bit-string to it.
• Since all descriptions can be put on a
single numbered list, we get, for
example, this table:
Description
Description #1
Description #2
Description #3
Description #4
Description #5
Description #6
...
Language
000000100…
110010001…
011011010…
110011010…
100000011..
111011011…
...
43
The list is incomplete
Description
Description #1
Description #2
Description #3
Description #4
Description #5
Description #6
...
Language
000000100…
110010001…
011011010…
110011010…
100000011..
111011011…
...
• Many languages exist that are not on the list of
languages above.
• The above list is far from complete, although the
list of descriptions is complete.
• We shall prove this by using the diagonalization
process (“Diagonalverfahren”) of Cantor.
44
Flip the bits along the diagonal
Description
Description #1
Description #2
Description #3
Description #4
Description #5
Description #6
...
Language
000000100…
110010001…
011011010…
110011010…
100000011..
111011011…
...
• Consider the language C = 100110…, which has
the property that its n-th bit is unequal to the
n-th bit of the language described by Description
#n.
• The first bit of C is 1 because the first bit of
Description #1 is 0. The second bit of C is 0
because the second bit of Description #2 is 1. And
so on.
45
Create a language
Description
Description #1
Description #2
Description #3
Description #4
Description #5
Description #6
...
Language
000000100…
110010001…
011011010…
110011010…
100000011..
111011011…
...
So C is created by walking the top-left to
bottom-right diagonal of the language table and
copying the opposites of the bits we meet.
Language
000000100…
110010001…
011011010…
110011010…
100000011..
111011011…
...
C = 100110…
46
It’s a new language!
Language
000000100…
110010001…
011011010…
110011010…
100000011..
111011011…
...
C = 100110…
• The language C cannot be on the list!
– C cannot equal line 1 since its first bit differs from that
line.
– C cannot equal line 2 since its second bit differs from
that line.
– And so forth.
• So, C cannot be on the list.
47
Infinite number of new languages
• So in spite of the fact that we exhaustively listed all
possible finite descriptions, we have created a
language that has no description on the list.
• There are many more languages not on the list:
– Construct, for example, the language whose n+6-th bit
differs from the n+6-th bit in Description #n. Again, it
cannot be on the list since for each Description #n the new
language differs in the n+6-th bit. That means that bits
1…5 play no role, and can be chosen arbitrarily; this yields
another 25 = 32 languages that are not on the list.
Language
000000100…
110010001…
011011010…
110011010…
100000011..
111011011…
...
C+6 = xxxxx1101…
48
Even more new languages
And there are many more languages not on the list:
– Construct, for example, the language whose 2n-th bit
differs from the 2n-th bit in Description #n. Again, it
cannot be on the list since for each Description #n the
new language differs in the 2n-th bit. That means that
the odd bits play no role and can be chosen freely.
Language
000000100…
110010001…
011011010…
110011010…
100000011..
111011011…
...
2C = x1x1x0x0…
49
Infinitely many languages
cannot be described
• We can create an infinite number of
languages, none of which allows a finite
description.
• For every language that can be described
there are infinitely many that cannot.
50
Many languages beyond our reach
We can only describe a tiny subset (not even a
fraction) of all possible languages. There are
infinitely many languages out there, forever
beyond our reach.
51
Unequal infinities
Although there are infinitely many descriptions
and infinitely many languages, these infinities
are not equal to each other.
# of languages = ℵ1
# of descriptions of languages = ℵ0
52
Generating a set of objects
• A good way to generate a set of objects is to start
with a small object and to give rules for how to
add to it and generate new objects from it.
• Example: start with these objects: integers and
addition. Then define rules for generating new
objects from the primitive objects:
𝑒𝑣𝑒𝑛 → 2
𝑒𝑣𝑒𝑛 → 𝑒𝑣𝑒𝑛 + 𝑒𝑣𝑒𝑛
“2 is an even number and the sum of two even
numbers is again an even number.”
This generates the set of all even numbers.
53
Generate the set of even numbers
Rules
Primitives
2 is an even number
The integers
The sum of two even
numbers is an even
number
Addition operator
The set of all even numbers
Examples:
4 is an even number since it the sum of 2 + 2, and 2 is an even number.
6 is an even number since it is the sum of 4 and 2.
8 is an even number since if is the sum of 6 and 2.
54
Generative rules
𝑒𝑣𝑒𝑛 → 2
𝑒𝑣𝑒𝑛 → 𝑒𝑣𝑒𝑛 + 𝑒𝑣𝑒𝑛
We can use those rules to generate the set of even numbers.
even → even + even
→2+2
→4
even → even + even
→4+2
→6
→ means “may be replaced by”
55
Generate infinite set from a
finite set of rules
𝑒𝑣𝑒𝑛 → 2
𝑒𝑣𝑒𝑛 → 𝑒𝑣𝑒𝑛 + 𝑒𝑣𝑒𝑛
• With those two rules we are able to generate
an infinite set.
• Note the generative character of the recipe
(rules).
56
Terminal, non-terminal symbols
“2” is called a terminal.
𝑒𝑣𝑒𝑛 → 2
𝑒𝑣𝑒𝑛 → 𝑒𝑣𝑒𝑛 + 𝑒𝑣𝑒𝑛
“even” is a symbol that stands for a number. It is called
a non-terminal (a singularly uninspired term).
57
Naming convention
Even → Even + Even
left-hand
Side
(LHS)
right-hand
Side
(RHS)
Since we have identified terminal symbols and non-terminal
symbols as technical objects, we shall write them in Cambria
Math font.
We write terminals in lower case letters and start nonterminals with upper case letters.
Non-terminals are called variables or syntactic categories in
linguistic contexts.
58
Generate regular expressions
Rules
1. Any character is a regular expression
Primitives
The symbols: |, *,
(, and )
2. r1 | r2 is a regular expression, where r1
and r2 are regular expressions
3. r1r2 is a regular expression
Characters
4. r* is a regular expression
Regular expressions
Examples:
Using rule 1 we generate this regular expression: a, which denotes this set: {a}
Using rule 3 and 1 we generate this regular expression: ab, which denotes this set: {ab}
Using rule 2, 3, and 1 we generate this regular expression: ab | b, which denotes this set: {ab, b}
Using rule 4 and 3 we generate this regular expression: a*b, which denotes this infinite set: {ab, aab, aaab, …}
59
Rules for regular expressions
𝑅𝑒𝑔𝑒𝑥
𝑅𝑒𝑔𝑒𝑥
…
𝑅𝑒𝑔𝑒𝑥
𝑅𝑒𝑔𝑒𝑥
𝑅𝑒𝑔𝑒𝑥
𝑅𝑒𝑔𝑒𝑥
→𝑎
→𝑏
→𝑧
→ 𝑅𝑒𝑔𝑒𝑥 | 𝑅𝑒𝑔𝑒𝑥
→ 𝑅𝑒𝑔𝑒𝑥 𝑅𝑒𝑔𝑒𝑥
→ 𝑅𝑒𝑔𝑒𝑥 ∗
60
Tom, Dick and Harry language
• Generate the set of all lists of names, each of the
form: tom, dick and harry.
• All names but the last two are separated by
commas.
• Duplicate names are okay, such as: tom, tom, and
dick.
• Although these are not complete sentences in
normal English, we shall call them “sentence”
since that is what they are in our midget
language.
61
Simple recipe for
generating the language
1. tom is a Name, dick is a Name, harry is a
Name
2. A Name is a Sentence
3. A Sentence followed by a “,” and a Name is
again a Sentence
4. Before finishing, if the Sentence ends in “,
Name” replace it by “and Name”
62
Problem with the recipe
• Clause 4 has trouble:
4. Before finishing, if the Sentence ends in “, Name”
replace if by “and Name”.
• A Sentence does not really end in “, Name”, it ends
in “, harry” or such, and “Name” is just a symbol
that stands for a real name; such symbols cannot
occur in a real sentence and must in the end be
replaced by a real name as given in clause 1:
1. tom is a Name, dick is a Name, harry is a Name.
• Likewise, the word “Sentence” in the recipe is a
symbol that stands for an actual sentence.
63
Two kinds of symbols
• There are two kinds of symbols:
– Real symbols which occur in finished sentences like
“tom”, a comma, and “and”
– Intermediate symbols like “Sentence” and “Name”
that cannot occur in finished sentences
• The first kind are called terminal symbols
(terminals for short).
• The second kind are called non-terminals, a
singularly uninspired term. In linguistic contexts
they are called variables or syntactic categories.
64
The recipe generates sentences
To stress the generative character of the recipe,
we shall replace X is Y by Y may be replaced by
X.
Instead of saying:
tom is a name
we say:
Name may be replaced by tom
65
Revised recipe
1. Name may be replaced by tom
Name may be replaced by dick
Name may be replaced by harry
2. Sentence may be replaced by Name
3. Sentence may be replaced by Sentence, Name
4. “, Name” at the end of a Sentence must be replaced
by “and Name” before Name is replaced by any of its
replacements
5. A sentence is finished only when it no longer
contains non-terminals
6. We start our replacement procedure with Sentence
66
Different types of clauses
1.
2.
3.
4.
5.
6.
Name may be replaced by tom
Name may be replaced by dick
Name may be replaced by harry
Sentence may be replaced by Name
Sentence may be replaced by Sentence, Name
“, Name” at the end of a Sentence must be replaced by “and Name” before
Name is replaced by any of its replacements
A sentence is finished only when it no longer contains non-terminals
We start our replacement procedure with Sentence
Clauses 1 through 4 describe replacements, but 5 and 6
are different:
– Clause 5 is not specific to this grammar. It is valid generally
and it is one of the rules of the game.
– Clause 6 tells us where to start generating. This name is
called the start symbol, and it is required for every
grammar.
67
Conventions
• For brevity we write → instead of “may be
replaced by”
Instead of writing:
Name may be replaced by tom
We write:
Name → tom
• The part before the → is called the left-hand
side (LHS), the part after it is called the righthand side (RHS).
68
Finite recipe for generating strings in
the t, d & h language
1. Name → tom
Name → dick
Name → harry
2. Sentence → Name
Sentence → List End
3. List → Name
List → List, Name
4. , Name End → and Name
5. the start symbol is Sentence
69
Transformation of the form of the
recipe rules
Our initial expression of each rule took this form:
tom is a Name
To emphasize the generative nature of the rules we revised the
rules to this form:
Name may be replaced by tom
For brevity we replaced “may be replaced by” with an arrow:
Name → tom
That form is strong enough to serve as the basis for formal
languages.
70
Generating a sentence from the recipe
1.
2.
3.
4.
5.
Name → tom
Name → dick
Name → harry
Sentence → Name
Sentence → List End
List → Name
List → List, Name
, Name End → and Name
the start symbol is Sentence
Sentence → List End
→ List, Name End
→ List, Name, Name End
→ Name, Name, Name End
→ tom, Name, Name End
→ tom, dick, Name End
→ tom, dick and Name
→ tom, dick and harry
2. Sentence → List End
3. List → List, Name
3. List → List, Name
3. List → Name
1. Name → tom
1. Name → dick
4. , Name End → and Name
1. Name → harry
71
Form is the foundation for
formal grammars
• This form:
Name → tom
“Name may be replaced by tom” is strong
enough to serve as a basis for formal
grammars.
• Similar forms, often called “rewriting
systems”, have a long history among
mathematicians, and were already in use
several centuries B.C. in India.
72
Chomsky
The specific form shown below was first studied
extensively by Chomsky. His analysis has been
the foundation for almost all research and
progress in formal languages, parsers, and a
considerable part of compiler construction and
linguistics.
1.
2.
3.
4.
5.
Name → tom
Name → dick
Name → harry
Sentence → Name
Sentence → List End
List → Name
List → List, Name
, Name End → and Name
the start symbol is Sentence
73
Formal languages
• Formal languages are a branch of
mathematics.
• The mathematics of formal languages uses a
special notation that has to be learned. It
allows a very concise expression of what and
how but gives very little information on why.
This tutorial gives the why.
74
Formal definition of a grammar
• A generative grammar G is an ordered four-tuple (VN, VT, S, F) where VN
and VT are finite alphabets with VN ∩ VT = ∅, S is a distinguished symbol of
VN, and F is a finite set of ordered pairs (P, Q) such that P and Q are in
(VN ∪ VT)* and P contains at least one symbol from VN.
• The symbols of VN are called nonterminal symbols or variables and will
usually be denoted by capital letters.
• The symbols of VT are called terminal symbols and will usually be denoted
by small letters.
• The sets VN and VT are disjoint in every grammar.
• The nonterminal symbol S is called the initial symbol and is used to start
the derivations of the sentences of the language.
• The ordered pairs in F are called rewriting rules or productions and will be
written in the form P → Q where the symbol → is, of course, not in
VN ∪ VT.
• Productions are used to derive new sentences from given ones by
replacing a part equal to the left-hand side of a rule by the right-hand side
of the same rule.
75
LHS must contain a non-terminal
A generative grammar G is an ordered fourtuple (VN, VT, S, F) where VN and VT are finite
alphabets with VN ∩ VT = ∅, S is a distinguished
symbol of VN, and F is a finite set of ordered
pairs (P, Q) such that P and Q are in
(VN ∪ VT)* and P contains at least one symbol
from VN.
P→Q
Must contain a non-terminal
76
Phrase structure grammars
1.
2.
3.
4.
5.
Name → tom
Name → dick
Name → harry
Sentence → Name
Sentence → List End
List → Name
List → List, Name
, Name End → and Name
the start symbol is Sentence
• The grammar above is in the form of what is known as
a phrase structure grammar for the t,d&h language
(often abbreviated to PS grammar).
• PS grammars have no restriction on the right-hand side
of production rules and on the left-hand side only the
restriction that it contain at least one non-terminal.
• PS grammars are called Type 0 grammars.
77
The alternative ( | ) symbol
1.
2.
3.
4.
5.
Name → tom
Name → dick
Name → harry
Sentence → Name
Sentence → List End
List → Name
List → List, Name
, Name End → and Name
the start symbol is Sentence
1.
2.
3.
4.
5.
Name → tom | dick | harry
Sentence → Name | List End
List → Name | List, Name
, Name End → and Name
the start symbol is Sentence
Several right-hand sides with the same left-hand
side are grouped together and separated by vertical
bars, |. This bar symbol belongs to the formalism,
just as the arrow →, and can be read “or else”. The
right-hand side separated by vertical bars are also
called alternatives.
78
Sentential forms
• In the process of generating a sentence from a
grammar, a series of intermediate forms are
produced, ultimately leading to the sentence.
• Each intermediate form is called a sentential
form.
• The sentential forms are all the forms that
occur from start symbol to final sentence.
79
Sentential Forms
1.
2.
3.
4.
5.
Name → tom | dick | harry
Sentence → Name | List End
List → Name | List, Name
, Name End → and Name
the start symbol is Sentence
Sentence → List End
→ List, Name End
→ List, Name, Name End
→ Name, Name, Name End
→ tom, Name, Name End
→ tom, dick, Name End
→ tom, dick and Name
→ tom, dick and harry
Sentential forms
80
Terminology
• If a sentential form contains no non-terminals
it is called a sentence and belongs to the
generated language.
• The transitions (separated by arrows →) are
called production steps.
• The grammar rules are called production rules.
81
Example of terminology usage
We have seen that the sentential forms
occurring in the production process for a finitestate grammar all contain only one nonterminal, except the last.
82
Terminal productions
The set of strings that are generated from the
start symbol are called the terminal productions.
83
Sentence → List End
→ List, Name End
→ List, Name, Name End
→ Name, Name, Name End
→ tom, Name, Name End
→ tom, dick, Name End
→ tom, dick and Name
→ tom, dick and harry
Production graph:
Sentence
List
List
List
,
,
,
Name
End
Name
Name
tom
End
dick
and
Name
and
harry
84
Graphs
• The production process can be made more visual
by drawing connective lines between
corresponding symbols, using a graph.
• A graph is a set of nodes connected by a set of
edges.
• If the edges are arrows, the graph is a directed
graph; if they are lines, the graph is undirected.
– Almost all graphs used in parsing techniques are
directed.
85
Production graph
A graph corresponding to a production process
is called a production graph or syntactic graph
and depicts the syntactic structure (with regard
to the given grammar) of the final sentence.
86
Production graph
The production graph normally
fans out and downwards
Sentence
List
List
List
,
,
,
Name
End
Name
Name
tom
End
dick
and
Name
and
harry
Starlike
construction
results from
rewriting a
group of
symbols.
87
Graph, not tree
Sentence
List
List
List
,
,
,
Name
End
Trees don’t fan
out and then
come back
together. So this
is a production
graph, not a
production tree.
Name
Name
tom
End
dick
and
Name
and
harry
88
Production graphs are acyclic
• A cycle in a graph is a path from node N following
the arrows, leading back to N.
• A production graph cannot contain cycles. Here’s
why: To get a cycle we would need a nonterminal node N in the production graph that has
produced children that are directly or indirectly N
again. But since the production process always
makes new copies for the nodes it produces, it
cannot produce an already existing node.
• So production graphs are always acyclic. Directed
acyclic graphs are called dags.
89
Exercise
• Draw the production graph for this grammar:
A→B
B→C
C→A
• Assume A is the start symbol.
90
Ha! It’s a trick question
• You don’t draw a production graph for a
grammar. You draw it for the process taken to
generate a sentence.
• The grammar rules on the previous slide has
no terminals. It loops. So it cannot generate a
sentence.
91
Only legal sentences
1.
2.
3.
4.
5.
Name → tom | dick | harry
Sentence → Name | List End
List → Name | List, Name
, Name End → and Name
the start symbol is Sentence
note the comma
• It is impossible to generate: tom, dick, harry
• If a sentence has more than one name, this
rule must be used: Sentence → List End
• The only way to remove End is with this rule:
, Name End → and Name
92
Implementing “must replace”
• Recall our recipe: “, Name” at the end of a Sentence
must be replaced by “and Name” before Name is
replaced by any of its replacements.
• Our formalism uses arrow → which means may
replace.
• Amazingly, we have succeeded in implementing the
notion must replace in a system that only uses may
replace.
• We accomplished this by splitting must replace into
may replace (List Name may replace Sentence) and
must not be a non-terminal (the items in a sentence
must not be a non-terminal).
93
Grammar produces many sentences
1.
2.
3.
4.
5.
Name → tom | dick | harry
Sentence → Name | List End
List → Name | List, Name
, Name End → and Name
the start symbol is Sentence
The grammar produces many sentences:
tom, dick and harry
harry and tom
harry
tom, tom, tom and tom
an infinity of others
94
Blind alley
1.
2.
3.
4.
5.
Name → tom | dick | harry
Sentence → Name | List End
List → Name | List, Name
, Name End → and Name
the start symbol is Sentence
Sentence → List End
→ List, Name End
→ List, Name, Name End
→ Name, Name, Name End
→ tom, Name, Name End
→ tom, dick, Name End
→ tom, dick, harry End
There is no rule for just the End nonterminal, so we can proceed no
further with this sentential form.
With the path we have taken, we
have arrived at a blind alley.
95
Frugal framework
• The main properties of a formal grammar are:
– it has production rules, which may be used for
rewriting part of the sentential form
– it has a start symbol, which is the mother of all
sentential forms
• In the production rules we find non-terminals
and terminals; finished sentences contain
terminals only.
• That’s it! It’s a frugal framework.
96
The expressive power of this
frugal framework
• Formal grammars is a framework of
impressive frugality for generating sets.
• Question: Is it sufficient for generating sets?
• Answer: We do not have anything more
expressive. All other methods known to
mankind for generating sets have been proved
to be equivalent-to, or less powerful than a
phrase structure grammar.
97
Computer programs generate sets
• A program reads some data and outputs a
result. That result is called a sentence in
formal language theory.
• Given another input, the program generates
another sentence.
• And so on.
• So a program generates a set of sentences (a
language).
98
Are programs more expressive than
phrase structure grammars?
It has been proved that any set (language) that
can be generated by a program can be
generated by a phrase structure grammar.
99
A stronger method might exist
• There is no proof that a stronger method
cannot exist.
• But in view of the fact that many quite
different methods all turn out to halt (in
expressivity) at the same barrier, it is highly
unlikely that a stronger method will ever be
found.
100
Illustration of the
expressive power of grammars
Below is a grammar for the movements of a Manhattan
turtle. A Manhattan turtle moves in a plane and can only
move north, east, south or west in distances of one block.
The grammar below produces all paths that return to
their own starting point.
Move
north east
north south
north west
east north
east south
east west
south north
south east
south west
west north
west east
west south
→
→
→
→
→
→
→
→
→
→
→
→
→
north Move south | east Move west | ε
east north
south north
west north
north east
south east
west east
north south
east south
east south
north west
east west
south west
101
Production graph for the round trip: north east south west
Move
north
south
Move
east
Move
west
south
south
west
The empty alternative in
rule 1 (the ε) results in
this Move dying out.
north
east
102
The grammar for a set
• There can be infinitely many grammars for a
set.
• By the grammar for a set we mean any
grammar that does the job (generates the
desired set) and is not obviously overly
complicated.
103
Easy grammars and hard grammars
• Some grammars are easy to understand.
• Some simple grammars generate very
complicated sets.
• The grammar for any given set is, however,
usually far from simple.
• Theory says that if a set can be generated at all
(for example, by a program), it can be generated
by a phrase structure grammar.
• But theory does not say that it will be easy to do
so, or that the grammar will be understandable.
104
Unmanageability of
phrase structure grammars
Apart from the intellectual problems phrase
structure grammars pose, they also exhibit
fundamental and practical problems:
– No general parsing algorithm for them can exist.
– All known special parsing algorithms are either
very inefficient or very complex.
105
Chomsky hierarchy
• The desire to restrict the unmanageability of
phrase structure grammars, while keeping as
much of their generative powers as possible,
has led to the Chomsky hierarchy of
grammars.
• The hierarchy has four types of grammars,
numbered 0 to 3.
• It is useful to include a fifth type, called Type
4.
106
Chomsky hierarchy
Type 0 (phrase-structure grammars)
Type 1 (context-sensitive grammars)
Type 2 (context-free grammars)
Type 3 (regular grammars)
Type 4 (finite-choice grammars)
107
Increasingly restricted grammars
• Type 0 grammars are the (unrestricted) phrase
structure grammars.
• The other types originate from applying more and
more restrictions to the allowed form of the rules of
the grammar.
• Each of these restrictions has far-reaching
consequences; the resulting grammars are gradually
easier to understand and manipulate, but are also
gradually less powerful.
• Fortunately, these less powerful types are still very
useful, actually more useful even than Type 0.
108
Type 0: arbitrary number of symbols
on LHS and RHS
The characteristic property of a Type 0 grammar
is that it may contain rules that transform an
arbitrary (non-zero) number of symbols into an
arbitrary (possibly zero) number of symbols.
Example:
, N E → and N
in which 3 symbols are replaced by 2.
109
Type 1 grammars
• Type 1 grammars restrict the freedom of the
Type 0 grammars.
• There are two completely different definitions
of Type 1 grammars, which can be easily
proved to be equivalent:
– Type 1 monotonic
– Type 1 context-sensitive
110
Type 1 monotonic grammars
• A grammar is Type 1 monotonic if every rule
has the same or more symbols on the righthand side (the tree expands, doesn’t
contract).
• This forbids, for example, the rule,
, N E → and N
N
,
LHS has
3 symbols
E
RHS has
2 symbols
Tree is contracting
and
N
111
Type 1 context-sensitive grammars
• A grammar is Type 1 context-sensitive if all of
its rules are context-sensitive.
• A rule is context-sensitive if only one nonterminal symbol in its left-hand side gets
replaced by other symbols, while we find the
others back, undamaged and in the same
order, in the right-hand side. Example:
Name Comma Name End → Name and Name End
“The rule Comma → and may be applied if the left context is Name and the right context is
Name End.” The replacement must be at least one symbol long; thus context-sensitive
112
grammars are always monotonic.
Example of a CS grammar rule
• Question: Is this a context-sensitive grammar
rule:
𝑆1𝑎 → 𝑆1𝑆𝑎
• Answer: Yes, because 𝑆1 is replaced by 𝑆1𝑆.
• “The replacement must be at least one
symbol long”. Thus, the replacement may be
more than one symbol. In this case the
replacement is two symbols.
Note: It is not the case that 𝑎 is replaced by 𝑆𝑎. Why? Because only a
non-terminal may be replaced.
113
Key to writing monotonic grammars
• In writing monotonic grammars one has to be
careful to never produce more symbols than will
eventually be produced.
• This rule produces a symbol than must eventually
be deleted:
Sentence → List End
The End symbol does not produce anything and
has to be deleted:
, Name End → and Name
But that is not monotonic.
114
End symbol is deleted
Sentence
List
End
,
Name
and
End
Name
115
Monotonic grammar for the
t,d&h language
We avoid the need to delete the End marker by
incorporating End into the rightmost name:
Name
Sentence
List
, EndName
→
→
→
→
tom | dick | harry
Name | List
EndName | Name , List
and Name
116
Context-sensitive grammar for the
t,d&h language
Name
Sentence
List
Comma EndName
and EndName
→
→
→
→
→
tom | dick | harry
Name | List
EndName | Name Comma List
and EndName
and Name
context is . . . EndName
context is and . . .
We had to introduce a new non-terminal, Comma. Here’s why:
Notice that this isn’t correct:
Name
Sentence
List
, EndName
and EndName
→
→
→
→
→
tom | dick | harry
Name | List
EndName | Name , List
and EndName
and Name
, is a terminal symbol. But in CS grammars only non-terminal symbols are replaced:
“one non-terminal symbol in its left-hand side gets replaced by other symbols”
117
MT = CS and less powerful than PS
• Monotonic and context-sensitive grammars are
equally powerful: for each language that can be
generated by a monotonic grammar a contextsensitive grammar exists that generates the same
language, and vice versa.
• They are less powerful than the Type 0
grammars. There are languages that can be
generated by Type 0 grammars that cannot be
generated by any Type 1 (or Type 2, 3, 4)
grammar.
118
No simple Type 0 grammars
Strangely enough, no simple examples of Type 0
languages are known, only their existence can
be proved.
119
Type 0 (phrase-structure grammars)
It can be proven
that there are
languages in here,
but we cannot
create grammars
that generate
them.
Type 1 (context-sensitive grammars)
Type 2 (context-free grammars)
Type 3 (regular grammars)
Type 4 (finite-choice grammars)
Roger, not sure
this is true
120
Contradiction?
• Two slides back it says:
Strangely enough, no simple examples of Type
0 languages are known, only their existence
can be proved.
• But hold on! Didn’t we already see a couple of
Type 0 grammars? Here’s one:
1.
2.
3.
4.
5.
Name → tom | dick | harry
Sentence → Name | List End
List → Name | List, Name
, Name End → and Name
the start symbol is Sentence
This is a PS grammar!
121
Key concept
1.
2.
3.
4.
5.
Name → tom | dick | harry
Sentence → Name | List End
List → Name | List, Name
, Name End → and Name
the start symbol is Sentence
The above grammar is in the form of a phrase
structure grammar but the language (set) it
generates can be generated by less powerful
grammars.
122
Type of a grammar is
its smallest class
• Any Type 1 grammar is also a Type 0 grammar
since the class of Type 1 grammars is obtained
from the class of Type 0 grammars by applying
restrictions.
• But it would be confusing to call a Type 1
grammar a Type 0 grammar; it would be like
calling a cat a mammal: correct but imprecise.
• A grammar is named after the smallest class (that
is, the highest class number) in which it will still
fit.
123
Using a Type 0 grammar for the
t,d&h language was overkill
• We saw a Type 0 grammar that generates the t,d&h
language:
Name
Sentence
List
, End
→
→
→
→
tom | dick | harry
Name | List End
Name | List, Name
and Name
• We saw two different Type 1 grammars that generates
the t,d&h language; here’s one of them:
Name
Sentence
List
Comma EndName
and EndName
→
→
→
→
→
tom | dick | harry
Name | List
EndName | Name Comma List
and EndName
and Name
124
Type n language for a Type n grammar
• A Type n language can be generated by a Type
n grammar or anything stronger, but not by a
weaker Type n+1 grammar.
• If a language is generated by a Type n
grammar, that does not necessarily mean that
there is no (weaker) Type n+1 grammar for it.
– The t,d&h language can be generated by a Type 0
grammar, but it can also be generated by Type 1,
2, and 3 grammars.
125
Constructing a Type 1 grammar
• The standard example of a Type 1 language is
the set of strings that consist of an equal
number of as, bs, and cs, in that order.
aa....a bb....b cc....c
n of them
n of them
n of them
• We shall derive a grammar for this toy
language.
126
Constructing a grammar for anbncn
• Starting with the simplest case, we have the rule:
0. S → abc
• Having obtained one instance of S, we may want
to prepend more as to the beginning. If we want
to remember how many there were, we shall
have to append something to the end as well,
and it cannot be a b or c. We shall use a yet
unknown symbol Q. The following rule both
prepends and appends:
1. S → aSQ
127
Continued
• Now, to get aabbcc from this, each Q must be worth
one b and one c, but we cannot just write:
Q → bc
because that would allow bs after the first c.
• The above rule would, however, be all right if we were
allowed to do replacement only between a b on the
left and a c on the right. There the newly inserted bc
will do no harm:
2. bQc → bbcc
• Still, we cannot apply this rule since normally the Qs
are to the right of the c. This can be remedied by
allowing Q to hop left over c:
3. cQ → Qc
128
Grammar for anbncn
1.
S → abc | aSQ
2. bQc → bbcc
3. cQ → Qc
Derivation of a3b3c3
S
aSQ
aaSQQ
aaabcQQ
aaabQcQ
aaabbccQ
aaabbcQc
aaabbQcc
aaabbbcccc
(start)
(rule 1)
(rule 1)
(rule 1)
(rule 3)
(rule 2)
(rule 3)
(rule 3)
(rule 2)
129
Derivation graph for a2b2c2
S
S
a
a
a
a
b
Q
b
c
Q
b
Q
c
b
c
c
1.
S → abc | aSQ
2. bQc → bbcc
3. cQ → Qc
130
Starlike forms
S
S
a
a
a
a
b
Q
b
c
Q
b
Q
c
b
c
Starlike forms
c
131
Monotonic or CS?
• Is the following grammar monotonic or contextsensitive?
1.
S → abc | aSQ
2. bQc → bbcc
3. cQ → Qc
• Answer: it is monotonic. The last rule is not contextsensitive since it does not conform to: only one nonterminal symbol in its left-hand side gets replaced by
other symbols, while we find the others back,
undamaged and in the same order, in the right-hand
side.
132
The anbncn language is Type 1
• It can be proved (using the pumping lemma
for context-free grammars) that there is no
Type 2 grammar for the anbncn language.
• We have created a Type 1 grammar for it.
• Therefore it is of Type 1.
133
CS grammars, MT grammars
• Although only context-sensitive Type 1
grammars can by rights be called contextsensitive grammars (CS grammars), that name
is used even if the grammar is actually
monotonic Type 1.
• There are no standard initials for monotonic,
but MT will do.
134
CF grammars
• Type 2 grammars are called context-free
grammars (CF grammars).
• A CF grammar may contain only rules that
have a single non-terminal on their left-hand
side.
135
CS versus CF grammars
Whereas context-sensitive grammars have rules
in which a non-terminal symbol changes within
a context, the rules in context-free grammars
are independent of context (the left and right
contexts are absent/empty).
– The rules in context-free grammars have a single
non-terminal on their left-hand side.
136
Independent production property
• Since there is always only one symbol on the lefthand side, each node in a production graph has
the property that whatever it produces is
independent of what its neighbors produce: the
productive life of a non-terminal is independent
of its context.
• Starlike forms cannot occur. Each node fans out,
no nodes come together.
• Consequently the production graph has a pure
tree-form and is called a production tree.
137
Feature of XML that breaks
production independence
In XML, if an element has an IDREF attribute,
then it is dependent on what its neighbors
produce.
<Book footnote_ref=“RB”>
<Title>Illusions</Title>
</Book>
<Footnote id=“RB”>
<Author>Richard Bach</Author>
</Footnote>
The Book element must reside in
a context where there is an
element with an ID attribute
whose value matches the value of
@footnote_ref.
Consider an XML Schema that declares the Book element to have an IDREF
footnote_ref attribute. That XML Schema is a context-free grammar. But the
sentences that are generated (i.e., the XML instances) have additional semantics
that imposes a context-sensitivity on the Book element.
138
Sentences in formal languages
have no semantics
• The sentences that you generate from a formal
grammar have no semantics.
• The symbols in a sentence have no semantics.
• But in an XML instance document that conforms to an
XML Schema the symbols do have semantics.
– Example: an element with an attribute of type IDREF must
reference an ID value. That element/attribute must reside
in a context in which there is a matching ID value. So while
the XML Schema is simply a context-free grammar, the
semantics of the data types imposes an additional layer of
constraint on the XML instance.
This is a key concept
139
CF grammar for the t,d&h language
1.
Name → tom | dick | harry
2. Sentence → Name | List and Name
3.
List → Name , List | Name
140
A production tree for the
CF t,d&h grammar
1.
Name → tom | dick | harry
2. Sentence → Name | List and Name
3.
List → Name , List | Name
Sentence
List
Name
,
and
Name
and
harry
List
Name
tom
,
dick
Notice that it is a tree: all nodes fan out, there are no starlike forms.
141
A rule “defines” a non-terminal
1.
Name → tom | dick | harry
2. Sentence → Name | List and Name
3.
List → Name , List | Name
• All right-hand sides for a non-terminal are
collected in one grammar rule.
• Each grammar rule reads like a definition of the
left-hand side:
– A Sentence is either a Name or a List followed by and
followed by Name.
– A List is either a Name followed by a , followed by a
List or it is a Name.
142
Sentences are produced using
two processes
Context-free grammars produce sentences by
two processes:
– concatenation (“… followed by …”)
– choice (“either … or …”)
143
Identification mechanism
In addition to the concatenation and choice
processes there is an identification mechanism
which links the name of a non-terminal used in a
right-hand side to its defining rule (“… is a …”).
– Example: “Name is a List” links Name to the rule
that defines List.
144
Each non-terminal
generates a language
• Earlier we identified a “language” as a set of strings: the set
of terminal productions of the start symbol.
• The independent production property allows us to extend
this definition to any non-terminal in the grammar: each
non-terminal produces a set—a language—independent of
the other non-terminals.
• If we write the set of strings produced by 𝐴 as 𝐿(𝐴) and 𝐴
has a production rule with, say, two alternatives,
𝐴 → 𝛼 | 𝛽, then 𝐿(𝐴) = 𝐿(𝛼) ∪ 𝐿(𝛽).
“∪” is the union operator on sets.
• If 𝛼 then consists of, say, three members 𝑃𝑞𝑅, we have
𝐿(𝛼) = 𝐿(𝑃) ○ 𝐿(𝑞) ○ 𝐿(𝑅)
“○” is the concatenation operator on the strings in the sets.
145
Nullable/Empty
• A non-terminal whose language contains ε is
called nullable.
• One also says that it produces empty.
146
In Type 0 or Type 1 grammars only the
start symbol defines a language
• Recall the Type 1 grammar for anbncn:
1. S → abc | aSQ
2. bQc → bbcc
3. cQ → Qc
• We cannot define a language L(Q) since Q
does not produce anything meaningful by
itself.
147
Only for Type 2 and lower
• Defining a language for a non-start symbol is
possible only for Type 2 grammars and lower.
• Defining a non-start non-terminal as nullable
is only possible for Type 2 grammars and
lower.
148
Recursive non-terminals
A non-terminal A is recursive if an A in a
sentential form can produce something that
again contains an A. Example:
A → aA
the A is directly recursive.
Here is an example of indirect recursion:
A → aB
B → bA
A produces aB and B produces bA, which takes
us back to the production for A.
149
Right recursion
A non-terminal A is right-recursive if it can
produce something that has an A at the right
end of the rewrite rule. Example:
A → abcA
150
Left recursive
A non-terminal A is left-recursive if it can
produce something that has an A at the left end
of the rewrite rule. Example:
A → Aabc
151
Self-embedding
A non-terminal A is self-embedding if there is a
derivation in which A produces A with
something, say α, before it and something, say
β, after it. Example:
A → αAβ
152
Nesting
• Self-embedding describes nesting: α is the
part produced when entering another level of
nesting, β is the part produced when leaving
that level.
• The best-known example of nesting is the use
of parentheses in arithmetic expressions:
Arith_expression → … | Simple_expression
Simple_expression → Number | ‘(‘ Arith_expression ‘)’
153
Both left- and right-recursive
A non-terminal can be left-recursive and rightrecursive at the same time; it is then selfembedding. Example:
A → Ab | cA | d
A → Ab
→ cAb
→ ccAb
→ ccdb
154
Recursion is essential
for infinite languages
• If no non-terminal in a grammar is recursive,
each production step uses up one nonterminal since that non-terminal will never
occur again.
• So the production process cannot continue
unlimitedly, and a finite language results.
• Recursion is essential for life in grammars.
155
Can create infinite languages
using the repetition operator
• Thus far we have not used repetition
operators in grammars.
• Later we will extend the grammar syntax to
provide repetition operators.
• With the extended syntax we can create
infinite languages without using recursion.
156
Advantage/disadvantage
of CF grammars
• In the actual world, many things are defined in
terms of other things.
• The advantage of CF grammars is that they are
a very concise way to formulate such
interrelationships.
• The disadvantage of CF grammars is that they
can generate a lot of good-looking nonsense.
157
XML
• XML is a text-markup system.
• Markup is used to express and control the
basic structure.
• An XML instance document is a parse tree!
158
CF grammar for English
If we ignore enough detail we can recognize an
underlying context-free structure in the sentences
of a natural language, for example, English:
Sentence
Subject
Object
NounPhrase
QualifiedNoun
Noun
Adjective
Verb
→
→
→
→
→
→
→
→
Subject Verb Object
NounPhrase
NounPhrase
the QualifiedNoun
Noun | Adjective QualifiedNoun
castle | caterpillar | cats
well-read | white | wistful | …
admires | bark | criticize | …
159
… which produces sentences like:
the well-read cats criticize the wistful caterpillar
Sentence
Subject
Object
NounPhrase
QualifiedNoun
Noun
Adjective
Verb
→
→
→
→
→
→
→
→
Subject Verb Object
NounPhrase
NounPhrase
the QualifiedNoun
Noun | Adjective QualifiedNoun
castle | caterpillar | cats
well-read | white | wistful | …
admires | bark | criticize | …
160
Since no context is incorporated, it will
equally well produce this
good-looking nonsense:
the cats admires the white well-read castle
Sentence
Subject
Object
NounPhrase
QualifiedNoun
Noun
Adjective
Verb
→
→
→
→
→
→
→
→
Subject Verb Object
NounPhrase
NounPhrase
the QualifiedNoun
Noun | Adjective QualifiedNoun
castle | caterpillar | cats
well-read | white | wistful | …
admires | bark | criticize | …
161
For keeping context we could use a
phrase structure grammar:
Sentence
Number
Noun Singular
Singular Verb
Singular
Noun Plural
Plural Verb
Plural
→
→
→
→
→
→
→
→
Noun Number Verb
Singular | Plural
castle Singular | caterpillar Singular | …
Singular admires | …
ε
cats Plural
Plural bark | Plural criticize | …
ε
The markers Singular and Plural control the production of
English words. Still, this grammar allows the cats to bark …
For a better way to handle context, see various sections in
Chapter 15, especially Van Wijngaarden grammars (Section
15.2) and attribute and affix grammars (Section 15.3).
162
Programming languages
are defined using CF grammars
• The bulk of examples of CF grammars originate
from programming languages.
• Sentences in these languages (that is, programs)
have to be processed automatically (by a
compiler) and it was recognized early (around
1958) that this is much easier if the language has
a well-defined formal grammar.
• The syntaxes of all programming languages in use
today are defined through formal grammars.
163
XML
• XML Schemas specify XML languages using CF
grammars.
• Sentences in XML languages (that is, XML
instances) have to be validated automatically (by
a validator) and it was recognized early (around
1999) that this is much easier if the XML language
has a well-defined formal grammar.
• The syntaxes of nearly all XML languages in use
today are defined through XML Schemas.
164
ε-rules, ε-free
• A grammar rule that has an empty right-hand
side:
A→ε
is called an ε-rule. Read that rule as: A may be
replaced by the empty string (which we denote
by ε).
– An empty string is a string of length zero, it contains
no characters from the alphabet, Σ.
• A grammar that contains no such rules is called εfree.
165
Non-monotonic CF grammar
The only way a CF rule can be non-monotonic is
by having an ε-rule. A grammar containing this
rule would not be monotonic:
A→ε
166
Require monotonicity
• Some authors (for example, Chomsky) and
some parsing algorithms require a CF
grammar to be monotonic.
• This means that it must be ε-free.
167
Making a grammar ε-free
• Almost any CF grammar can be made ε-free by
systematic substitution of the ε-rules.
• The exception is a grammar in which the start
symbol produces ε.
• The transformation of a CF grammar that
contains ε-rules into an ε-free grammar is
explained in Section 4.2.3.1
168
Advantage of ε-free CF grammars
The proofs and parsers are less complicated,
sometimes much less complicated.
169
Disadvantage of ε-free CF grammars
• The disadvantage with transforming a CF
grammar to remove ε-rules is that the
resulting grammar will almost always be more
complicated.
• Example: Suppose we have a system that can
be fed bits of information like: “Amsterdam is
the capital of the Netherlands”, “Truffles are
expensive”, and can then be asked a question.
continued
170
Disadvantage of ε-free CF grammars
• On a superficial level we can define its input as:
input → zero-or-more-bits-of-info question
zero-or-more-bits-of-info → bit-of-info zero-or-more-bits-of-info | ε
• This definition of input neatly fits the user’s view of the problem.
• Here is an ε-free grammar for the input:
input → question-preceded-by-info
question-preceded-by-info → question | bit-of-info question-preceded-by-info
• This second definition does not fit the user’s view of the problem.
• As a grammar becomes more and more complicated, the
requirement that it be ε-free becomes more and more of a
nuisance: the grammar is working against us, not for us.
171
No problem theoretically
• Requiring grammars be ε-free presents no
problem from a theoretical point of view: any CF
language can be described by an ε-free CF
grammar and ε-rules are never needed.
• Better still, any grammar with ε-rules can be
mechanically transformed into an ε-free grammar
for the same language.
• But the price we pay is that of any grammar
transformation: it is no longer our grammar and it
does not reflect the original structure as well.
172
The ε-rule is a useful tool
The bottom line is that the practitioner finds the
ε-rule to be a useful tool.
173
ε-rules make parsing problematic
• Many parsing methods will in principle work for
ε-free grammars only: if something does not
produce anything, you can’t very well see if it’s
there.
• Often the parsing method can be doctored to
handle ε-rules, but that invariably increases the
complexity of the method.
• If ε-rules did not exist, then the topic of parsing
would be 30% smaller – but then grammars
would lose much more than 30% of their
usefulness.
174
Advantage of using ε-rules
The advantage is that ε-rules are very
convenient for the grammar writer and user.
175
Notational style: BNF
• There are several different styles of notation
for CF grammars of programming languages.
• They are all functionally equivalent.
• The first is the Backus-Naur Form (BNF) which
was first used to define ALGOL 60. Here is a
sample:
<name> ::=
<sentence> ::=
<list> ::=
tom | dick | harry
<name> | <list> and <name>
<name>, <list> | <name>
• Angle brackets are used to enclose nonterminals and ::= is used for “may produce”
176
Notational style: van Wijngaarden
• The second notational style is that of van
Wijngaarden. Here is a sample:
name:
sentence:
list:
tom symbol; dick symbol; harry symbol.
name, list, and symbol, name.
name, comma symbol, list; name.
• The names of symbols end in …symbol; their
representations are hardware-dependent and
are not defined in the grammar.
• Rules are terminated with a period.
177
van Wijngaarden grammars
name:
sentence:
list:
tom symbol; dick symbol; harry symbol.
name, list, and symbol, name.
name, comma symbol, list; name.
• Punctuation is used in the traditional way. For example, the comma binds
tighter than the semicolon.
• The punctuation can be read as follows:
:
;
,
.
is defined as a(n)
, or as a(n)
followed by a(n)
, and as nothing else.
• So this rule
sentence: name, list, and symbol, name.
would be read as: A sentence is defined as a name followed by a list
followed by an and-symbol followed by a name, and as nothing else.
178
van Wijngaarden grammars
• The van Wijngaarden notation achieves its full
power only when applied to the two-level
van Wijngaarden grammars.
• But it also has merits on its own: it is formal
and still quite readable.
179
Extended CF grammars
• CF grammars are made more compact and more readable
by introducing special shorthands for frequently used
constructions.
• Rules like:
List → Item | Item List
are written in an extended CF grammar as:
List → Item+
Item+ means “one or more Items”
• We do not need to give a rule for Item+, the rule:
Item+ → Item | Item Item+
is implicit.
• This notation for grammars is called Extended BNF (EBNF).
180
Extended CF grammars (cont.)
• Likewise, rules like:
List → ε | Item List
are written in an extended CF grammar as:
List → Item*
Item* means “zero or more Items”
• The rule:
Item* → ε | Item Item*
is implicit.
181
Extended CF grammars (cont.)
• Finally, rules like:
Item → ε | Item
are written in an extended CF grammar as:
Item → Item?
Item? means “zero or one Item” (optional
Item)
• The rule:
Item? → ε | Item
is implicit.
182
Repetition operators
+,
*, and ? are called repetition operators
183
Extending an operator’s range
• In the preceding examples the operators +, *,
and ? work on the preceding symbol.
• Their range can be extended by using
parentheses:
(Item ;)? means “optionally an Item
followed by a ; ”
184
Advantage of extended grammar
The advantage of the repetition operators and
parentheses is that grammars can be written
more efficiently, more compactly, and more
readable.
185
Illustrate the gain in efficiency,
compactness, and readability
Book
Preface
ChapterSequence
Chapter
ParagraphSequence
Paragraph
SentenceSequence
Conclusion
→
→
→
→
→
→
→
→
Preface ChapterSequence Conclusion
“PREFACE” ParagraphSequence
Chapter | Chapter ChapterSequence
“CHAPTER” Number ParagraphSequence
Paragraph | Paragraph ParagraphSequence
SentenceSequence
…
“CONCLUSION” ParagraphSequence
Use extended operators
Book
Preface
Chapter
Paragraph
Sentence
Conclusion
→
→
→
→
→
→
Preface Chapter+ Conclusion
“PREFACE” Paragraph+
“CHAPTER” Number Paragraph+
Sentence+
…
“CONCLUSION” Paragraph+
186
Overdoing a good thing
Some styles even allow constructions like:
– Item+4 meaning “One or more Item, with a
maximum of 4 ”
– Item+ , meaning “One or more Items separated by
commas ”
187
No increase in expressive power
• The extensions of an EBNF grammar do not
increase its expressive powers.
• All implicit rules can be made explicit and then
a normal CF grammar results.
• Their strength lies in their user-friendliness.
188
Kleene star
• The star in the notation X* is called the Kleene
star
• As we’ve seen, in a grammar X* should be
read as “zero or more Xs ”
• If X is a set, X* should be read as “the set of
zero or more elements of X concatenated ”
• We denote an alphabet by Σ and the set of all
strings over the alphabet by Σ*
189
Regular expressions
• Forms involving the repetition operators *, +,
or ? and possibly the separators ( and ) are
called regular expressions.
• EBNFs, which have regular expressions for
their right-hand sides, are sometimes called
regular right part grammars (RRP grammars),
which is more descriptive than “extended
context free” but is more of a tongue twister.
190
Structural meaning of a
regular right-hand side
• There are two schools of thought about the
structural meaning of a regular right-hand
side.
• One school maintains that a rule like:
Book → Preface Chapter+ Conclusion
is an abbreviation of:
Book → Preface α Conclusion
α → Chapter | Chapter α
• This is, a right recursive interpretation.
191
Right recursive interpretation
• The advantage of a right recursive
interpretation is that it is easy to explain and
the transformation to “normal” CF is simple.
• The disadvantages are:
– The transformation entails anonymous rules
(identified by α).
– The lopsided production tree does not correspond
to our idea of the structure of the Book (see figure
on next slide).
192
Production tree for a
right recursive interpretation
Book
Preface
Conclusion
α
α
Chapter
α
Chapter
Chapter
Book → Preface α Conclusion
α → Chapter | Chapter α
α
Chapter
193
Iterative interpretation
• The second school of thought claims that:
Book → Preface Chapter+ Conclusion
is an abbreviation of:
Book → Preface Chapter Conclusion
| Preface Chapter Chapter Conclusion
| Preface Chapter Chapter Chapter Conclusion
|…
• It has the advantage that it yields a beautiful
production tree (see figure on next slide), but the
disadvantages are that it involves an infinite
number of production rules and that the nodes in
the production tree have varying fan-out.
194
Production tree for the
iterative interpretation
Book
Preface
Chapter
Chapter
Chapter
Chapter
Conclusion
195
Which interpretation is
commonly used?
Since the iterative implementation is complex,
most practical parser generators use the
recursive interpretation in some form or
another, whereas most research has been done
on the iterative interpretation.
196
XML/XSD uses the
iterative interpretation
Book → Preface Chapter+ Conclusion
Is expressed in XSD like so:
<xs:element name="Book">
<xs:complexType>
<xs:sequence>
<xs:element name="Preface">...</xs:element>
<xs:element name="Chapter" maxOccurs="unbounded">...</xs:element>
<xs:element name="Conclusion">...</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
Which is interpreted using the iterative interpretation:
<Book>
<Preface>...</Preface>
<Chapter>...</Chapter>
<Chapter>...</Chapter>
<Chapter>...</Chapter>
<Conclusion>...</Conclusion>
</Book>
Wrong! The XML instance is a sentence of
the grammar. The issue is: what is the parse
tree (DOM tree) for this instance?
197
Basic property of CF grammars
• The basic property of CF grammars is that they
describe things that nest: an object may
contain other objects in various places, which
in turn may contain … etc.
• When during the production process we have
finished producing one of the objects, the
right-hand side still “remembers” what has to
come after it.
198
While working on expanding a non-terminal, the
following symbols remain queued-up
Sentence
Subject
Object
NounPhrase
QualifiedNoun
Noun
Adjective
Verb
→
→
→
→
→
→
→
→
Subject Verb Object
NounPhrase
NounPhrase
the QualifiedNoun
Noun | Adjective QualifiedNoun
castle | caterpillar | cats
well-read | white | wistful | …
admires | bark | criticize | …
After having descended into the depth of the non-terminal
Subject to produce something like the wistful cat, the right-hand
side Subject Verb Object still remembers that a Verb must
follow. While we are working on the Subject, the Verb and
Object symbols remain queued at the right in the sentential
form. For example:
the wistful QualifiedNoun Verb Object
199
Here’s what makes CF languages
so useful
• It is the parsability that make CF languages so
useful, not the fact that they stem from the
Chomsky hierarchy.
• Parsing is the task of converting a string to the
production tree.
200
Type 3 grammars
• The restriction to Type 3 disallows the recollection of
things that came before.
• A right-hand side may only contain one non-terminal
and it must come at the end.
• This means that there are only two kinds of rules:
– A non-terminal produces zero or more terminals.
– A non-terminal produces zero or more terminals followed
by one non-terminal.
• Example: the language a*b+ is generated by this Type 3
grammar:
S → a*B
B → b+
201
Type 2 versus Type 3
• Type 2 allows queuing-up whereas Type 3
does not.
• Type 2 allows recollection of things that came
before whereas Type 3 does not.
This is a key concept
202
Chomsky definition of Type 3
Our definition of Type 3:
A non-terminal produces zero or more terminals.
A non-terminal produces zero or more terminals followed by one non-terminal.
Chomsky definition of Type 3:
A non-terminal produces one terminal.
A non-terminal produces one terminal followed by one non-terminal.
• Our definition is equivalent and more convenient.
• It is not completely trivial to convert a Type 3
grammar under our definition to a grammar
under the Chomsky definition.
203
Type 2.5 grammar (linear grammar)
• A Type 2.5 grammar allows a single nonterminal on the right-hand side and it doesn’t
have to be at the end.
• This kind of grammar is called a linear
grammar.
204
Note the equivalence between a Type 3
grammar and a finite-state automaton
Type 3 grammar:
S → a*B
B → b+
Finite-state automaton
b
a
S
b
B
205
Why grammars, not automata?
• There is a close relationship between formal grammars
and other abstract notions used in computer science,
such as automata and algorithms.
• Indeed, since the results in one theory can often be
translated into another, it seems to be an arbitrary
decision as to which interpretation is primary.
• In these slides formal grammars are given preferential
treatment because they are probably the most
commonly known of the various theories among
computer scientists.
• This is due to the success of the context-free grammars
in describing the syntax of programming languages.
206
Type 3 = Regular grammars
• Type 3 grammars are also called regular
grammars (RE grammars) or finite-state
grammars (FS grammars)
• More precisely the version defined below is
called right-regular since the only nonterminal in a rule is found at the right end of
the right-hand side.
Our definition of Type 3:
A non-terminal produces zero or more terminals.
A non-terminal produces zero or more terminals followed by one non-terminal.
207
Type 3 grammars can recurse
• Recall the rules on Type 3 grammars:
– A non-terminal produces zero or more terminals.
– A non-terminal produces zero or more terminals
followed by one non-terminal
• This grammar conforms to the rules; therefore
it is Type 3:
A → a | aA (it is recursive)
208
Left-regular grammars
• The left-regular grammars are subject to the
restriction that the only non-terminal in a rule must be
at the left end of the right-hand side:
– A non-terminal produces zero or more terminals.
– A non-terminal produces one non-terminal followed by
zero or more terminals.
• Example: this Type 3 left-regular grammar generates
the language a*b*
S → Ab*
A → a*
• Left-regular grammars are less intuitive than rightregular grammars, occur less frequently, and are more
difficult to process, but they do occur occasionally.
209
Regular grammar =
right-regular grammar
Given the prevalence of right-regular over leftregular, the term “regular grammar” is usually
intended to mean “right-regular grammar.”
210
Right-recursive vs. right-regular
• A non-terminal A is right-recursive if it can
produce a sentential form that has an A at the
right end.
– Right-recursive means that rule A can be used
again in the production process
• A rule is right-regular simply means that its
non-terminal is at the right end, following any
terminal symbols. Right-regular has nothing to
do with recursion.
211
Non-nesting
Regular grammars don’t nest
212
Common usage
• Regular grammars are used very often to
describe the structure of text at the character
level, in lexers.
• It is customary for the terminal symbols of a
regular grammar to be single characters.
213
Type 3 grammars for the
t,d,&h language
Right-regular grammar:
Sentence → t | d | h | List
List → t ListTail | d ListTail | h ListTail
ListTail → , List | & t | & d | & h
Left-regular grammar:
Sentence → t | d | h | List
List → ListHead &t | ListHead &d | ListHead &h
ListHead → ListHead , t | ListHead , d | ListHead , h | t | d | h
214
Grammar and equivalent automaton
Right-regular grammar:
S → t | d | h | List
List → t ListTail | d ListTail | h ListTail
ListTail → , List | & t | & d | & h
Automaton
t,d,h
t,d,h
S
ε
&
List
t,d,h
ListTail
,
215
Challenge of left-regular grammars
• Here is a Type 3 grammar (left-regular):
S → Ab*
A → a*
• Doing recognition with a left-regular requires
a non-deterministic automation. See Section
5.3.
216
Production chain
• The production tree for a sentence from a
Type 3 (right-regular) grammar degenerates
into a production chain of non-terminals that
drop a sequence of terminals on their left.
• See next slide for an example 
217
Right-regular grammar:
Sentence → t | d | h | List
List → t ListTail | d ListTail | h ListTail
ListTail → , List | & t | & d | & h
Sentence
Production chain
List
ListTail
t
List
,
ListTail
d
&
h
218
The […] notational device
Sentence → t | d | h | List
List → t ListTail | d ListTail | h ListTail
ListTail → , List | & t | & d | & h
There is a lot of repeating in the above grammar.
A notational device has been invented to abate
this nuisance. Square brackets are used to
indicated “one out of a set of characters”:
[tdh] is an abbreviation for t | d | h
Sentence → [tdh] | List
List → [tdh] ListTail
ListTail → , List | & [tdh]
219
The macro notational device
• A macro is a name for pieces of the grammar.
• The macro is referenced by preceding the
name with a $ symbol.
• A referenced macro is substituted by the
grammar.
Sentence → t | d | h | List
List → t ListTail | d ListTail | h ListTail
ListTail → , List | & t | & d | & h
use
macro
Name
Sentence
List
ListTail
→
→
→
→
t|d|h
$Name | List
$Name ListTail
, List | & $Name
220
The lex parser
• lex is a popular parser for regular grammars.
• It supports both notational devices: the […]
device and the macro device.
221
Type 3 Chomsky definition of t,d&h
Chomsky definition of Type 3:
A non-terminal produces one terminal.
A non-terminal produces one terminal followed by one non-terminal.
The following grammar does not adhere to the
Chomsky definition of Type 3:
Sentence → t | d | h | List
List → t ListTail | d ListTail | h ListTail
ListTail → , List | & t | & d | & h
2 terminals – not allowed
in the Chomsky definition
222
Convert to the Chomsky definition
Chomsky definition of Type 3:
A non-terminal produces one terminal.
A non-terminal produces one terminal followed by one non-terminal.
If we adhere to the Chomsky definition of Type
3, our grammar will not be smaller than:
Sentence → t | d | h | List
List → t ListTail | d ListTail | h ListTail
ListTail → , List | & t | & d | & h
Sentence
List
ListTail
Name
→
→
→
→
Our Type 3 definition of the
t,d&h language
t | d | h | List
t ListTail | d ListTail | h ListTail
, List | & Name | & Name | & Name
t|d|h
Chomsky Type 3 definition of
the t,d&h language
223
Chomsky version of t,d&h
Sentence
List
ListTail
Name
→
→
→
→
t | d | h | List
t ListTail | d ListTail | h ListTail
, List | & Name | & Name | & Name
t|d|h
The Chomsky version is:
– Bigger (an additional rule is needed)
– Not as easy to read as the version that uses […] or
macros
– But it is easier to process
224
Key point
There is variation in how people define Type 0, Type 1,
Type 2, and Type 3. Depending on which definition you
use, the grammars you create may vary in user friendliness
and in ease of processing.
225
Formal linguist vs. Computer scientist
• Chomsky’s definition of Type 3 has minimal
mechanisms. The formal linguist is interested
in and helped by minimally sufficient
grammars.
• The computer scientist values a form in which
the concepts underlying the grammar
($Name, etc.) are easily expressed, at the
expense of additional processing.
226
Two observations about
regular grammars
• The sentential forms will only contain one
non-terminal and this will always be at the
end.
• The size of regular grammars can be reduced
considerably by using the repetition operators
*, +, and ? along with the grouping operators (
and ).
227
One non-terminal, at the end
Sentence → t | d | h | List
List → t ListTail | d ListTail | h ListTail
ListTail → , List | & t | & d | & h
Sentence → List → t ListTail → , List → d ListTail → & h
Notice in these sentential forms that there is one non-terminal and it is at the end.
228
Smaller and simpler
All regular grammars can be reduced
considerably in size by using the repetition
operators *, +, and ? for “zero or more”, “one or
more” and “optionally one”, respectively.
Sentence → [tdh] | List
List → [tdh] ListTail
ListTail → , List | & [tdh]
Using the repetition operators along with ( and ) for grouping, we can
simply the grammar to:
Sentence → (( [tdh] , )* [tdh] & )? [tdh]
229
Regular expressions
Regular expressions exist for all Type 3
grammars.
Sentence → (( [tdh] , )* [tdh] & )? [tdh]
regular expression
Regular grammar (Type 3 grammar) that uses
repetition operators along with ( and )
230
Type 4 grammars
• The last restriction we shall apply to what is
allowed in a production rule is a pretty final one:
no non-terminal is allowed in the right-hand side.
• This removes all generative power from the
mechanism, except for the choosing of
alternatives.
• The start symbol has a (finite) list of alternatives
from which we are allowed to choose.
• Type 4 grammars are named finite-choice
grammars (FC grammars).
231
Example of a FC grammar
• There is no FC grammar for the t,d&h language.
• If, however, we are willing to restrict ourselves to
lists of names of finite length (say, no more than
three), then there is a FC grammar, since one
could enumerate all combinations. For three
names we get:
S → [tdh] | [tdh] & [tdh] | [tdh] , [tdh] & [tdh]
for a total of 3 + 3 x 3 + 3 x 3 x 3 = 39 production
rules.
232
Chomsky: No FC grammar
• FC grammars are not part of the official
Chomsky hierarchy in that they are not
identified by Chomsky.
• They are nevertheless very useful and are
often required as a tail-piece in some process
or reasoning.
• For example, the set of reserved words
(keywords) in a programming language can be
described by a FC grammar.
233
Parts of grammars are FC
• Although not many grammars are FC in their
entirety, some of the rules in many grammars
are finite-choice.
• For example, the first t,d&h grammar we
looked at has a FC rule:
1.
2.
3.
4.
5.
Name → tom
Finite choice
Name → dick
Name → harry
Sentence → Name
Sentence → List End
List → Name
List → List, Name
, Name End → and Name
the start symbol is Sentence
234
Summary
The below table summarizes the most
complicated data structures that can occur in
the production of a sentence, with respect to
the grammar type used.
Chomsky type
Grammar type
Most
complicated
data structure
0/1
PS/CS
production dag
2
CF
production tree
3
FS
production
chain
4
FC
production
element
Legend:
dag: directed acyclic graph
PS: phrase -structure
CS: context-sensitive
FS: finite-state
FC: finite-choice
235
XML is a graph
• XML is a graph due to ID-IDREF.
• So it must be Type 0/1.
236
Symbology
Let:
VN denote the set of non-terminal symbols
VT the set of terminal symbols
S the start symbol
F the production rules
237
Formal definition of
Type 0, 1, 2, 3 grammars
A generative grammar G = (VN, VT, S, F) is said to
be of Type i if it satisfies the restrictions
described in this list:
i=0
No restrictions except the LHS must contain at least
one non-terminal
i = 1:
Every rewriting rule in F has the form Q1AQ2 → Q1PQ2,
with Q1, Q2, and P in (VN ∪ VT)*, A ∈ VN, and P ≠ ε,
except possibly for the rule S → ε, which may occur in
F, in which case S does not occur on the right-hand
sides of the rules.
i = 2:
Every rule in F has form A → P, where A ∈ VN, and
P ∈ (VN ∪ VT)*.
i = 3:
Every rule in F has form with A → PB or A → P, where
A, B ∈ VN, and P ∈ VT*.
238
Monotonic not necessarily a
context-sensitive grammar
• A grammar is Type 1 monotonic if every rule has the same or more
symbols on the right-hand side (the tree expands, doesn’t contract).
• A grammar is Type 1 context-sensitive if all of its rules are contextsensitive. A rule is context-sensitive if only one (non-terminal)
symbol in its left-hand side gets replaced by other symbols, while
we find the others back, undamaged and in the same order, in the
right-hand side.
• The below grammar for anbncn is Type 1 monotonic but not Type 1
context-sensitive:
1.
S → abc | aSQ
2. bQc → bbcc
3. cQ → Qc
Every rule has the same or more
symbols on the right-hand side, so
it is monotonic.
239
Context-sensitive grammar for anbncn
The below grammar for anbncn is Type 1 contextsensitive:
CS grammar for anbncn
1.
2.
3.
4.
5.
6.
7.
8.
S
CB
HB
HC
aB
bB
bC
cC
→
→
→
→
→
→
→
→
aSBC | aBC
HB
HC
BC
ab
bb
bc
cc
Derivation of a2b2c2
S
aSBC
aaBCBC
aabCBC
aabHBC
aabHCC
aabBCC
aabbCC
aabbcC
aabbcc
http://www.answers.com/topic/context-sensitive-grammar
(start)
(rule 1)
(rule 1)
(rule 5)
(rule 2)
(rule 3)
(rule 4)
(rule 6)
(rule 7)
(rule 8)
240
Generating sentences from a grammar
• Until now we have only produced single
sentences from our grammars, in an ad hoc
fashion.
• But the purpose of a grammar is to generate
all of its sentences.
• Fortunately there is a systematic way to do so.
241
Production queue
Queue
aaSQQ, aabcQ
aSQ
aSQ
substitute S
S → abc | aSQ
aQC → bbcc
cQ → Qc
We can systematically generate all sentences using a queue.
242
Systematic way to enumerate all
sentences in a PS grammar
• Begin with S as the only sentential form in the queue. Now
continue doing the following:
– Consider the first sentential form in the queue.
– Scan it from left to right, looking for a substring that matches the lefthand side of a production rule.
– For each such production rule found, make a copy of the sentential
form, replace the substring with the production rule’s right-hand side,
add the revised sentential form to the end of the queue.
– If the original sentential form does not contain any non-terminals,
write it down as a sentence in the language.
– Throw away the original sentential form; it has been fully processed.
• If no rule matched and the sentential form was not a finished
sentence, it was a blind alley; they are removed automatically by
the above process and leaves no trace.
243
Recursively enumerable
• The procedure on the previous slide
enumerates all strings in a PS language.
• Thus, PS languages are also called recursively
enumerable sets, where “recursively” is taken
to mean “by a possibly recursive algorithm.”
244
Non-recursively enumerable sets
There are sets that are not recursively enumerable; the set of all Type 0
grammars that do not produce the empty string is an example. There is
no grammar for it, because this set cannot be generated (you cannot tell
whether a Type 0 grammar produces the empty string). If phrasestructure is not sufficient, only natural language description will do, as
shown here.
245
Language-generating procedure
The queue procedure is a systematic way of
producing all the strings that a grammar is
capable of generating. That is, it is a languagegenerating procedure.
246
Let’s see the procedure in action for the anbncn grammar:
S → abc | aSQ
bQc → bbcc
cQ → Qc
Step
Queue (front of queue on left)
1.
S
2.
abc
Result
aSQ
abc
aSQ
3.
aabcQ
aaSQQ
4.
aaSQQ
aabQc
5.
aabQc
aaabcQQ
6.
aaabcQQ
aaaSQQQ
7.
aaaSQQQ
aabbcc
8.
aabbcc
9.
aaabQcQ
10.
aaaabcQQQ
aaaaSQQQQ
11.
aaaaSQQQQ
aaabbccQ
…
…
aaabQcQ
aaaSQQQ
aabbcc
aaabQcQ
aaaabcQQQ
aaaabcQQQ
aaaaSQQQQ
aabbcc
aaaaSQQQQ
aaabbccQ
aaaabQcQQ
247
Every sentence will be produced
• The table on the previous slide shows that we do
not get a sentence each time we turn the crank
(each time we process the item at the front of the
queue).
• In fact, real sentences will get scarcer and scarcer.
The reason is that during the process more and
more side-lines develop which all require equal
attention.
• Still, we can be certain that every sentence that
can be produced, will be produced.
248
Breadth-first generator
This way of doing things is called breadth-first
production. Computers are better at it than
humans.
S
abc
aSQ
output
aabcQ
aaSQQ
aabQc
aaabcQQ
aaaSQQQ
249
Replace all left-hand sides
Queue
ac, bC
AC
AC
substitute A and AC
S → AC
A →b
AC → ac
If we were to only substitute A, then the
remainder is C which is a blind alley.
Doing both substitutions (replace A by b
and AC by ac) also leads to a blind alley,
but there will be an output, ac.
250
Sentential form provides a context
Queue
…………..
abcXYdef
abcXYdef
context
The sentential form provides a context. If
you ignore a context you run the risk
of creating false productions.
Remember, this discussion is just for phrase-structure grammars.
251
Grammar that generates the empty set
• What language will this grammar generate?
S → AB
A→B
B→A
• Let’s show a few sentential forms:
S → AB → BB → AB → BB → AB → …
•
•
•
•
Every new sentential form contains non-terminals.
It generates no sentences.
Therefore, it produces the empty set.
The language generated is the empty set: L(G) = {}
252
PS grammar that
generates the empty set
The language generated by this grammar is also
empty:
S → aAB
bB → a
Ab → SBb
Aa → SaB
B → SA
B → ab
253
Undecidable
• We have seen how to systematically generate, for PS
grammars, all sentential forms using a queue.
• It is not at all certain that the process will obtain a
sentence.
• It is quite possible that every new sentential form
never terminates (see example on previous slide).
• It is undecidable whether a PS grammar produces the
empty set.
• “Undecidable” means that there cannot be an
algorithm that will, for every PS grammar, correctly tell
if the grammar produces at least one sentence.
254
PS grammar
Procedure
No such procedure exists
Yes (no), the language generated is the empty set
255
This queue will run forever
Queue
AA, BB
AB
AB
substitute
S → AB
A→B
B→A
256
No algorithm to determine if PS
grammars will produce something
• There is no algorithm that can decide, given an
arbitrary PS grammar, whether it will produce a
sentence.
• This does not mean that we cannot prove for some
given grammar that it generates nothing. It means that
the proof method used will not work for all PS
grammars.
• We could have a program that correctly says Yes in
finite time if the answer is Yes but that takes infinite
time if the answer is No. In fact, the queue procedure
answers Yes in finite time but takes an infinite time if
the answer is No.
257
Many special cases can be identified
• For many PS grammars we can prove if they
produce the empty set or not.
• For example, the grammar may have a rule
S -> a, or we may find it has no rule without a
non-terminal in its RHS.
258
Can still get useful info
PS grammar
Procedure
No such procedure exists
The language generated is (not) the empty set
Even though we can’t get an exact answer, this does not prevent us from
obtaining all sorts of useful information that gets close. The computer
scientist is aware of but not daunted by the impossibilities from formal
languages.
259
The Halting Problem for PS grammars
• The previous slides say there is no algorithm to
determine if an arbitrary PS grammar will produce a
sentence.
• Question: What would cause a grammar to not
produce a sentence?
• Answer: Each sentential form must have a nonterminal (otherwise we have a sentence). The
production rules must produce another sentential form
with a non-terminal. So the sentential forms never
halt.
• There is no algorithm that can determine if an
arbitrary PS grammar’s production graph will halt.
260
The halting problem is undecidable
• Problem: write a tool that, given an arbitrary PS grammar, it
determines whether it will produce a string.
• What algorithm would you devise?
• You might use the algorithm presented earlier: use a queue, take
the first item off the queue, substitute, add the new sententials to
the end of the queue, repeat. Upon the first item generated that
has no non-terminals (is a sentence), return Yes (the grammar does
produce at least one sentence) and stop. Clearly this algorithm will
run forever on those grammars that don’t produce a string.
• Perhaps there is another algorithm that will solve the problem? It
turns out, there is no other algorithm.
The halting problem is not decidable for PS grammars
261
Example of a PS grammar?
• First, all Type 1-Type 4 grammars are Type 0
grammars too. And you can trivially rewrite
(preserving the sets they generate) any of
these so they are no longer Type 1-4.
• If we want interesting examples we will have
to concentrate not on the form of the
grammar, which is what Type N is concerned
with, but on the sets they generate.
262
Fundamental difference between
Type 0 and Type 1
The fundamental difference between Type 0 and
Type 1 lies in the sets (languages) they can
generate:
– For a set that is Type 1 (context-sensitive), we can
determine in finite time whether any item is or isn’t in
the set, whereas
– For a set that is Type 0 (phrase-structure), an item not
in the set cannot be ruled out in finite time. That is, it
may take an infinite amount of time to determine that
an item is not in the set. More formally, determining if
a given item belongs to a set generated by a Type 0
grammar is undecidable.
263
Time required to determine if an item
is in the set defined by a grammar
CS Grammar
PS Grammar
Item is in the set
Finite time
Finite time
Item is not in the
set
Finite time
Infinite time
The difference between PS grammars and CS grammars is that
PS grammars take an infinite amount of time to determine
that an item is not in the set.
264
Finding a PS grammar
• Finding a phrase-structure grammar—that’s not a
context-sensitive grammar—amounts to finding a
set in which we can determine in finite time that
an item belongs in the set, but an infinite amount
of time is required to determine that an item
does not belong in the set.
• Here is a set that is well-known to be
undecidable: The set of all programs that
terminate.
265
Creating a PS grammar
• Let L be a grammar for a simple but complete (with full Turing
power) programming language.
• Write a breadth-first generator for all programs in L (generate the
programs using the queue algorithm).
• Write an interpreter for L.
• Start interpreting the programs breath-first as they come.
• When a program terminates, we produce it as part of the generated
set.
• If the interpreter doesn’t terminate, the program is not a member
of the set.
• So this Type 0 grammar generates just the set of all terminating
programs in L, a set the membership of which is undecidable, so
there cannot be a Type 1 grammar for this set.
266
Expanding/shrinking PS grammars
Q → XYZ
XYZ → Q
Q is replaced by XYZ
XYZ is replaced by Q
Thus a production can grow or shrink.
267
Expanding/shrinking
PS sentential forms
Length of sentential form
production process
268
Unsolvable
• When we do get sentences from the queue procedure,
they may be produced in an unexplainable order.
• The sentential forms may grow for a while and then
suddenly shrink, perhaps even to the empty string.
• It can be proven that there cannot be an algorithm that
for all PS grammars produces their sentences in
increasing length (actually, in non-decreasing length).
• In other words, the parsing problem for PS grammars is
unsolvable.
269
Terminology:
Undecidable vs. Unsolvable
• Undecidable is the term used for Yes/No
questions.
– Example of an undecidable question: For any arbitrary
PS grammar, does the grammar produce a sentence?
• Unsolvable is the term used for problems.
– Example of an unsolvable problem: For any arbitrary
PS grammar, generate its sentences in increasing
length.
• Note: in the literature these terms are used
interchangeably.
270
Turn to CS grammars
• We have been discussing phrase-structure
grammars:
– How to systematically generate their languages (use
the queue procedure)
– Can we write a procedure to decide whether or not an
arbitrary PS grammar will generate a sentence (no).
We noted that PS sentential forms can expand and
shrink during a production process.
• Now let’s address the same issues for CS
grammars.
271
Language generation
The language-generating queue procedure is
also applicable to CS grammars.
272
CS grammars don’t shrink
context
context
Q1AQ2 → Q1PQ2
A is replaced by P
P may be multiple symbols. Thus a production can grow.
P cannot be empty (ε). Thus a production cannot shrink.
A production is either of the same length or longer.
273
Expanding CS sentential forms
Length of sentential form
production process
274
Decidable
• The sentential forms in CS grammars never
shrink: the strings are produced in monotonic
order of increasing length.
• This means: if we want to know if a given string w
is in the language, we can just wait until we see it
come up, in which case the answer is Yes, or until
we see a longer string come up, in which case the
answer is No.
• It is decidable whether a CS grammar produces
the string w.
275
CS grammar, G
A procedure exists for
deciding if a string w is an
element of a CS language
Generate string
using queue
procedure
generated string q
Stop
w ∈ L(G)
Yes
No
q=w
?
Stop
w ∉ L(G)
Continue
Yes
length(q)
>
length(w)
?
No
276
Recursive sets
• Since the strings in a CS language can be
recognized by a possibly recursive algorithm,
CS languages are also called recursive sets.
• So, the term recursive set means there exists a
procedure for determining if a string w is an
element of the set generated by a CS
grammar.
277
Halting Problem Decidable?
• We can systematically generate, for CS
grammars, all sentential forms using a queue.
• However, it is not at all certain that the
process will obtain a sentence.
• It is quite possible that every new sentential
form never terminates.
• Is there a procedure for deciding whether a
CS grammar produces the empty set?
278
Halting Problem Decidable?
• Clearly the queue procedure will not work – it
will loop endlessly if a grammar’s language is
empty.
• Perhaps there is some other procedure that
could be applied to a CS grammar to decide if
it will produce something?
• It turns out that there is no procedure. The
halting problem is undecidable for CS
grammars.
See http://www.cs.cmu.edu/~./FLAC/pdf/ContSens-6up.pdf, bottom of page 12.
279
Turn to CF grammars
• We have been discussing context-sensitive grammars:
– How to systematically generate their languages (use
the queue procedure)
– A procedure to decide whether a string w is an
element of the language generated by an arbitrary CS
grammar (run queue until a match is found or a longer
string is encountered
We noted that CS sentential forms always expand during
a production process.
• Now let’s address the same issues for CF
grammars.
280
There is an algorithm to determine if a
CF grammar will produce something
With CF grammars it may still happen that a grammar will
never produce a sentence but we can determine that
beforehand, as follows:
1)
2)
3)
4)
First, scan the grammar to find all non-terminals which have a
right-hand side that contains terminals only or is empty. These
terminals are guaranteed to produce something.
Now scan again to find non-terminals which have a right-hand
side that consists of only terminals and non-terminals that are
guaranteed to produce something. This will give us new nonterminals that are guaranteed to produce something.
Repeat 2) until we find no more new non-terminals.
If we have not met the start symbol this way, the grammar will
not produce anything.
The halting problem is decidable for CF grammars
281
Example
Determine that the CF grammar for the t,d,&h language
produces a sentence:
Sentence → t
Sentence → d
Sentence → h
List → t ListTail
List → d ListTail
List → h ListTail
ListTail → , List
ListTail → & t
ListTail → & d
ListTail → & h
1)
Sentence  t
Sentence  d
Sentence  h
ListTail  & t
ListTail  & d
ListTail  & h
List → t ListTail
List → d ListTail
List → h ListTail
ListTail → , List
Guaranteed to
produce something
List  t ListTail
List  d ListTail
List  h ListTail
ListTail → , List
282
Leftmost rewriting
Leftmost rewriting: in the production process,
rewrite the leftmost non-terminal every time.
283
Rightmost rewriting
Rightmost rewriting: in the production process,
rewrite the rightmost non-terminal every time.
284
Notation for: rule 2, second alternative
1.
Name → tom | dick | harry
2. Sentence → Name | List and Name
3.
List → Name , List | Name
Consider this derivation:
Sentence → List and Name
“Sentence” was rewritten using rule 2’s second alternative, i.e., 2b
We will write the derivation like so:
Sentence →
2b
List and Name
285
Compare leftmost and rightmost
rewriting
1.
Name → tom | dick | harry
2. Sentence → Name | List and Name
3.
List → Name , List | Name
Leftmost rewriting
Sentence →
2b
List and Name →
3a
Name, List and Name →
1a
tom, List and Name →
3b
tom, Name and Name →
1b
tom, dick and Name →
1c
tom, dick and harry
Rightmost rewriting
Sentence →
2b
List and Name →
1c
List and harry →
3a
Name, List and harry →
3b
Name, Name and harry →
1b
Name, dick and harry →
1a
tom, dick and harry
Notes:
a. The sequences of production
rules are not as similar as we
would expect. The sequences
are neither equal nor each
other’s mirror image, nor is
there any obvious
relationship.
b. In grand total the same rules
and alternatives are used.
286
Show the order that non-terminals are
rewritten in the production tree
Leftmost rewriting
Sentence →
2b
List and Name →
3a
Name, List and Name →
1a
tom, List and Name →
3b
tom, Name and Name →
1b
tom, dick and Name →
1c
tom, dick and harry
1
Sentence
6
2
List
and
Name
and
harry
4
3
Name
,
List
5
Name
tom
,
dick
287
Show the order that non-terminals are
rewritten in the production tree
1
Rightmost rewriting
Sentence →
2b
List and Name →
1c
List and harry →
3a
Name, List and harry →
3b
Name, Name and harry →
1b
Name, dick and harry →
1a
tom, dick and harry
Sentence
2
3
List
and
Name
and
harry
4
6
Name
,
List
5
Name
tom
,
dick
288
Different order of rewriting
Rightmost rewriting
Leftmost rewriting
Sente
nce
2
3
Name
and
List
,
List
Name
tom
,
dick
1
Sente
nce
Name
4
6
3
6
Name
List
,
5
List
Name
and
harry
tom
,
1
and
Name
and
harry
4
5
dick
Both rewrite-sequences define the same production tree.
But the order of rewriting differs.
289
2
Leftmost derivation
Leftmost rewriting
Sentence →
2b
List and Name →
3a
Name, List and Name →
1a
tom, List and Name →
3b
tom, Name and Name →
1b
tom, dick and Name →
1c
tom, dick and harry
Here is the sequence of production rules used in leftmost rewriting:
Sentence → List and Name → Name, List and Name → tom, List and Name → tom, Name and Name → tom, dick and Name → tom, dick and harry
This sequence of production rules is called the leftmost derivation of a sentence.
290
Indicating a leftmost production
A leftmost production step can be indicated by
using an arrow marked with a small l, for
example:
Name, List and Name → tom, List and Name
l
The leftmost production sequence:
Sentence → List and Name → Name, List and Name → tom, List and Name → tom, Name and Name → tom, dick and Name → tom, dick and harry
l
l
l
l
l
l
can be abbreviated:
* tom, dick and harry
Sentence →
l
291
Rightmost derivation
Rightmost rewriting
Sentence →
2b
List and Name →
1c
List and harry →
3a
Name, List and harry →
3b
Name, Name and harry →
1b
Name, dick and harry →
1a
tom, dick and harry
Here is the sequence of production rules used in rightmost rewriting:
Sentence → List and Name → List and harry → Name, List and harry → Name, Name and harry → Name, dick and harry → tom, dick and harry
This sequence of production rules is called the rightmost derivation of a sentence.
292
Indicating a rightmost production
A rightmost production step can be indicated by
using an arrow marked with a small r, for
example:
List and Name → List and harry
r
The rightmost production sequence:
Sentence → List and Name → List and harry → Name, List and harry → Name, Name and harry → Name, dick and harry → tom, dick and harry
r
r
r
r
r
r
can be abbreviated:
* tom, dick and harry
Sentence →
r
293
Indicating a production
The fact that Sentence produces
tom, dick and harry in any way is written:
* tom, dick and harry
Sentence →
294
Parsing (defined)
Parsing is the task of reconstructing the
derivation tree (or graph) for a given input
string.
grammar
input string
Parser
derivation tree (or graph)
295
1.
Name → tom | dick | harry
2. Sentence → Name | List and Name
3.
List → Name , List | Name
tom, dick and harry
Parser
Sentence
List
Name
,
and
Name
and
harry
List
Name
tom
,
dick
296
Most parsers use
leftmost or rightmost derivation
Some of the most efficient parsing techniques
can be understood more easily if viewed as
attempts to reconstruct a leftmost or rightmost
derivation process of the input string.
297
The concept of zero is still
not well accepted
Roughly 1500 years after the introduction of zero as a number by
mathematicians in India, the concept is still not well accepted in
computer science:
• Many programming languages do not support records with zero fields
• Many programming languages do not support arrays with zero
elements
• Many programming languages do not support variable definitions with
zero variables
• In some programming languages the syntax for calling a routine with
zero parameters differs from that for a routine with one or more
parameters
• XML provides a special syntax for empty elements
• Many compilers refuse to compile a module that defines zero names
• No parser generator can produce a parser for the empty language (the
language with zero strings)
298
Empty language vs. a language that
consists of the empty string
• Empty language: {}
• Language with only the empty string: {ε}
That language is easily generated by this grammar:
S→ε
• What would the grammar for the empty
language look like?
299
Grammars that produce
the empty language
• For a grammar to produce nothing, the
production process cannot be allowed to
terminate.
• Here’s one such grammar:
S→S
That grammar is ugly for two reasons:
– The generation process just loops and no information
about the emptiness of the language is obtained
– The use of the symbol S is arbitrary
300
Force the production process
to get stuck
• Another approach to force the production
process to get stuck is by not having any
production rules in the grammar.
• Recall that grammars are formally defined as,
G = (VN, VT, S, F), where F = the set of production
rules.
• This approach produces, G = (S, {}, S, {}).
• That is not very satisfactory either since:
– Now we have a non-terminal without a defining rule
– The symbol S is still arbitrary
301
Don’t allow the production process
to get started
• A better way is never allow the production process to
get started: have no start symbol.
• This can be accommodated by allowing a set of start
symbols in the definition of a grammar rather than a
single start symbol.
• There are good reasons for having a set of start
symbols: each global element declaration in an XML
Schema is a potential start symbol.
• If we extend the definition of a CF grammar to use a
set of start symbols, the grammar for the empty
language obtains the elegant and satisfactory form:
({}, {}, {}, {})
302
Rules with empty left-hand side
• It might be useful to have grammar rules in
which the left-hand side is empty:
ε → djakdlsaiewp
• Terminal productions of the right-hand sides
of such rules may appear anywhere in the
input, thus modeling noise and other everyday but extraneous events.
303
Our preoccupation with empty
is not frivolous
Our preoccupation with empty strings, sets,
languages, etc. is not frivolous, since it is wellknown that the ease with which a system
handles empty cases is a measure of its
cleanliness and robustness.
304
CF grammars are limited
• Many things can be expressed using CF
grammars.
• However, CF grammars have serious
limitations.
305
Lineage of a symbol
Here is a CF grammar for the d, h & h language (t = Tom, d = Dick, h = Harry, S = Start, L = List, N =
Name):
1. S → L & N
2. S → N
3. L → N , L
4. L → N
5. N → t
6. N → d
7. N → h
Here is the production tree for a derivation of: d, h & h:
1
3
6
7
4
7
d
,
h
&
h
When we have obtained a sentence from a CF
grammar, we may look at each terminal symbol in it
and ask: How did it get here? For example, looking at
the production tree, we see that “d” was produced as
the 1st member of the right-hand side of rule number
6. The left-hand side of this rule, the parent of our
symbol, was produced as the 1st member of rule 3. And
so on, until we reach the start symbol. We can, in a
sense, trace the lineage of the symbol in this way.
A sentence from the grammar
306
Express lineage as rule/member pairs
1
(1,1)
(1,3)
3
7
(3,1)
(1,2)
6
4
(3,2)
(4,1)
(6,1)
(7,1)
7
(7,1)
d
,
h
&
1.
2.
3.
4.
5.
6.
7.
S→L&N
S→N
L→N,L
L→N
N→t
N→d
N→h
h
Rule/member pairs: { (6,1), (3, 1), (1, 1) }
307
Original symbol, original sentence
• If all rule/member pairs in the lineage of a
symbol are different, we call the symbol
original.
– Example: the lineage of the first h is { (7, 1), (4, 1),
(3,3), (1,1) }. Since all rule/member pairs are
different, h is original.
• If all the symbols in a sentence are original, we
call the sentence original.
308
Same symbol, different lineage
(1,1)
1
(1,3)
3
7
6
4
(4,1)
(7,1)
7
1.
2.
3.
4.
5.
6.
7.
S→L&N
S→N
L→N,L
L→N
N→t
N→d
N→h
(7,1)
d
,
h
&
h
If a symbol occurs twice in an original sentence, both
its lineages must be different: if they were the same,
they would describe the same symbol in the same
place.
Rule/member pairs: { (7, 1), (4, 1), (3,3), (1,1) }
309
Member/rule pairs: { (7,1), (1,3) }
Any CF grammar produces a
finite set of original sentences
(rule, member)
Finite number
Finite number
So the number of unique (rule, member) pairs is finite.
Therefore the number of original symbols is finite and the
number of original sentences is finite.
We arrive at the surprising conclusion that any CF grammar
produces a finite-size kernel of original sentences and
(probably) an infinite number of unoriginal sentences.
310
Unoriginal sentences
• What do “unoriginal” sentences look like?
• By definition, an unoriginal sentence has one
or more symbols that are unoriginal.
• A symbol is unoriginal if it has two or more
(rule, member) pairs that are the same.
• Two (rule, member) pairs are the same means
that the same grammar rule is repeated.
311
Original sentence
1
(1,1)
(1,3)
(1,2)
b
b
(2,1)
2
(2,2)
b
(2,3)
a
3
(3,1)
a
a
a
a
1. S → b A b
2. A → a A a
3. A → a
b
Rule/member pairs: { (1, 1) }
Rule/member pairs: { (2,1), (1, 2) }
Rule/member pairs: { (3, 1), (2, 2), (1, 2) }
Rule/member pairs: { (2, 3), (1, 2) }
Rule/member pairs: { (1, 3) }
312
Each symbol is original, so baaab is an original sentence.
Unoriginal sentence
1
(1,2)
b
b
2
(2,2)
a
1. S → b A b
2. A → a A a
3. A → a
a
2
(2,2)
a
3
a
a
a
(3,1)
b
a
a
a
b
Rule/member pairs: { (3,1), (2, 2), (2, 2), (1, 2) }
Duplicate rule/member pairs, so this symbol is
not original and so baaaaab is not an original
sentence.
313
Repeated rule
Rule 2 is repeated
1
b
a
b
b
2
a
a
2
a
3
a
a
a
a
a
b
314
Partition the sentence
• Let’s partition the sentence into parts.
• Let A = the repeated rule (in this case it is rule 2)
– w : the part produced by the A that is furthest down the tree
– vwx: the part produced by the A that is furthest up the tree
– uvwxy: the entire unoriginal sentence
1
b
a
b
u
b
2
a
v
a
2
a
3
a
a
a
a
w
a
b
x
y
315
Pump up the sentence
We can get another unoriginal sentence by
replacing the smaller A by the larger A.
1
b
a
Replace this
by this
b
u
b
2
a
v
a
2
a
3
a
a
a
a
w
a
b
x
y
316
Another unoriginal sentence
baaaaab
u v
w
x
y
replace w with vwx
baaaaaaab
u v v
w
x x
y
uv2wx2y
317
and another unoriginal sentence
baaaaaaab
u v v
x x
w
y
replace w with vwx
baaaaaaaaab
u v v v
w
x x
x y
uv3wx3y
318
Family of nested sentences
• We can, in this way, construct a complete
family of sentences uvnwxny for all n≥0.
• This form shows w nested in a number of v
and x brackets, in an indifferent context of u
and y.
1
b
a
b
u
b
2
a
v
a
3
a
a
a
a
w
Family of nested
sentences
a
2
a
b
x
y
319
Original sentences become exhausted
When we examine longer and longer sentences
in a CF language, the original sentences become
exhausted and we meet only families of closely
related sentences telescoping off into infinity.
320
uvwxy theorem
• uvwxy theorem: any sentence generated by a
CF grammar that is longer than the longest
original sentence from that grammar can be
cut into five pieces u, v, w, x, and y, in such a
way that uvnwxny are sentences from that
grammar for all n≥0.
• The uvwxy theorem is also called the pumping
lemma for context-free languages.
321
Language isn’t CF if long sentences
don’t decay into nested sentences
• If a language keeps on providing longer and longer
sentences without reducing to families of nested
sentences, there cannot be a CF grammar for it.
• We have already encountered the context-sensitive
language anbncn and it is easy to see that it does not
decay into such nested sentences as sentences get
longer and longer. Consequently, there is no CF
grammar for it.
– A general technique for showing that a language is not
context-free may be found in this article: Commun. ACM,
April 1993, Using the context-free pumping lemma, by
David Billington.
322
Increase the length
of original sentences
• The longest original sentence is a property of the
grammar, not the language.
• By making a more complicated grammar for a language
we can increase the set of original sentences and push
away the border beyond which we are forced to start
nesting (repeating).
• If we make the grammar infinitely complicated, we can
push the border to infinity and obtain a phrase
structure language from it. How we can make a CF
grammar infinitely complicated is described in the
section on two-level grammars, 15.2.1.
323
Regular grammars are limited
A simpler form of the uvwxy theorem applies to
regular (Type 3) languages.
324
Repeated non-terminals
for long sentences
• We have seen that the sentential forms
occurring in the production process for a
regular grammar all contain only one nonterminal, which occurs at the end.
• During the production of a very long sentence,
one or more non-terminals must occur two or
more times, since there are only a finite
number of non-terminals.
325
Example Regular grammar
S → sP | sA
P → pQ
Q → qA
A → aR | aT
R → rS
T → tU
U→u
S
sP
spQ
spqA
spqaR
spqarS
spqarsA
spqarsaT
spqarsatU
spqarsatu
Clearly the sequence from
the first A to this A can be
repeated over and over
326
uvnw
S
sP
spQ
spqA
u
spqaR
spqarS
spqarsA
u v
spqarsaT
spqarsatU
spqarsatu
u v w
u is the part leading up to the first A, v is the part between the first A and
the second A (it can be repeated), and w is the part after the last A to
terminate the production process.
327
uvw theorem
• uvw theorem: any sufficiently long string from
a regular language can be cut into three pieces
u, v, and w, so that uvnw are strings in the
language for all n≥0.
• The uvw theorem is also called the pumping
lemma for regular languages.
328
CF grammars as transition graphs
• A transition graph is a directed graph in which the arrows are labeled with
zero of more symbols from the grammar.
• As you follow the arrows in the graph you produce one of the associated
symbols, if there is one, and nothing otherwise.
• The nodes, often unlabeled, are resting points between producing the
symbols.
• If there is more than one outgoing arrow from a node you can choose any
to follow.
• Here is the transition graph for the tom, dick, and harry language:
tom
dick
harry
tom
dick
harry
and
,
329
Produce the same strings
CF Grammar
1.
2.
3.
Sentence → Name | List and Name
Name → tom | dick | harry
List → Name , List | Name
Transition Graph
tom
dick
harry
tom
dick
harry
and
,
330
Turn a grammar
into a set of transition graphs
It is easy to turn a CF grammar into a set of transition
graphs, one for each non-terminal, as shown below.
Sentence → Name | List & Name
Name → tom | dick | harry
List → Name | Name , List
Name
Sentence
List
Name
Name
&
tom
dick
harry
Name
List
331
Name
,
List
Recursive transition network
• The transition graphs on the previous slide have nonterminal labels above the arrows.
• Upon encountering an arrow that points to node n2
and labeled with non-terminal N: push n2 onto a stack,
continue the walk at the entrance to the transition
graph for N. When leaving the transition graph for N,
pop n2 from the stack and continue at node n2.
• This is the recursive transition network interpretation
of context-free grammars: the set of graphs is the
transition network, and the stacking mechanism
provides the recursion.
332
Regular grammars as transition graphs
The rules of a regular grammar can also be turned into
transition graphs:
Sentence → [tdh] | List
List → [tdh] ListTail
ListTail → & [tdh] |, List
[tdh]
There is a non-terminal only
when leaving a graph. No need
for stacking: interpret an arrow
marked with a non-terminal N
as a jump to the transition
graph for N. So a regular
grammar corresponds to a nonrecursive transition network.
Sentence
List
List
[tdh]
ListTail
&
[tdh]
,
List
ListTail
333
Useless rules, successful
production process
• Grammars can contain useless rules: rules that cannot play a role in
any successful production process.
• A production process is successful when it results in a terminal
string.
• Production attempts can be unsuccessful by getting stuck (no
further substitution possible) or by entering a situation in which no
substitution sequence will ever remove all non-terminals.
• Example of a Type 0 grammar that can get stuck:
1.
S
2.
S
3.
S
4. A B
5.
C
→
→
→
→
→
A B
B A
C
x
C C
If we start with the first rule for S, all goes well and we
produce the terminal string x.
If we start with the second rule for S, we get stuck. It is a
useless rule.
If we start with the third rule for S, we get ourselves into an
infinite loop, producing more and more Cs.
Rules 2, 3, and 5 can never occur in a successful production
process; they are useless rules and can be removed from the
grammar without affecting the language produced.
334
Remove useless rules
• Useless rules are not a fundamental problem:
they do not obstruct the normal production
process.
• Still, they are dead wood in the grammar and
one would like to remove them.
• Also, when they occur in a grammar specified
by a programmer they probably point at some
error and one would like to detect them and
give warning or error messages.
335
Useless rules is undecidable
for Type 0 and Type 1
It can be shown that in general it is undecidable
whether a rule in a Type 0 or Type 1 grammar is
useless: there cannot be an algorithm that does
it correctly in all cases.
grammar (type 0 or 1)
rule i
Is the rule
useless?
yes/no
Impossible to build this
336
Useless rules is decidable
for Type 2
The problem of deciding whether a rule in a CF
grammar is useless is easily solved.
grammar (type 2)
rule i
Is the rule
useless?
Easy to build this
yes/no
337
Key Concept
• It is important to know what type of grammar
you are dealing with (Type 0, 1, 2, or 3). Why?
• Because if you know that it is Type 0 or 1, then
you can take advantage of a result from the
field of Formal Languages and not attempt to
build a program to decide if a rule is useless. If
you know that it is Type 2, then you can easily
build the program.
338
3 causes for useless rules
in CF grammars
• A rule in a CF grammar can be useless through
three causes:
1. It may contain an undefined non-terminal.
2. It may not be reachable from the start symbol.
3. It may fail to produce anything.
• These are useless rules.
339
Useless Rules
undefined
non-terminals
unreachable
non-terminals
non-productive
non-terminals
340
Undefined non-terminals
• The right-hand side of a rule may contain a nonterminal for which no production rule is given.
• Such a rule can be removed.
• However, that may result in another non-terminal
becoming undefined.
• Example: If the A in this rule B → . . . A . . .
is undefined, remove the rule. But now B may be
undefined, so remove rules with B on the righthand side, etc.
341
Unreachable non-terminals
If a non-terminal cannot be reached from the
start symbol, its defining rules will never be
used, and it cannot contribute to the production
of any sentence.
342
Non-productive rules
• A rule that gets stuck in an infinite loop cannot contribute
anything to the sentences of the language of the grammar,
since once the rule is entered, there is no way to get rid of it:
the rule has a non-productive non-terminal.
• Example: the rule X → aX is non-productive and any rule
which has X in its right-hand side is non-productive.
• In an extreme case all non-terminals in a grammar are nonproductive. This happens when all right-hand sides in the
grammar contain at least one non-terminal. Then there is no
way to get rid of the non-terminals, and the grammar itself is
non-productive.
343
Loops
• Rules of the form A → A are called loops.
• Loops can be indirect:
A→B
A
B→C
C
B
C→A
• Loops can be hidden:
A → PAQ
ε ε
* ε
P→
PAQ
A
* ε
Q→
344
Loops can legitimately occur
• A loop can legitimately occur in the
production of a sentence, provided there is
also a production that enables breaking out of
the loop.
• Example: below, the first rule is a loop, but the
second rule enables breaking out of the loop:
A → aA
A→a
345
Proper grammar
A grammar without useless non-terminals and
loops is called a proper grammar.
346
Cleaning up a CF grammar
• Normally, grammars supplied by people do not
contain undefined, unreachable, or nonproductive non-terminals.
• If they do, it is almost certainly a mistake and we
would like to detect and report them.
• Such anomalies can, however, occur in generated
grammars or be introduced by some grammar
transformations, in which case we wish to detect
them to “clean up” the grammar.
347
Algorithm to detect and remove
useless non-terminals and rules
• The algorithm to detect and remove useless
non-terminals and rules from a context-free
grammar consists of two steps:
1. Remove non-productive rules
2. Remove unreachable non-terminals
• It is not necessary to remove rules with
undefined non-terminals since the first step
does this automatically.
348
Let’s clean up this CF grammar
S
A
B
C
D
E
F
→
→
→
→
→
→
→
A
a
b
c
d
e
f
B | D E
C
F
D
The above grammar looks innocent: all
its non-terminals are defined and it
does not exhibit any suspicious
constructions.
349
Step 1:
Remove non-productive rules
• The following slides describe how to remove
non-productive rules.
• Find the non-productive rules by finding the
productive rules. After finding all productive
rules, the other, remaining rules are the nonproductive rules.
350
Algorithm to find productive rules
• A rule is productive if its right-hand side
consists of symbols all of which are
productive.
• Productive symbols:
– Terminal symbols are productive since they
produce terminals.
– Empty is productive since it produces the empty
string.
– A non-terminal is productive if there is a
productive rule for it.
351
Initial knowledge
Go through the grammar and for each rule for
which we know that all its right-hand side symbols
are productive, mark the rule and the non-terminal
it defines as Productive.
Rule
Productive
S → A B | D E
A → a
Productive
B → b C
C → c
Productive
D → d F
E → e
Productive
F → f D
352
Build on top of our knowledge
Now we know more. Apply this knowledge in a
second round through the grammar.
Rule
Productive
S → A B | D E
A → a
Productive
B → b C
Productive (since b is productive and C is productive)
C → c
Productive
D → d F
E → e
Productive
F → f D
353
Round three
Rule
Productive
S → A B
S → D E
Productive (since A is productive and B is productive)
A → a
Productive
B → b C
Productive (since b is productive and C is productive)
C → c
Productive
D → d F
E → e
Productive
F → f D
354
Round four
A fourth round yields nothing new.
Rule
Productive
S → A B
S → D E
Productive (since A is productive and B is productive)
A → a
Productive
B → b C
Productive (since b is productive and C is productive)
C → c
Productive
D → d F
E → e
Productive
F → f D
355
Recap
We now know that A, B, C, E and the rule
S → A B are productive. D, F, and the rule
S → D E are non-productive.
Rule
Productive
S → A B
S → D E
Productive (since A is productive and B is productive)
A → a
Productive
B → b C
Productive (since b is productive and C is productive)
C → c
Productive
D → d F
E → e
Productive
F → f D
356
Remove non-productive rules
We have pursued all possible avenues for
productivity and have not found any possibilities for
D, F, and the second rule for S. That means they are
non-productive and can be removed from the
grammar.
Rule
Productive
S → A B
Productive (since A is productive and B is productive)
A → a
Productive
B → b C
Productive (since b is productive and C is productive)
C → c
Productive
E → e
Productive
The grammar after removing non-productive rules
357
Removing non-productive rules also
removes undefined non-terminals
• Earlier we said: It is not necessary to remove rules with
undefined non-terminals since the first step [remove
non-productive rules] does this automatically.
• Consider a rule R that contains an undefined nonterminal, U. The algorithm shown on the previous
slides will not mark R as “Productive” and hence R will
be removed. Also, any rules that reference R in their
right-hand side will not be marked “Productive” and
will be removed. And so forth.
• So an undefined non-terminal is just a special case of a
non-productive non-terminal: it is non-productive
because there is no rule for it.
358
Bottom-up process
Removing the non-productive rules is a bottomup process: only at the bottom level, where the
terminal symbols live, can we know what is
productive.
359
Knowledge-improving algorithm
• In the previous slides we increased our
knowledge with each round.
• The previous slides is our first example of a
closure algorithm.
360
Closure algorithms
Closure algorithms are characterized by two components:
1. Initialization: an assessment of what we know initially.
For our problem we knew:
The grammar rules
Terminals and empty are productive
2. Inference rule: a rule telling how knowledge from several
places is to be combined.
The inference rule for our problem was:
If all the right-hand side symbols of a rule are productive, then
the rule’s left-hand side non-terminal is productive.
The inference rule is repeated until nothing changes any
more.
361
Step 2:
Remove unreachable non-terminals
The second step in removing useless nonterminals and rules from a context-free
grammar is to remove unreachable nonterminals.
362
Reachable non-terminals
• A non-terminal is called reachable or
accessible if there exists at least one sentential
form, derivable from the start symbol, in
which it occurs.
• Example: a non-terminal A is reachable if
*
S → αAβ for some α and β.
• Find the unreachable non-terminals by finding
the reachable ones.
363
Closure algorithm for finding
reachable non-terminals
• Initialization: the start symbol is marked
“reachable”.
• Inference rule: for each rule in the grammar of
the form A → α with A marked “reachable”,
all non-terminals in α are marked “reachable”.
• Continue applying the inference rule until
nothing changes any more.
• The remaining unmarked non-terminals are
not reachable and their rules can be removed.
364
Initialization
Rule
Reachable
S → A B
S is reachable
A → a
B → b C
C → c
E → e
365
Round one
Rule
Reachable
S → A B
S is reachable
A → a
A is reachable because it is reachable from S
B → b C
B is reachable because it is reachable from S
C → c
E → e
366
Round two
Rule
Reachable
S → A B
S is reachable
A → a
A is reachable because it is reachable from S
B → b C
B is reachable because it is reachable from S
C → c
C is reachable because it is reachable from B
E → e
367
Round three
Rule
Reachable
S → A B
S is reachable
A → a
A is reachable because it is reachable from S
B → b C
B is reachable because it is reachable from S
C → c
C is reachable because it is reachable from B
E → e
The third round produces no change.
So the rule E → e is unreachable and can be removed.
368
Cleaned grammar
S
A
B
C
D
E
F
→
→
→
→
→
→
→
A
a
b
c
d
e
f
B | D E
C
F
D
Initial grammar
S
A
B
C
E
→
→
→
→
→
A B
a
b C
c
e
Grammar after
removing nonproductive rules
S
A
B
C
→
→
→
→
A B
a
b C
c
Grammar after
removing
unreachable nonterminals
369
Top-down process
Removing unreachable non-terminals is a topdown process: only at the top level, where the
start symbol lives, can know what is reachable.
370
Order of cleaning matters
• The cleaning process must occur in this order:
– First, remove non-productive rules
– Second, remove unreachable non-terminals
• If the order is switched it may produce a
grammar which again contains unreachable
non-terminals. So you will have to redo the
algorithm for removing unreachable nonterminals.
371
Need run the algorithms only once
• Suppose we remove non-productive rules and
then unreachable rules.
• Consider a non-terminal N in a reachable rule:
X → αNβ
• By removing unreachable rules could N become
undefined? Will we have to run the algorithm for
removing non-productive rules again?
• No. In the process of removing non-productive
rules we determined that all symbols on the
right-hand side of X are productive. That means
that N is productive (N is defined).
372
Cleaning may remove all rules
Cleaning a grammar may remove all rules,
including those for the start symbol, in which
case the grammar describes the empty
language.
373
Set properties of context-free
and regular languages
Since languages are sets, it is natural to ask if the
standard operations on sets can be performed on
them, and if so, how. Set operations:
– union
– intersection
– negation (complement)
374
Set operations
• The union of two sets 𝑆1 and 𝑆2 contains the
elements that are in either set; it is written:
𝑆1 ∪ 𝑆2
• The intersection contains the elements that
are in both sets; it is written:
𝑆1 ∩ 𝑆2
• The negation of a set 𝑆 contains those in Σ*
but not in 𝑆; it is written:
¬𝑆
375
Set operations on grammars
In the context of formal languages the sets are
defined through grammars, so actually we want
to do the operations on the grammars rather
than on the languages.
376
Union of two grammars
• Constructing the grammar for the union of
two languages is trivial for context-free and
regular languages (and in fact for all Chomsky
types): just construct a new start symbol
𝑆’ → 𝑆1 | 𝑆2, where 𝑆1 and 𝑆2 are the start
symbols of the two grammars that describe
the two languages.
• Make sure the names 𝑆1 and 𝑆2 are different.
A grammar describes a language.
377
context-free ∪ context-free = context-free
The union of two context-free languages is a
context-free language.
378
Intersection of two CF grammars might
produce a non-CF grammar
• The intersection of two context-free languages might not be
context-free.
• Consider the two CF languages:
L1 = anbncm (same number of as and bs, arbitrary number of cs)
L2 = ambncn (arbitrary number of as, same number of bs and cs)
described by the CF grammars:
L1 → A P
A → a A b | ε
P → c P | ε
and
L2 → Q C
Q → a Q | ε
C → b C c | ε
• A string that occurs in both languages must have the same number
of as and bs per L1 and the same number of bs and cs per L2. So the
intersection language consists of strings of the form anbncn and we
know that language is not context-free.
379
context-free ∩ context-free = …
• The intersection of two context-free languages
might be a context-sensitive language.
– Example: the intersection of anbncm and ambncn is
anbncn, and the latter is context-sensitive.
• The intersection of two context-free languages
might be a context-free language.
– Example: the intersection of a context-free
language with itself is context-free.
380
Set theory vs. language theory
• When languages are treated as sets it is easy to
generate anbncn
anbncm
anbncn
ambncn
• Conversely, when languages are treated as
grammars it is quite difficult to generate anbncn
1.
S → abc | aSQ
2. bQc → bbcc
3. cQ → Qc
381
Easy to intersect two CF languages
Just enumerate them both (use the queue
algorithm) and output words that appear on
both lists. The queue algorithm outputs the
strings in order of increasing length. Suppose
grammar 1 generates string abc. We can
determine if grammar 2 generates abc by
running the queue algorithm on grammar 2
until (a) it outputs abc, or (b) it outputs a
string with length greater than 3 (the length
of abc).
382
Easy to determine membership of the
intersection of two XML languages
• Problem: is an XML instance document a member
of the intersection of two XML Schemas?
• Validate the XML instance document twice, once
for each XML Schema. The XML instance is a
member of the intersection if and only if it
conforms to both XML Schemas.
383
The intersection of CF languages
has weird properties
• The intersection of two CF languages might be a
Type 1 (context-sensitive) grammar.
• The intersection of three CF languages is more
powerful than the intersection of two of them.
• Remarkable phenomenon: any Type 1 language,
and even any Type 0 language, can be
constructed by intersecting just two CF
languages, provided we are allowed to erase all
symbols in the resulting strings that belong to a
set of erasable symbols. Continued 
384
erasure(L3 ∩ L4) = a CS language
• The CS language we will use to demonstrate
this remarkable phenomenon is the set of all
strings that consist of two identical parts: ww,
where w is any string over the given alphabet.
• The two languages to be intersected are
defined by:
L3 → A P
A → a A x | b A y | ε
P → a P | b P | ε
and
L4 → Q C
Q → a Q | b Q | ε
C → x C a | y C b | ε
x and y are the erasable symbols
385
Intersecting two CF grammars
with erasable symbols
L3 → A P
A → a A x | b A y | ε
P → a P | b P | ε
This grammar produces strings
consisting of three parts: a
sequence A1 of as and bs,
followed by its “dark mirror”
image M1, in which a
corresponds to x and b to y,
followed by an arbitrary
sequence G1 of as and bs.
and
L4 → Q C
Q → a Q | b Q | ε
C → x C a | y C b | ε
This grammar produces strings
consisting of an arbitrary
sequence G2 of as and bs, a
“dark mirror” image M2
preceding a sequence A2 of as
and bs.
Example string: abaxyxaba
Example string: abaxyxaba
abaxyxaba
erase the erasable symbols x, y
abaaba
386
Dark mirror
L3 → A P
A → a A x | b A y | ε
P → a P | b P | ε
and
L4 → Q C
Q → a Q | b Q | ε
C → x C a | y C b | ε
This grammar produces strings
consisting of three parts: a
sequence A1 of as and bs,
followed by its “dark mirror”
image M1, in which a
corresponds to x and b to y,
followed by an arbitrary
sequence G1 of as and bs.
This grammar produces strings
consisting of an arbitrary
sequence G2 of as and bs, a
“dark mirror” image M2
preceding a sequence A2 of as
and bs.
Example string: abaxyxaba
Example string: abaxyxaba
dark mirror of
dark mirror of
387
Each grammar has 3 parts
L3 → A P
A → a A x | b A y | ε
P → a P | b P | ε
and
L4 → Q C
Q → a Q | b Q | ε
C → x C a | y C b | ε
This grammar produces strings
consisting of three parts: a
sequence A1 of as and bs,
followed by its “dark mirror”
image M1, in which a
corresponds to x and b to y,
followed by an arbitrary
sequence G1 of as and bs.
This grammar produces strings
consisting of an arbitrary
sequence G2 of as and bs, a
“dark mirror” image M2
preceding a sequence A2 of as
and bs.
Example string: abaxyxaba
Example string: abaxyxaba
A1
M1
G1
G2
M2
A2
388
Corresponding parts must match
for an intersection
L3 → A P
A → a A x | b A y | ε
P → a P | b P | ε
and
L4 → Q C
Q → a Q | b Q | ε
C → x C a | y C b | ε
This grammar produces strings
consisting of three parts: a
sequence A1 of as and bs,
followed by its “dark mirror”
image M1, in which a
corresponds to x and b to y,
followed by an arbitrary
sequence G1 of as and bs.
This grammar produces strings
consisting of an arbitrary
sequence G2 of as and bs, a
“dark mirror” image M2
preceding a sequence A2 of as
and bs.
Example string: abaxyxaba
Example string: abaxyxaba
A1
M1
G1
The intersection forces A1 = G2, M1 = M2, and G1 = A2.
G2
M2
A2
389
A1 equals A2
L3 → A P
A → a A x | b A y | ε
P → a P | b P | ε
and
L4 → Q C
Q → a Q | b Q | ε
C → x C a | y C b | ε
This grammar produces strings
consisting of three parts: a
sequence A1 of as and bs,
followed by its “dark mirror”
image M1, in which a
corresponds to x and b to y,
followed by an arbitrary
sequence G1 of as and bs.
This grammar produces strings
consisting of an arbitrary
sequence G2 of as and bs, a
“dark mirror” image M2
preceding a sequence A2 of as
and bs.
Example string: abaxyxaba
Example string: abaxyxaba
A1
M1
G1
G2
M2
A2
The intersection forces A1 = G2, M1 = M2, and G1 = A2.
M1 is the dark mirror of A1, so M2 is the dark mirror of A1, so A2 equals A1.
390
Erase the dark mirrors
L3 → A P
A → a A x | b A y | ε
P → a P | b P | ε
and
L4 → Q C
Q → a Q | b Q | ε
C → x C a | y C b | ε
This grammar produces strings
consisting of three parts: a
sequence A1 of as and bs,
followed by its “dark mirror”
image M1, in which a
corresponds to x and b to y,
followed by an arbitrary
sequence G1 of as and bs.
This grammar produces strings
consisting of an arbitrary
sequence G2 of as and bs, a
“dark mirror” image M2
preceding a sequence A2 of as
and bs.
Example string: abaxyxaba
Example string: abaxyxaba
A1
M1
G1
G2
M2
A2
The intersection forces A1 = G2, M1 = M2, and G1 = A2.
M1 is the dark mirror of A1, so M2 is the dark mirror of A1, so A2 equals A1.
After erasing the mirrors we have abaaba, which is (aba)2
391
Create Type 0 from the intersection
of context-free languages
Using a massive application of the mirror-mirror
trick, one can relatively easily prove that any
Type 0 language can be constructed as the
intersection of two CF languages, plus a set of
erasable symbols.
The construction is described extremely formally in
Roughly the same explanation can be found in
@Book
{ FormalLanguages.Harrison.1978,
author = {Harrison, Michael A.},
title = {Introduction to Formal Language Theory},
publisher = {Addison-Wesley},
year = {1978}
} on pages 307-311.
@Article
{ New.Ginsburg.1967,
author = {Ginsburg, Seymour and Greibach, Sheila and Harrison,
Michael A.},
title = {One-Way Stack Automata},
journal = {J. ACM},
volume = {14},
number = {2},
year = {April 1967},
pages = {389-418},
annote = {}
} on pages 402-405.
But like all text in this book the explanation is very dense and is severely
complicated by the fact that the author inside the explanation wants to
prove that the two CF languages you need are deterministic.
392
**** Provisional Explanation of Creating a Type 0 Language by Intersecting Two CF Languages (Dick Grune)****
The basic trick is for the string in the intersection to represent the complete production process of a terminal production of a
Type 0 grammar G. The steps are encoded as follows:
… ( X1l alpha_1 X1r → -X1r -beta_1 -X1l ) ( X2l alpha_2 X2r → -X2r -beta_2 -X2l )...
(1)
where the →, ( and ) are unique markers; Xnl(eft) and Xnr(ight) are arbitrary strings; -S means the reverse of S; and alpha_1 →
beta_1, alpha_2 → beta_2, etc. are rules in G. In fact this is what L1 generates, and that is easy to do: it's just a repetition of (
Xnl alpha_n Xnr → -Xnr -beta_n -Xnl). It is easy to produce S x -S by CF grammar, so apply twice, once for the inner Xnr, once
for the outer Xnl, and you're done.
But the above steps are a good derivation only when the output of one step is the input of the next, so -(-X1r -beta_1 -X1l)
must be equal X2l alpha_2 X2r. This is where L2 comes in. It generates
… → Y1 ) ( -Y1 → Y2 ) ( -Y2 → …
(2)
which again can be produced easily with a CF grammar, since its structure is again S x -S. Intersecting the two enforces the
reverse of -X1r -beta_1 -X1l (which itself was the reverse of the result of the step alpha_1 → beta_1) to be equal to X2l
alpha_2 X2r, the input of the next step. This makes the intersection of string (1) and string (2) a representation of a valid Type 0
production process.
There are two more details to care for. One is the start-up, which is next to trivial. The second is the close-down and the
harvesting of the result. This is where the homomorphism (the erasing of the erasable symbols) comes in.
Before we start the whole construction we replace all terminals in G by non-terminals with similar names, and declare all
symbols in G erasable. This ensures that when we in the end apply the homomorphism (the erasure act) the whole production
process disappears. But of course we want to keep the final product which consists exclusively of those non-terminals that
represent terminals. We harvest them by letting the productions of L1 and L2 end in the language T ) #-T, where T is any string
of the non-terminals created for the original terminals of G, and # replaces each of the non-terminals by its corresponding
terminal. Again this is easy to do since its structure is again essentially S x -S.
Now when we erase the erasable symbols, everything disappears except the final string of terminals, a production of G.
Hurray!
393
context-free ∩ regular = context-free
• The intersection of a context-free and a
regular language is always a context-free
language.
• There is a simple algorithm to construct a
grammar for that intersection language.
394
De Morgan’s Law
The intersection of two sets equals the negation
of the two sets, unioned, then negated:
𝐿1 ∩ 𝐿2 = ¬((¬𝐿1) ∪ (¬𝐿2))
395
The negation of a context-free
language might not be context-free
• De Morgan’s law: L1 ∩ L2 = ¬((¬L1) ∪ (¬L2))
• Suppose the negation of a CF language produces a CF
language.
• L1 and L2 are CF languages. Then ¬L1 is a CF language,
as is ¬L2. We know that the union of two CF languages
produces a CF language so (¬L1) ∪ (¬L2) produces a CF
language. The negation of it then produces a CF
language. So ¬((¬L1) ∪ (¬L2)) is a CF language. But that
equals L1 ∩ L2 and we already know that the
intersection of two CF languages might not be CF.
• Therefore, negation of a CF language is not guaranteed
to produce a CF language.
396
Set properties of regular
(Type 3) languages
regular-language ∪ regular-language = regular-language
regular-language ∩ regular-language = regular-language
¬regular-language = regular-language
397
Would there be programming
languages? XML?
• It is interesting to speculate what would have
happened if formal languages had been based on set
theory with all the set operations right from the start,
rather than on the Chomsky hierarchy.
• Would context-free languages still have been invented?
• CF languages guarantee only set union, not set
intersection or set difference. If you insist on having set
intersection (which is very tempting and convenient,
see for example the ease with which you can construct
anbncn by intersection), you'll never invent CF
languages.
398
Parsing for grammar conformance
• Scenario: you observe a pattern in the strings
that you are dealing with. So you create a
grammar to describe the pattern. Now you
want to check that your grammar correctly
describes the pattern. What is required of a
parser?
• Parsing only needs to check that the string
conforms to the grammar.
399
Parsing to determine the
string’s semantics
• Often we want to go further than simply check
that a string conforms to the grammar we
have designed for it.
• We want to know the string’s meaning, its
semantics.
• The semantics of a string is directly related to
the structure of its production tree. If it is
not, we have the wrong grammar.
400
Attaching semantics to a grammar
Attaching semantics to a (context-free) grammar
is done in a very simple and effective way:
To each rule in the grammar, a semantic clause is
attached, which relates the semantics of the
members of the right-hand side of the rule to the
semantics of the left-hand side.
Sum → Digit
{A0 := A1}
semantic clause
401
Flow of the semantic info
Semantic info can flow up, down, or both ways:
• Up: semantic information flows from the leaves of the
tree upward to the start symbol. The semantics of the
members of the right-hand side of each rule is used to
define the semantics of the left-hand side.
• Down: semantic information flows downward from
the start symbol to the leaves. The semantics of the
left-hand side of each rule is used to define the
semantics of the members of the right-hand side.
• Both: semantic information flows up and down for a
while until a stable situation is reached.
402
Inherited vs. derived
semantic information
• Semantic information flowing down is called
inherited: each rule inherits semantics from its
parent in the tree.
• Semantic information flowing up is called
derived: each rule derives semantics from its
children.
– Derived information is also called synthesized
information.
403
Expressing semantics
• There are many ways to express semantics.
• We will briefly describe two often-used and
well-studied techniques:
1. Attribute grammars
2. Transduction grammars
404
Add semantic info to this grammar
• We will explain attribute grammars and
transduction grammars using the language:
sums of one-digit numbers.
• The semantics of a sentence in the language is
the value of the sum.
• The language is generated by this grammar:
1.
2.
3.
Sum → Digit
Sum → Sum + Digit
Digit → 0 | 1 | … | 9
Here is one of the strings in the language: 3 + 5 + 1
The semantics of that string is: 9
405
Attribute grammars
• The semantic clause in an attribute grammar assume that each node in
the production tree has room for one or more attributes, which are just
values (numbers, strings, or anything else) sitting in nodes in production
trees.
• For simplicity we restrict ourselves to attribute grammars with only one
attribute per node.
• The semantic clause of a rule in such a grammar contains some formulas
which compute the value of some of the non-terminals in that rule from
those of other non-terminals in that same rule.
• These semantic actions connect only values that are local to the rule. The
overall semantics is composed as the result of all the local computations.
– Local actions produce global results – cool!
• If the semantic action of a rule R computes the value of the left-hand side
of R, that value is derived (synthesized). If it computes a value of one of
the non-terminals in the right-hand side of R, say A, then that value is
inherited by A.
406
Naming the attributes
Sum → Sum + Digit
A0
A1
A3
Each non-terminal has an associated attribute.
The attribute for the symbol on the left-hand side, Sum, is named A0
Each symbol, including terminals, are indexed. So the attribute
for the right-side Sum is A1, the attribute for Digit is A3.
407
Attribute grammar for the
Sum grammar
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Sum
Sum
Digit
Digit
Digit
Digit
Digit
Digit
Digit
Digit
Digit
Digit
→
→
→
→
→
→
→
→
→
→
→
→
Digit
Sum + Digit
0
1
2
3
4
5
6
7
8
9
{A0
{A0
{A0
{A0
{A0
{A0
{A0
{A0
{A0
{A0
{A0
{A0
:=
:=
:=
:=
:=
:=
:=
:=
:=
:=
:=
:=
A1}
A1 + A3}
0}
1}
2}
3}
4}
5}
6}
7}
8}
9}
408
Initial production tree
+
1
+
A0 = 3
3
A0 = 1
A0 = 5
5
The initial production tree for 3 + 5 + 1 is given above.
Initially only the attributes of the leaves are known,
but as soon as all attributes in a right-hand side of a
production rule are known, we can use its semantic
clause to compute the attribute of its left-hand side.
This way the attribute values (semantics) percolate up
the tree, finally reaching the start symbol and
providing us with the semantics of the whole string.
409
Attribute values percolate up the tree
+
+
1
+
A0 = 3
3
A0 = 5
5
A0 = 1
A0 = A1 + A3
=3+5
=8
A0 = 3
1
+
A0 = 5
3
5
A0 = A1 + A3
=8+1
=9
A0 = A1 + A3
=3+5
=8
A0 = 3
3
A0 = 1
+
1
+
A0 = 1
A0 = 5
5
Attribute grammars are a very powerful method for handling the semantics of a language.
This is another example of a closure algorithm!
410
XML attributes
• An XML Schema creates a grammar.
• XML has “attributes”.
• Are XML attributes in any way related to
attribute grammars?
• Did the creators of XML create XML attributes
simply for tucking away a name-value pair in a
tidy fashion? Or, did they have in mind a
deeper usage for XML attributes: use them to
define the semantics of the XML?
411
Transduction grammars
• Transduction grammars define the semantics
of a string (the “input string”) as another
string, the “output string” or “translation”.
• The semantic clause in a production rule is the
string that should be output for the node.
• The string is output for a node after the
strings for all its children.
412
Transduction grammar for Sum
Here is a transduction grammar which translates
a sum of digits into an instruction to compute
the value of the sum:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Sum
Sum
Digit
Digit
Digit
Digit
Digit
Digit
Digit
Digit
Digit
Digit
→
→
→
→
→
→
→
→
→
→
→
→
Digit
Sum + Digit
0
1
2
3
4
5
6
7
8
9
“make it the result”
“add it to the previous result
“take a 0”
“take a 1”
“take a 2”
“take a 3”
“take a 4”
“take a 5”
“take a 6”
“take a 7”
“take a 8”
“take a 9”
413
Sequence of outputs
Sum
take a 3
Sum
Sum
+
Digit
+
Digit
1
Digit
5
3
+
Digit
+
Digit
1
5
3
take a 5
Add it to the
previous result
Digit
Sum
Sum
Digit
3
Sum
Sum
make it the
result
Sum
Sum
Sum
+
Digit
+
Digit
1
5
Sum
Digit
Sum
+
Digit
+
Digit
1
5
3
414
Meaning of 3 + 5 + 1
The transduction grammar translates 3 + 5 + 1
into:
take a 3
make it the result
take a 5
add it to the previous result
take a 1
add it to the previous result
which is indeed what 3 + 5 + 1 “means”
415
Augmented Transition Network (ATN)
• Semantics can be introduced into a recursive transition
network by attaching actions to the transitions in the
graphs.
• These actions can set variables, construct data
structures, etc.
• A thus augmented recursive transition network is
known as an Augmented Transition Network (ATN).
Recursive Transition Network:
Name
Sentence
List
&
Name
416
Generative Power
• Grammars generate languages.
• By applying restrictions to grammar rules we
reduce the generative power of the grammars.
417
Grammar power = language size?
• One often hears statements like these:
Type 𝑛 grammars are more powerful than Type 𝑛 + 1
grammars, for 𝑛 = 0,1,2
A regular (Type 3) grammar is not powerful enough to match
parentheses.
• What kind of power is meant?
• One might think that it is the power to generate
larger and larger sets, but this is clearly incorrect:
the largest possible set of strings, Σ* (the set of
all strings over the alphabet) is easily generated
by the (unpowerful) Type 3 grammar:
S → [Σ] S | ε
where [Σ] is an abbreviation for the symbols of the alphabet
418
Power means restriction!
• Restricting Σ* requires more powerful grammars.
• More powerful grammars can define more
complicated boundaries between correct and
incorrect sentences.
Example: a Type 1 (context-sensitive) grammar can define sentences with
the same number of as, bs, and cs (i.e., anbncn) whereas the best that a
Type 2 (context-free) grammar can do is define sentences with the same
number of as and bs, with no restrictions on the number of cs (i.e.,
anbncm).
• Some boundaries are so fine, they cannot be
described by any grammar (that is, by any
generative process).
419
Power of a grammar
• A more powerful grammar does not mean
that the grammar can generate larger
languages (larger sets).
• More powerful means that the grammar can
define more precise rules regarding what
strings are allowed in the language.
420
Metaphor for grammar power:
outlining a rose
•
•
•
•
•
•
•
Imagine drawing a rose. It is approximated by increasingly finer
outlines.
In this metaphor, the rose corresponds to the language (imagine the
strings of the language as molecules in the rose); the grammar serves
to delineate its silhouette.
A regular grammar only allows us straight horizontal and vertical line
segments to describe the flower. Ruler and T-square suffice, but the
result is a course and mechanical looking picture.
A CF grammar would approximate the outline by straight lines at any
angle and by circle segments. The drawing could still be made using
the classical tools of compass and ruler. The result is stilted but
recognizable.
A CS grammar would present us with a smooth curve tightly
enveloping the flower, but the curve is too smooth: it cannot follow
all the sharp turns, and it deviates slightly at complicated points. Still,
a very realistic picture results.
An unrestricted phrase structure grammar can represent the outline
perfectly.
The rose itself cannot be caught in a finite description. Its essence
remains forever out of our reach.
421
Set of Java programs that can be
generated by the grammar types
•
•
•
•
•
•
A regular grammar can generate the set of all “lexically correct” Java programs. A
Java program is lexically correct if there are no newlines inside strings, comments
are terminated before end-of-file, all numerical constants have the right form, etc.
A context-free grammar can generate the set of all “syntactically correct” Java
programs. These programs conform to the CF grammar in the language manual.
A context-sensitive grammar can generate the set of all semantically correct Java
programs. These are the programs that pass through a Java compiler without
drawing error messages.
An unrestricted phrase structure grammar can generate the set of all Java
programs that would terminate in finite time when run with a given input. Such a
grammar would, however, be very complicated, since it would incorporate detailed
descriptions of the Java library routines and the Java run-time system.
The set of all Java programs that solve a given problem (for example, play chess)
cannot be generated by a grammar (although the description of the set is finite).
Note that each of the above sets is a subset of the previous set.
422
Sets generated by the grammar types
423
Set versus Language
Sets that can be generated by a grammar type
Languages that can be generated by a grammar type
424
Set of XSLT programs that can be
generated by the grammar types
The set of all “lexically correct” XSLT programs can
be generated by a regular grammar. An XSLT
<xsl:value-of> element is lexically correct if it starts
with <xsl:value-of, optionally followed by one or
more attribute/value pairs, followed by </xsl:valueof>
1.
2.
3.
4.
5.
Value-of → <xsl:value-of Rest
Rest → string = “ Expression
Rest → End-tag
Expression → string ” End-tag
End-tag → > </xsl:value-of >
425
Set of XSLT programs that can be
generated by the grammar types
The set of all “syntactically correct” XSLT
programs can be generated by a context-free
grammar. These programs conform to the CF
grammar in the XSLT specification.
1.
2.
3.
4.
5.
6.
7.
8.
9.
Value-of → Start-tag End-tag
Start-tag → <xsl:value-of Select >
Start-tag → <xsl:value-of Separator >
Start-tag → <xsl:value-of Disable-output-escaping >
End-tag → </xsl:value-of >
Select → select = “ Expression ”
Separator → separator = “ AVT ”
Disable-output-escaping → disable-output-escaping = “ YESNO ”
YESNO → yes | no
426
Set of XSLT programs that can be
generated by the grammar types
• The set of all semantically correct XSLT programs can be
generated by a CS grammar. These are the programs that
pass through an XSLT processor without drawing error
messages.
• The set of all XSLT programs that would terminate in finite
time when run with a given input can be generated by an
unrestricted phrase structure grammar. Such a grammar
would, however, be very complicated, since it would
incorporate detailed descriptions of the XPath function
routines and the XSLT run-time system (e.g. optimizations).
• The set of all XSLT programs that solve a given problem (for
example, play chess) cannot be generated by a grammar
(although the description of the set is finite).
427
The basis for the importance
of context-free grammars
• A Chomsky grammar is a finite mechanism that
produces a (usually) infinite set of strings, a “language”.
• Unlike many other set generation mechanisms, this
production process assigns a structure to the produced
string, which can be utilized to attach semantics to it.
– For context-free (Type 2) grammars, this structure is a tree,
which allows the semantics to be composed from the
semantics of the branches. This is the basis of the
importance of context-free grammars.
428
Tom, Dick and Harry example
The following slides illustrate each of the
grammar types, using the tdh language.
429
Example of a phrase structure (PS)
grammar, Type 0
1. Sentence → Name | List End
2.
Name → tom | dick | harry
3.
List → Name | Name , List
4. , Name End → and Name
Notice that rule 4 has more stuff on the left-hand side than the righthand side. That’s what characterizes a PS grammar.
Key Point: this grammar is in the form of a phrase-structure grammar
but the language (set) it generates can be generated by a mere
regular grammar. Here is what characterizes a set that is Type 0: an
item not in the set cannot be ruled out in finite time. That is, it may
take an infinite amount of time to determine that an item is not in
the set. More formally, determining if a given item belongs to a set
generated by a Type 0 grammar is undecidable.
430
PS grammar vs. PS language
• A PS grammar is a grammar that has the
proper form: no restrictions on the LHS of
each rule, other than it must contain a nonterminal.
• A PS language is a language (set) that can only
be generated by a PS grammar, not a CS
grammar or CF grammar or FS grammar.
431
Example of a context-sensitive (CS)
grammar , Type 1
1.
2.
3.
4.
5.
6.
Sentence
Name
List
Comma EndName
and EndName
Comma
→
→
→
→
→
→
Name | List
tom | dick | harry
EndName | Name Comma List
and EndName
and Name
,
Notice that each rule has at least as much stuff on the right-hand side
as on the left-hand side. Further, the RHS is exactly like the LHS, except
one non-terminal has been changed. That’s what characterizes a CS
grammar.
Key Point: this grammar is in the form of a context-sensitive grammar
but the language (set) it generates can be generated by a mere
context-free grammar. What distinguishes a Type 1 grammar from a
Type 0 grammar is that an item not in the set can be determined in
finite time.
432
Example of a context-free (CF)
grammar , Type 2
1.
2.
3.
Sentence → Name | List and Name
Name → tom | dick | harry
List → Name , List | Name
Notice that each rule has exactly one non-terminal on the LHS: the
non-terminal is defined independent of context.
433
Example of a regular/finite-state (FS)
grammar , Type 3
1.
2.
3.
Sentence → tom | dick | harry | List
List → tom ListTail | dick ListTail | harry ListTail
ListTail →, List | and tom | and dick | and harry
This is a right-regular grammar: each rule’s non-terminal is at the right end of the rule.
434
Example of a finite-choice (FC)
grammar , Type 4
Sentence → [tdh] | [tdh] and [tdh] | [tdh] , [tdh] & [tdh]
Note: Type 4 is not part of the Chomsky hierarchy.
435
Standard example of a
Type 1 language: anbncn
CS grammar for anbncn
1.
2.
3.
4.
5.
6.
7.
8.
S
CB
HB
HC
aB
bB
bC
cC
→
→
→
→
→
→
→
→
aSBC | aBC
HB
HC
BC
ab
bb
bc
cc
Derivation of a2b2c2
S
aSBC
aaBCBC
aabCBC
aabHBC
aabHCC
aabBCC
aabbCC
aabbcC
aabbcc
(start)
(rule 1)
(rule 1)
(rule 5)
(rule 2)
(rule 3)
(rule 4)
(rule 6)
(rule 7)
(rule 8)
436
Formal summary of
Type 0, 1, 2, 3 grammars
A generative grammar G = (VN, VT, S, F) is said to
be of type i if it satisfies the restrictions
described in this list:
i=0
No restrictions except the LHS must contain a nonterminal
i = 1:
Every rewriting rule in F has the form Q1AQ2 → Q1PQ2,
with Q1, Q2, and P in (VN ∪ VT)*, A ∈ VN, and P ≠ ε,
except possibly for the rule S → ε, which may occur in F,
in which case S does not occur on the right-hand sides
of the rules.
i = 2:
Every rule in F has the form A → P, where A ∈ VN, and
P ∈ (VN ∪ VT)*.
i = 3:
Every rule in F has the form A → PB or A → P, where
A, B ∈ VN, and P ∈ VT*.
437
Type 0-3 languages are infinite
• Every language that is Type 0-3 is infinite.
• If a language is finite, we can enumerate the
sentences using a Type 4 grammar.
438
Why does a parse “reconstruct”?
Parsing is the task of reconstructing the
derivation tree (or graph) for a given input
string.
• Why is parsing about “reconstructing”?
• That implies that at one time the input string
was in the form of a tree (or graph) but
somehow it lost that form and now we are
reconstructing it.
439
The brain creates parse trees
According to Chomsky, and I think he is right in this, sentences in a
language, natural or artificial, are constructed according to a grammar.
While being generated they obtain a structure, the generation tree (or
graph in the case of PS grammars). This structure encodes the meaning. When
the sentence is spoken or written the terminal symbols (words) alone are
transferred to the listener or reader, losing the structure (linearized). But since
the meaning is attached to that structure the listener or reader will
have to reconstruct the generation tree, now called the parse tree, to
retrieve the meaning. That's why we need parsing.
Actually I do not think Chomsky is a 100% right. CF grammars are not
strong enough, and people don't use PS grammars. I think they use affix
or attribute grammars (they are equivalent). But the above paragraph
still holds.
Dick Grune
440
The brain creates parse trees
Sentences, linear sequences of symbols, are
really just serializations of parse trees we
humans grok natively in hardware. To get an
idea across to someone, we have to conjure up
the same parse tree in their head using a word
stream.
The Definitive ANTLR 4 Reference, p. 11
441
The computer scientist is undaunted
by undecidable problems
The issue addressed here is the formal linguist saying "you can't do
this" (and he is correct) and the computer scientist saying "true, but I can
handle an increasing number of instances with increasingly complicated
algorithms".
A good example is the package AmbiDexter
(http://homepages.cwi.nl/~storm/publications/ambidexter.pdf), which
"solves" an undecidable problem: is a given CF grammar ambiguous?
(impossible to decide according to formal language theory). The program
does so by trying all kinds of tricks to have the grammar produce two
identical sentences (it was written by Bas Basten, one of those
undaunted computer scientists).
Dick Grune
442
OTHER SLIDES
443
Grammar-oriented programming
• Grammar-oriented programming (GOP) and Grammaroriented Object Design (GOOD) are good for designing and
creating a domain-specific programming language (DSL) for
a specific business domain.
• GOOD can be used to drive the execution of the application
or it can be used to embed the declarative processing logic
of a context-aware component (CAC) or context-aware
service (CAS). GOOD is a method for creating and
maintaining dynamically reconfigurable software
architectures driven by business-process architectures. The
business compiler was used to capture business processes
within real-time workshops for various lines of business and
create an executable simulation of the processes used.
http://en.wikipedia.org/wiki/Grammar-oriented_programming
444
Rodney Brooks
Once Turing came up with a formalism for
computation we were able make great progress
fairly quickly. Now if you took any late 19th-century
mathematicians, you could explain the fundamental
ideas of computation to them in two or three days,
lead them through the theorems, they could
understand it and they wouldn't find it mind
boggling in any way. It follows on from 19th-century
mathematics. Once you have that notion of
computation, you are able to do a lot with it.
http://www.edge.org/conversation/the-deep-question
445
CS grammar for generating the same
number of as, bs, and cs
Shuffle the as,
bs, and cs
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
S
S’
AB
AC
BA
BC
CA
CB
A
B
C
→
→
→
→
→
→
→
→
→
→
→
S’ | ε
S’ABC | ABC
BA
CA
AB
CB
AC
BC
a
b
c
446
Grammar for generating
1, 2, 4, 8, … as (2i as)
S -> ACaB
Ca -> aaC
CB -> DB
CB -> E
aD -> Da
AD -> AC
aE -> Ea
AE -> ε
S -> ACaB -> AaaCB -> AaaE -> AaEa -> AEaa -> aa
S -> ACaB -> AaaCB -> AaDB -> ADaB -> ACaB -> AaaCB -> AaaDB -> AaDaB -> ADaaB ->
ACaaB -> AaaCaaB -> AaaaaCaB -> AaaaaaaCB -> AaaaaaaE -> AaaaaaEa -> AaaaaEaa ->
AaaaEaaaa -> AaaEaaaaa -> AaEaaaaa -> AEaaaaaa -> aaaaaa
447
CS languages
One of the simplest context-sensitive languages is: the
language of all strings consisting of n occurrences of
the symbol "a", then n "b"'s, then n "c"'s (abc, aabbcc,
aaabbbccc, etc.). A superset of this language, called
the Bach language,[1] is defined as the set of all strings
where "a", "b" and "c" (or any other set of three
symbols) occurs equally often (aabccb, baabcaccb,
etc.) and is also context-sensitive.[2][3]
Another example of a context-sensitive language that is
not context-free is L = { ap : p is a prime number }. L
can be shown to be a context-sensitive language by
constructing a linear bounded automaton which
accepts L. The language can easily be shown to be
neither regular nor context free by applying the
respective pumping lemmas for each of the language
classes to L.
http://en.wikipedia.org/wiki/Context-sensitive_language
448
Symbols
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
∈ = an element of
∉ = not an element of
ℕ = the set of natural numbers
ε = the empty string
Σ = the alphabet
Σ* = the set of all strings over the alphabet Σ
δ = transition function for a single token
δ* = transition function for a sequence of tokens
→
⊆ = proper subset of
𝐿 = a language
𝐿 = the complement of language 𝐿
∩ = intersection
∪ = union
↦
P
∅
≠
¬
b
{q0, q1}
449