No Slide Title

Download Report

Transcript No Slide Title

Scanning & Regular Expressions
CPSC 388
Ellen Walker
Hiram College
Scanning
• Input: characters from the source code
• Output: Tokens
– Keywords: IF, THEN, ELSE, FOR …
– Symbols: PLUS, LBRACE, SEMI …
– Variable tokens: ID, NUM
• Augment with string or numeric value
TokenType
• Enumerated type (a c++ construct)
Typedef enum {IF, THEN, ELSE …}
TokenType
• IF, THEN, ELSE (etc) are now literals of
type TokenType
Using TokenType
void someFun(TokenType tt){
…
switch (tt){
case IF: … break;
case THEN: … break;
…
}
Token Class (partial)
class Token {
public:
TokenType tokenval;
string tokenchars;
double numval;
}
Interlude: References and
Pointers
• Java has primitives and references
– Primitives are int, char, double, etc.
– References “point to” objects
• C++ has only primitives
– But, one of the primitives is “address”,
which serves the purpose of a reference.
Interlude: References and
Pointers
• To declare a pointer, put * after the type
char x;
// a character
char *y; // a pointer to a character
• Using pointers:
x = ‘a’;
y = &x; //y gets the address of x
*y = ‘b’; //thing pointed at by y becomes ‘b’;
//note that x is now also b!
Interlude: References and
Pointers
• Continuing the example…
cout << x << endl;
// prints b
cout << *y << endl; // prints b
cout << y << endl; // prints a hex address
cout << &x << endl; // same as above
cout << &y << endl; // a different address where the pointer is stored
GetToken(): A scanning function
• Token *getToken(istream &sin)
– Read characters from sin until a complete
token is extracted, return (a pointer to) the
token
– Usually called by the parser
– Note: version in the book uses global
variables and returns only the token type
Using GetToken
Token *myToken = GetToken(cin);
while (myToken != NULL){
//process the token
switch (myToken->TokenType){
//cases for each token type
}
myToken = GetToken(cin);
}
Result of GetToken
for (int i = 0 ;
i < 100 ; i++){
TokenType: FOR
for (int i = 0 ;
i < 100 ; i++){
TokenType: LPAREN
for (int i = 0 ;
i < 100 ; i++){
Tokens and Languages
• The set of valid tokens of a particular
type is a Language (in the formal sense)
• More specifically, it is a Regular
Language
Language Formalities
• Language: set of strings
• String: sequence of symbols
• Alphabet: set of legal symbols for
strings
– Generally  is used to denote an alphabet
Example Languages
• L1 = {aa, ab, bb} ,  = {a, b}
• L2 = {e,ab, abab, … },  = {a, b}
• L3 = {strings of N a’s where N is an odd
integer},  = {a}
• L4 = { e } (one string with no symbols)
• L5 = { } (no strings at all)
• L5 = Ø
Denoting Languages
• Expressions (regular languages only)
• Grammars
– Set of rewrite rules that express all and
only the strings in the language
• Automata
– Machines that “accept” all and only the
strings in the language
Primitive Regular Expressions
• 
– L() = {}
(no strings)
• e
– L(e) = {e}
(one string, no symbols)
• a where a is a member of 
– L(a) = {a}
(one string, one symbol)
Combining Regular Expressions
• Choice: r | s (sometimes r+s)
– L(r | s) = L(r )  L(s)
• Concatenation: rs
– L(rs) = L(r)L(s)
– All combinations of 1 from r and 1 from s
• Repetition: r*
– L(r*) = e  L(r )L(rr)  L(rrr ) …
– 0 or more strings from r concatenated
Precedence
• Repetition before concatenation
• Concatenation before choice
• Use parentheses to override
• aa* vs. (aa)*
• ab|c vs. a(b|c)
Example Languages
• L1 = {aa, ab, bb} ,  = {a, b}
• L2 = {e,ab, abab, … }, S = {a, b}
• L3 = {strings of N a’s where N is an odd
integer}, S = {a}
• L4 = { e } (one string with no symbols)
• L5 = { } (no strings at all)
• L5 = Ø
R.E.’s for Examples
•
•
•
•
•
L1 = aa | ab | bb
L1 = a(a|b) | bb
L1 = aa | (a|b) b
L2 = (ab)* not ab* !
L3 = a(aa)*
What are these languages?
•
•
•
•
•
a* | b* | c*
a*b*c*
(a*b*)*
a(a |b)*c
(a|b|c)*bab(a|b|c)*
What are the RE’s?
• In the alphabet {a,b,c}:
– All strings that are in alphabetical order
– All strings that have the first a before the
first b, before the first c, e.g. ababbabca
– All strings that contain “abc”
– All strings that do not contain “abc”
Extended Reg. Exp’s
• Additional operations for convenience
r + = rr* (one or more reps)
. ( any character in the alphabet)
.* = any possible string from the alphabet
[a-z] = a|b|c|…|z
[^aeiou] = b|c|d|f|g|h|j...