No Slide Title

Download Report

Transcript No Slide Title

Scanning & FLEX
CPSC 388
Ellen Walker
Hiram College
Scanning (review)
• Input: characters from the source code
• Output: Tokens
– Keywords: IF, THEN, ELSE, FOR …
– Symbols: PLUS, LBRACE, SEMI …
– Variable tokens: ID, NUM
• Augment with string or numeric value
Token Class (partial)
Class Token {
Public:
TokenType tokenval;
string tokenchars;
double numval;
}
GetToken(): A scanning function
• Token *getToken(istream &sin)
– Read characters from sin until a complete
token is extracted, return the token
– Usually called by the parser
– Note: version in the book uses global
variables and returns only the token type
Using GetToken (Review)
Token *myToken = GetToken(cin);
While (myToken != NULL){
//process the token
switch (myToken->TokenType){
//cases for each token type
}
myToken = GetToken(cin);
}
Result of GetToken (Review)
for (int i = 0 ;
i < 100 ; i++){
TokenType: FOR
for (int i = 0 ;
i < 100 ; i++){
TokenType: LPAREN
for (int i = 0 ;
i < 100 ; i++){
Regular Expressions for
Common Tokens
• Special characters: (the characters)
• Identifier: [a-zA-Z][a-zA-Z_]*
• Numbers:
– Int: [1-9][0-9]*
– Float: [1-9][0-9]*(e|(.[0-9]*))
– Scientific: [1-9][0-9]*(e|(.[0-9]*))(E+e)(+|–|
e)[1-9][0-9]*
Reg. Exp. For Comments
• Comment to end of line
– //[^\n]* (last part: (all chars except \n)* )
• /*…*/ comment
– ab (~b|b~a)*b?ba <--- ab … ba
– /\* (~\* | \*~/)*(\*)? \*/ <--- needs escapes!
– Does not require matching of “inner” /**/
Comments in Practice
• Often handled by “ad-hoc” methods
• Scanner simply loops to ignore
characters from /* to */
– If character is not ‘*’, ignore it
– Else if next character is not “/”, ignore it
– Else ignore “/*” and return to scanning
normally
Delimiters and Ambiguity
• Comments are not totally ignored!
– “fo/**/r” is not the keyword “for” !
• Principle of longest substring (“maximal
munch”)
– “fork” is not “for” followed by “k”
• Disallow keywords as identifiers
– Scan identifier, then look it up instead of
including keywords explicitly in language
FORTRAN’s mistakes
• Ignored white space (no delimiters)
– DO99I=1.2 (DO99I = 1.2) vs.
– DO99I=1,2 (DO 99 I = 1 , 2)
• No reserved words
– IF(IF.EQ.0)THENTHEN=17
• Result: arbitrary backtracking (or
lookahead) needed!
TINY Lexemes
• Reserved words: if, then, else, end,
repeat, until, read, write
• Symbols: +, -, *, /, =, <, (, ), ;, :=
• Other: number (integer only), identifier
(letters only)
• Comment: {…}
• Principle of longest substring holds
TINY DFA
digit
space
r
inum
digit
[ !digit ]
punctuat ion
start
{
letter
}
com
letter
[ !letter ]
inid
=
:
:=
~}
done
Using the TINY DFA
• Implement DFA directly or with a table
• Each call to gettoken() starts at the
current point of the string, scans until no
transition is possible.
• If final state is reached, return the token
determined by the link to the final state.
Otherwise, report an error.
• Characters in [ ] are not consumed
DFA pseudocodde
• State = Start_state
• While (chars available ){
• last_state = state;
• state = next_state(next_char, state);
• if state = null return (final (last_state));
• } return final(last_state);
LEX (FLEX)
• FLEX generates a scanner
automatically!
– Input: description of regular expression for
each token, optional additional code
– Output: lex.yy.c - includes function yylex()
for parsing (like gettoken)
DFA Pseudocode
•
•
•
•
•
•
state = initial-state
while(chars in string){
c = next char from string
state = next_state[state][c]
}
If final[state] return ACCEPT
Parts of a LEX file
• Definitions
– code for the top of the file, and define expressions
such as “digit”
– All code in %{ and %} directly copied
• Rules
– { expression } {code when recognized}
• Auxiliary Routines
– Define additional functions here (including main)
Predefined items
• yylex() - lex scanning routine (like
getToken) - generated by FLEX
• yytext - current string (a character array,
not a C++ string class)
• Input() - get a char from flex input
• ECHO - print yytext to yyout
Example: Definitions
%{
/* add line numbers to text and print */
#include <iostream>
int lineno=1;
%}
line .*\n
%%
Example: Rules & Aux. Code
{line} {cout << lineno++ <<“ “<< yytext;}
%%
main(){
yylex();
return 0;
}
Using the Scanner
• First, create the code
– flex test.lex
• Next, compile the program
– g++ lex.yy.c -o test -lfl
• Finally, scan the input file
– ./test < input_file