Diapositiva 1

Download Report

Transcript Diapositiva 1

Tools for building compilers
Clara Benac Earle
Tools to help building a compiler
C
– Lexical Analyzer generators: Lex, flex,
– Syntax Analyzer generator: yacc
Java
– Lexical Analyzer generators: JLex, JFlex,
– Syntax Analyzer generator: CUP
These tools with their documentation can be found
on the internet
Lex: Lexical Analyzer
Generator
example.l
lex.yy.c
Lex Compiler
C
compiler
a.exe
Description
A tool for generating scanners
The scanner is described as pairs of regular
expressions and C code
Flex generates as output a C source file,
lex.yy.c, which defines a routine yylex(). This file
produces an executable
When the executable is run, it analyzes its input
for occurrences of the regular expressions.
Whenever it finds one, it executes the
corresponding C code
Format of the input file
The flex input file consists of three
sections separated by %%
Definitions
%%
Rules
%%
User Code
Skeleton of a lex specification (.l file)
%{
< C global variables, prototypes,
comments >
%}
[DEFINITION SECTION]
%%
[RULES SECTION]
%%
< C auxiliary subroutines>
This part will be
embedded into *.c
substitutions, code and
start states; will be
copied into *.c
define how to scan and
what action to take for
each token
any user code. For
example, a main function
to call the scanning
function yylex().
The definition section
Contains name definitions and
declarations of start conditions
Name definitions have the form:
name
definition
Examples:
DIGIT
ID
[0-9]
[a-z][a-z0-9]*
The rules section
Form:
%%
<pattern>
{ <action to take when matched> }
<pattern>
{ <action to take when matched> }
…
%%
Patterns are specified by regular expressions
Examples:
%%
[A-Za-z]*
%%
{ printf(“this is a word”); }
Extended regular expressions
x
.
[]
[xy]
[a-z]
[^a-z]
r*
r+
r?
{name}
match the character “x”
any character except newline
a character class
match either an “x” or a “y”
match any letter from “a” to “z”
any character but those in the class
zero or more r´s
one or more r´s
zero or one r
the expansion of the name definition
Extended regular expressions
x|y
x/y
x{m,n}
x
x$
"s"
x or y
x, only if followed by y (y not removed from input)
m to n occurrences of x
x, but only at beginning of line
x, but only at end of line
exactly what is in the quotes (except for "\" and
following character)
A regular expression finishes with a space, tab or newline
Meta-characters
– meta-characters (do not match themselves, because they are used
in the preceding reg exps):
()[]{}<>+/,^*|.\"$?-%
– to match a meta-character, prefix with "\"
– to match a backslash, tab or newline, use
\\, \t, or \n
Regular Expression Examples
• an integer: 12345
[1-9][0-9]*
• a word: cat
[a-zA-Z]+
• a (possibly) signed integer: 12345 or -12345
[-+]?[1-9][0-9]*
• a floating point number: 1.2345
[0-9]*”.”[0-9]+
Two Rules
1. lex will always match the longest (number of
characters) token possible.
2. If two or more possible tokens are of the same
length, then the token with the regular expression
that is defined first in the lex specification is
favored.
How the input is matched
Once the match is determined, the text
corresponding to the match is made available in
the global character pointer yytext, and its length
in the global integer yyleng. The action
corresponding to the matched pattern is then
executed, and then the remaining input is
scanned for another match
Actions
Can be any arbitrary C statement
Normally they are written between {}
If the action is empty, then when the
pattern is matched the input token is
simply discarded
The action “|” means “same as the action
for the next rule”
Actions: examples
%%
[ \t
":="
"<“
"if"
\n]+ ;
return ASIG;
return MINOR;
return IF;
Start conditions
A mechanism for conditionally activating rules
%s comment
%%
“/*” { BEGIN comment; }
<comment>”*/” { END comment; /* =
BEGIN 0; */ }
<comment>. { }
Special Functions
yytext
– where text matched most recently is stored
yyleng
– number of characters in text most recently matched
yylval
– associated value of current token
yymore()
– append next string matched to current contents of yytext
yyless(n)
– remove from yytext all but the first n characters
unput(c)
– return character c to input stream
yywrap()
– may be replaced by user
– The yywrap method is called by the lexical analyzer
whenever it inputs an EOF as the first character when trying
to match a regular expression
Let us run a lex program