Transcript Perl

Regular Expressions
Regular Expression (or pattern) in Perl – is a template
that either matches or doesn’t match a given string.
Regular Expressions in Perl:
if( $str =~ /hello/){
while( <STDIN> ){
if( /hello/ ){
…
…
}
@words = split /\s+/, $str;
}
}
Regular Expressions (3)
“.” matchs any char except a newline \n
/hello.you/ matches any string that has ‘hello’, followed
by any one (exactly one) character, followed by ‘you’.
/to*ols/ last character before ‘*’ may be repeated zero
or more times. Matches ‘tools’,’tooooools’,’tols’ (but not
‘toxols’ !!!)
/to+ols/ ------//------- one or more -----//------.
/to.*ols/ matches ‘to’, followed by any string, followed
by ‘ols’.
/to?ols/ the character before ‘?’ is optional. Thus, there
are only two matching strings – ‘tools’ and ‘tols’.
Regular Expressions (4)
Grouping – parentheses ‘( )’ are used for grouping one
or more characters.
/(tools)+/ matches “toolstoolstoolstools”.
Alternatives:
/hello (world|Perl)/ - matches “hello world”, “hello
Perl”.
Regular Expressions (5)
Character Class
/Hello [abcde]/ matches “Hello a” or “Hello b” …
/Hello [a-e]/
the same as above
Negating:
[^abc]
any char except a,b,c
Regular Expressions (6)
Shortcuts
• \d digit
• \w word character [A-Za-z0-9_]
• \s white space
Negative ^ –
[^\d] matches non digit
\S anything not \s
\D anything not \d
Regular Expressions (8)
/^abc/ - “^” beginning of a string
Anchors
/a\^bc/ - matches “\^”
/[^abc]/ - negating
^ - marks the beginning of the string
$ - marks the end of the string
/^Hello Perl/ - matches “Hello Perl, good by Perl”, but not “Perl Hello Perl”
/^\s*$/ - matches all blank lines
Regular Expressions (9)
\b - matches at either end of a word (matches the
start or the end of a group of \w characters)
/\bPerl\b/ - matches “Hello Perl”, “Perl”
but not “Perl++”
\B - negative of \b
Regular Expressions (10)
Backreferences:
/(World|Perl) \1/ - matches “World World”, “Perl
Perl”.
/((hello|hi) (world|Perl))/
•\1 refers to (hello|hi) (world|Perl)
•\2 refers to (hello|hi)
•\3 refers to (world|Perl)
$1,$2,$3 store the
values of \1,\2,\3 after
a reg.expr. is applied.
Examples:
$date="12 10
10";
if($date=~ /(\d+)/){
print $1.":".$2.":".$3.":\n";
#output ($2 and $3 are empty):
#12:::
}
if($date=~ /(\d+)(\s+\1)+/){
print $1.":".$2.":".$3.":\n";
#output (notice $3 is empty):
#10:
10::
}
$str="Hello World";
if($str=~ /((Hello|Hi) (World|Perl))/){
print $1.":".$2.":".$3.":\n";
#output:
#Hello World:Hello:World:
}
$str="Hello Perl Hi";
if($str=~ /((Hello|Hi) (World|Perl)) \1/){
print $1.":".$2.":".$3.":\n";
#output: non
}
$str="Hi Perl Hi Perl";
if($str=~ /((Hi|Hello) (World|Perl)) \1/){
print $1.":".$2.":".$3.":\n";
#output:
#Hi Perl:Hi:Perl:
}
Examples
1. What is it?
/^0x[0-9a-fA-F]+$/
2. Date format: Month-Day-Year -> Year:Day:Month
$date = “12-31-1901”;
$date =~ s/(\d+)-(\d+)-(\d+)/$3:$2:$1/;
Examples
3. Make a pattern that matches any line of input that
has the same word repeated two (or more) times in a
row. Whitespace between words may differ.
Example
1. /\w+/
#matches a word
2. /(\w+)/
#to remember later
3. /(\w+)\1/
#two times
4. /(\w+)\s+\1/ #whitespace between words
5. “This is a test” -> /\b(\w+)\s+\1/
6. “This is the theory” -> /\b(\w+)\s+\1\b/
HomeWork
1) Write a regular expression that identifies a 24-hour
clock. For example: 0:01, 00:20, 15:00, 23:59
2) Write a regular expression that identifies a floating
point. For example: 10, 10.0001, -0.1, +001.3456789
For both assignments write a single program that
identifies these patterns in the input lines and prints
out only the matched patterns.
HomeWork
3) Write a CGI Perl script that extracts all http links from
a given WWW page.
Input: http address. It is received from a HTML text box.
Output: list of all http links found in <a href=“link”> field.
Input Examples:
http://www.tau.ac.il
http://www.cs.tau.ac.il
http://www.cnn.com
HomeWork (3)
Remarks:
1) You need to create two pages - (1) html page
with a text box (2) cgi script that receives the input
and formats output html file.
2) Unix command ‘wget’ downloads html files.
3) Use regular expressions. The code for
parsing should be small, 3-10 lines.
Regular Expressions
Quantifiers:
/a{3,6}/ - matches “a” repeated 3,4,5,6 times
/(abc){3,}/
- matches three or more repetitions of “abc”.
/a{3}/ - matches exactly three repetitions of “a”.
*
=
{0,}
+
=
{1,}
?
=
{0,1}
Negated Match
Negation
if( $str =~ /hello/){
if( $str !~ /hello/){
…
}
…
}
Regular Expressions (11)
$&
$`
$’
- what really was matched
- what was before
- the rest of the string after the matched pattern
$` . $& . $’ - original string
Regular Expressions (12)
Substitutions:
s/T/U/; #substitutes T with U (only once)
s/T/U/g; #global substitution
s/\s+/ /g; #collapses whitespaces
s/(\w+) (\w+)/$2 $1/g;
s/T/U/; #applied on $_ variable
$str =~ s/T/U/;
Regular Expressions (13)
File Extension Renaming:
my ($from, $to) = @ARGV;
@files = glob (“*.$from”);
foreach $file (@files){
$newfile = $file;
$newfile =~
=~ s/\.$from/\.$to/g;
s/\.$from$/\.$to/g
rename($file, $newfile);
}
Split and Join
$str=“aaa bbb
ccc
dddd”;
@words = split /\s+/, $str;
$str = join ‘:‘, @words;
#result is “aaa:bbb:ccc:dddd”
@words = split /\s+/, $_; “ aaa b” -> “”, “aaa”, “b”
@words = split;
“ aaa b” ->
“aaa”, “b”
@words = split ‘ ‘, $_;
“ aaa b” ->
“aaa”, “b”
Grep
grep EXPR, LIST;
@results = grep /^>/, @array;
@results = grep /^>/, <FILE>;
Regular Expressions (2)
Regular Expressions in Unix:
grep “include .*h”
regular
expression
*.h
 globes
Defined/Undef
my $i;
if( defined $i ) #false
$i=0;
if( defined $i ) #true
my %hash; #or %hash=();
defined %hash; #false, hash is empty
$hash{“1”}=“one”;
exists($hash{“1”})==defined($hash{“1”})==true;
undef $hash{“1”};
exists($hash{“1”})== true;
defined($hash{“1”})==false;
delete $hash{“1”};
exists($hash{“1”})== false;
defined($hash{“1”})==false;