Transcript Slide13,

Pattern Matching: Simple
Patterns
Introduction
• Programmers often need to scan a file,
directory, etc. for a specific substring.
– Find all files that begin with “A”.
– Find all files that end in “txt”
• This capability is provided by a variety of
tools.
– e.g. egrep, grep, awk,
• Useful to include this functionality in a
programming language.
Perl’s Pattern Matcher
• Perl has a built in pattern matcher.
– Motivation: system administrators frequently
use regular expressions. They also use Perl.
• Syntax is borrowed from the grep utility in
Unix.
• Based on regular expressions from
computer science.
Perl’s Pattern Matcher (cont.)
• Operates over a single string.
• Contexts:
– Scalar: Returns true or false.
– List: Matching substrings returned in a list.
• The syntax is:
m dl pattern dl [modifiers]
• (/) is the most common delimiter.
– m operator is unnecessary.
• Other delimiters can be used:
m~pattern~
Simple Patterns
• Simple patterns – match individual
characters or character classes.
• An abstract representation of a set of
strings.
• A pattern “matches” when the string it’s
compared with is in the set.
• Matching is done from left to right.
Three Categories of Characters
• Normal characters:
– Match themselves.
– Includes escape characters – e.g. \t, \cC
• Metacharacters:
– Have special meanings in patterns
–\|()[]{}^$*+
• Period:
– Matches any character except newline.
An Example
$_ = “It’s snowing today.”;
if (/snow/) {
print “There was snow somewhere in $_”;
}
else {
print “$_ was snowless \n”;
}
Character Classes
• Character classes specify collections of
characters in patterns.
• Defined by placing the set in [ ]
– e.g. /[<>=]
• Dashes are used specify ranges of
characters:
– /[A-Za-z]/
– /[0-7]/
– /[0-3-]/
Exclusion From a Class
• Characters can be excluded from a class
with (^)
• Matches anything except the specified
characters.
• For example:
– /[^A-Za-z]/
– /[^01]/
Useful Abbreviations
Abbreviation
Pattern
Matches
\d
[0-9]
A digit
\D
[^0-9]
A nondigit
\w
[A-Za-z_]
A word char
\W
[^A-Za-z_]
A nonword char
\s
[ \r\t\n\f]
A white-space char
\S
[^ \r\t\n\f]
A non-white-space
char
Some Examples
•
•
•
•
/[A-Z]”\s/
/[\dA-Fa-f]/
/\w\w:\d\d/
/0x\d/
Variables in Patterns
• A variable in a pattern is interpolated.
• For example,
$hexpat = “\\s[\dA-Fa-f]\\s”;
if (/$hexpat/) {
print “$_ has a hex digit.”
}
Quantifiers
• Quantifiers can make a pattern more
powerful.
• Allows a pattern to be repeated a specified
number of times.
• Perl has four kinds of quantifier:
– *, +, ?, {m, n}
• Quantifier immediately follows the pattern
it quantifies.
{m, n}
• {n} – exactly n repetitions.
• {m,} – at least m repetitions.
• {m,n} – at least m, but not more than n
repetitions.
{m,n} Examples
•
•
•
•
•
/a{1,3}b/ - ab, aab, aaab
/ab{3}c/ - abbbc
/ab{2,}c/ - abbc, abbbc, abbbbc, …
/c{3} z{5}/ - ccc zzzzz
/[abc] {1, 2}/ - a,b,c,ab,ac,ba,bc,ca,cb
Asterisk (*)
• (*) means zero or more repetitions.
• Equivalent to {0,}
• For example,
– /0\d\d*/
– /\w\w*/
– /bob.*cat/
Plus (+)
• (+) means one or more repetitions.
• Equivalent to {1,}
• For example,
– /\w+/
– /[A-Za-z][A-Za-z\d_]+/
– /\d+\.\d+/
Question Mark (?)
• (?) means either zero or one.
• Equivalent to {0,1}.
• For example,
– /\d+\.?/
– /\$?\d+\.\d\d/
– /”?\w+”?/
Subpatterns
• Quantifiers modify only the last character.
– e.g. /ball*/
• () can be used to group parts of patterns.
• The quantifier modifies the group.
• For example,
– /(ball)*/
– /(boo! ){3}/
Alternation
• (|) is the logical OR operator in a pattern.
• /a|e|i|o|u/ is equivalent to /[aeiou]/
• For example,
– /(Bob|Tom|Pussy|Scaredy)cat/
– /t(oo?|wo)/
• Be careful!
– /Tom|Tommie/
Precedence
• The precedence of the operators are:
–
–
–
–
Parenthesis
Quantifiers
Character Sequence
Alternation
• For example,
– /#|-+/
– /(#|-)+/