Perl and regexp's

Transcript Perl and regexp's

Perl and regexp's
Plus more examples
1
Documentation
Typically, all of the perl documentation comes with
the Perl installation. However, if all else fails…
http://perldoc.perl.org/
man perl
2
Question – pause on <STDIN>
open(WRITE_TO_FILEhehe,">output");
$i=0;
while($i<5) { #stay in this loop as long ias $i is less than 5
$line = <STDIN>;
chomp($line);
print STDOUT "You entered: $line\n";
print WRITE_TO_FILEhehe "$line"; # Hmmmmm? Not printing newlines # to file. Any reason why?
$i = $i +1; # increment $i for while loop
}
.
.
3
Perl Usage Example –
FIV feline integration virus
4
Perl for Analysis in FIV
• Cells are transfected – virus acts
– FIV potential genetic engineering vector/agent
• Extract DNA
• Primers and high-throughput (96 well ABI)
sequencing to read sequence
• BLAST to find genome locality
• Examine sequence manually
– initial pilot study of ~ 250 sequences (after about 60 it
was clear there had to be a better way)
– Several thousand subsequently done
5
Pattern Matching
m/PATTERN/MODIFIERS
or
/PATTERN/MODIFIERS
Note -- the "m" is optional if you are using "/"
Modifiers: g -- match globally (all occurrences)
i -- case insensitive
m -- treat string as multiple lines
o -- compile pattern once
s -- treat string as single line
x -- use extended regular expressions
6
s modifier (//s): Treat string as a single long line. '.'
matches any character, even "\n". "^" matches
only at the beginning of the string and "$"
matches only at the end or before a newline at
the end.
m modifier (//m): Treat string as a set of multiple
lines. '.' matches any character except "\n". "^"
and "$" are able to match at the start or end of
any line within the string.
7
//o modifier
There are a few more things you might want to know about matching operators. First, we pointed
earlier that variables in regexps are substituted before the regexp is evaluated:
$pattern = 'Seuss';
while (<STDIN>) {
print if /$pattern/;
}
This will print any lines containing the word "Seuss". It is not as
efficient as it could be, however, because perl has to re-evaluate
$pattern each time through the loop. If $pattern won't be changing
over the lifetime of the script, we can add the "//o" modifier, which
directs perl to only perform variable substitutions once:
#!/usr/bin/perl
# Improved simple_grep
$regexp="Seuss";
while (<STDIN>) {
print if /$regexp/o; # a good deal faster
}
8
//x
A regexp that matches numbers:
/^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da!
Long regexps like this may impress your friends, but can be hard to decipher. In complex situations like this, the "//x"
modifier for a match is invaluable. It allows one to put nearly arbitrary whitespace and comments into a regexp
without affecting their meaning. Using it, we can rewrite our 'extended' regexp in the more pleasing form
/^
[+-]?
# first, match an optional sign
(
# then match integers or f.p. mantissas:
\d+\.\d+ # mantissa of the form a.b
|\d+\. # mantissa of the form a.
|\.\d+ # mantissa of the form .b
|\d+
# integer of the form a
)
([eE][+-]?\d+)? # finally, optionally match an exponent
$/x;
If whitespace is mostly irrelevant, how does one include space characters in an extended regexp? The answer is to backslash it '\ ' or put
it in a character class "[ ]"
9
Regular Expressions – matching
with m//
Grouping
m/ATG+/ matches ATG, ATGG,
ATGGGGGGGGGGG etc.
Parentheses are used to group
/(ATG)+/ matches ATG, ATGATGATG, etc
/(ATG)*/ matches rtCCCf95, etc (it will
match any string)
10
Alternation and Grouping
(5.8 Perl Docs)
Grouping things and hierarchical matching
Alternation allows a regexp to choose among alternatives, but by itself
it unsatisfying. The reason is that each alternative is a whole regexp, but sometime we want alternatives for just part of a regexp. For
instance, suppose we want to search for housecats or housekeepers. The
regexp "housecat|housekeeper" fits the bill, but is inefficient because
we had to type "house" twice. It would be nice to have parts of the
regexp be constant, like "house", and some parts have alternatives,
like "cat|keeper".
The grouping metacharacters "()" solve this problem. Grouping allows
parts of a regexp to be treated as a single unit. Parts of a regexp
are grouped by enclosing them in parentheses. Thus we could solve the
"housecat|housekeeper" by forming the regexp as "house(cat|keeper)".
The regexp "house(cat|keeper)" means match "house" followed by either
"cat" or "keeper". Some more examples are
11
Examples
/(a|b)b/; # matches 'ab' or 'bb'
/(ac|b)b/; # matches 'acb' or 'bb'
/(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere
/(a|[bc])d/; # matches 'ad', 'bd', or 'cd'
/house(cat|)/; # matches either 'housecat' or 'house'
/house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or
# 'house'. Note groups can be nested.
12
Alternation
(|) vertical bar, aka "or" – means alternation
/at(C|G)+cc/
matches atCcc, atCGCCcc, atGGGcc, etc
/at(C+|G+)cc/
matches atCCCCcc, atGcc (but NOT
atCGcc)
13
Regular Expressions
character class – list of possible characters in [ ]
Examples
[abcwxyz] matches any single of 7 characters
[a-cw-z] matches the same thing
[a-zA-Z] matches any letter
$_ = "ATG 12133";
if(/[ATGC]+ [0-9]+/) { #matches any length sequence,
print "found number\n"; # any number, separated by 1 space
}
Match any character except for this class – (^) caret
[^ATGC\-] will match any single character EXCEPT A, T, G, C and NOTE -- this is not the beginning of line ^ anchor
14
Examples
o
o
o
o
o
o
"a?" = match 'a' 1 or 0 times
"a*" = match 'a' 0 or more times, i.e., any number of times
"a+" = match 'a' 1 or more times, i.e., at least once
"a{n,m}" = match at least "n" times, but not more than "m" times.
"a{n,}" = match at least "n" or more times
"a{n}" = match exactly "n" times
Here are some examples:
/[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
# any number of digits
/(\w+)\s+\1/; # match doubled words of arbitrary length -#
"BACKREFERENCE"
/y(es)?/i;
# matches 'y', 'Y', or a case-insensitive 'yes'
15
Regular Expressions
Character class shortcuts – abbreviations for
character classes
[0-9]
\d
digit
[A-Za-z0-9_]
\w
word
[\f\t\n\r ]
\s
whitespace
Negation
[^\d] == \D
[^\w] == \W
[^\s] == \S
non-digit
non-word
non-whitespace
16
Memory Parentheses
Grouping by parentheses stores a match to numbered automatic
variables.
$_='abcdefghijklmnop';
if(/(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)/) {
print "$1 $2 $3 $4 $5 $6";
#abcdef
$keep = $7;
}
$_ = "GATGCCCT";
if(/GAT(G)(C)CCT/){ #G stored in $1, etc
print "$1 $2\n";
}
17
Examples
18
Fun with pattern matching "*"
metacharacter
#!/usr/bin/perl
# t2.pl
#
# Here I am testing the * metacharacter.
#
# In the first case, I match 0 C's
#
# Notice how easy it is to test something in perl
#
$_ = "ATGATG";
if(/ATGC*ATG/) {
print "Found sequence\n";
}
# Second case, I match 1 C,
#
$_ = "ATGCATG";
if(/ATGC*ATG/) {
print "Found sequence\n";
}
# Third case I match LOTS of C's
#
$_ = "ATGCCCCCCCCCCCCCCCCCCATG";
if(/ATGC*ATG/) {
print "Found sequence\n";
}
19
Examples
#!/usr/bin/perl
# t3.pl
#
# Here I am testing the alternation
#
# In the first example, It matches 1 C
#
#
#
$_ = "atCcc";
if(/at(C|G)+cc/) {
print "1 Found sequence\n";
}
# Second example, same thing
$_ = "atCcc";
if(/at[C|G]+cc/) {
print "2 Found sequence\n";
}
# Third example, same thing -- different way
$_ = "atCCGGGGCCCcc";
if(/at[C]+|[G]+cc/) {
print "3 Found sequence\n";
}
# Fourth is just like 1, only more compless C's and G's
# -- proof that it works as advertised
$_ = "atCGCCcc";
if(/at(C|G)+cc/) {
print "4 Found sequence\n";
}
# Fifth example -- I add a little more sophistication -numbers and letters
#
$_ = "ATG 12133";
if(/[ATGC]+ [0-9]+/) {
print "5 Found sequence and number\n";
}
20
Pattern matching, grouping with (),
and quantifying match ranges
#!/usr/bin/perl
# t4.pl
#
# Match a Poly-T tail, and some amount of additional
sequence
# In this example, exactly 5
#
# Note the 2 pairs of parentheses. -- Paraenthesis will
automaticall
# be stored in variables $1, $2, $3, etc.
#
$_ = "TTTTTTTTTGATGCCCT";
if(/(T+)(.{5})/) {
print "$1 $2\n";
}
# This is an odd one -- will match 1 or more T's, but
because I'm
# also tryint to match 5 or more of any character (.), then
the
# right expression will match TGATG
#
$_ = "TTTTTTTTTGATG";
if(/(T+)(.{5,})/) {
print "$1 $2\n";
}
# This illustrates peking order /x*y*/
# First, x* tries to match. On success, y* tries to match.
# If y* fails, x* pics something else, then y* tries to pick
again, etc.
#
# In this example, T+ matches all the T's until "GATG".
.{5} will
# fail, so T+ will select, TTTTTTTT, and .{5} will match
TGATG
#
$_ = "TTTTTTTTTGATG";
if(/(T+)(.{5})/) {
print "$1 $2 \n";
}
# Here I match at least 5, but has high as 25 (25 in this
example)
#
$_ =
"TTTTTTTTTGATGTAGAGAGAGATTTTTTTTCCC
CCCTTTTT";
if(/(T+)(.{5,25})/) {
print "$1 $2\n";
}
21
#!/usr/bin/perl -w
# bind.pl
#
# Binding operator example, where we are searching in
# a variable besides the default variable $_
#
# Also, doing it case-insensitive with //i flag on pattern match
#
# Note that m// and // are the same thing.
#
my $seq = "GGcaTGccAT";
if($seq =~ /ATG/i){
print "found the start\n";
}
################## same as
my $seq = "GGcaTGccAT";
if($seq =~ m/ATG/i){
print "found the start\n";
}
22
Binding Operator =~ (NOT assignment,
does NOT work like assignment)
#!/usr/bin/perl -w
# bind2.pl
#
# Example: bind operation, that uses interpolation in the
# pattern match / /, to match a patnern in a variable
#
my $seq = "GGcaTGccAT";
my $q = "ATG";
if($seq =~ /($q)/i){
print "found the start\n";
}
23