No Slide Title

Download Report

Transcript No Slide Title

Intermediate Perl Programming
Todd Scheetz
July 18, 2001
Review of Perl Concepts
Data Types
scalar
array
hash
Input/Output
open(FILEHANDLE,”filename”);
$line = <FILEHANDLE>;
print “$line”;
Arithmetic Operations
+, -, *, /, %
&&, ||, !
Review of Perl Concepts
Control Structures
if
if/else
if/elsif/else
foreach
for
while
Regular Expressions
General approach to the problem of pattern matching
RE’s are a compact method for representing a set of possible
strings without explicitly specifying each alternative.
For this portion of the discussion, I will be using {} to
represent the scope of a set.
{A}
{A,AA}
{Ø} = empty set
Regular Expressions
In addition, the [] will be used to denote possible alternatives.
[AB] = {A,B}
With just these semantics available, we can begin building
simple Regular Expressions.
[AB][AB] = {AA, AB, BA, BB}
AA[AB]BB = {AAABB,AABBB}
Regular Expressions
Additional Regular Expression components
* = 0 or more of the specified symbol
+ = 1 or more of the specified symbol
A+ = {A, AA, AAA, … }
A* = {Ø, A, AA, AAA, … }
AB* = {A, AB, ABB, ABBB, … }
[AB]* = {Ø, A, B, AA, AB, BA, BB, AAA, … }
Regular Expressions
What if we want a specific number of iterations?
A{2,4} = {AA, AAA, AAAA}
[AB]{1,2} = {A, B, AA, AB, BA, BB}
What if we want any character except one?
[^A] = {B}
What if we want to allow any symbol?
. = {A, B}
.* = {Ø, A, B, AA, AB, BA, BB, … }
Regular Expressions
All of these operations are available in Perl
Several “shortcuts”
Name
Whitespace
Definition
[space, tab,
new-line]
Code
\s
Word
character
[a-zA-Z_0-9]
\w
Digit
[0-9]
\d
\d = {0, 2, 3, 4, 5, 6, 7, 8, 9}
\w+\s\w+ = {…, Hello World, … }
Pattern Matching
Perl supports built-in operations for pattern matching,
substitution, and character replacement
Pattern Matching
if($line =~ m/Rn.\d+/) {
...
}
In Perl, RE’s can be a part of the string rather than the whole
string.
^ - beginning of string
$ - end of string
Pattern Matching
Back references…
if($line =~ m/(Rn.\d+)/) {
$UniGene_label = $1;
}
Regular Expressions
$file = “my_fasta_file”;
open(IN, $file);
$line_count = 0;
while($line = <IN>) {
if($line =~ m/^\>/) {
$line_count++;
}
}
print “There are $line_count FASTA sequences in $file.\n”;
Pattern Matching
UniGene data file
ID
TITLE
EXPRESS
PROTSIM
PROTSIM
SCOUNT
SEQUENCE
SEQUENCE
//
ID
TITLE
...
Bt.1
Cow casein kinase II alpha …
;placenta
ORG=Caenorhabditis elegans; …
ORG=Mus musculus; PROTGI=…
2
ACC=M93665; NID=g162776; …
ACC=BF043619; NID=…
Bt.2
Bos taurus cyclin-dependent …
Pattern Matching
Let’s write a small Perl program to determine how many
clusters there are in the Bos taurus UniGene file.
Pattern Matching
Now we’ll build a Perl program that can write an HTML file
containing some basic links based on the Bos taurus UniGene
clustering.
Important:
http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retriev
e&db=Nucleotide&list_uids=GID_HERE&dopt=GenBank
Substitution
Pattern matching is useful for counting or indexing items,
but to modify the data, substitution is required.
Substitution searches a string for a PATTERN and, if found,
replaces it with REPLACEMENT.
$line =~ s/PATTERN/REPLACEMENT/;
Returns a value equal to the number of times the pattern was
found and replaced.
$result = $line =~ s/PATTERN/REPLACEMENT/;
Substitution
Substitution can take several different options.
specified after the final slash
The most useful are
g - global (can substitute at more than one location)
i - case insensitive matching
$string = “One fish, Two fish, Red fish, Blue fish.”;
$string =~ s/fish/dog/g;
print “$string\n”;
One dog, Two dog, Red dog, Blue dog.
Substitution
Example: Removing leading and trailing white-space
$line =~ s/^\s*(.*?)\s*$/$1/;
a *? performs a minimal match…
it will stop at the first point that the remainder of the
expression can be matched.
$line =~ s/^\s*(.*)\s*$/$1/;
this statement will not remove trailing white-space,
instead the white space is retained by the .*
Character Replacement
A similar operation to substitution is character replacement.
$line =~ tr/a-z/A-Z/;
$count_CG = $line =~ tr/CG/CG/;
$line =~ tr/ACGT/TGCA/;
$line =~ s/A/T/g;
$line =~ s/C/G/g;
$line =~ s/G/C/g;
$line =~ s/T/A/g;
Character Replacement
while($line = <IN>) {
$count_CG = $line =~ tr/CG/CG/;
$count_AT = $line =~ tr/AT/AT/;
}
$total = $count_CG + $count_AT;
$percent_CG = 100 * ($count_CG/$total);
print “The sequence was $percent_CG CG-rich.\n”;
Subroutines
One of the most important aspects of programming is dealing
with complexity. A program that is written in one large
section is generally more difficult to debug. Thus a major
strategy in program development is modularization.
Break the program up into smaller portions that can each be
developed and tested independently.
Makes the program more readable, and easier to maintain and
modify.
Subroutines
EXAMPLE:
Reading in sequences from UniGene.all.seq file
Multiple FASTA sequences in a single file, each annotated
with the UniGene cluster they belong to.
GOAL:
Make an output file consisting only of the longest
sequence from each cluster.
Subroutines
ISSUES:
1. Want to design and implement a usable program
2. Use subroutines where useful to reduce complexity.
3. Minimize the memory requirements.
(human UniGene seqs > 2 GB)