Perl for Bioinformatics - University of Illinois Crop Sciences

Download Report

Transcript Perl for Bioinformatics - University of Illinois Crop Sciences

Perl for Bioinformatics
Lecture 4
Variables - review
•
•
•
•
•
A variable name starts with a $
It contains a number or a text string
Use my to define a variable
Use = to assign a value
Use \ to stop the variable being
interpolated
• Take care with variable names and with
changing the contents of variables
Conditional Blocks, review
• An if test can be used to control a
command in a conditional block,
according to the outcome of a decision
made by comparing variables.
• It’s important to keep track of whether
variables are strings or numbers. Numbers
are compared with ==, strings with eq.
• It’s usual to indent the block to make it
easier to read the code
Arrays
• An array can store multiple pieces of data.
• They are essential for the most useful
functions of Perl. They can store data such
as:
– the lines of a text file (e.g. primer sequences)
– a list of numbers (e.g. BLAST e values)
• Arrays are designated with the symbol @
my @bases = (“A”, “C”, “G”, “T”);
Converting a variable to an array
split splits a variable into parts and puts them
in an array.
my $dnastring = "ACGTGCTA";
my @dnaarray = split //, $dnastring;
@dnaarray is now (A, C, G, T, G, C, T, A)
@dnaarray = split /T/, $dnastring;
@dnaarray is now (ACG, GC, A)
Converting an array to a variable
• join combines the elements of an array into a
single scalar variable (a string)
$dnastring = join('', @dnaarray);
spacer
(empty here)
which array
Loops
• A loop repeats a bunch of functions until it is
done. The functions are placed in a BLOCK –
some code delimited with curly brackets {}
• Loops are really useful with arrays.
• The “foreach” loop is probably the most useful
of all:
foreach my $base (@dnaarray) {
print "$base “;
}
Comparing strings
• String comparison (is the text the same?)
• eq (equal )
• ne (not equal )
There are others but beware of them!
Getting part of a string
• substr takes characters out of a string
$letter = substr($dnastring, $position, 1)
which string
where in
the string
how many
letters to take
Combining strings
• Strings can be concatenated (joined).
• Use the dot . operator
$seq1= “ACTG”;
$seq2= “GGCTA”;
$seq3= $seq1 . $seq2;
print $seq3;
ACTGGGCTA
Making Decisions - review
• The if operator is generally used together
with numerical or string comparison
operators, inside an (expression).
numerical:
strings:
==, !=, >, <, ≥, ≤
eq, ne
• You can make decisions on each member
of an array using a loop which puts each
part of the array through the test, one at a
time
More healthy exercise
• Write a program that asks the user for a DNA
restriction site, and then tells them whether that
particular sequence matches the site for the
restriction enzyme EcoRI, or Bam HI, or Hind III.
• Site for EcoR1: GAATTC
• Bam H1: GGATCC
• Hind III: AAGCTT
pseudocode
•
•
•
•
•
Read in restriction site to variable
Remove newline character
Check if variable contains “GAATTC”
Check if variable contains “GGATCC”
..etc.
What about longer sequences?
• Read in sequence to variable
• Remove newline character
• Split sequence in variable to array using
“GAATTC”.
• Count and report number of fragments.
• Measure length of fragments and report
site positions, adding six for missing sites
Arrays and loops - review
• An array starts with @. It contains multiple
bits of data in a list-like format.
• @bases = (“A”, “C”, “G”, “T”);
• You can make decisions on each member
of an array using a foreach loop which
puts each part of the array through the
test, one at a time
Test time, again
• Remember –
keep track of what’s in a variable
don’t over-write a variable with another
value, unless you intend to
syntax and case are critical
lines end with a semicolon
brackets and quotes must match.
Opening and closing files
• So we can input large amounts of data, Perl
has to read data out of files, and write results
into output files
• This is done in two steps
• First, you must give the file a name within the
script - this is known as a filehandle
• Use the open command:
open MYFILE, ‘exampleprotein.txt’;
Reading a file
• Once the file is open, you can read from it, line
by line, using the readline <> operator again
– (put the filehandle between the angle brackets)
• Perl reads files one line at a time, each time you
input data from the file, the next line is read:
open FILE1,’exampleprotein.txt’;
$line1 = <FILE1>;
chomp $line1;
$line2 = <FILE1>;
Using loops to read in a file
• The while loop just keeps doing an expression while it’s true. So it
will keep reading lines from the file until it runs out.
my $longsequence;
open FILE, ‘exampleprotein.txt’;
while (my $line = <FILE>){
chomp $line;
$longsequence = $longsequence . $line;
}
close FILE;
• This reads the whole file, and puts each line into the variable
$longsequence one at a time.
•
Think about what happens to the ID line….
Now More Fun Excercises
• Read a DNA sequence from a fasta
format file
• Calculate the GC content.
• What about the non-DNA characters in
the file?
>header lines with the name of the sequence
carriage returns !! You know this one.
blank spaces
N’s or X’s or unexpected letters
Pseudocode
• Open file
• Load ID line and sequence into different
variables
• Split the sequence into an array of
individual letters
• For each letter in the array:
– if A, increment “A” counting number variable
– If T, increment “T” counting variable
– …etc
Writing to a File
• Writing to a file is similar to reading from it
• Use the > operator to open a file for writing:
open OUTPUT, ‘>/home/class30/output.txt’;
• This creates a new file with that name, or overwrites
an existing file
• Use >> to append text to an existing file
• print to the file using the filehandle:
print OUTPUT $myoutputdata;
Some more stuff you need to
know
else
• Instead of just letting the script go on if it fails
an if test, you can get it to execute a second
block of code if the statement in brackets isn’t
true.
elsif
• You can string a lot of “if”s together using elsif
if ($site eq “GAATTC” {
print “EcoR1 site\n”;
}
elsif ($site eq “CCATGG” {
print “BamHI site\n”;
}
elsif ($site eq “AAGCTT”) {
print “HindIII site\n”;
}
else { #only happens if none of
the preceeding are true
die “I can’t find any of the sites I know\n”;
}
Subscripts
• Bioinformatics data often can be made into array
format:
– multi-line sequence files
– Microarray or statistics data in “tab delimited”
format
• You can address part of the array as if it was a
variable using a subscript
@numbers = (8, 8, 8, 23984092, 8);
print “$numbers[3]\n”;
Please note – the first element is number zero! Second is 1!
Regular Expressions
• Sounds odd, doesn’t it? It means a pattern that
the computer can match, in a standard format.
• Very useful in bioinformatics work
• DNA patterns
• restriction sites
• promoters/transcription factor binding sites
• intron splice site
• Protein patterns
• conserved domains (motifs)
• active sites
• structural motifs (membrane spanning, signal peptide, etc.)
The Binding and Match Operators:
=~ / /
• The =~ operator binds functions together
• The // operator matches things to patterns
It can be translated as “contains”
• The forward slashes contain the pattern to be
matched, like this:
if ($dnaseq =~ /GAATTC/) {print “EcoRI site found\n”}
A regular expression..
is a joy forever. And a pattern to match:
can be just a text string, such as:
/GATC/
it can have alternative characters:
/G[AT]TC/
or contain a wildcard that matches any character: /G.TC/
Or be something bizzare:
/\/[^\/]*\/\.\./
Perl Regular Expressions
• It never ceases to amaze me what people
can do with regular expressions, but you
can match pretty much anything you can
think of and a lot you can’t:
#man perlrequick
Alternative Characters
• Square brackets within the match expression allow
for alternative characters:
if
($dna =~ /CAG[AT]CAG/)
• This will match an DNA string that starts with CAG; has A or T
in the 4th position, followed by another CAG.
• A vertical line within the /expression/ means “or”; it
allows you to look for either of two completely
different patterns:
if
($dna =~ /GAAT|ATTC/)
Special characters
• Perl has a large set of special characters to
use in regular expressions:
– the dot (.) matches any character
– \d matches any digit (a number from 0-9)
– \w matches any “word” character
(a letter or a number, not punctuation or
space)
– \s matches white space (any amount)
– \t matches a tab (useful for tab delimited files)
– ^ matches the beginning of a line
– $ matches the end of a line
– Knowing this makes you lots of fun at parties.
“Special” characters
• What if you need to match text that contains a
special character?
• Aren’t there dots at the end of sentences?
• Now you have to use a backslash (\) to
“escape” the special meaning of that character:
if $onewordsentence =~ /\w+ \ ./
-This would match any text that has one or more text
characters, followed by a dot.
Finished.
Bringing it together
• So now, when you think about it, you can:
Open a file
Check whether each line of the file contains a particular
pattern
Recover part of that line
Write it out to another file
So.. isn’t that what you wanted to know?
But really, it’s very useful combined with the UNIX
command line.
A last exercise?...
• Now we’re getting up to speed with Perl,
lets try something more fun:
• Open up a BLAST output file
• Spit out the name of the query sequence,
the top hit, and how many hits there were.
Only the beginning
• Sadly, there is much, much more than this to the
Power of Perl.
• You can make, create and download other
people’s websites
• Make Linux and Windows graphical programs
• Do almost anything on the internet
• Interact with databases
• And much much more
Why won’t I teach you more
stuff?
• Whoa!
• Programming takes time to learn properly
• You’ve got the tools now to get started on a
programming project
• We will go through some more Perl functions
in the later classes, especially modules such
as Bioperl.
Practice makes perfect
• You can now practice your Perl skills and
understand a lot of the books and help files,
which are probably more useful.
#man perlintro
#man perlrequick
#perldoc bioperl
• Also, check out Radhika’s resource page
http://compbio.sph.harvard.edu/chb/training/training-resources