Perl for Bioinformatics

Download Report

Transcript Perl for Bioinformatics

Introduction to Perl
Matt Hudson
Review
• blastall: Do a blast search
• HMMER
hmmpfam: search against HMM database
hmmsearch: search proteins with HMM
hmmbuild: make an HMM from a protein
alignment, as made by clustalw
• clustalw: align protein or DNA sequences
• fasta34: search a sequence using an older,
slower, but sometimes more flexible algorithm
grep – my favorite
• Allows you to pick out lines of a text file that match a
query, count them, and retrieve lines around the
match.
grep ‘Query=’ myblast.txt
What sequences did I BLAST?
grep –c ‘>’ testprotein.txt
How many sequences are in this file?
grep –A 10 ‘>’ testprotein.txt
Give me the first ten lines of each protein
ftp commands
go to the NCBI site
open a connection
same as UNIX
same as UNIX
get me this file
get more than one file
put a file on the server
local cd
local shell
close connection
exit the ftp program
• ftp ftp.ncbi.nih.gov
•
•
•
•
•
•
•
•
•
•
open
ls
cd
get
mget
put
lcd
!
close
bye
• OK. You are now up and running with
UNIX, and can use it to do some fairly
sophisticated bioinformatics.
• We’re going to concentrate on Perl
scripting from now on.
UNIX books
•
You might find that your UNIX skills need some refreshing from time to
time. I recommend having one of these books around in case you need
some help using the command line:
•
For students who haven’t done much UNIX:
Sams Teach Yourself Unix in 24 Hours (4th Edition) (Sams Teach
Yourself in 24 Hours) (Paperback)
by Dave Taylor
For more advanced UNIX users:
UNIX System V: A Practical Guide (3rd Edition) (Paperback)
by Mark G. Sobell
•
Also, for those of you not so familiar with bioinformatics:
Bioinformatics for Dummies (Paperback)
by Jean-Michel Claverie, Cedric Notredame, Jean-Michel Claverie, Cedric
Notredame
Perl books
•
For some reason, although there are hundreds of Perl books out there, none of
them are really that good. Here are some that might be useful, but none are
completely recommended.
•
This one I recommend EXCEPT that it uses tools that come with the book that
are non-standard:
Beginning Perl for Bioinformatics (Paperback)
by James Tisdall
This I have heard good things about but not used much myself:
Beginning Perl, Second Edition (Paperback)
by James Lee
This is a classic but slow going if you know no programming:
Learning Perl, Fourth Edition (Paperback)
by Randal L. Schwartz, Tom Phoenix, brian d foy
This is better if you have little programming experience, but not a textbook:
Perl for Dummies (Fourth Edition) (Paperback)
by Paul Hoffman
•
Once you get started
Programming Perl, 3rd edition,
by Larry Wall, O’Reilly, 2001
Why use Perl?
• Interpreted language – quick to program
• Easy to learn compared to most
languages
• Designed for working with text files
• Free for all operating systems
• Most popular language in bioinformatics
– many scripts available you can
“borrow”, also ready made modules.
Programming
• In Perl, the program, or script, is just a text
file.
• You write it with ANY text editor (we are using
WordPad and/or nano).
• Run the program
• Look at the output
• Correct the errors (debugging)
• Edit the script and try again.
Remember your program?
All programming courses traditionally start with a
program that prints “Hello, world!”. So in keeping with
that tradition:
#!/usr/bin/perl
print “Hello, world\n”;
Note:
No line numbers.
Each command line ends with a semicolon
Print
• All programming languages use “print” to mean “write this to
the console” – i.e. the command line.
• Once opon a time, the console was a typewriter. But now “print”
never means print on a printer.
• print statements are necessary to keep tabs on what your
program is doing.
• You need to tell Perl to put a carriage return at the end of a
printed line
– Use \n in a text string to signify a newline.
– The \ character is called “backslash”.
– It is an “escape” – it changes the meaning of the character
after it. In this case it changes “n” to “newline”. Other
examples are \t (tab) or \$ (= print an actual dollar sign,
normally a dollar sign has a special meaning).
Program details
• Perl programs on UNIX start with a line like:
#!/usr/bin/perl
• Perl ignores anything after a # (this is a
command not to Perl, but to the UNIX shell).
• Elsewhere in the program # is used for
comments to explain the code.
• Lines that are Perl commands end with a
semicolon (;).
Run your Perl program
#cd scratch
#nano helloworld.pl
(paste or type text into editor, save, and exit)
#perl helloworld.pl
Or:
#chmod 755 helloworld.pl
#./helloworld.pl
Pseudocode
• Programmers often find it easier to write out
the things the program is doing in “normal”
language. We call this pseudocode.
print “Hello, world\n”;
=
Output the text “Hello, world” to the terminal,
followed by a newline character.
Strings
• In Perl, strings are very important. They are
just a series of any text characters – letters,
numbers, ><?>:$%^&*, etc.
• In the statement
print “Hello, world\n”;
---- this is a string----
Numbers, etc
• The other common type of data is a number.
• Perl can handle numbers in most common formats,
without any complications:
456
5.6743
6.3E-26
• Arithmetic functions:
+
/
*
**
(add)
(minus)
(divide)
(multiply)
(exponentiation)
A program using numbers
#!/usr/bin/perl
print “2+2\n”;
print 3*4 , “\n”;
print “8/2=” , 8/2 , “\n”;
Do you get it?
Numbers in quotes are part of a string.
Numbers outside quotes are numbers, and
the computer does the math before printing.
Pseudocode
print “2+2\n”;
=
Output “2+2”, followed by a newline, to the
terminal
print 3*4 , “\n”;
=
Evaluate 3 x 4, and print the answer, followed
by a newline, to the terminal
Variables
• Up till now, we’ve been telling the
computer exactly what to print. But in
order for the program to generate what
is printed, we need to use variables.
• A variable name starts with “$”
• It can be either a string or a number.
Assigning values
In pretty much all programming languages, = means
“assign this value to this variable”.
The “my” command in Perl initializes the variable.
This is optional but highly recommended.
So, you assign values to a variable as follows:
my $number = 123;
my $dna_sequence_string = “acgt”;
A program with variables
#!/usr/bin/perl -w
#this program uses variables containing numbers
my $two = 2;
my $three = $two + 1;
print “\$two * \$three = $two * $three = “,($two *
$three);
print "\n";
Pseudocode
my $two = 2;
Assign the value 2 to the variable $two
Interpolation
• When you print the variable, Perl gives the contents
rather than the name of the variable.
print $number;
9
• If you put a variable inside double quotes, Perl
interpolates the variable
print “The number is $number\n”
The number is 9
• If you use single quotes, no interpolation happens
print ‘The number is $number\n’
The number is $number\n
• A more flexible way to do this is to “escape” the $
print “The value of \$number is $number\n”;
The value of $number is 9
Variables - summary
•
•
•
•
•
A variable name starts with a $
It contains a number or a text string
Use my to define a variable
Use = to assign a value
Use \ to stop the variable being
interpolated
• Take care with variable names and with
changing the contents of variables
Standard Input
• To make the program do something, we
need to input data.
– The angle bracket operator (<>) tells Perl to
expect input, by default from the keyboard.
– Usually this is assigned to a variable
print “Please type a number: ”;
my $num = <STDIN>;
print “Your number is $num\n”;
Pseudocode
my $num = <STDIN>;
Stop the program, and wait until the user
types input. Once the user hits the
“enter” key, take the input (including the
newline character) and put it into the
variable $num.
chomp
• When data is entered from the keyboard, the program
waits for you to type the carriage return key.
• But.. the string which is captured includes a newline
(carriage return) at its end
• You can use the chomp function to remove the
newline character:
print
$name
print
chomp
“Enter your name: ”;
= <STDIN>;
“Hello $name, happy to meet you!\n”;
$name;
print “Hello $name, happy to meet you!\n”;
if and True/False
• All programming works on ones and zeros – true and
false.
if (1 == 1) {
print “one equals one”;
}
Perl evaluates the expression (1 == 1 )
Note TWO NOT ONE EQUALS SIGNS!
The if operator causes the command in curly
brackets to be executed ONLY IF the expression is true
if
• if evaluates some statement in
parentheses (must be true or false)
• Note: conditional block is indented,
using tabs.
– Perl doesn’t care about indents, but it
makes your code more “human readable”
Comparing variables
if ($one == $two) {print “one equals two”;}
Note there are TWO equals signs in this expression. If you
remember, = means “assign this variable this value”. So ==
actually means “equals”. You can also use
>
<
>=
<=
!=
Greater than
Less than
Greater than or equal to
Less than or equal to
Not equal to
Pseudocode
if ($one == $two) {print “one equals two”;}
If the contents of the variable $one are identical
to the contents of the variable $two, print “one
equals two”
What’s a block?
• In the case of an “if” statement:
• If the test is true, execute all the
command lines inside the {} brackets. If
not, then go on past the closing } to the
statements below.
• You can also do stuff in a block over and
over again using a loop – more later.
die, scum
• die kills your script safely and prints a
message
• It is often used to prevent you doing
something regrettable – e.g. running your
script on a file that doesn’t exist, or
overwriting an existing file.
Exercising the Perl muscles
• Now let’s write a script to ask the user
their age, and then deliver an insult
specific to the age bracket:
• Over 25 - old fogey
• Under 15 – callow youth
• 15-25 – (insert your own insult here)
Pseudocode
output “Enter your age: ” to the terminal
Stop the program, and wait until the user types input. Once the
user hits the “enter” key, take the input (including the newline
character) and put it into the variable $age.
Remove newline from $age if present
If the value in $age is less than 15, output “You are too young for
this kind of
work!” followed by a newline, then terminate the program with
the text “too young”
If the value in $age is more than 25,
output “You’re old enough to know better!” and then
terminate the program with the text “too old”.
If the program is still running (i.e. $age is between 15 and 25),
then output “You have much to learn!” followed by a newline.
Conditional Blocks, summary
• An if test can be used to control multiple
lines of commands, as in this example *
print “Enter your age: ”;
$age = <STDIN>;
chomp $age;
if ($age < 15) {
print “You are too young for this kind of
work!\n”;
die “too young”;
}
if ($age > 25) {
print “You’re old enough to know better!”;
die “too old”;
}
print “You have much to learn!\n”;
Arrays
• An array can store multiple pieces of data.
• They are essential for the most useful
functions of Perl. They can store data such
as:
– the lines of a text file (e.g. primer sequences)
– a list of numbers (e.g. BLAST e values)
• Arrays are designated with the symbol @
my @bases = (“A”, “C”, “G”, “T”);
Converting a variable to an array
split splits a variable into parts and puts them
in an array.
my $dnastring = "ACGTGCTA";
my @dnaarray = split //, $dnastring;
@dnaarray is now (A, C, G, T, G, C, T, A)
@dnaarray = split /T/, $dnastring;
@dnaarray is now (ACG, GC, A)
Converting an array to a variable
• join combines the elements of an array into a
single scalar variable (a string)
$dnastring = join('', @dnaarray);
spacer
(empty here)
which array
Loops
• A loop repeats a bunch of functions until it is
done. The functions are placed in a BLOCK –
some code delimited with curly brackets {}
• Loops are really useful with arrays.
• The “foreach” loop is probably the most useful
of all:
foreach my $base (@dnaarray) {
print "$base “;
}
Comparing strings
• String comparison (is the text the same?)
• eq (equal )
• ne (not equal )
There are others but beware of them!
Getting part of a string
• substr takes characters out of a
string
$letter = substr($dnastring, $position, 1)
which string
where in
the string
how many
letters to take
Combining strings
• Strings can be concatenated (joined).
• Use the dot . operator
$seq1= “ACTG”;
$seq2= “GGCTA”;
$seq3= $seq1 . $seq2;
print $seq3;
ACTGGGCTA
Making Decisions - review
• The if operator is generally used together
with numerical or string comparison
operators, inside an (expression).
numerical:
strings:
==, !=, >, <, ≥, ≤
eq, ne
• You can make decisions on each member
of an array using a loop which puts each
part of the array through the test, one at a
time
More healthy exercise
• Write a program that asks the user for a DNA
restriction site, and then tells them whether
that particular sequence matches the site for
the restriction enzyme EcoRI, or Bam HI, or
Hind III.
• Site for EcoR1: GAATTC
• Bam H1: GGATCC
• Hind III: AAGCTT