Transcript Advanced Perl for Bioinformatics
Advanced Perl for Bioinformatics
Lecture 5
Regular expressions - review
• You can put the pattern you want to match between //, bind the pattern to the variable with =~, then use it within a conditional:
if ($dna =~ /CAATTG/) {print
“
Eco RI\n
”
;}
• Square brackets within the match expression allow for alternative characters:
if ($dna =~ /CAG [AT] CAG/)
• A vertical line means “ or ” ; it allows you to look for either of two completely different patterns:
if
(
$dna =~ / GAAT | ATTC / )
Reading and writing files, review
• Open a file for reading: open INPUT, ” /home/class30/input.txt
” ; • Or writing open OUTPUT, ”
>
/home/class30/output.txt
” ; • Make sure you can open it!
open INPUT, ” input.txt
” or die “ Can ’ t open file\n ” ;
Test time
Last one…
Hashes
Perl has another super useful
data structure
called a hash, for want of a better name.
A hash is an
associative array
– i.e. it is an array of variables that are associated with each other.
Making a hash of it
• You can think of a hash just as if it were a set of questions and answers my %matthash = ( “ first_name ” => “ Matt ” , “ surname ” “ age ” => => “ Hudson “ secret ” , ” , “ height ” => “ hairstyle ” ); 187, #cm => “ D minus ”
Making a hash of it
• Pseudocode: Create an associative array where these keys are associated with these values: Key first_name surname age height hairstyle Value Matt Hudson secret 187 D minus (note in text – cm)
Getting the hash back
my %matthash = ( “ first_name ” “ surname ” => “ Matt ” , => “ Hudson ” , “ age ” => “ secret ” , “ height ” => 187, #cm “ hairstyle ” => “ D minus ” ) print “ my name is “ , $matthash{first_name}; print “ “ , $matthash{surname}, “ \n ” ; You can store a lot of information and recover it easily and quickly without knowing in what order you added it, unlike an array.
Getting the hash back
• Pseudocode Output text “My name is “ Then value for key “first name” in matthash Then value for key “last name” in matthash Then newline character
Hashes as an array
• You can get the “ keys ” of the hash and use them like an array: foreach my $info (
keys %matthash
){ print “ $info = $matthash{$info} ” ; }
Why are hashes useful? Exercise.
• Many of you might have noticed in the exercise on restriction sites, that there was no way to keep track of which sites were which using arrays • Modify your script using a hash like this one: my %enzymehash = ( “ EcoRI ” “ BamHI ” => => “ HindIII ” “ “ CAATTG GGATCC => “ ” ” , , AAGCTT ” );
(an) answer
foreach my $name (keys %enzymehash){ if ($sequence =~ /$enzymehash{$name}/) { print “ I found a site for $name,$enzymehash{$name} ” ; } }
pseudocode
For every key in the hash %enzymehash If the sequence in $sequence contains the value for that key: print “I found a site for (key), (value in %enzymehash for key) ”
my %hash;
Putting data in a hash
} while (
while ($line =
pseudocode
Create an empty hash %hash For every line in the file FILE: if the line matches the regex: stuff(important stuff) more stuff (best stuff) then store (important stuff) as a hash key and (best stuff) as a value for that key
Advanced regex
• The fun isn ’ t over yet.
• You can match precise numbers of characters • Any number of characters • Positions in a line • Precise formatting (spaces, tabs etc) • You can get bits of the string you matched out and store them in variables • You can use regexes to substitute or to translate
Grabbing bits of the regex
• The fun isn ’ t over yet.
my $blastline = “ Query= AT1g34399 gene CDS ” ; $blastline =~ /Query=
( .+ )
gene/; my $atgnumber =
$1
; print “ The accession number is $atgnumber\n ” ; You can store the contents of the bit within brackets, within the regex, as the special variable $1. Then use it for other stuff. If you put another pair of brackets in, it will be stored in $2.
Using modules
• You can use other peoples modules, including those that come with Perl. These provide extra commands, or change the way your Perl script behaves. E.g.
use strict; use warnings; use Bio::Perl; You will see these stacked up at the beginning of more complicated Perl scripts. Some modules come with perl (strict, warnings) #man perlmod others you need to download and add in yourself.
Using strict
• We have talked about using “my” the first time you use a variable • I recommend you always have use strict; At the top of your script. That way if you mistype a variable and use my, you will know.
A last exercise?...
• So: how might hashes help you solve this?
• Open up a BLAST output file • Spit out the name of the query sequence, the top hit, and how many hits there were.
Programming projects
• Now it ’ s time to think of your programming projects.
• Hopefully you have an idea – we ’ ll discuss how feasible they are in the time available • If not, here are some suggestions
Suggested program functions
• • • • Translate a cDNA into protein, and then check it against the pfam database for HMM hits.
Make a real restriction map of a DNA sequence, with predicted fragment sizes Align proteins of a favorite family, open the alignment and find residues that are totally conserved.
Perform BLAST against the latest version of the database files for a particular organism – which will check whether the user has the latest files, and if not will download them • • • • Design PCR primers, to make a fragment size chosen by the user, for a sequence input from a fasta file.
Check whether primer sites are unique in a sequenced, or partially sequenced, genome, and gives an “ electronic PCR ” result.
Output an XML formatted version of a BLAST or HMMER text file.
Analyze codon usage in a protein coding DNA sequence and calculate the Ka/Ks ratio