Advanced Perl for Bioinformatics

Transcript Advanced Perl for Bioinformatics

Advanced Perl for Bioinformatics

Lecture 5

Regular expressions - review

• You can put the pattern you want to match between //, bind the pattern to the variable with =~, then use it within a conditional:

if ($dna =~ /CAATTG/) {print

“

Eco RI\n

”

;}

• Square brackets within the match expression allow for alternative characters:

if ($dna =~ /CAG [AT] CAG/)

• A vertical line means “ or ” ; it allows you to look for either of two completely different patterns:

(

$dna =~ / GAAT | ATTC / )

Reading and writing files, review

• Open a file for reading: open INPUT, ” /home/class30/input.txt

” ; • Or writing open OUTPUT, ”

/home/class30/output.txt

” ; • Make sure you can open it!

open INPUT, ” input.txt

” or die “ Can ’ t open file\n ” ;

Test time

Last one…

Hashes

Perl has another super useful

data structure

called a hash, for want of a better name.

A hash is an

associative array

– i.e. it is an array of variables that are associated with each other.

Making a hash of it

• You can think of a hash just as if it were a set of questions and answers my %matthash = ( “ first_name ” => “ Matt ” , “ surname ” “ age ” => => “ Hudson “ secret ” , ” , “ height ” => “ hairstyle ” ); 187, #cm => “ D minus ”

Making a hash of it

• Pseudocode: Create an associative array where these keys are associated with these values: Key first_name surname age height hairstyle Value Matt Hudson secret 187 D minus (note in text – cm)

Getting the hash back

my %matthash = ( “ first_name ” “ surname ” => “ Matt ” , => “ Hudson ” , “ age ” => “ secret ” , “ height ” => 187, #cm “ hairstyle ” => “ D minus ” ) print “ my name is “ , $matthash{first_name}; print “ “ , $matthash{surname}, “ \n ” ; You can store a lot of information and recover it easily and quickly without knowing in what order you added it, unlike an array.

Getting the hash back

• Pseudocode Output text “My name is “ Then value for key “first name” in matthash Then value for key “last name” in matthash Then newline character

Hashes as an array

• You can get the “ keys ” of the hash and use them like an array: foreach my $info (

keys %matthash

){ print “ $info = $matthash{$info} ” ; }

Why are hashes useful? Exercise.

• Many of you might have noticed in the exercise on restriction sites, that there was no way to keep track of which sites were which using arrays • Modify your script using a hash like this one: my %enzymehash = ( “ EcoRI ” “ BamHI ” => => “ HindIII ” “ “ CAATTG GGATCC => “ ” ” , , AAGCTT ” );

(an) answer

foreach my $name (keys %enzymehash){ if ($sequence =~ /$enzymehash{$name}/) { print “ I found a site for $name,$enzymehash{$name} ” ; } }

pseudocode

For every key in the hash %enzymehash If the sequence in $sequence contains the value for that key: print “I found a site for (key), (value in %enzymehash for key) ”

my %hash;

Putting data in a hash

} while () { /stuff(important stuff) more stuff (best stuff)/; $hash{$1} = $2; } Or….

while ($line = ) { my @tmp = split /\t/, $line; $hash{$tmp[0]} = $tmp[1];

pseudocode

Create an empty hash %hash For every line in the file FILE: if the line matches the regex: stuff(important stuff) more stuff (best stuff) then store (important stuff) as a hash key and (best stuff) as a value for that key

Advanced regex

• The fun isn ’ t over yet.

• You can match precise numbers of characters • Any number of characters • Positions in a line • Precise formatting (spaces, tabs etc) • You can get bits of the string you matched out and store them in variables • You can use regexes to substitute or to translate

Grabbing bits of the regex

• The fun isn ’ t over yet.

my $blastline = “ Query= AT1g34399 gene CDS ” ; $blastline =~ /Query=

( .+ )

gene/; my $atgnumber =

; print “ The accession number is $atgnumber\n ” ; You can store the contents of the bit within brackets, within the regex, as the special variable $1. Then use it for other stuff. If you put another pair of brackets in, it will be stored in $2.

Using modules

• You can use other peoples modules, including those that come with Perl. These provide extra commands, or change the way your Perl script behaves. E.g.

use strict; use warnings; use Bio::Perl; You will see these stacked up at the beginning of more complicated Perl scripts. Some modules come with perl (strict, warnings) #man perlmod others you need to download and add in yourself.

Using strict

• We have talked about using “my” the first time you use a variable • I recommend you always have use strict; At the top of your script. That way if you mistype a variable and use my, you will know.

A last exercise?...

• So: how might hashes help you solve this?

• Open up a BLAST output file • Spit out the name of the query sequence, the top hit, and how many hits there were.

Programming projects

• Now it ’ s time to think of your programming projects.

• Hopefully you have an idea – we ’ ll discuss how feasible they are in the time available • If not, here are some suggestions

Suggested program functions

• • • • Translate a cDNA into protein, and then check it against the pfam database for HMM hits.

Make a real restriction map of a DNA sequence, with predicted fragment sizes Align proteins of a favorite family, open the alignment and find residues that are totally conserved.

Perform BLAST against the latest version of the database files for a particular organism – which will check whether the user has the latest files, and if not will download them • • • • Design PCR primers, to make a fragment size chosen by the user, for a sequence input from a fasta file.

Check whether primer sites are unique in a sequenced, or partially sequenced, genome, and gives an “ electronic PCR ” result.

Output an XML formatted version of a BLAST or HMMER text file.

Analyze codon usage in a protein coding DNA sequence and calculate the Ka/Ks ratio

Advanced Perl for Bioinformatics

Transcript Advanced Perl for Bioinformatics