BioPerl - University of Iowa

Download Report

Transcript BioPerl - University of Iowa

BioPerl
1
Object Oriented Programming
Continued – BioPerl Install
2
System Config
file: .ncbirc
[NCBI]
Data=~/blast/blast-2.2.6/data
# this tells blast applications where certain data files are located – such as
BLOSUM matrices
file: .cshrc (for tcsh/csh)
# this line tells blast applications where to look for pre-formatted DB's
setenv BLASTDB ~/blast/blast-2.2.6/blastdb
# this line is a plain environment variable – adding the blast applications to the
"path" variable
set path = ( $path ~/blast/blast-2.2.6 )
Example
blastall –p blastn –i seq –d yeast.nt
3
Example – Bio::SearchIO
#!/usr/bin/perl
# blastPars.pl
# taken almost verbatim from perldoc Bio::SearchIO
use strict;
use Bio::SearchIO;
my $in = new Bio::SearchIO(-file => 'bmp4.out');
while( my $result = $in->next_result ) {
while( my $hit = $result->next_hit ) {
while( my $hsp = $hit->next_hsp ) {
print "Hit= ",
$hit->name,
",Length=", $hsp->length('total'),
",Percent_id=", $hsp->percent_identity,
",hit_string=", $hsp->hit_string,
"\n";
}
}
}
4
Hit= mm3_dna,Length=14,Percent_id=100,hit_string=ttaattgtaatttt
Hit= mm3_dna,Length=13,Percent_id=100,hit_string=cttccctcctccc
Hit= mm3_dna,Length=13,Percent_id=100,hit_string=ggcaataacacca
Hit= mm3_dna,Length=12,Percent_id=100,hit_string=ccttttaggcca
Hit= mm3_dna,Length=12,Percent_id=100,hit_string=tgttttaatcat
Hit= mm3_dna,Length=12,Percent_id=100,hit_string=gttattttgttt
Hit= mm3_dna,Length=12,Percent_id=100,hit_string=tccttctctttt
Hit= mm3_dna,Length=12,Percent_id=100,hit_string=taaactgttaaa
Hit= mm3_dna,Length=11,Percent_id=100,hit_string=caaaaggagga
Hit= mm3_dna,Length=11,Percent_id=100,hit_string=tcaaagtaaat
5
HW – Part 2
6
Installing bioperl
Condensed instructions for installing bioperl on CSS (aka ICAEN): (see also INSTALL
that comes with bioperl)
First – look at www.cpan.org
0) perl –version
Note, this will only work with perl version 5.6.0 or higher. (so on CSS, use NX-client
download bioperl-1.4.tar.gz
(~3 Meg, from www.bioperl.org/DIST, OR
www.cpan.org/modules/01modules.index.html, or ftp.cpan.org in
/pub/CPAN/modules/by-module/Bio)
1) ftp ftp.cpan.org
1.1) bin
1.2) cd pub/CPAN/modules/by-module/Bio
1.3) get bioperl-1.*.*.tar.gz #bioperl-1.4.tar.gz
1.4) quit
2) gunzip bioperl-1.4.tar.gz
3) tar -xvf bioperl-1.4.tar ( this is approximately 13 meg)
7
Installing bioperl
Condensed instructions for installing bioperl on CSS (aka ICAEN): (see also INSTALL
that comes with bioperl)
First – look at www.cpan.org
0) perl –version
Note, this will only work with perl version 5.6.0 or higher. (so on CSS, use NX-client
download bioperl-1.4.tar.gz
(~3 Meg, from www.bioperl.org/DIST, OR
www.cpan.org/modules/01modules.index.html, or ftp.cpan.org in
/pub/CPAN/modules/by-module/Bio)
1) ftp ftp.cpan.org
1.1) bin
1.2) cd pub/CPAN/modules/by-module/Bio
1.3) get bioperl-1.*.*.tar.gz #bioperl-1.4.tar.gz
1.4) quit
2) gunzip bioperl-1.4.tar.gz
3) tar -xvf bioperl-1.4.tar ( this is approximately 13 meg)
8
Installing bioperl
3.5) mkdir ~/perl
3.6) mkdir ~/perl/bioperl
3.8) cd bioperl-1.4
4) perl Makefile.PL PREFIX=~/perl/bioperl (if you do it this way -- the "lib"
won't work)
make
make test
make install (see installing in private space on next slides)
To uninstall, just delete ~/perl/bioperl and ~/perl/bioperl-1.4
9
5) To use:
#!/usr/local/bin/perl
use lib "~/perl/bioperl/"; # this is supposed to work ,but did NOT on CSS
use Bio::SearchIO;
Instead, set environment variable:
Bash 5.1) PERL5LIB=~/perl/bioperl/lib/perl5/site_perl/5.10.0; export PERL5LIB
Csh 5.1) setenv PERL5LIB ~/perl/bioperl/lib/perl5/site_perl/5.10.0
Mac (bash) 5.1) PERL5LIB=~/perl/bioperl/lib/perl5/site_perl; export PERL5LIB
6) To make docs work (I would just put this in your .cshrc file:
Bash: PATH=$PATH:~/perl/bioperl/lib/site_perl/5.10.10; export PATH
Csh: set path = ($path ~/perl/bioperl/lib/site_perl/5.10.0)
Test with: perldoc Bio::SearchIO
7) Test with sample program
FINALLY, please note that the version numbers change over time, and the actual paths may very a
little between CPAN and/or bioperl.org
It make take some trial and error (it usually does for me).
10
Try it – Bio::SearchIO
#!/usr/bin/perl
# blastPars.pl
# taken almost verbatim from perldoc Bio::SearchIO
use strict;
use Bio::SearchIO;
my $in = new Bio::SearchIO(-file => 'bmp4.out');
while( my $result = $in->next_result ) {
while( my $hit = $result->next_hit ) {
while( my $hsp = $hit->next_hsp ) {
print "Hit= ",
$hit->name,
",Length=", $hsp->length('total'),
",Percent_id=", $hsp->percent_identity,
",hit_string=", $hsp->hit_string,
"\n";
}
}
}
11
INSTALLING BIOPERL IN A PERSONAL OR PRIVATE MODULE AREA
If you lack permission to install perl modules into the
standard site_perl/ system area you can configure bioperl to
install itself anywhere you choose. Ideally this would
be a personal perl directory or standard place where you
plan to put all your 'local' or personal perl modules.
Note: you _must_ have write permission to this area.
Simply pass a parameter to perl as it builds your system
specific makefile.
Example:
perl Makefile.PL LIB=/home/users/dag/My_Local_Perl_Modules
make
make test
make install
This tells perl to install bioperl in the desired place, e.g.:
/home/users/dag/My_Perl_Modules/Bio/Seq.pm
Then in your Bioperl script you would write (NOTE ~/dag/My_Local_Perl_Modules will NOT work):
use lib "/home/users/dag/My_Local_Perl_Modules";
use Bio::Seq;
To see "perldoc Bio::SearchIO -- you would need to be in directory ~/dag/My_Local_Peral_Modules
12
SearchIO.pm
http://www.bioperl.org/wiki/HOWTO:SeqIO
References:
http://bioperl.org/
http://bioperl.org/Core/Latest/faq.html
http://cpan.org
13
More notes on bioperl: Windows
http://bioperl.org/SRC/bioperl-live/INSTALL.WIN
1) Quick instructions for the impatient, lucky, or experienced user.
==========================================
Download the ActivePerl MSI from
http://www.activestate.com/Products/ActivePerl/
Run the ActivePerl Installer (accepting all defaults is fine).
Open a command prompt (Menus Start->Run and type cmd) and run the PPM
shell (C:\>ppm).
Add two new PPM repositories with the following commands:
ppm> rep add Bioperl http://bioperl.org/DIST
ppm> rep add Kobes http://theoryx5.uwinnipeg.ca/ppms
ppm> rep add Bribes http://www.Bribes.org/perl/ppm
Install Bioperl with the following commands:
ppm> search Bioperl
This returns a numbered list of packages with corresponding version
numbers etc. with "Bioperl" in their name.
ppm> install <number>
Where <number> corresponds to the relevant package and version from the
numbered list obtained above.
Go to http://www.bioperl.org and start reading documentation.
14
Another way
"cpan"
ppm
15
Windows blast binaries?
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/L
ATEST/blast-2.2.14-ia32-win32.exe
16
Bioperl
• large collection of Perl modules (extensions to
the Perl language) that aid in the task of writing
Perl code
• assists with sequence data and associated
annotation
• access to various types of databases remote
(GenBank, EMBL etc) and
• local (MySQL, flat files, GFF etc.) for storage
and retrieval of sequences.
• associated documentation and mailing list
(community of bioinformaticists)
17
Bioperl
• "most" bioinformatics and computational biology
applications are developed in Unix/Linux
environments
• more and more programs are being ported to
other operating systems like Windows, and
many users (often biologists with little
background in programming) are looking for
ways to automate bioinformatics analyses in the
Windows environment.
• Perl and Bioperl can be installed natively on
Windows NT/2000/XP.
• Most of the functionality of Bioperl is available
18
with this type of install
Bioperl
•
•
Some programs (BLAST for example) have been ported to Windows.
These can be installed and work quite happily with Bioperl in the native Windows environment.
•
fairly simple project OR only have access to a computer running Windows, and/or don't mind
bumping up against some limitations then Bioperl on Windows may be a good place for you to
start.
•
example, downloading a bunch of sequences from GenBank and sorting out the ones that have a
particular annotation or feature works great.
•
•
•
•
•
Running a bunch of your sequences against remote or local BLAST, parsing the output and
storing it in a MySQL database would be fine also.
Be aware that most Bioperl developers are working in some type of a UNIX environment (Linux,
OSX, Cygwin).
If you have problems with Bioperl that are specific to the Windows environment, you may be
blazing new ground and your pleas for help on the Bioperl mailing list may get few responses simply because no one knows the answer to your Windows specific problem.
One solution to this problem that will keep you working on a Windows machine it to install Cygwin,
a UNIX emulation environment for Windows.
19
Bioperl
• Perl is a programming language that has been extended a lot by the
addition of external modules.
• These modules work with the core language to extend the
functionality of Perl.
• Bioperl is one such extension to Perl.
• These modular extensions to Perl sometimes depend on the
functionality of other Perl modules and this creates a dependency.
• Some Perl modules are so fundamentally useful that the Perl
developers have included them in the core distribution of Perl - if
you've installed Perl then these modules are already installed
20
Bioperl
• Bioperl is actually a large collection of Perl
modules (over 1000 currently) and these
modules are split into six packages.
21
Bioperl
Bioperl Group
Functions
----------------------------------------------------------------bioperl (the core)
Most of the main functionality of Bioperl.
bioperl-run
Wrappers to a lot of external programs.
bioperl-ext
Interaction with some alignment functions
and the Staden package.
bioperl-db
Using bioperl with BioSQL and local
relational databases.
bioperl-microarray
Microarray specific functions.
bioperl-gui
Some preliminary work on a graphical user
interface to some Bioperl functions.
22
Miscellaneous
Various commands and techniques that did
not make it into other sections.
Useful as a review
Valuable (I've used them)
23
split
split /PATTERN/, EXPR, LIMIT
split /PATTERN/, EXPR
split /PATTERN/
split
-- returns an array of strings
-- scans the string EXPR
-- splits the EXPR string into a list of substrings by delimiters
-- delimiters are defined by repeated pattern matching of the regular
expression PATTERN
-- if it doesn't match, the whole string (EXPR) is returned
-- if it matches once, you get 2 strings, etc.
-- if PATTERN is omitted, it splits on whitespaces after omitting leading
whitespaces (/\s+/
-- if EXPR is omitted, it splits $_
-- If LIMIT is specified, it splits the string into NO MORE than that many
fields
24
Examples
@words = split ' ', $text;
@tokens = split /[ |,]+/, $text;
($login, $passwd, $remainder) = split /:/, $_, 3;
# this splits on ":" – and assigns the first 2 to
variables, then the rest is stored in $remainder
because of the limit (3)
25
Split Examples
#!/usr/bin/perl
$text = "this is a test";
@words = split ' ',$text;
foreach (@words) {
print "$_\n";
}
# this is a test (on separate lines)
$text = "this is another, simple test";
@words = split /[ |,]+/,$text;
foreach (@words) {
print "$_\n";
}
# this is another simple test (on separate lines)
26
splice
splice ARRAY, OFFSET, LENGTH, LIST
splice ARRAY, OFFSET, LENGTH
splice ARRAY, OFFSET
-- removes the elements designated by
OFFSET and LENGTH, from ARRAY, and
replaces them with LIST
-- if LENGTH is omitted, then everything
after OFFSET is removed
-- returns the elements removed
27
splice examples
#!/usr/bin/perl
@array = qw/one two 3 4 5 six seven/;
@middle = ("three", "four", "five");
print "@array\n";
# one two 3 4 5 six seven
splice @array, 2, 3, @middle;
print "@array\n";
# one two three four five six seven
28
glob
glob EXPR
-- returns the file name expansions of EXPR
@files = glob "*";
@files = glob ".* *"; #multiple patterns separated
by spaces
@files = glob "*.pl";
29
system
system LIST
-- executes any program on the system
#!/usr/bin/perl
system("/mnt/r0-blastdb/blast-bin/blastall –p blastn –d
/mnt/r0-blastdb/FormattedDBs/nt – i test.txt –o test.out");
.
30
back ticks
#!/usr/bin/perl
$output = `/mnt/r0-blastdb/blast-bin/blastall –p
blastn –d /mnt/r0-blastdb/FormattedDBs/nt – i
test.txt –o test.out`;
print "$output\n";
# command is passed on, and interpreted by the
shell
# output of command returned
31
Slides Deprecated
32
New Version of BPlite
• BPlite has actually been "deprecated"
– this means that its functionality has been
replaced by something else
– the code is still available and included, but will
not be supported by future versions
• Replaced by SearchIO
perldoc Bio::Tools::BPlite
perldoc Bio::SearchIO
33
OOP used extensively in BioPerl
A subject is a BLAST hit, which should not be confused with an HSP
(below).
A BLAST hit may have several alignments associated with it.
A useful way of thinking about it is that a subject is "analogous" to a
gene and HSPs are "analogous" to exons.
Subjects have one attribute (name) and one method (nextHSP).
An HSP is a high scoring pair, or simply an alignment.
Look at:
perldoc Bio::Tools::BPlite
34
BPlite Example –
what it looks like to use OOP
$report->query;
$report->database;
while(my $sbjct = $report->nextSbjct)
{
$sbjct->name;
while (my $hsp = $sbjct->nextHSP)
{
print "querySeq ".$hsp->querySeq."\n";
print "sbjctSeq ".$hsp->sbjctSeq."\n";
print "homologySeq ".$hsp->homologySeq."\n";
}
}
35
use Bio::Tools::BPlite;
$blast_file = "Chr16.0.out";
my $report = new Bio::Tools::BPlite('-file' => $blast_file);
$rp = $report->query;
$db = $report->database;
while(my $sbjct = $report->nextSbjct)
{
$sbjct->name;
while (my $hsp = $sbjct->nextHSP)
{
print "substart ".$hsp->subject->start."\n";
print "subjectend ".$hsp->subject->end."\n";
}
}
36
#!/usr/bin/perl -w
#print "percent ".$hsp->percent."\n";
#
#print "P ".$hsp->P."\n";
#Input: file_name (blast results file from RPS-BLAST)
#print "match ".$hsp->match."\n";
#Output: list of domain locations relative to database sequence,
#print "positive ".$hsp->positive."\n";
#
and perhaps the genomic sequence with domains emphasized #print "length ".$hsp->length."\n";
#
#print "querySeq ".$hsp->querySeq."\n";
# Note I broke 5.8.0 in:
#print "sbjctSeq ".$hsp->sbjctSeq."\n";
# /usr/lib/perl5/site_perl
#print "homologySeq ".$hsp->homologySeq."\n";
# to mimic the fact that students probably do not have bio_perl
installed
if($hsp->P <= $cutoff)
use Bio::Tools::BPlite;
{
use Getopt::Long;
print "subjectseqname ".$hsp->subject->seqname."\n";
print "qstart ".$hsp->query->start."\n";
print "qend ".$hsp->query->end."\n";
if($#ARGV != 3) {
print "e = ".$hsp->P."\n";
die "usage: domainID.pl -bf file -c cutoff_value(0.001)\n";
#print "percent = ".$hsp->percent."\n"; #What is this???
}
print "match = ".$hsp->match."\n";
print "positive = ".$hsp->positive."\n";
&GetOptions("bf=s" => \$blast_file, "c=s" => \$cutoff);
print "length = ".$hsp->length."\n";
#print "NT query start ".(3*$hsp->query->start)." (assuming protein
my $report = new Bio::Tools::BPlite('-file' => $blast_file);
input)\n";
#open(FH,$blast_file);
#print "NT query qend ".(3*$hsp->query->end)."\n";
#my $report = new Bio::Tools::BPlite('-fh' => \*FH);
#print "querySeq ".$hsp->querySeq."\n";
print "\n";
$rp = $report->query;
}
print "rp = $rp\n";
#print "substart ".$hsp->subject->start."\n";
$db = $report->database;
#print "subjectend ".$hsp->subject->end."\n";
print "db = $db\n";
#print "subjectseqname ".$hsp->subject->seqname."\n";
while(my $sbjct = $report->nextSbjct)
{
$sbjct->name;
while (my $hsp = $sbjct->nextHSP)
{
#print "score ".$hsp->score."\n";
#print "bits ".$hsp->bits."\n";
#
$hsp->subject->overlaps($exon);
}
}
# the following line takes you to the next report in the stream/file
# it will return 0 if that report is empty,
# but that is valid for an empty blast report.
# Returns -1 for EOF.
37
Sample "look" file
rp = Random Sequence 500 10 (500 letters)
db = blastdb/yeast.nt
subjectseqname gi|6324971|ref|NC_001148.1| Saccharomyces cerevisiae chromosome
XVI, complete chromosome sequence
qstart 1
qend 16
e = 1.2
match = 16
positive = 16
length = 16
subjectseqname gi|6324971|ref|NC_001148.1| Saccharomyces cerevisiae chromosome
XVI, complete chromosome sequence
qstart 321
qend 335
e = 4.7
match = 15
positive = 15
length = 15
38
./domainID.pl -bf bmp4.out -c 10
-------------------- WARNING --------------------MSG: SeqFeatureI::seqname() is deprecated. Please use seq_id() instead.
--------------------------------------------------subjectseqname mm3_dna range=chr14:37718906-37723909 5'pad=0 3'pad=0 revComp=FALSE
strand=? repeatMasking=none
qstart 1
qend 278
e = 1e-110
match = 258
positive = 258
length = 278
-------------------- WARNING --------------------MSG: SeqFeatureI::seqname() is deprecated. Please use seq_id() instead.
--------------------------------------------------subjectseqname mm3_dna range=chr14:37718906-37723909 5'pad=0 3'pad=0 revComp=FALSE
strand=? repeatMasking=none
qstart 707
qend 818
e = 3e-32
match = 102
positive = 102
length = 112
39
End
40
Notes
• Note to self – exploring the percent
identity, gapped, and non-gapped would
be a great assignment
• requires random sequence, alignment
(clustalw)
41