Transcript Bioperl

INTRODUCTION TO BIOPERL
Gautier Sarah & Gaëtan Droc
BioPerl is …



A Set of Perl modules for manipulating genomic and
other biological data
An Open Source Toolkit with many contributors
A flexible and extensible system for doing bioinformatics
data manipulation
Some things we can do




Read in sequence data from a file in standard formats
(FASTA, GenBank, EMBL, SwissProt,...)
Convert sequence file format (Sequence & Alignment)
Manipulate sequences, reverse complement, translate
coding DNA sequence to protein.
Parse a BLAST like report, get access to every bit of data
in the report
Sequence file formats

Simple formats - without features
 Fasta

Rich formats - with features and annotations
 EMBL, GenBank, GFF3
 SwissProt, GenPept
 TIGRXML, BSML, InterPro (XML)
Simple formats
>ID Description(Free text)
AGTGATGATAGTGAGTAGGA
>gi|number|emb|ACCESSION
AGATAGTAGGGGATAGAG
>gi|number|sp|BOSS_7LES
MTMFWQQNVDHQSDEQDKQAKGAAPTKRLN
Building a sequence
#!/usr/bin/perl -w
use strict;
use Bio::Seq;
my $seq = new Bio::Seq(
-seq => 'ATGGGACCAAGTA',
-display_id => 'example1‘
);
print “Sequence name ", $seq->display_id, "\n";
print “Sequence length is ", $seq->length, "\n";
print “Sub-sequence is ", $seq->subseq(1,3), "\n";
% perl ex2.pl
Sequence name is example1
Sequence length is 13
Sub-sequence is ATG
Bio::PrimarySeq : Primary Information
Method
Description
$seq->seq
Get/Set the sequence string
$seq->display_id
Get/Set the Sequence identifier string
$seq->desc
Get/Set the description string
$seq->length
Return the length of the sequence
$seq->subseq(start,end)
Get a sub-sequence as a tring
$seq->trunc(start,end)
Get a sub-sequence as an object
$seq->revcom
Get the reverse complement (dna only)
$seq->translate
Get the protein translation (dna only)
Rich formats
Primary informations
Taxonomic informations
Bibliographic references
Features (with location)
+ Annotations
Sequence data
Features & Annotations

GFF format derived
GFF format




« Generic Feature Format »
Tab delimited format
9 columns: sequence_id, source, primary_tag, start,
stop, score, strand, frame, description
Different versions of GFF (GFF1, GFF2 & GFF3)
 Variation is in how the description column is formatted
 For GFF3, ‘primary_tag’ column values must be in the
sequence ontology
Features & Annotations


GFF format derived
Have a location on a sequence
 start(), end() & strand() for location information
 score(), frame(), primary_tag(), source_tag() for feature
information
 tag(): hash reference of tag/value


Bio::SeqFeature::Generic
More details
 http://www.bioperl.org/wiki/HOWTO:Feature-Annotation
Convert format : Bio::SeqIO


Read /Write sequence
Initialize
 file: filename for input; prepend ‘>’ for writing
 format: for reading or writing

Some supported format
Format

fasta
FASTA
genbank
GenBank DB
embl
EMBL DB
swiss
SwissProt DB
http://www.bioperl.org/wiki/HOWTO:SeqIO
Read in sequence and write out in
different format
use Bio::SeqIO;
my $in = new Bio::SeqIO(
-format => 'genbank',
-file => 'in.gb‘
);
my $out = new Bio::SeqIO(
-format => 'fasta',
-file =>'>out.fa‘
);
while ( my $seq = $in->next_seq ) {
$out->write_seq($seq);
}
Read GFF
#!/usr/bin/perl
use Bio::Tools::GFF;
my $file = shift;
my $tag = shift;
my $in = new Bio::Tools::GFF(
-gff_version => 3,
-file => $file
);
while(my $feature = $in->next_feature) {
if ($feature->primary_tag() eq $tag) {
my ($id) = $feature->get_tag_values("ID");
print join("\t",$id,$feature->seq_id,$feature->start,$feature->end,$feature->strand),"\n";
}
}
$in->close;
Bio::SearchIO


Parsing analysis report
Can be split into 3 components
 Result : One per query
 Hit : Sequence which matches query (Component of Result)
 HSP : High Scoring Segment Pairs (Component of Hit)

Implemented for BLAST, BLAT, FASTA, HMMER,
Exonerate…
Bio::SearchIO
Result
Can be split into 3 components:
Result: One per query
Hit 1
HSP 1
HSP 2
Hit 2
HSP 1
Hit: Sequence whiches match query
Component of a Result
HSP: High Scoring Segment Pairs
Component of a Hit
Bio::SearchIO
use strict;
use Bio::SearchIO;
my $in = new Bio::SearchIO(
-format => 'blast',
-file => 'report.bls‘
);
while( my $result = $in->next_result ) {
while( my $hit = $result->next_hit ) {
while( my $hsp = $hit->next_hsp ) {
if( $hsp->length('total') > 50 ) {
if ( $hsp->percent_identity >= 75 ) {
print "Query=", $result->query_name,
" Hit=",
$hit->name,
" Length=", $hsp->length('total'),
" Percent_id=", $hsp->percent_identity, "\n";
}
}
}
}
}
HOWTO Parsing with Bio::SearchIO

Table of methods
http://www.bioperl.org/wiki/HOWTO:SearchIO
Things I'm skipping (here)





Bio::Tools::SeqStats - base-pair freq, dicodon freq, etc
Bio::Tools::SeqWords - count n-mer words in a sequence
Bio::SeqUtils – mixed helper functions
Bio::Restriction - find restriction enzyme sites and cut
sequence
Bio::Graphics – represent information graphically
Link


HOWTO :
http://www.bioperl.org/wiki/HOWTOs
CPAN BioPerl : Modules Documentation
http://search.cpan.org/~cjfields/BioPerl-1.6.1/