Ensembl Modules and MySQL

Transcript Ensembl Modules and MySQL

Ensembl Modules and MySQL
1
SQL and Database Tables
• Quick Examples
2
exon
transcript_id
exon_num = 3
sequence_start
sequence_stop
project
id
name = pro1
description
date
set_table
id
project_id
name =testset
date
description
intron
transcript_id
intron_num = 3
sequence_start
sequence_stop
primer_pair
id
transcript_id
left_primer_id
right_primer_id
transcript
id
sequence_id
source = Ensembl
source_id
target_set_info
set_id
target_id
rank = 5
cas_rank
cas_options
target
id
date
gene_name = BBS4
description
accession
status
select id from target where gene_name = "BBS4";
sequence
id
target_id
type = nucleotide
sequence = ATG…
chr_name = 15
strand = 1
genomic_start = 15,123,120
genomic_stop = 16,378,131
source
source_id
refresh
3
MySQL Demo with Ensembl
mysql -u anonymous -h ensembldb.ensembl.org
show databases;
show databases like "%omo%core%";
use homo_sapiens_core_47_36i;
show tables;
select count(*) from exon;
show columns from gene;
select * from xref limit 10
select * from xref where dbprimary_acc = "NM_000777";
select stable_id, gene.gene_id from gene_stable_id, gene,
transcript, object_xref, xref
where gene_stable_id.gene_id = gene.gene_id
and gene.gene_id = transcript.gene_id
and transcript.transcript_id = object_xref.ensembl_id
and xref.xref_id = object_xref.xref_id
and xref.dbprimary_acc = 'NM_000777';
select * from transcript where gene_id = 17393;
select * from exon_transcript where transcript_id = 33341;
select * from exon where exon_id = 193252;
4
Ensembl Schema
Core Schema
http://www.ensembl.org/info/docs/api/core/s
chema/index.html#exon_stable_id
API Tutorial:
http://www.ensembl.org/info/docs/api/core/c
ore_tutorial.html
5
Code Development
1) Generate random sequence
ATGCCCGCTGAGT
2) Generate formatted random sequence
1 ATGCCCGCTT TGACCCTTTA 20
3) Generate random sequence, translated into protein, and
formatted…
•
Code revision
–
–
–
adding functionality and features
may introduce bugs that are not discovered until much later
useful to examine the changes to code, that may have caused
bugs
6
Code Development Solutions
• May retain a copy of every version of
every file
– have complete record
– redundant and waste of space
– responsibility on developer to maintain
revision history
– Example (V1, V2, V3 experiment, V4
unfinished feature, return in 6 months?)
7
Multi-coder Environment
•
•
•
•
•
•
•
Developers D1, D2, and D3
Source code S1, S2, S3, S4, S5.
D1 copies S1 and makes changes
D2 copies S1 and makes changes
D2 returns S1
D1 returns S1
Clearly, this is ineffective for managing
and integrating changes
8
Brief Overview of CVS
CVS – Concurrent Versions System
CVS
– only stores differences between files/versions
– uses repository structure
•
•
•
•
•
check out
check in
lock
branching, merging
etc
• Reference
– http://www.gnu.org/software/cvs/
– https://www.cvshome.org/
9
Installing Ensembl Modules
Sample program – ens4.pl (simple demo program that obtains exons for a
particular gene from Ensembl database, for given accession number, and
Ensembl Gene ID)
When connected to Ensembl's MySQL database
% mysql -u anonymous -h ensembldb.ensembl.org
To get a list of their current databases. Find the most recent (highest numbers)
version of the homo_sapiens core database.
type
% show databases;
Example: homo_sapiens_core_47_36i
Example: homo_sapiens_core_25_36
The final two numbers represent the Ensembl code version and the NCBI
human build,
respectively (i.e. Ensembl modules 25 and NCBI Human Build 36).
In this case, you should be using Ensembl code 47 to do the following:
10
NO LONGER VALID
(for CSS)
%touch ~/.cvspass
%chmod 755 ~/.cvspass
create the directory
%mkdir Ensembl_modules-41
enter the directory
%cd Ensembl_modules-41
type the following:
%cvs -d :pserver:[email protected]:/cvsroot/CVSmaster login (when prompted, the password is
CVSUSER) -- yes, in all CAPS
%cvs -d :pserver:[email protected]:/cvsroot/CVSmaster checkout -r branch-ensembl-41 ensembl
%cvs -d :pserver:[email protected]:/cvsroot/CVSmaster checkout -r branch-ensembl-41 ensembl-external
%cvs -d :pserver:[email protected]:/cvsroot/CVSmaster checkout -r branch-ensembl-41 ensembl-lite
Note this is all about 9 Meg
Make symbolic link called "Ensembl_modules-current" to point to your newly created directory of modules:
%cd ..
%ln -s Ensembl_modules-41 Ensembl_modules-current
11
http://www.ensembl.org/info/software/api_installation.html
# -- Clearly this assumes a Unix flavor -- Create an installation directory
$ cd
$ mkdir src
$ cd src
$ cvs -d :pserver:[email protected]:/home/repository/bioperl login
Logging in to :pserver:[email protected]:2401/home/repository/bioperl
CVS password: cvs
Install BioPerl (version 1.2.3)
$ cvs -d :pserver:[email protected]:/home/repository/bioperl checkout -r bioperl-release-1-2-3 bioperl-live
Log into the Ensembl CVS server at Sanger (using a password of CVSUSER):
$ cvs -d :pserver:[email protected]:/cvsroot/ensembl login
CVS password: CVSUSER
Install the Ensembl Core Perl API for version 47
$ cvs -d :pserver:[email protected]:/cvsroot/ensembl checkout -r branch-ensembl-47 ensembl
If required, install the Ensembl Variation Perl API for version 47
$ cvs -d :pserver:[email protected]:/cvsroot/ensembl checkout -r branch-ensembl-47 ensembl-variation
If required, install the Ensembl Compara Perl API for verion 47
$ cvs -d :pserver:[email protected]:/cvsroot/ensembl checkout -r branch-ensembl-47 ensembl-compara
NB: You can install as many Ensembl APIs as you need in this way.
12
To install Ensembl modules -assumes you do need to have
BioPerl modules installed
(used to be separate step)
13
Run Program
• Now put the following program into your
"src" directory – and it should run.
14
#!/usr/local/bin/perl
use lib "bioperl-live";
# you MAY have to use: use lib "bioperl-live/bioperl-live";
use lib "ensembl/modules"; #
use lib "ensembl/ensembl/modules";
use Bio::EnsEMBL::DBSQL::DBAdaptor;
#my $host = "kaka.sanger.ac.uk";
my $host = "ensembldb.ensembl.org";
my $user = "anonymous";
#my $dbname = "homo_sapiens_core_41_36c";
my $dbname = "homo_sapiens_core_47_36i";
my $accession_num = "NM_000777";
my $Ensembl_gene_id = "ENSG00000106258";
my $flank_length = 5000;
my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(
-host => $host,
-user => $user,
-dbname => $dbname);
my $gene_adaptor = $db->get_GeneAdaptor();
my @genes = @{$gene_adaptor->fetch_all_by_external_name('NM_000777')};
foreach my $gene (@genes) {
my $string = feature2string($gene);
print "$string\n";
}
sub feature2string {
my $f = shift;
my $stable_id = $f->stable_id();
my $name = $f->external_name();
my $seq_region = $f->slice->seq_region_name();
my $start = $f->start();
my $end = $f->end();
my $strand = $f->strand();
return "$stable_id: $name $seq_region:$start-$end ($strand)";
}
15
Output
ENSG00000106258: CYP3A5 7:99083759-99115557 (-1)
Doesn't seem like much, but remember:
1) Using the language "perl"
2) Using other peoples software (modules)
3) Accessing genomic data in a database in England
4) Accessing data programatically
16
Look at API
Ensembl API (full):
http://www.ensembl.org/info/docs/api/Pdoc/index.html
Ensembl->gene_adaptor->fetch_all_by_external_name
@genes = @{$gene_adaptor->fetch_all_by_external_name('BRCA2')};
17
From the API…
# Fetch all clones from a slice adaptor (returns a list reference)
my $clones_ref = $slice_adaptor->fetch_all('clone');
# If you want a copy of the contents of the list referenced by
# the $clones_ref reference...
my @clones = @{$clones_ref};
# Get the first clone from the list via the reference:
my $first_clone = $clones_ref->[0];
18
Object adaptors have internal knowledge of the underlying database schema
and use this knowledge to fetch, store and remove objects (and data) from
the database. This way you can write code and use the Ensembl Core API
without having to know anything about the underlying databases you are
using.
Object adaptors are obtained from the Registry via a method named
get_adaptor(). To obtain a Slice adaptor or a Gene adaptor (which retrieve
Slice and Gene objects respectively) for Human, do the following after
having loaded the Registry, here called $registry, as above:
my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' );
my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );
Don't worry if you don't immediately see how useful this could be. Just
remember that you don't need to know anything about how the database is
structured, but you can retrieve the necessary data (neatly packaged in
objects) by asking for it from the correct adaptor. Throughout the rest of this
document we are going to work through the ways the Ensembl objects can
be used to derive the information you want.
19
UCSC
• http://genome.ucsc.edu/FAQ/FAQdownloa
ds#download29
20
#### genome-mysql.cse.ucsc.edu
use DBI;
my ($dsn) = "DBI:mysql:hg18:genome-mysql.cse.ucsc.edu";
my ($username) = "genome";
my ($passwd) = "";
my ($query);
my $dbh = DBI->connect ($dsn, $username, $passwd,{RaiseError=>1});
if (! defined $dbh)
{ print "\nConnect to database(Human_annot_mar06): FAILED\n"; }
else { print "\nConnect to database (gh18): SUCCESS\n"; }
$gene = "BBS4";
my $string = "SELECT geneName, name, exonStarts, exonEnds, chrom, strand FROM refFlat WHERE geneName =
'$gene'";
my $sth = $dbh->prepare($string);
$sth->execute();
while(my @row = $sth->fetchrow_array) {
$GENENAME = $row[0];
$NAME = $row[1];
$EXONSTARTS = $row[2];
$STRAND = $row[5];
}
$sth->finish();
print "geneName = $GENENAME\n";
print "Name = $NAME\n";
print "strand = $STRAND\n";
21
End
22
Output
Intron: 11 -9247 -6928
Exon: 12 -6927 -6768
sequence_start = -6927
sequence_stop = -6768
exon length= 160
exon start, exon_stop 6768 6927
exon sequence:
CTGTGTTTCTTTACAAGGTTTGAAGGAGAAGTTCTGAAGGACTCTGATTAGAGCAAGTTTCAT
GTTCATGAGAGCAAACCTCATGCCAATGCAGTTTCTGGGTCCAGTTCCAAAGGGTGTGTATA
TGTAAGGATCTATGCTGTCCTTCTTCTTACTGAAC
Intron: 12 -6767 -5096
Exon: 13 -5095 -5000
sequence_start = -5095
sequence_stop = -5000
exon length= 96
exon start, exon_stop 5000 5095
exon sequence:
TCATTCTCCACTTAGGGTTCCATCTCTTGAATCCACCTTTAGAACAATGGGTTTTTCTGGTTGA
AGAAGTCCTTGCGTGTCTAATTTCAAGGGGAT
chr = chr7
seq length= 41692
23
Installing bioperl (Linux)
3.5) mkdir ~/perl
3.6) mkdir ~/perl/bioperl
3.8) cd bioperl-1.2.3
4)
perl Makefile.PL LIB=~/perl/bioperl
(Do it this way -- with "LIB" -- recently changed slide)
make test
make install (see installing in private space on next slides)
To uninstall, just delete ~/perl/bioperl and ~/perl/bioperl-1.2.3
Note: version -1.2.3 was the current version when I made this slide -- it may
have updated since.
24
5) To use:
#!/usr/local/bin/perl
use lib "~/local/bioperl/"; # this is supposed to work ,but did NOT on CSS
use Bio::Tools::BPlite; # Need -- LIB prefix for this to work.
csh 5.1) setenv PERL5LIB ~/perl/bioperl
bash 5.1) PERL5LIB=~/perl/bioperl; export PERL5LIB
mac (bash) 5.1) PERL5LIB=~/perl/bioperl; export PERL5LIB
6) To make docs work (I would just put this in your .cshrc file:
set path = ($path ~/perl/bioperl/lib/site_perl/5.8.1)
PATH=$PATH:~/perl/bioperl/lib/site_perl/5.8.1; export PATH
Test with:
cd
perldoc Bio::SearchIO
FINALLY, please note that the version numbers change over time, and the actual paths may very a
little between CPAN and/or bioperl.org
It make take some trial and error (it usually does for me).
NOTE TO SELF -- check out the CPAN installer (its much easier)
25
Using Modules
Finally, need DBI.pm
% mkdir modules
% cd modules
% ftp ftp.cpan.org (login: ftp passwd: [email protected])
% bin
% cd /pub/CPAN/modules/by-module/DBI
% get DBI-1.53.tar.gz
% cd ../DBD
% get DBD-mysql-3.0008.tar.gz
% gunzip DBI-1.53.tar.gz
% tar –xvf DBI-1.53.tar
% cd DBI-1.53
% perl Makefile.PL LIB=~/modules (**** changed this slide)
% make
% make install
(set up Environment for DBI -- next slide), then install DBD
26
Connecting /w Perl
% mkdir modules
(put modules in this dir)
Need DBI, DBD-mysql
gunzip, and tar
(do this for both modules)
perl Makefile.PL LIB=~/modules
make
make install
csh5.0) setenv PERL5LIB "$HOME/modules:$HOME/perl/bioperl"
bash 5.1) PERL5LIB=$HOME/modules:$HOME/perl/bioperl; export PERL5LIB
(note CSS has upgraded perl from 5.6.0 – used the last time)
27
Using Modules
CSH
setenv PERL5LIB
"$HOME/local/bioperl/lib/site_perl/5.8.1:$HOME/modules/lib/site_perl/5.8.1:$HOME/Ensembl_mo
dules-41/ensembl/modules:$HOME/Ensembl_modules-41/ensemblexternal/modules:$HOME/Ensembl_modules-41/ensembl-lite/modules
:$HOME/modules:$HOME/perl/bioperl"
BASH
PERL5LIB="$HOME/local/bioperl/lib/site_perl/5.8.1:$HOME/modules/lib/site_perl/5.8.1:$HOME/Ense
mbl_modules-41/ensembl/modules:$HOME/Ensembl_modules-41/ensemblexternal/modules:$HOME/Ensembl_modules-41/ensembllite/modules:$HOME/modules:$HOME/perl/bioperl"
export PERL5LIB
This (below) would work if we used the LIB prefix -- but that makes it a pain to install DBD. So just rely
on environment settings. NOTE -- if you log out -- and don’t save the environment setting
somewhere (such as .chsrc, or .bashrc, you will have to re-type the command).
DBI used with:
use lib "~/modules/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi/";
Ensembl modules can then be used in a perl program with:
use lib "~/Ensembl_modules-current/ensembl/modules";
use lib "~/Ensembl_modules-current/ensembl-external/modules";
use lib "~/Ensembl_modules-current/ensembl-lite/modules";
28
Another Module
Finally, need DBD.pm
% cd modules
% ftp ftp.cpan.org (login: ftp passwd: [email protected])
% bin
% cd /pub/CPAN/modules/by-module/DBD
% get DBD-mysql-2.9003.tar.gz
% quit
% gunzip DBD-mysql-2.9003.tar.gz
% tar –xvf DBD-mysql-2.9003.tar.gz
% cd DBD-mysql-2.9003.tar.gz
% perl Makefile.PL LIB=~/modules
% make
% make install
29
Does not work on CSS
•
Concluded either
–
–
version of Perl incompatible
port blocking
./ens3.pl
current core DB: homo_sapiens_core_18_34
-------------------- EXCEPTION -------------------MSG: Could not connect to database homo_sapiens_core_18_34 user anonymous using
[DBI:mysql:database=homo_sapiens_core_18_34;host=ensembldb.ensembl.org;port
=3306] as a locator
STACK Bio::EnsEMBL::DBSQL::DBConnection::new
/user/eng/tbraun/Ensembl_modules18/ensembl/modules/Bio/EnsEMBL/DBSQL/DBConnection.pm:125
STACK Bio::EnsEMBL::DBSQL::DBAdaptor::new /user/eng/tbraun/Ensembl_modules18/ensembl/modules/Bio/EnsEMBL/DBSQL/DBAdaptor.pm:79
STACK main::dbconnect_Ensembl ./ens3.pl:150
STACK toplevel ./ens3.pl:26
------------------------------------------30
However…
• Installed local version of MySQL
• Needed modules
– DBI (perl database interface)
– DBD (database specific interface – mysql)
• Realized that I had failed to install DBD
with Ensembl modules
31
End
32
• No longer need to install BioPerl
separately – Ensembl install instructions
installs BioPerl now.
33
Install BioPerl
• I'll assume Windows XP, Eclipse (if you are using Linux/Unix, then
the default documentation with Bioperl is better than these slides
www.bioperl.org).
• Dowload Bioperl:
• http://pdb.eng.uiowa.edu/~tabraun/biotech/2007/modules/bioperl1.4.zip
• The "official version can be found from here:
(http://code.open-bio.org/cgi/viewcvs.cgi/bioperl-live/bioperllive.tar.gz?tarball=1)
• Move this zip file into your Eclipse "workspace" directory and unzip it
(mine is H:\windowsdata\workspace)
• You will need an "unzip" program. Most default versions of XP
comes with one. If you don't have one, you can download a free
one:
– http://www.download.com/jZip/3000-2250_410761563.html?tag=lst-6
• Now in your perl program -- you will need to add line:
34
• use lib "H:\windowsdata\workspace\bioperl-live";
BioPerl continued
Depending on if your "zip" program creates a directory for you, you may have
to put in:
use lib "H:\windowsdata\workspace\bioperl-live\bioperl-live";
You will also need 2 other modules (DBD and DBI). These are used by the
Ensembl modules to allow a perl program to connect to a mySql database.
• DBI - Database independent interface for Perl
• DBD::mysql - MySQL driver for the Perl5
Database Interface (DBI)
I tried to compile a library for Windows to make availabe -- but was unable to
get it to work. Therefore I asked CSS to install these two modules for me -since I do not have administrative permission on CSS nodes.
35

Ensembl Modules and MySQL

Transcript Ensembl Modules and MySQL

Directory