Transcript Lab7

Lab7
QRNA, HMMER, PFAM
Sean Eddy’s Lab
• http://selab.janelia.org/software.html
HMMER
Introduction
• HMMER2 is an implementation in UNIX (Linux, MacOS)
platform of profile hidden Markov model, whose
source code, executables, and user guide can be
downloaded from http://hmmer.janelia.org/
• The experiment of HMMER is to look for known
domains in a query sequence by searching a single
sequence again a library of HMMs.
• One such library is PFAM, and you can also create your
own library using HMMER
HMMER executables
1.
hmmalign ‐ Align sequences to an existing model.
2.
hmmbuild ‐ Build a model from a multiple sequence alignment.
3.
hmmcalibrate ‐ Takes an HMM and empirically determines parameters that are used to make
searches more sensitive, by calculating more accurate expectation value scores (E‐values).
4.
hmmconvert ‐ Convert a model file into different formats, including a compact HMMER 2 binary
format, and “best effort” emulation of GCG profiles.
5.
hmmemit ‐ Emit sequences probabilistically from a profile HMM.
6.
hmmfetch ‐ Get a single model from an HMM database.
7.
hmmindex ‐ Index an HMM database.
8.
hmmpfam ‐ Search an HMM database for matches to a query sequence.
9.
hmmsearch ‐ Search a sequence database for matches to an HMM.
Installation
•
Simple installation
–
–
–
•
Download the current version of HMMER “hmmer‐2.3.2.bin.intel‐linux.tar.gz” from
http://hmmer.janelia.org/#download;
Unpack the software by typing “tar –xvf hmmer‐2.3.2.bin.intel‐linux.tar.gz” in the command line. You will see
a new directory “hmmer‐2.3.2.bin.intel‐linux”.
Enter the directory of hmmer‐2.3.2.bin.intel‐linux. You will see NINE executables ready in the subdirectory
“/binaries”, and also nine files in the subdirectory “/tutorial”;
Installation from source code
–
–
–
–
–
–
–
–
–
Download the current HMMER source code version “hmmer‐2.3.2.tar.gz” from
http://hmmer.janelia.org/#download;
Create a new directory in your Linux account and upload or move the software package to the directory;
Unpack the software by typing “tar –xvf hmmer‐2.3.2.tar.gz” in the command line;
Type “cd hmmer‐2.3.2” to enter the software directory;
Type “./configure” to configure for your system and build the programs;
Type “make” to generate the executables;
Type “make check” to run the automated test suite; (This is optional but recommended, and all these tests
should pass);
Please note that by default programs are in “/usr/local/bin/” and man pages are in “/usr/local/man/man1”;
Type “make install” to install all executables;
Hmmbuild : build a profile HMM from
an aignment
•
hmmbuild [options] hmmfile alignfile
•
•
hmmbuild test.hmm test.aln
hmmbuild -h
•
hmmbuild reads a multiple sequence alignment file alignfile , builds a new profile HMM, and saves
the HMM in hmmfile.
•
alignfile may be in ClustalW, GCG MSF, or SELEX alignment format.
•
By default, the model is configured to find one or more non-overlapping alignments to the
complete model.
•
To configure the model for a single global alignment, use the -g option;
•
To configure the model for multiple local alignments, use the -f option;
•
To configure the model for a single local alignment (standard Smith/Waterman), use the -s option.
Hmmcalibrate: calibrate HMM
search statistics
•
hmmcalibrate [options] hmmfile
•
•
hmmcalibrate test.hmm
Hmmcalibrate -h
•
hmmcalibrate reads an HMM file from hmmfile, scores a large number of
synthesized random sequences with it, fits an extreme value distribution (EVD) to
the histogram of those scores, and re-saves hmmfile now including the EVD
parameters.
•
This step is optional, but it will increase the sensitivity of your database search
•
hmmcalibrate may take several minutes (or longer) to run. While it is running, a
temporary file called hmmfile.xxx is generated in your working directory.
•
If you abort hmmcalibrate prematurely (ctrl-C, for instance), your original hmmfile
will be untouched, and you should delete the hmmfile.xxx temporary file.
Hmmsearch: - search a sequence
database with a profile HMM
•
hmmsearch [options] hmmfile seqfile
•
•
hmmsearch test.hmm query.faa > query.faa.domain
hmmsearch -h
•
hmmsearch reads an HMM from hmmfile and searches seqfile for significantly similar sequence
matches.
•
hmmsearch may take minutes or even hours to run, depending on the size of the sequence
database. It is a good idea to redirect the output to a file.
•
The output consists of four sections:
•
•
•
•
•
a ranked list of the best scoring sequences,
a ranked list of the best scoring domains,
alignments for all the best scoring domains, and
a histogram of the scores.
A sequence score may be higher than a domain score for the same sequence if there is more than
one domain in the sequence; the sequence score takes into account all the domains. All sequences
scoring above the -E and -T cutoffs are shown in the first list, then every domain found in this list is
shown in the second list of domain hits. If desired, E-value and bit score thresholds may also be
applied to the domain list using the -domE and -domT options.
PFAM
Pfam 23.0 (July 2008, 10340 families)
•
The Pfam database is a large collection of protein families, each represented by multiple sequence
alignments and hidden Markov models (HMMs).
•
Proteins are generally composed of one or more functional regions, commonly termed domains.
Different combinations of domains give rise to the diverse range of proteins found in nature.
•
The identification of domains that occur within proteins can therefore provide insights into their
function.
•
There are two components to Pfam: Pfam-A and Pfam-B.
–
–
Pfam-A entries are high quality, manually curated families.
Although these Pfam-A entries cover a large proportion of the sequences in the underlying sequence
database, in order to give a more comprehensive coverage of known proteins we also generate a
supplement using the ADDA database. These automatically generated entries are called Pfam-B.
–
•
Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A
entries are found.
Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection
of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM. (see PfamC)
Sequence analysis with HMM
• ftp://ftp.sanger.ac.uk/pub/databases/Pfam/releases/Pfam23.0/ to
download files “Pfam_fs.gz” and “Pfam_ls.gz”
– Pfam_ls - All global (ls mode) Pfam-A HMMs in an HMM library searchable
with the hmmpfam program.
– Pfam_fs - All local (fs mode) Pfam-A HMMs in an HMM library searchable with
the hmmpfam program.
• Data location
–
/home/kwchoi/public_html/I529-09-lab/Lab7/Data/PFAM_data/
– Copy to your working directory or make symbolic link
• To search for domains in “test.faa” in the global sequence database, type
–
hmmpfam Pfam_fs test.faa > test.faa.pfam”
• The results is logged into an output file “test.faa.pfam”;
QRNA
QRNA
• QRNA is a prototype structural noncoding RNA genefinder tools for
detecting novel structural RNA genes.
• It uses three probabilistic "pair-grammars":
• a pair stochastic context free grammar modeling alignments
constrained by structural RNA evolution,
• a pair hidden Markov model modeling alignments constrained by
coding sequence evolution, and
• a pair hidden Markov model modeling a null hypothesis of positionindependent evolution.
• Given an input pairwise sequence alignment (e.g. from a BLASTN
comparison of two related genomes), it classify the alignment into
the coding, RNA, or null class according to the posterior probability
of each class.
Local Installation
• The latest version of QRNA (qrna-2.0.3c.tar.gz ) can be download from
ftp://selab.janelia.org/pub/software/qrna/
• Configure QRNA and install
•
•
•
•
•
•
•
•
tar -xvf qrna-2.0.3c.tar
cd qrna-2.0.3c
cd squid
make
cd ../squid02
make
cd ../src
make
• QRNA is installed in
• /home/kwchoi/Installed/qrna-2.0.3c/
• Data location:
– /home/kwchoi/public_html/I529-09lab/Lab7/Data/QRNA_data/
• blastn2qrnadepth.pl
– /home/kwchoi/Installed/qrna2.0.3c/src/scripts/blastn2qrnadepth.pl -g human
HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast
• HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q
• HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q.gff
• HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q.rep
• Simple test
– Set the running enveriment and run a simple example:
• export QRNADB=/home/kwchoi/Installed/qrna-2.0.3c/lib
• /home/kwchoi/Installed/qrna-2.0.3c/src/eqrna -a 5s_rRNA.q >
5s_rRNA.q.eqrna
QRNA demo
•
Option -C shuffles the columns of the pairwise alignment while maintaining the gap and conserved
structure of the original alignment. Compare the two results of using -C and without using -C:
–
•
Example using the scanning version with a window. Consider file “Scerevisiae orf v other yeasts.q”
which contains an alignment of a S. cerevisiae ORF The alignment has 514 nucleotides, and we
would like to score it with eqrna using a window of 150 nucleotides, and moving the window 50
nucleotides each time:
–
•
/home/kwchoi/Installed/qrna-2.0.3c/src/eqrna -w 150 -x 50
Scerevisiae_orf_v_other_yeasts.q >
Scerevisiae_orf_v_other_yeasts.q.w150.x50.eqrna
example start with a blastn output:
–
•
/home/kwchoi/Installed/qrna-2.0.3c/src/eqrna -a -C 5s_rRNA.q >
5s_rRNA.q.con_shuffle.eqrna
/home/kwchoi/Installed/qrna-2.0.3c/src/eqrn -w 50 -x 50
HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q
HG_13_RNAs_gene.fa.MGSCv3.fragchrom.blast.E0.01.D1.q.W150.X50.eqrna
The results files are shown as following:
–
–
–
5s_rRNA.q.eqrna
5s_rRNA.q.con_shuffle.eqrna
Scerevisiae_orf_v_other_yeasts.q.w150.x50.eqrna