Transcript Document

CZ5225 Methods in Computational Biology
CZ5225 Methods in Computational Biology
Lecture 4: Practical use of sequence alignment
methods and introduction of projects
CZ5225 Methods in Computational Biology
Sequence Alignment Methods
• Pairwise alignment  best-matching
–Global alignment
–Local alignment
• Multiple alignment
• Software
–FASTA
–Clustal
–BLAST (Basic Local Alignment Search Tool)
–PSI-BLAST (detecting remote-homologues)
–HMM-based methods (detecting remote-homologues)
CZ5225 Methods in Computational Biology
Pairwise Alignment Algorithms
Needleman-Wunsch
Global alignment only.
Smith-Waterman
Local or global alignment. Substitution matrix and the gapscoring scheme
Blosum, pam,etc Affine Gap, Extension Gap,etc
It is fairly demanding of time and memory resources
FASTA,BLAST…
CZ5225 Methods in Computational Biology
Multiple Sequence Alignment
FASTA : Superseded by BLAST
BLAST : emphasizes the balance between the speed and
sensitivity
PSI-BLAST: profile alignments, remote homology identify
HMM: profile alignments, remote homology identify
Clustal: Profile alignments
CZ5225 Methods in Computational Biology
BLAST Programs
There are five different blast programs, which can be distinguished by the type of
the query sequence (DNA or protein) and the type of the subject database:
BLASTP compares an amino acid query sequence against a protein sequence
database;
BLASTN compares a nucleotide query sequence against a nucleotide sequence
database;
BLASTX compares the six-frame conceptual translation products of a nucleotide
query sequence (both strands) against a protein sequence database;
TBLASTN compares a protein query sequence against a nucleotide sequence
database dynamically translated in all six reading frames (both strands).
TBLASTX compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence database.
CZ5225 Methods in Computational Biology
Practical Use of BLAST
1.
2.
3.
4.
5.
6.
The Information Database Curation (data collection)
The sequence data transformation.
formatDB, indexing the Sequence Database for BLAST
Do BLAST against the designed database.
Identify the homologous from the blast results.
Scoring the blast hits according to their e-value and their
drug susceptibility.
CZ5225 Methods in Computational Biology
Preparation: Get the BLAST package
•Why do we need a local version?
•Where to get the software package?
•http://www.ncbi.nlm.nih.gov/blast/
•Tree Structure after unpacking:
CZ5225 Methods in Computational Biology
2.The sequence data transformation.
Any sequence format to FASTA format
greater than
symbol
The description line
>Example1 envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLE
>Example2 envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFN
CZ5225 Methods in Computational Biology
3.formatDB, indexing the Sequence Database for BLAST
formatdb -i ecoli.nt -p F -o T
-i Input file(s) for formatting [File In] Optional
-p Type of file(default = T)
T - protein
F - nucleotide [T/F] Optional
-s Create indexes limited only to accessions - sparse [T/F] Optional
default = F
-V Verbose: check for non-unique string ids in the database [T/F] Optional
default = F
-o Parse options(default = F)
T - True: Parse SeqId and create indexes.
F - False: Do not parse SeqId. Do not create indexes.[T/F] Optional
-F Gifile (file containing list of gi's) [File In] Optional
……
formatdb.exe -i ourOwnDatabase -p T -o T
CZ5225 Methods in Computational Biology
4.Do BLAST against the designed database.
blastall arguments:
-p Program Name [String]
-d Database [String]
default = nr
-i Query File [File In]
default = stdin
-e Expectation value (E) [Real]
default = 10.0
-v Number of database sequences to show one-line descriptions
default = 500
-b Number of database sequence to show alignments
default = 250
CZ5225 Methods in Computational Biology
4.Do BLAST against the designed database.
EXAMPLE:
blastall -p blastp -d db/swissprot -i Q9Y5N1.txt -o Q9Y5N1.out
blastall -p blastp -d db/swissprot -e 1 -i Q9Y5N1.txt -o Q9Y5N1.out
Q9Y5N1.txt
>newSP|Q9Y5N1|HH3R_HUMAN Histamine H3 receptor (HH3R) (G protein-coupled receptor 97)
MERAPPDGPLNASGALAGEAAAAGGARGFSAAWTAVLAALMALLIVATVLGNALVMLAFV
ADSSLRTQNNFFLLNLAISDFLVGAFCIPLYVPYVLTGRWTFGRGLCKLWLVVDYLLCTS
SAFNIVLISYDRFLSVTRAVSYRAQQGDTRRAVRKMLLVWVLAFLLYGPAILSWEYLSGG
SSIPEGHCYAEFFYNWYFLITASTLEFFTPFLSVTFFNLSIYLNIQRRTRLRLDGAREAA
GPEPPPEAQPSPPPPPGCWGCWQKGHGEAMPLHRYGVGEAAVGAEAGEATLGGGGGGGSV
ASPTSSSGSSSRGTERPRSLKRGSKPSASSASLEKRMKMVSQSFTQRFRLSRDRKVAKSL
AVIVSIFGLCWAPYTLLMIIRAACHGHCVPDYWYETSFWLLWANSAVNPVLYPLCHHSFR
RAFTKLLCPQKLKIQPHSSLEHCWK
CZ5225 Methods in Computational Biology
4.Do BLAST against the designed database.
Q9Y5N1.out
Query= newSP|Q9Y5N1|HH3R_HUMAN Histamine H3 receptor (HH3R) (G
protein-coupled receptor 97)
(445 letters)
Database: swissprot
172,892 sequences; 63,586,428 total letters
Sequences producing significant alignments:
Score
E
(bits) Value
sp|Q9Y5N1|HRH3_HUMAN Histamine H3 receptor (HH3R) (G-protein cou... 668 0.0
…………………..
sp|P18871|ADA2A_PIG Alpha-2A adrenergic receptor (Alpha-2A adren...
105 2e-022
sp|Q9N2B2|HRH1_PANTR Histamine H1 receptor
105 2e-022
…………………………………..
Database: swissprot
Posted date: Jul 8, 2005 9:35 PM
Number of letters in database: 63,586,428
Number of sequences in database: 172,892
…………………………………
Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Hits to DB: 40,317,827
Number of Sequences: 172892
…………………………………………….
CZ5225 Methods in Computational Biology
Parsing and interpreting the results
Biojava
Bioperl
Bioruby
Biopython
Or
Your own codes-Why?
CZ5225 Methods in Computational Biology
Work Flow of Manipulate Batched
BLAST Queries – Shell Programming
1.
2.
3.
4.
Prepare and put the job into the queue
Handle individual request
Analyze/output the result after each job request
Remaining – collect and finalize report
Basic/Bash/C/C++/C#/Java/Python/Perl/R/Ruby/TCL
CZ5225 Methods in Computational Biology
Another way to BLAST like a robot
BLAST URL API ( from NCBI)
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
CZ5225 Methods in Computational Biology
A Sample URL
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Put
&PROGRAM=blastn&DATABASE=nr&FILTER=L&QUERY=AF123456
CMD
Put : submit a query
PROGRAM
blastn : run BLASTn
DATABASE
nr : search against nr
FILTER
L : turn low complexity filtering on
QUERY
AF123456 : accession, GI, or FASTA
An interim update to BLAST URLAPI, still being reviewed, is at:
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/node_0.html
Quote From: NCBI-programming with BLAST
CZ5225 Methods in Computational Biology
End Users
NCBI BLAST Server
Formatter
RID
Result
RID
Search
Request
Blast.cgi
mssql
splitd
Replicate
Backup
mssql
Split query
into chucks
for distributed
computing on
multiple
available
CPUs
Database
loading
if needed
Intel
Pentium
Linux
Intel 2-way
Pentiumfarm
Linux
2-way
Finished chunks
are merged to
Intel farm
generate final
Pentium
blastalign object
Linux
2-way
farm
Database
server
Merger
demon
Quote From: NCBI-Programming with BLAST
CZ5225 Methods in Computational Biology
Posting a URL
$response = $ua->request($req)
HTTP
Request
$ua = LWP::UserAgent->new
NCBI
HTTP
Response
User Agent
$req = new HTTP::Request POST
Quote From: NCBI-programming with BLAST
CZ5225 Methods in Computational Biology
Introduction of projects
Drug Resistant Mutation Data Collection and Database
development
The scoring matrix development by sequence variations and
their drug susceptibility data
Prediction of drug resistant mutations