Transcript Document
CZ5225 Methods in Computational Biology CZ5225 Methods in Computational Biology Lecture 4: Practical use of sequence alignment methods and introduction of projects CZ5225 Methods in Computational Biology Sequence Alignment Methods • Pairwise alignment best-matching –Global alignment –Local alignment • Multiple alignment • Software –FASTA –Clustal –BLAST (Basic Local Alignment Search Tool) –PSI-BLAST (detecting remote-homologues) –HMM-based methods (detecting remote-homologues) CZ5225 Methods in Computational Biology Pairwise Alignment Algorithms Needleman-Wunsch Global alignment only. Smith-Waterman Local or global alignment. Substitution matrix and the gapscoring scheme Blosum, pam,etc Affine Gap, Extension Gap,etc It is fairly demanding of time and memory resources FASTA,BLAST… CZ5225 Methods in Computational Biology Multiple Sequence Alignment FASTA : Superseded by BLAST BLAST : emphasizes the balance between the speed and sensitivity PSI-BLAST: profile alignments, remote homology identify HMM: profile alignments, remote homology identify Clustal: Profile alignments CZ5225 Methods in Computational Biology BLAST Programs There are five different blast programs, which can be distinguished by the type of the query sequence (DNA or protein) and the type of the subject database: BLASTP compares an amino acid query sequence against a protein sequence database; BLASTN compares a nucleotide query sequence against a nucleotide sequence database; BLASTX compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database; TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. CZ5225 Methods in Computational Biology Practical Use of BLAST 1. 2. 3. 4. 5. 6. The Information Database Curation (data collection) The sequence data transformation. formatDB, indexing the Sequence Database for BLAST Do BLAST against the designed database. Identify the homologous from the blast results. Scoring the blast hits according to their e-value and their drug susceptibility. CZ5225 Methods in Computational Biology Preparation: Get the BLAST package •Why do we need a local version? •Where to get the software package? •http://www.ncbi.nlm.nih.gov/blast/ •Tree Structure after unpacking: CZ5225 Methods in Computational Biology 2.The sequence data transformation. Any sequence format to FASTA format greater than symbol The description line >Example1 envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLE >Example2 envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFN CZ5225 Methods in Computational Biology 3.formatDB, indexing the Sequence Database for BLAST formatdb -i ecoli.nt -p F -o T -i Input file(s) for formatting [File In] Optional -p Type of file(default = T) T - protein F - nucleotide [T/F] Optional -s Create indexes limited only to accessions - sparse [T/F] Optional default = F -V Verbose: check for non-unique string ids in the database [T/F] Optional default = F -o Parse options(default = F) T - True: Parse SeqId and create indexes. F - False: Do not parse SeqId. Do not create indexes.[T/F] Optional -F Gifile (file containing list of gi's) [File In] Optional …… formatdb.exe -i ourOwnDatabase -p T -o T CZ5225 Methods in Computational Biology 4.Do BLAST against the designed database. blastall arguments: -p Program Name [String] -d Database [String] default = nr -i Query File [File In] default = stdin -e Expectation value (E) [Real] default = 10.0 -v Number of database sequences to show one-line descriptions default = 500 -b Number of database sequence to show alignments default = 250 CZ5225 Methods in Computational Biology 4.Do BLAST against the designed database. EXAMPLE: blastall -p blastp -d db/swissprot -i Q9Y5N1.txt -o Q9Y5N1.out blastall -p blastp -d db/swissprot -e 1 -i Q9Y5N1.txt -o Q9Y5N1.out Q9Y5N1.txt >newSP|Q9Y5N1|HH3R_HUMAN Histamine H3 receptor (HH3R) (G protein-coupled receptor 97) MERAPPDGPLNASGALAGEAAAAGGARGFSAAWTAVLAALMALLIVATVLGNALVMLAFV ADSSLRTQNNFFLLNLAISDFLVGAFCIPLYVPYVLTGRWTFGRGLCKLWLVVDYLLCTS SAFNIVLISYDRFLSVTRAVSYRAQQGDTRRAVRKMLLVWVLAFLLYGPAILSWEYLSGG SSIPEGHCYAEFFYNWYFLITASTLEFFTPFLSVTFFNLSIYLNIQRRTRLRLDGAREAA GPEPPPEAQPSPPPPPGCWGCWQKGHGEAMPLHRYGVGEAAVGAEAGEATLGGGGGGGSV ASPTSSSGSSSRGTERPRSLKRGSKPSASSASLEKRMKMVSQSFTQRFRLSRDRKVAKSL AVIVSIFGLCWAPYTLLMIIRAACHGHCVPDYWYETSFWLLWANSAVNPVLYPLCHHSFR RAFTKLLCPQKLKIQPHSSLEHCWK CZ5225 Methods in Computational Biology 4.Do BLAST against the designed database. Q9Y5N1.out Query= newSP|Q9Y5N1|HH3R_HUMAN Histamine H3 receptor (HH3R) (G protein-coupled receptor 97) (445 letters) Database: swissprot 172,892 sequences; 63,586,428 total letters Sequences producing significant alignments: Score E (bits) Value sp|Q9Y5N1|HRH3_HUMAN Histamine H3 receptor (HH3R) (G-protein cou... 668 0.0 ………………….. sp|P18871|ADA2A_PIG Alpha-2A adrenergic receptor (Alpha-2A adren... 105 2e-022 sp|Q9N2B2|HRH1_PANTR Histamine H1 receptor 105 2e-022 ………………………………….. Database: swissprot Posted date: Jul 8, 2005 9:35 PM Number of letters in database: 63,586,428 Number of sequences in database: 172,892 ………………………………… Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Number of Hits to DB: 40,317,827 Number of Sequences: 172892 ……………………………………………. CZ5225 Methods in Computational Biology Parsing and interpreting the results Biojava Bioperl Bioruby Biopython Or Your own codes-Why? CZ5225 Methods in Computational Biology Work Flow of Manipulate Batched BLAST Queries – Shell Programming 1. 2. 3. 4. Prepare and put the job into the queue Handle individual request Analyze/output the result after each job request Remaining – collect and finalize report Basic/Bash/C/C++/C#/Java/Python/Perl/R/Ruby/TCL CZ5225 Methods in Computational Biology Another way to BLAST like a robot BLAST URL API ( from NCBI) http://www.ncbi.nlm.nih.gov/blast/Blast.cgi CZ5225 Methods in Computational Biology A Sample URL http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Put &PROGRAM=blastn&DATABASE=nr&FILTER=L&QUERY=AF123456 CMD Put : submit a query PROGRAM blastn : run BLASTn DATABASE nr : search against nr FILTER L : turn low complexity filtering on QUERY AF123456 : accession, GI, or FASTA An interim update to BLAST URLAPI, still being reviewed, is at: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/node_0.html Quote From: NCBI-programming with BLAST CZ5225 Methods in Computational Biology End Users NCBI BLAST Server Formatter RID Result RID Search Request Blast.cgi mssql splitd Replicate Backup mssql Split query into chucks for distributed computing on multiple available CPUs Database loading if needed Intel Pentium Linux Intel 2-way Pentiumfarm Linux 2-way Finished chunks are merged to Intel farm generate final Pentium blastalign object Linux 2-way farm Database server Merger demon Quote From: NCBI-Programming with BLAST CZ5225 Methods in Computational Biology Posting a URL $response = $ua->request($req) HTTP Request $ua = LWP::UserAgent->new NCBI HTTP Response User Agent $req = new HTTP::Request POST Quote From: NCBI-programming with BLAST CZ5225 Methods in Computational Biology Introduction of projects Drug Resistant Mutation Data Collection and Database development The scoring matrix development by sequence variations and their drug susceptibility data Prediction of drug resistant mutations