Transcript Document
BLAST Programming Thomas Madden NCBI [email protected] January 28, 2002 The BLAST algorithm O'Reilly Bioinformatics Technology BLAST Programming 2 What is BLAST? • Basic Local Alignment Search Tool • Calculates similarity for biological sequences. • Produces local alignments: only a portion of each sequence must be aligned. • Uses statistical theory to determine if a match might have occurred by chance. O'Reilly Bioinformatics Technology BLAST Programming 3 BLAST is a heuristic. • A lookup table is made of all the “words” (short subsequences) and “neighboring” words in the query sequence. • The database is scanned for matching words (“hot spots”). • Gapped and un-gapped extensions are initiated from these matches. O'Reilly Bioinformatics Technology BLAST Programming 4 O'Reilly Bioinformatics Technology BLAST Programming 5 BLAST OUTPUT O'Reilly Bioinformatics Technology BLAST Programming 6 There are many different BLAST output formats. • • • • • • Pair-wise report Query-anchored report Hit-table Tax BLAST Abstract Syntax Notation 1 XML O'Reilly Bioinformatics Technology BLAST Programming 7 BLAST reports at the NCBI Web page. O'Reilly Bioinformatics Technology BLAST Programming 8 Formatting Page O'Reilly Bioinformatics Technology BLAST Programming 9 Graphical Overview O'Reilly Bioinformatics Technology BLAST Programming 10 One-line descriptions O'Reilly Bioinformatics Technology BLAST Programming 11 Pair-wise alignments O'Reilly Bioinformatics Technology BLAST Programming 12 Query-anchored alignments O'Reilly Bioinformatics Technology BLAST Programming 13 Future improvements: LinkOut, taxonomic and structure links. Link to Locus-link Link to UniGene Link to taxonomy O'Reilly Bioinformatics Technology BLAST Programming 14 BLAST report designed for human readability. • One-line descriptions provide overview designed for human “browsing”. • Redundant information is presented in the report (e.g., one-line descriptions and alignments both contain expect values, scores, descriptions) so a user does not need to move back and forth between sections. • HTML version has lots of links for a user to explore. • It can change as new features/information becomes available. O'Reilly Bioinformatics Technology BLAST Programming 15 Hit-table • Contains no sequence or definition lines, but does contain sequence identifiers, starts/stops (one-offset), percent identity of match as well as expect value etc. • Simple format is ideal for automated tasks such as screening of sequence for contamination or sequence assembly. O'Reilly Bioinformatics Technology BLAST Programming 16 There are drawbacks to parsing the BLAST report and Hit-table. • No way to automatically check for truncated output. • No way to rigorously check for syntax changes in the output. O'Reilly Bioinformatics Technology BLAST Programming 17 Structured output allows automatic and rigorous checks for syntax errors and changes. O'Reilly Bioinformatics Technology BLAST Programming 18 Abstract Syntax Notation 1 (ASN.1) • Is an International Standards Organization (ISO) standard for describing structured data and reliably encoding it. • Used extensively in the telecommunications industry. • Both a binary and a text format. • NCBI data model is written in ASN.1. • Asntool can produce C object loaders from an ASN.1 specification. O'Reilly Bioinformatics Technology BLAST Programming 19 ASN.1 is used for the NCBI BLAST Web page. Request results server Return formatted results Fetch sequence Fetch ASN.1 ASN.1 O'Reilly Bioinformatics Technology BLAST Programming BLAST DB 20 Different reports can be produced from the ASN.1 of one search. O'Reilly Bioinformatics Technology BLAST Programming 21 Hit-table HTML Pair-wise BLAST report HTML Query-anchored BLAST report ASN.1 text text TaxBlast report XML O'Reilly Bioinformatics Technology BLAST Programming 22 The BLAST ASN.1 (“SeqAlign”) contains: • Start, stop, and gap information (zerooffset). • Score, bit-score, expect-value. • Sequence identifiers. • Strand information. O'Reilly Bioinformatics Technology BLAST Programming 23 Three flavors of Seq-Align, Score-block(s) plus one of: • Dense-diag: series of unconnected diagonals. No coordinate “stretching” (e.g., cannot be used for protein-nucl. alignments). Used for ungapped BLASTN/BLASTP. • Dense-seg: describes an alignment containing many segments. No coordinate “stretching”. Used for gapped BLASTN/BLASTP. • Std-seg: a collection of locations. No restriction on stretching of coordinates. Used for gapped/ungapped translating searches. Generic. O'Reilly Bioinformatics Technology BLAST Programming 24 Score Block SEQUENCE is an ordered list of elements, each of which is an ASN.1 type. Required unless DEFAULT or OPTIONAL Score ::= SEQUENCE { id Object-id OPTIONAL , value CHOICE { real REAL , int INTEGER } } -- identifies Score type -- actual value -- floating point value -- integer O'Reilly Bioinformatics Technology BLAST Programming 25 Score Block example 2.45905555x10-9 38.1576692 O'Reilly Bioinformatics Technology BLAST Programming 26 Dense-seg definition Dense-seg ::= SEQUENCE { dim INTEGER DEFAULT 2 , -- for (multiway) global or partial alignments -- dimensionality numseg INTEGER , -- number of segments here ids SEQUENCE OF Seq-id , -- sequences in order starts SEQUENCE OF INTEGER , -- start OFFSETS in ids order within segs lens SEQUENCE OF INTEGER , -- lengths in ids order within segs strands SEQUENCE OF Na-strand OPTIONAL , scores SEQUENCE OF Score OPTIONAL } -- score for each seg SEQUENCE OF is an ordered list of the same type of element. O'Reilly Bioinformatics Technology BLAST Programming 27 Dense-seg example O'Reilly Bioinformatics Technology BLAST Programming 28 Std-seg definition Std-seg ::= SEQUENCE { dim INTEGER DEFAULT 2 , -- dimensionality ids SEQUENCE OF Seq-id OPTIONAL , -- sequences identifiers loc SEQUENCE OF Seq-loc , -- locations in ids order scores SET OF Score OPTIONAL } -- score for each segment SET is an unordered list of elements, each of which is an ASN.1 type. Required unless DEFAULT or OPTIONAL. SET OF is an unordered list of the same type of element. O'Reilly Bioinformatics Technology BLAST Programming 29 Std-seg example O'Reilly Bioinformatics Technology BLAST Programming 30 Demo program (“blreplay”) to reproduce BLAST results from ASN.1 • Start/stops and identifiers read in from ASN.1 (SeqAlign). • Sequences and definition lines fetched from BLAST databases. O'Reilly Bioinformatics Technology BLAST Programming 31 Asntool can produce XML from ASN.1 • Really a transliteration, not a new specification • A Document Type Definition (DTD) can also be produced. O'Reilly Bioinformatics Technology BLAST Programming 32 ASN.1 and XML validation differences. • XML can be “well-formed” (does not break any XML syntax rules) or “validated” (checked against a DTD). • ASN.1 must always be valid (checked against a specification). O'Reilly Bioinformatics Technology BLAST Programming 33 Special purpose XML • NCBI specification does not fit the needs of some users (the sequence is not provided in the SeqAlign, when fetched the sequence is packed 2/4 bp’s per byte). • Possible to produce XML with more/less information or in a different format. • First done as an ASN.1 specification, which is then dumped as XML. O'Reilly Bioinformatics Technology BLAST Programming 34 BLAST XML designed to be self-contained. • Query sequence, database sequence, etc. • Sequence definition lines. • Start, stop, etc. (oneoffset). • Scores, expect values, % identity etc. • Produced by BLAST binaries and on NCBI Web page. O'Reilly Bioinformatics Technology BLAST Programming 35 Overview of the BLAST XML <!ELEMENT BlastOutput ( BlastOutput_program , BlastOutput_version , BlastOutput_reference , BlastOutput_db , BlastOutput_query-ID , BlastOutput_query-def , BlastOutput_query-len , BlastOutput_query-seq? , BlastOutput_param , BlastOutput_iterations )> BLAST program, e.g., blastp, etc version of BLAST engine (e.g., 2.1.2) Reference about algorithm Database(s) searched query identifier query definition query length query sequence BLAST search parameters BLAST results for each iteration/run O'Reilly Bioinformatics Technology BLAST Programming 36 <!ELEMENT BlastOutput ( BlastOutput_program , BlastOutput_version , BlastOutput_reference , BlastOutput_db , BlastOutput_query-ID , BlastOutput_query-def , BlastOutput_query-len , BlastOutput_query-seq? , BlastOutput_param , BlastOutput_iterations )> <!ELEMENT BlastOutput_iterations ( Iteration+ )> <!ELEMENT Iteration ( Iteration_iter-num , Iteration_hits? , Iteration_stat? , Iteration_message? )> Iteration number (one for non PSI-BLAST) Hits (one for each database sequence) Search statistics Error messages O'Reilly Bioinformatics Technology BLAST Programming 37 <!ELEMENT Iteration ( Iteration_iter-num , Iteration_hits? , Iteration_stat? , Iteration_message? )> <!ELEMENT Iteration_hits ( Hit* )> <!ELEMENT Hit ( Hit_num , Hit_id , Hit_def , Hit_accession , Hit_len , Hit_hsps? )> ordinal number of the hit, one-offset (e.g., "1, 2..."). ID of db sequence (e.g., "gi|7297267|gb|AAF52530.1|") definition of the db sequence accession of the db sequence (e.g., "AAF57408") length of the database sequence describes individual alignments O'Reilly Bioinformatics Technology BLAST Programming 38 <!ELEMENT Hit ( Hit_num , Hit_id , Hit_def , Hit_accession , Hit_len , Hit_hsps? )> <!ELEMENT Hit_hsps ( Hsp* )> <!ELEMENT Hsp ( Hsp_num , Hsp_bit-score , Hsp_score , Hsp_evalue , Hsp_query-from , Hsp_query-to , Hsp_hit-from , Hsp_hit-to , Hsp_pattern-from? , Hsp_pattern-to? , Hsp_query-frame? , Hsp_hit-frame? , Hsp_identity? , Hsp_positive? , Hsp_gaps? , Hsp_density? , Hsp_qseq , Hsp_hseq , Hsp_midline? )> ordinal number of the HSP, one-offset score (in bits) of the HSP raw score of the HSP expect value of the HSP query offset at alignment start (one-offset) query offset at alignment end (one-offset) db offset at alignment start (one-offset) db offset at alignment end (one-offset) start of phi-blast pattern on query (one-offset) end of phi-blast pattern on query (one-offset) query frame (if applicable) db frame (if applicable) number of identities in the alignment number of positives in the alignment number of gaps in the alignment score density alignment string for the query alignment string for the database middle line as normally seen in BLAST report O'Reilly Bioinformatics Technology BLAST Programming 39 Parsing BLAST XML with Expat. • Expat is a popular free-ware used for parsing XML. • Non-validating. • Simple C (demo) program to parse BLAST output. O'Reilly Bioinformatics Technology BLAST Programming 40 Output sizes for a BLASTP search of gi|178628 vs. nr. • • • • • • • Hit-table: 16 kb Binary ASN.1 (SeqAlign): 35 kb Text ASN.1 (SeqAlign): 144 kb XML (SeqAlign): 392 kb XML: 288 kb BLAST report (text): 232 kb BLAST report (html): 272 kb O'Reilly Bioinformatics Technology BLAST Programming 41 Specification (i.e., “data model”) issues should not be confused with the question about whether to use ASN.1 or XML. O'Reilly Bioinformatics Technology BLAST Programming 42 Structured output is not a panacea. • Design issues must still be addressed. • Semantic issues still exist, e.g. is a start/stop value zero-offset or one-offset. • Data issues still exist, e.g., is the correct sequence shown, are the offsets correct, was the DNA translated with the correct genetic code? O'Reilly Bioinformatics Technology BLAST Programming 43 Overview of BLAST code. O'Reilly Bioinformatics Technology BLAST Programming 44 NCBI toolkit • Has many low-level functions to make it platform independent; supported under LINUX, many flavors of UNIX, NT, and MacOS. • Contains portable types such as Int2, Int4, FloatHi. • Developer should write a “Main” function that is called by a toolkit “main”. • Contains the BLAST code in the “tools” library. • A C++ toolkit is now being developed. O'Reilly Bioinformatics Technology BLAST Programming 45 BLAST code has a modular design. • API for retrieval from databases independent of the compute engine. • Compute engine independent of formatter. O'Reilly Bioinformatics Technology BLAST Programming 46 Readdb API can be used to easily extract information from the BLAST databases. • Date produced. • Title of database. • Number of letters, number of sequences, longest sequence. • Sequence and description of an entry. • Function prototypes in readdb.h. O'Reilly Bioinformatics Technology BLAST Programming 47 Dump a BLAST record in FASTA format (db2fasta.c): “Main” is called by “main” in the toolkit. Get or display command-line arguments Allocate an object for reading the database Get the ordinal number (zero-offset) of the record given a ‘FASTA’ identifier (e.g., “gb|AAH06766.1|AAH0676”). Fetch the Bioseq (contains sequence, description, and identifiers) for this record Dump the sequence as FASTA. O'Reilly Bioinformatics Technology BLAST Programming 48 Only a few function calls are needed to perform a BLAST search (doblast.c): Set the expect value cutoff to a non-default value. Allocate a BLASTOptionsBlk with default values for the specified program (e.g., “blastp”), the boolean argument specifies a gapped search Perform a BLAST search of the BioseqPtr query_bsp. The BioseqPtr could have been obtained from the BLAST databases, Entrez or from FASTA using the function call FastaToSeqEntry O'Reilly Bioinformatics Technology BLAST Programming 49 BlastOptionNew BLAST_OptionsBlkPtr BLASTOptionNew (CharPtr progname, Boolean gapped) CharPtr progname: name of program. Legal values are blastp, blastn, blastx, tblastn, and tblastx. Boolean gapped: if TRUE gapped parameters are set, if FALSE ungapped. Non-default values may be specified by changing elements of the allocated structure (typedef in blastdef.h). The most often changed elements (options) are: Nlm_FloatHi expect_value Int2 wordsize Int2 penalty Int2 reward CharPtr matrix Int4 gap_open Int4 gap_extend CharPtr filter_string Int4 hitlist_size Int2 number_of_cpus Expect value cutoff Number of letters used in making words for lookup table. Penalty for a mismatch (only BLASTN and MegaBLAST) Reward for a match (only BLASTN and MegaBLAST Matrix used for comparison (not BLASTN or MegaBLAST) Cost for gap existence Cost to extend a gap one more letter (including first). Filtering options (e.g., “L”, “mL”) Number of database sequences to save hits for. Number of CPU’s to use. O'Reilly Bioinformatics Technology BLAST Programming 50 BioseqBlastEngine SeqAlignPtr BioseqBlastEngine (BioseqPtr bsp, CharPtr progname, CharPtr database, BLAST_OptionsBlkPtr options, ValNodePtr *other_returns, ValNodePtr *error_returns int (LIBCALLBACK *callback) (Int4 done, Int4 positives)) BioseqPtr bsp: contains the query sequence, identifier, and definition line. CharPtr progname: name of program (one of blastp, blastn, blastx, tblastn, or tblastx). CharPtr database: name (and path) to BLAST database(s). Multiple databases to be searched should be separated by a space (e.g., “nt est”). BLAST_OptionsBlkPtr options: BLAST option structure obtained from BLASTOptionNew. If NULL default values will be used. ValNodePtr *other_returns: a linked list of ValNodePtr’s, each one containing information about things like the database(s) searched, the Karlin-Altschul parameters, the region of query masked. See blastall.c to see how to use this information. May be set to NULL. ValNodePtr *error_returns: a linked list of error messages, these may be printed with a call to BlastErrorPrint(error_returns). May be set to NULL. int (LIBCALLBACK *callback) (Int4 done, Int4 positives): callback function to mark progress through the database. May be set to NULL. O'Reilly Bioinformatics Technology BLAST Programming 51 What can I do with the SeqAlignPtr? SeqAlignId gets the identifier SeqIdWrite formats the(C-structure) information in “query_id” for the first (zeroth) sequence. into a FASTA identifier (e.g., “gi|129295”) and places it into query_id_buf. SeqAlignStart returns the start value (zero-offset) SeqAlignStop returns thesequences. end values (zero-offset) for the first and second for the first and second sequences. O'Reilly Bioinformatics Technology BLAST Programming 52 MySeqAlignPrint output for a search of gi|129295 vs. ecoli O'Reilly Bioinformatics Technology BLAST Programming 53 Notes on Traditional BLAST printing. A call to the fetch function ReadDBBioseqFetchEnable ("blastall", blast_database, db_is_na, TRUE); tells the formatter where to obtain sequences. Entrez or a network connection to the BLAST server could have been used. The one-line descriptions are printed by PrintDefLinesFromSeqAlignEx2(seqalign, 80, outfp, print_options, FIRST_PASS, NULL, number_of_descriptions, NULL, NULL); The pair-wise alignments are printed by ShowTextAlignFromAnnot(seqannot, 60, outfp, NULL, NULL, align_options, txmatrix, mask_loc, FormatScoreFunc); Look at blreplay.c and blastall.c to see details of how these are called. O'Reilly Bioinformatics Technology BLAST Programming 54 Resources • BLAST Home page: http://www.ncbi.nlm.nih.gov/BLAST/ • NCBI Information Engineering Branch home page: http://www.ncbi.nlm.nih.gov/IEB/ • Demonstration programs (parsing XML with EXPAT, blreplay.c, doblast.c, db2fasta.c): ftp://ftp.ncbi.nih.gov/blast/demo O'Reilly Bioinformatics Technology BLAST Programming 55 ASN.1 RESOURCES • The Open Book : A Practical Perspective on OSI by Marshall T. Rose (Prentice Hall). • OSS Nokalva Web site: http://www.oss.com/asn1/overview.html • NCBI toolkit documentation on ASN.1: http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/ ASNLIB.HTML O'Reilly Bioinformatics Technology BLAST Programming 56 Email addresses • General questions about running BLAST: [email protected] • Questions about compiling the toolkit and requests for hard-copy of documentation: [email protected] O'Reilly Bioinformatics Technology BLAST Programming 57