Transcript Document

BLAST Programming
Thomas Madden
NCBI
[email protected]
January 28, 2002
The BLAST algorithm
O'Reilly Bioinformatics Technology BLAST Programming
2
What is BLAST?
• Basic Local Alignment Search Tool
• Calculates similarity for biological
sequences.
• Produces local alignments: only a portion of
each sequence must be aligned.
• Uses statistical theory to determine if a
match might have occurred by chance.
O'Reilly Bioinformatics Technology BLAST Programming
3
BLAST is a heuristic.
• A lookup table is made of all the “words”
(short subsequences) and “neighboring”
words in the query sequence.
• The database is scanned for matching words
(“hot spots”).
• Gapped and un-gapped extensions are
initiated from these matches.
O'Reilly Bioinformatics Technology BLAST Programming
4
O'Reilly Bioinformatics Technology BLAST Programming
5
BLAST OUTPUT
O'Reilly Bioinformatics Technology BLAST Programming
6
There are many different BLAST
output formats.
•
•
•
•
•
•
Pair-wise report
Query-anchored report
Hit-table
Tax BLAST
Abstract Syntax Notation 1
XML
O'Reilly Bioinformatics Technology BLAST Programming
7
BLAST reports at the NCBI Web
page.
O'Reilly Bioinformatics Technology BLAST Programming
8
Formatting Page
O'Reilly Bioinformatics Technology BLAST Programming
9
Graphical Overview
O'Reilly Bioinformatics Technology BLAST Programming
10
One-line descriptions
O'Reilly Bioinformatics Technology BLAST Programming
11
Pair-wise alignments
O'Reilly Bioinformatics Technology BLAST Programming
12
Query-anchored alignments
O'Reilly Bioinformatics Technology BLAST Programming
13
Future improvements: LinkOut,
taxonomic and structure links.
Link to
Locus-link
Link to
UniGene
Link to taxonomy
O'Reilly Bioinformatics Technology BLAST Programming
14
BLAST report designed for human
readability.
• One-line descriptions provide overview designed
for human “browsing”.
• Redundant information is presented in the report
(e.g., one-line descriptions and alignments both
contain expect values, scores, descriptions) so a
user does not need to move back and forth
between sections.
• HTML version has lots of links for a user to
explore.
• It can change as new features/information
becomes available.
O'Reilly Bioinformatics Technology BLAST Programming
15
Hit-table
• Contains no sequence or definition lines, but does contain sequence
identifiers, starts/stops (one-offset), percent identity of match as well
as expect value etc.
• Simple format is ideal for automated tasks such as screening of
sequence for contamination or sequence assembly.
O'Reilly Bioinformatics Technology BLAST Programming
16
There are drawbacks to parsing
the BLAST report and Hit-table.
• No way to automatically check for truncated
output.
• No way to rigorously check for syntax
changes in the output.
O'Reilly Bioinformatics Technology BLAST Programming
17
Structured output allows
automatic and rigorous checks
for syntax errors and changes.
O'Reilly Bioinformatics Technology BLAST Programming
18
Abstract Syntax Notation 1
(ASN.1)
• Is an International Standards Organization (ISO)
standard for describing structured data and reliably
encoding it.
• Used extensively in the telecommunications
industry.
• Both a binary and a text format.
• NCBI data model is written in ASN.1.
• Asntool can produce C object loaders from an
ASN.1 specification.
O'Reilly Bioinformatics Technology BLAST Programming
19
ASN.1 is used for the NCBI
BLAST Web page.
Request results
server
Return formatted
results
Fetch sequence
Fetch ASN.1
ASN.1
O'Reilly Bioinformatics Technology BLAST Programming
BLAST DB
20
Different reports can be produced
from the ASN.1 of one search.
O'Reilly Bioinformatics Technology BLAST Programming
21
Hit-table
HTML
Pair-wise
BLAST report
HTML
Query-anchored
BLAST report
ASN.1
text
text
TaxBlast
report
XML
O'Reilly Bioinformatics Technology BLAST Programming
22
The BLAST ASN.1 (“SeqAlign”)
contains:
• Start, stop, and gap
information (zerooffset).
• Score, bit-score,
expect-value.
• Sequence identifiers.
• Strand information.
O'Reilly Bioinformatics Technology BLAST Programming
23
Three flavors of Seq-Align,
Score-block(s) plus one of:
• Dense-diag: series of unconnected diagonals. No
coordinate “stretching” (e.g., cannot be used for
protein-nucl. alignments). Used for ungapped
BLASTN/BLASTP.
• Dense-seg: describes an alignment containing
many segments. No coordinate “stretching”.
Used for gapped BLASTN/BLASTP.
• Std-seg: a collection of locations. No restriction
on stretching of coordinates. Used for
gapped/ungapped translating searches. Generic.
O'Reilly Bioinformatics Technology BLAST Programming
24
Score Block
SEQUENCE is an ordered list of elements,
each of which is an ASN.1 type.
Required unless DEFAULT or OPTIONAL
Score ::= SEQUENCE {
id Object-id OPTIONAL ,
value CHOICE {
real REAL ,
int INTEGER } }
-- identifies Score type
-- actual value
-- floating point value
-- integer
O'Reilly Bioinformatics Technology BLAST Programming
25
Score Block example
2.45905555x10-9
38.1576692
O'Reilly Bioinformatics Technology BLAST Programming
26
Dense-seg definition
Dense-seg ::= SEQUENCE {
dim INTEGER DEFAULT 2 ,
-- for (multiway) global or partial alignments
-- dimensionality
numseg INTEGER ,
-- number of segments here
ids SEQUENCE OF Seq-id , -- sequences in order
starts SEQUENCE OF INTEGER , -- start OFFSETS in ids order within segs
lens SEQUENCE OF INTEGER , -- lengths in ids order within segs
strands SEQUENCE OF Na-strand OPTIONAL ,
scores SEQUENCE OF Score OPTIONAL } -- score for each seg
SEQUENCE OF is an ordered list of
the same type of element.
O'Reilly Bioinformatics Technology BLAST Programming
27
Dense-seg
example
O'Reilly Bioinformatics Technology BLAST Programming
28
Std-seg definition
Std-seg ::= SEQUENCE {
dim INTEGER DEFAULT 2 ,
-- dimensionality
ids SEQUENCE OF Seq-id OPTIONAL , -- sequences identifiers
loc SEQUENCE OF Seq-loc ,
-- locations in ids order
scores SET OF Score OPTIONAL } -- score for each segment
SET is an unordered list of elements, each of which is an
ASN.1 type. Required unless DEFAULT or OPTIONAL.
SET OF is an unordered list of the same type of element.
O'Reilly Bioinformatics Technology BLAST Programming
29
Std-seg example
O'Reilly Bioinformatics Technology BLAST Programming
30
Demo program (“blreplay”) to
reproduce BLAST results from
ASN.1
• Start/stops and identifiers read in from
ASN.1 (SeqAlign).
• Sequences and definition lines fetched from
BLAST databases.
O'Reilly Bioinformatics Technology BLAST Programming
31
Asntool can produce XML from
ASN.1
• Really a
transliteration, not a
new specification
• A Document Type
Definition (DTD) can
also be produced.
O'Reilly Bioinformatics Technology BLAST Programming
32
ASN.1 and XML validation
differences.
• XML can be “well-formed” (does not break
any XML syntax rules) or “validated”
(checked against a DTD).
• ASN.1 must always be valid (checked
against a specification).
O'Reilly Bioinformatics Technology BLAST Programming
33
Special purpose XML
• NCBI specification does not fit the needs of
some users (the sequence is not provided in
the SeqAlign, when fetched the sequence is
packed 2/4 bp’s per byte).
• Possible to produce XML with more/less
information or in a different format.
• First done as an ASN.1 specification, which
is then dumped as XML.
O'Reilly Bioinformatics Technology BLAST Programming
34
BLAST XML designed to be
self-contained.
• Query sequence, database
sequence, etc.
• Sequence definition lines.
• Start, stop, etc. (oneoffset).
• Scores, expect values, %
identity etc.
• Produced by BLAST
binaries and on NCBI
Web page.
O'Reilly Bioinformatics Technology BLAST Programming
35
Overview of the
BLAST XML
<!ELEMENT BlastOutput (
BlastOutput_program ,
BlastOutput_version ,
BlastOutput_reference ,
BlastOutput_db ,
BlastOutput_query-ID ,
BlastOutput_query-def ,
BlastOutput_query-len ,
BlastOutput_query-seq? ,
BlastOutput_param ,
BlastOutput_iterations
)>
BLAST program, e.g., blastp, etc
version of BLAST engine (e.g., 2.1.2)
Reference about algorithm
Database(s) searched
query identifier
query definition
query length
query sequence
BLAST search parameters
BLAST results for each iteration/run
O'Reilly Bioinformatics Technology BLAST Programming
36
<!ELEMENT BlastOutput (
BlastOutput_program ,
BlastOutput_version ,
BlastOutput_reference ,
BlastOutput_db ,
BlastOutput_query-ID ,
BlastOutput_query-def ,
BlastOutput_query-len ,
BlastOutput_query-seq? ,
BlastOutput_param ,
BlastOutput_iterations )>
<!ELEMENT BlastOutput_iterations ( Iteration+ )>
<!ELEMENT Iteration (
Iteration_iter-num ,
Iteration_hits? ,
Iteration_stat? ,
Iteration_message?
)>
Iteration number (one for non PSI-BLAST)
Hits (one for each database sequence)
Search statistics
Error messages
O'Reilly Bioinformatics Technology BLAST Programming
37
<!ELEMENT Iteration (
Iteration_iter-num ,
Iteration_hits? ,
Iteration_stat? ,
Iteration_message?
)>
<!ELEMENT Iteration_hits ( Hit* )>
<!ELEMENT Hit (
Hit_num ,
Hit_id ,
Hit_def ,
Hit_accession ,
Hit_len ,
Hit_hsps?
)>
ordinal number of the hit, one-offset (e.g., "1, 2...").
ID of db sequence (e.g., "gi|7297267|gb|AAF52530.1|")
definition of the db sequence
accession of the db sequence (e.g., "AAF57408")
length of the database sequence
describes individual alignments
O'Reilly Bioinformatics Technology BLAST Programming
38
<!ELEMENT Hit (
Hit_num ,
Hit_id ,
Hit_def ,
Hit_accession ,
Hit_len ,
Hit_hsps? )>
<!ELEMENT Hit_hsps ( Hsp* )>
<!ELEMENT Hsp (
Hsp_num ,
Hsp_bit-score ,
Hsp_score ,
Hsp_evalue ,
Hsp_query-from ,
Hsp_query-to ,
Hsp_hit-from ,
Hsp_hit-to ,
Hsp_pattern-from? ,
Hsp_pattern-to? ,
Hsp_query-frame? ,
Hsp_hit-frame? ,
Hsp_identity? ,
Hsp_positive? ,
Hsp_gaps? ,
Hsp_density? ,
Hsp_qseq ,
Hsp_hseq ,
Hsp_midline? )>
ordinal number of the HSP, one-offset
score (in bits) of the HSP
raw score of the HSP
expect value of the HSP
query offset at alignment start (one-offset)
query offset at alignment end (one-offset)
db offset at alignment start (one-offset)
db offset at alignment end (one-offset)
start of phi-blast pattern on query (one-offset)
end of phi-blast pattern on query (one-offset)
query frame (if applicable)
db frame (if applicable)
number of identities in the alignment
number of positives in the alignment
number of gaps in the alignment
score density
alignment string for the query
alignment string for the database
middle line as normally seen in BLAST report
O'Reilly Bioinformatics Technology BLAST Programming
39
Parsing BLAST XML with
Expat.
• Expat is a popular
free-ware used for
parsing XML.
• Non-validating.
• Simple C (demo)
program to parse
BLAST output.
O'Reilly Bioinformatics Technology BLAST Programming
40
Output sizes for a BLASTP
search of gi|178628 vs. nr.
•
•
•
•
•
•
•
Hit-table: 16 kb
Binary ASN.1 (SeqAlign): 35 kb
Text ASN.1 (SeqAlign): 144 kb
XML (SeqAlign): 392 kb
XML: 288 kb
BLAST report (text): 232 kb
BLAST report (html): 272 kb
O'Reilly Bioinformatics Technology BLAST Programming
41
Specification (i.e., “data model”)
issues should not be confused
with the question about whether
to use ASN.1 or XML.
O'Reilly Bioinformatics Technology BLAST Programming
42
Structured output is not a
panacea.
• Design issues must still be addressed.
• Semantic issues still exist, e.g. is a start/stop
value zero-offset or one-offset.
• Data issues still exist, e.g., is the correct
sequence shown, are the offsets correct, was
the DNA translated with the correct genetic
code?
O'Reilly Bioinformatics Technology BLAST Programming
43
Overview of BLAST code.
O'Reilly Bioinformatics Technology BLAST Programming
44
NCBI toolkit
• Has many low-level functions to make it platform
independent; supported under LINUX, many
flavors of UNIX, NT, and MacOS.
• Contains portable types such as Int2, Int4, FloatHi.
• Developer should write a “Main” function that is
called by a toolkit “main”.
• Contains the BLAST code in the “tools” library.
• A C++ toolkit is now being developed.
O'Reilly Bioinformatics Technology BLAST Programming
45
BLAST code has a modular
design.
• API for retrieval from databases independent
of the compute engine.
• Compute engine independent of formatter.
O'Reilly Bioinformatics Technology BLAST Programming
46
Readdb API can be used to easily
extract information from the
BLAST databases.
• Date produced.
• Title of database.
• Number of letters, number of sequences,
longest sequence.
• Sequence and description of an entry.
• Function prototypes in readdb.h.
O'Reilly Bioinformatics Technology BLAST Programming
47
Dump a BLAST record in FASTA
format (db2fasta.c):
“Main” is called by “main” in the toolkit.
Get or display command-line arguments
Allocate an object for reading the database
Get the ordinal number (zero-offset) of the
record given a ‘FASTA’ identifier
(e.g., “gb|AAH06766.1|AAH0676”).
Fetch the Bioseq (contains sequence,
description, and identifiers) for this record
Dump the sequence as FASTA.
O'Reilly Bioinformatics Technology BLAST Programming
48
Only a few function calls are
needed to perform a BLAST
search (doblast.c):
Set the expect value cutoff to a non-default value.
Allocate a BLASTOptionsBlk with default values
for the specified program (e.g., “blastp”),
the boolean argument specifies a gapped search
Perform a BLAST search of the BioseqPtr query_bsp.
The BioseqPtr could have been obtained from the
BLAST databases, Entrez or from FASTA using
the function call FastaToSeqEntry
O'Reilly Bioinformatics Technology BLAST Programming
49
BlastOptionNew
BLAST_OptionsBlkPtr BLASTOptionNew (CharPtr progname, Boolean gapped)
CharPtr progname: name of program. Legal values are blastp, blastn, blastx, tblastn, and tblastx.
Boolean gapped: if TRUE gapped parameters are set, if FALSE ungapped.
Non-default values may be specified by changing elements of the allocated structure (typedef in blastdef.h).
The most often changed elements (options) are:
Nlm_FloatHi expect_value
Int2 wordsize
Int2 penalty
Int2 reward
CharPtr matrix
Int4 gap_open
Int4 gap_extend
CharPtr filter_string
Int4 hitlist_size
Int2 number_of_cpus
Expect value cutoff
Number of letters used in making words for lookup table.
Penalty for a mismatch (only BLASTN and MegaBLAST)
Reward for a match (only BLASTN and MegaBLAST
Matrix used for comparison (not BLASTN or MegaBLAST)
Cost for gap existence
Cost to extend a gap one more letter (including first).
Filtering options (e.g., “L”, “mL”)
Number of database sequences to save hits for.
Number of CPU’s to use.
O'Reilly Bioinformatics Technology BLAST Programming
50
BioseqBlastEngine
SeqAlignPtr BioseqBlastEngine (BioseqPtr bsp, CharPtr progname, CharPtr database,
BLAST_OptionsBlkPtr options, ValNodePtr *other_returns, ValNodePtr *error_returns
int (LIBCALLBACK *callback) (Int4 done, Int4 positives))
BioseqPtr bsp: contains the query sequence, identifier, and definition line.
CharPtr progname: name of program (one of blastp, blastn, blastx, tblastn, or tblastx).
CharPtr database: name (and path) to BLAST database(s). Multiple databases to be searched should
be separated by a space (e.g., “nt est”).
BLAST_OptionsBlkPtr options: BLAST option structure obtained from BLASTOptionNew.
If NULL default values will be used.
ValNodePtr *other_returns: a linked list of ValNodePtr’s, each one containing information about things
like the database(s) searched, the Karlin-Altschul parameters, the region of query masked.
See blastall.c to see how to use this information.
May be set to NULL.
ValNodePtr *error_returns: a linked list of error messages, these may be printed with a call to
BlastErrorPrint(error_returns).
May be set to NULL.
int (LIBCALLBACK *callback) (Int4 done, Int4 positives): callback function to mark progress through
the database. May be set to NULL.
O'Reilly Bioinformatics Technology BLAST Programming
51
What can I do with the
SeqAlignPtr?
SeqAlignId
gets the
identifier
SeqIdWrite
formats
the(C-structure)
information in
“query_id”
for
the
first
(zeroth)
sequence.
into a FASTA identifier (e.g., “gi|129295”) and
places it into query_id_buf.
SeqAlignStart returns the start value (zero-offset)
SeqAlignStop
returns
thesequences.
end values (zero-offset)
for the first and
second
for the first and second sequences.
O'Reilly Bioinformatics Technology BLAST Programming
52
MySeqAlignPrint output for a
search of gi|129295 vs. ecoli
O'Reilly Bioinformatics Technology BLAST Programming
53
Notes on Traditional BLAST
printing.
A call to the fetch function
ReadDBBioseqFetchEnable ("blastall", blast_database, db_is_na, TRUE);
tells the formatter where to obtain sequences. Entrez or a network connection to the
BLAST server could have been used.
The one-line descriptions are printed by
PrintDefLinesFromSeqAlignEx2(seqalign, 80, outfp, print_options,
FIRST_PASS, NULL, number_of_descriptions, NULL, NULL);
The pair-wise alignments are printed by
ShowTextAlignFromAnnot(seqannot, 60, outfp, NULL, NULL, align_options,
txmatrix, mask_loc, FormatScoreFunc);
Look at blreplay.c and blastall.c to see details of how these are called.
O'Reilly Bioinformatics Technology BLAST Programming
54
Resources
• BLAST Home page:
http://www.ncbi.nlm.nih.gov/BLAST/
• NCBI Information Engineering Branch home
page: http://www.ncbi.nlm.nih.gov/IEB/
• Demonstration programs (parsing XML with
EXPAT, blreplay.c, doblast.c, db2fasta.c):
ftp://ftp.ncbi.nih.gov/blast/demo
O'Reilly Bioinformatics Technology BLAST Programming
55
ASN.1 RESOURCES
• The Open Book : A Practical Perspective on OSI
by Marshall T. Rose (Prentice Hall).
• OSS Nokalva Web site:
http://www.oss.com/asn1/overview.html
• NCBI toolkit documentation on ASN.1:
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/
ASNLIB.HTML
O'Reilly Bioinformatics Technology BLAST Programming
56
Email addresses
• General questions about running BLAST:
[email protected]
• Questions about compiling the toolkit and
requests for hard-copy of documentation:
[email protected]
O'Reilly Bioinformatics Technology BLAST Programming
57