SRA transcript BLAST

Download Report

Transcript SRA transcript BLAST

SRA Transcript BLAST
Tom Madden
May 15, 2009
BLAST
• Basic Local Alignment Search Tool
• Calculates similarity for biological sequences.
• Produces local alignments: only a portion of
each sequence must be aligned.
• Uses statistical theory to determine if a match
might have occurred by chance.
BLAST databases used for
searches
Collect
sequences
(FASTA, Bioseq)
Produce BLAST
database with
FormatDB API
Distribute to
network
attached storage
Requirements for searching SRA
sequences as a BLAST DB
• Extract new or updated sequences.
• Format into a BLAST database.
• Provide disks for eight copies BLAST databases,
each with 5 tera-bases (as of January).
• Distribute databases to storage in Bethesda
and Virginia.
• Know how to quickly re-dump for policy
changes or data corruption (e.g., unclipped or
differently clipped reads should be searched).
Direct BLAST searches against the SRA
archive.
•
•
•
•
Uses SRA toolkit and C++ BLAST API.
Smallest search unit is a “run”.
Multiple runs may be searched together.
Offers searches of 454 SRA transcripts
(grouped by organism) at NCBI web page.
• Clipped application reads are searched.
Clipped Application Read is Searched.
Advantages
• The search set offered no longer depends
upon how fast BLAST database can be
produced and distributed.
• Changes to SRA archive are seen immediately
(e.g., change in clipping algorithm).
Three most popular organisms.
• Human
• Sus scrofa
• Tachyglossus aculeatus
Counts searches after April 29, 2009 and only includes those with an
average of two or more searches per session.
Future development
• Allow users to build custom search sets.
• Take mate-pair information into account.
• Combine SRA searches with traditional BLAST
database searches.
Acknowledgements
•
•
•
•
•
Kurt Rodarmer
Eugene Yaschenko
Ty Roach
Martin Shumway
Christopher O’Sullivan
•
•
•
•
Vahram Avagyan
Christiam Camacho
Yan Raytselis
Irena Zaretskaya