A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc., Pasadena,
Download ReportTranscript A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc., Pasadena,
A Study of GeneWise with the Drosophila Adh Region
Asta Gindulyte
CMSC 838 Presentation
Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc., Pasadena, CA
Motivation
Genome annotation
Extraction of biologically relevant knowledge from raw genomic sequence data
Need faster genome annotation methods
DNA sequences are very long (millions of nucleotides)
Current methods are computationally too expensive
Approach/Solution
GeneMatcher2 hardware acceleration of GeneWise
CMSC 838T – Presentation
Outline
Motivation
Genome annotation
GeneMatcher2
Design
ASIC hardware
Comparison
GeneWise algorithm
HalfWise algorithm
Performance (time, precision)
Observations
Performance improvement
Cost effectiveness
CMSC 838T – Presentation
Approach
Problem: make GeneWise run faster
“Embarassingly parallel” algorithm
Computationally too expensive when run in parallel on PC’s
Paracell’s solution: hardware acceleration
Don’t change the algorithm
Produce an implementation on the GeneMatcher2 supercomputer that works as much like the original software as possible
6LITE algorithm, now also in Wise2
CMSC 838T – Presentation
GeneMatcher Architecture
CMSC 838T – Presentation
ASIC Hardware
ASIC – application specific integration circuit
Designed to speed up dynamic programming algorithms
(could be used for Smith-Waterman)
Each ASIC board has 3072 processors
System has up to 9 boards
Cost per board around $40K
CMSC 838T – Presentation
GeneWise Algorithm
Perform a search of genomic DNA sequence data using a protein HMM
Build HMMs from protein families
Scan genome using HMM
Look for start codon “GT” sequence signals possible 5’ splice site “AG” sequence signals possible 3’ splice site
Dynamic programming used in the scanning process
Obtain probability of the most likely path in HMM generating the sequence Obtain alignment by backtracking
CMSC 838T – Presentation
GeneWise model on GeneMatcher2
CMSC 838T – Presentation
HalfWise Algorithm
Reduce cost by running BLAST to select HMMs with possible hits
Use these HMMs with GeneWise database search and sequence alignment algorithm
May miss some genes due to BLAST misses
CMSC 838T – Presentation
Evaluation
Test data set
A genomic DNA sequence contig of about 2.9 Mb from the Drosophila Adh region
Focuss on finding all Pfam (Protein families database of alignments and HMMs) protein profile-HMMs that occur in the Adh genomic sequence
CMSC 838T – Presentation
Evaluation: Speed
CMSC 838T – Presentation
Evaluation: Score
CMSC 838T – Presentation
Evaluation: Sensitivity and Specificity
CMSC 838T – Presentation
Observations
Performance improvement
The speedup is several orders of magnitude.
Makes real target applications possible
Accuracy might be improved over HalfWise algorithm
Cost effectiveness
System used costs around $500K
500K worth Linux PC’s (500 processors at $1K each) would run about 10 times slower
Weaknesses
Cannot modify the algorithm
Not enough data to assess scalability
CMSC 838T – Presentation