A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc., Pasadena,

Download Report

Transcript A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc., Pasadena,

A Study of GeneWise with the Drosophila Adh Region

Asta Gindulyte

CMSC 838 Presentation

Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc., Pasadena, CA

Motivation

Genome annotation

Extraction of biologically relevant knowledge from raw genomic sequence data

Need faster genome annotation methods

DNA sequences are very long (millions of nucleotides)

Current methods are computationally too expensive

Approach/Solution

GeneMatcher2 hardware acceleration of GeneWise

CMSC 838T – Presentation

Outline

Motivation

Genome annotation

GeneMatcher2

Design

ASIC hardware

Comparison

GeneWise algorithm

HalfWise algorithm

Performance (time, precision)

Observations

Performance improvement

Cost effectiveness

CMSC 838T – Presentation

Approach

Problem: make GeneWise run faster

“Embarassingly parallel” algorithm

Computationally too expensive when run in parallel on PC’s

Paracell’s solution: hardware acceleration

Don’t change the algorithm

Produce an implementation on the GeneMatcher2 supercomputer that works as much like the original software as possible

6LITE algorithm, now also in Wise2

CMSC 838T – Presentation

GeneMatcher Architecture

CMSC 838T – Presentation

ASIC Hardware

ASIC – application specific integration circuit

Designed to speed up dynamic programming algorithms

(could be used for Smith-Waterman)

Each ASIC board has 3072 processors

System has up to 9 boards

Cost per board around $40K

CMSC 838T – Presentation

GeneWise Algorithm

Perform a search of genomic DNA sequence data using a protein HMM

Build HMMs from protein families

Scan genome using HMM

  

Look for start codon “GT” sequence signals possible 5’ splice site “AG” sequence signals possible 3’ splice site

Dynamic programming used in the scanning process

 

Obtain probability of the most likely path in HMM generating the sequence Obtain alignment by backtracking

CMSC 838T – Presentation

GeneWise model on GeneMatcher2

CMSC 838T – Presentation

HalfWise Algorithm

Reduce cost by running BLAST to select HMMs with possible hits

Use these HMMs with GeneWise database search and sequence alignment algorithm

May miss some genes due to BLAST misses

CMSC 838T – Presentation

Evaluation

Test data set

A genomic DNA sequence contig of about 2.9 Mb from the Drosophila Adh region

Focuss on finding all Pfam (Protein families database of alignments and HMMs) protein profile-HMMs that occur in the Adh genomic sequence

CMSC 838T – Presentation

Evaluation: Speed

CMSC 838T – Presentation

Evaluation: Score

CMSC 838T – Presentation

Evaluation: Sensitivity and Specificity

CMSC 838T – Presentation

Observations

Performance improvement

The speedup is several orders of magnitude.

Makes real target applications possible

Accuracy might be improved over HalfWise algorithm

Cost effectiveness

System used costs around $500K

500K worth Linux PC’s (500 processors at $1K each) would run about 10 times slower

Weaknesses

Cannot modify the algorithm

Not enough data to assess scalability

CMSC 838T – Presentation