Bioinformatics Tools for Viral Quasispecies Reconstruction from Next-Generation Sequencing Data and Vaccine Optimization University of Connecticut: Ion Mandoiu (PI), Mazhar Khan, Rachel.

Download Report

Transcript Bioinformatics Tools for Viral Quasispecies Reconstruction from Next-Generation Sequencing Data and Vaccine Optimization University of Connecticut: Ion Mandoiu (PI), Mazhar Khan, Rachel.

Bioinformatics Tools for Viral Quasispecies Reconstruction from Next-Generation Sequencing Data and Vaccine Optimization University of Connecticut: Ion Mandoiu (PI), Mazhar Khan, Rachel O’Neill (Co-PIs), Craig Obergfell, Hongjun Wang, Andrew Bligh Georgia State University: Alexander Zelikovsky (Co-PI), Bassam Tork, Nicholas Mancuso Infectious Bronchitis Virus

Group 3 coronavirus

Biggest single cause of economic loss in US poultry farms

– Young chickens: coughing, tracheal rales, dyspnea – Broiler chickens: reduced growth rate – Layers: egg production drops 5-50%, thin-shelled, watery albumin

IBV-infected egg defect

Worldwide distribution, with dozens of serotypes in circulation

• Co-infection with multiple serotypes creates conditions for recombination

IBV-infected embryo normal embryo

454 Pyrosequencing of S1 Gene ViSpA: Viral Spectrum Assembler

Shotgun 454 reads Read Error Correction Read Alignment Postprocessing of Aligned Reads

From Rev. Bras. Cienc. Avic. vol.12 no.2 Campinas Apr./June 2010 Redesigned S1 primers

Quasispecies sequences w/ frequencies Frequency Estimation Contig Assembly Read Graph Construction

• Freely available at http://alla.cs.gsu.edu/software/VISPA/vispa.html

• Currently being integrated in Galaxy framework

KEC: k-mer Error Correction

1. Calculate k-mers and their frequencies kc(s) (k-counts). Assume that kmers with high k counts (“solid” k-mers) are correct, while k-mers with low k counts (“weak” k-mers) contain errors 2. Determine the threshold k-count (error threshold), which distinguishes solid kmers from weak k-mers 3. Find error regions 4. Correct the errors in error regions Zhao X et al 2010

Alignment & Postprocessing

Read Alignment vs Reference Build Consensus 35000 30000 25000 20000 M41 Vaccine M42 Read Re-Alignment vs. Consensus 15000 10000 5000 Yes 0 0 500 1000

Position in S1 Gene

1500 2000 More Reads Aligned?

No 145K 454 reads of avg. length 400bp (~60Mb) sequenced from 2 samples (M41 vaccine and M42 isolate) Alignment postprocessing: indels supported by single reads removed • • •

Read Graph Construction

Vertices = superreads (reads not completely contained in other reads

with

n mismatches

) Edge b/w two superreads they overlap and agree on their overlap

with

m mismatches

Transitive reduction: •

Edge costs measure the uncertainty that two superreads belong to the same quasispecies

cost (

u

,

v

) 

e

k

 

o j

   1   

o

j

j

Δ

Contig Assembly and Frequency Estimation

• Contig assembly based on

max bandwidth s-t-paths

through each vertex

– For each path, first build coarse sequence out of path’s superreads (for each position: >70%-majority if it exists, otherwise

N

) – Replace

N

’s with weighted consensus obtained on

all

reads

• Frequency estimation based on

all

reads using

EM

f q

= (unknown) frequency of candidate q –

o r =

observed frequency of read

r

h q,r

= probability that read

r

mismatches

h q

,

r

 is produced by quasispecies

l j

 1   

l

j

j

E step:

p q

,

r

q

' :  

r q

'

f q

f q

'

h q

,

r

h q

' ,

r

M step:

f q

q

with

j r

 

q r p

qr o

r o r

Shotgun vs Amplicon Reads

• Shotgun reads – Starting positions distributed uniformly • Amplicon – Each read has predefined start/end covering fixed overlapping windows • Existing qsps reconstruction algorithm – [Prosperi et al 2010] • Read graph is constructed from amplicons

Variability of Reconstructed Sequences

Sequencing primer ATGGTTTGTGGTTTAATTCACTTTC

122 clones of avg. length 500bp sequenced using Sanger Reconstruction from Amplicon Reads

• Minimum entropy objective – Fractional relaxation of parsimony • Our algorithm is

Maximum Bandwidth Path

– Find a path whose minimum count is maximized (bandwidth) – Reduce counts by the bandwidth – Repeat until no valid paths exist • Average sensitivity/ppv for

ViSpA

and

Max-Bandwidth

algorithm given 10 & 20 qsps, 5k, 20k, 100k reads, and various qsps distributions – Max-Bandwidth does better when read count is lower

NJ Tree of Sanger & Vispa Qsps Ongoing Work

• Correction for coverage bias • Comparison of shotgun and amplicon based reconstruction methods • Quasispecies reconstruction from ION Torrent reads • Combining long and short read technologies • Study of quasispecies persistence and evolution in layer flocks following administration of modified live IBV vaccine • Optimization of vaccination strategies

Selected Publications

•I. Astrovskaya and B. Tork and S. Mangul and K. Westbrooks and I.I. Mandoiu and P. Balfe and A. Zelikovsky,

Inferring Viral Quasispecies Spectra from 454 Pyrosequencing Reads

,

BMC Bioinformatics

12(Suppl 6):S1, 2011 •M. Nicolae and S. Mangul and I.I. Mandoiu and A. Zelikovsky,

Estimation of alternative splicing isoform frequencies from RNA-Seq data

,

Algorithms for Molecular Biology

6:9, 2011 •N. Mancuso and B. Tork and I.I. Mandoiu and A. Zelikovsky and P. Skums,

Viral Quasispecies Reconstruction from Amplicon 454 Pyrosequencing Reads

,

Proc. 1st Workshop on Computational Advances in Molecular Epidemiology

, pp. 94 101, 2011 •S. Mangul and I. Astrovskaya and M. Nicolae and B. Tork and I.I. Mandoiu and A. Zelikovsky,

Maximum Likelihood Estimation of Incomplete Genomic Spectrum from HTS Data

,

Proc. 11th Workshop on Algorithms in Bioinformatics

, pp. 213-224, 2011