Transcript Slide 1

Optimal Tag SNP Selection for Haplotype Reconstruction

Jin Jun and Ion Mandoiu

Computer Science & Engineering Department University of Connecticut

Approaches to Phasing Motivation and Contributions

• To reduce prohibitively expensive haplotyping costs, a two stage methodology has been recently proposed [3] –

Pilot Study

• All SNPs of interest are genotyped in a small sample of the population • Common haplotypes are inferred using statistical methods • A set of

tag SNPs

is selected for the population study –

Population Study

• Tag SNPs are genotyped in the remaining population • Statistical methods are used to infer haplotypes over the tag SNPs • Haplotypes over the tag SNPs are

extrapolated

to full haplotypes • We propose novel t – Allow computing ag SNP selection methods based on integer linear programming. Our methods the complete tradeoff curve between genotyping cost and reconstruction accuracy • Yield improved reconstruction accuracy by taking haplotype frequencies into account

Background

• A Single Nucleotide Polymorphism (SNP) is a position in the genome at which exactly two of the possible four nucleotides occur in a large percentage of the population. SNPs account for most of the genetic variability between individuals.

• In diploid organisms such as humans, there are two non identical copies of each chromosome. A description of the SNPs in each chromosome is called a haplotype .

• At present, it is prohibitively expensive to directly determine the haplotypes of an individual, but it is possible to obtain rather easily the conflated SNP information in the so called genotype .

• The genotyping cost is affected by the number of SNPs typed. In order to reduce this cost, a small number of SNPs ( Tag SNPs ) which predicts the rest of SNPs are needed.

Previous Work on Tag SNPs

• • • •

Bafna et al

.[1] : Informative SNP Set Problem – Find set of k SNPs with maximum “informativeness”

Sebastiani et al

. [5] : Best Enumeration SNP Tags (BEST) – Generates all optimum fully informative Tag SNPs sets – Limitation:

worst-case runtime grows exponentially

n BEST time* 10 12 14 16 18 20 <.01s 2s 29s 14m8s 6h4m 4d18h * running BEST on the n x n identity matrix

Barzuza et al

.[2] : Phasing Tagging SNP problem – Find the minimum number of SNPs for which every two distinct haplotype pairs yield distinct (XOR) genotypes – Limitation:

in practice, many pairs of haplotypes will give the same genotype even if all SNPs are used as tags

Halperin et al

.[4] : Genotype Tagging SNPs – Find set of k SNPs allowing most accurate genotype reconstruction

Optimum Fully Informative Tag SNP Sets by Integer Programming

• • •

Given:

haplotypes

h 1

,

h 2 , …, h m

over

n

SNPs

Find:

minimum number of tag SNPs

Such that:

every two distinct haplotypes differ in at least one tag SNP • Integer Program Formulation 0/1 variable

x j

for every SNP -

x j = 1 x j = 0

if SNP

j

is selected as a tag SNP otherwise

Min s

.

t

.

j

:

h i

(

j n

 

x

1 )

j

 

h i x

'

j

(

j j

)  1 , 1 

i

i

' 

m

• Can be solved efficiently using general purpose solvers such as CPLEX - In practice significantly faster than BEST

Tag SNP Selection and Haplotype Reconstruction Flow

Pilot Study Population Study

Population Sample

Phasing

Sample haplotypes (with frequencies)

Tag Selection

Tag SNP Set Remaining Population Genotype (tag SNPs) Haplotype pairs (tag SNPs)

Phasing Extrapolation

Haplotype pairs (all SNPs)

Tag SNP Selection for Haplotype Reconstruction

Reconstruction Errors

• Haplotypes not represented in sample population - Cannot be reconstructed!

- Minimized by choosing sample large enough • Incorrect inferred haplotypes over tag SNPs - Minimized by using accurate haplotype inference (phasing) methods - We use PHASE [6] for phasing sample genotypes as well as population genotypes over tag SNPs • Incorrect haplotype extrapolation - Our extrapolation procedure - Find sample haplotype with minimum Hamming distance - Break ties according to the frequency of sample haplotypes (most frequent haplotypes are given preference)

Informal Problem Definition Given: Find:

sample haplotypes and frequencies K tag SNPs maximizing

reconstruction accuracy

ILP Formulation (1)

• Integer program formulation similar to that for the fully informative tag SNP problem

ILP1

Max

1 

i

 

i

' 

m y i

,

i

'

s

.

t

.

j

:

h i

( )

j

 

h i

'

x j

  

y i

,

i

' ,

j n

  1

x j

K

1 

i

i

' 

m

• 0/1 variable

x j

set to 1 iff SNP

j

selected as a tag SNP • Only K SNPs can be selected • 0/1 variable

y i,i’

set to 1 iff is haplotypes

h i

,

h i’

are distinguished by at least one selected SNP • Objective is to maximize

informativeness

, i.e., number of pairs of haplotypes distinguished by selected SNPs

ILP Formulation (2)

• Reconstruction accuracy can be improved by considering haplotype frequencies

ILPf : ILP with frequency

Max

1 

i

 

i

' 

m y i

,

i

' 

p i

p i

'

s

.

t

.

j

:

h i

( )

j

 

h i

'

x j

  

y i

,

i

' , 1 

i

i

' 

m j n

  1

x j

K

• Select K tag SNPs maximizing the

total probability of distinguished pairs of haplotypes

• The probability of haplotype in the population is estimated from the initial sample using PHASE computed frequencies

Experimental Setup Datasets and Parameters:

We used synthetic datasets generated following the methodology in [3] for 2 populations (European and West African) on 2 regions (IL8 and 5q31). For each of the 4 populations, we used haplotypes and frequencies inferred in [3] from the real data to generate 5 datasets containing between 200 and 1000 individuals. For each dataset, we picked 5 random samples with size 5 times the number of SNPs (we ran our algorithm using predetermined block sizes of 10 and 20). Random selections of Tag SNPs ( Rand ) were performed for comparison.

IL8-euro and SNP=10 IL8-euro and SNP=20

100 100 80 80 60 40 ILP1 ILPf Rand 20 0 1 2 3 4 5 6 7 8 9 10

K

60 40 20 ILP1 ILPf Rand 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

K

Phasing Accuracy (%)

100 90 80 70 60 50 40 30 100 80 60 40 20 0 1

5q31-euro and SNP=10

2 3 4 5 6 7 8

5q31-wafr and SNP=10

9 10

K

1 2 3 4 5 6 7 8 9 10

K IL8-euro and SNP=10

200-ILP1 200-ILPf 200-Rand 400-ILP1 400-ILPf 400-Rand 600-ILP1 600-ILPf 600-Rand 800-ILP1 800-ILPf 800-Rand 1000-ILP1 1000-ILPf 1000-Rand 100 90 80 70 60 1 2 3 4 5 6 7 8

IL8-wafr and SNP=10

9 10

K

200-ILP1 200-ILPf 200-Rand 400-ILP1 400-ILPf 400-Rand 600-ILP1 600-ILPf 600-Rand 800-ILP1 800-ILPf 800-Rand 1000-ILP1 1000-ILPf 1000-Rand 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10

K

200-ILP1 200-ILPf 200-Rand 400-ILP1 400-ILPf 400-Rand 600-ILP1 600-ILPf 600-Rand 800-ILP1 800-ILPf 800-Rand 1000-ILP1 1000-ILPf 1000-Rand 200-ILP1 200-ILPf 200-Rand 400-ILP1 400-ILPf 400-Rand 600-ILP1 600-ILPf 600-Rand 800-ILP1 800-ILPf 800-Rand 1000-ILP1 1000-ILPf 1000-Rand

Error Analysis

%

Correct haplotype pairs

-

Single-Correct:

inferred haplotype pair over tag SNPs compatible with a single pair of sample haplotypes 100 90 80 70 60 50 40 30 20 10 0 -

Multi-Correct:

inferred haplotype pair over tag SNPs compatible with multiple pairs of sample haplotypes, and most frequent is correct •

Incorrect haplotype pairs -Missing:

one or both real haplotypes not present in sample population

%

100 90 80 70 60 50 40 30 20 10 0 1 2 1 2

-Wrong Short:

incorrect inferred haplotypes over tag SNPs

-Multi-Wrong:

inferred haplotype pair over tag SNPs compatible with multiple pairs of sample haplotypes, and most frequent is incorrect

%

100 90 80 70 60 50 40 30 20 10 0 1 2 3 3 3

ILP1

4 5

ILPf

6 4 5

Rand

6 4 5 6 7 Missing Wrong Short Multi-Wrong Multi-Correct Single-Correct 8 9

K

7 Missing Wrong Short Multi-Wrong Multi-Correct Single-Correct 8 9

K

7 Missing Wrong Short Multi-Wrong Multi-Correct Single-Correct 8 9

K

Conclusions

• Preliminary experiments show that use of the haplotype frequencies improves reconstruction accuracy compared to random selection and ILP1 • In ongoing work we are extending our methods to reconstruction of long haplotypes by using integer program formulations based on overlapping blocks, and are comparing them to other reconstruction flows, including tag SNP based genotype reconstruction as in [4] followed by phasing

References:

1.

V. Bafna, B.V. Halldórsson, R.S. Schwartz, A.G. Clark, and S. Istrail, Haplotypes and informative SNP selection algorithms: Don’t block out information. RECOMB’03, pp. 19-27, 2003.

2.

T. Barzuza, J.S. Beckmann, R. Shamir, and I. Pe’er, Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs, CPM 2004, LNCS 3109, pp. 14 –31, 2004.

3.

J. Forton, D. Kwiatkowski, K. Rockett, G. Luoni, M. Kimber, and J. Hull, Accuracy of haplotype reconstruction from haplotype-tagging single-nucleotide polymorphisms, American Journal of Human Genetics, 76(3), pp. 438-48, 2005.

4. E. Halperin, G. Kimmel, and R. Shamir. Tag SNP Selection in Genotype Data for Maximizing SNP Prediction Accuracy, Proc. ISMB 2005.

5.

P. Sebastiani, R. Lazarus, S.T. Weiss, L.M. Kunkel, I.S. Kohane, and M.F. Ramoni, Minimal haplotype tagging, Proc. National Academy of Sciences, 100(17), pp. 9900-9905, 2003.

6.

M. Stephens, N. Smith, and P. Donnelly. A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics, 68, pp. 978-989, 2001.