Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population

Download Report

Transcript Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population

Inferring Local Tree Topologies
for SNP Sequences Under
Recombination in a Population
Yufeng Wu
Dept. of Computer Science and Engineering
University of Connecticut, USA
MIEP 2008
Genetic Variations
Sites
AATGTAGCCGA
Sites
00100
AATATAACCTA
01010
DNA
AATGTAGCCGT
sequences
AATGTAACCTA
CATATAGCCGT
Haplotypes
00101
Each SNP
induces a split
00010
11101
• Single-nucleotide polymorphism (SNP): a site
(genomic location) where two types of
nucleotides occur frequently in the population.
– Haplotype, a binary vector of SNPs (encoded as 0/1).
• Haplotypes: offer hints on gene genealogy.
2
Gene Genealogy: Evolutionary History
of Genomic Sequences
• Tells how sequences in a
population are related
Disease
mutation
• Helps to explain diseases:
disease mutations occur on
branches and all descendents
carry the mutations
• Problem: How to determine
the genealogy for “unrelated”
sequences?
Diseased
(case)
Healthy (control)
Sequences in
current population
• Complicated by recombination
3
Recombination
• One of the principle genetic forces shaping
sequence variations within species
• Two equal length sequences generate a third
new equal length sequence in genealogy
• Spatial order is important: different parts of genome inherit
from different ancestors.
110001111111001
11000
Prefix
000110000001111
Suffix
0000001111
Breakpoint
4
Ancestral Recombination Graph (ARG)
00
Mutations
Recombination
10
1 0
0 1
1 1
10
01
S1 = 00
S2 = 01
S3 = 10
S4 = 10
00
10
11
S1 = 00
S2 = 01
S3 = 10
S4 = 11
01
00
Assumption:
At most one mutation per
site
5
Local Trees
ARG
• ARG represents a set of local trees.
• Each tree for a continuous genomic region.
• No recombination between two sites 
same local trees for the two sites
• Local tree topology: informative and useful
Local tree near sites 1 and 2 Local tree near site 2
Local tree to the right of site 3
6
Inference of Local Tree Topologies
• Question: given SNP
haplotypes, infer local tree
topologies (one tree for each
SNP site, ignore branch length)
– Hein (1990, 1993)
• Enumerate all possible tree
topologies at each site
– Song and Hein (2003,2005)
– Parsimony-based
• Local tree reconstruction can be
formulated as inference on a
hidden Markov model.
7
Local Tree Topologies
• Key technical difficulty
– Brute-force enumeration of local tree topologies: not
feasible when number of sequences > 9
• Can not enumerate all tree topologies
• Trivial solution: create a tree for a SNP containing
the single split induced by the SNP.
– Always correct (assume one mutation per site)
– But not very informative: need more refined trees!
A:
B:
C:
D:
E:
F:
G:
H:
0
0
1
0
1
0
1
0
C
E
G
A
B
D
F
H
8
How to do better? Neighboring
Local Trees are Similar!
• Nearby SNP sites provide hints!
– Near-by local trees are often topologically similar
– Recombination often only alters small parts of the
trees
• Key idea: reconstructing local trees by combining
information from multiple nearby SNPs
9
RENT: REfining Neighboring Trees
• Maintain for each SNP site a (possibly nonbinary) tree topology
– Initialize to a tree containing the split induced by
the SNP
• Gradually refining trees by adding new splits
to the trees
– Splits found by a set of rules (later)
– Splits added early may be more reliable
• Stop when binary trees or enough information
is recovered
10
A Little Background: Compatibility
12 34 5
M
a
b
c
d
e
f
g
00010
10010
00100
10100
01100
01101
00101
Sites 1 and 2 are
compatible, but 1 and 3
are incompatible.
• Two sites (columns) p, q are incompatible if columns
p,q contains all four ordered pairs (gametes): 00, 01,
10, 11. Otherwise, p and q are compatible.
• Easily extended to splits.
• A split s is incompatible with tree T if s is incompatible with
any one split in T. Two trees are compatible if their splits are
pairwise compatible.
11
Fully-Compatible Region: Simple Case
• A region of consecutive SNP sites where these
SNPs are pairwise compatible.
– May indicate no topology-altering recombination
occurred within the region
• Rule: for site s, add any such split to tree at s.
– Compatibility: very strong property and unlikely arise
due to chance.
12
Split Propagation: More General Rule
• Three consecutive sites 1,2 and 3. Sites 1 and 2 are
incompatible. Does site 3 matter for tree at site 1?
– Trees at site 1 and 2 are different.
– Suppose site 3 is compatible with sites 1 and 2. Then?
– Site 3 may indicate a shared subtree in both trees at sites 1 and 2.
• Rule: a split propagates to both directions until reaching a
incompatible tree.
13
Unique Refinement
• Consider the subtree with leaves 1,2 and 3.
– Which refinement is more likely?
– Add split of 1 and 2: the only split that is compatible
with neighboring T2.
• Rule: refine a non-binary node by the only
compatible split with neighboring trees
?
1
3
2
14
One Subtree-Prune-Regraft (SPR) Event
• Recombination: simulated by SPR.
– The rest of two trees (without pruned subtrees) remain the same
• Rule: find identical subtree Ts in neighboring trees T1 and T2,
s.t. the rest of T1 and T2 (Ts removed) are compatible. Then
joint refine T1- Ts and T2- Ts before adding back Ts.
Subtree
to prune
More complex
rules possible.
15
Simulation
• Hudson’s program MS (with known coalescent local tree topologies):
100 datasets for each settings.
– Data much larger and perform better or similarly for small data than Song
and Hein’s method.
• Test local tree topology recovery scored by Song and Hein’s sharedsplit measure
 = 15
 = 50
16
Acknowledgement
• Software available upon request.
• More information available at:
http://www.engr.uconn.edu/~ywu
• I want to thank
– Yun S. Song
– Dan Gusfield
Reference:
Y. Wu: New methods for Inference of Local Tree
Topologies with Recombinant SNP Sequences in
Populations, submitted for pulication, 2008.
Y. S. Song and J. Hein: Constructing Minimal Ancestral
Recombination Graphs. J. of Comp. Bio., 2005, 12,
p159-178.
17