Transcript Document

The rise (and fall) of QTL mapping:
The fusion of quantitative
& molecular genetics
Bruce Walsh ([email protected])
Depts of Ecology & Evolutionary Biology
Plant Sciences
Animal Sciences
Molecular & Cellular Biology
Epidemiology & Biostatistics
University of Arizona
Rough outline
• Classical Quantitative Genetics
• The Golden Age: The search for QTLs
– History and review of methods
• History revised: how successful has the search for
QTLs been?
• The next wave:
–
–
–
–
eQTLs
Association mapping
Molecular signatures of selection
Are these improvements?
• Summary: Where is quantitative genetics today?
Quantitative Genetics
Quantitative Genetics is the analysis of traits whose
variation is influenced by both genetic and environmental
factors
The assumption is that the genotype of an individual
cannot be easily predicted from its phenotype.
Indeed, the genotypes (and hence loci) contributing
to trait variation have historically been assumed to be
unknown and largely unknowable.
“Classical” Quantitative Genetics works with genetic
variance components, which are often easy to estimate.
Genetic variance components
Fisher (1918) reconciled quantitative traits with Mendelian
Genetics, building on statistical machinery developed by
the biometricians. The term variance was first introduced
in Fisher’s paper (as well as ANOVA)
Z=G+E
Fisher’s key insight was the, in sexual species, parents
do not pass along their genotypic value G to their
offspring, but rather only pass along part, the
breeding value A,
G=A+D+I
Fisher also noted that the variance of A can be
estimated by phenotypic covariances among relatives
Variance components and Selection Response
Cov(Parent, offspring) = Var(A)/2
Cov(half sibs) = Var(A)/4
Cov(full sibs) = Var(A)/2 + Var(D)/4 + Var(Ec)
Thus, without any genetic information, we can still
estimate important genetic features associated with
the trait variation in a particular population.
Key use: The Breeders’ Equation for selection response
R = h2 S, with the heritability h2 = Var(A)/Var(P)
Quantitative Genetics: The infinitesimal model
At the heart of much of classical quantitative
genetics is the infinitesimal model -- the genetic
variation in a trait is due to a large number of
loci each of small effect.
Classical quantitative genetics represents the
fusion of Mendelian and population genetics, under
the umbrella of classical statistical methods
What about a fusion of quantitative genetics with
molecular biology and genomics?
Statistics and Molecular biology
The success of “classical” quantitative-genetics
(variance components and related statistical measures)
has been spectacular, esp. in plant and animal breeding.
However, the solely statistical nature of this
approach has been unsettling to some, and the demise
of the field was predicted once we had a better
molecular handle on trait variation.
Thus, starting with the ability to score a vast number
of molecular markers, the fusion of molecular biology
and quantitative genetics seemed a possibility.
Quantitative Trait Loci, QTLs
The first “harvest” from the ability to score modest
number of molecular markers was the ability to search for
Quantitative Trait Loci, QTLs, loci showing allelic
variation that influences trait variation (mid 1980’s).
Conceptually, nothing new, as this is just linkage analysis
Consider the gametes from an AB/ab parent, where
A & B are linked loci. We observe an excess of AB and ab
gametes, and a deficiency of Ab, aB.
Suppose B influences a trait, making it larger. Offspring
getting the A allele from this parent disproportionately
get the B allele as well, and hence have larger trait values.
Early localization of factors influencing quantitative traits
was done by Payne 1918, Sax 1923, and Thoday 1960’s
Sax (1923) crossed two inbred bean lines differing in seed
pigment and weight, with the pigmented parents having
heavier seeds than the nonpigmented parents.
These crosses demonstrated that seed pigment is
determined by a single locus with two alleles, P and p.
Among F2 segregants from this cross, PP and Pp seeds
were 4.3 +/- 0.8 and 1.9 +/- 0.6 centigrams heavier than
pp seeds.
Hence, the P allele is linked to a factor (or factors) that act
in an additive fashion on seed weight.
Makers and more markers
While the basic outlines for QTL mapping has been
known for over 70 years, the lack of sufficient
genetic markers prevented its widespread use
until the mid 1980’s.
The early studies (in maize) used 50-80 markers, mostly
allozymes and were very loosely-linked (marker spacing
much greater than 20 cM)
With the advent of DNA (esp. STR = microsat) markers,
numbers and density of markers have grown, resulting
in a parallel development of more statistically-sophisticated
approaches to mapping to use this additional information.
The statistical machinery for QTL mapping
Single marker linear model approaches
Interval mapping: pairs of markers, move
to Maximum likelihood approaches
Composite Interval mapping: analysis of a marker
interval, flanked by adjacent markers. ML-based
Shrinkage and Bayesian approaches for detecting
epistasis
From from line-cross analysis to the analysis of
outbred populations: mixed models
Conditional Probabilities of
QTL Genotypes
The basic building block for all QTL methods is
Pr(Qk | Mj) --- the probability of QTL genotype
Qk given the marker genotype is Mj.
Pr(Qk M j )
Pr(Qk j M j ) =
Pr(M j )
Consider a QTL linked to a marker (recombination
Fraction = c). Cross MMQQ x mmqq. In the F1, all
gametes are MQ and mq
In the F2, freq(MQ) = freq(mq) = (1-c)/2,
freq(mQ) = freq(Mq) = c/2
Hence, Pr(MMQQ) = Pr(MQ)Pr(MQ) = (1-c)2/4
Pr(MMQq) = 2Pr(MQ)Pr(Mq) = 2c(1-c) /4
Pr(MMqq) = Pr(Mq)Pr(Mq) = c2 /4
Since Pr(MM) = 1/4, the conditional probabilities become
Pr(QQ | MM) = Pr(MMQQ)/Pr(MM) = (1-c)2
Pr(Qq | MM) = Pr(MMQq)/Pr(MM) = 2c(1-c)
Pr(qq | MM) = Pr(MMqq)/Pr(MM) = c2
Expected Marker Means
The expected trait mean for marker genotype Mj
is just
XN
πM j =
πQ k Pr( Qk j M j )
k= 1
For example, if QQ = 2a, Qa = a(1+k), qq = 0, then in
the F2 of an MMQQ/mmqq cross,
(πM M
-
πm m )=2 = a(1 - 2c)
• If the trait mean is significantly different for the
genotypes at a marker locus, it is linked to a QTL
• A small MM-mm difference could be (i) a tightly-linked
QTL of small effect or (ii) loose linkage to a large QTL
Hence, the use of single markers provides for
detection of a QTL. However, single marker means does
not allow separate estimation of a and c.
Now consider using interval mapping (flanking markers)
π
M 1 M 1M 2 M 2
° π
2
m1 m 1 m2 m 2
µ
= a-
1 ° c1 ° c2
1 ° c1 ° c2 + 2c1 c2
' a (1 ° 2c1 c2 )
∂
∂
1
1µ
πM 1 M 1This
° πmis1 m
essentially a for
c1 =
1°
2
2a modest linkage
even
µ
∂
1
πM 1 M 1 ° πm 1 m 1
'
1°
2
πM 1 M 1 M 2 M 2 ° πm 1 m 1 m 2 m 2
Hence, a and c can be estimated from the mean values of
flanking marker genotypes
Linear Models for QTL Detection
The use of differences in the mean trait value
for different marker genotypes to detect a QTL
and estimate its effects is a use of linear models.
One-way ANOVA.
zi k = π + bi + ei k
Value ofEffect
trait
inofkth
individual
of marker
genotype
Detection:
a QTL
marker
is linked
genotype
to the
marker
i on trait
if at
value
least
type
one
ofi the bi is significantly different from zero
Estimation (QTL effect and position): This requires
relating the bi to the QTL effects and map position
Maximum Likelihood Methods
ML methods use the entire distribution of the data, not
just the marker genotype means.
More powerful that linear models, but not as flexible
in extending solutions (new analysis required for each model)
Basic likelihood function:
`(z j M j ) =
XN
k= 1
' (z; πQ k ; æ2 ) Pr( Qk j M j )
Trait
Distribution
value
Probability
givenofmarker
trait
of QTL
value
genotype
genotype
givenisQTL
type
k given
genotype
j marker
is kgenotype
Sum over the N possible linked QTL genotypes
j --- with
genetic
map
linkage
phase
entire
here
is normal
mean
mQkand
. (QTL
effects
enter
here)
This is a mixture model
ML methods combine both detection and estimation
Of QTL effects/position.
Test for a linked QTL given from the LR test
max ` r (z)
LR = ° 2ln
max `(z)
Maximum of the likelihood under a no-linked QTL
Themodel
LR score is often
Maximum
plotted
of the
by trying
full likelihood
different locations
for the QTL (i.e., values of c) and computing a LOD score
for each
∑
LOD(c) = ° log10
∏
max ` r (z)
LR(c)
LR(c)
=
'
max `(z; c)
2 ln 10
4:61
A typical QTL map from a likelihood analysis
Estimated QTL location
Support interval
Significance
Threshold
Interval Mapping with
Marker Cofactors
Consider
interval
mapping
using
markers
i and i+1.
Now
suppose
we also
add the
twothe
markers
flanking
the
QTLs linked
to these
interval
(i-1 and
i+2) markers, but outside this
interval,
can (potentially)
contribute (falsely)
estimation
of to
CIM also
includesto
unlinked
markers
QTL
position
effect
account
forand
QTL
on other chromosomes.
i-1
i
i+1
i+2
Inclusion
Interval
of
being
i-1cofactors
mapped
and
fully
Interval
However,
mapping
still
domarkers
not
+ marker
account
fori+2
QTLs
is called
inaccount
the blue areas
for anyInterval
linked QTLs
to the
left of i-1 and the
Composite
Mapping
(CIM)
right of i+2
CIM works by adding an additional term to the
linear model ,
X
bk x k j
k6
= i ;i + 1
From Line Crosses to Outbred Populations
In
such
cases,
all outbred
of the F1populations,
offspring have
the
In
contrast,
with
each
individual
Much
of
the
above
discussion
was
for
the
analysis
same
genotype,
namely and
MQ/mq,
being
aparent
heterozygote
has
a
unique
genotype,
hence
each
must
of
line-cross
data.
at
loci thatseparately.
show fixed differences between the
be all
examined
lines being crossed. We can thus lump all offspring
For example, if a father is M1/M2, we contrast
phenotypic values in offspring getting M1 vs. M2
from this parent.
The reason is that (say) a father could be M1Q/M2q,
while his mate might be M1q/M2Q.
Likewise, many individuals have no linkage information,
e.g., M1Q/M2Q or M1/M1
General Pedigree Methods
Random effects (hence, variance component) method
for detecting QTLs in general pedigrees
zi = π+ A i + A 0i + ei
Trait
value
for individual
i
Genetic
value
of
other
QTLs
Genetic
effect
of
chromosomal
ofthus
interest
The covariance between individuals(background)
iregion
and j is
æ(zi ; zj ) = Ri j æA2 + 2£ i j æ2A 0
Fraction
of chromosomal
region
sharedcorrection
IBD
Resemblance
between
relatives
between individuals
and used,
j.
Mixed-model
approachesi are
with variances
estimated for each chromosomal region.
Assume z is MVN, giving the covariance matrix as
V = R æA2 + A æ2A 0 + I æe2
Here
Ω
Ri j =
1
bi j
R
for i = j
;
for i =
6 j
Ω
A ij =
Estimated from marker
data
1
2£ i j
for i = j
for i =
6 j
Estimated from
the pedigree
The resulting likelihood function is
∑
1
1
T
` ( z j π; æ2A ; æ2A 0; æe2 ) = p
exp
°
(z
°
π)
V
n
2
(2º ) jV j
∏
° 1
A significant sA2 indicates a linked QTL.
(z ° π)
What are some of the take-home
messages from QTL mapping studies?
• Most traits show several (4-30) QTLs that are
localized to modest-sized chromosomal segments
• Detected QTLs typically account for between 5 and 50%
of the observed phenotypic variation (in the F2)
• Transgressive segregation is often observed, with
high trait alleles being found in low trait value lines,
and vise-versa (hidden variation for selection).
• Epistasis appears to lacking in many studies, but seems
to be fairly common in eQTLs
What are some concerns from
QTL mapping studies?
• Replication of results is often poor.
• Common for a “single” QTL region to show multiple
QTLs upon more careful fine analysis, often with
effects in opposite directions
• QTL mapping does not get at the underling genes,
only isolates chromosomal regions of interest, usually
with rather poor resolution (20 cM = 20 Megabases =
200 - 2000 genes)
• When isolated in inbred lines, QTLs often show strong
interaction effects (G x G, G x E), that are not apparent
in a normal analysis. Hence, likely very context-specific.
Genotype X environment interaction
Additive and dominance effects of QTL
are often environment-specific
QTL for Drosophila longevity, different larval rearing densities
Lifespan (Days)
OO
60
OB
60
50D
BB
60
68B
55
55
55
50
50
50
45
45
45
40
40
40
Low
High
Low
High
Density
Slide courtesy of Trudy Mackay
76B
Low
High
More complicated effects
Epistatic effects can be sex- and environment specific
QTL for Drosophila longevity
65
60
55
50
45
40
35
High Density
50D
Lifespan (Days)
Lifespan (Days)
Low Density
BB
OB
BB
OB
76B
65
60
55
50
45
40
35
50D
BB
OB
BB
OB
76B
Slide courtesy of Trudy Mackay
Cracks in the façade?
QTL mapping appears to dispute the infinitesimal
model, suggesting a few discrete loci account for much
of the variation.
Problem 1: Upon closer analysis, many of these highvalue regions themselves decompose into several QTLs,
not just one. How fine such a decomposition can be
continued until no more QTL appear is unresolved.
Problem 2: From a molecular-biology standpoint, QTLs
have not really led us significantly closer to the
underling genes, and hence the molecular mechanisms for
quantitative trait variation.
Power for detection
Most QTL studies are vastly underpowered.
How
many
individuals
must
be scored
in an F2 for
design
For an
alpha
of a = 0.01,
sample
size required
90%
power
of (high
detection
(Fsetting)
in
a line
cross
power
2 design) is roughly
22/d2 , where d = a/s, the allele effect in units of SD
Thus, the sample size for d = 0.5, 0.2, 0.1, 0.05 are
88,550, 2200, and 8800.
Typical QTL study in the range of n = 350, giving d = 0.25
Effect of linkage: for c = 0.05, 0.1, 0.2, increase
in sample size (over c = 0) is 1.2, 1.6, and 2.8
Power and Repeatability:
The Beavis Effect
QTLs with low power of detection tend to have their
effects overestimated, often very dramatically
As power of detection increases, the overestimation
of detected QTLs becomes far less serious
For example,
a QTLthe
accounting
for 0.75%
ofBill
the total
This
is often called
Beavis Effect,
after
F2 variation
has noticed
only a 3%
chance
of beingstudies
detected
Beavis
who first
this
in simulation
with 100 F2 progeny (markers spaced at 20 cM). For
The
Beavis
effect
the
concern
many
cases
in which
suchraises
a QTL
is real
detected,
thethat
average
QTL
of apparent
large effect
may befor
artifacts.
estimated
total variance
it accounts
is 15%!.Under
an infinitesimal model this is especially a concern.
Detection vs. localization
Darvasi & Soller (1997) give an appropriate
expression for the sample size required for a
95% confidence interval in position,
CI = 1500/(nd)2
For a QTL with d = 0.25, 0.1, and 0.05, the sample
sizes needed for a 1cM CI are 1500, 3800, and
7600.
Fine mapping (localizing to under 1 cm) requires the
generation of special lines, such as advanced intercross
(AIC), or recombinant inbred lines (RILs). In flies,
A series of overlapping deficieny strains can be used.
Tradeoffs in sample designs
Most QTL mapping studies are highly underpowered.
While QTLs of modest effects can be detected with
sample sizes of 500 or less, an order of magnitude
more is needed for high-resolution mapping.
Adding more markers does not really improve
power or resolution very much. Increasing the number
of individuals does.
Ironically, we are now at the stage where it is fair easier
to score markers than to score phenotypes. This limits
the sample sizes that can be used.
Mapping eQTLs
A current very fashionable trend is the mapping
of expression QTLs, locations that influence the
amount of protein or RNA made by a particular gene
A common design is to use RILs and examine a number
of microarrays across a modest set of lines (10-100).
Some improvement in power (over an F2 design) occurs
because of being able to replicated within each RIL and
the expanded map distances (4 fold) found in RILs vs. F2
Still, such designs are underpowered, making localization
(cis vs. trans) difficult and the contribution from detected
eQTLs being inflated by the Beavis effect.
How can we improve the ability
To detect QTLs?
Two complementary approaches, which require very
dense marker maps, have been suggested.
• Association mapping -- much finer resolution with
a smaller sample size, using historical recombinants
• Methods for detecting genes under (or very recently
under) selection.
Association mapping
Basic idea is very straightforward: If there
exists very tight linkage between a marker and
a QTL, with marker and QTL alleles in linkagedisequilibrium, then a random collection of
individuals show a marker-trait association.
Since the region of LD is expected to be very small,
this method potentially allows for fine mapping using
not a collection of relatives (hard to get), but rather
a random (and hence likley much larger) collection
of individuals from a population.
Linkage disequilibrium mapping
Idea is to use a random sample of individuals from
the population rather than a large pedigree.
Ironically, in the right settings this approach has
more power for fine mapping than pedigree analysis.
Why?
• Key is the expected number of recombinants.
in a pedigree, Prob(no recombinants) in n
individuals is (1-c)n
• LD mapping uses the historical recombinants in
a sample. Prob(no recomb) = (1-c)2t, where t =
time back to most recent common ancestor
Expected number of recombinants in a sample of
n sibs is cn
Expected number of recombinants in a sample of
n random individuals with a time t back to the
MRCA (most recent common ancestor) is 2cnt
Hence, if t is large, many more expected recombinants
in random sample and hence more power for very
fine mapping (i.e. c < 0.01)
Because so many expected recombinants, only works
with c very small
Dense SNP Association Mapping
Mapping genes using known sets of relatives can be
problematic because of the cost and difficulty in
obtaining enough relatives to have sufficient power.
By contrast, it is straightforward to gather large
sets of unrelated individuals, for example a large
number of cases (individuals with a particular
trait/disease) and controls (those without it).
With the very dense set of SNP markers (dense =
very tightly linked), it is possible to scan for markers
in LD in a random mating population with QTLs, simply
because c is so small that LD has not yet decayed
These ideas lead to consideration of a strategy of
Dense SNP association mapping.
For example, using 30,000 equally spaced SNP in
The 3000cM human genome places any QTL within
0.05cM of a SNP. Hence, for an association created
t generations ago (for example, by a new mutant
allele appearing at that QTL, the fraction of
original LD still present is at least (1-0.0005)t ~
1-exp(t*0.0005). Thus for mutations 100, 500,
and 1000 generations old (2.5K, 12.5K, and 25 K
years for humans), this fraction is 95.1%, 77.8%, 60.6%,
We thus have large samples and high disequilibrium,
the recipe needed to detect linked QTLs of small effect
Problems with association mapping
Good news: Do not need a set of relatives. Hence, easier
to gather a large sample.
Bad news: One can have marker-trait associations in
the absence of linkage. For example if a marker
predict group membership, and being in that group
gives you a different trait value, then a markertrait covariance will occur.
This is the problem of population stratification.
When population being sampled actually consists of
Example.
The Gm
marker was thought
biological
several
distinct
subpopulations
we have(for
lumped
together,
reasons)alleles
to be may
an excellent
candidate gene
marker
provide information
as tofor
which group
diabetes
in the
high-risk
population
of Pima
indians
an
individual
belongs.
If there
are other
risk
factors in
the American
Initially
a verybtw
strong
aingroup,
this can Southwest.
create a false
association
marker and
association was observed:
trait
Gm+
Total
% with diabetes
Present
293
8%
Absent
4,627
29%
The association was+ re-examined in a population of Pima
Problem: freq(Gm ) in Caucasians (lower-risk diabetes
that were 7/8th (or more)
full heritage:
Population) is 67%, Gm+ rare in full-blooded Pima
Gm+
Total
% with diabetes
Present
17
59%
Absent
1,764
60%
Adjusting for population stratification
• Use molecular makers to classify individuals into
groups, do association mapping within each group
(structured association mapping). This approach
typically uses the program STRUCTURE
• Use a simple regression approach, adding additional
markers as cofactors for group membership, removing
their effect,
n
y = π+
X
k =1
m
Øk M k +
X
j =1
∞j bj + e
Scans for genes under selection
• Reduction in levels of polymorphism around
selected site (selected sweep), or increase in the
levels of polymorphism around a locus under
stabilizing selection.
• Formal tests based on molecular variation
(Tijama’s D, MK, ect.) -- either as a test for candidate
genes or scanning the genome for regions showing
strong signals
• Dense SNP approaches based on linkage disequilibrium
and age of allele.
A scan of levels of polymorphism can thus
suggest sites under selection
Variation
Directional selection
(selective sweep)
Local region with
reduced mutation rate
Map location
Variation
Balancing selection
Local region with
elevated mutation rate
Map location
Example: maize domestication gene tb1
Doebley
Major
changes
lab identified
in plant aarchitecture
gene, teosinite
in
transition 1,
branched
from
tb1,teosinte
involved to
in many
maizeof these
architectural changes
Wang et al. (1999) observed a significant decrease
in genetic variation in the 5’ NTR region of tb1,
suggesting a selective sweep influenced this
region. The sweep did not influence the coding
region.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Wang et al (1999) Nature 398: 236.
Clark et al (2004) examined the 5’ tb1 region
in more detail, finding evidence for a
sweep influencing a region of 60 - 90 kb
Clark et al (2004) PNAS 101: 700.
Formal tests
Strict neutral theory: single parameter describes
(i) heterozygosity,
(ii) average number of differences between alleles
(iii) Number of singletons (alleles present once in sample)
A number of tests comparing these various measures
of within-population variation have been proposed:
Tajima’s D, HKA, Fu and Li’s D* and F*, Fu’s W and Fs,
Fay and Wu’s H, etc.
One could either test a candidate gene or do a genomic
scan using dense markers to test a sliding window along
a chromosome.
Rejection of neutrality = locus
under selection!
A central problem with all of these frequency
spectrum tests is that a rejection of the strict
neutral model can be caused by changes in population
size in addition to a locus under selection.
Such demographic signals would be present at all loci,
so that one approach is to use such signals over all
loci to correct the test at any particular locus.
Another approach is to use marker information toe
estimate the demographic parameters and then again
use these to generate an appropriate null (neutral) model.
LD tests based on dense markers
A newer class of tests that is not influenced by
demographic factors are those based on the length of
linkage disequilibrium around a target site.
Under drift, alleles at moderate to high frequencies are
old, and hence have smaller tracks of disequilibrium, due
to time for recombination to break down longer tracks.
LD based tests of selection look for long tracks of
disequilibrium around allele at high frequency. This
requires dense SNP markers
Summary
The jury is still out on whether current QTL studies
show that the infinitesimal model (lots of loci each of
small average effects) is incorrect.
In its classic form, QTL mapping has not successfully
yielded a number of actual genes contributing to small
amounts of variation. Hence, they have not helped us
to fuse molecular biology and Quantitative genetics.
The problem with QTL mapping is not marker density
(i.e, number of markers scored), but rather poor power
from too few individuals being scored.
Summary (cont)
QTL mapping in microarrays (eQTLs) faces many of these
lack of power issues and results should be interpreted
with some care in the absence of replication.
Association mapping, requiring very dense SNP markers,
offers the potential for (i) using a much larger sample
(as unrelated individuals can be used) and (ii) fine mapping.
However, correction for population stratification remains
a concern.
LD-based tests for selection signatures seems to be a
promising approach, but also requires dense SNP mapping.
while not a method to directly get at QTLs for a trait of
interest, it does suggest loci under recent selection, which
may eventually point to ecologically interested traits.
Farewell from the “desert”
U of A Campus
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
Detecting epistasis
One major advantage of linear models is their
flexibility. To test for epistasis between two QTLs,
used an ANOVA with an interaction term
z = π + ai + bk + di k + e
Effect
Effect
from marker
from
marker
genotype
genotype
at genotypes
firstat second
Interaction
between
marker
i in 1st
• At least
one
of (can
theset
abe
different
marker
marker
set
>k 1inloci)
i significantly
marker
set
and
2nd marker
set from 0
---- QTL linked to first marker set
• At least one of the bk significantly different from 0
---- QTL linked to second marker set
• At least one of the dik significantly different from 0
---- interactions between QTL in sets 1 and two