Transcript Document
The rise (and fall) of QTL mapping: The fusion of quantitative & molecular genetics Bruce Walsh ([email protected]) Depts of Ecology & Evolutionary Biology Plant Sciences Animal Sciences Molecular & Cellular Biology Epidemiology & Biostatistics University of Arizona Rough outline • Classical Quantitative Genetics • The Golden Age: The search for QTLs – History and review of methods • History revised: how successful has the search for QTLs been? • The next wave: – – – – eQTLs Association mapping Molecular signatures of selection Are these improvements? • Summary: Where is quantitative genetics today? Quantitative Genetics Quantitative Genetics is the analysis of traits whose variation is influenced by both genetic and environmental factors The assumption is that the genotype of an individual cannot be easily predicted from its phenotype. Indeed, the genotypes (and hence loci) contributing to trait variation have historically been assumed to be unknown and largely unknowable. “Classical” Quantitative Genetics works with genetic variance components, which are often easy to estimate. Genetic variance components Fisher (1918) reconciled quantitative traits with Mendelian Genetics, building on statistical machinery developed by the biometricians. The term variance was first introduced in Fisher’s paper (as well as ANOVA) Z=G+E Fisher’s key insight was the, in sexual species, parents do not pass along their genotypic value G to their offspring, but rather only pass along part, the breeding value A, G=A+D+I Fisher also noted that the variance of A can be estimated by phenotypic covariances among relatives Variance components and Selection Response Cov(Parent, offspring) = Var(A)/2 Cov(half sibs) = Var(A)/4 Cov(full sibs) = Var(A)/2 + Var(D)/4 + Var(Ec) Thus, without any genetic information, we can still estimate important genetic features associated with the trait variation in a particular population. Key use: The Breeders’ Equation for selection response R = h2 S, with the heritability h2 = Var(A)/Var(P) Quantitative Genetics: The infinitesimal model At the heart of much of classical quantitative genetics is the infinitesimal model -- the genetic variation in a trait is due to a large number of loci each of small effect. Classical quantitative genetics represents the fusion of Mendelian and population genetics, under the umbrella of classical statistical methods What about a fusion of quantitative genetics with molecular biology and genomics? Statistics and Molecular biology The success of “classical” quantitative-genetics (variance components and related statistical measures) has been spectacular, esp. in plant and animal breeding. However, the solely statistical nature of this approach has been unsettling to some, and the demise of the field was predicted once we had a better molecular handle on trait variation. Thus, starting with the ability to score a vast number of molecular markers, the fusion of molecular biology and quantitative genetics seemed a possibility. Quantitative Trait Loci, QTLs The first “harvest” from the ability to score modest number of molecular markers was the ability to search for Quantitative Trait Loci, QTLs, loci showing allelic variation that influences trait variation (mid 1980’s). Conceptually, nothing new, as this is just linkage analysis Consider the gametes from an AB/ab parent, where A & B are linked loci. We observe an excess of AB and ab gametes, and a deficiency of Ab, aB. Suppose B influences a trait, making it larger. Offspring getting the A allele from this parent disproportionately get the B allele as well, and hence have larger trait values. Early localization of factors influencing quantitative traits was done by Payne 1918, Sax 1923, and Thoday 1960’s Sax (1923) crossed two inbred bean lines differing in seed pigment and weight, with the pigmented parents having heavier seeds than the nonpigmented parents. These crosses demonstrated that seed pigment is determined by a single locus with two alleles, P and p. Among F2 segregants from this cross, PP and Pp seeds were 4.3 +/- 0.8 and 1.9 +/- 0.6 centigrams heavier than pp seeds. Hence, the P allele is linked to a factor (or factors) that act in an additive fashion on seed weight. Makers and more markers While the basic outlines for QTL mapping has been known for over 70 years, the lack of sufficient genetic markers prevented its widespread use until the mid 1980’s. The early studies (in maize) used 50-80 markers, mostly allozymes and were very loosely-linked (marker spacing much greater than 20 cM) With the advent of DNA (esp. STR = microsat) markers, numbers and density of markers have grown, resulting in a parallel development of more statistically-sophisticated approaches to mapping to use this additional information. The statistical machinery for QTL mapping Single marker linear model approaches Interval mapping: pairs of markers, move to Maximum likelihood approaches Composite Interval mapping: analysis of a marker interval, flanked by adjacent markers. ML-based Shrinkage and Bayesian approaches for detecting epistasis From from line-cross analysis to the analysis of outbred populations: mixed models Conditional Probabilities of QTL Genotypes The basic building block for all QTL methods is Pr(Qk | Mj) --- the probability of QTL genotype Qk given the marker genotype is Mj. Pr(Qk M j ) Pr(Qk j M j ) = Pr(M j ) Consider a QTL linked to a marker (recombination Fraction = c). Cross MMQQ x mmqq. In the F1, all gametes are MQ and mq In the F2, freq(MQ) = freq(mq) = (1-c)/2, freq(mQ) = freq(Mq) = c/2 Hence, Pr(MMQQ) = Pr(MQ)Pr(MQ) = (1-c)2/4 Pr(MMQq) = 2Pr(MQ)Pr(Mq) = 2c(1-c) /4 Pr(MMqq) = Pr(Mq)Pr(Mq) = c2 /4 Since Pr(MM) = 1/4, the conditional probabilities become Pr(QQ | MM) = Pr(MMQQ)/Pr(MM) = (1-c)2 Pr(Qq | MM) = Pr(MMQq)/Pr(MM) = 2c(1-c) Pr(qq | MM) = Pr(MMqq)/Pr(MM) = c2 Expected Marker Means The expected trait mean for marker genotype Mj is just XN πM j = πQ k Pr( Qk j M j ) k= 1 For example, if QQ = 2a, Qa = a(1+k), qq = 0, then in the F2 of an MMQQ/mmqq cross, (πM M - πm m )=2 = a(1 - 2c) • If the trait mean is significantly different for the genotypes at a marker locus, it is linked to a QTL • A small MM-mm difference could be (i) a tightly-linked QTL of small effect or (ii) loose linkage to a large QTL Hence, the use of single markers provides for detection of a QTL. However, single marker means does not allow separate estimation of a and c. Now consider using interval mapping (flanking markers) π M 1 M 1M 2 M 2 ° π 2 m1 m 1 m2 m 2 µ = a- 1 ° c1 ° c2 1 ° c1 ° c2 + 2c1 c2 ' a (1 ° 2c1 c2 ) ∂ ∂ 1 1µ πM 1 M 1This ° πmis1 m essentially a for c1 = 1° 2 2a modest linkage even µ ∂ 1 πM 1 M 1 ° πm 1 m 1 ' 1° 2 πM 1 M 1 M 2 M 2 ° πm 1 m 1 m 2 m 2 Hence, a and c can be estimated from the mean values of flanking marker genotypes Linear Models for QTL Detection The use of differences in the mean trait value for different marker genotypes to detect a QTL and estimate its effects is a use of linear models. One-way ANOVA. zi k = π + bi + ei k Value ofEffect trait inofkth individual of marker genotype Detection: a QTL marker is linked genotype to the marker i on trait if at value least type one ofi the bi is significantly different from zero Estimation (QTL effect and position): This requires relating the bi to the QTL effects and map position Maximum Likelihood Methods ML methods use the entire distribution of the data, not just the marker genotype means. More powerful that linear models, but not as flexible in extending solutions (new analysis required for each model) Basic likelihood function: `(z j M j ) = XN k= 1 ' (z; πQ k ; æ2 ) Pr( Qk j M j ) Trait Distribution value Probability givenofmarker trait of QTL value genotype genotype givenisQTL type k given genotype j marker is kgenotype Sum over the N possible linked QTL genotypes j --- with genetic map linkage phase entire here is normal mean mQkand . (QTL effects enter here) This is a mixture model ML methods combine both detection and estimation Of QTL effects/position. Test for a linked QTL given from the LR test max ` r (z) LR = ° 2ln max `(z) Maximum of the likelihood under a no-linked QTL Themodel LR score is often Maximum plotted of the by trying full likelihood different locations for the QTL (i.e., values of c) and computing a LOD score for each ∑ LOD(c) = ° log10 ∏ max ` r (z) LR(c) LR(c) = ' max `(z; c) 2 ln 10 4:61 A typical QTL map from a likelihood analysis Estimated QTL location Support interval Significance Threshold Interval Mapping with Marker Cofactors Consider interval mapping using markers i and i+1. Now suppose we also add the twothe markers flanking the QTLs linked to these interval (i-1 and i+2) markers, but outside this interval, can (potentially) contribute (falsely) estimation of to CIM also includesto unlinked markers QTL position effect account forand QTL on other chromosomes. i-1 i i+1 i+2 Inclusion Interval of being i-1cofactors mapped and fully Interval However, mapping still domarkers not + marker account fori+2 QTLs is called inaccount the blue areas for anyInterval linked QTLs to the left of i-1 and the Composite Mapping (CIM) right of i+2 CIM works by adding an additional term to the linear model , X bk x k j k6 = i ;i + 1 From Line Crosses to Outbred Populations In such cases, all outbred of the F1populations, offspring have the In contrast, with each individual Much of the above discussion was for the analysis same genotype, namely and MQ/mq, being aparent heterozygote has a unique genotype, hence each must of line-cross data. at loci thatseparately. show fixed differences between the be all examined lines being crossed. We can thus lump all offspring For example, if a father is M1/M2, we contrast phenotypic values in offspring getting M1 vs. M2 from this parent. The reason is that (say) a father could be M1Q/M2q, while his mate might be M1q/M2Q. Likewise, many individuals have no linkage information, e.g., M1Q/M2Q or M1/M1 General Pedigree Methods Random effects (hence, variance component) method for detecting QTLs in general pedigrees zi = π+ A i + A 0i + ei Trait value for individual i Genetic value of other QTLs Genetic effect of chromosomal ofthus interest The covariance between individuals(background) iregion and j is æ(zi ; zj ) = Ri j æA2 + 2£ i j æ2A 0 Fraction of chromosomal region sharedcorrection IBD Resemblance between relatives between individuals and used, j. Mixed-model approachesi are with variances estimated for each chromosomal region. Assume z is MVN, giving the covariance matrix as V = R æA2 + A æ2A 0 + I æe2 Here Ω Ri j = 1 bi j R for i = j ; for i = 6 j Ω A ij = Estimated from marker data 1 2£ i j for i = j for i = 6 j Estimated from the pedigree The resulting likelihood function is ∑ 1 1 T ` ( z j π; æ2A ; æ2A 0; æe2 ) = p exp ° (z ° π) V n 2 (2º ) jV j ∏ ° 1 A significant sA2 indicates a linked QTL. (z ° π) What are some of the take-home messages from QTL mapping studies? • Most traits show several (4-30) QTLs that are localized to modest-sized chromosomal segments • Detected QTLs typically account for between 5 and 50% of the observed phenotypic variation (in the F2) • Transgressive segregation is often observed, with high trait alleles being found in low trait value lines, and vise-versa (hidden variation for selection). • Epistasis appears to lacking in many studies, but seems to be fairly common in eQTLs What are some concerns from QTL mapping studies? • Replication of results is often poor. • Common for a “single” QTL region to show multiple QTLs upon more careful fine analysis, often with effects in opposite directions • QTL mapping does not get at the underling genes, only isolates chromosomal regions of interest, usually with rather poor resolution (20 cM = 20 Megabases = 200 - 2000 genes) • When isolated in inbred lines, QTLs often show strong interaction effects (G x G, G x E), that are not apparent in a normal analysis. Hence, likely very context-specific. Genotype X environment interaction Additive and dominance effects of QTL are often environment-specific QTL for Drosophila longevity, different larval rearing densities Lifespan (Days) OO 60 OB 60 50D BB 60 68B 55 55 55 50 50 50 45 45 45 40 40 40 Low High Low High Density Slide courtesy of Trudy Mackay 76B Low High More complicated effects Epistatic effects can be sex- and environment specific QTL for Drosophila longevity 65 60 55 50 45 40 35 High Density 50D Lifespan (Days) Lifespan (Days) Low Density BB OB BB OB 76B 65 60 55 50 45 40 35 50D BB OB BB OB 76B Slide courtesy of Trudy Mackay Cracks in the façade? QTL mapping appears to dispute the infinitesimal model, suggesting a few discrete loci account for much of the variation. Problem 1: Upon closer analysis, many of these highvalue regions themselves decompose into several QTLs, not just one. How fine such a decomposition can be continued until no more QTL appear is unresolved. Problem 2: From a molecular-biology standpoint, QTLs have not really led us significantly closer to the underling genes, and hence the molecular mechanisms for quantitative trait variation. Power for detection Most QTL studies are vastly underpowered. How many individuals must be scored in an F2 for design For an alpha of a = 0.01, sample size required 90% power of (high detection (Fsetting) in a line cross power 2 design) is roughly 22/d2 , where d = a/s, the allele effect in units of SD Thus, the sample size for d = 0.5, 0.2, 0.1, 0.05 are 88,550, 2200, and 8800. Typical QTL study in the range of n = 350, giving d = 0.25 Effect of linkage: for c = 0.05, 0.1, 0.2, increase in sample size (over c = 0) is 1.2, 1.6, and 2.8 Power and Repeatability: The Beavis Effect QTLs with low power of detection tend to have their effects overestimated, often very dramatically As power of detection increases, the overestimation of detected QTLs becomes far less serious For example, a QTLthe accounting for 0.75% ofBill the total This is often called Beavis Effect, after F2 variation has noticed only a 3% chance of beingstudies detected Beavis who first this in simulation with 100 F2 progeny (markers spaced at 20 cM). For The Beavis effect the concern many cases in which suchraises a QTL is real detected, thethat average QTL of apparent large effect may befor artifacts. estimated total variance it accounts is 15%!.Under an infinitesimal model this is especially a concern. Detection vs. localization Darvasi & Soller (1997) give an appropriate expression for the sample size required for a 95% confidence interval in position, CI = 1500/(nd)2 For a QTL with d = 0.25, 0.1, and 0.05, the sample sizes needed for a 1cM CI are 1500, 3800, and 7600. Fine mapping (localizing to under 1 cm) requires the generation of special lines, such as advanced intercross (AIC), or recombinant inbred lines (RILs). In flies, A series of overlapping deficieny strains can be used. Tradeoffs in sample designs Most QTL mapping studies are highly underpowered. While QTLs of modest effects can be detected with sample sizes of 500 or less, an order of magnitude more is needed for high-resolution mapping. Adding more markers does not really improve power or resolution very much. Increasing the number of individuals does. Ironically, we are now at the stage where it is fair easier to score markers than to score phenotypes. This limits the sample sizes that can be used. Mapping eQTLs A current very fashionable trend is the mapping of expression QTLs, locations that influence the amount of protein or RNA made by a particular gene A common design is to use RILs and examine a number of microarrays across a modest set of lines (10-100). Some improvement in power (over an F2 design) occurs because of being able to replicated within each RIL and the expanded map distances (4 fold) found in RILs vs. F2 Still, such designs are underpowered, making localization (cis vs. trans) difficult and the contribution from detected eQTLs being inflated by the Beavis effect. How can we improve the ability To detect QTLs? Two complementary approaches, which require very dense marker maps, have been suggested. • Association mapping -- much finer resolution with a smaller sample size, using historical recombinants • Methods for detecting genes under (or very recently under) selection. Association mapping Basic idea is very straightforward: If there exists very tight linkage between a marker and a QTL, with marker and QTL alleles in linkagedisequilibrium, then a random collection of individuals show a marker-trait association. Since the region of LD is expected to be very small, this method potentially allows for fine mapping using not a collection of relatives (hard to get), but rather a random (and hence likley much larger) collection of individuals from a population. Linkage disequilibrium mapping Idea is to use a random sample of individuals from the population rather than a large pedigree. Ironically, in the right settings this approach has more power for fine mapping than pedigree analysis. Why? • Key is the expected number of recombinants. in a pedigree, Prob(no recombinants) in n individuals is (1-c)n • LD mapping uses the historical recombinants in a sample. Prob(no recomb) = (1-c)2t, where t = time back to most recent common ancestor Expected number of recombinants in a sample of n sibs is cn Expected number of recombinants in a sample of n random individuals with a time t back to the MRCA (most recent common ancestor) is 2cnt Hence, if t is large, many more expected recombinants in random sample and hence more power for very fine mapping (i.e. c < 0.01) Because so many expected recombinants, only works with c very small Dense SNP Association Mapping Mapping genes using known sets of relatives can be problematic because of the cost and difficulty in obtaining enough relatives to have sufficient power. By contrast, it is straightforward to gather large sets of unrelated individuals, for example a large number of cases (individuals with a particular trait/disease) and controls (those without it). With the very dense set of SNP markers (dense = very tightly linked), it is possible to scan for markers in LD in a random mating population with QTLs, simply because c is so small that LD has not yet decayed These ideas lead to consideration of a strategy of Dense SNP association mapping. For example, using 30,000 equally spaced SNP in The 3000cM human genome places any QTL within 0.05cM of a SNP. Hence, for an association created t generations ago (for example, by a new mutant allele appearing at that QTL, the fraction of original LD still present is at least (1-0.0005)t ~ 1-exp(t*0.0005). Thus for mutations 100, 500, and 1000 generations old (2.5K, 12.5K, and 25 K years for humans), this fraction is 95.1%, 77.8%, 60.6%, We thus have large samples and high disequilibrium, the recipe needed to detect linked QTLs of small effect Problems with association mapping Good news: Do not need a set of relatives. Hence, easier to gather a large sample. Bad news: One can have marker-trait associations in the absence of linkage. For example if a marker predict group membership, and being in that group gives you a different trait value, then a markertrait covariance will occur. This is the problem of population stratification. When population being sampled actually consists of Example. The Gm marker was thought biological several distinct subpopulations we have(for lumped together, reasons)alleles to be may an excellent candidate gene marker provide information as tofor which group diabetes in the high-risk population of Pima indians an individual belongs. If there are other risk factors in the American Initially a verybtw strong aingroup, this can Southwest. create a false association marker and association was observed: trait Gm+ Total % with diabetes Present 293 8% Absent 4,627 29% The association was+ re-examined in a population of Pima Problem: freq(Gm ) in Caucasians (lower-risk diabetes that were 7/8th (or more) full heritage: Population) is 67%, Gm+ rare in full-blooded Pima Gm+ Total % with diabetes Present 17 59% Absent 1,764 60% Adjusting for population stratification • Use molecular makers to classify individuals into groups, do association mapping within each group (structured association mapping). This approach typically uses the program STRUCTURE • Use a simple regression approach, adding additional markers as cofactors for group membership, removing their effect, n y = π+ X k =1 m Øk M k + X j =1 ∞j bj + e Scans for genes under selection • Reduction in levels of polymorphism around selected site (selected sweep), or increase in the levels of polymorphism around a locus under stabilizing selection. • Formal tests based on molecular variation (Tijama’s D, MK, ect.) -- either as a test for candidate genes or scanning the genome for regions showing strong signals • Dense SNP approaches based on linkage disequilibrium and age of allele. A scan of levels of polymorphism can thus suggest sites under selection Variation Directional selection (selective sweep) Local region with reduced mutation rate Map location Variation Balancing selection Local region with elevated mutation rate Map location Example: maize domestication gene tb1 Doebley Major changes lab identified in plant aarchitecture gene, teosinite in transition 1, branched from tb1,teosinte involved to in many maizeof these architectural changes Wang et al. (1999) observed a significant decrease in genetic variation in the 5’ NTR region of tb1, suggesting a selective sweep influenced this region. The sweep did not influence the coding region. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Wang et al (1999) Nature 398: 236. Clark et al (2004) examined the 5’ tb1 region in more detail, finding evidence for a sweep influencing a region of 60 - 90 kb Clark et al (2004) PNAS 101: 700. Formal tests Strict neutral theory: single parameter describes (i) heterozygosity, (ii) average number of differences between alleles (iii) Number of singletons (alleles present once in sample) A number of tests comparing these various measures of within-population variation have been proposed: Tajima’s D, HKA, Fu and Li’s D* and F*, Fu’s W and Fs, Fay and Wu’s H, etc. One could either test a candidate gene or do a genomic scan using dense markers to test a sliding window along a chromosome. Rejection of neutrality = locus under selection! A central problem with all of these frequency spectrum tests is that a rejection of the strict neutral model can be caused by changes in population size in addition to a locus under selection. Such demographic signals would be present at all loci, so that one approach is to use such signals over all loci to correct the test at any particular locus. Another approach is to use marker information toe estimate the demographic parameters and then again use these to generate an appropriate null (neutral) model. LD tests based on dense markers A newer class of tests that is not influenced by demographic factors are those based on the length of linkage disequilibrium around a target site. Under drift, alleles at moderate to high frequencies are old, and hence have smaller tracks of disequilibrium, due to time for recombination to break down longer tracks. LD based tests of selection look for long tracks of disequilibrium around allele at high frequency. This requires dense SNP markers Summary The jury is still out on whether current QTL studies show that the infinitesimal model (lots of loci each of small average effects) is incorrect. In its classic form, QTL mapping has not successfully yielded a number of actual genes contributing to small amounts of variation. Hence, they have not helped us to fuse molecular biology and Quantitative genetics. The problem with QTL mapping is not marker density (i.e, number of markers scored), but rather poor power from too few individuals being scored. Summary (cont) QTL mapping in microarrays (eQTLs) faces many of these lack of power issues and results should be interpreted with some care in the absence of replication. Association mapping, requiring very dense SNP markers, offers the potential for (i) using a much larger sample (as unrelated individuals can be used) and (ii) fine mapping. However, correction for population stratification remains a concern. LD-based tests for selection signatures seems to be a promising approach, but also requires dense SNP mapping. while not a method to directly get at QTLs for a trait of interest, it does suggest loci under recent selection, which may eventually point to ecologically interested traits. Farewell from the “desert” U of A Campus QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Detecting epistasis One major advantage of linear models is their flexibility. To test for epistasis between two QTLs, used an ANOVA with an interaction term z = π + ai + bk + di k + e Effect Effect from marker from marker genotype genotype at genotypes firstat second Interaction between marker i in 1st • At least one of (can theset abe different marker marker set >k 1inloci) i significantly marker set and 2nd marker set from 0 ---- QTL linked to first marker set • At least one of the bk significantly different from 0 ---- QTL linked to second marker set • At least one of the dik significantly different from 0 ---- interactions between QTL in sets 1 and two