No Slide Title

Download Report

Transcript No Slide Title

Gene mapping by association

3/4/04 Biomath/HG 207B/Biostat 237

Linkage has its limits

To determine that a trait is closer to marker 1 than marker 2, we need to see recombination between marker 2 and the trait locus. As distance between the markers decreases the number of informative meioses needed to see recombination increases. At some point Linkage analysis because impractical because too many families are needed.

Association Studies

• Association is a statistical term that describes the co-occurrence of alleles or phenotypes. • An Allele A is associated with disease D, if people with D have a different frequency of A than people without D.

Possible causes for allelic association

• best: allele increases disease susceptibility

candidate gene

studies D

• good: some subjects share common ancestor

linkage disequilibrium

studies

Loci  

M K

Alleles

D A 1

 

Under linkage equilibrium P(A,D)=P(A)*P(D) Violation of the equality is termed linkage disequilibrium

Linkage Disequilibrium

Suppose one of the population founders carries an allelic variant that increases risk of a disease. The disease gene is very close to a marker so  is very small.

d A D a d A d a

Ancestral haplotypes are dA, da, and Da

d A d a

Note that

D

is

.

.

.

associated

d d D d d d

with

a . A A a A A A P( a|D ) is close to one.

Over many generations (n), there is occasionally recombination between the two genes. So that the population looks like:

d A d a d A d a d A d A d A d A D a d A D a d A D A d A d A

The degree of association between

D

and

a

has decreased

P( a|D) but still P(a|D) > P(a). P(a D)>p(a)P(D) d A

The Degree of Association Between Two Genes Depends on the Distance Between them and the Age of the Population

1. Let  aD = P(aD)-P(a)P(D) and similarly for other alleles.

 aD (n) =  aD (0)(1  ) n 2. At linkage equilibrium P(a/a|D/d)=P(a/a|d/d)=P(a/a|D/D)=P(a/a) P(A/a|D/d)=P(A/a|d/d)=P(A/a|D/D)=P(A/a) P(A/A|D/d)=P(A/A|d/d)=P(A/A|D/D)=P(A/A) Violation of these equalities is evidence of linkage disequilibrium.

Allelic association studies test whether alleles are associated with the trait • 2 types of association tests

population-based association test

• cases and controls are unrelated • cross-classify by genotype • use  2 test or logistic regression

family-based association tests

• cases and controls are related: parents, sibs etc • often based on allele transmission rates • prime example TDT

Mapping Genes using a Case Control Design

1.

Example: Non insulin dependent diabetes in Pima indians is associated with human immunoglobulin gene, Gm. (Knowler et al., 1988) genotype 1/1 or 1/2 2/2 Cases 23 (.0169) 1343(.983) 1366  2 =61.6 p <0.00005.

Controls 270(.0760) 3284(.924) 3554 Total 293 4627 4920 2.

What can go wrong? Association could be due to ethnic differences among cases and controls population stratification.

A Dramatic Example of when Association is due to Population Stratification

1.

The Gm genotype differs by degree of caucasian heritage Genotype 1/1 or 1/2 2/2 >50% 184 (.441) 233 417  2 =1185.5 p <0.00005

<50% 109 (.0242) 4394 4503 Total 293 4627 4920 2.

Diabetes prevalence differs by caucasian heritage Diabetes >50% <50% Total Yes No 20 (.0146) 397 417 1346 (.112) 3157 4503 1366 3554 4920 3.

Controlled for age and degree caucasian background, diabetes and Gm are not significantly associated.

(Knowler et al., 1988).

How concerned should we be about population stratification invalidating case/control results?

1. The allele frequencies and disease prevalence rarely differ as dramatically by race as in the example.

2. Good epidemiological methods can reduce the problem.

Collect information on racial/ethnic background 3. Sometimes there is no alternative to a case/control design.

Family controls may not be available. On the other hand, 1. Better safe than sorry - Family based control designs 2. Family based designs require more genotyping but not more phenotyping than case/control

The Transmission Disequilibrium Test eliminates concern over false positives due to population stratification

Spielman et al., 1993 Terwilliger and Ott, 1992

A simple illustration of the TDT

: Collect parent-child trios If the child is chosen without regard to disease status

A/a A/A

then the child’s genotype is equally likely to be

A/a

or

A/A

However, if the child is chosen because they are affected and the marker allele a is associated with the disease allele D

d A D a d A d A D a d A

then the child is more likely to have the A/a genotype at the marker than the A/A genotype.

Testing for Transmission Distortion (Disequilibrium) A biallic locus

1.

Select individuals with the disease, genotype these individuals as well as their parents. 2.

Determine how many heterozygous parents transmit A and how many transmit a. 3.

Under the null hypothesis, the probability that a parent with the A/a genotype transmitted an A is ½. 4.

Also under the null hypothesis, the maternal and paternal transmissions are independent. 5.

In the case where there are only two alleles at the marker the test is equivalent to a McNemar test. Transmitted/ Not transmitted A a A C 11 a

C 21

Test statistic T= (C 21 -C 12 ) 2 /(C 21 +C 12 )

C

C

12

22 For large samples and under the null hypothesis, T has a chisquare distribution (df=1)

What are we testing with the TDT?

A strongly positive result suggests that the marker tested is a trait susceptibility locus or that the marker is closely linked to a trait susceptibility locus.

H O

:

P

A

/

a

a

 

P

A

/

a

A

  1 2

H Alt

:

P

A

/

a

a

  1 2 For a single affected child per family, the null and alternative hypotheses are equivalent to:

H O

: 

H Alt

:   1 2 or   0  1 2 and   0 When more than one affected child per family is used, the TDT confounds linkage and association. Thus little is gained by running the TDT on a data set consisting of several very large pedigrees if linkage of the trait and marker has already been established. With many small unrelated pedigrees information on association can still be gained.

The TDT has been extended to multiple alleles per locus H o = transmission to affected child is not dependent on allele type H a = transmission to affected child depends on allele type

Allele 1 2 transmitted k-1 k not-transmitted 1 .

.

2 .

k-1 k

-----

.

.

C 2,1

.

C k-1,1 C k,1

t 1

C 1,2 . . . C 1,k-1 ----- . . . C 2,k-1

.

.

.

C k-1,2 . . . ---- C k,2 . . . C k,k-1

t 2 . . . t k-1

C 1,k C 2,k C k-1,k -------

t k

t i represents the column sum omitting the diagonal term, n i the row sum also omitting the diagonal. Test statistics include

T m h

k k

 1

i k

  1 

t i t i

 

n i n i

 2

TDT

2  max 1 

i

k

t i t i

 

n i n i

 2 Mendel’s TDT 1 proportional to is this statistic.

n 1 n 2 n k-1 n k

Under some conditions, T mh is asymptotically distributed as chi-square with degree of freedom k-1

Allele 3 4 t 5

Numerical example: data from a locus with 5 alleles. 120 transmissions from heterozygous parents to affected children.

2 3 transmitted 4 5 1 not-trans.

1 2

6 8 8 7

29

6 -- 7 5 8

26

4 5 -- 5 7

21

4 7 7 -- 6

24

5 4 5 6 ---

20

T mh = ?

TDT 2 = ?

Is there evidence of transmission distortion?

n 19 22 27 24 28 120

MENDEL determines significance using permutation procedures Why? If the sample size is small or alleles are rare, the TDT statistic distribution is poorly approximately by a chi-square distribution. How?

(1) For each iteration (usually 10,000 or more) (a) Calculate a new TDT table. Hold the parental genotypes fixed. For each child, designate with equal probability that the child gets one of the parental alleles.

(b) Calculate the TDT statistic and determine if larger than the observed TDT statistic.

(2) The p-value is equal to the number of iterations in which the TDT statistic is larger than the observed divided by the total number of iterations. What is the reason for the standard error?

Permutation p-values are estimated using Monte Carlo simulation with a finite number of iterations.

TDT Summary

• ignores transmissions from homozygous parents • with two alleles it has an approximate chi square(1) distribution (McNemar test) – but exact p-values can be computed from the Binomial(p=.5) distribution in the bi-allelic case • If there is one affected per nuclear family this tests the null: no linkage or no association – If test is significant, there is linkage and association • If there are multiple affecteds, the TDT will confound linkage and association owing to the dependencies of the trios. – users should not expect new insight when the data consists of one or two large disease pedigrees already showing linkage – with many small unrelated pedigrees, the chance of confusing linkage with association becomes less of an issue, and the TDT can help in identifying associated marker alleles.

Limitations of the original TDT

(1) Nuclear Families (2) Qualitative traits (3) Codominant markers Many methods for extending the TDT have been developed.

We will discuss one in detail, the gamete competition model.

One way to extend the TDT

: Lange (1988), Jin et al. (1994), and Sham and Curtis (1995) considered a model (Bradley Terry, 1952) that was originally used to predict to rank teams the outcome of team sports. How does the model work? Look at specific example: Suppose we are interested in predicting the outcome of a playoff game where the Diamond Back play the Dodgers. Or suppose we want to know the probability that Dodgers will be the National League West winners this year if we consider regular season results for last year?

Suppose results are:

winner D’backs Dodgers Giants Rockies Padres Loser D’Backs

-- 6 4 4 5

Dodgers Giants Rockies Padres

6 8 8 7 -- 5 7 8 7 -- 7 5 5 -- 4 6 5 6 7 -- Let D’backs/Dodgers Dodgers denote the event that the D’backs and Dodgers play and the Dodgers win. In general for each team i, we assign a win parameter t i so that the probability that i beats j is:

P

(

i

/

j

i

)  t

i

t 

i

t

j

Bradley - Terry Model of Competing Sports Teams

P

(

i

/

j

i

)  t

i

t 

i

t

j

Note that multiplying each t i by any a >0 does not change its value, so one t i can be fixed at 1. We fix t d’backs = 1. Note that if t i > t j for all j then i is the best team Let y ij denote the number of times that i plays j and i wins. For example, the D’backs beat the Giants 8 times and the Giants beat the D’backs 4 times (y ij = 8 and y ji = 4). The win parameters can be determined using the following recurrence relationship t

i m

 1 ln(L) 

j

 

i y ij

j

 

i y ij

y ji

t

i m i

 

j j

 t

m

y ij

j

 ln t

i

where the loglikelihood is  ln( t

i

 t

j

) 

RESULTS Loser D’Backs Dodgers Giants Rockies Padres winner D’backs Dodgers Giants Rockies Padres

-- 6 8 8 7 6 -- 5 7 8 4 7 -- 7 4 5 5 -- 5 4 6 5 6 7 -- Ho = all teams are equally likely to win ( t i

P

(

i

/

j

i

)  1 2 = 1 for all i) LRT = 3.63, the p-value of 0.46 supports acceptance of the null hypothesis.

We get more information from this analysis

We get the relative rankings. t dodgers = 1.23, t d’backs =1.00, t giants = 0.87, t rockies = 0.71, t padres = 0.67

With these rankings we can make predictions about the outcomes of games:

P

(

dodgers

/

d

'

backs

dodgers

)  1 .

23 2 .

23  .

55

P

(

giants

/

dodgers

dodgers

)  1 .

23 2 .

10  .

59 Note that these probabilities are different from the predictions if we just used the individual match up records

.

The estimate is not 8/12 =.67 for dodgers beating giants

How is this sports analysis analogous to TDT?

Think of : (1) Each possible allele at locus = a team (2) A heterozygous parent = a match up (3) Allele received by child from a heterozygous parent = the winner of the game (4) The transmission parameters = the win parameters (5) The win/lost record is determined by the transmissions from heterozygous parents.

1 not trans.

1

---

2 3

6 8

4 5

8 7

transmitted 2

6 -- 5 7 8

3

4 7 -- 7 6

4 5

4 5 5 -- 5 4 6 5 7 -- When we ignore disease status, the Bradley- Terry model provides a form of segregation analysis. When we consider the transmission to affected members only (like this example) we have a form of TDT analysis.

The gamete competition likelihood for a pedigree

The general form of the gamete competition likelihood for a pedigree with n individuals is

L

G

1

G n i Pen

(

X i

|

G i

) 

j

Prior (

G j

)  Tran (

G m

{

k

,

l

,

m

} |

G k

,

G l

) Here person i has marker phenotype X i and underlying marker genotype G i . For founders , Prior(G j ) For offspring, the transmission probability factors Tran(G m | G k , G l )= Tran(G mk | G k )*Tran(G ml | G l ) Tran(G mk | G k ) = t mk /( t mk + t nk ) and Tran(G ml | G l ) = t ml /( t ml + t nl ) The penetrance, Pen(Xi| Gi) is always 1 or 0, depending on whether Xi and Gi are consistent or inconsistent

Assessing significance

We use a likelihood ratio test statistic LRT = 2*( ln(L Ha )-ln(L Ho ) ) Where L Ha and L Ho are the maximum likelihoods under the alternative and null hypotheses.

Significance?

Approximate p-values can be calculated by assuming a the distribution is chi-square or by gene dropping.

Gamete Competition contrasted with the TDT

(1) Gamete Competition works on extended pedigrees No need to break up large families into nuclear families.

(2) If have only trios, the gamete competition and the TDT are equivalent. Their null hypothesis is no linkage or no association. The alternative hypothesis is linkage and association.

(3) When considering more than one affected per family, the TDT and gamete competition confound association with linkage.

(4) Exact p-values can be determined with the TDT. Gamete competition p-values are asymptotic.

(5) The gamete competition model can be used when there is missing marker information. Allele frequencies can be fixed at population estimates or estimated along with the t ’s. (6) When there is missing data, the gamete competition is not immune to the effects of population stratification or rare alleles.

Example:Families affected with Noninsulin Dependent Diabetes and linkage to a marker within the sulfonyl urea receptor-1 gene 27 Mexican-American extended pedigrees with 74 affected offspring (all genotyped) at SUR The likelihood ratio test statistic is 9.133 with 9 degrees of freedom. P-value =0.043

allele 1 freq t i se of t i .054

.288

.215

2 3 4 5 6 7 8 .210

1.00

.190 .048 .047 .108 .140 .091

.810 1.40 .697 .383 .556 .567

fixed .447 .985 .681 .204 .288 .322

9 10 .071 .042

.499 .082

.509 .104

Can this model be extended to quantitative traits?

Yes by recognizing that the Bradley – Terry Model is equivalent to a matched case control design. The transmitted allele is the case, the untransmitted allele is the control.

t

i

e

i x p

where x p denotes child p’s standardized trait value, i denotes allele i and the probability of an i/j heterozygous parent transmitting i is

P

(

i

/

j

i

) 

e

( 

i

 

j e

( 

i

 

j

)

x p

)

x p

 1 Note that one  is set to zero.

This is equivalent to conditional logistic regression.

Quantitative Trait Example: ACE

High ACE concentration is associated with a deletion within an intron of the ace gene.

404 people in 69 families (Sinsheimer et al., 2000).

P

(

insertion

/

deletion

deletion

) 

e

deletion x k e

deletion x k

 1

P

(

insertion

/

deletion

insertion

)  1

e

deletion x k P

(

insertion

/

deletion

deletion

) 

P

(

insertion

/

deletion

insertion

)  1 .

0  1

mle s.e. of mle

insertion

0.00

fixed 

deletion

1.31

0.17

Ho:  deletion = 0 Ha:  deletion  0 LRT = 82.76 Asymptotic p-value < 1 x 10 -19

Another Example: Analyzing tightly linked SNPs:

SNPs (single nucleotide polymorphisms) tend to be more stable and more abundant than microsatellite markers.

They are predominately biallelic, so we would like to use several tightly linked markers simultaneously to increase the overall information content. Recall that we use the allele transmissions from heterozygous parents.

Assuming HWE, the maximum possible % of heterozygous parents for biallelic system is 0.50. For an n allele system, it is

H=(n-1)/n.

More alleles more information.

The phase of these multilocus SNPs may not be known:

Example: suppose there are three SNPs. An individual with multilocus genotype 1/2, 1/2, 1/2 could have one of the following haplotypes: (1) 111 and 222, (2) 122 and 211 (3) 121 and 212 or (4) 112 and 221.

The gamete competition allows the use of non codominant markers so we don’t need to determine which of these haplotypes combinations is present in a particular individual.

For two linked loci associated with a quantitative trait, the transmission probability is expressed as:

P

(

ij

/

kl

ij

)  ( 1   )  

e

ij x p

e

( 1  

kl

x p

)

e

ij x

   

p

 

e

il x p

e

kj x p

  If we are using tightly linked SNPs, then  and the transmission probability reduces to: is effectively zero

P

(

ij

/

kl

ij

) 

e

ij e

ij x p x p

e

kl x p

An Example

Again we use sex adjusted ACE levels as a quantitative trait. The three SNPs are labeled by their position and the nucleotides present at the position. A-240T, T1237C, and G2350A. Because the ACE gene spans only 26kb, the recombination fractions between these SNPs are effectively zero. The pedigree data consist of 83 white British families ranging in size from 4 to 18 members. ACE levels were determined on 405 family members. Genotypes were collected on 555 family members.

In MENDEL, the most important difference from the previous example will be observed in the locus file. We need to allow for phase ambiguities (lack of certainty in haplotypes).

L469 AUTOSOME 627

<-# haplotypes,# phenotypes

ATA 0.40190

ATG 0.00780

ACA 0.06740

ACG 0.18310 T*A 0.01340 !

T*A corresponds to haplotypes TTA and TCA T*G 0.32640 !

T*G corresponds to haplotypes TTG and TCG We are no longer assuming co-dominant markers so we must specify the phenotype (of the marker) / genotype relationship.

These phenotypes correspond to the marker phenotypes used in the pedigree file.

RESULTS

Haplotype

H o p ijk

s.e. of

p ijk H a p ijk

s.e. of

p ijk

ijk

s.e. of 

ijk

ATA ATG ACA .4052 .0079 .0676

.0257 .0045 .0136

.4019 .0078 .0674

.0256 .0024 .0136

.0000 .2440 .2137

fixed .9893 .4076

Log-likelihood under Ho = -704.34

Log-likelihood under Ha = -663.73

ACG .1839

.0199

.1831

.0198

1.169

.2352

TTA+TCA .0133

.0059

.0134

.0059

.2765

.5848

LRT=81.22

df = 5 p-value = 4.67 x 10 -16 TTG+TCG .3321

.0240

.3264

.0242

1.528

.2189

Many other extensions / alternatives to the TDT have been developed.

These include: TDT using sibling controls Sib-TDT (Spielman and Ewens, 1998) DAT (Boehnke and Langefeld, 1998) SDT (Horvath and Laird) TDT for quantitative traits Allison (1997), Rabinowitz (1997), Abecasis (2000) Joint modeling of linkage and association that allow estimation of recombination Hastabacka (1992) Kaplan, Hill and Weir (1995) Terwilliger (1995)