Chatterjee_Banff2014Public

Download Report

Transcript Chatterjee_Banff2014Public

Statistical Issues in Development and Evaluation of Genetic Risk Prediction Models

Nilanjan Chatterjee, PhD

Chief and Senior Investigator Biostatistics Branch, Division of Cancer Epidemiology and Genetics

Thanks to team science!

Biostatistics Branch JuHyun Park, Fellow Paige Maas, Fellow

Jianxin Shi, TT Investigator Joshua Sampson, TT Investigator Bin Zhu, TT Investigator Mitchell Gail, Investigator

Minsun Song, Fellow DCEG

Stephen Chanock, Director Nat Rothman, Investigator Debra Silverman, Investigator

Other Institutions/Collaborations

Peter Kraft, HSPH

Montserrat Garcia-Closas, ICR, UK

Cambridge University, UK German Cancer Research Center

BPC3 Consortium BCAC Consortium

Utility of Risk Models

• Individual counseling – weighing risks and benefits for various preventive interventions • Screening, medication, risk-factor modification • Understanding distribution of risk at population-level and inform public heath strategies for prevention • Comparative effectiveness studies • Design of intervention trial

Methodological Issues

Sample size and study design

• Model building – –

Polygenic risk score (PRS) Incorporating environmental risk-factors

– – Using external information Model calibration • Model validation and evaluation

Limited Discriminatory Ability of Early GWAS Discoveries “A tiny step to personalized risk prediction of breast cancer” - Devilee and Rookus, NEJM, Editorial

Many more to be found

Cancer Site BREAST

Utility of Foreseeable Cancer SNPs

Family History Only

0.536

Known SNPs

0.599

Foreseeabl e SNPs

0.635

Family History and Known SNPs

0.613

Family History and Foreseeabl e SNPs

0.646

Epidemiol ogic Risk Factors and Foreseeabl e SNPs

0.670

0.549

0.647

0.676

0.668

0.694

PROSTATE COLORECTU M OVARY BLADDER GLIOMA PANCREAS

0.528

0.509

0.514

0.503

0.517

0.582

0.557

0.596

0.597

0.576

0.616

0.568

0.615

0.621

0.600

0.598

0.564

0.602

0.598

0.588

0.629

0.658

0.575

0.620

0.726

0.622

0.610

Park et al., JCO, 2012

Hidden Heritability for Complex Traits

Trait HT BMI TC HDL LDL CD T1D T2D PrCA CAD Narrow sense

h g

0.45

0.14

0.12

Effective sample-size for the largest GWAS

133K 162K 100K 100K 95K 0.22

0.30

0.51

0.22

25K 22K 36K 28K 73K

No. of detected SNPs

108 31 45 35 36 64 30 22 20 21

Heritability explained by detected SNPs

0.066

0.014

0.063

0.046

0.059

0.066

0.053

0.034

0.061

0.024

• Heritability: fraction of total variance attributable to susceptibility (Quantitative traits) and sibling-recurrence-risks (Qualitative traits)

Challenges

• Many loci with very small effects are undetectable at genome-wide significance level • Can we still exploit them to improve risk prediction? – Using a more liberal threshold or a fancier penalized regression method?

• Needs an understanding of “power” in the context of prediction

Predictive Correlation Coefficient (PCC)

– covariances and variances are taken with respect to randomness of a “ new ” observation for which prediction is desired – Remaining randomness is due to that of the “ training ” dataset

The Expected PCC value for GWAS Polygenic Models

• Parameters of genetic architecture • Properties of the statistical method • For fixed N, optimal threshold ( ® opt (N)) can be chosen by maximizing ¹ (N, ® )

Chatterjee et al, Nature Genetics, 2013

Further Results

• Many measures of discriminatory performance of risk-model have a one-to-one relationship with PCC • Can project performance of models that include polygenic-risk-score (PRS) and family history – Family hx effect is attenuated by a quantity related to PCC

Chatterjee et al., Nature Genetics, 2013

AUC (Cont’d)

Trait (AUC with FH alone) T2D (0.595) PrCA Model SNPs SNPs+FH SNPs (0.552) SNPs+FH CAD (0.601) SNPs SNPs+FH Current Sample size (N) α=10 -7 α OPT

0.570

0.598

0.632

0.654

0.621

0.648

0.625

0.651

0.582 0.584

0.587 0.589

0.647 0.648

0.651 0.652

α=10 -7

0.617

0.667

0.637

0.661

0.595 0.604

0.656 0.663

3xN α OPT

0.704

0.736

0.648

0.670

0.612 0.650

0.669 0.697

α=10 -7

0.660

0.700

0.646

0.669

0.603 0.629

0.663 0.681

5xN α OPT

0.750

0.776

0.673

0.692

0.635 0.676

0.686 0.717

Architecture of Joint Effects: Implications for Disease Prevention

Breast Cancer Risk Modeling: BPC3 Study

• 17,176 cases and 19,860 controls from 8 prospective studies • • Risk factors – Family history, height, reproductive risk-factors, smoking, BMI, alcohol and HRT use SNPs – 24 genotyped SNPs, imputed PRS for 86 SNPs

Steps for Building Absolute Risk Model and Projecting Risk Distribution

• Develop models for relative-risk – Construction of efficient PRS, Model selection for gene-

gene/gene-environment interaction

• Utilize rates from SEER cancer registry to calibrate absolute risk to the US population • Use national survey data to project risk distribution

Gene-gene/Gene-Environment Interactions in Disease-risk

• Interaction in what scale?

– Logistic, probit (liability threshold), additive… • Little evidence of SNP-SNP/SNP-E interactions under the logistic scale – Lack of power or are risks truly multiplicative?

– Does the scale matter?

• Important to have good model-fit at extremes of disease risks – Clinically important

Linear Logistic vs Linear Additive Null Models

• Linear logistic • Linear additive • Can be fitted in the logistic scale under rare disease assumption

10 15 20 Number of risk alleles at the 19 loci 25

10 15 20 Number of risk alleles at the 19 loci 25

A Tail-based Goodness-of-fit Test

(also a global test for interaction)

Song et al. (Biostatistics, In Press)

Hosmer and Lemeshow test Tail-based Test Complete case analysis Hom OR

Multiplicative Model

Het OR Analysis including subjects with missing genotypes Hom OR Het OR 0.11

0.87

.

.

C=25 C=100 0.11

0.20

0.85

0.77

0.16

0.23

0.11

0.17

Additive Model

Complete case analysis Hom OR 0.0003

0 0 Het OR 0.01

0 0

Statistically Speaking…

• Multiplicative model could not be rejected even with a large dataset and a powerful method – Fit seems adequate even at extremes • Modest departure cannot be ruled out • Additive model is soundly rejected – Plethora of gene-gene interactions in the additive scale

Does the Scale Matter Clinically?

• Stronger risk variation (or risk stratification) under the multiplicative than the additive model • Proportion of the population identified at 2 fold or higher than average risk: – –

1.16% under multiplicative model 0.02% under additive model

Correlation in PRS under two model= 0.93 (AUC is hardly different)

Concluding Remarks

• Translating heritability to predictability is hard – Due to highly polygenic (non-sparse) architecture • Multiplicative model for gene-gene and gene environment interaction works amazingly well • Time to seriously think about public health implications for joint effects – Evaluate risk stratification – Stop using AUC