Transcript Slide 1

Building and Using
Disease Prediction Models
in the Real World
Discussion leader: Heejung Bang, Ph.D.
Weill Medical College of Cornell University
At Joint Statistical Meetings
Salt Lake Cit, UT, 2007
1
What are risk score/prediction models?





A “prediction” is a statement or claim that a particular
event will occur in the future (current or past event is
also sensible).
Response is often binary (event/non-event) or censored.
Mathematical equation can be used to model the rate (or
probability or likelihood) of event.
Scoring system (e.g., integer) can be derived to grade
the risk, often by simplifying the mathematical model
(e.g., regression coefficients).
Mathematical equation and/or scoring system can be
used to stratify subjects (e.g., high vs. low risk )
2
Why important?
Evidence-based medicine = Science
(theory) + Data + Statistics
 Risk score = Statistics + Art + Reality
 One of real practical solutions to reduce
the burden/incidence of some diseases.
 People use it in real world (esp., lay and
underserved people)
-- used in clinical setting, community
setting, or self-use for pre-screening,
screening or risk assessment/prediction.

3
But
Prediction is very hard, especially about the
future - Yogi Berra
4
Prediction via multiple regression
(Stat 101)
Two general applications for multiple regression:
explanation & prediction, two differing goals in
research
-- attempting to understand a phenomenon by examining a
variable's correlates on a group level (explanation)
-- being able to make valid projections concerning an
outcome for a particular individual (prediction)

Refs: 1. Osborne (2000). Prediction in multiple regression.
Pract Assessment, Res & Eval .
2. Neter, Kutner, Nachtsheim & Wasserman (2003).
Applied Linear Statistical Models.
5
Regression
vs.
Prediction/risk score
what make these two tasks different?
6
1. Simple and easy: Don’t let the
perfect be the enemy of the good!
User-friendliness and easy use are important!
-- a perfect model can be a jewel in your closet or
a journal.
-- if biostatisticians can not use it, how lay persons
can?
 Interactions or nonlinear function may make
prediction model/risk score more complex.
-- worth it?
-- proper detection and modeling require larger N
(more later).

7
2. Variable categorization
Most statisticians agree with Royston et al. (2005).
“Dichotomizing continuous predictors in multiple regression: a
bad idea”.
 However, filling in continuous information (e.g., blood
pressure (BP), BMI, CRP) can be hard for many people.
-- Q: do you know your BP? Which BP? In what unit?
 A wrong unit is can be worse than nothing! (pound vs. kg, mg
vs. g) Unit is more complex than you ever imagine.
 Prediction model that includes continuous variables may not
be converted to a simple questionnaire.
 Prediction model solely based on categorical variables is still
usable with a few missing inputs.
 It may be safe, informative, and instructive to develop
“continuous models” and “categorical models” together and
present both and let users decide.
 Intuitive cutpoints (optimal cutpoint may or may not help) 8

3. Variable selection
>5-10 variables can be too many.
-- hard to believe but it is true. Everyone is busy and lazy!
-- there are too many risk scores (or medical calculators) in
the world.
 Not all significant predictors should be included in the final
model: this is not a sin or cheating!
 There are difficult variables and easy variables.
 Statistical techniques (e.g., backward elimination, data
mining) can guide variable search but subjectivity can come
into play (e.g., some variables such as SES may be
intentionally excluded. Race may be excluded not to limit
generalizability).
 >1 model can be developed to accommodate different data
availabilities, say, with or without CRP.
 Even clinicians don’t agree on variables (e.g., internist vs.
9
nephrologist, serum vs. urine).

4. Sample size (N) and data quality






Power/N calculation is less relevant (no specific test
involved) or can be complicated because it is
multivariables-setting.
No absolute consensus on N requirement. As the goal is a
stable regression equation, more is better.
Creating a prediction equation involves gathering relevant
data from a “large, representative” sample from the
population (if not, less reproducible risk score!)
We may need to save some N for internal validation (e.g.,
split-sample method).
Therefore, a large database is often used, e.g., NHANES,
NHS, SEER, Framingham, ARIC, etc.
“No fancy statistical analysis is better than the quality of
the data. Garbage in, garbage out, as they say.” (Robins)
10
5. Population characteristics


-----

Universal model may not exist.
Separate models may be needed:
by sex
by race or country (e.g., many countries have their own
diabetes risk score)
by age
high risk (e.g., clinical setting) vs. general population
first vs. recurrent event. Before vs. after surgery
Often, not under investigator’s control (e.g., ≥ 65 old
subjects only in Medicare database, no minority in
Framingham study).
May need to be mentioned in the title of your paper.
Remark: More chances for publications! Not just a repeat
but each effort can be meaningful. (good news!)
11
6. Databases
Administrative data
Clinical or Epi data
 Generally HUGE
 Small or mid-size
 No lab data, not many  Lab and clinical data
variables
and many more
variables
 Represent the target
population well
 Reduced
generalizability or
 Generally well
representiveness
maintained by reliable
organization
 Data quality not
always guaranteed.
 Data check well done
12
Statistical tools for model development





Standard regression: Logistic and Cox
-- most popular
-- explicit mathematical formula and numeric scoring system can be
derived (e.g., by converting regression coefficients)
Advanced regression: Fractional Polynomials regression (Royston &
Altman 1999; Sauerbrei et al. 2006).
-- combining variable selection with determination of functional
relationships for predictors
Tree-based methods: CART (by Breiman), Recursive Partitioning (by
Hawkins & Kass), Optimal Discriminant Analysis/Classification Tree
Analysis (by Yarnold), Logical Analysis of Data (by Hammer), Bayesian
CART
-- complex interactions can be revealed.
-- cutpoints identified
Neural network
Data mining techniques
13
Statistical measures for model
evaluation/diagnostics
Sensitivity & Specificity — most popular
 Discrimination (ROC/AUC) – most popular
 Predictive values, positive or negative (PPV, NPV)
 Likelihood ratio (LR)
 Accuracy (e.g., Youden index, Brier score)
 Yield, number needed to treat (NNT), number needed to
screen (NNS)
 Model fit (e.g., AIC, BIC)
 Lack of fit (e.g., Hosmer-Lemeshow test)
 R2 (coefficient of determination)
 P-value (significance of association)
 Predictiveness curve (based on R2, by Pepe)
 Calibration/re-calibration
 Decision curve analysis (Vickers 2006)
Remark: LR, Youden index and ROC/AUC are functions of
sensitivity and specificity.

14
Noted limitations of some methods
P-value: significant association is often not enough for
good prediction.
-- can have small p-value but poor values for everything else
(e.g., low p and low R2 is a well-known phenomenon).
-- some adopt strict p-value thresholds (e.g., p<0.001) or
multiple testing adjustment.
 AUC: triply robust. Once it is high, it is extremely difficult
to increase. Odds ratio alone can be problematic (Pepe et
al. 2004; Cook 2007)
 R2: oftentimes, hard to increase.
 Sensitivity/Specificity: do not address the problems of the
prevalence of disease in different populations, e.g., if
prevalence=0.01, sensitivity =0.95, specificity=0.95 then
PPV=0.16.
 Hosmer-Lemeshow test: different software can produce
different test statistics/p-values.
Remark: For novel markers, relying on one statistical measure
15
may not be wise.

16
Good references


Ridgeway G. (2003). Strategies and Methods for
Prediction” In The Handbook of Data Mining (N.
Ye, ed.).
Harrell FE Jr, Lee KL, Mark DB. (1996).
Multivariable prognostic models: issues in
developing models, evaluating assumptions and
adequacy, and measuring and reducing errors.
Statistics in Medicine. 15(4): 361-87.
17
Prevalent vs. Incident disease
Prevalent/concurrent disease:
--cross-sectional study is needed.
--useful for asymptomatic disease for detecting
undiagnosed cases (e.g., diabetes mellitus (DM),
kidney disease), not for all diseases.
--simplicity in prediction model/risk score is
important.
 Incident disease:
--prospective study of event-free cohort is needed.
--simplicity is less important because prediction of
new cases is not as urgent as diagnosis of
(hidden) concurrent cases.

18
Models evolve
Gail et al (1989, 1999, 2001)’s original and improved
prediction models for breast cancer, called Gail et al. model1,
model2, etc.
& Barlow et al. and Chen et al. (2006) improved Gail et al. with
novel predictors.
 Similarly, Stroke Prognosis Instrument (SPI) I & II,
Acute Physiology and Chronic Health Evaluation (APACHE) I,
II, III.
 Multiple efforts to simplify and improve the Framingham risk
score.
 Many risk scores exist for incident and prevalent DM.

Remark: act on the best available evidence, as opposed to
waiting for the best possible evidence (Institute of Medicine)
19
Other highly cited scores/equations
Not necessarily for specific disease prediction
 Charlson’s comorbidity index: to give a 10 year
survival estimate for a patient (Charlson et al.
1987):
 MDRD-GFR: kidney function measure (Levey et al.
2000)
 APACHE: a severity of disease classification system
at intensive care unit (Knaus et al., 1981).
--most were developed by clinicians (not statisticians)
and more use for clinicians
20
Individual vs. population-level risk



----
---
Still unsolved issues
Started from Rose’s legendary paper, “Sick individuals and
sick population” (1985, republished in 2001).
Population-based model is a poor model in individual level.
think about Winston Churchill!
a large number of people at a small risk may give rise to
more cases of disease than the small number who are at a
high risk.
a preventive measure that brings large benefits to the
community offers little to each individual.
advantages/disadvantages of “high-risk strategy” vs.
“population strategy”. Not competing. Both are necessary.
Individual prediction ever possible?
Genetic studies may be helpful.
LAD answers better but a more complex algorithm is
needed—nothing is free!
21
Percentage distribution of serum cholesterol levels (mg/dl) in men aged 50-62 who did or did not
subsequently develop coronary heart disease (Framingham Study5)
Rose, G. Int. J. Epidemiol. 2001 30:427-432; doi:10.1093/ije/30.3.427
Copyright restrictions may apply.
22
Ability of the Gail et al. breast cancer risk prediction model to discriminate between women who
were diagnosed with breast cancer and women who were not diagnosed in the Nurses' Health
Study
Elmore, J. G. et al. J. Natl. Cancer Inst. 2006 98:1673-1675; doi:10.1093/jnci/djj501
Copyright restrictions may apply.
23
What diseases can be predicted?
Breast cancer
--Gail et al. (1989), Rosner and Colditz (1996, 2000), Tyrer
et al. (2004), Barlow et al. (2006)
 Other cancers: cervical (2006), ovarian (2006), prostate
(2005), lung (2007), colorectal (2007) ---all are recent!
 Coronary heart disease
---Framingham, ARIC, SCORE (Europe), Reynolds score
 Stroke
---SPI, Stroke-Thrombolytic Predictve Instrument, ARIC
 Diabetes
---Herman et al., San Antonio (Stern et al.), ARIC (Schmidt
et al.)
 Kidney disease
---SCORED (Bang et al. 2007)
24
 Numerous other specific diseases/events

Not all diseases/events are well
predicted






Some mental disorders (screening is more common than
prediction)
HIV
How about car accident, divorce, bankruptcy, suicide,
lay-off?
Cancer again (AUC can be as low as 0.56)
Too many “Don’t Do” (i.e., risk factors) are not more
helpful than “Do Nothing”.
Poor PPV
A good reading: Begg (2001). The search for cancer risk
factors: when can we stop looking? AJPH.
25
Can we predict low risk or health?



It is hard for a risk model based on clinical factors to
identify a group at very low risk that does need to
worry. In other words, a group with very low BMI, high
exercise levels, good genes, usually is not well captured
by screening questions.
This is true virtually in all disease prediction problems.
Discrimination at the low end may not be good but can
be good at the high end of the risk spectrum.
‘We know exactly why certain people commit suicide. We
don’t know, within the ordinary concepts of causality,
why certain others don't commit suicide. .............. We
know a great deal more about the causes of physical
disease than we do about the causes of physical health.’’
("The Road Less Travelled" by Peck, 1978)
26
Sample risk scores
1. Framingham score
27
28
2. Reynolds score
29
3. SCreening Occult REnal Disease
(SCORED)
Bang, H. et al. Arch Intern Med 2007;167:374-381.
Copyright restrictions may apply.
30
31
4. Prostate cancer nomogram
32
Issues to consider:
before/during development
Predictable and meaningful disease?
 1st model: always thrilled! No need to compare with
other models.
 Best model: also great. But one should show a new
model improves the existing models/guidelines in
important aspects.
 External validation using independent dataset within the
same publication is a great strength (editors seem to
give brownie points).
-- advanced validation techniques (e.g., cross-validation,
bootstrap) are not popularly used in clinical publications.
Split-Sample method is widely used.). However, it still
utilizes the sample data (“internal validation”).
 Ask yourself “Will this model be reproducible?”

33
After development





VALIDATE or Perish!
Validations will be done by you and others (be
careful and prepared to write Response.)
People will compare your model and others’ once
they are developed (not always in a fair manner.
Usually, you don’t have a chance to review and
give comments on others’ publications that
criticize your method)
Do not publish “model not to be replicated” or
“type-I-error” and run!
Everybody loves external validation. Esp. Editors
34
After publication, how to disseminate?
Power of marketing: good method deserves
good marketing.
 Computer-system (e.g., web-based or handheld)
vs. paper-pencil method.
 Work with Public Affair department in your
institute or contact Media directly.
-- authors may need to write Press Release (PR).
-- no one reads/understands your paper as well as
you do. Highlight the main findings clearly.
-- no p-value in PR!
-- ready for interviews (esp., for 1st study)
 Work with authority and practitioners to
implement/distribute your method (preferably
after Validation).

35
Statistician’s role in risk prediction/score
Statistician’s involvement is absolutely necessary.
 Statisticians (or epidemiologists) can be a leader/first author
of clinical research/publication.
 Statisticians who develop risk score should be highly familiar
with the current literature of the relevant disease.
 Your title beyond “statistician” or “faculty of (bio)statistics”
may be helpful for PR and interview purposes (sadly true!
some reporters search for MD authors).
 At times, clinical communities call for the development of a
(new or improved) risk score. For example,
-- Multiple editorials in 2006 called for renal score.
-- Aitkins (1994) wrote “It would be a shame if Spiegelman et
al. were to stop short of presenting a new model based on
their substantially more powerful tool, the cohort study."
-- Beyond Framingham.

36
Screen or not screen?
Not all prediction models/pre-screening/screening are beneficial.
e.g., ADA recommended “Do screen” for DM (in one year) and “Do not
screen” (in the following year), in the same journal.
e.g., a lot of controversies in breast cancer screening

Freedman DA, Petitti DB, and Robins JM. (2004). On the efficacy of screening for breast
cancer. International Journal of Epidemiology, 33:43-55. Comment by Gotzsche, P.C. On the
benefits and harms of screening for breast cancer,pp. 56-64. Comment by Miller, A.B.
Commentary: A defence of the Health Insurance Plan (HIP) study and the Canadian
National Breast Screening Study (CNBSS) pp. 64-65. Comment by Baum, M. Commentary:
False premises, false promises and false positives - the case against mammographic
screening for breast cancer. pp. 66-67. Comment by Berry, D. Commentary: Screening
mammography: a decision analysis. pp. 68. Rejoinder by Freedman, D.A.,Petitti, D.B., and
Robins, J.M.Rejoinder. pp. 69-73.






WHO announced 10 principles for national screening programs (1968).
False negatives can cause a serious problem.
False positives can create too much anxiety and scare people.
Problem of high vs. low risk. What does “low” means?
Effectiveness of screening should ultimately be tested in RCTs.
Cost-effectiveness should also be evaluated.
37
US Preventive Services Task Force




The entity with the most rigorous evidence-based
approach
An independent panel of experts in primary care and
prevention that systematically reviews the evidence of
effectiveness and develops recommendations for clinical
preventive services.
http://www.ahrq.gov/clinic/uspstf/uspstopics.htm
Many specific disease authorities (e.g., CDC, ADA, NKF)
have their own screening recommendations. Often
considerably different.
Some agencies (e.g., NCI, NHLBI) hold a workshop to
review various risk models.
A recent ref: Campos-Outcalt (2007). Screening: New
guidance on what and what not to do. J of Family
Practice.
38
Risk score primer
Simplicity, user-friendliness and accuracy are key issues in
success.
 Can have real impacts on people’s lives (esp., for
underserved)
 Useful for educating people about risk factors and
increasing low-awareness of some diseases.
 Great collaboration area for clinicians and statisticians.
 Name can be important (e.g., ABCD, APACHE, SCORED,
Framingham, Reynolds, Gail et al., Take the test and know
your score, Indian diabetes score), Googlable?
 Nothing causal, all about association or correlation!
-- If causes can be removed, susceptibility ceases to matter
(Rose 1985)

39
Sample risk scores on internet














Cancer: http://riskfactor.cancer.gov/cancer_risk_prediction/
http://www.mskcc.org/mskcc/html/5794.cfm
http://www4.utsouthwestern.edu/breasthealth/cagene/
APACHE: http://www.sfar.org/scores2/apache22.html
http://www.apache-web.com/public/pub_main.html
Charlson comorbity index: http://www.medalreg.com/qhc/medal/ch1/1_13/01-13-01ver9.php3
Framingham score: http://hp2010.nhlbihin.net/atpiii/calculator.asp?usertype=prof
& http://www.nhlbi.nih.gov/about/framingham/riskabs.htm
UK CVD score: http://www.riskscore.org.uk/
PROCAM score: http://www.chd-taskforce.de/
Reynolds score: http://www.reynoldsriskscore.org/
Herman et al.’s diabetes risk score: http://www.diabetes.org/risk-test.jsp
German diabetes risk score: http://www.dife.de/
Angina score: http://www.anginarisk.org/
Pneumonia score: http://www.ahrq.gov/clinic/pneuclin.htm#head1
SCORED: http://kidneydiseases.about.com/od/diagnostictests/a/scored.htm
Depression: http://www.psycom.net/depression.central.screening.html
Medical calculator: http://medcalc3000.com/ (some are commercial)
In general, google can find these.
40