Guidance for Statisticians when Reporting Biomarker Studies

Download Report

Transcript Guidance for Statisticians when Reporting Biomarker Studies

Biomarkism: taming the revolution?

May 12

th

2014 PSI Conference David Lovell St George’s Medical School University of London

Henry Gray Edward Jenner

Plan of contribution

• Different types of biomarkers • Biomarkers (the journal) and statistical guidelines • Some personal editorial/refereeing experiences • Challenges • Epigenetics • Discussion points

Editorial Board Member

• Biomarkers • Mutagenesis • Toxicology in Vitro (until 2008) • Refereeing for numerous journals (20+) over last 10-15 years

Biomarkers was started in 1996 by John Timbrell (UoL School of Pharmacy) The journal

Biomarkers

brings together all aspects of the rapidly growing field of biomarker research, encompassing their various uses and applications in one essential source. Manuscripts can describe biomarkers measured in humans or other animals in vivo or in vitro.

FDA definition of Biological Marker (biomarker)

“A characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention.” "Biomarkers and Surrogate Endpoints: Preferred Definitions and Conceptual Framework" Biomarkers Definitions Working Group

(2001)

• • • •

Biomarkers of exposure

chemicals; : covering detection and measurement of internal exposure to drugs and other

Biomarkers of response

chemicals; : including measures of endogenous substances or parameters indicative of pathological or biochemical changes both toxicodynamic and pharmacodynamic, resulting from exposure to drugs and other

Biomarkers of susceptibility

: including genetic factors which alter susceptibility to drugs and other chemicals ;

Biomarkers of disease

: covering measurement of endogenous substances or parameters indicative of a disease process and the use of pharmacodynamic and genetic markers in evidence based laboratory medicine and treatment (markers of efficacy)

• 13 (at least) other journals with Biomarkers in the title • 8 issues/year • >400 papers received in 2012 and 2013 • Approx 225 referees in 2013 • Rejection rates have increased from 51% in 2009 to 73% in 2013.

• Year average impact factor stable at 2.230

• Geographical spread

Country/Year

U.S.

China India Italy U.K.

Germany Brazil Turkey

2009 2010

54 54 33 25 44 27 20 14 13 9 5 12 22 20 9 7

2011

32 54 23 17 14 20 9 7

2012

36 108 35 22 11 8 8 27

2013

16 122 26 17 16 5 15 24

The paper

"Biomarkers expects high standards but recognizes that it needs to be vigilant as scientific research continues to be affected by errors in the conduct and reporting of research and by fraudulent research. There have been reports of the high incidence of statistical errors, poor statistical practice, and limitations in the designs used in papers published by peer-review journals."

Quotes from the paper

Biomarkers

, therefore, starts from the position that it is of paramount importance that studies published in a peer-reviewed journal should have been correctly designed, carried out, and reported, and that the results are provided in such a way that the experimental and statistical methods could be repeated. This is also important for both economic and ethical considerations.

Transparency in terms of the availability of and access to the original raw data is a key component for the critical assessment of evidence-based research.

Biomarkers

, at present, does not have statistical guidelines. It does, though, have instructions to authors, which provide sensible general requirements.

Table 1 provides a non-exhaustive list of some of the guidelines available. In many specific technical areas, such as the − omics technical reviews of appropriate statistical methods have been produced; the referee can reasonably expect that an author is aware of them, applies the methods, and cites the publications where appropriate.

Biomarkers

, therefore, expects to see evidence of the planning that went into a study and to see statistical analyses, which make full use of the design. Examples would be details of the statistical analysis plan (SAP), consideration of the primary endpoint, and whether the primary aim of the study was hypothesis testing or hypothesis generation. A failure to declare in the “methods” section that blinding and randomization were carried out would be interpreted as implying that this had not been done. Details, such as the type of randomization, e.g., block or stratified and methods used for blinding, should be given when relevant.

Biomarkers

expects to see a justification of the sample sizes used and, where relevant, the power calculations, which were carried out as part of the development of the SAP for both experimental and observational studies.

uncritical use of hypothesis testing and the reporting of results as either statistically significant (

p

<0.05) or, preferably, with the exact

p

values is not acceptable. Statistical significance alone is not a justification for publication.

It is, therefore, important to note that

Biomarkers

policy is that well-designed studies, which produce negative results, are viewed favourably for publication. This policy also meets the ICMJE (2008) obligation to publish negative studies.

The Refereeing Challenge

An email arrives to the ‘volunteer’:

Invitation to Review Manuscript ID xxx

• Recently, you agreed to review Manuscript ID xxx. A previous e-mail was sent to you four days ago as a reminder that your review was due. We have yet to receive your review of this manuscript.

Indication of the problem

• 42,328 journals listed in PubMed • (Biomarkers is # 29,757) • Estimated 1 million+ papers in them a year?

• “A reasonably mature journal like Neuroimage would hope to see between 70% and 90% of submissions rejected.” • The Intergovernmental Panel on Climate Change (IPCC) latest report is based upon 73,000 publications (25% of them in Chinese) 100-fold increase in 30 years (Economist 5/4/14)

How long does it take to referee a paper?

Initial read through (30-40) minutes: locate the paper in the scientific universe (main concepts/general theme).. 10-20 minutes digesting, follow up legends of figure and table to see if they link in with what I read in the text. Is this ground breaking work or junk? Can I tell? (1 hour) • Second read through (two days later) (1hr) more detailed, identification of main methods, any limitations, uncertainties, unanswered questions, link more closely text (especially conclusions) to data, analysis and results. Identify figures don't match text or legends. Results seem odd?. Gaining confidence in view that uninteresting and should be rejected. First draft of referee report (1 hour) • Third reading (next day) (40 mins) concentration on areas not completely understood and to identify key points in referee's report to justify recommending rejection rather than resubmission. Write up report and send back to editor 20 mins:, completion of final report to editor (1 hour) • Total 3 hours plus time taken to access websites, remember passwords etc.

Follow on

Statistics:

Guidelines are given in Lovell, D.P. (2012) BiOMARKERS 17(3) 193-200.

In brief, BiOMARKERS expects authors to be aware of the appropriate statistical analyses that should be used in their specific field of research and prepare their submissions accordingly. A statistical analysis plan should be available for the studies reported and, if required, all relevant data and analyses must be accessible to reviewers. Authors should indicate how datasets used in the analyses will be maintained and be willing to make their data available to other researchers.

The author(s) responsible for statistical design and analysis must be indicated as a point of contact on the title page by the # symbol. If a statistician was employed for the analysis, but is not an author, s(he) must be identified and have agreed that their name and email address will appear in the acknowledgements section.

Following on

• Citations?

• Effect on journal?

From Volume 19 2014 onwards: # ***** **** and ****** ********* are responsible for statistical design and analysis.

• Speed of response? • How to monitor and follow up?

Personal examples

Example 1

What goes round comes round

• Paper refereed for one journal (rank #1) • Identical re-submission to first journal • Asked to referee by 2

nd

journal. Refused.

• Asked to referee by 3

rd

journal. Refused.

Example 2

“You can be one of the authors”

Reviewer Comments:

“A statistician with experience in systematic review and meta-analysis should be consulted to assist with the analysis as the description given of the statistics and the summary statistics provided are not correct.

Authors’ Comments:

Reviewer clearly very upset with basic statistical analysis performed.

We do not have the knowledge to perform such analysis given that the individual papers are so heterogenous. Unusual analysis required.

Likely lower impact journal would have accepted our statement!

Authors Suggested Action:

Suggest contact Dr. Lovell and offer authorship in return for review of statically

(sic)

element of paper “We have got your name from Prof ****** ******* because we need some support to improve a systematic review paper on the long term issues associated with ********. Paper has been accepted with major revision in the American Journal of ********.” (Impact Factor 2.516)

Example 3

“Although there is a distinct grouping of CASE samples on the left side of the map, there are many other samples that are located throughout the Control samples, indicating these samples have measures for these analytes that are more similar to Controls than CASE.

Thus these analytes are not sufficient to differentiate between all members of the 2 groups.” Report from company bioinformatician

7.5

5.0

2.5

0.0

-2.5

-5.0

-7.5

-5.0

Scatte rplot of PC II vs PC I

-2.5

PC I

0.0

2.5

5.0

Group CASE Control

Scatterplot of C103 vs C3

5.0

2.5

0.0

-2.5

-5.0

-7.5

0 20 40

C3

60 80 100 Group CASE Control

Scatte rplot of PC II vs PC I

7.5

5.0

2.5

0.0

-2.5

-5.0

-7.5

-5.0

-2.5

PC I

0.0

2.5

5.0

Batch 1 1 2

5.0

2.5

0.0

-2.5

-5.0

-7.5

1000 1020

Scatterplot of PC I vs Study ID

1040 1060

Study ID

1080 1100 1120 Batch 1 1 2

Scatterplot of PC I vs Study ID

5.0

2.5

0.0

-2.5

-5.0

-7.5

1000 1020 1040 1060

Study ID

1080 1100 1120 Group CASE Control

“As you can imagine I am slightly distraught but have been in contact with both the company and the statistician to look again at the data to quantify the effect” “I find it interesting there are two distinct groupings unlinked to my clinical categorisation.” “I did send two lots of samples separated by about a year but this is the only difference - I wonder if they separate on ID number (1001-1050 roughly went first) 1151 onwards went second. I was expecting there to be minimal experimental variation as this is what the company promise so this would be an important quality control issue to flag up.” “…there were 18 months between batches so entirely possible this explains the split. ” All other methods of collection and storage remained the same as far as I am aware.

• I am essentially going to bin these results • Thanks for pointing it out - I at least can resend them in one go and hopefully ******* will be able to give me a discount on principle - if not the grant can take it luckily • I am going to withdraw all the papers obviously

Challenges and solutions?

Fraud and forensic bioinformatics

Potti et al and Duke University

http://arxiv.org/pdf/1010.1092.pdf

After thousands of hours of investigation, three clinical trials at Duke University in Durham, North Carolina, were suspended in late 2009 because of the irreproducibility of the genomic 'signatures' used to select cancer therapies for patients. Journals have a duty to help the community by maintaining reproducibility as a cornerstone of the scientific process.

“They also noted that the internal committees responsible for protecting patients and overseeing clinical trials lacked the expertise to review the complex, statistics-heavy methods and data produced by experiments involving gene expression “ “That is a theme the investigating committee has heard repeatedly. The process of peer review relies (as it always has done) on the goodwill of workers in the field, who have jobs of their own and frequently cannot spend the time needed to check other people's papers in a suitably thorough manner.

(Dr McShane estimates she spent 300-400 hours reviewing the Duke work, while Drs Baggerly and Coombes estimate they have spent nearly 2,000 hours.)

Moreover, the methods sections of papers are supposed to provide enough information for others to replicate an experiment, but often do not. Dodgy work will out eventually, as it is found not to fit in with other, more reliable discoveries. But that all takes time and money.” Economist Sep 10th 2011 http://www.economist.com/node/21528593

Challenges to Guideline approaches

Academic science is better than GLP science?

• “But scientific reform is needed as well. For decades, regulatory bodies have relied on guideline studies conducted under national and internationally agreed standards known as Good Laboratory Practice (GLP). This governs how the studies are planned, performed, monitored, recorded, reported and archived. These standards are invaluable, providing a guarantee of reliability and cross-comparability for studies on chemical safety. But the glacial pace of consensus building and validation required to update guidelines can leave gaping holes that allow the approval of chemicals of questionable safety.”

http://www.nature.com/nature/journal/v464/n7292/full/4641103b.html

“Moreover, detecting BPA's effects generally requires cutting-edge biological techniques whose results, in the eyes of regulatory bodies, carry just a fraction of the weight of those produced by a GLP study .”

Séralini et al (2012) paper

Séralini et al (2012) paper

“The Editor-in-Chief again commends the corresponding author for his willingness and openness in participating in this dialog.

The retraction is only on the inconclusiveness of this one paper.

The journal’s editorial policy will continue to review all manuscripts no matter how controversial they may be. The editorial board will continue to use this case as a reminder to be as diligent as possible in the peer review process.”

Ultimately, the results presented (while not incorrect) are inconclusive, and therefore do not reach the threshold of publication for Food and Chemical Toxicology

. The peer review process is not perfect, but it does work. The journal is committed to getting the peer-review process right, and at times, expediency might be sacrificed for being as thorough as possible. The time-consuming nature is, at times, required in fairness to both the authors and readers.

Likewise, the Letters to the Editor, both pro and con, serve as a post-publication peer review

. The back and forth between the readers and the author has a useful and valuable place in our scientific dialog.” FCT (2014)

Efforts to suppress scientific findings, or the appearance of such, erode the scientific integrity upon which the public trust relies. The retraction by the FCT marks a significant and destructive shift in management of the publication of controversial scientific research

. Equally troublesome is that this retraction does not really impact how the science will be viewed by scientists, but only how it is viewed by others outside of the scientific community.

We feel the decision to retract a published scientific work by an editor, against the desires of the authors, because it is “inconclusive” based on a post hoc analysis represents a dangerous erosion of the underpinnings of the peer-review process, and Elsevier should carefully reconsider this decision.”

Portier et al (2014) Inconclusive Findings: Now You See Them, Now You Don’t! Environmental Health Perspectives volume 122 February 2014

Genetics

Economist 4/1/14 Illumina this week (15/1/14) claimed to be the first company to achieve the coveted $1,000 genome .

• Genome-wide Association Studies (GWAS) • Next Generation Sequencing (NGS) • Analysis of exome chip, exome sequencing data and whole genome sequencing data.

• Haplotype mapping, analysis of structural variation, meta-analysis and gene-environment interaction • Qualitative differences, stable within the individual and over time (cancer/mutation etc)

Epigenetics

Marksists

• Something revolutionary. • Studying all the marks left in the genome that form the basis of epigenetics suggests a new type of scientists is about to appear: Marksists.

Epigenetics

• • • Epgienetic marks Methylation of DNA bases, histone variation Switch genes on or off and/or regulates them • Epigenome • • • • • “The epigenome comprises all of the chemical compounds that have been added to the entirety of one’s DNA (genome), but are not part of the DNA sequence, as a way to regulate the activity (expression) of all the genes within the genome.” . 10’s of millions of methylation sites Pattern variable within individual and over time Chip ($200) 450,000 sites • • • • Inter-generational and trans-generational inheritance Three generation test (link to reproductive toxicity) Male-mediated teratogenesis “Sins of the grandmother”

Epigenome-wide association studies (EWAS) EWAS

• The investigation of the distribution of methyl groups at thousands of specific DNA nucleotides across the genome to identify arrangements that are common in a disease, or associated with variation in a trait.

• “The problem with EWAS is that there’s so much more that can confound an outcome compared with a GWAS,” (John Greally) • Epigenetic signatures which were thought to result from ageing, instead reflected the changing proportions of blood cell types with age.

• Methods for analysing chemical patterns on DNA shows promise for explaining disease, but few results have yet been replicated.

Examples

• • • • • • • • • Stressful home life associated with shorter telomeres in a group of 9 yo boys Psychotherapy can alter methylation Post-traumatic stress disorders (PDST) “unusual profiles” People abused as children differ from those abused as adults) Patterns related to suicide, successful dieting, US social status) Drugs can alter epigenome Holocaust survivors v. no traumatic experience Hungerwinter study of the Dutch Famine Birth Cohort Maternal nutrition around the time of conception can affect the regulatory tagging of child’s DNA • • • • Marks left by: smokers, ex-smokers, food, Diesel fumes, pesticides arsenic ‘produce distinct patterns’ Male mice with folate deficiencies ‘reprogram sperm’ Markers associated with rodent stress in early life correlated with certain aversive behaviours same marks could be found in their offspring, and in some cases, in their offspring’s offspring • Methylation profile involving around 400 sites give five years' warning of the onset of breast cancer

http://www.economist.com/news/science-and technology/21591547-lack-folate-diet-male-mice-reprograms their-sperm-ways

Skinner, M. K. (2008). What is an epigenetic transgenerational phenotype? F3 or F2. Reprod. Toxicol. 25, 2–6.

Male mice of great grandmothers who have been exposed to PCB have lower sperm counts that others whose great grandmothers weren’t (Poscar et al, 2013) How do you design studies that control all the confounders over three generations?

Discussion Points

• Is peer review feasible if every paper can be published somewhere?

• Can guidelines by themselves be used to ‘police’ the literature?

• Is it realistically possible to review multi author/multi-disciplinary work?

• Is it possible to ensure quality prospectively or retrospectively?

• How (or even should we be trying) to maintain quality as the scientific output becomes increasingly global/multi-polar?

• Should reproducibility be a pre-condition for publication?

• ‘Black boxes’?

• Bioinformatics and/or statistics?

• Use of Check Lists?

• Perception of statistics as a tool (technical rather than scientific)?

• “I’m a molecular biologist, we don’t need statistics”

Nature 13 th February 2014

Nature Editorial 13 th February 2014

“Too many researchers have an incomplete or outdated sense of what is necessary in statistics; this is a broader problem than misuse of the

P

value. Among the most common fundamental mistakes in research papers submitted to

Nature

, for instance, is the failure to understand the statistical difference between technical replications and independent experiments.“ "Department heads, lab chiefs and senior scientists need to upgrade a good working knowledge of statistics from the ‘desirable’ column in job specifications to ‘essential’. But that, in turn, requires universities and funders to recognize the importance of statistics and provide for it." "Good statistics can no longer be seen as something that makes science better — it is a fundamental requirement, and one that can only grow in importance as funding cuts bite and competition for resources intensifies."

"Correctable weaknesses in the design, conduct, and analysis of biomedical and public health research studies can produce misleading results and waste valuable resources. Small effects can be difficult to distinguish from bias introduced by study design and analyses. An absence of detailed written protocols and poor documentation of research is common. Information obtained might not be useful or important, and statistical precision or power is often too low or used in a misleading way. Insufficient consideration might be given to both previous and continuing studies. Arbitrary choice of analyses and an overemphasis on random extremes might affect the reported findings. Several problems relate to the research workforce, including failure to involve experienced statisticians and methodologists, failure to train clinical researchers and laboratory scientists in research methods and design, and the involvement of stakeholders with conflicts of interest. Inadequate emphasis is placed on recording of research decisions and on reproducibility of research. Finally, reward systems incentivise quantity more than quality, and novelty more than reliability. We propose potential solutions for these problems, including improvements in protocols and documentation, consideration of evidence from studies in progress, standardisation of research efforts, optimisation and training of an experienced and non-conflicted scientific workforce, and reconsideration of scientific reward systems.“ Ioannidis et al (2014) Research: increasing value, reducing waste 2 Published Online January 8, 2014 http://dx.doi.org/10.1016/ S0140-6736(13)62227-8

New Scientist Survey

• N =122 (out of 1000 stem cell researchers) • 55% thought stem cell research is put under more pressure that other areas of biomedical science

http://www.newscientist.com/articleimages/mg22129623.400/1-stem-cell-scientists reveal-unethical-work-pressures.html

New Scientist 29/3/14

http://www.newscientist.com/data/doc/article/dn25281/stemcellsurveypdf1.pdf

The importance of transparent reporting of biomarker studies Doug Altman

Centre for Statistics in Medicine University of Oxford

The importance of transparent reporting

Research only has value if

– Study methods have validity – Research findings are published in a usable form 

The goal should be transparency

– Should not mislead – Should allow replication (in principle) – Can be included in systematic review and meta-analysis 85

Biomarker studies: Focus on studies of prognosis Prognosis condition refers to the risk of future health outcomes in individuals or groups with a given disease or health The study of prognosis has never been more important – more people are living with conditions impairing health due to improvements in life expectancy Understanding and improving prognosis is pivotal to the practice of clinical medicine Prognostic information is increasingly used by clinicians to help manage patients

86

Prognostic research themes 1) Fundamental prognosis research The course of health related conditions in the context of the nature and quality of current care 2) Prognostic factor research Specific factors (such as biomarkers) that are associated with prognosis 3) Prognostic model research The development, validation, and impact of statistical models that predict individual risk of a future outcome 4) Stratified medicine research The use of prognostic information to help tailor treatment decisions to an individual or group of individuals with similar characteristics

[Hemingway et al. BMJ 2013] 87

Prognostic research themes 1) Fundamental prognosis research The course of health related conditions in the context of the nature and quality of current care 2) Prognostic factor research Specific factors (such as biomarkers) that are associated with prognosis 3) Prognostic model research The development, validation, and impact of statistical models that predict individual risk of a future outcome 4) Stratified medicine research The use of prognostic information to help tailor treatment decisions to an individual or group of individuals with similar characteristics

[Hemingway et al. BMJ 2013] 88

Prognostic factor research Aims to identify factors associated with subsequent clinical outcome in people with a particular disease or health condition Examples:

Biological (biomarkers)

– genomic – proteomic – imaging  – physiological variables

Others

– psychosocial (e.g. depression) – ecological (e.g. area-level social deprivation) 89

90

Hamilton et al, J Transl Med 2010 91

92

Prognostic importance of a single specific prognostic factor/marker

A clear view of the benefit of a marker is only likely to emerge from looking across multiple studies

– Systematic review 

We should by now know the prognostic importance of numerous markers that have been extensively investigated for many cancers and other diseases

– Why don’t we? 93

Example: p53 as a prognostic marker in bladder cancer Systematic review of literature

 

168 published studies >10000 patients “After 10 years of research, evidence is not sufficient to conclude whether changes in P53 act as markers of outcome in patients with bladder cancer.”

[Malats et al, Lancet Oncology 2005 ]

Example: Ki-67 in Breast cancer

 

Systematic review 43 studies >15,000 patients

Some evidence of publication bias “Whether these proliferation markers provide additional prognostic information to commonly used prognostic indices remains unclear.”

[Stuart-Harris et al, Breast 2008] 95

Evidence from systematic reviews that the quality of prognostic factor research needs to improve Coronary disease

“Multiple types of reporting bias, and publication bias, make the magnitude of any independent association between CRP and prognosis among patients with stable coronary disease sufficiently uncertain that no clinical practice recommendations can be made.” [Hemingway et al, PLoS Med 2010]

Osteosarcoma

“93 papers were studied in depth … Only 7 papers were of sufficient quality to analyze ... Because of heterogeneity of the studies, pooling results is hardly possible.” [Bramer et al, Eur J Surg Oncol 2009]

Peptic ulcer perforation

“Fifty prognostic studies with 37 prognostic factors comprising a total of 29,782 patients were included in the review. The overall methodological quality was acceptable, yet only two-thirds of the studies provided confounder adjusted estimate” [Moller et al, Scand J Gastroenterol ] 96

Multiple studies

 

Clinical and methodological heterogeneity

– Different patient groups – Different assays/measurement techniques – Variation in cutpoints – Adjustment for different other variables (or none)

… leading to

– confusion – amplification of biases

Results are probably not reliable even if there is apparently a clear picture

– More studies may make things worse!

97

Publication bias

“… the literature is probably cluttered with false positive studies that would not have been submitted or published if the results had come out differently.”

[Simon, 2001] 98

“Together with the long recognized problem of publication bias favoring studies that report positive findings, the result may be a body of literature that is heavily influenced by false-positive findings.” 99

Bcl-2 Martin et al, BJC 2003 100

Hemingway et al, PLoS Med 2010 83 studies of C-reactive protein in stable coronary artery disease 101

Prognostic factor research Limitations

     

Small samples Poor statistical analysis

– adjustment for known predictors, handling of continuous variables

Heterogeneous laboratory methods Lack of replication Poor publication practices

– Inadequate reporting – Selective publication

Reliable answers require better studies

– especially planned collaborative studies leading to IPD meta analysis 102

Reporting guidelines

JNCI , BJC , JCO , EJC 2005 103

REMARK: REporting guidelines for tumor MARKer prognostic studies Recommended reporting elements to facilitate

– Evaluation of

appropriateness & quality

design, methods, and analysis of study – Understanding of

context

Reproducibility

Comparisons

analyses in which conclusions apply across studies, including formal meta 104

REMARK checklist elements Introduction

 

Markers examined Study objectives Methods

Patients

   

Specimen characteristics Assay methods Study design Statistical analysis methods Results

 

Data Analysis & presentation Discussion

Interpretation

Implications 20 items in total

105

REMARK Item 17: “Among reported results, provide estimated effects with confidence intervals from an analysis in which the marker and standard prognostic variables are included, regardless of their statistical significance”

106

129 articles

36% included the marker in a multivariable model with standard clinical variables

107

Vickers et al, Cancer

2008

Reporting of prognostic studies – Pre-REMARK First 10 articles 5 high profile cancer journals 2006-7 REMARK item Number of patients overall

Assessed for eligibility Excluded

Number available for analysis

Patients Events

Number in univariable analysis

Patients Events

Numbers in multivariable analysis

Patients Events Mallett et al, BJC 2010

Reported

56% 54% 98% 50% 54% 21% 54% 30% 108

Prognostic model research A prognostic model is a formal combination of multiple prognostic factors from which risks of a specific endpoint can be calculated for individual patients.

Also called: prognostic (or prediction) index prognostic (or prediction) rule risk (or clinical) prediction model

109

Prognostic model research Uses

Clinical practice

– Communication with patients/relatives – Risk stratification 

Design and analysis of clinical trials

Case mix adjustment

110

111

Prognostic model research

112

Prognostic model research Major steps:

Development Identification and combination of variables associated with outcome External validation Evaluate the model’s predictive ability in a different population Impact Evaluate the impact of the use of the prognostic model on health outcomes 113

Published Prediction Models 111 models for prostate cancer (Shariat 2008) 102 models for traumatic brain injury (Perel 2006) 83 models for stroke (Counsell 2001) 54 models for breast cancer (Altman 2009) 43 models for type 2 diabetes (Collins 2011; van Dieren 2012)

20+ more models have since been published!

31 models for osteoporotic fracture (Steurer 2011)

Omitted FRAX due to insufficient information

29 models in reproductive medicine (Leushuis 2009) 26 models for hospital readmission (Kansagara 2011) >25 models for length of stay after cardiac surgery (Ettema 2010) 13 models for tooth decay (Ritter 2010) Very few of these models have been ‘validated’ in new data and compared

114

Prediction Models in UK Clinical Guidelines

      

Framingham Risk Score & QRISK2 (NICE CG67)

– 10-year CVD risk

Nottingham Prognostic Index (NICE CG80)

– Recurrence & survival in breast cancer patients

FRAX & QFracture (NICE CG146)

– 10-year osteoporotic and hip fracture risk

GRACE/PURSUIT/PREDICT/TIMI (NICE CG94)

– Adverse CV outcomes in patients with UA/NSTEMI

APGAR (NICE CG132/2)

– Newborn prognosis

SAPS & APACHE (NICE CG50)

– ICU scoring systems

Leicester Diabetes Risk Score, QDSCORE, Cambridge Risk score (NICE PH38)

– Type 2 diabetes 115

Model development

    

Select important candidate predictors

– Trying to avoid selection based on statistically significant univariable associations with outcome

Appropriately handle (acknowledge) missing data Fit a multivariable model Estimate the predictive performance

– Calibration and discrimination – Quantify any optimism from overfitting • Use bootstrapping (avoid randomly splitting a dataset)

Prediction models should be presented in adequate detail to allow predictions in individuals, either for subsequent validation studies or in clinical practice

116

Why do we need to validate a model?

     

Deficiencies in design of prognostic studies Deficiencies of standard modelling methods Models may not be transportable

– over-optimism because of data-dependent analysis choices – variation in ‘case-mix’

Performance cannot be predicted

– Need empirical demonstration of model performance

Usefulness is determined by how well a model works in practice, not by P values An important feature of validation is to provide an unbiased estimate of prediction error

117

Poor reporting … and poor conduct Reviews of published studies

     

Diabetes (Collins et al, BMC Med 2011) Cancer (Mallett et al, BMC Med Kidney disease (Collins et al, 2010) J Clin Epidemiol 2012) General medical journals (Bouwmeester et al, Med 2012) PLoS Breast cancer (Altman, Cancer Invest 2009) Missing data in prognosis studies (Burton, Cancer 2004) Br J and many more….

118

Conclusions from the systematic reviews

Poor reporting

– Number of events often difficult to identify • Candidate predictors (and number) inadequately defined – Insufficient information to determine events per variable (EPV) • 40% studies (Mallett 2010; 33% Collins 2011) – How candidate predictors were selected • Unclear in 25% studies (Bouwmeester 2012) – How the multivariable model was derived • Unclear in 77% studies (Mallett 2010) 119

Conclusions from the systematic reviews

Poor reporting

– Missing data rarely mentioned (41% Collins 2010; 45% Collins 2012) – Missing data is often an exclusion criterion (but often not specified) – Complete-case analysis usually carried out 

Model often not reported in full

– intercept missing for logistic regression, – baseline survival missing for Cox regression models 

Ranges of continuous predictors rarely reported

120

Conclusions from the systematic reviews

 

Methodological shortcomings including

– Small sample size (number of events) [EPV<10] – Large number of candidate predictors – Calibration rarely assessed • 74% not done (Collins); 46% not done (Bouwmeester) – Dichotomization of all/some continuous predictors • 63% of studies (Collins); 70% of studies (Mallett) – Previously published models often ignored – Inadequate validation • Reliance on random-split (often using an already small dataset) to validate

Lack of comparisons of competing models on same dataset

– Siontis et al, BMJ (2012), Collins & Moons, BMJ (2012) 121

External validation

Separate dataset (not a random split)

– Different centres (geographical validation) – Different time period (temporal validation) – Different case-mix – Possibly with different definitions of predictors and outcome 

Ideally conducted by independent researchers

122

Evaluating model performance

Performance of prediction models is characterised by

Calibration • agreement between observed outcomes and predictions – Often ignored – Preferably assessed graphically – Discrimination • ability to distinguish between patients who do and do not experience the event of interest – Usually reported (c-index) 123

Calibration plot for a scoring system for predicting postoperative nausea and vomiting (PONV) [van den Bosch et al 2005] 124

125

Review of published validation studies

[Collins et al, 2014]

Reviewed 78 articles that evaluated 120 prediction models in participant data that were not used to develop the model

  

16% did not report number of outcome events in the validation dataset 54% made no explicit mention of missing data 67% did not report evaluating model calibration

126

Transparent Reporting of multivariable models for Individual Prognosis Or Diagnosis (TRIPOD)

Consensus-based guidelines for improving the quality of reporting of multivariable prediction modelling studies

Focus on reporting (but much attention to methodological conduct in long E&E paper)

Steering group:

Gary Collins (Oxford) Karel Moons (UMC Utrecht) Doug Altman (Oxford) Hans Reitsma (UMC Utrecht) 127

TRIPOD checklist elements Introduction Title and abstract Introduction

Background & Objectives Methods

   

Source of data Participants Outcome Predictors

    

Sample size Missing Data Statistical analysis methods Risk groups Development versus validation Results

   

Participants Model Development Model Presentation Model Performance

Model Updating Discussion

 

Limitations Interpretation

Implications Other information 22 items in total

128

TRIPOD

Key minimal information deemed important to report

Help authors, peer reviewers, editors, readers and potential users

Educational – providing guidance, cautioning against particular approaches

Improve evaluating risk of bias (PROBAST) if more information is reported

Submitted for publication in March 2014

129

Published prognostic studies

Poor methods are widely used

– Exploratory studies presented as if confirmatory 

We need high quality reporting so we can identify and discard bad studies

– REMARK, TRIPOD, … 

Other initiatives…

130

131

“Across many types of research, accumulating evidence of bias has led to increasing support for greater transparency, especially relating to registration, publication of full protocols, and adherence to reporting guidelines. None of these will solve all the problems, but certainly all will help.” 132

133

Assessing risk of bias (QUIPS tool)

134

QUIPS: 6 bias domains 1. Participation 2. Attrition 3. Prognostic factor measurement 4. Confounding measurement and account 5. Outcome measurement 6. Analysis and reporting

135

Phases in prediction model research

Development

– Predictor selection, model building – Internal validation (evaluating optimism) • Split sample (random) – inefficient - not very useful • Cross-validation • Bootstrapping 

Evaluate performance (external validation)

– Temporal & geographical validation – Independent validation (i.e. independent investigators) 

Impact study

– Does the prognostic model improve patient outcomes?

– Does the prognostic model change clinician behaviour?

– Is the prognostic cost effective?

136

Assessing performance: comparing observations with predictions

 

Comparison of observed and predicted event rates for groups of patients (calibration)

– Can plot observed proportions of events against predicted probabilities – Ideally observed and predicted proportions agree over the whole range of probabilities, and plot shows a 45  line – Can fit model: observed mortality = a + b x risk score

Measures that distinguish between patients who do or do not experience the event of interest (discrimination)

– Discrimination is often assessed in graph or in table 137