Transcript Document
CRI-09: Cross-Institutional Systems to
Support Phenotyping in Biomedical Research:
Experiences from the eMERGE Network
Luke Rasmussen
Marshfield Clinic
David Carrell, PhD
Group Health Research Institute
William Thompson, PhD
Northwestern University
Hua Xu, PhD
Vanderbilt University
Jyoti Pathak, PhD
Mayo Clinic
AMIA CRI Summit 2011
eMERGE Consortium
• Principal sponsor: NHGRI with additional
funding from NIGMS
• NIH-funded consortium (CTSA awardee
institutions)
• DNA Biobanks linked to EHR data
• Consortium members
–
–
–
–
–
Group Health of Puget Sound
Marshfield Clinic
Mayo Clinic
Northwestern University
Vanderbilt University
Dementia
Cataracts
Type II diabetes
Coordinating
center
Peripheral vascular disease
QRS duration
Marshfield Clinic
Biobank Population
Geographically defined cohort
Stable population
Minimal selection bias
Over 95% of medical events captured in EMR
Data
All levels of inpatient and outpatient care
5 decades of retrospective clinical data
Prospective & continuous data collection via EHR
Event, testing, treatment and outcomes represented
High utilization of primary care to classify controls
Clinical, financial and environment data
Health Events
eMERGE Contributors
•
•
•
NHGRI
– Rongling Li
– Heather Junkins
– Teri Manolio
– Jim Ostell
Group Health
– Eric Larson
– Gail Jarvik
– Chris Carlson
– Wylie Burke
– Gene Jart
– David Carrell
– Malia Fullerton
– Walter Kukull
– Paul Crane
– Noah Weston
Northwestern
– Rex Chisholm
– Bill Lowe
– Phil Greenland
– Wendy Wolf
– Maureen Smith
– Geoff Hayes
– Pedro Avila
– Joel Humowiecki
– Jen Allen-Pacheco
– Amy Lemke
– Will Thompson
•
Marshfield
– Cathy McCarty
– Peggy Peissig
– Luke Rasmussen
– Marilyn Ritchie
– Justin Starren
– Russ Wilke
– Dick Berg
– Jim Linneman
•
Mayo Clinic
– Christopher G. Chute
– Iftikhar J. Kullo
– Barbara Koenig
– Suzette Bielinski
– Mariza de Andrade
•
Vanderbilt
– Dan Roden
– Dan Masys
– Josh Denny
– Brad Malin
– Ellen Wright Clayton
– Dana Crawford
– Jonathan Haines
– Jonathan Schildcrout
– Jill Pulley
– Melissa Basford
– Marilyn Ritchie
RFA HG-07-005:
Genome-Wide Studies in Biorepositories with
Electronic Medical Record Data
• 2007 NIH Request for Applications from the
National Human Genome Research Institute
“The purpose of this funding opportunity is to provide
support for investigative groups affiliated with existing
biorepositories to develop necessary methods and
procedures for, and then to perform, if feasible,
genome-wide studies in participants with phenotypes
and environmental exposures derived from electronic
medical records, with the aim of widespread sharing of
the resulting individual genotype-phenotype data to
accelerate the discovery of genes related to complex
diseases.” (Emphasis added)
Development and Growth
Idea
Develop
• Pre-existing and new
systems/methods
• Applied to common
(yet different) tasks
• Different locations/
environments
Disseminate
Issues
More Ideas
Tools and Methods
Presenter
Topic
Luke Rasmussen
Reusable phenotype algorithms
Marshfield Clinic
Techniques to facilitate future reuse of phenotype
algorithms.
David Carrell
Clinical Text Explorer Search Interface
Group Health
Facilitates exploration of EHR for rapid phenotyping and
algorithm refinement.
William Thompson
clinical Text Analysis and Knowledge
Extraction System (cTAKES)
Northwestern University
Natural language processing (NLP) system utilized for
multiple phenotypes, including PAD.
Hua Xu
MedEx
Vanderbilt University
NLP system utilized within eMERGE with additional
applications to pharmacogenomic research.
Jyoti Pathak
eleMAP
Mayo Clinic
Facilitates harmonization and standardization of
phenotype variables across sites.
Reusable Phenotype Algorithms
Luke Rasmussen
Senior Programmer/Analyst
Marshfield Clinic Research Foundation
Biomedical Informatics Research Center
AMIA CRI Summit 2011
Phenotype Development
•
•
•
•
Multi-disciplinary teams
Multiple sites
Iterative
Intangible →Tangible
EMR-based Phenotype Algorithms
• Typical components
–
–
–
–
–
Billing and diagnoses codes
Procedure codes
Labs
Medications
Phenotype-specific co-variates (e.g., Demographics,
Vitals, Smoking Status, CASI scores)
– Pathology
– Imaging?
• Organized into inclusion and exclusion criteria
EMR-based Phenotype Algorithms
• Iteratively refine case definitions through partial
manual review to achieve ~PPV ≥ 95%
• For controls, exclude all potentially overlapping
syndromes and possible matches; iteratively
refine such that ~NPV ≥ 98%
Primary Phenotypes
Site
Phenotype
Validation
(PPV/NPV)
73% / 92%
Group Health
Dementia
Marshfield
Clinic
Cataracts / Low HDL
Mayo Clinic
PAD
98% / 98%
82% / 96%
94% / 99%
Northwestern
University
Vanderbilty
University
Type 2 DM
98% / 100%
QRS Duration
97% / 100%
Supplemental Phenotypes
Site
Phenotype
Validation
(PPV/NPV)
*
Group Health
WBC
Marshfield
Clinic
Mayo Clinic
Diabetic
Retinopathy
RBC
80% / 98%
Northwestern
University
Lipids / Height
Vanderbilty
University
PheWAS
92% / 100%
95% / 100%
*
* - Not available at this time
98% / 94%
Phenotype Reuse
• T2DM Diabetic Retinopathy
– Identification of DM
– T2DM included T1DM for exclusion
• Low HDL Lipids
Phenotype Reuse
Diabetic Retinopathy
T2DM
Iterative Refinement for Reuse
Condition - Subtype A
Condition - Subtype B
Subtype A
Subtype B
Condition
Formalizing Reuse
• Identified potential for reuse
• Leverage significant work
• Phenotypes available: www.gwas.org
• Limitations
– Site-specific implementations
Impressions
•
•
•
•
Easy to do
Fits with eMERGE goals
Can fit retrospectively
Prospective mindset
Thank You
Luke Rasmussen
[email protected]
AMIA CRI Summit 2011