Transcript Jimeng Sun

Phenotyping from Electronic Health Records
Jimeng Sun
College of Computing
Georgia Tech
[email protected]
More info at sunlab.org
1
My research focus on health analytics
Health Analytic Apps
Clinical data
Social data
Clinical
Researchers
Visualization
User
Behavior data
Genomic data
Privacy engine
Heart disease
predictor for $5.99
Research Challenges
Analytic cloud
My focus
 Big data analytics on the cloud
 Data mining and machine learning techniques
 Privacy preserving data sharing
 Visual analytic techniques
2
Outline
 Phenotyping from EHR
 Other work
– PARAMO: Large scale predictive modeling pipeline
– Patient Similarity
3
Phenotyping from Electronic Health Records
Demographic
Procedure
Diagnosis
EHR
Medication
Medical
Images
Lab Tests
Phenotyping
Medical
Concepts
(phenotypes)
4
Motivation: Increasing Importance of Electronic Health Records
Explosion in
interest
 EHR become acceptable data sources for clinical research
 EHR data can enable many more research
5
Challenges in Phenotyping from EHR
 Representation
This talk
– How to represent heterogeneous EHR data and phenotypes?
 Speed
– How to construct diverse phenotypes in unsupervised fashion?
 Intuition
– How to validate and refine the phenotypes?
 Adaptation
– How to adapt phenotypes from one site to another?
6
Constructing Feature Tensor
 Tensor is a generalization of matrix
– Matrix is a 2nd order tensor
 Tensors can better capture interactions among concepts
Data element types:
• Binary
• Count (integer)
• Continuous
(numeric)
Mode
7
Multiple Tensors
Lab Results
Medication Reconciliation
Diagnosis-Medication
Diagnostic Sources
Vital
Symptoms
8
Phenotyping through Tensor Factorization
Medication factor
Phenotype importance
Factor elements
sum to 1
Diagnosis factor
λ1
λR
≈
+
Patients
factor
Phenotype 1
…+
Elements
sum to 1
Phenotype R
9
Example Phenotype
Medication factor
λk
Diagnosis factor
Candidate Phenotype k
(40% of patients)
Hypertension
Patients
factor
Beta Blockers Cardio-Selective
Thiazides and Thiazide-Like Diuretics
HMG CoA Reductase Inhibitors
Phenotyping Process using Tensor Factorization
λ1
Count
Data
New
Patients
Tensor
Factorization
Count
Data
+ …+
λR
Phenotype
Definitions
Projection
Phenotypes
Matrix
11
CP-APR Model
KL divergence for count data
Element index
Nonnegative combinations
Stochastic constraint
(elements in factor sum to 1)
Chi, E.C. and Kolda, T.G. 2012. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on
Matrix Analysis and Applications. 33, 4 (2012), 1272–1299.
12
Constructing the Tensor
 Medication orders from Geisinger dataset
 Diagnosis codes aggregated into HCC codes
 Medications are defined as pharmacy subclass
 31,816 patients x 169 diagnoses x 471 medications
13
Evaluation of Phenotypes: Classification
 Task: predict patients with heart failure
 Model: logistic regression with ℓ1 regularization
 10 random even splits of the dataset (50% training)
 Features:
1. Baseline using source independence matrix
2. Principal Component Analysis (PCA)
3. Nonnegative Matrix Factorization (NMF)
4. Phenotype Tensor Factorization (PTF)
14
Predictive Performance Effect
Small number of phenotypes
outperforms 640 features
Features
●
PCA
NMF
●
●
PTF
0.73
●
●
●
●
AUC
0.71
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Baseline
0.69
●
0.67
●
●
0.65
25
50
75
100
Number
of Factors
Number
of Phenotypes
15
NMF factors are not concise, harder to interpret
Phenotype 1
Hypertension – Opiod Combinations
Disorders of the Vertebrae and Spinal Discs – Glucocortiocosteriods
Disorders of the Vertebrae and Spinal Discs – Stimulant Laxatives
Phenotype 2
Disorders of the Vertebrae and Spinal Discs – Beta Blockers Cardio-Selective
Major Symptoms, Abnormalities – Stimulant Laxatives
Disorders of the Vertebrae and Spinal Discs – Sympathomimetics
Major Symptoms, Abnormalities – Beta Blockers Cardio-Selective
Disorders of the Vertebrae and Spinal Discs – Anticonvulsants - Misc
Major Symptoms, Abnormalities – Sympathomimetics
Disorders of the Vertebrae and Spinal Discs – Central Muscle Relaxants
Disorders of the Vertebrae and Spinal Discs – HMG CoA Reductase Inhibitors
Disorders of the Vertebrae and Spinal Discs – Selective Serotonin Reuptake Inhibitors
Major Symptoms, Abnormalities – Coumarin Anticoagulants
Major Symptoms, Abnormalities – Salicylates
Major Symptoms, Abnormalities – Surfactant Laxatives
Major Symptoms, Abnormalities – Insulin
Disorders of the Vertebrae and Spinal Discs – Surfactant Laxatives
Major Symptoms, Abnormalities – Proton Pump Inhibitors
Disorders of the Vertebrae and Spinal Discs – Proton Pump Inhibitors
Major Symptoms, Abnormalities – Anti-infective Agents - Misc
Disorders of the Vertebrae and Spinal Discs – Cephalosporins – 1st Generation
Major Symptoms, Abnormalities – Vasodilators
Disorders of the Vertebrae and Spinal Discs – Analgesics Other
Disorders of the Vertebrae and Spinal Discs – Non-Barbiturate Hypnotics
Disorders of the Vertebrae and Spinal Discs – Electrolyte Mixtures
Hypertension – Opiod Combinations
Other Gastrointestinal Disorders – Surfactant Laxatives
Other Gastrointestinal Disorders – Insulin
Minor Symptoms, Signs, Findings – Opiod Combinations
Diabetes with No or Unspecified Complications – Insulin
Post-Surgical States/Aftercare/Elective – Opiod Combinations
Specified Heart Arrhythmias – Beta Blockers Cardio-Selective
Post-Surgical States/Aftercare/Elective – Stimulant Laxatives
Iron Deficiency and Other/Unspecified Anemias and Blood Disease - Hematopoietic Growth Factors
Post-Surgical States/Aftercare/Elective – Beta Blockers Cardio-Selective
Urinary Tract Infection – Insulin
Post-Surgical States/Aftercare/Elective – HMG CoA Reductase Inhibitors
Other Endocrine/Metabolic/Nutritional Disorders – Insulin
Post-Surgical States/Aftercare/Elective – Proton Pump Inhibitors
Vascular Disease – Coumarin Anticoagulants
Post-Surgical States/Aftercare/Elective – Opiod Agonists
Post-Surgical States/Aftercare/Elective – Cephalosporins – 1st Generation
Post-Surgical States/Aftercare/Elective – Analgesics Other
Vascular Disease – Insulin
History of Disease– Insulin
Unspecified Renal Failure – Coumarin Anticoagulants
Diabetes with Renal Manifestation – Insulin
Post-Surgical States/Aftercare/Elective – Non-Barbiturate Hypnotics
Other Eye Disorders – Opiod Combinations
Other Eye Disorders – Stimulant Laxatives
Other Eye Disorders – Opiod Agonists
Other Eye Disorders – Cephalosporins – 1st Generation
Other Eye Disorders – Non-Barbiturate Hypnotics
PTF interpretation: Major disease phenotypes can be identified
Uncomplicated
Diabetes
Phenotype 3
(17.6% of patients)
Diabetes with No or
Unspecified Complications
Sulfonylureas
Biguanides
Diagnostic Tests
Insulin Sensitizing Agents
Diabetic Supplies
Meglitinide Analogues
Antidiabetic Combinations
Mild Hypertension
Phenotype 4
(31.1% of patients)
Hypertension
ACE Inhibitors
Thiazides and Thiazide-Like
Diuretics
Chronic Respiratory
Inflammation/Infection
Phenotype 5
(36.7% of patients)
Other Ear, Nose, Throat, and Mouth Disorders
Viral and Unspecified Pneumonia, Pleurisy
Significant Ear, Nose, and Throat Disorders
Cough/Cold/Allergy Combinations
Azithromycin
Fluoroquinolones
Sympathomimetics
Penicillin Combinations
Antitussives
Glucocorticosteroids
Tetracyclines
Anti-infective Misc. - Combinations
Clarithromycin
Cephalosporins - 2nd Generation
Cephalosporins - 1st Generation
Expectorants
PTF interpretation: Disease subtypes can be automatically identified
Mild Hypertension
Phenotype 4
(31.1% of patients)
Hypertension
ACE Inhibitors
Thiazides and Thiazide-Like
Diuretics
Moderate Hypertension
Phenotype 2
(31.5% of patients)
Hypertension
Beta Blockers Cardio-Selective
Angiotensin II Receptor
Antagonists
Loop Diuretics
Potassium
Nitrates
Alpha-Beta Blockers
Vasodilators
Severe Hypertension
Phenotype 6
(24.3% of patients)
Hypertension
Calcium Channel Blockers
Antihypertensive Combinations
Antiadrenergic Antihypertensives
Potassium Sparing Diuretics
Over 80% phenotype factors are clinically meaningful
Summary: Phenotyping using Tensor Factorization
λ1
λR
≈
…
+ +
Few diagnosis
Phenotype 1
Phenotype R
 Nonnegative tensor factorization can be used to learn phenotypes
without supervision
 Small number of phenotypes outperforms a large number of features
in a prediction task
19
System
PARAMO: PARALLEL
PREDICTIVE MODELING
PLATFORM
20
Predictive Modeling Pipeline
 There are many different models that need to be built and evaluated
– Different patient cohorts
– Different targets
– Different features
– Different algorithms
– Multiple training and testing splits in cross-validation
21
Running Time vs. Parallelism level
1000000
9 days
Large
Medium
Small
72X speed up
Runtime (s)
100000
3 hours
10000
1000
Serial
 Patient sets
10
20
40
80
120
160
Number of Concurrent Tasks
– Small: 5,000 patients for hypertension control prediction
– Medium: 33K for predicting heart failure onset
– Large: 319K for hypertension diagnosis prediction
 Dependency graph: 1808 nodes and 3610 edges
22
Algorithm
PATIENT SIMILARITY
23
Patient Similarity Problem
Doctor
Similarity
search
Patient
24
Patient Similarity Problem
Patient
Doctor
25
Summary on Patient Similarity
 To learn a customized distance metric for a target [1]
 Extension 1: Composite distance integration (Comdi) [2]
– How to combine multiple patient similarity measures?
 Extension 2: Interactive metric update (iMet) [3]
– How to update an existing distance measure?
1. Sun, J., Wang, F., Hu, J., Edabollahi, S., 2012. Supervised patient similarity measure of heterogeneous
patient records. ACM SIGKDD Explorations Newsletter 14, 16.
2. Fei Wang, Jimeng Sun, Shahram Ebadollahi: Integrating Distance Metrics Learned from Multiple Experts
and its Application in Inter-Patient Similarity Assessment. SDM 2011: 59-70 56
3. Fei Wang, Jimeng Sun, Jianying Hu, Shahram Ebadollahi: iMet: Interactive Metric Learning in Healthcare
Applications. SDM 2011: 944-955
26
Phenotyping from Electronic Health Records
Jimeng Sun
College of Computing
Georgia Tech
[email protected]
More info at sunlab.org
27