Daniella Meeker

Transcript Daniella Meeker

Herding Ponies: How big
data methods facilitate
collaborative analytics
Changes in Outcomes Research

New monikers…





Patient Centered Outcomes
Research
Health Services Research
Comparative Effectiveness
Research
Safety and Surveillance
Changes in funding agencies



PCORI - AHRQ
FDA – CMS
NIH

Changes in research models





More multi-site studies
Larger “center-based” studies
Greater interest in Patient
Generated Data
Greater interest in EHR-based
data
Less interest in claims
Collaboration Frameworks From other
disciplines

Open Science Grid
Physics, nanotechnolgy, structural biology



Physics/Astrophysics
Established practices and metadata standards
1 PB data in last science run, distributed
worldwide
ESGF



>260 pubs in 2010
LIGO



OSG: 1.4M CPU-hours/day, >90 sites, >3000
users,
1.2 PB climate data
delivered to 23,000 users; 600+ pubs
Collage – Executable papers
Computer science
“Why hasn’t Outcomes Research adopted
collaborative methods used in physics,
climate science, and genomics?”
- Everyone in data-driven research
Adapting to Collaborative Science
1. Healthcare data are not collected for research

Not standardized

Not complete
2. Privacy protection has legal and ethical implications
3. Data is an asset
4. Data sharing is not incentivized supported by journals,
funding agencies, or the business of healthcare

Obtaining consent is expensive

Data hoarding is rewarded and conservative
Are Federated Research Networks the solution?
In federated models data are not centralized. AHRQ and
PCORI have invested heavily this approach.
5. Each data holder independently assumes
responsibility for “data wrangling” and standardization
6. Requires distributed analysis as opposed to traditional
central data pooling and analysis.

If data are simply used to independently estimate
one model per site, value-added for causal
inference is similar to a meta-analysis
7. Requires greater levels of coordination of governance,
standards, software, and policies.
8. High barriers to entry – what is the ROI?
Federated Meta-Analysis vs. Distributed
Analysis
Meta-analysis
• 1 Independently estimated
model for each node in the
network
• Not iterative
Parallel Meta-analysis (Independently Estimated Results)
Distributed Analysis
• One jointly estimated model
using data from all sites
• Typically iterative
• Leverages computational
power of the entire network
Parallel Distributed Analytics (Jointly Estimated Results)
Converged Estimate
Results Site 1
model fit to 100
patients
Query
Portal
Results Site 2
model fit to 50
patients
Query
Portal
&
Aggregator
Model Fit 150
Patients
Analysis Program
Iterative
Analysis Program
Data Site 1
100
patients
Data Site 2
50
patients
Data Site 1
100
patients
Data Site 2
50 patients
Intermediate
Statistics Site 1
Intermediate
Statistics Site 2
What does this have to do with “big data?”
Two (of 8) barriers to collaborative data
science solved with “Big Data” methods



Privacy protection has legal and ethical
implications
If data are simply used to independently
estimate one model per site, value-added for
causal inference is similar to a meta-analysis
Bonus – specialized software or hardware like
SAS and CMS repositories can be replaced
with parallelized systems
Parallel Evolution of Distributed Computing and
Federated Research Networks
AHRQ Distributed
Research Network
Projects launched
CaGRID
HMO Research
Network adopts
standard model
1993
1998
2003
2008
First map-reduce
paper from Google
Cluster Computing
"The Grid" Published
Statistical Query
Model Introduced
FDA MiniSentinel
Launched
Peer-to-Peer
Networks (Napster)
PCORnet Launched
2013
R-volution For
Hadoop
Apache Spark Project
Launched
MAD Lib In-Database
Analytics
Amazon EC2
GLORE Published
“Big Data” Analytics vs. Outcomes Research Analytics
“Big Data” in Distributed
Environments
Outcomes Research in
Federated Research Networks
Analysis Questions
Patterns
Predictions
Classification
Causal Inference
Predictions
Hypothesis testing
Data Distribution
Data can be randomly distributed
across processors by a master
Data are non-randomly anchored
to sites
# Nodes on network
100s or more
10s
Data Governance constraints
between network nodes
Typically none or low
Typically very high
Data set size
Very large
Relatively small
Query Distribution Platforms
Apache Spark
Hadoop Map-Reduce
Apache Pig
SHRINE
PopMedNet
TRIAD
Common Analytic Platforms
R-Volution/R-Hadoop
Apache Mahout
Spark Machine Learning Lib
Spark Graph X Lib
R
SAS
Stata
Size of developer community
1000s
Dozens
“Big-Data” Methods are Incidentally “Privacy Preserving”
Feature
Clinical Research
Rationale
“Big Data” Rationale
Federation in the form of
multiple networked nodes or
processing cores
Multiple independently
operating data partners
Inefficient to rely on a
single very powerful
processor or specialized
hardware
Distributed computation
across networked nodes
(instead of central pooling of
data)
Transferring patient-level
data incurs re-identification
risks
Inefficient to transfer large
data sets across the
network
Distributed Computing Frameworks



Grid Computing Architectures
Statistical Query Oracle
Mostly an academic effort
Hadoop
From Google



Hundreds of developers
591 Active projects and organizations
Apache Spark
Berkeley Computer Science answer to Hadoop


Most rapidly growing user base
99 Active projects and organizations
Collaboration Frameworks In Outcomes
Research



SHRINE for I2B2
PopMedNet – for MiniSentinel, PCORnet
TRIAD for CAGrid, SAFTINet DRN
What distributed methods in the standard
biostats toolbox are already supported in
“Big Data” vs. Clinical Frameworks?
Algorithm/Method
Apache Spark Libraries
Map-Reduce MultiCore or RHadoop
Linear regression (weighted)
X
X
Logistic regression
X
X
Cox Proportional Hazard
X
Naïve Bayes
X
X
Gaussian Discriminative Analysis
X
X
X
Neural Network Backpropagation
Matrix Factorization
X
X
PCA
*
X
ICA
*
X
Support Vector Machine
X
X
Generalized Linear Models
K-means
Federated Clinical Research
Networks
X
X
Expectation Maximization
X
Random Forest Classifier
X
X
No Longer a Technical Challenge
We have the tools we need to overcome
privacy and liability concerns. Now we
“only” need to change culture.
Moving Collaborative Outcomes
Science Forward
Policies (aka incentives)


Payer-driven incentives for better data hygiene and standardization

Payer incentives for sharing

Funding agency incentives for collaborative data management vs. data hoarding

Journal incentives

HIPAA Clarification
Infrastructure


As a community - adopt existing easy-to-use, flexible platforms for sharing code and
data

Link clinical data and patient device infrastructure to research infrastructure
Culture


Clinician demand

Patient demand

Tenure and promotion transformation

Replace “not invented here syndrome” with collective credit and shared efficiencies

Daniella Meeker

Transcript Daniella Meeker

Directory