Transcript Document

Distributed, Modular
Grid Software for
Management and
Exploration of Data in
Patient-Centric
Healthcare IT
Andrew Hart
NASA Jet Propulsion Laboratory
David Kale
Whittier VPICU, Children’s Hospital LA
Heather Kincaid
NASA Jet Propulsion Laboratory
Agenda
 Health Care Data Challenges for Large-scale Research
 Intro to Object Oriented Data Technology (OODT)
 Applications of OODT in distributed scientific data systems
- NASA’s Planetary Data System
- NCI’s Early Detection Research Network
- Whittier Virtual Pediatric Intensive Care Unit (VPICU)
 OODT as Open Source
 Learning More & Keeping in Touch
Health care research
 Increasingly collaborative
 Increasingly geographically distributed
 Scale, Complexity, Cost drive cooperation
 Opportunities for discovery emerge through larger data sets
 Increase in need for technology to support for “virtual
organizations” carrying out distributed scientific research
OODT – What Is It?
“A data grid software infrastructure for constructing large-scale,
distributed data-intensive systems”
 Reference Architecture
 Software Product Line
OODT/Science
Web Tools
Archive
Client
Navigation
Service
OBJECT ORIENTED DATA TECHNOLOGY FRAMEWORK
Archive
Service
Profile
Service
Product
Service
Query
Service
Bridge to
External
Services
Other
Service 1
 Reusable Components
Other
Service 2
 Common Patterns
Profile
XML
Data
Data
System
1
Data
System
2
A Brief History of OODT
 Funded out of NASA’s Office of Space Science in 1998
 Funded to address critical software engineering challenges
affecting the design of mission science data systems
 Designed, implemented, and refined over the past 7 years
across multiple scientific domains:
- Planetary Science,
- Earth Science,
- Cancer Research,
- Space Physics,
- Modeling and Simulation,
- Pediatric Intensive Care
 Runner up NASA software of the year in 2003
Principles behind OODT
 Division of Labor
Avoid making one component the workhorse, configurable
 Technology Independence
Guard against unexpected changes in the technology landscape
 Metadata as a first-class citizen
Descriptions of resources come in handy
 Separation of software and data models
Allow each to evolve independently
 Modular, domain-agnostic
Pick and choose from adaptable components with defined
interfaces
OODT Core Framework Services
OODT/Science
Web Tools
Archive
Client
Navigation
Service
OBJECT ORIENTED DATA TECHNOLOGY FRAMEWORK
Archive
Service
Profile
Service
Product
Service
Query
Service
Bridge to
External
Services
Other
Service 1
Other
Service 2
 Archive Service
Profile
XML
Data
Data
System
1
Data
System
2
Ingest data + metadata, processing algorithms, workflow support
 Profile Service
Deliver metadata from an underlying data store
 Product Service
Deliver data from an underlying data store
 Query Service
Manage sets of profile servers
 Data Grid Service
Interfaces and tools for connecting distributed resources over the web
Applications of OODT: PDS
 Planetary Data System
 National Aeronautics and Space Administration
 http://pds.nasa.gov
NASA Planetary Data System
 Official NASA archive for all planetary data
 9 Nodes with data located
at discipline sites
 All missions must add their
data (required as part of
mission Announcement of
Opportunity
Planetary Data System
Distributed Planetary Science Archive
Rings Node
Ames Research Center
Moffett Field, CA
 Prior to October 2002, no
ability to find and share
data between PDS nodes
Geosciences Node
Washington University
St. Louis, MO
Imaging Node
JPL and USGS
Pasadena, CA and Flagstaff, AZ
THEMIS Data Node
Arizona State University
Tempe, AZ
Central Node
Jet Propulsion Laboratory
Pasadena, CA
Planetary Plasma Interactions Node
University of California Los Angeles
Los Angeles, CA
Navigation Ancillary Information Node
Jet Propulsion Laboratory
Pasadena, CA
Atmospheres Node
New Mexico State University
Las Cruces, NM
Small Bodies Node
University of Maryland
College Park, MD
PDS Data Key Challenges
Challenges to building a science data system for the PDS:
 NASA often flies unique, one of a kind missions
 A static infrastructure won’t work: Nodes and models change
 Data stored at PDS nodes differs dramatically in structure
 Missions are required to share science data results with the
research community
PDS Data Architecture
 Distributed data system environment with federated governance
Each site maintains their own database and infrastructure
 Common domain information model (regularly updated) used to
drive system implementations
Ontology and Common Data Elements (based on ISO/IEC 11179)
 Common query interface to distributed services
implemented with OODT Query Handlers
 Software services that wrap existing data systems to share data
Implemented with OODT Product & Profile servers
 Publishing of data products to a common portal
Implemented using Resource Description Format (RDF)
PDS Architecture Decomposition
Applications of OODT: EDRN
 Early Detection Research Network
- Division of Cancer Prevention, National Cancer Institute
- http://cancer.gov/edrn
EDRN Overview
 Focus: investigator-initiated, collaborative
research on molecular, genetic and other biomarkers for
cancer detection and risk assessment.
 Funded since 2000 by the Division of Cancer Prevention in
the National Cancer Institute (NCI)
 40+ geographically
distributed centers
performing parallel,
complementary studies
 Strong emphasis on the
role of informatics
EDRN Participants
 Biomarker Development Laboratories
Responsible for the development and characterization of new
biomarkers or the refinement of existing biomarkers.
 Biomarker Reference Laboratories
Serve as a Network resource for clinical and laboratory
validation of biomarkers, which includes technological
development, quality control, refinement, and high throughput.
 Clinical Epidemiology and Validation Centers
Conduct clinical and epidemiological research regarding the
clinical application of biomarkers.
 Data Management and Coordinating Center
Coordinate EDRN research activities, provide logistic support,
conduct statistical and computational research for data analysis,
analyzing data for validation.
OODT and EDRN
 OODT’s success lead to interagency agreements with both
NIH and NCI, resulting in:
 EDRN Informatics Center
Support EDRN's efforts through the development of
software systems for information management. Located at
NASA Jet Propulsion Laboratory, Pasadena, CA.
- Principal Investigator: Dan Crichton, JPL.
EDRN Data
 EDRN collects, generates, analyzes, and stores a wide variety of
different data, including:
- Specimen Inventories
Map specimens collected (blood, sputum, etc.) to patient
characteristics
- Studies and Publications
Information about studies conducted in the EDRN as well as
published results (publications, outputs)
- Biomarkers
Information about indicators of early disease
- Science Data
Outputs of experiments on specimens, regarding biomarkers,
driven by particular studies and protocols
EDRN Data Flow
 Moving beyond the local laboratory
 Scalability, interoperability
Case Study: ERNE
 ERNE: EDRN Resource Network Exchange
 Challenge: Overcome differences in local schema to develop a
national distributed specimen information infrastructure
 All sites running different software and following own procedures
 Rely on a common information
model for distributed querying,
and provide site-specific
mappings at each participant
ERNE Architecture
Connecting Research
 Designing the EDRN informatics architecture as a collection of
well-defined components via OODT has simplified the process of
building interfaces to non-EDRN systems
 Wrappers can be built to link non-EDRN systems
 Translators can be developed to deal with different semantic
architectures
 caBIG
- ERNE/caTissue Wrapper
 EDRN-Canary Collaboration
- A cloud computing effort that shares raw science data via
Amazon S3 between EDRN and the Canary group which uses
software from GenoLogics Life Sciences
EDRN Knowledge Environment
 Building a Semantic Bioinformatics Grid for the EDRN
Lessons From EDRN
 Architecture and a vision has been critical
- Technology hasn’t been as critical
- Keep it simple
 Science support has been critical
- Getting buy-in and participation from domain experts is key
 Incremental development and deployment
- Starting with a few sites was very helpful in understanding the issues
- We had both development sites and observer sites initially
 The IRB process has been a big schedule driver
 Distributed architecture can be a challenge
- Not all sites up to maintaining the implementation
- Loosely coupled architecture with simple interfaces helped
Applications of OODT: VPICU
 Whittier Virtual Pediatric Intensive Care Unit
- Childrens Hospital Los Angeles
- http://picu.net
Collaboration between 85 Multi-disciplinary pediatric intensive
care units across the U.S.
Collaboration with VPICU
 Laura P. and Leland K. Whittier Virtual Pediatric Intensive Care
Unit (VPICU), founded in 1998 by clinicians at CHLA
 Leverage advances in technology to:
- Improve patient care
- Educate practitioners
- Conduct research
- Reduce cost of providing care
VPICU Research Data
Secondary use of observational clinical (EHR, monitor, annotations) data
Ideal Research Data Set
 Manageable size, Static
 Homogeneous
 Complete, standardized
descriptions and annotations
 Available as single unit
 Complete, consistent
 Minimal usage restrictions
Real Health Care Data Set
 Massive, grows continuously
 Heterogeneous formats, types,
etc.
 Incomplete, proprietary,
descriptions
 Fragmented across stores,
organizational boundaries
 Incomplete, inconsistent
 Highly restricted (legal,
privacy, ethical considerations)
VPICU Project Areas
 Data extraction and management
Take data from proprietary stores, make it accessible
 Transformation of data into knowledge
Process (and re-process) the data to extract insight
 Data-driven decision support
Develop tools that learn continuously from the data
 Distributed data-sharing over a national network
Enable research on scales previously impossible while maintaining
security, privacy, compliance
Principles behind VPICU
 Decouple from (proprietary) vendor databases
 Integrate disparate data sources into a single model
 Dynamically (re)generate research database(s)
- we don’t know for sure what queries will be most useful at the
outset
 Provide web services for multi-faceted access to the data to
enable discovery & analysis
 Support federation among multiple PICU sites
“Algorithm” for VPICU Data System
1. Develop a common Domain Ontology to describe the information
space
2. Develop compute services that support extraction of data from
existing databases
3. Identify mechanisms to integrate information objects from disparate
repositories and map them to the common domain ontology
4. Construct a set of online research databases to enable data mining
and analysis
5. Deploy a “data grid” infrastructure of hardware & software to
facilitate utilization of the data environment at CHLA and beyond
(external entities and applications)
6. Deploy a set of compute services to support data mining and
analysis
7. Develop an architectural plan and roadmap for scaling and
integrating other PICUs
VPICU Architecture
File-based
storage
VPICU Architecture
EHR
Homegrown
Clinical apps
Monitor data
 Original data sources/stores at backend
 Proprietary schema
 Hardware that we don’t “own” or control
 Production systems (very load-sensitive)
 Legacy technologies (sometimes)
 Unreliable (can’t guarantee always available)
File-based
storage
 Includes:
 Hospital-wide commercial EHR system(s)
 Homegrown critical care database
 Specialized clinical applications
 Raw bedside monitor data
Proprietary data sources
VPICU Architecture
 Regular extraction of new data
 VPICU-controlled resources
(Our hardware and software)
 Transform to VPICU schema
 Link data belonging to same patient
 May contain PHI
Must be highly secure
File-based
storage
 Data at this stage is normalized,
stored in a format suitable for
ingestion into any number of
research databases
VPICU-owned resources
VPICU Architecture
 Research databases
 Application-specific
 Optimized
 Contain de-identified or
anonymized data
File-based
storage
 VPICU ontology, schema
 Access via configurable
web services
What are “research databases?”
 Designed for specific research questions, analytical techniques
 Need not always be relational or databases at all
 Available via web interfaces and software services
Researcher using R can connect directly through R bindings
 Examples:
 Relational database for traditional retrospective studies
 Search engine over free text clinical notes, etc.
 Patient/patient comparison, retrieval (find patient like this one)
 Data-backed patient simulator for “testing” interventions
VPICU Architecture
File-based
storage
OODT and the VPICU Data System
1. Develop an Information Model (Ontology) to describe the domain
2. Develop compute services that support extraction of data from
existing CHLA databases (OODT Query Handlers)
3. Identify mechanisms to integrate information objects from disparate
repositories and map them to the common domain ontology (OODT
CAS crawler, catalog services)
4. Construct a set of online research databases to enable data mining
and analysis (OODT Catalog and Archive Services)
5. Deploy a “data grid” infrastructure of hardware & software to
facilitate utilization of the data environment at CHLA and beyond
(external entities and applications) (OODT Data Grid Services)
6. Deploy a set of compute services to support data mining and
analysis
7. Develop an architectural plan and roadmap for scaling and
integrating other PICUs
OODT as Open Source
 Jan 2010: OODT Accepted as a podling in the Apache Software
Foundation (ASF) Incubator
 First NASA software licensed and incubating within the ASF
 Learn more and track our progress at:
- http://incubator.apache.org/projects/oodt.html
 Join the mailing list:
- [email protected]
 Chat on IRC:
- #oodt on irc.freenode.net
Acknowledgements
 Jet Propulsion Laboratory: Dan Crichton, Chris Mattmann,
Sean Kelly, Steve Hughes, Amy Braverman, Thuy Tran
 National Cancer Institute: Sudhir Srivastava, Christos Patriotis,
Don Johnsey
 Fred Hutchinson Cancer Research Center: Mark Thornquist,
Ziding Feng, Jackie Dalhgren, Suzanna Reid
 Children’s Hospital Los Angeles:
Randall Wetzel, Robinder Khemani,
Paul Vee, Jeff Terry, Robert Kaptan,
Doug Hallam