Vivian Bonazzi

Download Report

Transcript Vivian Bonazzi

Biomedical Big Data Initiative (BD2K)
Vivien Bonazzi Ph.D.
Program Director: Computational Biology (NHGRI)
Co Chair Software Methods & Systems (BD2K)
Myriad Data Types
Genomic
Other ‘Omic
Imaging
Phenotypic
Exposure
Clinical
Data and Informatics Working Group
acd.od.nih.gov/diwg.htm
What Are the Big Problems to Solve?
1. Locating the data
2. Getting access to the data
3. Extending policies and practices
for data sharing
4. Organizing, managing, and
processing biomedical Big Data
5. Developing new methods for
analyzing biomedical Big Data
6. Training researchers who can use
biomedical Big Data effectively
Overarching Strategy and Goals
Two initiatives being proposed to overcome
roadblocks
Big Data to Knowledge (BD2K) – enable the
biomedical research enterprise to maximize the
value of biomedical data
InfrastructurePlus – create an adaptive environment
at NIH to sustain world-class biomedical research
Big Data to Knowledge (BD2K): Overview
 Major trans-NIH initiative addressing an NIH
imperative and key roadblock
 Aims to be catalytic and synergistic
 Overarching goal:
By the end of this decade, enable a quantum leap in the
ability of the biomedical research enterprise to maximize
the value of the growing volume and complexity of
biomedical data
BD2K: Four Programmatic Areas
I. Facilitating Broad Use of Biomedical
Big Data
II. Developing and Disseminating
Analysis Methods and Software for
Biomedical Big Data
III. Enhancing Training for Biomedical
Big Data
IV. Establishing Centers of Excellence
for Biomedical Big Data
Area 1: Data Sharing & Access
Facilitating usage and sharing of biomedical big data
 New Policies to Encourage Data & Software Sharing
 Index of Research Datasets to Facilitate Data Location & Citation
 Community-based Development of Data & Metadata Standards
A. Policies to Facilitate Data Sharing.
B. Data Catalog: Data Discovery, Citation, Links to Literature.
C. Frameworks for Community-Based Solutions to Developing Data Standards.
D. Enabling Research Use of Clinical Data.
Area 2: Software and Systems Development
Development of analysis methods and software
 Software to Meet Needs of the Biomedical Research Community
 Facilitating Data Analysis: Access to Large-scale Computing
 Dynamic Community Engagement of Users and Developers
A. Grants for software development
B. Software Registry: Making biomedical software findable and citable
C. Cloud computing: Facilitating Data Analysis
D. Dynamic Social Engagement via social media
Area 2: Software and Systems Development
Software Grants
Current and emerging needs for using, managing, and
analyzing the larger and more complex data sets
inherent to biomedical Big Data
 Compression/Reduction
 Visualization
 Provenance
 Data Wrangling
Area 2: Software and Systems Development
Big Data needs Big Computing
Cloud Computing
 Leveraging the cloud
 Storing and analyzing huge data sets
 Collaborative environment
 Developing appropriate policies for use of controlled
access data in the cloud (dbGaP)
 Developing working relationships with major cloud
providers
 AWS, Google, Microsoft (Azure)
HPC
 More exploration with Supercomputing facilities
Area 3: Training
Enhancing computational training
Increase Number of Computationally Skilled Trainees
Strengthen the Quantitative Skills of All Researchers
Enhance NIH Review and Program Oversight
Area 4: Centers
Establishing centers of excellence
Collaborative environments & technologies
Data integration
Analysis & modeling methods
Computer science & statistical approaches
A. Investigator-initiated Centers
B. NIH-specified Centers
Big Data to Knowledge (BD2K)
bd2k.nih.gov
Biomedical Research as Part of the Digital
Enterprise
Philip E. Bourne Ph.D.
Associate Director for Data Science
National Institutes of Health
Myriad Data Types
Genomic
Other ‘Omic
Imaging
Phenotypic
Exposure
Clinical
Myriad Data Types
Genomic
Other ‘Omic
Imaging
Phenotypic
Exposure
Clinical
Components of The Academic Digital
Enterprise
 Consists of digital assets
 E.g. datasets, papers, software, lab notes
 Each asset is uniquely identified and has
provenance, including access control
 E.g. publishing simply involves changing the access
control
 Digital assets are interoperable across the
enterprise
Let’s Break Down the Silos
 New policies,
regulations e.g.
data sharing
 Economic drivers
 The promise of
shared data
The NIH is Starting to Think About
the Digital Enterprise
Big Data to Knowledge (BD2K)
bd2k.nih.gov
This is great, but BD2K is just a
start, what will the end product
look like?
To get to that end point we have to
consider the complete research
lifecycle
The Research Life Cycle will Persist
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Tools and Resources Will Continue
To Be Developed
Authoring
Tools
Lab
Notebooks
Data
Capture
Analysis
Tools
Software
Scholarly
Communication
Visualization
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Those Elements of the Research Life Cycle will Become More
Interconnected Around a Common Framework
Authoring
Tools
Lab
Notebooks
Data
Capture
Software
Analysis
Tools
Scholarly
Communication
Visualization
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
New/Extended Support Structures
Will
Emerge
Authoring
Tools
Data
Capture
Lab
Notebooks
Analysis
Tools
Scholarly
Communication
Software
Visualization
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Commercial &
Public Tools
DisciplineBased Metadata
Standards
Community Portals
Git-like
Resources
By Discipline
Training
Institutional Repositories
Commercial Repositories
Data Journals
New Reward
Systems
[email protected]
Thank You
Questions?