Transcript Document

Building Infrastructure for Data
Management
25 April 2014
Larry Lannom
Corporation for National Research Initiatives
http://www.cnri.reston.va.us/
Corporation for National Research Initiatives
Three Part Talk
• Organizing for Infrastructure: RDA
• Building Infrastructure: Data Type Registries
• Using Infrastructure: Deep Carbon Observatory,
Handles/DOIs
Corporation for National Research Initiatives
The Information Age – Extraordinary Potential for
Driving Science and Bettering Society
3
More
Efficient
Physical
Infrastructure
Contribution to a safer
and more secure world
Transformative
strategies for
disease
treatment and
well-being
More goods and services
More Research Insights
Key Driver 1: Data Sharing Accelerating
Discovery and Innovation
4
Data Sharing is a Global Issue
Science, Humanities, Arts
Communities
5
Libraries, Archives,
Repositories, Museums
Cyberinfrastructure professionals,
data analysts, data center staff, …
Data
Scientists
Key Driver 2: Community Effort Accelerating
Impact
6
“Just do it” -- Focused efforts help
communities drive tangible progress
Creation / adoption of data
sharing policies have
accelerated research
innovation
Development of public access shared
data collection enabling new results
for Alzheimer’s
Development and adoption of
shared parallel communication
protocols through the MPI Forum
drove a generation of advances
Now 25 years old, the Internet Engineering Task
Force’s mission “to make the Internet work
better” has resulted in key specifications of
Internet common community standards that
support innovation
MPI Forum photo by Erez Heba,
PDB molecule of the month at
http://www.rcsb.org/pdb/home/home.do
Enabling Technologies
ID
ID
010001010
ID
010011011
010001010
ID
010101001
010011011
ID
ID
ID
101010000
010101001
010001010
101010000
010011011
010101001
101010000
ID
ID
ID
ID
ID
Scientists, Data Curators,
End Users, Applications
Datasets
Corporation for National Research Initiatives
Enabling Technologies
ID
ID
ID
0100 ID
0101..
ID0100 ID
0101..
0100
0101..
ID
0100
0101..
ID
ID
ID
ID
ID
ID
ID
ID
Scientists, Data Curators,
End Users, Applications
ID
ID
0100
0101..
ID
Datasets
Accessed via Repositories
Corporation for National Research Initiatives
Enabling Technologies
Enabling
Technologies
ID
ID
Discovery
ID
0100 ID
0101..
ID0100 ID
0101..
0100
0101..
ID
0100
0101..
ID
ID
ID
ID
ID
ID
ID
ID
Scientists, Data Curators,
End Users, Applications
ID
ID
0100
0101..
ID
Datasets
Accessed via Repositories
Corporation for National Research Initiatives
Discovery & Evaluation
• Search
– Metadata registries
•
•
•
•
Subject
Parties
Dates
Etc
– Crawlers – more ad hoc
• Citation
– Formats
• Permissions
– Can I see it?
– Can I use it?
• Trust
Corporation for National Research Initiatives
Enabling Technologies
Enabling
Technologies
ID
ID
Discovery
ID
0100 ID
0101..
ID0100 ID
0101..
0100
0101..
ID
0100
0101..
ID
ID
ID
Access
ID
ID
ID
ID
ID
Scientists, Data Curators,
End Users, Applications
ID
ID
0100
0101..
ID
Datasets
Accessed via Repositories
Corporation for National Research Initiatives
Access
• ID / reference resolution
• Access Protocols
– How to get it
– Protocol registries
– Bootstrapping into new protocols
• Authentication & Authorization
– Proof of identity (tradeoff: usability vs security)
– Permissions: with the object or in some external system?
Corporation for National Research Initiatives
Enabling Technologies
Enabling
Technologies
ID
ID
Discovery
ID
0100 ID
0101..
ID0100 ID
0101..
0100
0101..
ID
0100
0101..
ID
ID
ID
Access
ID
ID
ID
ID
ID
Scientists, Data Curators,
End Users, ApplicationsInterpretation
ID
ID
0100
0101..
ID
Datasets
Accessed via Repositories
Corporation for National Research Initiatives
Interpretation
• Registries
–
–
–
–
–
Schemas
Vocabularies
Formats
Available services
Useful client-side tools
• Trust
– Who did this?
– Who owns this?
• Provenance
– Data Source
– Processing steps
– Computing environment
• what is needed to trust the numbers?
• Domain specific?
Corporation for National Research Initiatives
Enabling Technologies
Enabling
Technologies
ID
ID
Discovery
ID
0100 ID
0101..
ID0100 ID
0101..
0100
0101..
ID
0100
0101..
ID
ID
ID
Access
ID
ID
ID
ID
ID
Scientists, Data Curators,
End Users, ApplicationsInterpretation
Reuse
ID
ID
0100
0101..
ID
Datasets
Accessed via Repositories
Corporation for National Research Initiatives
Reuse
•
Everything from Interpretation slide + Permissions
– Example: I need to understand a data set for peer review but that
doesn’t give me permission to use the data
• Validation
• Education & Training
•
– Integrate ‘live’ data into education and training
Repurpose data
Corporation for National Research Initiatives
The Research Data Alliance (RDA)
•
•
Global community-driven
organization launched in
March 2013 to accelerate
data-driven innovation
RDA focus is on building the social,
organizational and technical
infrastructure to
•
•
reduce barriers to data sharing and
exchange
accelerate the development of
coordinated global data infrastructure
17
RESEARCH DATA ALLIANCE
RDA Vision and Mission
•
•
Research Data Alliance Vision: Researchers and
innovators openly share data across technologies,
disciplines, and countries to address the grand challenges
of society.
Research Data Alliance
Mission:
RDA builds the social and
technical bridges that enable
data sharing.
18
Goal of RDA Infrastructure: Support Data Sharing and
Interoperability Across Cultures, Scales, Technologies
•
•
•
•
•
•
Common data types for data
Interoperability
Persistent identifiers
Domain-focused portals
Harmonized standards
Harmonized standards
Data access and preservation policy
and practice
Tools for data discoverability, …
Policy and Practice
19
CREATE  ADOPT  USE
RDA Members come together as
•
•
•
•
•
•
•
Working Groups – 12-18 month efforts to build, adopt, and use specific
pieces of infrastructure
Interest Groups – longer-lived discussion forums that spawn Working
Groups as specific pieces of needed infrastructure are identified.
Working Group efforts focus on the development and use of data
sharing infrastructure
Code, policy, infrastructure, standards, or best practices that are
adopted and used by communities to enable data sharing
“Harvestable” efforts for which 12-18 months of work can eliminate a
roadblock
Efforts that have substantive applicability to groups within the data
community, but may not apply to everyone
Efforts for which working scientists and researchers can start today
20
RDA Plenaries: Venue for community building
and WG / IG progress
•
•
•
•
RDA Plenary 1 / Launch
•
•
•
21
Plenary 1
March 2013 in Gothenburg, Sweden
240 participants
3 WG, 9 IG
RDA Plenary 2
•
•
•
September 2013 in Washington, DC
380 participants
6 WG, 17 IG, 5 BOF
RDA Plenary 3
•
•
•
•
March 2014 in Dublin, Ireland
Plenary 2
497 participants
12 WG, 22 IG, 14 BOF
6 co-located events
RDA Plenary 4
•
Sept 2014 in Amsterdam
Plenary 3
Fran Berman
RDA Plenaries Emerging as a Data Community
“Town Square”
Emerging Plenary Format:
•
•
•
All-hands sessions: Place for community
networking and exchange of information
(funding agencies, data organizations, key
stakeholders)
Working sessions: Face-to-face
opportunities for global Interest Groups,
Working Groups, and BOFs to meet and
advance their agendas
Neutral meeting place: Place for multiple
groups to meet and form a common agenda
and action plan (e.g. Plenary 2 Data
Citation Harmonization Summit)
22
Precipitous Growth
23
First “neutral
space”
community
meeting (Data
Citation Summit)
First Org.
Assembly
First Org. Partner
Meet-up
First BOFs
14 BOF,
12 Working
Groups, 22
Interest Groups
240 participants
380 participants
from 22 countries
497
participants
Amsterdam
RDA Launch /
First Plenary
RDA Second
Plenary
RDA Third
Plenary
RDA Fourth
Plenary
March 2013
September 2013
March 2014
September 2014
First Working
Groups and
Interest Groups
6 co-located
events
RDA Community Evolving Rapidly: Over 1500
members from 70+ countries (as of 3/15/14)
Australpacific
4%
Afric
a
2%
Asia
4%
South
America
1%
Map courtesy
traveltip.org
24
RDA Interest (IG) and Working Groups (WG) effectively
doubling each Plenary (Groups as of 1/14)
Domain Science - focused




Toxicogenomics
Interoperability IG
Structural Biology IG
Biodiversity Data
Integration IG
Agricultural Data
Reference and Sharing focused
•Data Citation IG
•Data Categories and Codes
WG
•Legal Interoperability IG




Interoperability IG
Digital History and
Ethnography IG
Defining Urban Data Exchange
for Science IG
Marine Data Harmonization IG
Materials Data Management IG
Data Stewardship - focused •
•
•
•
•
Research Data Provenance IG •
•
Certification of Digital
Repositories IG
Preservation e-infrastructure
Long-tail of Research Data IG
Base Infrastructure - focused
•
•
•
•
•
Data Foundations and Terminology WG
Metadata Standards WG
Practical Policy WG
PID Information Types WG
Data Type Registries WG
•
•
•
25
Community Needs focused
•Community Capability Model
IG
•Engagement IG
•Clouds in Developing
Countries IG
Publishing Data IG
Domain Repositories IG
Global Registry of Trusted
Data Repositories and
Services IG
Metadata IG
Big Data Analytics IG
Data Brokering IG
RDA Organizational Frameworknearly at
Steady State
RDA Council
RDA Membership
Responsible for overarching mission, vision, impact of RDA
Secretary-General and
Secretariat
Technical Advisory
Board
Responsible for Technical
roadmap and interactions
Responsible for
administration and
operations
Organizational Advisory
Board and
Organizational
Assembly
Responsible for organizational
and strategic advice
Working Groups
Responsible for impactful, outcome-oriented efforts
Interest Groups
Responsible for defining and refining common issues
RDA Colloquium (Research Funders)
Operational and community sponsorship
26
Coming in Fall: First RDA Infrastructure
Deliverables
Scheduled to Complete Summer 2014
Scheduled to Complete Fall 2014
Data Type Registries WG
• Deliverables: System of data type registries,
Language Codes
• Deliverables: Operationalization of ISO
formal model for describing types, working model of
a registry.
•
testbed of machine actionable policies, deployment
of 5 policy sets, policy starter kits
•
language categories for repositories.
•
Initial Adopters and Users: CNRI, International
DOI Foundation, Deep Carbon Observatory
Practical Code Policies
• Deliverables: Survey of policies in production use,
Initial Adopters and Users: RENCI, DataNet
Federation Consortium, CESNET, Odum Institute
Persistent Identifier Information
Types
• Deliverables: Minimal set of PID types, API
• Initial Adopters and Users: Data Conservancy,
DKRZ
27
Initial Adopters and Users: Language
Archive, Paradisec
Data Foundations and
Terminology
• Deliverables: Common vocabulary for data
terms, formal definitions and open registry for
data terms
•
Initial Adopters and Users: EUDAT, DKRZ,
Deep Carbon Observatory, CLARIN, EPOS
Metadata Standards
• Deliverables: Use cases and prototype
director of current metadata standards starting
from DCC directory
•
Initial Adopters and Users: JISC, DataOne
RDA Medium Term (3-5 year) Goals
•
Create a pipeline of data sharing infrastructure
efforts
•
•
•
Build and expand the research data community for
effective impact
•
•
that are adopted and used by communities during their
development
that increase their impact through greater adoption over time
globally, regionally, and within constituent groups
Evolve as a useful, relevant, and agile organization
•
that helps the community capitalize on opportunity and respond
to challenges within the data community
28
RDA as an Accelerant of Existing Projects
•
•
This is already the case
RDA is helping expand the impact of at least two Sloan-funded projects.
•
CNRI Interoperability Platform
•
•
•
Type Registry
Deep Carbon Observatory (DCO)
•
•
LEI Prototype
Data science infrastructure (RPI)
DCO now working with CNRI in the context of the RDA Data Type Registries
Working Group
Corporation for National Research Initiatives
What are Data Types?
• Characterize data structures at multiple levels of granularity
– Serve as macro or shortcut for understanding and processing data
• File formats & mime types are examples of solved problems at
the container level but don’t solve finer grained interpretation
– It’s a number in cell A3 but what does it mean
• Other structures with more limited use, e.g., many sci. data
sets, may need multiple levels of typing
• Data types enable humans and machines to discover, process,
and reason about data
Corporation for National Research Initiatives
Corporation for National Research Initiatives
Data Type Registries
• Each type registered with unique identifier
• Common data model and expression
• Associate with services, tools, format registries, etc.
• Common API for machine consumption
Corporation for National Research Initiatives
RDA Data Type Registries WG
• Goal: Interoperable set of Type Registries
• Approved as RDA WG at Plenary 1
• Co-chairs
•
•
Larry Lannom – CNRI
Daan Broeder - Max Planck Institute for Psycholinguistics
• Membership
•
• 44 participants
• U.S., UK, Netherlands, Germany, Italy, Australia, Finland, Canada, Kenya, Japan
• Various scientific fields, Practitioners, Librarians, Publishers
Schedule
• 3/2013 – 9/2013: gather use cases, begin design, including data model
• 10/2013 – 12/2013: refine model, begin prototyping
• 1/2014 – 5/2014: finalize data model & functional specs, deploy functional registry for
Handle types, release turnkey registry
Corporation for National Research Initiatives
DTR Use Cases
•
Broad Functional Classification
• Repos hold widely varying levels of data & metadata
• High-level functional classification of the identified object needed to make sense of what is
available, e.g., data object, metadata, repo description, contact info, etc.
•
Simple License Information via PID Resolution
• Data set access conditions cannot be predicted based on ID
• For DataCite DOIs, a handle/type/value triple could be used to provide access information, probably
through a level of indirection, resulting in a pop-up or intervening page or open linked data
•
Object Types as a Short-cut for Dependent Services to Match Processing
Requirements to Data Objects
• Using data acquisition as an example
•
•
•
•
Determine object type you are trying to build
Consult registry to index into an ontology to dynamically define required and optional properties
Does the input data have what is needed?
Registration of PID Types (in ID/Type/Value triples) for Data Processing and
Interpretation
• Distinguish pointers to objects from pointers to metadata from pointers to services
• Enable complex client interactions as opposed to simple one-to-one re-direction
Corporation for National Research Initiatives
Discovery Use Case
2
Users
3
1
Federated Set of Type
Registries
4
ID
ID
ID
ID
Type ID
Type ID
Type
Type
Payload
Type Payload
Type
Payload
Payload
Payload
Payload
Repositories and Metadata
Registries
1 Clients (process or people) look for types that match their criteria for data. For example, clients may
look for types that match certain criteria, e.g., combine location, temperature, and date-time stamp.
2 Type Registry returns matching types.
3 Clients look up in repositories and metadata registries for data sets matching those types.
4 Appropriate typed data is returned.
Corporation for National Research Initiatives
Process Use Case
3
Users
2
1
Federated Set of Type
Registries
4
ID
ID
ID
ID
Type ID
Type ID
Type
Type
Payload
Type Payload
Type
Payload
Payload
Payload
4
Payload
Typed Data
Terms:…
I Agree
10100
Visualization
11010
Rights
101….
Data Set
Data Processing
Dissemination
Services
1 Client (process or people) encounters unknown type.
2 Resolved to Type Registry.
3 Response includes type definitions, relationships, properties, and possibly service pointers. Response can be
used locally for processing, or, optionally
4 Typed data or reference to typed data can be sent to service provider.
Corporation for National Research Initiatives
Deep Carbon Observatory
Data Science and
Data Management Infrastructure
Overview
• Global research program to transform our understanding of
carbon in Earth
• Community of scientists --- biologists, physicists,
geoscientists, chemists, and many others --- whose work
crosses these disciplinary lines, forging a new, integrative
field of deep carbon science
• 10-year initiative to intensify global attention and scientific
effort in the burgeoning field of deep carbon science
• DCO infrastructure includes: public engagement and
education, online and offline community support, innovative
data management, and novel instrumentation
deepcarbon.net
• Alfred P. Sloan Foundation pledged $50 million over the
duration to fund: infrastructure development, scientific
workshops, novel technology development, and preliminary
research and fieldwork.
• “Seed funding” awarded to catalyze collaborative scientific
efforts around the world, increase public and private sector
spending in deep carbon science, and leave a thriving
community of international scientists as its legacy.
• DCO will synthesize 10 years of scientific research to
generate unique and unprecedented views of Earth, looking
at both scientific and human societal issues through a new,
sharper lens.
deepcarbon.net
DCO-Data Science World View:
Everything is a first-class (science) object
deepcarbon.net
Entry point for DCO object
registration and deposit
deepcarbon.net
DTR Use Cases
•
Broad Functional Classification
• Repos hold widely varying levels of data & metadata
• High-level functional classification of the identified object needed to make sense of what is
available, e.g., data object, metadata, repo description, contact info, etc.
•
Simple License Information via PID Resolution
• Data set access conditions cannot be predicted based on ID
• For DataCite DOIs, a handle/type/value triple could be used to provide access information, probably
through a level of indirection, resulting in a pop-up or intervening page or open linked data
•
Object Types as a Short-cut for Dependent Services to Match Processing
Requirements to Data Objects
• Using data acquisition as an example
•
•
•
•
Determine object type you are trying to build
Consult registry to index into an ontology to dynamically define required and optional properties
Does the input data have what is needed?
Registration of PID Types (in ID/Type/Value triples) for Data Processing and
Interpretation
• Distinguish pointers to objects from pointers to metadata from pointers to services
• Enable complex client interactions as opposed to simple one-to-one re-direction
Corporation for National Research Initiatives
RDA Brings Together DCO & DTR
•
Benefits to DTR
•
•
•
Benefits to DCO
•
•
•
•
DCO brought the data acquisition use case – no one else thought of it
DCO as early adopter will benefit testing and use of RDA result
Needed facility specified and prototyped with DCO use case in mind
Turn-key DTR will be available to DCO
DCO data science approaches and accomplishments presented to wide multidisciplinary audience
Benefits to Sloan
•
Two funded projects each augmented through interaction in RDA
Corporation for National Research Initiatives
Types and the Handle System
• Typing makes sense of data, which is just bits
• Handles resolve to type/value pairs – all other functions
reside in the applications
• Handles identify digital entities which are implicitly or
explicitly typed
• So – to develop Handle-based applications
– Must understand the types of returned values
– Will at some point need to understand the downstream
data identified by handles
Corporation for National Research Initiatives
Corporation for National Research Initiatives
Example DTR Use Cases
•
Broad Functional Classification
–
–
•
Simple License Information via PID Resolution
–
–
•
Repos hold widely varying levels of data & metadata
High-level functional classification of the identified object needed to make sense of what is available,
e.g., data object, metadata, repo description, contact info, etc.
Data set access conditions cannot be predicted based on ID
For DataCite DOIs, a handle/type/value triple could be used to provide access information, probably
through a level of indirection, resulting in a pop-up or intervening page or open linked data
Object Types as a Short-cut for Dependent Services to Match Processing
Requirements to Data Objects
–
Using data acquisition as an example
o
o
o
•
Determine object type you are trying to build
Consult registry to index into an ontology to dynamically define required and optional properties
Does the input data have what is needed?
Registration of PID Types (in ID/Type/Value triples) for Data Processing and
Interpretation
–
–
Distinguish pointers to objects from pointers to metadata from pointers to services
Enable complex client interactions as opposed to simple one-to-one re-direction
Corporation for National Research Initiatives
What do Data Type Records contain?
• Data type records contain
– textual description for human understanding
– provenance information (who created when and what)
• Records could contain
–
–
–
–
structured metadata about types for machines to process
encoding information (think file formats)
service information (think APIs to systems or applications that can process typed data)
semantic information (think description or predicate logic, useful for reasoning)
• Records do not enforce or define new ways to describe or represent data
structures, but rely on existing frameworks and technologies
– File formats (mime types), etc., may be used for describing encoding information
– WSDL, REST APIs, etc., may be used for describing service information
– OWL, KIF, etc., may be used for representing semantics and knowledge
Corporation for National Research Initiatives
Proposed Data Type Data Model
Element
Cardinality
(min, max)
Notes
ID
(1,1)
A unique, persistent identifier. Assigned
by a type registry
Human Description
(1,*)
Description in English mandatory.
Descriptions in other languages as
needed
Provenance
(1,1)
Who created it, when, etc.
Properties
(0,*)
Properties that describe data. Aka
predicates. For example, a weather
dataset contains time, location, and
temperature properties
Encoding Information (0,*)
File-formats (mime-types), etc.
Semantic Information
(0,*)
OWL, KIF, etc.
Service Information
(0,*)
WSDL, WADL, APIs, etc.
Corporation for National Research Initiatives
Proposed Use of Data Types
• Multiple type registries will be deployed; perhaps one per community
• Type registries federate across each other; local policies may restrict (the
scope of) such federation
• Users register data structures within a type registry and acquire a unique,
persistent identifier (data type)
• Data type identifiers are then associated with corresponding data
• Registered type records are additionally disseminated by type registries as
Linked Data compatible outputs
• General Guidelines
– Users decide what data structures to register or not. If a data structure is
expected to play a global role, then users are encouraged to register that data
structure
– Users are encouraged to first search if the data structure is registered prior to
registering to avoid duplicates
– Users decide the encoding, service, and semantic technology or framework
that best suits them
Corporation for National Research Initiatives