Presentation

Download Report

Transcript Presentation

SEAD

Sustainable Environment – Actionable Data Margaret Hedstrom

SEAD PI/Project Director Professor & Associate Dean UM School of Information

Robert H. McDonald

SEAD Sr. Personnel Assoc. Dean/Associate Director Indiana University

CNI Fall Members Meeting Arlington, VA 12/12/2011

NSF DataNet Program

• • • • • new types of organizations that integrate library & archival sciences, cyberinfrastructure, computer & information sciences, & domain science expertise

provide reliable digital preservation, access, integration, and

analysis capabilities for science and/or engineering data over a decades-long timeline; continuously anticipate and adapt to changes in technologies and in user needs and expectations; engage in research to drive the leading edge forward serve as component elements of an interoperable data preservation

and access network

http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503141

Partners

SEAD’s Unique Contributions

– Address domain-driven needs & requirements – Serve scientists and researchers in the “long tail” – Integrate existing technologies, tools & services (rather than build new from scratch)

Sustainability Science

Science Technology Cooperation Policy Economics Poverty & Justice 4

• • • •

Data challenges

Heterogeneity of all kinds Multiple scales Multidisciplinary Many small datasets

The long tail of scientific research • • • • • Small and derived data sets Heterogeneous data Multiple sources of data Short-lived data with long-term value Value of data grows when combined & integrated

SEAD’s Goals

• • • • Provide data services that address the needs of researchers working toward sustainability Integrate these services into an generalizable “Active and Social Curation” infrastructure suited to the social structure and economics of long-tail research communities Develop capabilities to package and migrate the most valuable datasets to a federated repository infrastructure for long-term preservation Education, outreach, & training to disseminate SEAD’s contributions to other projects & communities

SEAD’s Strategy

• • • • Leverage social media for discovery of data, interest, and expertise Move data curation upstream in the data life cycle Involve domain scientists in setting priorities for evolution of data and services Take advantage of existing infrastructures (Institutional Repositories, ICPSR) for long term preservation

Active and Social Curation

• • • • Engage researchers during projects, not at the end Automatically capture metadata as defined by the data producers Provide facilities for commentary, recommendations, and mark-up of data Further reduce costs by re-engineering curation processes to leverage this rich metadata and volunteered effort

Active Curation Model

Active Curation

Workflows

Social Media

Data Review Rating Commenting Metadata

SEAD Status

Phase 1 Months 1-18 Develop Prototype Phase 2 Years 3-5 Grow SEAD users, data, and functionality SEAD start date: 10/1/2011 In other words, SEAD is not ready to accept your data!

SEAD Personnel

• • • • • • • • • • • Margaret Hedstrom, PI (Michigan) Praveen Kumar, co-PI (Illinois) Jim Myers, co-PI (RPI) Beth Plale, co-PI (Indiana) Ann Zimmerman, co-PI/Project Manager (Michigan) George Alter (ICPSR) Bryan Beecher (ICPSR) Katy Börner (Indiana) Robert McDonald (Indiana) Jude Yew, Post-doc (Michigan) + many more to come

http://sead-data.net

SEAD TEAM

University of Michigan: Margaret Hedstrom (UM PI), Ann Zimmerman (Co-PI and Project Manager), George Alter, Bryan Beecher, Charles Severance, Karen Woollams, Jude Yew. Indiana University: Beth Plale (IU PI), Katy Borner, Robert H. McDonald, Kavitha Chandrasekar, Robert Ping, Stacy Kowalczyk, Robert Light. University of Illinois: Praveen Kumar (UIUC PI), Rob Kooper, Luigi Marini, Terry McLaren. Rensselaer Polytechnic Institute: Jim Myers (RPI PI), Ram Prasanna Govind Krishnan, Lindsay Todd, Adam Wilson.

SEAD Cyberinfrastructure

• • • An international resource for sustainability science Novel technical and business approaches to supporting the long-tail of research data Lifecycle support: actionable data services integrated with curation and preservation infrastructure

Key Challenges for SEAD Cyberinfrastructure

• • • • • • • • Managed Data storage and services are expensive!

Begging for metadata doesn’t work!

Curation and preservation are time consuming!

The long-tail is not standardized!

Data collections are always missing something valuable!

Data models evolve!

Cyberinfrastructure is obsolete by the time you build it!

Building Community as you leverge cyberinfrastructure

SEAD: Social Networking

• • • • • • • Co-authorship Co-funding Micro-citation Shared project repositories Shared tags Threaded discussions Quoting, forwarding, …

Linked Data and Repositories

• • • • Tag and annotate data Overlay it with reference data Organize it in domain terminology Link it to people, papers, projects, conversations…

Using Science of Science to Link Repositories

KEY SEAD Questions

• • •

What could SEAD capture when?

How can SEAD provide direct value to data producers, users, and curators?

How can robust web-services and social computing lower barriers and reduce/realign costs?

SEAD: Active Content Repository

• With the ‘Big Picture’ graph in-hand, curators can: ▫ Focus on what to curate and when, ▫ Automate parts of the process ▫ Use existing/emerging technologies for packaging and preserving datasets ▫ Better manage federated repositories

SEAD: Leveraging Existing Resources

• • Cyberinfrastructure ▫ IU Data Capacitor/HPC Capabilities ▫ UIUC/NCSA HPC Capabilities ▫ Rensselaer CCNI Capabilities Repositories ▫ UM Deep Blue ▫ IU ScholarWorks ▫ ICPSR Repository ▫ UIUC IDEALS

SEAD LayerCake View

• Services over an active content layer that is backed by/harvested into a federated archive infrastructure based on institutional resources

Network of Data Producers

Data Conservancy

Web User Interface Active Content Repository

Content Mining

Services Provided

Curation Decisions Archival data generation Other services

Virtual Archives

IU

Institutional Repositories

RPI UIUC UM ICPSR

User Network

CI Technical Approach

Active and Social Curation Curation Boundary Data Acquisition, Analysis and Simulation Metadata Management DDI3. METS, PREMIS, MODS, DC, SensorML, OGC, … OAIS Repository Federation Automated Curation Workflow/Rule Engine Operates on Metadata, Content Objects and Trigger Events Scholarly Communication

VIVO/ Linked Data Active Content Repositor y

Appraisal and Selection Ingest scripts: fixity, integrity, authentication, transformation Ingest, AIPs Compound Objects - OAI-ORE

Di gital Repository Federation (OAIS compliant)

Preservation Actions Dissemination Packages Wide-Area File System Search, Browse, Annotation, Visualization Tools Use, Reuse, Repurposing Tools Contributor User Migration and Emulation Tools Access Mechanisms and E-Scholarship Services

Toward PetaScale Data

• Internet2 upgrade: ▫ Total bandwidth from 100 Gbps to 8.8 Tbps ▫ Moving a petabyte of data will go from from 10 days to 25 hrs

SEAD 18 Month Prototype Targets for Cyberinfrastructure

• • Active and Social Content Curation ▫ Pilot Active Content Repository, VIVO deployments ▫ Exemplar services for Data Ingest, Discovery, Re use, Curation CI for Long-term Access ▫ Data model, protocol design/development ▫ Pilot Federated Repository infrastructure

SEAD CI QuickView

• • • • SEAD will quickly build a repository and data services infrastructure for sustainability research that can be responsively adapted based on community feedback – Community Agile Development SEAD will leverage existing tools and emerging practices to dramatically enhance the interactions of researchers and data librarians – Active Curation SEAD’s focus on the long-tail will force an emphasis on ease-of-use and low costs that is critical for long-term sustainability – Leverage

Existing Institution Resources for Long-term Access

SEAD will leverage experiences in the sustainability research community to provide guidance for other long-tail communities making the transition to an interdisciplinary, systems-oriented approach to research – Sustainability and Resource Growth

Partnership and Collaboration

Acknowledgments

SEAD is funded by the National Science Foundation under cooperative agreement #OCI0940824 • •

For more on SEAD go to: http://sead-data.net

Follow us on Twitter @SEADdatanet http://sead-data.net