Cyberinfrastructure Deconstructed

Download Report

Transcript Cyberinfrastructure Deconstructed

Research and Data
ARL/CNI Workshop, October 2008
Dr. Francine Berman
Director, San Diego Supercomputer Center
Professor and High Performance Computing Endowed Chair,
UC San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Research Today
What is the impact of a largescale earthquake on the
Southern San Andreas Fault?
Which has the greatest impact
– nature or nurture?
PSID: longitudinal data on 8000
families over 40 years
Digital data from Southern California
Earthquake Center simulations used
for disaster planning and building
requirements
How does
disease
spread?
PDB: World
wide reference
collection of
protein
structure
information
Are current stresses on this
bridge dangerous?
Where are the brown
dwarfs?
Terabridge data set: Structure
sensor data for real-time data
mining, event detection, decision
support and alert dissemination
NVO: Data from 50+
astronomical sky surveys and
large-scale telescopes.
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Analysis with Protein Data
Bank structures
Storage of Data
from the CERN
Large Hadron
Collider
Data
Grid
Applications
NETWORK
(more
BW)
DATA (more BYTES)
Data Needs Vary Over the Spectrum of Research Applications
Data-intensive
applications
Home, Lab,
Campus,
Desktop
Applications
Data-intensive
and
Computeintensive
HPC
applications
Computeintensive
HPC
Applications
COMPUTE (more FLOPS)
Grid
Applications
Cosmology
Development of
biofuels
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Cyberinfrastructure Support: Researchers Want a Coordinated and Easyto-Use Portfolio of Services and Resources to Make the Most of Their Data
Coordinated systems make
discovery the issue rather than IT
Services Are Critical for Use
Data Access
analysis
modeling
• Data visualization
• Portal creation and collection
publication
simulation
Data Use
visualization
• Data analysis
• Data mining
Data Management
File systems,
Database systems,
Collection Management
Data Integration, etc.
Many
Data
Sources
• Preservation services
• Domain-specific tools
Data Storage
instruments
Sensornets computers
• Data hosting
Data Preservation
•
Biology Workbench
•
Montage (astronomy
mosaicking)
•
Kepler (Workflow
management)
• Data anonymization, etc.
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Today’s Presentation
• Research and Data
Data-Driven Cosmology: Simulating the first billion years of the
Universe after the Big Bang
• Long-Lived Research Data
Key area of engagement for Libraries and Researchers
• Economically sustainable digital preservation:
The Blue Ribbon Task Force on Sustainable Digital Preservation
and Access
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Research and Data: Evolving the
Universe from the “Big Bang”
then
dump1 dump2
Composing simulation outputs from
different timeframes builds up lightcone volume
dump3 dump4
dump5
now
SAN DIEGO SUPERCOMPUTER CENTER
Slide modified from MikeFran
Norman
Berman
UCSD
After the “Big Bang” – the Universe’s First Billion
Years
• ENZO simulates the first billion years
of cosmic evolution after the “Big
Bang”
•
Key period which represents
– A tumultuous period of intense
star formation throughout the
universe
then
now
dump1
– Synthesis of the first heavy
elements in massive stars
– Supernovae, gamma-ray bursts,
seed black holes, and the
corresponding growth of
supermassive black holes and the
birth of quasars
– Assembly of first galaxies
SAN DIEGO SUPERCOMPUTER CENTER
Slide modified from Mike
Norman
Fran
Berman
UCSD
ENZO Simulations
What ENZO does:
•
Calculates the growth of cosmic structure
from seed perturbations to form stars,
galaxies, and galaxy clusters, including
simulation of
–
–
–
–
•
Formation
of a galaxy
cluster
Dark matter
Ordinary matter (atoms)
Self-gravity
Cosmic expansion
Uses adaptive mesh refinement (AMR) to
provide high spatial resolution in 3D
– The Santa Fe light cone simulation
generated over 350,000 grids at 7 levels of
refinement
– Effective resolution = 65,5363
AMR
Level 0
Level 1
SAN DIEGO SUPERCOMPUTER CENTER
Slide modified from Mike NormanFran
Berman
UCSD
Level 2
Greater Simulation Accuracy Requires More
Computing and Generates More Data
ENZO at Petascale
● Self-consistent radiation-hydro
simulations of structural, chemical,
and radiative evolution of the
universe simulates from first stars
to first galaxies
Computer Science challenges
● Parallelizing the grid hierarchy metadata for millions of subgrids distributed across
10s of thousands of cores
● Efficient dynamic load balancing of the numerical computations, taking memory
hierarchy and latencies into account
● Efficient parallel “packed AMR” I/O for 100 TB data dumps
● Inline data analysis/viz. to reduce I/O
SAN DIEGO SUPERCOMPUTER CENTER
Slide modified from Mike Norman,Fran
imageBerman
by Robert Harkness
UCSD
Verifying Theory with Observation
• James Webb Space Telescope, coming in 2013 will probe the first billion
years of the universe – providing observations of unprecedented depth and
breadth
• Data will enable tight integration of observation and theory, and will
enable simulations to approach realistic complexity
• Analysis of petascale
data sets will be essential
for validating model
SAN DIEGO SUPERCOMPUTER CENTER
Slide modified from Mike Norman,Fran
imageBerman
by Robert Harkness
UCSD
Digital Research Data Spans Spectrum of
Preservation Profiles
• Short-term (few months, years) to long-term (decades,
centuries, …)
• Small-scale (GBs) to large-scale (PBs)
• Well-tended (metadata, community standards) to poorly
tended (flat files, insufficient metadata)
• Subject to more restrictive policy and regulation (HIPAA) vs.
subject to less restrictive policy and regulation (OMB)
• Has a data management and sustainability plan (PDB, PSID,
NVO) vs. ad hoc approach
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Synergy between Researcher Needs and
Library Strengths
• Research community focused on
– Innovation (new starts)
– Targeted domain-focused solutions to problems (customization)
– Collaboration (open source, inter-disciplinary teaming, etc.)
• Researchers need help with things Librarians are good at
–
–
–
–
Developing reliable management, preservation and use environments
Proper curation and annotation
Navigating policy, regulation, intellectual property
Collaboration (partnership to share resources, create economies of
scale, etc.)
– Sustainability
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Long-Lived Research Data
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Historical Data
• The 2008 Cyber-election
– Fundraising via website
– YouTube videos of the
candidates and conventions
– Blogs as vehicles for
discussing issues
– On-line organizing
• Digital data from historic 2008
cyber-election will be valuable
for decades+ to come
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Life Sciences Data
• The Protein Data Bank
– worldwide repository for
the processing and
distribution of 3-D
structure data of large
molecules of proteins and
nucleic acids.
• PDB represents $80 billion +
investment in research
resulting in PDB structures
• PDB supported by funds from
NSF, NIGMS, DOE, NLM, NCI,
NCRR, NIBIB, NINDS, NIDDK.
October Molecule of the Month: Poly(A)
Polymerase
•“The enzyme poly(A) polymerase, in PDB entries 1f5a (cow)
and 1fa0 (yeast, shown here), is responsible for the creation
of the poly(A) tail. …
•Some researchers actually think that poly(A)-binding protein
links the RNA strand into a big circle. This could have a very
useful consequence: since the beginning of the messenger
RNA is so close to the end, ribosomes that have just finished
making a protein could jump immediately to the beginning
and start again.”
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Cultural Data
• Historical photographs
SAN DIEGO SUPERCOMPUTER CENTER
Some images courtesy of David
Minor
and the Library of Congress
Fran
Berman
UCSD
Preserving Long-lived Research Data
1) What should we save?
– value, policy,
regulation
2) Who should pay for it?
--- economics
Value
Cost
Time
Key candidates for
digital preservation
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Just Saving Everything Isn’t An Option
U.S. Library of Congress manages
295 TB of digital data, 230 TB of
which is “born digital”
YouTube: 6M
videos in 2006 =
600 TB
1 novel =
1 MB
SDSC Tape
Archives =
25+ PB
•
2007 was the “crossover year” where the amount
of digital information exceeded the amount of
available storage (~264 exabytes)
•
By 2023, the amount of digital data will exceed
Avogadro’s number. (6.02 X 10^23).
Kilo
103
Mega
106
Giga
109
Tera
1012
Peta
1015
Exa
1018
Zetta
1021
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Source: “The Diverse and Exploding
Digital
Universe” IDC Whitepaper, March 2008
Fran
Berman
What do We Want to Save?
Data we* want to keep over the long-term:
– We = “Society”
• Official and historically valuable data
(Census information, presidential emails,
Shoah Collection, etc.)
– We = Research Community
• Protein Data Bank, National Virtual
Observatory, etc.
– We = Me
• My medical record, my Quicken
data, digital photos of my
Mom’s 80th birthday, etc.
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
What do we Have to Save?
•
•
•
HIPAA applies to health
information created or
maintained by health
care providers
Sarbanes-Oxley
regulations apply to all
U.S. public company
boards, management,
and public accounting
firms.
OMB regulations apply
to federally funded
research data (NIH, NSF,
DOE, etc.)
Crime and Punishment
Regulations
Retention Requirement
Penalty
HIPAA
Retain patient data for 6
years
$250K fine and up to
10 years in prison
Sarbanes-Oxley
Auditors must retain
relevant data for at least 7
years
Fines to $5M and 20
years in prison
Gramm-LeachBaily
Ensure confidentiality of
customer financial
information
Up to $500K and 10
years in prison
SEC 17a
Broker data retention for 36 years. Some require
longer retention
Variable based on
violation
OMB Circular A110 / CFR Part
215 (applies to
federally funded
research data)
“a three year period is the
minimum amount of time
that research data should
be kept by the grantee”
Penalty structure
unclear, likely fines?
SAN DIEGO SUPERCOMPUTER CENTER
Table information partly based on “Data Retention – More Value, Less Filling”,
John Murphy, http://www.tdan.com/view-articles/5222
Fran Berman
UCSD
Economics: Preserving Research Data Incurs Real Costs
Resources and
Resource Refresh
Model A (8-yr,15.2-mo 2X)
TB Stored
Other Costs
Planned Capacity
100000.0
•
Maintenance and upkeep
•
Software tools and packages
•
Utilities (power, cooling)
•
Space
•
Networking
•
Security and failover systems
•
People (expertise, help, infrastructure
management, development)
•
Training, documentation
•
Monitoring, auditing
•
Reporting costs, costs of compliance
with regulation
Archival Storage (TB)
10000.0
1000.0
100.0
10.0
June-97
June-98
June-99
June-00
June-01
June-02
June-03
June-04
June-05
June-06
June-07
June-08
June-09
SDSC Data Storage Growth
Date
•
Most valuable data must be
replicated
•
SDSC research collections have been
doubling every 15 months.
•
SDSC storage is 25 PB and counting.
SAN DIEGO SUPERCOMPUTER CENTER
Information courtesy of Richard Moore
Information courtesy of Richard Moore Fran Berman
UCSD
Data Management and Preservation Key Opportunity for Libraries and
Research Enterprise Partnership: The UCSD Experience
•
Development of a reliable
production preservation grid to
support community collections
(supported by LC)
Brian
Schottlaender
Fran
Berman
Luc Declerck
DLCS
Director
Robin Chandler
Curation
Services
Davie Minor
Subject
Specialists
Ar
NCAR
Libraries
Digital Library
Program
U Md
Format
Specialists
UCSD Research
Community
Database and
Web
Services
Henry Jaime
Data Use
Services
Natasha
Balac
Information
Technology
UCSD
Chronopolis
Site
Metadata
Analysis
and
Specification
Storage
Services
TBD
UCSD, UC, and
Worldwide Research
Communities
UCSDL and SDSC creating an
institutional partnership
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Other
SDSC
Divisions
Economically sustainable digital preservation:
The Blue Ribbon Task Force on Sustainable Digital
Preservation and Access (BRTF-SDPA)
BRTF-SDPA focus:
•
•
General cost framework: key cost
categories of digital preservation
Website:
brtf.sdsc.edu
Set of economic models which
provide alternative ways of
addressing sustainable digital
preservation
– Pros, cons, costs, trade-offs of each
– List real world conditions for which
each model is best suited.
•
Actionable recommendations: “If your digital preservation context is X, you
should consider using model Y for sustainable digital access and preservation.”
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
BRTF-SDPA Participants
Blue Ribbon Task Force:
Sponsoring Agencies/Institutions:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Paul Ayris, University College London
Fran Berman, SDSC/UCSD
Bob Chadduck, NARA Liaison
Sayeed Choudhury, Johns Hopkins University
Elizabeth Cohen, AMPAS/Stanford
Paul Courant, University of Michigan
Lee Dirks, Microsoft
Amy Friedlander, CLIR
Chris Greer, NITRD Liaison
Vijay Gurbaxani, UC Irvine
Anita Jones, University of Virginia
Ann Kerr, Consultant
Brian Lavoie, OCLC
Cliff Lynch, CNI
Dan Rubinfeld, UC Berkeley
Chris Rusbridge, DCC
Roger Schonfeld, Ithaka
Abby Smith, Consultant
Anne Van Camp, Smithsonian
National Science Foundation
Mellon Foundation
Library of Congress
National Archives and Records Administration
CLIR
NITRD
JISC
Member institutions
Specific Responsibilities
•
•
•
•
•
•
•
•
•
•
Fran Berman / co-Chair
Amy Friedlander / First Report Editor
Ann Kerr / January Panel Rapporteur
Brian Lavoie / co-Chair
Susan Rathbun / Task Force Support
Abby Smith / Second Report Editor
Jan Zverina / Communications Lead
Lucy Nowell / NSF Program Officer
Don Waters / Mellon Program Officer
Laura Campbell, Martha Anderson / LC representative
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
BRTF-SDPA Working Definition of
Economic Sustainability
•
•
Economic sustainability in a digital preservation context: The set of business, social,
technological, and policy mechanisms that
1.
Encourage the gathering of important information assets into digital preservation
systems, and
2.
Support the indefinite persistence of the digital preservation systems, thus securing
access to and use of the information assets into the long-term future
Economically sustainable digital preservation requires
–
Recognition of the benefits of preservation on the part of key decision-makers, as part
of a process of selecting digital materials for long-term retention;
–
Appropriate incentives to induce decision-makers to act in the public interest;
–
Mechanisms to secure an ongoing allocation of resources, both within and across
organizations, to digital preservation activities;
–
Efficient use of limited preservation resources;
–
Appropriate organization and governance of digital preservation activities.
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Panelists
2008 BRTF Panels

January Panel
 John Gantz, IDC
 Andy Maltz, AMPAS
 Edmund Mesrobian, NetApp
 Stijn Hoorens, Rand

July Panel
 Stuart McKee, Microsoft
 Kris Carpenter, Internet Archive
 Rick Zuray, Boeing
 Helen Berman, PDB
 Myron Gutmann, ICPSR
 Nan Rubin, PBS
 Rick Luce, Emory Library
 Paul Ratnaraj, WRDS

October Panel
 Eileen Fenton, Portico
 Melissa Levine, U. Mich.
 Peter Mojica, SNIA
 Vicky Reich, LOCKSS
 TBD, Amazon
General Questions for Panelists:
 What is the nature of the digital
materials being preserved
 Who are the stakeholders for
these materials?
 What is the “value proposition”
for this preservation effort?
 What are the key features of
long-term preservation for these
materials?
 What are the “economic aspects”
of digital preservation?
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
BRTF Deliverables
Goal is to go beyond the
“3 R’s”:
• First Year (2008) Report (positive, “what is”):
1.Data preservation is
Really important
– Describe past and current models of economic sustainability for digital preservation
2.More Research is
(interviews, testimony, case studies, etc.);
needed.
– Identify points of convergence/divergence; “lessons learned”; 3.More Resources are
– What we know so far, and what our key knowledge gaps are.
needed.
• Second Year (2009) Report (normative, “what should be”):
– General cost framework: key cost categories of digital preservation
– Set of economic models/“scenarios”: alternate ways of organizing digital preservation
activities, within the context of the cost framework
– Describe each model: features, pros, cons, trade-offs, etc.
– List real world conditions for which each model is best suited.
• “If your digital preservation context is X, we recommend you consider using
model Y to organize your activities in a sustainable way.”
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Challenges in Developing Economic Models
• Definition of scope
– Focus of BRTF: economic
sustainability of digital data for
which there is a clear public
interest
– Materials may come from a variety
of sectors and in a variety of forms
(text, audio, video, website,
scientific data set, e-journals, etc.)
– Materials may change hands over
their lifetime
– Some of this data can/will be
identified at its inception as data of
long-term value, but for some,
retention value only becomes
apparent over time.
• Taxonomizing and integrating related
work
– Difficult to compare and contrast different
studies given diverse definitions of costs,
units of measurement, scope or accounting
methods
– Key related work: studies focusing on various
“parts of the elephant” (e.g. LIFE, BCL cost
models which focus on entire life-cycle of
costs as well as inter-temporal considerations
(e.g., inflation/deflation, interest rates))
• Getting the whole picture
–
Need to include both costs associated with
human dimensions (e.g., management) and
infrastructure elements of storage (e.g.,
power and cooling)
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Other Deliverables: Community Outreach
Get the message out beyond the “usual suspects”
• Does your dry cleaner
know what digital
preservation is?
• Goal is to help focus
attention and
encourage new
resources from
decision makers and
the general public by
expanding the
discussion
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Thank You
• Blue Ribbon Task Force for
Sustainable Digital Preservation
and Access
brtf.sdsc.edu
• Chronopolis
http://chronopolis.sdsc.edu/
• SDSC
www.sdsc.edu
• Fran
[email protected]
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD