Data Preservation at the Exa-Scale and Beyond Challenges

Download Report

Transcript Data Preservation at the Exa-Scale and Beyond Challenges

Data Preservation at the
Exa-Scale and Beyond
Challenges of the Next Decade(s)
[email protected]
APARSEN Webinar, November 2014
The Story So Far…
• Together, we have reached the point where a
generic, multi-disciplinary, scalable e-i/s for LTDP
is achievable – and will hopefully be funded 
• Built on standards, certified via agreed
procedures, using the “Cream of DP services”
• In parallel, Business Cases and Cost Models are
increasingly understood, working closely with
Projects, Communities and Funding Agencies
Open Questions
• Long-term sustainability is still a technical issue
– Let’s assume that we understand the Business Cases
& Cost Models well enough…
– And (we) even have agreed funding for key aspects
• But can the service providers guarantee a
multi-decade service?
– Is this realistic?
– Is this even desirable?
4C Roadmap Messages
A Collaboration to Clarify the Costs of Curation
1. Identify the value of digital assets and make
choices
2. Demand and choose more efficient systems
3. Develop scalable services and infrastructure
4. Design digital curation as a sustainable
service
5. Make funding dependent on costing digital
assets across the whole lifecycle
6. Be collaborative and transparent to drive
down costs
OSD@Orsay - [email protected]
4
“Observations” (unrepeatable)
versus “measurements”
“Records” versus “data”
Choices & decisions:
- Some (re-)uses of
data are
unforeseen!
No “one-size fits all”
OSD@Orsay - [email protected]
5
Suppose these guys can build / share
the most cost effective, scalable and
reliable federated storage services,
e.g. for peta- / exa- / zetta- scale
bit preservation?
Can we ignore them?
6
H2020 EINFRA-1-2014
Managing, preserving and computing with big
research data
7) Proof of concept and prototypes of data
infrastructure-enabling software (e.g. for
databases and data mining) for extremely large
or highly heterogeneous data sets scaling to
zetabytes and trillion of objects.
Clean slate approaches to data management
targeting 2020+ 'data factory' requirements of
research communities and large scale facilities
(e.g. ESFRI projects) are encouraged
7
Next Generation Data Factories
• HL-LHC (https://indico.cern.ch/category/4863/)
– Europe’s top priority should be the exploitation of the full
potential of the LHC, including the high-luminosity upgrade of
the machine and detectors with a view to collecting ten times
more data than in the initial design, by around 2030
– (European Strategy for Particle Physics)
• SKA
– The Square Kilometre Array (SKA) project is an international
effort to build the world’s largest radio telescope, with a square
kilometre (one million square metres) of collecting area
 Typified by SCALE in several dimensions:
– Cost; longevity; data rates & volumes
– Last decades; cost O(EUR 109); EB / ZB data volumes
8
http://science.energy.gov/fundingopportunities/digital-data-management/
• “The focus of this statement is sharing and preservation of digital research
data”
• All proposals submitted to the Office of Science (after 1 October 2014) for
research funding must include a Data Management Plan (DMP) that
addresses the following requirements:
1.
DMPs should describe whether and how data generated in the course of
the proposed research will be shared and preserved.
If the plan is not to share and/or preserve certain data, then the plan must
explain the basis of the decision (for example, cost/benefit considerations,
other parameters of feasibility, scientific appropriateness, or limitations
discussed in #4).
At a minimum, DMPs must describe how data sharing and preservation
will enable validation of results, or how results could be validated if data
are not shared or preserved.
9
Data: Outlook for HL-LHC
450.0
400.0
350.0
PB
300.0
CMS
250.0
200.0
ATLAS
We are here!
ALICE
LHCb
150.0
100.0
50.0
0.0
Run 1
Run 2
Run 3
Run 4
• Very rough estimate of a new RAW data per year of running using a
simple extrapolation of current data volume scaled by the output rates.
• To be added: derived data (ESD, AOD), simulation, user data…
 At least 0.5 EB / year (x 10 years of data taking)
Predrag Buncic, October 3, 2013
ECFA Workshop Aix-Les-Bains - 10
Bit-preservation WG one-slider
• Mandate summary (see w3.hepix.org/bit-preservation)
– Collecting and sharing knowledge on bit preservation across HEP (and
beyond)
– Provide technical advice to
– Recommendations for sustainable archival storage in HEP
• Survey on Large HEP archive sites carried out and
presented at last HEPiX
– 19 sites; areas such as archive lifetime, reliability, access, verification,
migration
– HEP Archiving has become a reality by fact rather than by design
– Overall positive but lack of SLA’s, metrics, best practices, and long-term
costing impact
11
Verification & reliability
Systematic verification of archive data ongoing
–
–
–
•
From annual bit loss rates of
O(10-12) (2009) to O(10-16) (2012)
Still, room for improvement
–
–
–
12
“Cold” archive: Users only accessed ~20% of the
data (2013)
All “historic” data verified between 2010-2013
All new and repacked data being verified as well
Data reliability significantly improved over
last 5 years
–
•
~35 PB verified in 2014
Vendor quoted bit error rates: O(10-19..-20)
But, these only refer to media failures
Errors (eg bit flips) appearing in complete chain
No losses
•
“LHC Cost Model” (simplified)
Start with 10PB, then +50PB/year, then +50% every 3y (or +15% / year)
10EB
1EB
13
Case B) increasing archive growth
Total cost: ~$59.9M
(~$2M / year)
14
Certification – Why Bother?
✚Help align policies and practices across sites
✚Improve reliability, eliminate duplication of effort,
reduce “costs of curation”
– Some of this is being done via HEPiX WG
✚Help address the “Data Management Plan” issue
required by Funding Agencies
✚Increase “trust” with “customers” wrt
stewardship of the data
✚Increase attractiveness for future H2020 bids and
/ or to additional communities
2020 Vision for LT DP in HEP
• Long-term – e.g. FCC timescales: disruptive change
– By 2020, all archived data – e.g. that described in DPHEP Blueprint,
including LHC data – easily findable, fully usable by designated
communities with clear (Open) access policies and possibilities to
annotate further
– Best practices, tools and services well run-in, fully documented and
sustainable; built in common with other disciplines, based on
standards
– DPHEP portal, through which data / tools accessed
 “HEP FAIRport”: Findable, Accessible, Interoperable, Re-usable
 Agree with Funding Agencies clear targets & metrics
16
OSD@Orsay - [email protected]
17
Summary
• Next generation data factories will bring with
them many challenges for computing, networking
and storage
• Data Preservation – and management in general –
will be key to their success and must be an
integral part of the projects: not an afterthought
• Raw “bit preservation” costs may drop to
~$100K / year / EB over the next 25 years
18
3 Points to Take Away:
1. Efficient;
2. Scalable;
3. Sustainable.
A (small-ish) network of certified, trusted
digital repositories can address all of these
19