The DataGRID Project

Download Report

Transcript The DataGRID Project

The DataGRID Project
Les Robertson
CERN - IT Division
May 2001
[email protected]
CERN
les robertson - cern-it mar-2001 - 1
last update: 18/07/2015 00:57
Enabling Worldwide
Scientific Collaboration
• an example of the problem
• the DataGRID solution
• concluding remarks
CERN
les robertson - cern-it mar-2001 - 2
last update: 18/07/2015 00:57
The Beginning of DataGRID
The DataGRID project evolved from the conjunction of

the search for a practical solution to building the computing
system for CERN’s next accelerator – the Large Hadron
Collider (LHC)

and the appearance of Ian Foster and Carl Kesselman’s book
– The GRID – Blueprint for a New Computing Infrastructure
CERN
les robertson - cern-it may-2001 - 3
last update 18/07/2015 00:57
The Problems

Vast quantities of data

Enormous computing requirements

Researchers spread all over the world
CERN
les robertson - cern-it may-2001 - 4
last update 18/07/2015 00:57
The Large Hadron Collider Project
4 detectors
ATLAS
CMS
Storage –
Raw recording rate 0.1 – 1 GBytes/sec
Accumulating at 5-8 PetaBytes/year
10 PetaBytes of disk LHCb
Processing –
200,000 of today’s fastest PCs
CERN
les robertson - cern-it may-2001 - 5
last update 18/07/2015 00:57
Storage
Network
Computing fabric
at CERN (2005)
12
Thousands of CPU boxes
1.5
0.8
8
6*
24 *
Farm Network
Hundreds of
tape drives
0.8
0.8
Real-time
detector data
* Data Rate
in Gbps
250
5
960 *
LAN-WAN Routers
Storage Network
Thousands of disks
CERN
les robertson - cern-it may-2001 - 6
last update 18/07/2015 00:57
Simulated Collision in the ATLAS Detector
CERN
les robertson - cern-it may-2001 - 7
last update 18/07/2015 00:57
Complex Data = More CPU Per Byte
Estimated CPU Capacity required at CERN
Moore’s law –
4,000
some measure of
the capacity
technology
advances provide
for a constant
number of
processors or
investment
LHC
3,000
2,000
Other
experiments
Jan 2000:
3.5K SI95
CERN
les robertson - cern-it may-2001 - 8
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
0
1998
1,000
[email protected]
K SI95
5,000
last update 18/07/2015 00:57
Data Analysis: unpredictable requests require intensive
computation on huge data flows
1-100 GB/sec
35K SI95
Event Filter
(selection &
reconstruction)
Detector
0.1 to 1
GB/sec
200 TB / year
~200 MB/sec
Event
Summary
Data
1 PB / year
Raw data
500 TB
~100 MB/sec
250K SI95
Event
Reconstruction
Event
Simulation
CERN
350K SI95
64 GB/sec
Processed
Data
Batch Physics
Analysis
analysis objects
Interactive
Data Analysis
Thousands of scientists
les robertson - cern-it may-2001 - 9
last update 18/07/2015 00:57
CERN's Users in the World
CERN
Europe:
267 institutes, 4603 users
Elsewhere: 208 institutes, 1632 users
les robertson - cern-it may-2001 - 10
last update 18/07/2015 00:57
Solution (i)
Large Scale Computing Fabrics



Long experience in HEP with large clusters –
processors, disk farms, mass storage
reliable, manageable, flexible growth
Applications adapted to a well established computing model
Currently using thousands of simple PCs, IDE disk servers,
Ethernet
 Everything is using commodity components, tape storage excepted

New developments needed to scale these up by an order of
magnitude to tens of thousands of components
 maintaining reliability and availability targets
 containing management costs
 Terabit switches

New levels of management automation –
installation, monitoring, auto-diagnosing, self-healing
CERN
les robertson - cern-it may-2001 - 11
last update 18/07/2015 00:57
Large Scale Computing Fabrics (cont)

But the requirements are greater than can be satisfied at a
single site
 political/financial arguments against very large facilities
 national constraints from funding organisations
 exploiting existing computing centre infrastructure
Compare with geographical distribution of super-computing
centres
CERN
les robertson - cern-it may-2001 - 12
last update 18/07/2015 00:58
Solution (ii)
Regional Centres - a Multi-tier Model
Lab m
USA
FermiLab
Lab a
Tier2
Physics
Department
Tier 1
UK
Rutherford
CERN
Italy
Desktop
Lab b
CNAF/Bologna

NL
NIKHEF
Uni b
France
IN2P3/Lyon
Uni n
Lab c
Is this usable?
manageable?
Uni b
Uni y


CERN
les robertson - cern-it may-2001 - 13
[email protected]
Uni x
last update 18/07/2015 00:58
The Promise of Grid Technology
What does the Grid do for you?

you submit your work

and the Grid
 Finds convenient places for it to be run
 Optimises use of the widely dispersed resources
 Organises efficient access to your data
Caching, migration, replication
 Deals with authentication to the different sites that you will
be using
 Interfaces to local site resource allocation mechanisms,
policies
 Runs your jobs
 Monitors progress
 Recovers from problems

 .. and .. Tells you when your work is complete
CERN
les robertson - cern-it may-2001 - 14
last update 18/07/2015 00:58
The DataGRID Project
www.eu-datagrid.org
DataGRID Partners
Managing partners
UK
PPARC
Italy
INFN
France
CNRS
Holland NIKHEF
Italy
ESA/ESRIN
CERN – proj.mgt. - Fabrizio Gagliardi
Industry
IBM (UK), Communications & Systems (F), Datamat (I)
Associate partners
Finland- Helsinki Institute of Physics
& CSC,
Swedish Natural Science Research
Council (Parallelldatorcentrum–KTH,
Karolinska Institute),
Istituto Trentino di Cultura,
Zuse Institut Berlin,
University of Heidelberg,
CEA/DAPNIA (F),
IFAE Barcelona,
CNR (I),
CESNET (CZ),
KNMI (NL),
SARA (NL),
SZTAKI (HU)
CERN
les robertson - cern-it may-2001 - 16
last update 18/07/2015 00:58
The Data Grid Project - Summary

European dimension
 EC funding 3 years, ~10M Euro
 Closely coupled to several national initiatives


Multi-science
Technology leverage –
 Globus, Condor, HEP farming & MSS, Monarc, INFN-Grid, Géant

Emphasis –
 Data – Scaling - Reliability
 Rapid deployment of working prototypes - production quality
 Collaboration with other European and US projects

Status –
 Started 1 January 2001
 Testbed 1 scheduled for operation at end of year

Open –
 Open-source and communication
 Global GRID Forum
 Industry and Research Forum
CERN
les robertson - cern-it may-2001 - 17
last update 18/07/2015 00:58
DataGRID Challenges

Data

Scaling

Reliability

Manageability

Usability
CERN
les robertson - cern-it may-2001 - 18
last update 18/07/2015 00:58
Programme of work
Middleware - starting with a firm base in the Globus toolkit

Grid Workload Management, Data Management,
Monitoring services
Fabric

Fully automated Local Computing Fabric management

Mass Storage
Production quality testbed

Testbed Integration & Network Services
 > 40 sites
 Géant infrastructure
Scientific Applications

Earth Observation

Biology

High Energy Physics
CERN
les robertson - cern-it may-2001 - 19
last update 18/07/2015 00:58
Grid Middleware
Building on an existing framework (Globus)

workload management
 The workload is chaotic – unpredictable job arrival rates, data access
patterns
 The goal is maximising the global system throughput (events
processed per second)
 Start with Condor Class-Ads
Current issues
 Declaration of data requirements at job submission time
The application discovers the objects it requires during
execution
mapping of objects to the files managed by the Grid
 Decomposition of jobs (e.g. moving jobs where data is)
 Interactive workloads


CERN
les robertson - cern-it may-2001 - 20
last update 18/07/2015 00:58
Data Management & Application
Monitoring

data management
 Management of petabyte-scale data volumes, in an
environment with limited network bandwidth and heavy use of
mass storage (tape)
 Caching, replication, synchronisation
 Support for object database model

application monitoring
 Tens of thousands of components, thousands of jobs and
individual users
 End-user - tracking of the progress of jobs and aggregates of
jobs
 Understanding application and grid level performance
 Administrator – understanding which global-level applications
were affected by failures, and whether and how to recover
CERN
les robertson - cern-it may-2001 - 21
last update 18/07/2015 00:58
Fabric Management
Local fabric –

Effective local site management of giant computing fabrics
 Automated installation, configuration management, system
maintenance
 Automated monitoring and error recovery - resilience, selfhealing
 Performance monitoring
 Characterisation, mapping, management of local Grid resources

Mass storage management
 multi-PetaByte data storage
Expect tapes to be used only for archiving data
 “real-time” data recording requirement
 active tape layer – 1,000s of users
 uniform mass storage interface
 exchange of data and meta-data between mass storage systems

CERN
les robertson - cern-it may-2001 - 22
last update 18/07/2015 00:58
Infrastructure
Operate a production quality trans European “testbed”
 Interconnecting clusters in about 40 sites
 Integrate, build and operate successive releases of the
project middleware
 Negotiate and manage the network infrastructure
Initially Ten-155, migrating to Géant
 Demonstrations, data challenges  performance, reliability
 Production environment for applications


Inter-working with US projects (GriPhyN, PPDG)
CERN
les robertson - cern-it may-2001 - 23
last update 18/07/2015 00:58
Testbed Sites
(>40)
Dubna
Lund
RAL
Estec KNMI
IPSL
Paris
Santander
Lisboa
Moscow
Berlin
Prague
CERN
Brno
Lyon
Grenoble
Milano
PD-LNL
Torino
Madrid
Marseille Pisa BO-CNAF
Barcelona
ESRIN
Roma
Valencia
HEP sites
ESA sites
Catania
CERN
[email protected]
- [email protected]
les robertson - cern-it may-2001
- 24
last update 18/07/2015 00:58
Applications

HEP
 The four LHC experiments
 Live proof-of-concept prototype of the Regional Centre model

Earth Observation
 ESA-ESRIN
 KNMI (Dutch meteo) climatology
 Processing of atmospheric ozone data derived from ERS
GOME and ENVISAT SCIAMACHY sensors

Biology
 CNRS (France), Karolinska (Sweden)
 Application being defined
CERN
les robertson - cern-it may-2001 - 25
last update 18/07/2015 00:58
Challenges

Large, diverse, dispersed project
 but coordinating this European activity is one of the project’s
raisons d’être

Collaboration, convergence with US and other Grid activities
– this area is very dynamic

Organising adequate Network bandwidth –
a vital ingredient for success of a Grid

Keeping the feet on the ground –
The GRID is a good idea but not the panacea suggested by some
CERN
les robertson - cern-it may-2001 - 26
last update 18/07/2015 00:58
Concluding remarks

The vision is –
easy and reliable access to very large, shared,
worldwide distributed computing facilities,
without the user having to know the details

The DataGRID project will provide –
a large (capacity & geography) working testbed
practical experience and tools that can be adapted to the
needs of a wide range of scientific and engineering applications
CERN
les robertson - cern-it may-2001 - 27
last update 18/07/2015 00:58