Distri - GridPP
Download
Report
Transcript Distri - GridPP
Distributed Storage
Wahid Bhimji
Outline:
• Update on some issues mentioned last time:
• SRM; DPM collaboration
• Federation: webDav /xrootd deployment status
• Other topics: Clouds and “Big Data”
Update on issues
•
SRM: Still required but progress towards removing need for disk-only: GDB talk
DPM (Collaboration)
•
Agreed” at DPM workshop at LAL, 04/12/2012
•
“more than adequate level of effort was pledged in principle to maintain, develop
and support the DPM as an efficient storage solution.”
•
Strong commitment from TW; core support from CERN; decent one from France
and us: we didn’t really evaluate transitioning – now there seems no need to.
•
In practice CERN are currently providing more than promised – but there is a
further reorganisation coming and previous lead developer (Ricardo) left.
•
Xrootd now working well in DPM (see federation slides)
•
DMLite is in production (but only actually used for WebDav)
• still minimal experience / testing: HDFS / S3 apparently works; Lustre not finished.
Federation: what is it
ATLAS project called “FAX”, CMS called “AAA” (AnydataAnytimeAnywhere)
Description (from the FAX Twiki):
The Federated ATLAS Xrootd (FAX) system is a storage federation aims at bringing Tier1,
Tier2 and Tier3 storage together as if it is a giant single storage system, so that users do not have
to think of there is the data and how to access the data. A client software like ROOT or xrdcp
will interact with FAX behind the sight and will reach the data whereever it is in the federation.
Goals (from Rob Gardner’s talk at Lyon Federation mtg. 2012). Similar to CMS:
•
Common ATLAS namespace across all storage sites, accessible from anywhere;
•
Easy to use, homogeneous access to data
•
Use as failover for existing systems
•
Gain access to more CPUs using WAN direct read access
•
Use as caching mechanism at sites to reduce local data management tasks
ATLAS FAX Deployment
Status
•
US fully deployed: Scalla; dCache and Lustre (though not StoRM)
•
DPM has a nice xrootd server now : details
• UK have been a testbed for this – but now entirely YAIM setup (since 1.8.4)
:https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Xroot/Setup
• Thanks to David Smith all issues (in xrootd not DPM) solved.
•
CASTOR required a kind of custom setup by T1- works now
•
Regional redirector setup for UK : physically at CERN
•
UK sites working now
•
•
•
•
•
DPM: Cambridge; ECDF; Glasgow; Lancaster; Liverpool; Manchester; Oxford;
EMI push means all sites could install now (but its still optional)
Lustre: QMUL – Working
dCache : RalPP – In progress
CASTOR: RAL - Working
CMS AAA Deployment Status
Andrew Lahiff
•
Site status
• xrootd fallback:
• Enabled at all US Tier-2s
• UK/European sites being actively pursed & tested
• xrootd access:
• Enabled at some Tier-2s, FNAL, RAL
• Sites being encouraged to provide access
• xrootd fallback & access SAM tests not yet critical
• UK Sites: RalPP; IC and Brunel all have fallback and access enabled
•
Usage:
• User analysis jobs (ongoing)
• Central MC & data reprocessing (started a few weeks ago)
• Running workflows which read input data from remote sites using xrootd fallback
• Makes better use of available resources without moving data around using FTS or
wasting tape at Tier-1s.
CMS xrootd usage
• Data reprocessing
running at CNAF,
reading using xrootd
“fallback”
• > 1000 slots
• > 500 MB/s aggregate
• ~ 25 T data copied
• 350 kB/s/core
• CPU eff > 95%
CMS remote usage – last week
350T
Big Time – mostly in the US
ATLAS remote usage
15 T
Small Fry: currently tests and
users in the know. But it works
and UK Sites are up there
What about http://
• Storage federation based on http (WebDav) has been demonstrated – see
for example Fabrizio’s GDB talk
• DPM has a WebDav interface as does dCache. Storm has just released
something – testing at QMUL. (Castor (and EOS) not yet).
• Sits with wider goals of interoperability with other communities.
• However doesn’t have the same level of HEP/LHC uptake so far
• ATLAS however want to use it within Rucio
• Initially for renaming files – but wider uses envisaged: e.g. user download
• UK ironed out a few problems for ATLAS use with DPM server
• Those fixes will be 1.8.7 - not yet released.
Other things
• Cloud Storage:
• Plan for testing within GridPP (Some existing studies at CERN)
• Definition: “resources which have an interface like those provided by
current commercial cloud resource providers”
• Goal: to allow use of future resources that are provided in this way .
• Testing: transfers and direct reading (ROOT S3 plugin)
• Status: set up on IC Swift:
• Copying works.
• Added as pool to test-DPM at Edinburgh – almost works .
• “Big Data”:
• “Big Data” is not a buzzword, now business-as-usual in the private sector.
• Why does HEP share little of the same toolkit?
• Explore via joint-workshops + development activities: real physics usecases.
• Both these are on the topic list for WLCG Storage/Data WG.
Conclusion/ Discussion
• Future of DPM is not so much of a worry
• Reduced reliance on SRM will offer flexibility in storage solutions
• Federated xrootd deployment is well advanced (inc in UK)
• Seems to work well in anger for CMS (aim 90% T2s by June);
• So far so good for ATLAS – though not pushing all the way;
• LHCb also interested (“fallback onto other storage elements only as
exception”)
• WebDav option kept alive by ATLAS rucio
https://www.gridpp.ac.uk/wiki WebDAV#Federated_storage_support
• Cloud Storage: Worth exploring – have just started
• Big Data: Surely an “Impact”
in the room
Backup
Federation traffic
Modest levels now
will grow when in production
•
•
14
In fact inc. local traffic UK sites dominate
Oxford and ECDF switched to xrootd for
local traffic
Systematic FDR load tests in progress
EU cloud results
30
Read 10% events 30MB TTC
BNLATLAS
CERNPROD
ECDF
25
MB/s
20
15
ROMA1
10
5
0
BNL-ATLAS
CERN-PROD
ECDF
ROMA1
Source
source
source
15
Slide
Stolen from
I. Vukotic
events/s
BNL-ATLAS
CERN-PROD
ECDF
ROMA1
QMUL
BNL-ATLAS
126.76
82.68
80.68
32
41.34
destination
CERN-PROD
ECDF
29.4
25.1
232.52
108.46
56.06
252.39
73.66
23.95
24.14
52.2
MB/s
BNL-ATLAS
CERN-PROD
ECDF
ROMA1
QMUL
BNL-ATLAS
13.07
8.36
8.23
3.15
4.26
CERN-PROD
3.03
23.26
5.64
7.49
2.6
ECDF
ROMA1
26.05
123.52
62.83
197.01
99.43
ROMA1
2.61
11.02
25.14
2.47
5.33
Absolute
values not
important
QMUL
(Affected by
CPU /HT
Etc.) and
QMUL
57.26 setup
145.96
145.18
49.72
105.46
QMUL
2.65
12.71
6.52
20.77
9.65
5.84
14.68
14.42
4.79
10.38
Point is remote
read can be
Good but varies
Cost matrix measurements
Cost-of-access: (pairwise network links, storage load,
etc.)
16
FAX by destination cloud
17
Testing and Monitoring
https://twiki.cern.ch/twiki/bin/view/Atlas/MonitoringFax
FAX specific :
http://uct3-xrdp.uchicago.edu:8080/rsv/ (basic status)
http://atl-prod07.slac.stanford.edu:8080/display (previous page)
Now in normal ATLAS monitoring too :
• http://dashb-atlasssb.cern.ch/dashboard/request.py/siteview?view=FAXMON#c
urrentView=FAXMON&fullscreen=true&highlight=false
• http://dashb-atlas-xrootd-transfers.cern.ch/ui/
Atlas FAX structure
Cloud to Global
Topology of “regional” redirectors
atlas-xrd-uk.cern.ch
atlas-xrd-eu.cern.ch
atlas-xrd-de.cern.ch
18
Start locally - expand
search as needed
Needs “Name2Name” LFC lookup (unlike CMS) (probably not needed in soon (rucio))