The USCMS Integration Grid Testbed (IGT)

Transcript The USCMS Integration Grid Testbed (IGT)

Grids Now:
The USCMS Integration Grid Testbed
and the European Data Grid
Michael Ernst
1
6-Mar-2003
Grids NOW
2
6-Mar-2003
Grids NOW
Why Grids?

Efficient sharing of distributed heterogeneous compute
and storage resources.




Grid solutions need to be scalable and robust




Virtual Organizations and Institutional resource sharing
Dynamic reallocation of resources to target specific problems
Collaboration-wide data access and analysis environments
Must handle petabytes per year
Tens of thousands of CPUs
Tens of thousands of jobs
Grid solutions presented here are supported in part by
the EDG, DataTag, PPDG, GriPhyN, and iVDGL.

We are learning a lot from these current efforts for LCG-1!
3
6-Mar-2003
Grids NOW
Introduction

US CMS is positioning itself to be able to learn, prototype and develop while
providing a production environment to cater to CMS, US CMS and LCG demands




R&D
Integration
Production
3 phased approach is not a new idea, but is mission critical!

Development Grid Testbed (DGT), Integration Grid Testbed (IGT), and Production Grid4
6-Mar-2003
Grids NOW
Commissioning with CMS
Monte Carlo Production

USCMS IGT is producing:

1M Egamma “BigJets” events from generation stage all the way
through analysis Ntuples

Time dominated by GEANT simulation, but no intermediate data is kept
 Must run on RedHat Linux 6.1 resources for Objectivity license


EDG is producing:


500K Egamma “BigJets” events at CMSIM stage only
1M Egamma “BigJets” events at CMSIM stage only
Comparison of requests:


CMSIM-only involves large intermediate data (~ 1.5 TB) transfers
Full simulation through to Ntuples involves more complex workflow
5
6-Mar-2003
Grids NOW
Special IGT Software: MOP
Remote Site 1
Batch
Queue

MOP is a system for
packaging production
processing jobs into
DAGMan job descriptions


Master Site
IMPALA
mop_submitter
DAGMan
Condor-G
GridFTP
GridFTP
Remote Site N
Batch
Queue
DAGMan jobs are Directed Acyclic Graphs (DAGs)
GridFTP
MOP uses the following DAG Nodes for each job:
 Stage-in: Stages in needed application files, scripts, data from the submit host
 Run: The application(s) run on the remote host
 Stage-out: The produced data is staged out from the execution site back to the submit host
 Clean-up: Temporary areas on the remote site are cleansed
 Publish: Data is published to a GDMP replica catalogue after it is returned
6
6-Mar-2003
Grids NOW
Grid Resources

The USCMS Integration Grid Testbed (IGT) comprises:

About 230 CPU (750 MHz equivalent, RedHat Linux 6.1)



About 5 TB local disk space plus Enstore Mass storage at
FNAL using dCache
This is the combined USCMS Tier-1/Tier-2 resources:


Additional 80 CPU at 2.4 GHz running RedHat Linux 7.X
Caltech, Fermilab, U Florida, UC SanDiego, (UW Madison)
The European Data Grid (EDG) comprises:

About 350 CPU (750-1000 MHz equivalent, mostly RHL 6.2)

Up to 400 additional CPU at Lyon that are shared
 CERN, CNAF, RAL, (NIKHEF), Legnaro, Lyon

About 3.7 TB local Disk space plus CASTOR Mass Storage at
CERN and HPSS at Lyon
7
6-Mar-2003
Grids NOW
Current Grid Software


Both grids use Globus and Condor core middleware
USCMS Integration Grid Testbed (IGT)


Using Virtual Data Toolkit (VDT) 1.1.3
Bottom-up” approach


Advantage: Bugs have been shaken out of the core middleware
products
European Data Grid


Using EDG release 1.3
Enhanced functionality


Eg- Using multiple Resource Brokers relying on real time
monitoring information to schedule jobs
“Top-down” approach

Advantage: More functionality has been added
8
6-Mar-2003
Grids NOW
The CMS IGT “Stack”
Application
CMS
Applications
Job Creation
MCRunJob
DAG Creation
MOP
Job Submission
DAGMAN/Condor_G
VDT
Globus
Network
Job Manager
Globus/GRAM
Farm/Batch System
FBS/PBS/
Condor
Mass Storage System
(dCache+Enstore)
6-Mar-2003
The CMS IGT “stack” comprises nine
layers. The Application layer contains
only CMS executables. The Job
Creation layer comprises CMS
provided tools MCRunJob and Impala.
Neither MCRunJob nor Impala are
specifically “grid aware.” Then there is
a DAG Creation layer and a Job
Submission layer. Both functionalities
are provided by MOP. Jobs are
submitted to DAGMAN which, through
Condor-G, manages jobs run on
remote Globus Job Managers. Finally,
there is a local Farm or Batch System
used by Globus GRAM to manage
jobs. In the case of the IGT, the local
Batch manager was always FBSNG or
Condor. Scheduling and Integrated
monitoring are not present.
9
Grids NOW
Monitoring

MonaLisa is used as the primary IGT Grid-wide monitor.





Physical parameters: CPU load, network usage, disk space, etc.
Dynamic discovery of monitoring targets and schema
Implemented with Java/Jini with SNMP local monitors
Interfaces to/from other monitoring packages
EDG uses two sources of Monitoring:

EDG Monitoring System (based on Globus Metadata Directory
Service)


Physical Parameters: CPU Load, network usage, disk space, etc.
BOSS (Batch Object Submission Service)

Application level: Running Time, Size of Output, Job Progress, etc
 Stores information in MySQL DB in real time
10
6-Mar-2003
Grids NOW
ML Design Considerations
Act as a true dynamic service and provide the necessary
functionally to be used by any other services that require such
information (Jini, UDDI - WSDL / SOAP)
- mechanism to dynamically discover all the "Farm Units" used by
a community
- remote event notification for changes in the any system
- lease mechanism for each registered unit
Allow dynamic configuration and the list of monitor parameters.
Integrate existing monitoring tools ( SNMP, Ganglia, Hawkeye …)
To provide:
- single-farm values and details for each node
- network aspect
- real time information
- historical data and extracted trend information
- listener subscription / notification
- (mobile) agent filters (algorithms for prediction and decision-support)
11
6-Mar-2003
Grids NOW
IGT Results (so far)
Time to process
1 event:
500 sec @ 750 MHz
Oct
Speedup:
Avg factor of 100
speedup during current
run
25
Resources:
Approximately 230 CPU
@750 MHz equiv.
Sustained efficiency:
about 43.5%
12
6-Mar-2003
Grids NOW
EDG CMSIM events vs. time
Avg. CPU Utilization
was “about the same”
13
6-Mar-2003
Grids NOW
Analysis of Problems on IGT

The usual servers die and need to be restarted


But nothing that seems to be related to the load...
Failure semantics currently lean heavily towards
automatic resubmission of failed jobs


Sometimes failures are not recognized right away
Need better system for spotting chronic problems


BOSS does this already, we aren’t using it because it was still
under development when we were planning the IGT
Problems must be better routed to the “right” people

At one point, an application problem was misdiagnosed early
as a middleware issue


Partly because IGT is currently run by middleware developers!
Once the application expert looked into it, the problem was
solved in 90 minutes
14
6-Mar-2003
Grids NOW
Analysis of Problems on EDG

The biggest problems related to the Information System:







Problems related to the replica manager:



Symptom: no resources are found
Cause: instability of the MDS when it is overloaded
Solution: submitting jobs at a lower rate improves the chances of success
Symptom: the RB gets stuck (no job ever starts)
Cause: investigating...
Symptom: grid elements disappear from the II
Cause: services on some machines stopped working
Solution: restart the services
Symptom: timeouts when copying the input sandbox
Symptom: log file lost (“Stdout does not contain useful data”)
Cause: several (no free files/inodes, broken connect. between CE & RB, …)
Symptom: file registration in the RC fails from time to time
None of these problems is a show-stopper and they happen just in a fraction
of the jobs!
Fixes are already there for some of them (but not yet deployed)
15
6-Mar-2003
Grids NOW
The Integration Grid Testbed
(IGT)
Resource Allocations (1 GHz equiv. CPU) in 2002/2003
for IGT and Production Grid. (R&D Grid not included.)
2002(IGT)
2002(PG) 2003(New)
2003(IGT)
2003(PG)
FNAL
60
0
260
10
310
Florida
80
0
175
5
250
Caltech
120
0
88
5
203
UCSD
128
0
88
5
211
Total
388
0
611
25
974
New resources for Tier-2 are from iVDGL.
16
6-Mar-2003
Grids NOW
Comparison to CMS Spring 02


This is really apples and oranges, but...
Average CPU utilization was about 20-25% during
Spring02 over all participating CMS Regional Centers

Though it is impossible to extract a concrete number...
It really isn’t known how many Spring02 CPUs should be counted
in the denominator, estimated 700-1000 CPUs
 File transfers were much more complicated during Spring02
 Objectivity data was kept and archived
 Different events were processed at different steps



But the current efforts show that the Grids are in the
same ballpark!
We still need a factor of ~2.5 for DC04 production

want slightly better than 1 evt/sec
17
6-Mar-2003
Grids NOW
Manpower Results (so far)

Estimates of Manpower for the IGT




Caveats:





2.65 FTE equivalent during initial phase and debugging
 Reported voluntarily in response to a general query
1.1 FTE equivalent during smooth running periods
 The manager plus periodic small file transfers
Expected to be less than 1 FTE when regular shift procedures are adopted
This is the “second” attempt for the IGT. The first attempt last Spring
needed more manpower.
We STILL have a rapidly changing middleware environment
This is not counting general sysadmin support
This is really saying that production ops is becoming less specialized
EDG began in early December, first attempts,


No manpower estimates yet.
A task force has been set up including reps from EDG, CMS, and LCG
18
6-Mar-2003
Grids NOW
Next steps

For the EDG:



For the IGT:



Deploy the fixes for the problems encountered so far
Put the online monitoring in place
Deploy fixes for problems uncovered so far
Deploy more functionality
IGT and EDG are preparation for LCG-1

Recently, LCG started participating in the IGT


50 nodes running CMSIM-only production
Getting “top” and “bottom” to meet in the middle
19
6-Mar-2003
Grids NOW
Conclusion



Our approach to developing the software systems for the distributed data
processing environment adopts “rolling prototyping”
– Analyze current practices
– Prototyping of the distributed processing environment
– Software Support and Transitioning
– Servicing external milestones
Next prototype system to be delivered is the US CMS contribution to the
LCG Production Grid (June 2003)
– CMS will run a large Data Challenge on that system to prove the
computing systems (including new object storage solution ?)
This scheme allows us to flexibly react to technology developments AND
to changing and developing external requirements
20
6-Mar-2003
Grids NOW