Snapshot of the D0 Computing and Operations Planning Process

Download Report

Transcript Snapshot of the D0 Computing and Operations Planning Process

Accelerator Based Physics:
ATLAS CDF CMS DO STAR
Amber Boehnlein
OSG Consortium Meeting
January 24, 2006
Amber Boehnlein, FNAL
Particle Physics


These five physics experiments are physics facilities with the intent of
testing the Standard Model
What are the questions:

What causes electroweak symmetry breaking ?
Does Quantum Chromodynamics precisely describe the behavior of quarks
and gluons?
What is the mechanism of CP violation?
What is the wave function of the proton; of a heavy nucleus?

...




What we measure:

The production and decay of particles and associated properties






Cross sections, spectra measurements (E, Pt, eta, ...), angular distributions,
particle correlations
the top mass and properties
Properties of the electoweak bosons.
Flavor physics; mixing
…
What we seek:


Higgs Boson
SUSY and other new phenomena beyond the Standard Model
Amber Boehnlein, FNAL
The Road to Physics
passes through software and computing …
Monte Carlo
Raw Data:
•
•
•
•
• event generation
• Geant detector
simulation
• Fast simulations
•…
Databases
Network
Releases
Operation
Data handling & access
Trigger simulations
Luminosity
detector
trigger system
data acquisition
…
Calibration
• pedestals
• gains, linearity
•…
Reconstruction (RECO)
•
•
•
•
•
detector algorithms
particle identifications
production farm
user ready data format
…
Physics Analysis
• event selections
• efficiencies & backgrounds
•…
Amber Boehnlein, FNAL
OSG is a road to Physics
Monte Carlo
Raw Data:
•
•
•
•
• Atlas, CMS,
•CDF,DO
•Star
…
detector
trigger system
data acquisition
…
Calibration
• pedestals
• gains, linearity
•…
Reconstruction (RECO)
Atlas
DO
CDF
CMS
• DO-reconstruction from
raw and from derived Data
•STAR
…
Physics Analysis
• Atlas
•CMS
•CDF
•STAR
Amber Boehnlein, FNAL
Implications



Calibration database connectivity via some mechanism is essential for
reconstruction
“User” application code/macros distributed as self contained tarballs or as a an
advertised local installation of code distribution.
Computations can be compute intensive



Computations can be data intensive






ALPGEN simulates multi-parton processes well, but is much slower than other
standard packages
Flagship analysis: CDF estimates 84 GHZ-years for top mass and cross section
analyses (manipulating about 10 TB of data)
Reconstruction process typically GBs of data and GB.
Run over terabytes of input data clustered in hundreds of GB of dataset for
bookkeeping purposes.
Job management is shaped around this clustering, resulting in bursts (hundreds) of
local jobs submitted at the same time.
Jobs typically run for several hours and typically require external network
connectivity.
For efficient storage, output files might require merging.
OSG provides a maturing infrastructure to run within this paradigm.


Resources are made available via standard interfaces for job and data management.
Operationally issues such as time synchrony for security, local scratch management.
Amber Boehnlein, FNAL
Operations
 CDF, DO, STAR




Mature experiments accumulating ~1pb/year, Billions of
events, Millions of files…
Well established and stable applications
Anticipating upgrades in detectors, luminosity
All depend on distributed computing
 Atlas, CMS



Use of MC data challenges, test beam data to test
infrastructure and prepare for physics
Cosmic ray commissioning
Computing scales up dramatically compared to current
experiments in all dimensions, including number of
collaborators.
 My thanks to all those who contributed to this talk!
Amber Boehnlein, FNAL
CDF Operational Modes
 OSG for MC production,

Targeting other production chain tasks such as generating
user level ntuples



Condor-g submission
Self contained tarball for production applications
DB access via squid server or connection to FNAL
 Pursuing user analysis using “glide CAF”



Provides familiar user environment
investigating user-level mounting of a remote filesystem
using HTTP, and using local squid servers for caching to
provide flexibility of the full CDF software distribution
Will rely on SAM for data handling
Amber Boehnlein, FNAL
CMS Operations

CMS Relies on OSG for two significant activities

Centralized production of simulated events in the US



Remote submission of user simulation on the US-CMS Tier-2 sites


User submission of jobs to access data published as being available at the site
CMS Simulated Event Production

Over the last 4 years CMS has been successfully submitting simulated
production jobs to distributed computing sites using ever-improving grid
middleware



CMS is performing both opportunistic submission to non-CMS sites
Centralized submission by a dedicated to US-CMS sites
CMS dedicated infrastructure initially, followed by Grid3, followed by OSG
In 5 months in 2006 we expect to generate 50M events for the next challenge.
The OSG share is 15M-20M
CMS Analysis Activities

During the Worldwide LCG (WLCG) service challenge CMS submitted
analysis jobs to access local data

Thousands of jobs, 10s of TB of data access
During the challenge only dedicated expert users

Next step will include normal users

Amber Boehnlein, FNAL
CMS Simulation

Submitted centrally from UFL by a dedicated team


Relatively quiet period for CMS over the final quarter of 2005 CMS ran
5M events with three processing steps on OSG resources


Adds to 1FTE of effort over three people
Represents about 40CPU years of computing
During ramp for DC04 CMS utilized several hundred years of CPU
More than 100 years of
Opportunistic resources

CMS expects to generate a
Sample roughly the size
Of the raw data at start
USCMS contribution is
Roughly 30% of this
800TB per year of simulation by
The start of high lumi running
Amber Boehnlein, FNAL
CMS Analysis Experience

Service challenge 3 CMS ran over 18k jobs on OSG connected Tier-2
resources.



Submission and completion efficiency still need to be improved


Completed 14k, corresponding to ~20 TB
The total data read was ~20TB, Preload data at site using phedex.
Many of the failures were uniquely attributable to CMS
First large scale analysis attempt for CMS on OSG

Increasing user participation on OSG analysis to the whole collaboration
and improving the experience are part of the 2006 program of work
Amber Boehnlein, FNAL
Star Operations
SUMS based (STAR Unified Meta-Scheduler)





High level User JDL describes task, code needed, dataset and
SUMS submits to appropriate sites depending on user resource
requirements or hints
Assumed software installed
- Transferred input using GRAM input (achive/ tarball)
- Output transferred using GRAM output
- Integrated Cataloging possible via RRS = Replica
Registration Service) making this fully automated
MC
- ALL is SUMS based
- MC jobs only, nightly test (QA) moved to Grid
- PACMAN packages available for STAR software for one
OS (Linux)
- Use Archive SandBox for the specific codes [mostly used].
- Assumes DB connectivity and outbound connections.
- More recently: SRM transfer of output
- Job submission COndor-G based
Plan to migrate all MC to OSG


Offload from Tier0 and Tier1 center to ANY resource
Allow Tier2 to submit R&D simulations (RHIC-II
detector simulation)
Amber Boehnlein, FNAL
STAR Analysis Experience

Star has very positive user analysis experience with 10K jobs/user.


User analysis is “expert” only
STAR has strong incentive to encourage generic users




Users already severely constrained
Opportunistic computing for user analysis makes more sense at this stage (jobs
are smaller
as time and input adaptable to even the smallest site).
RHIC-II running will require more resources.
Data moved/relocated/managed on demand (in the background)



Generic user analysis would require mechanism to locate "Hot" datasets
Would need (require) SE enabled sites and asynchronous
CPU / data transfer mechanism (like SRM now)
RRS-like essential for automation of data mining and registration
on arrival (immediate access and exploitation)
*
* Concerned of user needs mismatching available QoS and "help
desk" - OSG our best hope.
Amber Boehnlein, FNAL
DO Operational Modes
 DO

Depends on distributed
computing for MC, production chain
activities
Use SAMGrid to
submit jobs

SAMGrid can broker jobs
Or forward
Data handling via SAM

Data Sets delivered to local cache

Self-contained Tarball
distributed via SAM

DB access via proxy servers




Next steps will be towards targeted
ID activities such as jet energy
scale determination to improve systematic
error
M(top) = 169.5 +/- 3.0 (stat) +/- 3.2 (JES) +/- 1.7 (other) GeV
Amber Boehnlein, FNAL
DO Operations


Monte Carlo production
Reprocessing





Improved tracking, EM calorimeter calibration
~1 B event effort using 4000 GHz cpu
equivalents for 9 months at 12 sites (3 OSG
sites)
Would have taken ~5 years on FNAL DO
dedicated resources.
Calibration DB access via proxy servers
Refixing

DO applied new hadronic calorimeter
calibration post processing on FNAL dedicated
analysis resources. Found a problem and are
doing so again—


Fixed some skims for immediate use


Six week target using remote facilities.
QCD sample processed on CMS farm (OSG site)
Full effort ramping up—cpu needs same scale
as reprocessing

Moving aggressively to use 1000 GHz equivalents
on OSG!
Every DO publication depends on Grid
Computing
Amber Boehnlein, FNAL
ATLAS Production Runs (2004-2005)
Grid Production
Worldwide
U.S.
U.S. Tier 2
Jobs (k)
Events (M)
Jobs (k)
Events (M)
Percentage of U.S. Jobs
done by three U.S. T2 sites
334
81
117
28
55%
573
28
138
7
60%
Data Challenge 2 (DC2)
Rome Physics Workshop
 U.S. Tier 2 role was critical to success of ATLAS
production



Over 400 physicists attended Rome workshop, 100 papers presented based on
the data produced during DC2 and Rome production
U.S. provided resources on appropriate scale for U.S. physicists (60k CPUdays, >50 TB data), provided leadership roles in organization of challenges, in
key software development, and in production operations
Production during DC2 and Rome established a hardened Grid3 infrastructure
benefiting all participants in Grid3
Amber Boehnlein, FNAL
Next ATLAS Production
 Formerly, DC3, now Computer System
Commissioning

Simulate 10^7 events (same order as DC2)

Full software commissioning
– calibration and alignment

Will need ~2000 CPU in the U.S. continuously in
2006


OSG opportunistic resource will provide an important
part of these resources.
Started last week.
Amber Boehnlein, FNAL
Atlas MC Analysis
Running Alpgen possible with OSG Resources
Amber Boehnlein, FNAL
Conclusions
 OSG is providing progressively more mature
infrastructure
 Increased use is leading to positive feedback
from the perspective of users and providers
of middleware and facilities
The Accelerator based experiments are relying
on it to deliver their physics programs.
Amber Boehnlein, FNAL