HPC的插缝运行

Download Report

Transcript HPC的插缝运行

ATLAS@home
Wenjing Wu
Andrej Filipčič
David Cameron
Eric Lancon
Claire Adam Bourdarios
& others
2015/7/21
ATLAS:Elementary Particle Physics
One of the biggest experiment at CERN
trying to understand the origin of mass which completes the standard model
2012,ATLAS and CMS discovered Higgs Boson
2015/7/21
2015/7/21
data processing flow in ATLAS
2015/7/21
Why ATLAS@home
• It's free! Well, almost.
• Public outreach – volunteers want to know more
about the project they participate
• Good for ATLAS visibility
• Can add a significant computing power to WLCG
• A brief history
– Started end of 2013, at a test instance at IHEP, Beijing
– Migrated to CERN and officially launched June 2014
– are continuously running.
2015/7/21
ATLAS@home
• Goal: to run ATLAS simulation jobs on volunteer
computers.
• Challenges:
– Big ATLAS software base, ~10GB, and very platform
dependant , runs on Scientific Linux
– Volunteer computing resources, should be integrated into
the current Grid Computing infrastructure. In other words,
all the volunteer computers should appear as a WLCG site,
and Jobs are submited from PanDA(ATLAS Grid Computing
Portal).
– Grid Computing relies heavily on personal credentials, but
these credential should not be put on volunteer computers
Solutions
• Use VirtualBox+vmwrapper to virtualize volunteer hosts
• Use network file system CVMFS to distribute ATLAS software,
as CVMFS supports on-demand file caching, it helps to reduce
the image size.
• In order to avoid placing credential on the volunteer hosts,
Arc CE is introduced in the architecture together with BOINC
– Arc CE is grid middleware, it interacts with ATLAS Central Grid Services,
and manages different LRMS (Local Resource Management System),
such as Condor, PBS by specific LRMS plugins
– A BOINC plugin is developped, to forward “Grid Jobs” to the BOINC
server, and convert the job results into Grid format.
Architecture
ATLAS Workload Management System
2015/7/21
BOINC ARC plugin(1)
• Converts a ARC CE job into a BOINC job
• The Plugin includes:
– Submit/scan/cancel job
– Information provider (total CPUs, CPU usages, job status)
• Submit
– ARC CE job: All input files into one tar.gz file
– Copy the input file from ARC CE session directory into BOINC internal
directory
– Setup BOINC environment and call BOINC command to generate a job
based on job templates/input files
– Wrote the jobid back to ARC CE job control directory.
– Upon job finishing, BOINC services put the desired output files back to
the ARC CE session directory
BOINC ARC CE plugin(2)
• Scan
– Scan the job diag file (in session directory), get the exit code, upload output files to
designated SE, update ARC CE job status.
• Cancel
–
Cancel a BOINC job
• Information provider
– Query BOINC DB, get information concerning total CPU number, CPU usage,
status of each job
Current Status
gained CPU hours: 103,355 daily resource: 3% of grid computing
Current Status:
the Whole ATLAS Computing
ATLAS jobs
• Full ATLAS simulation jobs
– 10 evts/job initially
– Now 100 evts/job
• A typical ATLAS simulation job
– 40~80MB Input data
– 10~30MB output data
– on average, 92 minutes CPU time, 114 minutes elapsed time
• CPU efficiency lower than on grid
– Slow home network → significant
– initialization time
– CPUs not available all the time
• Jobs run in an SLC5 64-bit->upgraded to SLC6 (Ucernvm)
• virtualization on Windows, Linux, Mac
• ANY kind of job could run onATLAS@HOME
2015/7/21
How Grid People see ATLAS@home
• Volunteers want to earn the credits for their contribution, they want their
PCs to work optimally
– This is true for the grid sites as well, at least it should be
– But volunteers are better shifters then we are
• Different to what we are used to:
– On grid: jobs are failing, please fix the sites!
– On Boinc: jobs suck, please fix your code!
• ATLAS@HOME is the first Boinc project massive I/O demands, even for
less intensive jobs
– Server infrastructure needs to be carefully planned to cope with a high load Credentials
must not be passed to PCs
• Jobs can be in the execution mode for a long time, depending on the
volunteer computer preferences, not suitable for high priority tasks
2015/7/21
ATLAS outreach
• outreach website: https://atlasphysathome.web.cern.ch/
[email protected]
• feedback mail list:
Future Effort (1)
• Customize the VM image to reduce the network
traffic and speed up the initialization
• Optimize the file transfers, server load and job
efficiency on the PCs
• Test and migrate to LHC@home infrastructure
• Test if BOINC can replace the small Grid Sites
• Investigation of the use of BOINC on local batch
clusters to run ATLAS jobs.
• Investigation of running various worflows (longer
jobs, multi-core jobs) on virtual machines
2015/7/21
Future Effort(2)
• provide an event display & possibly screen saver that would let people see
what they are running.
Acknowledgements
• David and Rom for all the supports and
suggestions.
• CERN IT for providing Servers and Storage
resources for ATLAS@home, working on
integrating ATLAS@home with LHC@home