Transcript HPC的插缝运行
ATLAS@home Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others 2015/7/21 ATLAS:Elementary Particle Physics One of the biggest experiment at CERN trying to understand the origin of mass which completes the standard model 2012,ATLAS and CMS discovered Higgs Boson 2015/7/21 2015/7/21 data processing flow in ATLAS 2015/7/21 Why ATLAS@home • It's free! Well, almost. • Public outreach – volunteers want to know more about the project they participate • Good for ATLAS visibility • Can add a significant computing power to WLCG • A brief history – Started end of 2013, at a test instance at IHEP, Beijing – Migrated to CERN and officially launched June 2014 – are continuously running. 2015/7/21 ATLAS@home • Goal: to run ATLAS simulation jobs on volunteer computers. • Challenges: – Big ATLAS software base, ~10GB, and very platform dependant , runs on Scientific Linux – Volunteer computing resources, should be integrated into the current Grid Computing infrastructure. In other words, all the volunteer computers should appear as a WLCG site, and Jobs are submited from PanDA(ATLAS Grid Computing Portal). – Grid Computing relies heavily on personal credentials, but these credential should not be put on volunteer computers Solutions • Use VirtualBox+vmwrapper to virtualize volunteer hosts • Use network file system CVMFS to distribute ATLAS software, as CVMFS supports on-demand file caching, it helps to reduce the image size. • In order to avoid placing credential on the volunteer hosts, Arc CE is introduced in the architecture together with BOINC – Arc CE is grid middleware, it interacts with ATLAS Central Grid Services, and manages different LRMS (Local Resource Management System), such as Condor, PBS by specific LRMS plugins – A BOINC plugin is developped, to forward “Grid Jobs” to the BOINC server, and convert the job results into Grid format. Architecture ATLAS Workload Management System 2015/7/21 BOINC ARC plugin(1) • Converts a ARC CE job into a BOINC job • The Plugin includes: – Submit/scan/cancel job – Information provider (total CPUs, CPU usages, job status) • Submit – ARC CE job: All input files into one tar.gz file – Copy the input file from ARC CE session directory into BOINC internal directory – Setup BOINC environment and call BOINC command to generate a job based on job templates/input files – Wrote the jobid back to ARC CE job control directory. – Upon job finishing, BOINC services put the desired output files back to the ARC CE session directory BOINC ARC CE plugin(2) • Scan – Scan the job diag file (in session directory), get the exit code, upload output files to designated SE, update ARC CE job status. • Cancel – Cancel a BOINC job • Information provider – Query BOINC DB, get information concerning total CPU number, CPU usage, status of each job Current Status gained CPU hours: 103,355 daily resource: 3% of grid computing Current Status: the Whole ATLAS Computing ATLAS jobs • Full ATLAS simulation jobs – 10 evts/job initially – Now 100 evts/job • A typical ATLAS simulation job – 40~80MB Input data – 10~30MB output data – on average, 92 minutes CPU time, 114 minutes elapsed time • CPU efficiency lower than on grid – Slow home network → significant – initialization time – CPUs not available all the time • Jobs run in an SLC5 64-bit->upgraded to SLC6 (Ucernvm) • virtualization on Windows, Linux, Mac • ANY kind of job could run onATLAS@HOME 2015/7/21 How Grid People see ATLAS@home • Volunteers want to earn the credits for their contribution, they want their PCs to work optimally – This is true for the grid sites as well, at least it should be – But volunteers are better shifters then we are • Different to what we are used to: – On grid: jobs are failing, please fix the sites! – On Boinc: jobs suck, please fix your code! • ATLAS@HOME is the first Boinc project massive I/O demands, even for less intensive jobs – Server infrastructure needs to be carefully planned to cope with a high load Credentials must not be passed to PCs • Jobs can be in the execution mode for a long time, depending on the volunteer computer preferences, not suitable for high priority tasks 2015/7/21 ATLAS outreach • outreach website: https://atlasphysathome.web.cern.ch/ [email protected] • feedback mail list: Future Effort (1) • Customize the VM image to reduce the network traffic and speed up the initialization • Optimize the file transfers, server load and job efficiency on the PCs • Test and migrate to LHC@home infrastructure • Test if BOINC can replace the small Grid Sites • Investigation of the use of BOINC on local batch clusters to run ATLAS jobs. • Investigation of running various worflows (longer jobs, multi-core jobs) on virtual machines 2015/7/21 Future Effort(2) • provide an event display & possibly screen saver that would let people see what they are running. Acknowledgements • David and Rom for all the supports and suggestions. • CERN IT for providing Servers and Storage resources for ATLAS@home, working on integrating ATLAS@home with LHC@home