Open Science Grid: More Compute Power

Download Report

Transcript Open Science Grid: More Compute Power

Open Science Grid:
More compute power
Alan De Smet [email protected]
CHTC Cores In Use
1,800
1,600
1,400
1,500
1,200
1,000
800
600
400
200
0
(CPU days each day averaged over one month)
chtc.cs.wisc.edu
OSG Cores In Use
80,000
70,000
60,000
60,000
50,000
40,000
30,000
20,000
10,000
0
(CPU days each day averaged over one month)
chtc.cs.wisc.edu
Open Science Grid
chtc.cs.wisc.edu
CHTC and OSG usage
4,500
4,000
3,500
3,000
2,500
2,000
1,500
1,000
500
0
(CPU days each day)
chtc.cs.wisc.edu
Challenges Solved
We worry about all of this.
You don’t have to.
› Authentication
X.509 certificates, certificate authorities, VOMS
› Interface
Globus, GridFTP, Grid universe
› Validation
Linux distribution, glibc version, basic libraries
chtc.cs.wisc.edu
Using OSG
› Before
universe
executable
log
= vanilla
= myjob
= myjob.log
queue
chtc.cs.wisc.edu
Using OSG
› After
universe
executable
log
+WantGlidein
queue
=
=
=
=
vanilla
myjob
myjob.log
true
chtc.cs.wisc.edu
Challenge: Opportunistic
› OSG computers go away without notice
› Solutions
Condor restarts automatically
Sub-hour jobs
Self-checkpointing
Automated checkpointing
• Condor’s standard universe
• DMTCP
http://dmtcp.sourceforge.net/
chtc.cs.wisc.edu
Challenge: Local Software
chtc.cs.wisc.edu
Challenge: Local Software
› Bare-bones Linux systems
› Solution
Bring everything with you
CHTC provided MATLAB and R packages
• RunDagEnv/mkdag
chtc.cs.wisc.edu
Challenge: Erratic Failures
› Complex systems fail sometimes
› Solution
Expect failures and automatically
retry
DAGMan for retries
DAGMan POST scripts to detect
problems
• RunDagEnv/mkdag
chtc.cs.wisc.edu
Challenge: Bandwidth
› Solutions
Only send what you need
Store large, shared files in our web
cache
Read small amounts of data on the fly
• Condor’s standard universe
• Parrot
http://www.cse.nd.edu/~ccl/software/parrot/
chtc.cs.wisc.edu