Document 7517211
Download
Report
Transcript Document 7517211
Reliability, Availability, and Serviceability
(RAS) for High-Performance Computing
Presented by
Stephen L. Scott
Christian Engelmann
Computer Science Research Group
Computer Science and Mathematics Division
Research and development goals
Provide high-level RAS capabilities for
current terascale and next-generation
petascale high-performance computing
(HPC) systems
Eliminate many of the numerous single
points of failure and control in today’s
HPC systems
Develop techniques to enable HPC
systems to run computational jobs 24/7
Develop proof-of-concept prototypes
and production-type RAS solutions
2 Scott_RAS_SC07
MOLAR: Adaptive runtime support for
high-end computing operating and
runtime systems
Addresses the challenges for operating and runtime systems to
run large applications efficiently on future ultrascale high-end
computers
Part of the Forum to Address Scalable Technology for Runtime and
Operating Systems (FAST-OS)
MOLAR is a collaborative research effort (www.fastos.org/molar)
3 Scott_RAS_SC07
Symmetric active/active redundancy
Many active head nodes
Active/active head nodes
Workload distribution
Symmetric replication
between head nodes
Continuous service
Always up to date
No fail-over necessary
No restore-over necessary
Virtual synchrony model
Complex algorithms
Compute nodes
4 Scott_RAS_SC07
Prototypes for Torque
and Parallel Virtual File
System metadata server
Writing throughput
Reading throughput
Throughput (Requests/sec)
Throughput (Requests/sec)
Symmetric active/active Parallel Virtual
File System metadata server
120
100
80
60
40
PVFS
A/A 1
20
A/A 2
A/A 4
0
1
2
4
8
Number of clients
Nodes
5 Scott_RAS_SC07
16
32
400
350
4 servers
300
250
200
2 servers
150
100
1 server
50
0
1
2
4
8
16
Number of clients
Availability
Est. annual downtime
1
98.58%
5d, 4h, 21m
2
99.97%
1h, 45m
3
99.9997%
1m, 30s
32
Reactive fault tolerance for HPC with
LAM/MPI+BLCR job-pause mechanism
Live node
Paused MPI
process
Failed node
Failed
Failed MPI
process
Process
migration
Existing
connection
Operational nodes: Pause
BLCR reuses existing
processes
LAM/MPI reuses existing
connections
Restore partial process state
from checkpoint
Failed nodes: Migrate
Paused MPI
process
New connection
Live node
Migrated MPI
process
Spare node
Restart process on new node
from checkpoint
Reconnect with paused
processes
Scalable MPI membership
management for low overhead
Efficient, transparent, and
automatic failure recovery
Shared storage
6 Scott_RAS_SC07
LAM/MPI+BLCR job pause performance
Job pause and migrate
LAM reboot
Job restart
10
9
8
Seconds
7
6
5
4
3
2
1
0
BT
CG
EP
FT
3.4% overhead over job restart, but
No LAM reboot overhead
Transparent continuation of execution
7 Scott_RAS_SC07
LU
MG
SP
No requeue penalty
Less staging overhead
Proactive fault tolerance for HPC using
Xen virtualization
PFT
daemon
Ganglia
MPI
task
Privileged VM
Guest VM
Privileged VM
Xen VMM
H/w
BMC
PFT
daemon
H/w
BMC
Ganglia
Privileged VM
Guest VM
Xen VMM
Ganglia
MPI
task
Privileged VM
Guest VM
Xen VMM
BMC
H/w
Migrate guest VM to
spare node
New host generates
unsolicited ARP reply
Indicates that guest VM
has moved
PFT
daemon
MPI
task
8 Scott_RAS_SC07
Deteriorating health
Ganglia
Migrate
Xen VMM
H/w
Standby Xen host (spare
node without guest VM)
PFT
daemon
BMC
ARP tells peers to resend
to new host
Novel fault-tolerance
scheme that acts
before a failure
impacts a system
VM migration performance impact
Single node
failure
500
Double node failure
300
without migration
450
without migration
one migration
400
350
two migrations
200
300
Seconds
Seconds
one migration
250
250
200
150
100
150
100
50
50
0
BT
CG
EP
LU
SP
0
BT
CG
EP
LU
Single node failure: 0.5–5% additional cost over total wall clock
time
Double node failure: 2–8% additional cost over total wall clock
time
9 Scott_RAS_SC07
SP
HPC reliability analysis and modeling
Programming paradigm and system scale impact reliability
Reliability analysis
Estimate mean time to failure (MTTF)
Obtain failure distribution: exponential, Weibull, gamma, etc.
Feedback into fault-tolerance schemes for adaptation
1
800
Node MTTF 1000 hrs
Node MTTF 3000 hrs
Node MTTF 5000 hrs
Node MTTF 7000 hrs
600
400
0.8
Cumulative
probability
Total system MTTF (hrs)
System reliability (MTTF) for k-of-n AND
Survivability (k=n) Parallel Execution Model
0.6
Negative likelihood value
0.4
200
0.2
0
10
10 Scott_RAS_SC07
50
100
500 1000 2000
Number of Participating Nodes
5000
0
0
50
Exponencia
l
2653.3
Weibull
3532.8
Lognormal
2604.3
Gamma
2627.4
100
150
200
250
Time between failure (TBF)
300
Contacts
Stephen L. Scott
Computer Science Research Group
Computer Science and Mathematics Division
(865) 574-3144
[email protected]
Christian Engelmann
Computer Science Research Group
Computer Science and Mathematics Division
(865) 574-3132
[email protected]
11
11 Scott_RAS_SC07
Scott_RAS_SC07