Document 7517211

Transcript Document 7517211

Reliability, Availability, and Serviceability
(RAS) for High-Performance Computing
Presented by
Stephen L. Scott
Christian Engelmann
Computer Science Research Group
Computer Science and Mathematics Division
Research and development goals
 Provide high-level RAS capabilities for
current terascale and next-generation
petascale high-performance computing
(HPC) systems
 Eliminate many of the numerous single
points of failure and control in today’s
HPC systems
 Develop techniques to enable HPC
systems to run computational jobs 24/7
 Develop proof-of-concept prototypes
and production-type RAS solutions
2 Scott_RAS_SC07
MOLAR: Adaptive runtime support for
high-end computing operating and
runtime systems
 Addresses the challenges for operating and runtime systems to
run large applications efficiently on future ultrascale high-end
computers
 Part of the Forum to Address Scalable Technology for Runtime and
Operating Systems (FAST-OS)
 MOLAR is a collaborative research effort (www.fastos.org/molar)
3 Scott_RAS_SC07
Symmetric active/active redundancy
 Many active head nodes
Active/active head nodes
 Workload distribution
 Symmetric replication
between head nodes
 Continuous service
 Always up to date
 No fail-over necessary
 No restore-over necessary
 Virtual synchrony model
 Complex algorithms
Compute nodes
4 Scott_RAS_SC07
 Prototypes for Torque
and Parallel Virtual File
System metadata server
Writing throughput
Reading throughput
Throughput (Requests/sec)
Throughput (Requests/sec)
Symmetric active/active Parallel Virtual
File System metadata server
120
100
80
60
40
PVFS
A/A 1
20
A/A 2
A/A 4
0
1
2
4
8
Number of clients
Nodes
5 Scott_RAS_SC07
16
32
400
350
4 servers
300
250
200
2 servers
150
100
1 server
50
0
1
2
4
8
16
Number of clients
Availability
Est. annual downtime
1
98.58%
5d, 4h, 21m
2
99.97%
1h, 45m
3
99.9997%
1m, 30s
32
Reactive fault tolerance for HPC with
LAM/MPI+BLCR job-pause mechanism
Live node
Paused MPI
process
Failed node
Failed
Failed MPI
process
Process
migration
Existing
connection
 Operational nodes: Pause

BLCR reuses existing
processes

LAM/MPI reuses existing
connections

Restore partial process state
from checkpoint
 Failed nodes: Migrate
Paused MPI
process
New connection
Live node
Migrated MPI
process
Spare node

Restart process on new node
from checkpoint

Reconnect with paused
processes
 Scalable MPI membership
management for low overhead
 Efficient, transparent, and
automatic failure recovery
Shared storage
6 Scott_RAS_SC07
LAM/MPI+BLCR job pause performance
Job pause and migrate
LAM reboot
Job restart
10
9
8
Seconds
7
6
5
4
3
2
1
0
BT
CG
EP
FT
 3.4% overhead over job restart, but
 No LAM reboot overhead
 Transparent continuation of execution
7 Scott_RAS_SC07
LU
MG
SP
 No requeue penalty
 Less staging overhead
Proactive fault tolerance for HPC using
Xen virtualization
PFT
daemon
Ganglia
MPI
task
Privileged VM
Guest VM
Privileged VM
Xen VMM
H/w
BMC
PFT
daemon
H/w
BMC
Ganglia
Privileged VM
Guest VM
Xen VMM
Ganglia
MPI
task
Privileged VM
Guest VM
Xen VMM
BMC
H/w
 Migrate guest VM to
spare node
 New host generates
unsolicited ARP reply
 Indicates that guest VM
has moved
PFT
daemon
MPI
task
8 Scott_RAS_SC07
 Deteriorating health
Ganglia
Migrate
Xen VMM
H/w
 Standby Xen host (spare
node without guest VM)
PFT
daemon
BMC
 ARP tells peers to resend
to new host
 Novel fault-tolerance
scheme that acts
before a failure
impacts a system
VM migration performance impact
Single node
failure
500
Double node failure
300
without migration
450
without migration
one migration
400
350
two migrations
200
300
Seconds
Seconds
one migration
250
250
200
150
100
150
100
50
50
0
BT
CG
EP
LU
SP
0
BT
CG
EP
LU
 Single node failure: 0.5–5% additional cost over total wall clock
time
 Double node failure: 2–8% additional cost over total wall clock
time
9 Scott_RAS_SC07
SP
HPC reliability analysis and modeling
 Programming paradigm and system scale impact reliability
 Reliability analysis
 Estimate mean time to failure (MTTF)
 Obtain failure distribution: exponential, Weibull, gamma, etc.
 Feedback into fault-tolerance schemes for adaptation
1
800
Node MTTF 1000 hrs
Node MTTF 3000 hrs
Node MTTF 5000 hrs
Node MTTF 7000 hrs
600
400
0.8
Cumulative
probability
Total system MTTF (hrs)
System reliability (MTTF) for k-of-n AND
Survivability (k=n) Parallel Execution Model
0.6
Negative likelihood value
0.4
200
0.2
0
10
10 Scott_RAS_SC07
50
100
500 1000 2000
Number of Participating Nodes
5000
0
0
50
Exponencia
l
2653.3
Weibull
3532.8
Lognormal
2604.3
Gamma
2627.4
100
150
200
250
Time between failure (TBF)
300
Contacts
Stephen L. Scott
Computer Science Research Group
Computer Science and Mathematics Division
(865) 574-3144
[email protected]
Christian Engelmann
Computer Science Research Group
Computer Science and Mathematics Division
(865) 574-3132
[email protected]
11
11 Scott_RAS_SC07
Scott_RAS_SC07

Document 7517211

Transcript Document 7517211

Directory