The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim

Download Report

Transcript The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim

The TickerTAIP
Parallel RAID Architecture
P. Cao, S. B. Lim
S. Venkatraman, J. Wilkes
HP Labs
RAID Architectures
• Traditional RAID architectures have
– A central RAID controller interfacing to the
host and processing all I/O requests
– Disk drives organized in strings
– One disk controller per disk string (mostly
SCSI)
Limitations
• Capabilities of RAID controller are crucial to the
performance of RAID
– Can become memory-bound
– Presents a single point of failure
– Can become a bottleneck
• Having a spare controller is an expensive
proposition
Our Solution
•
Have a cooperating set of
array controller nodes
• Major benefits are:
– Fault-tolerance
– Scalability
– Smooth incremental growth
– Flexibility: can mix and match components
TickerTAIP
Host
interconnects
Controller nodes
TickerTAIP ( I)
A TickerTAIP array consists of:
• Worker nodes connected with one or more
local disks through a bus
• Originator nodes interfacing with host
computer clients
• A high-performance small area network:
– Mesh based switching network (Datamesh)
– PCI backplanes for small networks
TickerTAIP ( II)
• Can combine or separate worker and originator
nodes
• Parity calculations are done in decentralized
fashion:
– Bottleneck is memory bandwidth not CPU
speed
– Cheaper than having faster paths to a
dedicated parity engine
Design Issues (I)
• Normal-mode reads are trivial to implement
• Normal mode writes:
– three ways to calculate the new parity:
• full stripe: calculate parity from new data
• small stripe: requires at least four I/Os
• large stripe: if we rewrite more than half a
stripe, we compute the parity by reading
the unmodified data blocks
Design Issues (II)
• Parity can be calculated:
– At originator node
– Solely parity: at the parity node for the stripe
• Must ship all involved blocks to party node
– At parity: same as solely parity but partial
results for small stripe writes are computed at
worker node and shipped to parity node
• Occasions less traffic than solely parity
Handling single failures (I)
• TickerTAIP must provide request atomicity
• Disk failures are treated as in standard RAID
• Worker failures:
– Treated like disk failures
– Detected by time-outs
(assuming fail-silent nodes)
– A distributed consensus algorithm reaches
consensus among remaining nodes
Handling single failures (II)
• Originator failures:
– Worst case is failure of a originator/worker
node during a write
– TickerTAIP uses a two-phase commit
protocol:
– Two options:
• Late commit
• Early commit
Late commit/Early commit
• Late commit only commits after parity has been
computed
– Only the writes must be performed
• Early commit commits as soon as new data and
old data have been replicated
– Somewhat faster
– Harder to implement
Handling multiple failures
• Power failures during writes can corrupt stripe
being written:
– Use UPS to eliminate them
• Must guarantee that some specific requests will
always be executed in a given order:
– Cannot write data blocks before updating the
i-nodes containing block addresses
– Uses request sequencing to achieve partial
write ordering
Request sequencing (I)
• Each request
– Is given a unique identifier
– Can specify one or more requests on whose
previous completion it depends
(explicit dependencies)
• TickerTAIP adds enough implicit dependencies
to prevent concurrent execution of overlapping
requests
Request sequencing (II)
• Sequencing is performed by a
centralized sequencer
– Several distributed solutions were considered
but not selected because of the complexity
of the recovery protocols they would require
Disk Scheduling
Not discussed in
class in Fall 2005
• Considered
– First come first served (FCFS):
implemented in the working prototype
– Shortest seek time first (SSTF):
– Shortest access time first (SATF):
Considers both seek time and rotation time
– Batched nearest neighbor (BNN):
Runs SATF on all reuests in queue
Evaluation (I)
• Based upon
– Working prototype
• Used seven relatively slow Parsytec cards
each with its own disk drive
– Event-driven simulator was used to test
other configurations:
• Results were always within 6% of prototype
measurements
Evaluation (II)
• Read performance:
– 1MB/s links are enough unless the request
sizes exceed 1MB
Evaluation (III)
• Write performance:
– Large stripe policy always results in a
slight improvement
– At-parity significantly better than at-originator
especially for link speeds below 10MB/s
– Late commit protocol reduces throughput by
at most 2% but can increase response time by
up to 20%
– Early commit protocol is not much better
Evaluation (IV)
• TickerTAIP always outperforms a comparable
centralized RAID architecture
• Best disk scheduling policy is Batched Nearest
Neighbor (BNN)
Conclusion
• Can use physical redundancy to eliminate
single points of failure
• Can use eleven 5 MIPS processors instead of
single 50 MIPS
• Can use off-the-shelf processors for parity
computations
• Disk drives remain the bottleneck for small
request sizes