Transcript Carbonite

Efficient Replica Maintenance for
Distributed Storage Systems
B-G Chun, F. Dabek, A. Haeberlen, E.
Sit, H. Weatherspoon, M. Kaashoek, J.
Kubiatowicz, and R. Morris, In Proc. of
NSDI, May 2006.
Presenter: Fabián E. Bustamante
Fabián E. Bustamante, Fall 2005
Replication in Wide-Area Storage
Applications put & get objects in/from the
wide-area storage system
Objects are replicated for
– Availability
• Get on an object will return promptly
– Durability
• Object put by the app are not lost due to disk failures
– An object may be durably stored but not
immediately available
EECS 443 Advanced Operating Systems
Northwestern University
2
Goal: durability at low bandwidth cost
Durability is a more practical & useful goal
Threat to durability
– Loose the last copy of an object
– So, create copies faster than they are destroyed
Challenges
– Replication can eat your bandwidth
– Hard to distinguish bet/ transient & permanent
failure
– After recover, some replicas may be in nodes the
lookup algorithm does not check
Paper presents Carbonite – efficient wide-area
replication technique for durability
EECS 443 Advanced Operating Systems
Northwestern University
3
System Environment
Use PlanetLab (PL) as representative
– >600 nodes distributed
world-wide
– History traces collected by
CoMon project (every 5’)
– Disk failures from event
logs of PlanetLab Central
Synthetic traces
Dates
3/1/05-2/28/06
Hosts
632
Transient failures
21355
Disk failures
219
Transient host downtime (s)
(median,avg,90th)
1208, 104647, 14242
Any failure interarrival (s)
305, 1467, 3306
Disk failure interarrival (s)
544411, 143476, 490047
– 632 nodes as PL
– Failure inter-arrival times from exponential dist.
(mean session time and downtime as in PL)
– Two years instead of one and avg node lifetime of 1 year
Simulation
– Trace-driven event-based simulator
– Assumptions
• Network paths are independent
• All nodes reachable from all other nodes
• Each node with same link capacity
EECS 443 Advanced Operating Systems
Northwestern University
4
Understanding durability
To handle some avg. rate of failure – create new
replicas faster than they are destroyed
– Function of per-node access link, number of nodes, amount of
data stored per node
Infeasible system – unable to keep pace w/ avg.
failure rate – will eventually adapt by discarding
objects (which ones?)
If creation rate is just above failure rate – failure burst
may be a problem
Target replicas to maintain – rL
Durability does not increased continuously with rL
EECS 443 Advanced Operating Systems
Northwestern University
5
Improving repair time
Scope – set of other nodes that can hold copies of the
objects a node is responsible for
Small scope
– Easier to keep track of copies
– Effort of creating copies fall on a small set of nodes
– Addition of nodes may result on needless copying of objects
(when combined w/ consistent hashing)
Large scope
– Spread work among more
nodes
– Network traffic source/
destination are spread
– Temp failures will be
noticed by more nodes
EECS 443 Advanced Operating Systems
Northwestern University
6
Reducing transient costs
Impossible to distinguish transient/permanent failures
To minimize net traffic due to transient failures:
reintegrate replicas
Carbonite
– Selecet a suitable value for rL
– Respond to detected failure by creating new replica
– Reintegrate replicas
Bytes sent by different
maintenance algorithms
EECS 443 Advanced Operating Systems
Northwestern University
7
Reducing transient costs
Bytes sent w/
and w/o
reintegration
Impact of timeouts on
bandwidth and
durability
EECS 443 Advanced Operating Systems
Northwestern University
8
Assumptions
The PlanetLab testbed can be seen as
representative of something
Immutable data
Relatively stable system membership & data
loss driven by disk failures
Disk failures are uncorrelated
Simulation
– Network paths are independent
– All nodes reachable from all other nodes
– Each node with same link capacity
EECS 443 Advanced Operating Systems
Northwestern University
9