Transcript Carbonite
Efficient Replica Maintenance for Distributed Storage Systems B-G Chun, F. Dabek, A. Haeberlen, E. Sit, H. Weatherspoon, M. Kaashoek, J. Kubiatowicz, and R. Morris, In Proc. of NSDI, May 2006. Presenter: Fabián E. Bustamante Fabián E. Bustamante, Fall 2005 Replication in Wide-Area Storage Applications put & get objects in/from the wide-area storage system Objects are replicated for – Availability • Get on an object will return promptly – Durability • Object put by the app are not lost due to disk failures – An object may be durably stored but not immediately available EECS 443 Advanced Operating Systems Northwestern University 2 Goal: durability at low bandwidth cost Durability is a more practical & useful goal Threat to durability – Loose the last copy of an object – So, create copies faster than they are destroyed Challenges – Replication can eat your bandwidth – Hard to distinguish bet/ transient & permanent failure – After recover, some replicas may be in nodes the lookup algorithm does not check Paper presents Carbonite – efficient wide-area replication technique for durability EECS 443 Advanced Operating Systems Northwestern University 3 System Environment Use PlanetLab (PL) as representative – >600 nodes distributed world-wide – History traces collected by CoMon project (every 5’) – Disk failures from event logs of PlanetLab Central Synthetic traces Dates 3/1/05-2/28/06 Hosts 632 Transient failures 21355 Disk failures 219 Transient host downtime (s) (median,avg,90th) 1208, 104647, 14242 Any failure interarrival (s) 305, 1467, 3306 Disk failure interarrival (s) 544411, 143476, 490047 – 632 nodes as PL – Failure inter-arrival times from exponential dist. (mean session time and downtime as in PL) – Two years instead of one and avg node lifetime of 1 year Simulation – Trace-driven event-based simulator – Assumptions • Network paths are independent • All nodes reachable from all other nodes • Each node with same link capacity EECS 443 Advanced Operating Systems Northwestern University 4 Understanding durability To handle some avg. rate of failure – create new replicas faster than they are destroyed – Function of per-node access link, number of nodes, amount of data stored per node Infeasible system – unable to keep pace w/ avg. failure rate – will eventually adapt by discarding objects (which ones?) If creation rate is just above failure rate – failure burst may be a problem Target replicas to maintain – rL Durability does not increased continuously with rL EECS 443 Advanced Operating Systems Northwestern University 5 Improving repair time Scope – set of other nodes that can hold copies of the objects a node is responsible for Small scope – Easier to keep track of copies – Effort of creating copies fall on a small set of nodes – Addition of nodes may result on needless copying of objects (when combined w/ consistent hashing) Large scope – Spread work among more nodes – Network traffic source/ destination are spread – Temp failures will be noticed by more nodes EECS 443 Advanced Operating Systems Northwestern University 6 Reducing transient costs Impossible to distinguish transient/permanent failures To minimize net traffic due to transient failures: reintegrate replicas Carbonite – Selecet a suitable value for rL – Respond to detected failure by creating new replica – Reintegrate replicas Bytes sent by different maintenance algorithms EECS 443 Advanced Operating Systems Northwestern University 7 Reducing transient costs Bytes sent w/ and w/o reintegration Impact of timeouts on bandwidth and durability EECS 443 Advanced Operating Systems Northwestern University 8 Assumptions The PlanetLab testbed can be seen as representative of something Immutable data Relatively stable system membership & data loss driven by disk failures Disk failures are uncorrelated Simulation – Network paths are independent – All nodes reachable from all other nodes – Each node with same link capacity EECS 443 Advanced Operating Systems Northwestern University 9