NCCloud: A Network-Coding-Based Storage System in a Cloud

Download Report

Transcript NCCloud: A Network-Coding-Based Storage System in a Cloud

NCCloud: A Network-Coding-Based
Storage System in a Cloud-of-Clouds
Henry C. H. Chen
Yuchong Hu
Patrick P. C. Lee
Yang Tang
IEEE Transactions on Computers, 15 August 2013
1
Outline
‫ﻪ‬
‫ﻪ‬
‫ﻪ‬
‫ﻪ‬
‫ﻪ‬
Introduction
Repair in Multiple Cloud Storage
FMSR Codes
NCCloud
Conclusion
2
Introduction
‫ ﻪ‬Cloud storage provides an on-demand
remote backup solution.
‫ ﻪ‬A single cloud storage provider encounters
the problem such as a single point of failure.
3
Introduction
‫ ﻪ‬The general solution is to distribute data
across different cloud providers.
‫ ﻩ‬stripe data
‫ ﻪ‬The fault-tolerance can be improved by the
diversity of multiple clouds.
4
Introduction-Data Failure
‫ ﻪ‬This paper focuses on unexpected
permanent cloud failure.
‫ ﻩ‬a cloud fails permanently => activate repair.
‫ ﻩ‬maintain data redundancy and fault-tolerance.
‫ ﻪ‬A repair operation
‫ ﻩ‬retrieves data from existing surviving clouds.
‫ ﻩ‬reconstructs the lost data in a new cloud.
5
Introduction-Data Failure
‫ ﻪ‬During repair, each surviving node
‫ ﻩ‬encode its stored data chunks.
‫ ﻩ‬send the encoded chunks to a new node
‫ ﻪ‬Regenerate the lost data.
6
Introduction-Cost Problem
‫ ﻪ‬Today’s cloud storage providers charge
users for outbound data.
‫ ﻪ‬While repairing failures, moving the
enormous amount of data (repair traffic) can
introduce significant monetary costs.
7
Introduction-Repair Traffic
Problem
‫ ﻪ‬In order to minimize repair traffic problem,
regenerating codes [16] have been proposed.
‫ ﻩ‬store data redundantly in a distributed storage
system.
‫ ﻩ‬require less repair traffic, but with the same
fault-tolerance level.
[16] Network Coding for Distributed Storage Systems
8
Introduction-Regenerating Codes
‫ ﻪ‬But, most existing regenerating codes
require storage nodes
‫ ﻩ‬equip with computation capabilities.
‫ ﻩ‬perform encoding operations during repair.
9
Introduction-Regenerating Codes
‫ ﻪ‬In order to make regenerating codes
portable to any cloud storage service.
‫ ﻪ‬This paper considers only a thin-cloud
interface where storage nodes only support
read/write.
10
Introduction-NCCloud
‫ ﻪ‬In this paper, we present the design and
implementation of NCCloud
‫ ﻩ‬a proxy-based storage system.
‫ ﻩ‬a fault-tolerant storage.
‫ ﻩ‬over multiple cloud storage providers.
11
Introduction-FMSR
‫ ﻪ‬On top of NCCloud, we propose the
functional minimum-storage regenerating
(FMSR) codes.
‫ ﻪ‬The FMSR code implementation
‫ ﻩ‬maintain double-fault tolerance.
‫ ﻩ‬maintain the same storage cost as in RAID-6
‫ ﻩ‬less repair traffic when recovering a single-cloud failure.
12
Introduction-FMSR
‫ ﻪ‬FMSR codes are non-systematic
‫ ﻩ‬the encoded chunks was formed by linear
combination of the original data chunks.
‫ ﻩ‬not keep the original data chunks as in
systematic coding schemes.
13
Outline
‫ﻪ‬
‫ﻪ‬
‫ﻪ‬
‫ﻪ‬
‫ﻪ‬
Introduction
Repair in Multiple Cloud Storage
FMSR Codes
NCCloud
Conclusion
14
Repair in Multiple Cloud Storage
‫ ﻪ‬Transient failure
‫ ﻩ‬is short-term, such that the failed cloud will
return to normal after some time and no
outsourced data is lost.
15
Repair in Multiple Cloud Storage
‫ ﻪ‬Permanent failure
‫ ﻩ‬is long-term, in the sense that the outsourced
data on a failed cloud will become permanently
unavailable.
‫ ﻩ‬example :
‫ ﻯ‬data center outages in disasters.
‫ ﻯ‬data loss and corruption.
‫ ﻯ‬malicious attacks.
16
Outline
‫ ﻪ‬Introduction
‫ ﻪ‬Repair in Multiple Cloud Storage
‫ ﻪ‬FMSR Codes
‫ ﻩ‬Motivation
‫ ﻩ‬Implementation
‫ ﻪ‬NCCloud
‫ ﻪ‬Conclusion
17
Motivation
‫ ﻪ‬This paper considers
‫ﻩ‬
‫ﻩ‬
‫ﻩ‬
‫ﻩ‬
distributed
multiple-cloud storage
data is striped
proxy-based design
18
Motivation
19
Fault-tolerant
‫ ﻪ‬Maximum Distance Separable property
‫( ﻩ‬n, k)-MDS code
‫ ﻯ‬divide file into equal-size native chunks.
‫ ﻯ‬linearly combined to form code chunks.
‫ ﻩ‬distribute over n (larger than k) nodes.
‫ ﻩ‬reconstruct original file from any k of the n
nodes.
‫ ﻩ‬tolerate the failures of any n − k nodes.
20
Fault-tolerant
‫ ﻪ‬The FMSR codes can reconstruct the data of
failed node from the surviving nodes.
‫ ﻩ‬download less data.
‫ ﻩ‬not reconstruct the whole file.
21
Different Coding Schemes
Storage size 2M
Repair traffic M
Storage size 2M
Repair traffic 0.75M
Storage size 2M
Repair traffic 0.75M
22
Double-fault Tolerant FMSR
Codes
‫ ﻪ‬divide a file M into 2(n − 2) native chunks.
‫ ﻪ‬generate 2n code chunks.
‫ ﻪ‬each node store two code chunks of size
‫ ﻪ‬repair a failed node, repair traffic is
‫ ﻪ‬RAID-6 codes, total storage size is
traffic is M.
𝑀
.
2(𝑛−2)
𝑀(𝑛−1)
.
2(𝑛−2) 50%
𝑀𝑛
𝑛−2
saved
, repair
23
Outline
‫ ﻪ‬Introduction
‫ ﻪ‬Repair in Multiple Cloud Storage
‫ ﻪ‬FMSR Codes
‫ ﻩ‬Motivation
‫ ﻩ‬Implementation
‫ ﻪ‬NCCloud
‫ ﻪ‬Conclusion
24
FMSR Codes Implementation
‫ ﻪ‬FMSR codes do not require lost chunks to
be exactly reconstructed
‫ ﻩ‬not identical to those in the failed node.
‫ ﻪ‬As long as the MDS property holds.
25
FMSR Codes Implementation
‫ ﻪ‬This paper propose a two-phase checking
scheme to ensure the code chunks on all
nodes always satisfy the MDS property.
26
FMSR Codes Implementation
‫ ﻪ‬The implementation assumes a thin-cloud
interface.
1. File upload
2. File download
3. Repair
27
File Upload
‫ ﻪ‬Native chunks :
‫ ﻪ‬Code chunks :
‫ ﻪ‬Encoding matrix of coefficients :
‫ ﻩ‬size 𝑛 𝑛 − 𝑘 × 𝑘 𝑛 − 𝑘
‫ ﻩ‬in the Galois field GF(pn)
28
File Upload
‫ ﻪ‬Galois field GF(pn)
Encoding coefficient vector
29
File Download
1. Download the k(n−k) code chunks from any k of
the n storage nodes.
2. The ECVs of the k(n−k) code chunks can form a
k(n−k)×k(n−k) square matrix.
3. Obtain the original k(n − k) native chunks.
‫ ﻩ‬multiply the inverse of the square matrix with the code
chunks.
30
Iterative Repair
‫ ﻪ‬MDS property must hold even after iterative
repairs.
‫ ﻪ‬This paper proposes a two-phase checking.
‫ ﻩ‬MDS property
‫ ﻩ‬rMDS property
31
Satisfy MDS, but not rMDS
32
Iterative Repair
Step 1. Download the encoding matrix from a surviving node.
Step 2. Select one ECV from each of the n-1 surviving nodes.
Step 3. Generate a repair matrix
.
Step 4. Compute the ECVs for the new code chunks and
reproduce a new encoding matrix.
33
Iterative Repair
Step 5. Given EM’, verify if those properties are satisfied.
‫ ﻩ‬verify MDS by enumerating all
‫ ﻩ‬verify rMDS by n(n−k)n-1
𝑛
𝑘
𝑛
𝑘
.
.
‫ ﻩ‬The corresponding encoding matrices must form a full rank.
Step 6. Download the actual chunk data and regenerate new
chunk data.
‫ ﻩ‬Step 4 : The new ECVs
‫ ﻩ‬Code chunks from surviving nodes
34
rMDS Sustaining
35
Time of Two-phase Checking
36
Double-fault Tolerant Codes
‫ ﻪ‬Markov Model
37
MTTDL, Compare to RAID-6
Mean Time To Data Loss
38
Outline
‫ﻪ‬
‫ﻪ‬
‫ﻪ‬
‫ﻪ‬
‫ﻪ‬
Introduction
Repair in Multiple Cloud Storage
FMSR Codes
NCCloud
Conclusion
39
NCCloud
‫ ﻪ‬A proxy that bridges user applications and
multiple clouds.
‫ ﻪ‬Its design is built on three layers.
‫ ﻩ‬File system layer
‫ ﻩ‬Coding layer
‫ ﻩ‬Storage layer
40
NCCloud
‫ ﻪ‬It is mainly implemented in Python, while
the coding schemes are implemented in C
for better efficiency.
41
Goal of NCCloud
‫ ﻪ‬Compare the costs and response time of
using RAID-6 and FMSR codes.
‫ ﻪ‬The cost advantage of FMSR over RAID-6,
while maintaining acceptable response time.
42
Goal of NCCloud
‫ ﻪ‬Normal operations
‫ ﻩ‬RAID-6 and FMSR incur similar storage costs.
‫ ﻪ‬Repair operation
‫ ﻩ‬FMSR save a significant amount of transfer
costs over RAID-6.
43
Cost Saving-Price
44
Cost Saving
‫ ﻪ‬Normal operations
‫ ﻩ‬1.25PB of data stored
‫ ﻯ‬FMSR : $86,851 monthly storage cost
‫ ﻯ‬RAID-6 : $86,851 monthly storage cost
‫ ﻪ‬Repair operation
‫ ﻩ‬RAID-6 : 1PB of data, $56,832 Saving of $ 22,938
‫ ﻩ‬FMSR : 0.5625PB of data, $33,894
45
Response Time-Local Cloud
46
Response Time-Local Cloud
47
Response Time-Commerical
Cloud
48
Outline
‫ﻪ‬
‫ﻪ‬
‫ﻪ‬
‫ﻪ‬
‫ﻪ‬
Introduction
Repair in Multiple Cloud Storage
FMSR Codes
NCCloud
Conclusion
49
Conclusion
‫ ﻪ‬This paper present NCCloud providing the reliability of
today’s cloud backup storage.
‫ ﻩ‬proxy-based
‫ ﻩ‬multiple-cloud storage system
‫ ﻪ‬NCCloud not only provides fault tolerance in storage, but
also allows cost-effective repair.
‫ ﻪ‬The FMSR code implementation eliminates the encoding
requirement of storage nodes during repair.
50