下載/瀏覽Download

Download Report

Transcript 下載/瀏覽Download

DeDu: Building a Deduplication
Storage system over Cloud computing
Speaker: Yen-Yi Chen MA190104
Date:2013/05/28
This paper appears in :
Computer Supported Cooperative work in Design(CSCWD) ,2011 15th International
Data of Conference: 8-10 June 2011
Author(s):
Zhe Sun, Jun Shen, Fac. of inf., Univ. of Wollongong, Wollongong, NSW, Australia
Jianming Yong, Fac. of bus., Univ. of Southern Queensland, Toowoomab, QLD ,Australia
Outline
•
•
•
•
•
•
•
Introduction
Two issues to be addressed
Deduplication
Theories and approaches
System design
Simulations and Experiments
Conclusions
Introduction
•
•
•
•
雲端運算興起、分散式系統架構
資訊爆炸、資料海量
儲存設備成本上升
增加資料傳輸與減緩佔用網路頻寬
Introduction
• System name:DeDu
• Front-end:
deduplication application
• Back-end:
Hadoop Distributed File System
• HDFS
• HBase
Two issues to be addressed
• How does the system identify the duplication?
*hash function-MD5 and SHA-1
• How does the system manage the data?
*HDFS and HBase
Deduplication
類別
File-level
Data Store
Data Store
重複資料比對層級
A
C
C
重複資料比對範圍
B
A
A
B
Block-level
C
優點
1. Data chunks are
evaluated to determine a
unique signature for each
缺點
重複資料刪檢比例
檔案
C
A
B
整個指定磁碟區
A
C
B
B
A
對單一檔案的容量刪減
效果最好
2. Signature values are
compared to identify all
duplicates
對已編碼檔案無效,對
完全相同的兩份檔案仍
會重複儲存
1:2~1:5
Data Store
C
區塊
整個指定磁碟區
A
a
b
c
B
b
a
可跨檔案比對,也能比
a
對不同檔案底層的重複
3.Duplicate data chunks
部份are replaced with pointes
to a single stored chunk.
Saving storage space
較消耗處理資源
1:200甚至更高
Theories and approaches
A. The architecture of source data and link files
B. Architecture of deduplication cloud storage
system
Source data and link files
Deduplication Cloud storage system
System design
A.
B.
C.
D.
Data organisation
Storage of the files
Access to the files
Deletion of files
Data organisation
Storage of the files
Access to the files
Deletion of files
Simulations and Experiments
Performance evaluations
Conclusions
• 1. The fewer the data nodes, the writing efficiency is
high; but the reading efficiency is low;
• 2. The more data nodes, the writing efficiency is low,
but reading efficiency is hight;
• 3. single file is big, the time to calculate hash values
becomes higher ; but transmission cost is low;
• 4.single file is small, the time to calculate hash values
becomes lower ; but transmission cost is high.
Thanks for your listening