Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco.

Download Report

Transcript Toward Achieving Tapeless Backup at PB Scales Hakim Weatherspoon University of California, Berkeley Frontiers in Distributed Information Systems San Francisco.

Toward Achieving Tapeless Backup
at PB Scales
Hakim Weatherspoon
University of California, Berkeley
Frontiers in Distributed Information Systems
San Francisco. Thursday, July 31, 2003
OceanStore Context:
Ubiquitous Computing
• Computing everywhere:
– Desktop, Laptop, Palmtop.
– Cars, Cellphones.
– Shoes? Clothing? Walls?
• Connectivity everywhere:
– Rapid growth of bandwidth in the interior of the net.
– Broadband to the home and office.
– Wireless technologies such as CDMA, Satellite, laser.
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:2
Archival Storage
• Where is persistent information stored?
– Want: Geographic independence for availability,
durability, and freedom to adapt to circumstances
• How is it protected?
– Want: Encryption for privacy, secure naming and
signatures for authenticity, and Byzantine commitment
for integrity
• Is it Available/Durable?
– Want: Redundancy with continuous repair and
redistribution for long-term durability
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:3
Path of an Update
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:4
Questions about Data?
• How to use redundancy to protect against
data being lost?
• How to verify data?
• Amount of resources used to keep data
durable? Storage? Bandwidth?
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:5
Archival Dissemination Built
into Update
• Erasure codes
– redundancy without overhead of strict replication
– produce n fragments, where any m is sufficient to reconstruct data.
m < n. rate r = m/n. Storage overhead is 1/r.
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:6
Durability
• Fraction of Blocks Lost Per Year (FBLPY)*
– r = ¼, erasure-enncoded block. (e.g. m = 16, n = 64)
– Increasing number of fragments, increases durability of block
• Same storage cost and repair time.
– n = 4 fragment case is equivalent to replication on four servers.
* Erasure Coding vs. Replication, H. Weatherspoon and J. Kubiatowicz, In Proc. of IPTPS 2002.
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:7
Naming and Verification
Algorithm
• Use cryptographically
secure hash algorithm to
detect corrupted
fragments.
• Verification Tree:
B-GUID
Hd
Data
H12
– n is the number of fragments.
– store log(n) + 1 hashes with Fragment 1:
each fragment.
– Total of n*(log(n) + 1)
hashes.
• Top hash is a block GUID
(B-GUID).
H14
H34
H1
H2
H3
H4
F1
F2
F3
F4
Encoded Fragments
H2
H34
Hd
F1 - fragment data
Fragment 2:
H1
H34
Hd
F2 - fragment data
Fragment 3:
H4
H12
Hd
F3 - fragment data
Fragment 4:
H3
H12
Hd
F4 - fragment data
Data:
H14
data
– Fragments and blocks are
self-verifying
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:8
Naming and Verification
Algorithm
• Use cryptographically
secure hash algorithm to
detect corrupted
fragments.
• Verification Tree:
B-GUID
Hd
H12
H1
H34
H2
F1
– n is the number of fragments.
– store log(n) + 1 hashes with Fragment 1:
each fragment.
– Total of n*(log(n) + 1)
hashes.
• Top hash is a block GUID
(B-GUID).
H14
Encoded Fragments
H2
H34
Hd
F1 - fragment data
Fragment 2:
H1
H34
Hd
F2 - fragment data
Fragment 3:
H4
H12
Hd
F3 - fragment data
Fragment 4:
H3
H12
Hd
F4 - fragment data
Data:
H14
data
– Fragments and blocks are
self-verifying
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:9
Enabling Technology
GUID
Fragments
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:10
Complex Objects I
GUID of d
Verification Tree
Encoded Fragments:
Unit of Archival Storage
Unit of Coding
data
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:11
Complex Objects II
VGUID
Data
B -Tree
M
Indirect
Blocks
d1
d2
d3
Blocks
Data
d4 d5 d6
d7
d8
d9
Unit of Coding
Encoded Fragments:
Unit of Archival Storage
Verification Tree
GUID of d1
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:12
Complex Objects III
AGUID = hash{name+keys}
VGUIDi
Data
B -Tree
VGUIDi + 1
backpointer
M
M
copy on write
Indirect
Blocks
copy on write
Data
Blocks
d1
FDIS 2003
d2
d3
d4
d5
d6
d7
d8
d9
©2003 Hakim Weatherspoon/UC Berkeley
d'8
d'9
Distributed Archival Service:13
Mutable Data
• Need mutable data for real system.
–
–
–
–
–
–
Entity in network.
A-GUID to V-GUID mapping.
Byzantine Commitment for Integrity
Verifies client privileges.
Creates a serial order.
Atomically applies update.
• Versioning system
– Each version is inherently read-only.
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:14
Deployment
• Planet Lab global network
– 98 machines at 42 institutions, in North America,
Europe, Asia, Australia.
– 1.26Ghz PIII (1GB RAM), 1.8Ghz PIV (2GB RAM)
– North American machines (2/3) on Internet2
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:15
Deployment
• Deployed storage system in November of 2002.
– ~ 50 physical machines.
– 100 virtual nodes.
• 3 clients, 93 storage serves, 1 archiver, 1 monitor.
– Support OceanStore API
• NFS, IMAP, etc.
– Fault injection.
– Fault detection
and repair.
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:16
Performance
• Performance of the Archival Layer
– Performance of an OceanStore server in archiving a objects.
– analyze operations of archiving data (this includes signing updates
in a BFT protocol).
• No archiving
• Archiving (synchronous) (m = 16, n = 32)
• Experiment Environment
– OceanStore servers were analyzed on a 42-node cluster.
– Each machine in the cluster is a
•
•
•
•
•
IBM xSeries 330 1U rackmount PC with
two 1.0 GHz Pentium III CPUs
1.5 GB ECC PC133 SDRAM
two 36 GB IBM UltraStar 36LZX hard drives.
The machines use a single Intel PRO/1000 XF gigabit Ethernet
adaptor to connect to a Packet Engines
• Linux 2.4.17 SMP kernel.
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:17
Performance: Throughput
• Data Throughput
– No archive 8MB/s.
– Archive 2.8MB/s.
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:18
Performance: Latency
• Latency
– Fragmentation
• Y-intercept 3ms, slope 0.3s/MB.
– Archive = No archive + Fragmentation.
Update Latency vs. Update Size
80
y = 1.2x + 36.4
Archive
70
No Archive
Fragmentation
Latency (in ms)
60
y = 0.6x + 29.6
50
40
30
20
y = 0.3x + 3.0
10
0
2
7
12
17
22
27
UpdateSize (kB)
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:19
32
Closer Look: Update Latency
Latency Breakdown
Update Latency (ms)
Key
Size
512b
1024b
Update 5%
Size
Time
Median
Time
95%
Time
Phase
Time (ms)
Check
0.3
Serialize
6.1
4kB
39
40
41
2MB
1037
1086
1348
Apply
1.5
4kB
98
99
100
Archive
4.5
2MB
1098
1150
1448
Sign
77.8
• Threshold Signature dominates small update latency
– Common RSA tricks not applicable
• Batch updates to amortize signature cost
• Tentative updates hide latency
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:20
Current Situation
• Stabilized routing layer under churn and
extraordinary circumstances
• NSF infrastructure grant
– Deploy code as a service for Berkeley
– Target 1/3 PB
• Future Collaborations
– CMU for PB Store.
– Internet Archive?
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:21
Conclusion
• Storage efficient, self-verifying
mechanism.
– Erasure codes are good.
• Self-verifying data assist in
– Secure read-only data
– Secure caching infrastructures
– Continuous adaptation and repair
For more information:
http://oceanstore.cs.berkeley.edu/
Papers: Pond: the OceanStore Prototype
- Naming and Integrity: Self-Verifying Data in P2P Systems
FDIS 2003
©2003 Hakim Weatherspoon/UC Berkeley
Distributed Archival Service:22