Oceanstore - Computer Science

Download Report

Transcript Oceanstore - Computer Science

Principal Resource
OceanStore: An Architecture for Global-Scale Persistent Storage: John
Kubiatowicz, Davic Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton,
Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon,
Westley Weimer, Chris Wells, Ben Zhao. Proceedings of ACM ASPLOS 2000.
Additional Materials
http://oceanstore.cs.berkeley.edu/
www.cs.fsu.edu/~awang/courses/cop5611_s2006/oceanstore.ppt
www.cs.fsu.edu/~awang/courses/cop5611_s2004/lecture_21_distributed_fs3.ppt
Presenter
Stanley Ziewacz
April 10, 2008
Global-Scale Persistent Storage
• Global Scale
 1010 users
 10,000 files per user
• Population Clocks
 U.S. 303,807,847
 World 6,660,028,725
 16:12 GMT (EST+5) Apr 08, 2008
• Mole
 6.023 X 1023 atoms in exactly 12 grams of carbon-12
Persistent Data Storage
• Connectivity to all types of computing devices
 Desktop, laptop, palmtop, cellphone, etc.
• Information Safety
 Avoid prying eyes and survive malicious hands
• Durable—1000 years?
 Redundancy with continuous repair and
redistribution for long-term storage
• Uniform and highly-available access to data
 Data divorced from location
 Servers close to clients
Cooperative Utility Model
• OceanStore Service Providers
• Client pays one monthly bill to one company
• Clients can use resources on other OSPs
• OSPs buy and sell capacity among themselves
• Millions of servers
Design Highlights
• Infrastructure is only trusted in the aggregate
 Servers may crash without warning
 Servers may leak data
• Data can be cached anywhere, anytime
 Promiscuous caching
 Nomadic Data
Applications
•
•
•
•
Groupware
Personal information managers
Digital libraries
Scientific data repositories
System Architecture
•
•
•
•
•
Naming
Data location and routing
Update model and conflict resolution
Deep archival storage
Introspection
Naming
Globally Unique Identifiers GUIDs
• Object GUID
• Secure hash of owner’s key and human-readable name
• Server GUID
• Secure hash of server’s public key
• Archival fragment GUID
• Secure hash over the data it stores
Access Control
• Restricting Readers
– If data is not completely public
• Encrypt
• Distribute key to users with read permission
– To revoke read permission
• Request that replicas be deleted or re-encrypted
– Restricted at clients
• Restricting Writers
• All writes are signed
• Owner can provide access control lists for objects
– Restricted at servers
Data Location and Routing
•
•
•
•
Support location-independent routing
Message routes to discover a destination
Then message routes directly to destination
Two routing strategies
– Bloom filters, fast probabilistic algorithm
– Plaxton-style routing
Bloom Filter
• A Bloom filter
– Represents a set S = {S1, … Sn}
– Is depicted by a m bit array, filter[m]
– Uses r independent hash functions
• h1…hr
• for i = 1…n
– for j = 1…r
• filter[hj[Si]] = 1
www.cs.fsu.edu/~awang/courses/cop5611_s2004/lecture_21_distributed_fs3.ppt
Bloom Filter Example
• filter[] = {1, 1, 0, 1, 0, 1}
• Does x belong to the set?
– filter[h1(x)] = filter[0] = 1
– filter[h2(x)] = filter[3] = 1
– filter[h3(x)] = filter[5] = 1
• Does z belong to the set?
– filter[h1(z)] = filter[2] = 0  no
– filter[h2(z)] = filter[3] = 1
– filter[h3(z)] = filter[5] = 1
www.cs.fsu.edu/~awang/courses/cop5611_s2004/lecture_21_distributed_fs3.ppt
Attenuated Bloom Filters
www.cs.fsu.edu/~awang/courses/cop5611_s2006/oceanstore.ppt
Variation on Plaxton Routing
•
•
•
•
Each object GUID has root node
Root ID matches GUID’s hash in the most bits
But replicas can be placed anywhere
Publishing process for replicas
– Do Plaxton hops from replica location to root
– Place a pointer to replica locale at each hop
Update Model
•
•
•
•
Client generates updates
Primary tier of replicas commit
Evaluate update’s predicates in time order
Perform action with earliest predicate
Updating Ciphertext
• Only 4 predicates available
–
–
–
–
Compare-version
Compare-size
Compare-block
Search
• Actions available
–
–
–
–
Replace-block
Insert-block
Delete-block
Append
Serializing Updates
• Primary tier of replicas
– Byzantine agreement protocol
– Final commit order
– Multicast committed updates
• Tentative Updates
– Sent to several random replicas
– Tentative commits spread by epidemic
Deep Archival Storage
•
•
•
•
•
•
Objects exist in both active and archival form
Archival form is permanent read-only
Form treated as series of fragments of data
Fragments spread over the network structure
Use any n fragments to reconstruct data
In principle every version is archived
Introspection
• Cluster recognition
– Each client machine has an event handler
triggered by each data access
• Replica Management
– Event handlers monitor client requests and system
load
• Detect periodic migration of clusters from site
to site
– OceanStore can monitor my work/travel routine