Transcript Oceanstore

OceanStore: An Infrastructure for
Global-Scale Persistent Storage
John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski,
Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea,
Hakim Weatherspoon, Westley Weimer, Chris Wells, Ben Zhao
A few slides have been borrowed from the authors’ presentations
Vision
• What is Oceanstore?
• “a utility infrastructure to span the globe and provide
continuous access to persistent information”
Source: Berkeley OceanStore Website
Vision
• What is Oceanstore?
• “a utility infrastructure to span the globe and provide
continuous access to persistent information”
• data
• all kinds of information
• desktop, laptop, palmtop
• cars, cellular phones, other devices
• futuristic: embedded in environment
Vision
• What is Oceanstore?
• “a utility infrastructure to span the globe and provide
continuous access to persistent information”
• persistence
• devices can be rebooted, lost, replaced
• reliable, durable data (“deep archival” will last forever)
• Automatic maintenance
Vision
What is Oceanstore?
• “a utility infrastructure to span the globe and provide
continuous access to persistent information”
• connectivity
• even to tiniest devices, possibly intermittent
• variable bandwidth, latency
• availability
• uniform access, comparable to LAN-based networked
storage
• fault-tolerant, DoS-tolerant
Vision
• what is oceanstore?
• “a utility infrastructure to span the globe and provide
continuous access to persistent information”
• scale
• geographically distributed
• 1010 users
• 1014 files / objects
Questions about information:
• Where is persistent information stored?
• 20th-century tie between location and content outdated
• In world-scale system, locality is key
• How is it protected?
• Can disgruntled employee of ISP sell your secrets?
• Can’t trust anyone (how paranoid are you?)
• Can we make it indestructible?
• Want our data to survive “the big one”!
• Highly resistant to hackers (denial of service)
• Wide-scale disaster recovery
• Is it hard to manage?
• Worst failures are human-related
• Want automatic (introspective) diagnosis and repair
First Observation:
Want Utility Infrastructure
• Mark Weiser from Xerox: Transparent computing is the
ultimate goal. Computers should disappear into the background
• In the context of storage:
• Don’t want to worry about backup
• Don’t want to worry about obsolescence
• Need lots of resources to make data secure and highly
available, BUT don’t want to own them
• Outsourcing of storage already becoming popular
• Pay monthly fee and your “data is out there”
Utility-based Infrastructure
Canadian
OceanStore
Sprint
AT&T
Pac
Bell
IBM
IBM
• Service provided by confederation of companies
• Monthly fee paid to one service provider
• Companies buy and sell capacity from each other
Target applications
Email
Group calendar, contacts
Distributed design tools
Computer Supported Cooperative Work
Digital libraries
Distributed/shared repositories
Assumptions
• Untrusted infrastructure
•
•
•
•
A small number of servers may crash or leak information
most of the servers functioning correctly
financially “responsible party” of servers ensure integrity
but only clients trusted with cleartext
• Nomadic data
•
•
•
•
•
data divorced from location
flows freely within the storage infrastructure
promiscuous caching: “anywhere, anytime”
location important for performance
dynamic system tuning through introspection
System overview
• persistent object
• GUID: 160-bit SHA-1 hash
• secure identification – globally unique and unforgeable
• 280 unique objects before collisions (birthday paradox)
• floating object replicas: independent of location
• encrypted data
• read
• try fast probabilistic replica search (Bloom filter)
• fallback to slower deterministic search (Tapestry)
• write
• update with predicates [as in Bayou – what is Bayou?]
• creates new version
What is Bayou
The Bayou System (Xerox PARC) is a platform of
replicated, highly-available, variable-consistency,
databases on which collaborative applications can
be built. It caters to portable devices having
intermittent connections.
System overview
• application interface
• sessions: sequence of read/writes
• session guarantees [Bayou]
• loose consistency levels, ACID
• active and archival forms
• active: latest version, with update handle
• archive: erasure coded read-only version
• dynamic optimization
• object location
• degree of replication
Tentative Updates:
Epidemic Dissemination
Committed Updates:
Multicast Dissemination
naming
• self-certifying path names (Mazières)
• object GUID = hash of owner key and readable name
• create hierarchies using “directory” objects
• read restriction
• through client encryption of data
• write restriction, access control
• associate ACL lists with object, respected by servers
addressing
• address an object by its GUID
• message: GUID, random number, small predicate
• route to closest GUID replica matching predicate
• combines data location and routing:
• no central name service to attack
• save one round-trip for location discovery
• routing
• fast, probabilistic search algorithm
• slow, deterministic search algorithm
routing
• fast, probabilistic search algorithm
• Bloom filter
• probabilistic set membership test using bit vector
• n-bit vector generated from n hashes of each set element
• filter is union (OR) of all bit vectors
• attenuated Bloom filter
• array of d Bloom filters
• i th Bloom filter is union of all <i -hop nodes
• slow, deterministic algorithm
• Tapestry
addressing and routing
deterministic
probabilistic
Attenuated Bloom Filter
updates
• Updates based on versioning and conflict resolution
• i.e. no locking
• update: actions with predicates
• commit – apply action of first true predicate
• abort – no true predicates
• conflict resolution on encrypted data
• possible predicates:
• compare-version, compare-size, compare-block, search
• possible actions:
• replace-block, insert-block, delete-block, append
archival
• produced when objects idle
• use erasure codes (redundant fragmentation)
• simplest example: parity bit
• need any (n-1) out of n fragments
• interleaved Reed-Solomon codes, Tornado codes
• fragmentation improves reliability
• “deep archival storage”
• sweeper processes ensure replication
sustained over time
• fragmentation improves performance
Erasure Codes
Simple parity bits, or generalized Reed-Solomon codes
can be used to implement it.
Floating Replica and Deep Archival Coding
Full
Copy
Ver1: 0x34243
Ver2: 0x49873
Ver3: …
Conflict
Resolution
Logs
Floating
Replica
Full
Copy
Ver1: 0x34243
Ver2: 0x49873
Ver3: …
Conflict
Resolution
Full
Copy
Ver1: 0x34243
Ver2: 0x49873
Ver3: …
Conflict
Resolution
Logs
Erasure-coded Fragments
dynamic optimization (introspection)
• observation modules
• collect and summarize information
• incrementally update system database
• optimization modules
• periodically process the observation database
• cluster recognition: group related objects
• replica management: maintain replica number and location
• periodic migration: work-home-work-home…
• maintenance: routing, dissemination, availability, durability