pptx - UCSB Computer Science

Download Report

Transcript pptx - UCSB Computer Science

Managing Data in the Cloud

App Server

Scaling in the Cloud

Client Site Client Site Client Site Load Balancer (Proxy) App Server App Server App Server App Server

Replication

MySQL

Scalability Bottleneck

MySQL Slave DB

Cannot leverage elasticity

CS271 2

App Server

Scaling in the Cloud

Client Site Client Site Client Site Load Balancer (Proxy) App Server App Server App Server App Server MySQL Master DB

CS271 Replication

MySQL Slave DB

Scaling in the Cloud

Client Site Client Site Client Site Load Balancer (Proxy) Apache + App Server Apache + App Server Apache + App Server Apache + App Server Apache + App Server

Key Value Stores

CS271 4

CAP Theorem (Eric Brewer)

• “Towards Robust Distributed Systems” PODC 2000.

• “CAP Twelve Years Later: How the "Rules" Have Changed” IEEE Computer 2012 CS271 5

Key Value Stores

• • • Key-Valued data model – Key is the unique identifier – – Key is the granularity for consistent access Value can be structured or unstructured Gained widespread popularity – In house: Bigtable (Google), PNUTS (Yahoo!), Dynamo (Amazon) – Open source: HBase, Hypertable, Cassandra,

Voldemort

Popular choice for the modern breed of web applications CS271 6

Big Table (Google)

• • • • Data model .

– Sparse, persistent, multi-dimensional sorted map.

Data is partitioned across multiple servers.

The map is indexed by a row key, column key, and a timestamp .

Output value is un-interpreted array of bytes.

– (row: byte[ ], column: byte[ ], time: int64)  byte[ ] CS271 7

Architecture Overview

• Shared-nothing architecture consisting of thousands of nodes (commodity PC).

Google’s Bigtable Data Model Google File System …….

CS271 8

Atomicity Guarantees in Big Table

• Every read or write of data under a single row is atomic .

•

Objective

: make read operations single-sited !

CS271 9

Big Table’s Building Blocks

• • • • Google File System (GFS) – Highly available distributed file system that stores data files log and Chubby – Highly available persistent distributed lock manager.

Tablet servers – Handles read and writes to its tablet and splits – Each tablet is typically 100-200 MB in size.

tablets.

Master Server – Assigns tablets to tablet servers, – Detects the addition and deletion of tablet servers, – Balances tablet-server load, CS271 10

Overview of Bigtable Architecture

Master Control Operations Lease Management Tablet Server Chubby Master and Chubby Proxies T 1 T 2 T n Tablets Cache Manager Log Manager Google File System CS271 11

GFS Architectural Design

• • A GFS cluster – – A single master Multiple chunkservers • per master Accessed by multiple clients – Running on commodity Linux machines A file – Represented as fixed-sized chunks • Labeled with 64-bit unique global IDs • • Stored at chunkservers 3-way replication across chunkservers CS271 12

Architectural Design

Application

GFS client

chunk location?

GFS Master

chunk data?

GFS chunkserver GFS chunkserver

Linux file system

GFS chunkserver

Linux file system Linux file system CS271 13

Single-Master Design

• • • • Simple Master answers only chunk locations A client typically asks for multiple chunk locations in a single request The master also predicatively provides chunk locations immediately following those requested CS271 14

• •

Metadata

Master stores three major types – File and chunk namespaces, persistent in operation log – File-to-chunk mappings, persistent in operation log – Locations of a chunk’s replicas, not persistent .

All kept in memory : Fast!

– Quick global scans • For Garbage collections and Reorganizations – 64 bytes of metadata only per 64 MB of data CS271 15

Mutation Operation in GFS

• • • Mutation: any write or append operation The data needs to be written to all replicas Guarantee of the same order when multi user request the mutation operation.

CS271 16

• • • •

GFS Revisited

“GFS: Evolution on Fast-Forward” an interview with GFS designers in CACM 3/11.

Single master was critical for early deployment.

“the choice to establish 64MB …. was much larger than the typical file-system block size, but only because the files generated by Google's crawling and indexing system were unusually large.” As the application mix changed over time, ….deal efficiently with large numbers of files requiring far less than 64MB (think in terms of but rather with the Gmail , for example). The problem was not so much with the number of files itself, memory demands all of those files made on the centralized master, thus exposing one of the bottleneck risks inherent in the original GFS design.

CS271 17

GFS Revisited(Cont’d)

• • • • “the initial emphasis in designing GFS was on batch efficiency as opposed to low latency.” “The original single-master design : A single point of failure may not have been a disaster for batch-oriented applications, but it was certainly unacceptable for latency-sensitive applications , such as video serving .” Future directions : distributed master, etc.

Interesting and entertaining read.

CS271 18

PNUTS Overview

• • • Data Model : – Simple relational model—really key-value store.

– Single-table scans with predicates Fault-tolerance : – Redundancy at multiple levels: data, meta-data etc.

– Leverages relaxed consistency for high availability: reads & writes despite failures Pub/Sub Message System : – Yahoo! Message Broker for asynchronous updates CS271 19

Asynchronous replication

CS271 20

Consistency Model

• • • • Hide the complexity of data replication Between the two extremes : – One-copy serializability, and – Eventual consistency Key assumption : – Applications manipulate one record at a time Per-record

time-line consistency

: – All replicas of a record preserve the update order CS271 21

Implementation

• • • • A read returns a consistent version One replica designated as master ( per record ) All updates forwarded to that master Master designation adaptive , replica with most of writes becomes master CS271 22

Consistency model

• Goal: make it easier for applications to reason about updates and cope with asynchrony • What happens to a record with primary key “Brian”?

Record inserted Update Update Update Update Update Update Update

v. 1 v. 2 v. 3 v. 4 v. 5 Generation 1 v. 6 v. 7 v. 8

Delete CS271 23

Consistency model

Read Stale version Stale version Current version

v. 1 v. 2 v. 3 v. 4 v. 5 Generation 1 v. 6 v. 7 v. 8

Time CS271 24

Consistency model

Read up-to-date Stale version Stale version Current version

v. 1 v. 2 v. 3 v. 4 v. 5 Generation 1 v. 6 v. 7 v. 8

Time CS271 25

Consistency model

Read ≥ v.6

Stale version Stale version Current version

v. 1 v. 2 v. 3 v. 4 v. 5 Generation 1 v. 6 v. 7 v. 8

Time CS271 26

Consistency model

Write Stale version Stale version Current version

v. 1 v. 2 v. 3 v. 4 v. 5 Generation 1 v. 6 v. 7 v. 8

Time CS271 27

Consistency model

Write if = v.7

Stale version Stale version

ERROR

Current version

v. 1 v. 2 v. 3 v. 4 v. 5 Generation 1 v. 6 v. 7 v. 8

Time CS271 28

PNUTS Architecture

Clients REST API Tablet controller Routers Storage units CS271 Message Broker

Data-path components

Clients Tablet controller

PNUTS architecture

Local region Remote regions

Routers REST API YMB CS271 Storage units 30

System Architecture: Key Features

• • • • Pub/Sub Mechanism : Yahoo! Message Broker Physical Storage: Storage Unit Mapping of records: Tablet Controller Record locating: Routers CS271 31

Highlights of PNUTS Approach

• • • • • Shared nothing architecture Multiple datacenter for geographic distribution Time-line consistency and access to stale data.

Use a publish-subscribe system for reliable fault-tolerant communication Replication with record-based master .

CS271 32

AMAZON’S KEY-VALUE STORE: DYNAMO

Adapted from Amazon’s Dynamo Presentation CS271 33

Highlights of Dynamo

• • • • High write availability Optimistic: vector clocks for resolution Consistent hashing (Chord) in controlled environment Quorums for relaxed consistency.

CS271 34

TOO MANY CHOICES – WHICH SYSTEM SHOULD I USE?

Cooper et al., SOCC 2010 CS271 35

Benchmarking Serving Systems

• A standard benchmarking tool for evaluating Key Value stores : Yahoo! Cloud Servicing Benchmark (YCSB) • Evaluate different systems on common workloads • Focus on performance and scale out CS271 36

Benchmark tiers

• Tier 1 – Performance – Latency versus throughput as throughput increases • Tier 2 – Scalability – Latency as database, system size increases – “Scale-out” – – Latency as we elastically add servers “Elastic speedup” CS271 37

70 60 50 40 30 20 10 0 0

Workload A – Update heavy:

50/50 read/update Workload A - Read latency 2000 4000 Cassandra 6000 8000 Throughput (ops/sec) Hbase PNUTS 10000 MySQL 12000 14000 Cassandra (based on Dynamo) is optimized for heavy updates Cassandra uses hash partitioning.

8 6 4 2 0 20 18 16 14 12 10 0

Workload B – Read heavy

95/5 read/update Workload B - Read latency 1000 2000 3000 4000 5000 Throughput (operations/sec) 6000 Cassandra HBase PNUTS 7000 MySQL 8000 9000 PNUTS uses MSQL, and MSQL is optimized for read operations CS271 39

120 100 80 60 40 20 0 0

Workload E – short scans

Scans of 1-100 records of size 1KB

Workload E - Scan latency

200 400 600 800 1000

Throughput (operations/sec)

Hbase PNUTS 1200 Cassandra 1400 1600 HBASE uses append-only log, so optimized for scans—same for MSQL and PNUTS. Cassandra uses hash partitioning, so poor scan performance.

CS271 40

Summary

• • • Different databases suitable for different workloads Evolving systems – landscape changing dramatically Active development community around open source systems CS271 41

Two approaches to scalability

• •

Scale-up

– C

lassical enterprise

(RDBMS) setting – Flexible

ACID transactions

– Transactions in a single node

Scale-out

–

Cloud friendly

(Key value stores) – Execution at a single server • Limited functionality & guarantees – No

multi-row

transactions or

multi-step

CS271 42

Key-Value Store Lessons

What are the design principles learned?

Design Principles

[DNIS 2010]

•

Separate

System and Application State – System metadata is critical but small – Application data has varying needs – Separation allows use of different class of protocols CS271 44

Design Principles

•

Decouple Ownership from Data Storage

– Ownership is exclusive read/write access to data – Decoupling allows lightweight ownership migration Transaction Manager Recovery Cache Manager Classical DBMSs CS271

Ownership

[Multi-step transactions or Read/Write Access]

Storage

Decoupled ownership and Storage 45

Design Principles

•

Limit

–

most interactions to a

Allows horizontal scaling

single node

– – Graceful degradation during failures No distributed synchronization Thanks: Curino et al VLDB 2010 CS271 46

Design Principles

•

Limited distributed synchronization is practical

– Maintenance of metadata – Provide strong guarantees only for data that needs it CS271 47

Fault-tolerance in the Cloud

• Need to tolerate catastrophic failures – Geographic Replication • How to support ACID transactions over data replicated at multiple datacenters – One-copy serializablity : Clients can access data in any datacenter, appears as single copy with atomic access SBBD 2012 48

Megastore: Entity Groups

(Google--CIDR 2011) • Entity groups are sub-database – – Static partitioning Cheap transactions in Entity groups (common) – Expensive cross-entity group transactions (rare) SBBD 2012 49

Megastore Entity Groups

• • • Semantically Predefined Email – Each email account forms a natural entity group – Operations within an account are transactional: user’s send message is guaranteed to observe the change despite of fail over to another replica Blogs – User’s profile – is entity group Operations such as creating a new blog rely on asynchronous messaging with two-phase commit Maps – Dividing the globe into non-overlapping patches – Each patch can be an entity group SBBD 2012 50

Megastore

Slides adapted from authors’ presentation SBBD 2012 51

Google’s Spanner: Database Tech That Can Scan the Planet

(OSDI 2012) SBBD 2012 52

The Big Picture (OSDI 2012)

GPS + Atomic Clocks

TrueTime

2PC (atomicity )

2PL + wound-wait (isolation ) Paxos (consistency )

Movedir load balancing Tablets Logs SSTables

Colossus File System

TrueTime

   TrueTime : APIs that provide real time with bounds on error. o Powered by GPS and atomic clocks .

Enforce external consistency o If the start of T2 occurs after the commit of T1 then the commit timestamp of T2 must be greater than the commit timestamp of T1 .

, Concurrency Control : o Update transactions: 2PL o Read-only transactions: Use real time to return a consistency snapshot.

• • • • • •

Primary References

Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, Gruber: Bigtable: A Distributed Storage System for Structured Data. OSDI 2006 The Google File System: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. Symp on Operating Systems Princ 2003.

GFS: Evolution on Fast-Forward: Kirk McKusick, Sean Quinlan Communications of the ACM 2010.

Cooper, Ramakrishnan, Srivastava, Silberstein, Bohannon, Jacobsen, Puz, Weaver, Yerneni: PNUTS: Yahoo!'s hosted data serving platform. VLDB 2008.

DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: amazon's highly available key-value store. SOSP 2007 Cooper, Silberstein, Tam, Ramakrishnan, Sears: Benchmarking cloud serving systems with YCSB. SoCC 2010 CS271 55