Yahoo PNUTS Sherpa - Renmin University of China

Download Report

Transcript Yahoo PNUTS Sherpa - Renmin University of China

Introduction to cloud
computing
Jiaheng Lu
Department of Computer Science
Renmin University of China
www.jiahenglu.net
Yahoo! Cloud computing
Yahoo! Cloud Stack
EDGE
Brooklyn
Horizontal
Cloud Services
YCPI
…
WEB
VM/OS
Horizontal
Cloud ServicesPHP
yApache
APP
VM/OS
Horizontal
Cloud
Serving
Grid Services …
STORAGE
Sherpa
Horizontal
Cloud Services…
MOBStor
BATCH
Hadoop
Horizontal…Cloud Services
App Engine
Data Highway
Monitoring/Metering/Security
Provisioning (Self-serve)
YCS
Web Data Management
• Scan oriented
workloads
• Focus on
sequential disk
I/O
• $ per cpu
cycle
Large data analysis
(Hadoop)
Structured record
storage
(PNUTS/Sherpa)
Blob storage
(SAN/NAS)
• Object
retrieval and
streaming
• Scalable file
storage
• $ per GB
• CRUD
• Point lookups
and short
scans
• Index
organized
table and
random I/Os
• $ per latency
The World Has Changed

Web serving applications need:






Scalability!
 Preferably elastic
Flexible schemas
Geographic distribution
High availability
Reliable storage
Web serving applications can do without:


Complicated queries
Strong transactions
PNUTS /
SHERPA
To Help You Scale Your Mountains of Data
Yahoo! Serving Storage Problem

Small records – 100KB or less

Structured records – lots of fields, evolving

Extreme data scale - Tens of TB

Extreme request scale - Tens of thousands of requests/sec

Low latency globally - 20+ datacenters worldwide

High Availability - outages cost $millions

Variable usage patterns - as applications and users change
9
What is PNUTS/Sherpa?
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
Parallel database
CREATE TABLE Parts (
ID VARCHAR,
StockNumber INT,
Status VARCHAR
…
)
Structured, flexible schema
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
Geographic replication
Hosted, managed infrastructure
11
What Will It Become?
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
A
B
C
D
E
F
Indexes and views
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
Design Goals
Scalability



Thousands of machines
Easy to add capacity
Restrict query language to avoid costly queries
Geographic replication


Asynchronous replication around the globe
Low-latency local access
High availability and fault tolerance


Automatically recover from failures
Serve reads and writes despite failures
Consistency



Per-record guarantees
Timeline model
Option to relax if needed
Multiple access paths


Hash table, ordered table
Primary, secondary access
Hosted service


Applications plug and play
Share operational cost
14
Technology Elements
Applications
Tabular API
PNUTS API
YCA: Authorization
PNUTS
• Query planning and execution
• Index maintenance
Distributed infrastructure for tabular data
• Data partitioning
• Update consistency
• Replication
YDOT FS
• Ordered tables
YDHT FS
• Hash tables
Tribble
• Pub/sub messaging
Zookeeper
• Consistency service
15
Data Manipulation

Per-record operations




Get
Set
Delete
Multi-record operations



Multiget
Scan
Getrange
16
Tablets—Hash Table
0x0000
Name
Description
Grape
Grapes are good to eat
$12
Lime
Limes are green
$9
Apple
Apple is wisdom
$1
Strawberry
0x2AF3
0x911F
0xFFFF
Strawberry shortcake
Price
$900
Orange
Arrgh! Don’t get scurvy!
$2
Avocado
But at what price?
$3
Lemon
How much did you pay for this lemon?
$1
Tomato
Is this a vegetable?
$14
Banana
The perfect fruit
$2
New Zealand
$8
Kiwi
17
Tablets—Ordered Table
A
Name
Description
Price
Apple
Apple is wisdom
$1
Avocado
But at what price?
$3
Banana
The perfect fruit
$2
Grape
Grapes are good to eat
$12
New Zealand
$8
How much did you pay for this lemon?
$1
Limes are green
$9
H
Kiwi
Lemon
Lime
Q
Orange
Strawberry
Tomato
Arrgh! Don’t get scurvy!
$2
Strawberry shortcake
$900
Is this a vegetable?
$14
Z
18
Flexible Schema
Posted date
Listing id
Item
Price
6/1/07
424252
Couch
$570
6/1/07
763245
Bike
$86
6/3/07
211242
Car
$1123
6/5/07
421133
Lamp
$15
Color
Condition
Good
Red
Fair
Detailed Architecture
Remote regions
Local region
Clients
REST API
Routers
Tribble
Tablet Controller
Storage
units
20
Tablet Splitting and Balancing
Each storage unit has many tablets (horizontal partitions of the table)
Storage unit may become a hotspot
Storage unit
Tablet
Overfull tablets split
Tablets may grow over time
Shed load by moving tablets to other servers
21
QUERY
PROCESSING
22
Accessing Data
4 Record for key k
1
Get key k
3 Record for key k
SU
SU
2
Get key k
SU
23
Bulk Read
1
{k1, k2, … kn}
2
Get k1
Get k2
SU
SU
Get k3
Scatter/
gather
server
SU
24
Range Queries in YDOT

Clustered, ordered retrieval of records
Apple
Avocado
Grapefruit…Pear?
Banana
Blueberry
Canteloupe
Grape
Kiwi
Lemon
Grapefruit…Lime?
Lime…Pear?
Router
Lime
Mango
Orange
Strawberry
Apple
Tomato
Avocado
Watermelon
Banana
Blueberry
Storage unit 1
Canteloupe
Storage unit 3
Lime
Storage unit 2
Strawberry
Storage unit 1
Strawberry
Tomato
Watermelon
Storage unit 1
Lime
Mango
Orange
Canteloupe
Grape
Kiwi
Lemon
Storage unit 2
Storage unit 3
Updates
1
8
Write key k
Sequence # for key k
Routers
Message brokers
3
Write key k
2
7
Sequence # for key k
4
Write key k
5
SU
SU
SU
6
SUCCESS
Write key k
26
ASYNCHRONOUS REPLICATION
AND CONSISTENCY
27
Asynchronous Replication
28
Consistency Model

Goal: Make it easier for applications to reason about updates and
cope with asynchrony

What happens to a record with primary key “Alice”?
Record
inserted
Update
v. 1
Update Update
Update
v. 2
v. 3
v. 4
Update
Update
v. 5
v. 6
Generation 1
v. 7
Delete
Update
v. 8
Time
Time
As the record is updated, copies may get out of sync.
29
Example: Social Alice
East
West
User
Status
Alice
___
User
Status
Alice
Busy
User
Status
User
Status
Alice
Busy
Alice
Free
User
Status
User
Status
Alice
???
Alice
???
Record Timeline
___
Busy
Free
Free
Consistency Model
Read
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
In general, reads are served using a local copy
31
Consistency Model
Read up-to-date
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
But application can request and get current version
32
Consistency Model
Read ≥ v.6
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
Or variations such as “read forward”—while copies may lag the
master record, every copy goes through the same sequence of changes
33
Consistency Model
Write
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
Achieved via per-record primary copy protocol
(To maximize availability, record masterships automaticlly
transferred if site fails)
Can be selectively weakened to eventual consistency
(local writes that are reconciled using version vectors)
34
Consistency Model
Write if = v.7
ERROR
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
Test-and-set writes facilitate per-record transactions
35
Consistency Techniques

Per-record mastering
 Each record is assigned a “master region”



May differ between records
Updates to the record forwarded to the master region
Ensures consistent ordering of updates

Tablet-level mastering
 Each tablet is assigned a “master region”
 Inserts and deletes of records forwarded to the master region
 Master region decides tablet splits

These details are hidden from the application
 Except for the latency impact!
Mastering
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
A
B
C
D
E
F
Tablet master
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
37
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
Bulk Insert/Update/Replace
Client
Source Data
Bulk manager
1. Client feeds records to bulk
manager
2. Bulk loader transfers records
to SU’s in batches
• Bypass routers and
message brokers
• Efficient import into
storage unit
Bulk Load in YDOT

YDOT bulk inserts can cause performance hotspots

Solution: preallocate tablets
Index Maintenance

How to have lots of interesting indexes and
views, without killing performance?

Solution: Asynchrony!

Indexes/views updated asynchronously when
base table updated
SHERPA
IN CONTEXT
41
Types of Record Stores

Query expressiveness
S3
PNUTS
Oracle
Simple
Feature rich
Object
retrieval
Retrieval from
single table of
objects/records
SQL
Types of Record Stores

Consistency model
S3
PNUTS
Oracle
Best effort
Eventual
consistency
Timeline
consistency
Object-centric
consistency
ACID
Program
centric
consistency
Strong
guarantees
Types of Record Stores

Data model
PNUTS
CouchDB
Oracle
Flexibility,
Schema evolution
Object-centric
consistency
Optimized for
Fixed schemas
Consistency
spans objects
Types of Record Stores

Elasticity (ability to add resources on
demand)
Oracle
PNUTS
S3
Inelastic
Elastic
Limited
(via data
distribution)
VLSD
(Very Large
Scale
Distribution
/Replication)
Data Stores Comparison
Versus PNUTS

User-partitioned SQL stores



Microsoft Azure SDS
Amazon SimpleDB
Multi-tenant application databases


Salesforce.com
Oracle on Demand







More expressive queries
Users must control partitioning
Limited elasticity
Highly optimized for complex
workloads
Limited flexibility to evolving
applications
Inherit limitations of underlying data
management system
Mutable object stores

Amazon S3

Object storage versus record
management
Application Design Space
Get a few
things
Sherpa
MySQL Oracle
BigTable
Scan
everything
Everest
Records
MObStor
YMDB
Filer
Hadoop
Files
47
SQL/ACID
Consistency
model
Updates
Structured
access
Global low
latency
Availability
Operability
Elastic
Alternatives Matrix
Sherpa
Y! UDB
MySQL
Oracle
HDFS
BigTable
Dynamo
Cassandra
48
Further Reading
Efficient Bulk Insertion into a Distributed Ordered Table (SIGMOD 2008)
Adam Silberstein, Brian Cooper, Utkarsh Srivastava, Erik Vee,
Ramana Yerneni, Raghu Ramakrishnan
PNUTS: Yahoo!'s Hosted Data Serving Platform (VLDB 2008)
Brian Cooper, Raghu Ramakrishnan, Utkarsh Srivastava,
Adam Silberstein, Phil Bohannon, Hans-Arno Jacobsen,
Nick Puz, Daniel Weaver, Ramana Yerneni
Asynchronous View Maintenance for VLSD Databases,
Parag Agrawal, Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava and
Raghu Ramakrishnan
SIGMOD 2009 (to appear)
Cloud Storage Design in a PNUTShell
Brian F. Cooper, Raghu Ramakrishnan, and Utkarsh Srivastava
Beautiful Data, O’Reilly Media, 2009 (to appear)
QUESTIONS?
50