PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and.

Download Report

Transcript PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and.

PNUTS: Yahoo!’s Hosted Data
Serving Platform
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava,
Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick
Puz, Daniel Weaver and Ramana Yerneni
Yahoo! Research
With some additions by S. Sudarshan
How do I build a cool new web
app?

Option 1: Code it up! Make it live!




Scale it later
It gets posted to slashdot
Scale it now!
Flickr, Twitter, MySpace, Facebook, …
2
How do I build a cool new web
app?

Option 2: Make it industrial strength!

Evaluate scalable database backends
Evaluate scalable indexing systems
Evaluate scalable caching systems
Architect data partitioning schemes
Architect data replication schemes
Architect monitoring and reporting infrastructure

Write application









Go live
Realize it doesn’t scale as well as you hoped
Rearchitect around bottlenecks
1 year later – ready to go!
3
Example: social network
updates
Brian
Sonja
Jimi
Brandon
Kurt
What are my friends up to?
Sonja:
Brandon:
4
Example: social network
updates
<photo>
<title>Flower</title>
<url>www.flickr.com</url>
</photo>
6
8
12
15
16
17
Jimi
Mary
Sonja
Brandon
Mike
Bob
<ph..
<re..
<ph..
<po..
<ph..
<re..
5
What do we need from our DBMS?

Web applications need:

Scalability




And the ability to scale linearly
Geographic scope
High availability
Web applications typically have:

Simplified query needs


No joins, aggregations
Relaxed consistency needs

Applications can tolerate stale or reordered data
6
What is PNUTS?
7
What is PNUTS?
A
B
42342
42521
E
W
C
66354
W
D
E
12352
75656
E
C
F
15677
E
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
Indexes and views
Parallel database
CREATE TABLE Parts (
ID VARCHAR,
StockNumber INT,
Status VARCHAR
A 42342
E
…
B 42521
W
)
C 66354
W
D
E
F
12352
75656
15677
Geographic replication
Structured, flexible schema
E
C
E
Hosted, managed infrastructure
8
Query model

Per-record operations




Multi-record operations




Get
Set
Delete
Multiget
Scan
Getrange
Web service (RESTful) API
9
Detailed architecture
Clients
Data-path components
REST API
Routers
Tablet
controller
Message
Broker
Storage units
10
Detailed architecture
Local region
Remote regions
Clients
REST API
Routers
YMB
Tablet controller
Storage
units
11
Tablet splitting and balancing
Each storage unit has many tablets (horizontal partitions of the table)
Storage unit may become a hotspot
Storage unit
Tablet
Overfull tablets split
Tablets may grow over time
Shed load by moving tablets to other servers
12
Query processing
13
Range queries
Apple
Avocado
Grapefruit…Pear?
Banana
Blueberry
Canteloupe
Grape
Kiwi
Lemon
MIN-Canteloupe
SU1
Canteloupe-Lime
SU3
Lime-Strawberry
SU2
Strawberry-MAX
SU1
Router
Grapefruit…Lime?
Lime…Pear?
Lime
Mango
Orange
Strawberry
Tomato
Watermelon
Storage unit 1
Storage unit 2
Storage unit 3
16
Updates
1
8 Sequence # for key k
Write key k
Routers
Message brokers
3
Write key k
2
7 Sequence # for key k
4
Write key k
5
SUCCESS
SU
SU
SU
6
Write key k
17
Yahoo Message Bus


Distributed publish-subscribe service
Guarantees delivery once a message is
published



Logging at site where message is published, and
at other sites when received
Guarantees messages published to a
particular cluster will be delivered in same
order at all other clusters
Record updates are published to YMB by
master copy

All replicas subscribe to the updates, and get them
in same order for a particular record
18
Asynchronous
replication and
consistency
19
Asynchronous replication
20
Consistency model

Goal: make it easier for applications to reason about updates
and cope with asynchrony

What happens to a record with primary key “Brian”?
Record Update
inserted
v. 1
v. 2
Update
v. 3
Update Update
v. 4
Update
Delete
Update Update
v. 5
v. 6
Generation 1
v. 7
v. 8
Time
21
Consistency model
Read
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
22
Consistency model
Read up-to-date
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
23
Consistency model
Read-critical(required version):
Stale version
v. 1
v. 2
v. 3
v. 4
Read ≥ v.6
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
24
Consistency model
Write
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
25
Consistency model
Test-and-set-write(required version)
Write if = v.7
ERROR
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
26
Consistency model
Write if = v.7
ERROR
Stale version
Stale version
Current
version
Mechanism: per record mastership
v. 1
v. 2
v. 3
v. 4
v. 5
v. 6
Generation 1
v. 7
v. 8
Time
27
Record and Tablet Mastership


Data in PNUTS is replicated across sites
Hidden field in each record stores which copy is the
master copy



Record also contains origin of last few updates



updates can be submitted to any copy
forwarded to master, applied in order received by master
Mastership can be changed by current master, based on
this information
Mastership change is simply a record update
Tablets mastership


Required to ensure primary key consistency
Can be different from record mastership
28
Other Features


Per record transactions
Copying a tablet (on failure, for e.g.)





Request copy
Publish checkpoint message
Get copy of tablet as of when checkpoint is
received
Apply later updates
Tablet split

Has to be coordinated across all copies
29
Query Processing

Range scan can span tablets


Only one tablet scanned at a time
Client may not need all results at once


Continuation object returned to client to indicate where
range scan should continue
Notification


One pub-sub topic per tablet
Client knows about tables, does not know about
tablets


Automatically subscribed to all tablets, even as tablets
are added/removed.
Usual problem with pub-sub: undelivered
notifications, handled in usual way
30
Experiments
31
Experimental setup

Production PNUTS code


Three PNUTS regions





Enhanced with ordered table type
2 west coast, 1 east coast
5 storage units, 2 message brokers, 1 router
West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array
East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk
Workload



1200-3600 requests/second
0-50% writes
80% locality
32
Inserts

Inserts



required 75.6 ms per insert in West 1
(tablet master)
131.5 ms per insert into the non-master
West 2, and
315.5 ms per insert into the non-master
East.
33
10% writes by default
34
Scalability
160
Average latency (ms)
140
120
100
80
60
40
20
0
1
2
3
4
5
6
Storage units
Hash table
Ordered table
35
Request skew
100
90
Average latency (ms)
80
70
60
50
40
30
20
10
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Zipf parameter
Hash table
Ordered table
36
Size of range scans
8000
Average latency (ms)
7000
6000
5000
4000
3000
2000
1000
0
0
0.02
0.04
0.06
0.08
0.1
0.12
Fraction of table scanned
30 clients
300 clients
37
Related work

Distributed and parallel databases



Distributed filesystems


Ceph, Boxwood, Sinfonia
Distributed (P2P) hash tables


Especially query processing and transactions
BigTable, Dynamo, S3, SimpleDB, SQL Server Data Services,
Cassandra
Chord, Pastry, …
Database replication

Master-slave, epidemic/gossip, synchronous…
38
Conclusions and ongoing work

PNUTS is an interesting research product



Research: consistency, performance, fault
tolerance, rich functionality
Product: make it work, keep it (relatively) simple,
learn from experience and real applications
Ongoing work



Indexes and materialized views
Bundled updates
Batch query processing
39
Thanks!


[email protected]
research.yahoo.com
40