PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and.
Download
Report
Transcript PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and.
PNUTS: Yahoo!’s Hosted Data
Serving Platform
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava,
Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick
Puz, Daniel Weaver and Ramana Yerneni
Yahoo! Research
With some additions by S. Sudarshan
How do I build a cool new web
app?
Option 1: Code it up! Make it live!
Scale it later
It gets posted to slashdot
Scale it now!
Flickr, Twitter, MySpace, Facebook, …
2
How do I build a cool new web
app?
Option 2: Make it industrial strength!
Evaluate scalable database backends
Evaluate scalable indexing systems
Evaluate scalable caching systems
Architect data partitioning schemes
Architect data replication schemes
Architect monitoring and reporting infrastructure
Write application
Go live
Realize it doesn’t scale as well as you hoped
Rearchitect around bottlenecks
1 year later – ready to go!
3
Example: social network
updates
Brian
Sonja
Jimi
Brandon
Kurt
What are my friends up to?
Sonja:
Brandon:
4
Example: social network
updates
<photo>
<title>Flower</title>
<url>www.flickr.com</url>
</photo>
6
8
12
15
16
17
Jimi
Mary
Sonja
Brandon
Mike
Bob
<ph..
<re..
<ph..
<po..
<ph..
<re..
5
What do we need from our DBMS?
Web applications need:
Scalability
And the ability to scale linearly
Geographic scope
High availability
Web applications typically have:
Simplified query needs
No joins, aggregations
Relaxed consistency needs
Applications can tolerate stale or reordered data
6
What is PNUTS?
7
What is PNUTS?
A
B
42342
42521
E
W
C
66354
W
D
E
12352
75656
E
C
F
15677
E
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
Indexes and views
Parallel database
CREATE TABLE Parts (
ID VARCHAR,
StockNumber INT,
Status VARCHAR
A 42342
E
…
B 42521
W
)
C 66354
W
D
E
F
12352
75656
15677
Geographic replication
Structured, flexible schema
E
C
E
Hosted, managed infrastructure
8
Query model
Per-record operations
Multi-record operations
Get
Set
Delete
Multiget
Scan
Getrange
Web service (RESTful) API
9
Detailed architecture
Clients
Data-path components
REST API
Routers
Tablet
controller
Message
Broker
Storage units
10
Detailed architecture
Local region
Remote regions
Clients
REST API
Routers
YMB
Tablet controller
Storage
units
11
Tablet splitting and balancing
Each storage unit has many tablets (horizontal partitions of the table)
Storage unit may become a hotspot
Storage unit
Tablet
Overfull tablets split
Tablets may grow over time
Shed load by moving tablets to other servers
12
Query processing
13
Range queries
Apple
Avocado
Grapefruit…Pear?
Banana
Blueberry
Canteloupe
Grape
Kiwi
Lemon
MIN-Canteloupe
SU1
Canteloupe-Lime
SU3
Lime-Strawberry
SU2
Strawberry-MAX
SU1
Router
Grapefruit…Lime?
Lime…Pear?
Lime
Mango
Orange
Strawberry
Tomato
Watermelon
Storage unit 1
Storage unit 2
Storage unit 3
16
Updates
1
8 Sequence # for key k
Write key k
Routers
Message brokers
3
Write key k
2
7 Sequence # for key k
4
Write key k
5
SUCCESS
SU
SU
SU
6
Write key k
17
Yahoo Message Bus
Distributed publish-subscribe service
Guarantees delivery once a message is
published
Logging at site where message is published, and
at other sites when received
Guarantees messages published to a
particular cluster will be delivered in same
order at all other clusters
Record updates are published to YMB by
master copy
All replicas subscribe to the updates, and get them
in same order for a particular record
18
Asynchronous
replication and
consistency
19
Asynchronous replication
20
Consistency model
Goal: make it easier for applications to reason about updates
and cope with asynchrony
What happens to a record with primary key “Brian”?
Record Update
inserted
v. 1
v. 2
Update
v. 3
Update Update
v. 4
Update
Delete
Update Update
v. 5
v. 6
Generation 1
v. 7
v. 8
Time
21
Consistency model
Read
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
22
Consistency model
Read up-to-date
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
23
Consistency model
Read-critical(required version):
Stale version
v. 1
v. 2
v. 3
v. 4
Read ≥ v.6
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
24
Consistency model
Write
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
25
Consistency model
Test-and-set-write(required version)
Write if = v.7
ERROR
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
26
Consistency model
Write if = v.7
ERROR
Stale version
Stale version
Current
version
Mechanism: per record mastership
v. 1
v. 2
v. 3
v. 4
v. 5
v. 6
Generation 1
v. 7
v. 8
Time
27
Record and Tablet Mastership
Data in PNUTS is replicated across sites
Hidden field in each record stores which copy is the
master copy
Record also contains origin of last few updates
updates can be submitted to any copy
forwarded to master, applied in order received by master
Mastership can be changed by current master, based on
this information
Mastership change is simply a record update
Tablets mastership
Required to ensure primary key consistency
Can be different from record mastership
28
Other Features
Per record transactions
Copying a tablet (on failure, for e.g.)
Request copy
Publish checkpoint message
Get copy of tablet as of when checkpoint is
received
Apply later updates
Tablet split
Has to be coordinated across all copies
29
Query Processing
Range scan can span tablets
Only one tablet scanned at a time
Client may not need all results at once
Continuation object returned to client to indicate where
range scan should continue
Notification
One pub-sub topic per tablet
Client knows about tables, does not know about
tablets
Automatically subscribed to all tablets, even as tablets
are added/removed.
Usual problem with pub-sub: undelivered
notifications, handled in usual way
30
Experiments
31
Experimental setup
Production PNUTS code
Three PNUTS regions
Enhanced with ordered table type
2 west coast, 1 east coast
5 storage units, 2 message brokers, 1 router
West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array
East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk
Workload
1200-3600 requests/second
0-50% writes
80% locality
32
Inserts
Inserts
required 75.6 ms per insert in West 1
(tablet master)
131.5 ms per insert into the non-master
West 2, and
315.5 ms per insert into the non-master
East.
33
10% writes by default
34
Scalability
160
Average latency (ms)
140
120
100
80
60
40
20
0
1
2
3
4
5
6
Storage units
Hash table
Ordered table
35
Request skew
100
90
Average latency (ms)
80
70
60
50
40
30
20
10
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Zipf parameter
Hash table
Ordered table
36
Size of range scans
8000
Average latency (ms)
7000
6000
5000
4000
3000
2000
1000
0
0
0.02
0.04
0.06
0.08
0.1
0.12
Fraction of table scanned
30 clients
300 clients
37
Related work
Distributed and parallel databases
Distributed filesystems
Ceph, Boxwood, Sinfonia
Distributed (P2P) hash tables
Especially query processing and transactions
BigTable, Dynamo, S3, SimpleDB, SQL Server Data Services,
Cassandra
Chord, Pastry, …
Database replication
Master-slave, epidemic/gossip, synchronous…
38
Conclusions and ongoing work
PNUTS is an interesting research product
Research: consistency, performance, fault
tolerance, rich functionality
Product: make it work, keep it (relatively) simple,
learn from experience and real applications
Ongoing work
Indexes and materialized views
Bundled updates
Batch query processing
39
Thanks!
[email protected]
research.yahoo.com
40