PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and.
Download ReportTranscript PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and.
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni Yahoo! Research With some additions by S. Sudarshan How do I build a cool new web app? Option 1: Code it up! Make it live! Scale it later It gets posted to slashdot Scale it now! Flickr, Twitter, MySpace, Facebook, … 2 How do I build a cool new web app? Option 2: Make it industrial strength! Evaluate scalable database backends Evaluate scalable indexing systems Evaluate scalable caching systems Architect data partitioning schemes Architect data replication schemes Architect monitoring and reporting infrastructure Write application Go live Realize it doesn’t scale as well as you hoped Rearchitect around bottlenecks 1 year later – ready to go! 3 Example: social network updates Brian Sonja Jimi Brandon Kurt What are my friends up to? Sonja: Brandon: 4 Example: social network updates <photo> <title>Flower</title> <url>www.flickr.com</url> </photo> 6 8 12 15 16 17 Jimi Mary Sonja Brandon Mike Bob <ph.. <re.. <ph.. <po.. <ph.. <re.. 5 What do we need from our DBMS? Web applications need: Scalability And the ability to scale linearly Geographic scope High availability Web applications typically have: Simplified query needs No joins, aggregations Relaxed consistency needs Applications can tolerate stale or reordered data 6 What is PNUTS? 7 What is PNUTS? A B 42342 42521 E W C 66354 W D E 12352 75656 E C F 15677 E A B C D E F 42342 42521 66354 12352 75656 15677 E W W E C E Indexes and views Parallel database CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR A 42342 E … B 42521 W ) C 66354 W D E F 12352 75656 15677 Geographic replication Structured, flexible schema E C E Hosted, managed infrastructure 8 Query model Per-record operations Multi-record operations Get Set Delete Multiget Scan Getrange Web service (RESTful) API 9 Detailed architecture Clients Data-path components REST API Routers Tablet controller Message Broker Storage units 10 Detailed architecture Local region Remote regions Clients REST API Routers YMB Tablet controller Storage units 11 Tablet splitting and balancing Each storage unit has many tablets (horizontal partitions of the table) Storage unit may become a hotspot Storage unit Tablet Overfull tablets split Tablets may grow over time Shed load by moving tablets to other servers 12 Query processing 13 Range queries Apple Avocado Grapefruit…Pear? Banana Blueberry Canteloupe Grape Kiwi Lemon MIN-Canteloupe SU1 Canteloupe-Lime SU3 Lime-Strawberry SU2 Strawberry-MAX SU1 Router Grapefruit…Lime? Lime…Pear? Lime Mango Orange Strawberry Tomato Watermelon Storage unit 1 Storage unit 2 Storage unit 3 16 Updates 1 8 Sequence # for key k Write key k Routers Message brokers 3 Write key k 2 7 Sequence # for key k 4 Write key k 5 SUCCESS SU SU SU 6 Write key k 17 Yahoo Message Bus Distributed publish-subscribe service Guarantees delivery once a message is published Logging at site where message is published, and at other sites when received Guarantees messages published to a particular cluster will be delivered in same order at all other clusters Record updates are published to YMB by master copy All replicas subscribe to the updates, and get them in same order for a particular record 18 Asynchronous replication and consistency 19 Asynchronous replication 20 Consistency model Goal: make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Brian”? Record Update inserted v. 1 v. 2 Update v. 3 Update Update v. 4 Update Delete Update Update v. 5 v. 6 Generation 1 v. 7 v. 8 Time 21 Consistency model Read Stale version v. 1 v. 2 v. 3 v. 4 Stale version v. 5 v. 6 Generation 1 v. 7 Current version v. 8 Time 22 Consistency model Read up-to-date Stale version v. 1 v. 2 v. 3 v. 4 Stale version v. 5 v. 6 Generation 1 v. 7 Current version v. 8 Time 23 Consistency model Read-critical(required version): Stale version v. 1 v. 2 v. 3 v. 4 Read ≥ v.6 Stale version v. 5 v. 6 Generation 1 v. 7 Current version v. 8 Time 24 Consistency model Write Stale version v. 1 v. 2 v. 3 v. 4 Stale version v. 5 v. 6 Generation 1 v. 7 Current version v. 8 Time 25 Consistency model Test-and-set-write(required version) Write if = v.7 ERROR Stale version v. 1 v. 2 v. 3 v. 4 Stale version v. 5 v. 6 Generation 1 v. 7 Current version v. 8 Time 26 Consistency model Write if = v.7 ERROR Stale version Stale version Current version Mechanism: per record mastership v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 Generation 1 v. 7 v. 8 Time 27 Record and Tablet Mastership Data in PNUTS is replicated across sites Hidden field in each record stores which copy is the master copy Record also contains origin of last few updates updates can be submitted to any copy forwarded to master, applied in order received by master Mastership can be changed by current master, based on this information Mastership change is simply a record update Tablets mastership Required to ensure primary key consistency Can be different from record mastership 28 Other Features Per record transactions Copying a tablet (on failure, for e.g.) Request copy Publish checkpoint message Get copy of tablet as of when checkpoint is received Apply later updates Tablet split Has to be coordinated across all copies 29 Query Processing Range scan can span tablets Only one tablet scanned at a time Client may not need all results at once Continuation object returned to client to indicate where range scan should continue Notification One pub-sub topic per tablet Client knows about tables, does not know about tablets Automatically subscribed to all tablets, even as tablets are added/removed. Usual problem with pub-sub: undelivered notifications, handled in usual way 30 Experiments 31 Experimental setup Production PNUTS code Three PNUTS regions Enhanced with ordered table type 2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk Workload 1200-3600 requests/second 0-50% writes 80% locality 32 Inserts Inserts required 75.6 ms per insert in West 1 (tablet master) 131.5 ms per insert into the non-master West 2, and 315.5 ms per insert into the non-master East. 33 10% writes by default 34 Scalability 160 Average latency (ms) 140 120 100 80 60 40 20 0 1 2 3 4 5 6 Storage units Hash table Ordered table 35 Request skew 100 90 Average latency (ms) 80 70 60 50 40 30 20 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Zipf parameter Hash table Ordered table 36 Size of range scans 8000 Average latency (ms) 7000 6000 5000 4000 3000 2000 1000 0 0 0.02 0.04 0.06 0.08 0.1 0.12 Fraction of table scanned 30 clients 300 clients 37 Related work Distributed and parallel databases Distributed filesystems Ceph, Boxwood, Sinfonia Distributed (P2P) hash tables Especially query processing and transactions BigTable, Dynamo, S3, SimpleDB, SQL Server Data Services, Cassandra Chord, Pastry, … Database replication Master-slave, epidemic/gossip, synchronous… 38 Conclusions and ongoing work PNUTS is an interesting research product Research: consistency, performance, fault tolerance, rich functionality Product: make it work, keep it (relatively) simple, learn from experience and real applications Ongoing work Indexes and materialized views Bundled updates Batch query processing 39 Thanks! [email protected] research.yahoo.com 40