Lecture 18, BigTable.

Transcript Lecture 18, BigTable.

Bigtable

: A Distributed Storage System for Structured Data Jing Zhang

Reference: Handling Large Datasets at Google: Current System and Future Directions, Jeff Dean 4/27/2020 EECS 584, Fall 2011 1

Outline



Motivation



Data Model



APIs



Building Blocks



Implementation



Refinement



Evaluation

EECS 584, Fall 2011 4/27/2020 2

Outline



Motivation

 Data Model  APIs  Building Blocks  Implementation  Refinement  Evaluation EECS 584, Fall 2011 4/27/2020 3

Google’s Motivation – Scale!

 Scale Problem – Lots of data – Millions of machines – Different project/applications – Hundreds of millions of users  Storage for (semi-)structured data  No commercial system big enough – Couldn’t afford if there was one  Low-level storage optimization help performance significantly  Much harder to do when running on top of a database layer 4/27/2020 EECS 584, Fall 2011 4

Bigtable

 Distributed multi-level map  Fault-tolerant, persistent  Scalable – Thousands of servers – Terabytes of in-memory data – Petabyte of disk-based data – Millions of reads/writes per second, efficient scans  Self-managing – Servers can be added/removed dynamically – Servers adjust to load imbalance EECS 584, Fall 2011 4/27/2020 5

Real Applications

4/27/2020 EECS 584, Fall 2011 6

Outline

 Motivation 

Data Model

 APIs  Building Blocks  Implementation  Refinement  Evaluation EECS 584, Fall 2011 4/27/2020 7

Data Model

 a sparse, distributed persistent multi dimensional sorted map

(row, column, timestamp) -> cell contents

4/27/2020 EECS 584, Fall 2011 8

Data Model

 Rows – Arbitrary string – Access to data in a row is atomic – Ordered lexicographically 4/27/2020 EECS 584, Fall 2011 9

Data Model

 Column – Tow-level name structure: • family: qualifier – Column Family is the unit of access control 4/27/2020 EECS 584, Fall 2011 10

Data Model

 Timestamps – Store different versions of data in a cell – Lookup options • Return most recent K values • Return all values 4/27/2020 EECS 584, Fall 2011 11

Data Model

   The row range for a table is dynamically partitioned Each row range is called a tablet Tablet is the unit for distribution and load balancing 4/27/2020 EECS 584, Fall 2011 12

Outline

 Motivation  Data Model 

APIs

 Building Blocks  Implementation  Refinement  Evaluation EECS 584, Fall 2011 4/27/2020 13

APIs

 Metadata operations – Create/delete tables, column families, change metadata  Writes – Set(): write cells in a row – DeleteCells(): delete cells in a row – DeleteRow(): delete all cells in a row  Reads – Scanner: read arbitrary cells in a bigtable • Each row read is atomic • Can restrict returned rows to a particular range • Can ask for just data from 1 row, all rows, etc.

• Can ask for all columns, just certain column families, or specific columns 4/27/2020 EECS 584, Fall 2011 14

Outline

 Motivation  Data Model  APIs 

Building Blocks

 Implementation  Refinement  Evaluation EECS 584, Fall 2011 4/27/2020 15

Typical Cluster

Shared pool of machines that also run other distributed applications 4/27/2020 EECS 584, Fall 2011 16

Building Blocks

    Google File System (GFS) – stores persistent data (SSTable file format) Scheduler – schedules jobs onto machines Chubby – Lock service: distributed lock manager – master election, location bootstrapping MapReduce (optional) – Data processing – Read/write Bigtable data EECS 584, Fall 2011 4/27/2020 17

Chubby

 {lock/file/name} service  Coarse-grained locks  Each clients has a session with Chubby.

– The session expires if it is unable to renew its session lease within the lease expiration time.

 5 replicas, need a majority vote to be active  Also an OSDI ’ 06 Paper EECS 584, Fall 2011 4/27/2020 18

Outline

 Motivation  Overall Architecture & Building Blocks  Data Model  APIs 

Implementation

 Refinement  Evaluation EECS 584, Fall 2011 4/27/2020 19

Implementation

 Single-master distributed system  Three major components – Library that linked into every client – One master server • Assigning tablets to tablet servers • Detecting addition and expiration of tablet servers • Balancing tablet-server load • Garbage collection • Metadata Operations – Many tablet servers • Tablet servers handle read and write requests to its table • Splits tablets that have grown too large 4/27/2020 EECS 584, Fall 2011 20

Implementation

4/27/2020 EECS 584, Fall 2011 21

Tablets

 Each Tablets is assigned to one tablet server.

– Tablet holds contiguous range of rows • Clients can often choose row keys to achieve locality – Aim for ~100MB to 200MB of data per tablet  Tablet server is responsible for ~100 tablets – Fast recovery: • 100 machines each pick up 1 tablet for failed machine – Fine-grained load balancing: • Migrate tablets away from overloaded machine • Master makes load-balancing decisions EECS 584, Fall 2011 22 4/27/2020

How to locate a Tablet?

 Given a row, how do clients find the location of the tablet whose row range covers the target row?

  METADATA: Key: table id + end row, Data: location Aggressive Caching and Prefetching at Client side 4/27/2020 EECS 584, Fall 2011 23

Tablet Assignment

 Each tablet is assigned to one tablet server at a time.

 Master server keeps track of the set of live tablet servers and current assignments of tablets to servers.

 When a tablet is unassigned, master assigns the tablet to an tablet server with sufficient room.

 It uses

Chubby

to monitor health of tablet servers, and restart/replace failed servers.

EECS 584, Fall 2011 24 4/27/2020

Tablet Assignment

 Chubby – Tablet server registers itself by getting a lock in a specific directory chubby • Chubby gives “ lease ” on lock, must be renewed periodically • Server loses lock if it gets disconnected – Master monitors this directory to find which servers exist/are alive • If server not contactable/has lost lock, master grabs lock and reassigns tablets • GFS replicates data. Prefer to start tablet server on same machine that the data is already at EECS 584, Fall 2011 25 4/27/2020

Outline

 Motivation  Overall Architecture & Building Blocks  Data Model  APIs  Implementation 

Refinement

 Evaluation EECS 584, Fall 2011 4/27/2020 26

Refinement

– Locality groups & Compression  Locality Groups – Can group multiple column families into a

locality group

• Separate SSTable is created for each locality group in each tablet.

– Segregating columns families that are not typically accessed together enables more efficient reads.

• In WebTable, page metadata can be in one group and contents of the page in another group.

 Compression – Many opportunities for compression • Similar values in the cell at different timestamps • Similar values in different columns • Similar values across adjacent rows EECS 584, Fall 2011 27 4/27/2020

Outline

 Motivation  Overall Architecture & Building Blocks  Data Model  APIs  Implementation  Refinement 

Evaluation

EECS 584, Fall 2011 4/27/2020 28

Performance - Scaling

Not Linear!

WHY?

 As the number of tablet servers is increased by a factor of 500: – Performance of random reads from memory increases by a factor of 300.

– Performance of scans increases by a factor of 260.

4/27/2020 EECS 584, Fall 2011 29

Not linearly?

 Load Imbalance – Competitions with other processes • Network • CPU – Rebalancing algorithm does not work perfectly • Reduce the number of tablet movement • Load shifted around as the benchmark progresses EECS 584, Fall 2011 4/27/2020 30