Bigtable: A Distributed Storage System for Structured Data
Download
Report
Transcript Bigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage
System for Structured Data
Authors
Presented by:
Fay Chang
Jeffrey Dean
Sanjay Ghemawat
Wilson Hsieh
Deborah Wallach
Mike Burrows
Tushar Chandra
Andrew Fikes
Robert Gruber
Arif Bin Hossain
Dept. of Computer Science
UTSA
Motivation
Large scale structured data
URLs: Contents, links, anchors, page rank
User data: Pref. settings, recent queries, search results
Geographic locations: Physical entities, roads, satellite image
Large set of structured MATLAB data
EEG, EMG, Eye motion
Field are not uniform among datasets
Data types are not uniform among datasets
Why not Relational Database?
Scale is too large for most commercial databases
Even if it weren’t, cost would be very high
Low-level storage optimizations help performance
significantly
Hard to map semi-structured data to relational
database
Non-uniform fields makes it difficult to
insert/query data
Bigtable
BigTable is a distributed storage system for
managing structured data.
Designed to scale to a very large size
Used for many Google projects
Web indexing, Personalized Search, Google Earth, Google
Analytics, Google Finance
Efficient scans over all or interesting subsets of
data
Efficient joins of large one-to-one and one-tomany datasets
Bigtable
Used for variety of demanding workloads
Throughput oriented batch processing
Latency sensitive data serving
Data is indexed using row and column names
Treats data as uninterpreted strings
Clients can control the locality
Dynamic controls to serve data out of memory or
from disk
Building Blocks
Google File System (GFS)
Large scale distributed file system
Maintains multiple replicas
Consists for Master and Chunk server
Chunk Server
Stores the data files
Each data file broken into fixed size chunks
Each chunk is replicated at least three times
Master
Stores the metadata associated with the chunks
Building Blocks
Chubby lock service
Have five active replicas
Provides namespace that consists of directories and files
Each file can be used as a lock
Each Chubby client maintains a session with Chubby service
When the session expires, it loses any locks and open handles
Building Block
SSTable
Immutable file format used internally to store data files
Sorted Key-Value pairs of arbitrary byte strings
Contains a sequence of blocks
Block index is used to locate blocks
Index is loaded into memory when the SSTable is opened
Lookup can be performed in single disk access
64K
block
64K
block
64K
block
SSTable
Index
Basic Data Model
A table is a sparse, distributed, persistent
multidimensional sorted map
Data is organized into three dimensions
(row: string, column: string, time: int64) string
Each cell is referenced by a row key, column key and
timestamp
Basic Data Model
(row, column, timestamp) cell contents
Example: webtable
Data Model: Row
Name is an arbitrary string.
Access to data in a row is a atomic.
Row creation is implicit upon storing data.
Transactions with in a row
Rows ordered lexicographically by row key
Rows close together lexicographically usually on one or a
small number of machines.
Rows are grouped together to form the unit of load
balancing
Data Model: Column
Columns has two-level name structure:
Family:qualifier
Example:
“anchor: cnnsi.com”
Column keys are grouped into sets called Column Family
Unit of access control
All data stored in a column family is usually of same type
Additional level of indexing, if desired
Main idea: Limited families, Unbounded columns
Data Model: Timestamp
Used to store different versions of data in a cell
New writes default to current time
Can also be set explicitly by clients
Look up examples
“Return most recent K values”
“Return all values in timestamp range(on all values)”
Can be used to mark column family
“Only retain most recent K values in a cell”
“Keep values until they are older than K seconds”
Tablets
Rows with consecutive key are grouped into tablets
Unit of load balancing
Reads of short row ranges are efficient and require
communication with a small number of machines
Clients can use this property to get good locality by
selecting row keys efficiently
Tablets (cont.)
Contains some range of rows, essentially a set of
SSTables
Tablet
64K
block
64K
block
64K
block
SSTable
Index
64K
block
64K
block
64K
block
SSTable
Index
Implementation
Three major components
Library linked into every client
Single master server
Assigning tablets to tablet servers
Detecting addition and expiration of tablet servers
Balancing tablet-server load
Garbage collection files in GFS
Many tablet servers
Manages
a set of tablets
Tablet servers handle read and write requests to its table
Splits tablets that have grown too large
Implementation (cont.)
Clients communicates directly with tablet servers for
read/write
Each table consists of a set of tablets
Initially, each table have just one tablet
Tablets are automatically split as the table grows
Row size can be arbitrary (hundreds of GB)
Locating Tablets
How do clients find a right machine ?
Need to find tablet whose row range covers the target row
Three level hierarchy
Level 1: Chubby file containing location of the root tablet
Level 2: Root tablet contains the location of METADATA
tablets
Level 3: Each METADATA tablet contains the location of
user tablets
Location
of tablet is stored under a row key that encodes table
identifier and its end row
Locating Tablets
Assigning Tablets
Each tablet is assigned to one tablet server at a time.
Master server keeps track of
Set of live tablet servers
Current assignments of tablets to servers.
Unassigned tablets.
When a tablet is unassigned, master assigns the
tablet to an tablet server with sufficient space.
Assigning Tablets
Tablet server startup
It creates and acquires an exclusive lock on uniquely named
file on Chubby
Master monitors this directory to discover tablet servers.
Tablet server stops serving tablets
If it loses its exclusive lock.
Tries to reacquire the lock on its file as long as the file still
exists.
If file no longer exists, the tablet server will never be able to
serve again
Assigning Tablets
Master server startup
Grabs unique master lock in Chubby.
Scans the tablet server directory in Chubby.
Communicates with every live tablet server
Scans METADATA table to learn set of tablets.
Master is responsible for finding when tablet server is no longer serving
its tablets and reassigning those tablets as soon as possible.
Periodically asks each tablet server for the status of its lock
If no reply, master tries to acquire the lock itself
If successful to acquire lock, then tablet server is either dead or having
network trouble
Tablet Serving
Updates are committed to a commit log that stores the redo records
Recently committed updates are stored in memory in a sorted buffer called
memtable
Memtable maintains the updates on a row-by-row basis
Older updates are stored in a sequence of immutable SSTables.
To recover a tablet
Tablet server reads data from METADATA table.
Metadata contains list of SSTables and set of redo points
Server reads the indices of the SSTables in memory
Reconstructs the memtable by applying all of the updates since redo points.
Tablet Serving
Write operation
Server checks if it is well-formed
Checks if the sender is authorized
Write to commit log
After commit, contents are inserted into Memtable
Read operation
Similar check for well-formedness and authorization
Executed on a merged view of the sequence of SSTables and
memtable
Compaction: Minor
As write operations execute, size of memtable
increases
When memtable reaches threshold
Frozen memtable is converted to an SSTable
SSTable written to file system
Goals
Reduce memory usage of the tablet server
Reduce the amount of data to read from commit log during
recovery
Compaction
Problem: too many SSTable
Read operations might need to merge from a number of
SSTables
Merging compaction
Reads the contents of a few SSTable and memtable
Writes new SSTable
Merging compaction that re-writes all SSTables into
exactly one SSTable is a major compaction
Locality Groups
Each column families is assigned to a locality group defined by
client
Seperate SSTable is created for each locality group during
compaction
Increases read efficiency as columns that are grouped together are
usually accessed together
Used to organize underlying storage representation for performance
Scans over one locality group are O(bytes_in_locality_group),
not O(bytes_in_table)
Data in locality group can be explicitly memory mapped
Refinements
Compression
Clients can control SSTable compression for a locality group
Caching
Scan Cache: a high-level cache that caches key-value pairs
returned by the SSTable interface
Block Cache: a lower-level cache that caches SSTable blocks
read from file system
Bloom Filters
Allows to ask whether an SSTable might contain any data for a
given row/column pair
Reduces disk access while reading SSTables
Example: Cassandra
Initially developed by Facebook for inbox search
Built on BigTable data model
Provides a structured key-value store
Keys map to multiple values, which are grouped
into column families
Used by
Cassandra
A table in cassandra is distributed multidimensional
map indexed by a key
The row key in a table is a string with no size
restrictions
Usually a four dimensional map
Keyspace -> Column Family
Column Family -> Column Family Row
Column Family Row -> Columns
Column -> Data value
Cassandra: Column
Column
{
name: "emailAddress",
value: "[email protected]",
timestamp: 123456789
}
Cassandra: SuperColumn
SuperColumn
{
name: "homeAddress",
value: {
street: {name: "street", value: "1234 x street", timestamp: 123456789},
city: {name: "city", value: "san francisco", timestamp: 123456789},
zip: {name: "zip", value: "94107", timestamp: 123456789},
}
}
Cassandra: ColumnFamily
Column Family
UserProfile = {
ahossain: {
username: " ahossain",
email: “[email protected]",
phone: "(210) 123-4567"
},
jdoe: {
username: “jdoe", email: “[email protected]",
phone: "(210) 765-4321"
age: "66",
gender: “male"
},
}
Example: Pelops (Write)
String pool = "pool";
String keyspace = "mykeyspace";
String colFamily = "users";
String rowKey = "abc123";
Cluster cluster = new Cluster("localhost", 9160);
Pelops.addPool(pool, cluster, keyspace);
Mutator mutator = Pelops.createMutator(pool);
mutator.writeColumns(
colFamily, rowKey,
mutator.newColumnList(
mutator.newColumn("name", "Dan"),
mutator.newColumn("age", Bytes.fromInt(33))
)
);
mutator.execute(ConsistencyLevel.ONE);
Example: Pelops (Read)
Selector selector = Pelops.createSelector(pool);
List<Column> columns = selector.getColumnsFromRow(
colFamily, rowKey, false, ConsistencyLevel.ONE);
System.out.println("Name: " +
Selector.getColumnStringValue(columns, "name"));
System.out.println("Age: " +
Selector.getColumnValue(columns, "age").toInt());
Thank you
Questions?