Mass Data Processing Technology on Large Scale Clusters
Download
Report
Transcript Mass Data Processing Technology on Large Scale Clusters
For the class of Advanced Computer Architecture
All course material (slides, labs, etc) is licensed under the Creative
Commons Attribution 2.5 License .
Many thanks to Aaron Kimball & Sierra Michels-Slettvet for their
original version
Some slides are from the Internet.
1
Introduction to Distributed Systems
Google File System
MapReduce System
BigTable
2
Four Papers
Luiz Barroso, Jeffrey Dean, and Urs Hoelzle, Web Search for a Planet:
The Google Cluster Architecture, IEEE MACRO, 2003
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google
File System, 19th ACM Symposium on Operating Systems Principles,
Lake George, NY, October, 2003.
Jeffrey Dean and Sanjay Ghemawat,
MapReduce: Simplified Data Processing on Large Clusters, OSDI'04:
Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004.
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah
A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert
E. Gruber, Bigtable: A Distributed Storage System for Structured Data,
OSDI'06: Seventh Symposium on Operating System Design and
Implementation, Seattle, WA, November, 2006.
3
4
Computer Speedup
Why slow down
here?
Then, How to
improve the
performance?
Moore’s Law: “The density of transistors on a chip doubles every 18 months, for
the same cost” (1965)
Image: Tom’s Hardware
5
Scope of Problems
What can you
do with 1
computer?
What can you
do with 100
computers?
What can you
do with an
entire data
center?
6
Distributed Problems
Rendering multiple
frames of high-quality
animation
Image: DreamWorks Animation
7
Distributed Problems
Simulating several hundred or thousand
characters
Happy Feet © Kingdom Feature Productions; Lord of the Rings © New Line Cinema
8
Distributed Problems
Indexing the web (Google)
Simulating an Internet-sized network for
networking experiments (PlanetLab)
Speeding up content delivery (Akamai)
What is the key attribute that all these examples
have in common?
9
PlanetLab
PlanetLab is a global research network
that supports the development of new
network services.
PlanetLab currently consists of 809 nodes at 401 sites.
10
CDN - Akamai
11
Parallel vs. Distributed
Parallel computing can mean:
Vector processing of data (SIMD)
Multiple CPUs in a single computer
(MIMD)
Distributed computing is multiple
CPUs across many computers (MIMD)
12
A Brief History… 1975-85
Parallel computing was
favored in the early
years
Primarily vector-based
at first
Gradually more threadbased parallelism was
introduced
Cray 2 supercomputer (Wikipedia)
13
A Brief History… 1985-95
“Massively parallel architectures”
start rising in prominence
Message Passing Interface (MPI)
and other libraries developed
Bandwidth was a big problem
14
A Brief History… 1995-Today
Cluster/grid architecture increasingly
dominant
Special node machines eschewed in
favor of COTS technologies
Web-wide cluster software
Companies like Google take this to the
extreme (10,000 node clusters)
15
Top 500, Architecture
16
Top 500 Trends
17
Top 500 Trends
18
Distributed System Concepts
Multi-Thread Program
Synchronization
Semaphores, Conditional Variables, Barriers
Network Concepts
TCP/IP, Sockets, Ports
RPC, Remote Invocation, RMI
Synchronous, Asynchronous, Non-Blocking
Transaction Processing System
P2P, Grid
19
Semaphores
Set:
Reset:
A semaphore is a flag
that can be raised or
lowered in one step
Semaphores were flags
that railroad engineers
would use when
entering a shared track
Only one side of the semaphore can ever be red! (Can
both be green?)
20
Barriers
A barrier knows in advance how
many threads it should wait for.
Threads “register” with the
barrier when they reach it, and
fall asleep.
Barrier wakes up all registered
threads when total count is
correct
Barrier
Pitfall: What happens if a thread
takes a long time?
21
Synchronous RPC
client
server
s = RPC(server_name, “foo.dll”,
get_hello, arg, arg, arg…)
RPC dispatcher
time
foo.dll:
String get_hello(a, b, c)
{
…
return “some hello str!”;
}
print(s);
...
22
Asynchronous RPC
client
server
h = Spawn(server_name,
“foo.dll”, long_runner, x, y…)
RPC dispatcher
...
keeps running…)
time
(More code
foo.dll:
String long_runner(x, y)
{
…
return new GiantObject();
}
GiantObject myObj = Sync(h);
23
Asynchronous RPC 2: Callbacks
client
server
h = Spawn(server_name, “foo.dll”,
callback, long_runner, x, y…)
RPC dispatcher
time
(More code
...
Thread spawns:
runs…)
foo.dll:
String long_runner(x, y)
{
…
return new Result();
}
void callback(o)
{
Uses Result
}
24
25
Early Google System
26
Spring 2000 Design
27
Late 2000 Design
28
Spring 2001 Design
29
Empty Google Cluster
30
Three Days Later…
31
A Picture is Worth…
32
The Google Infrastructure
>200,000 commodity Linux servers;
Storage capacity >5 petabytes;
Indexed >8 billion web pages;
Capital and operating costs at fraction of large scale
commercial servers;
Traffic growth 20-30%/month.
33
Dimensions of a Google Cluster
•
•
•
•
•
•
359 racks
31,654 machines
63,184 CPUs
126,368 Ghz of processing power
63,184 Gb of RAM
2,527 Tb of Hard Drive space
Appx. 40 million searches/day
34
Architecture for Reliability
Replication (3x +) for redundancy;
Replication for proximity and response;
Fault tolerant software for cheap hardware.
Policy: Reliability through software architecture,
not hardware.
35
Query Serving Infrastructure
Processing a query may engage 1000+ servers;
Index Servers manage distributed files;
Document Servers access distributed data;
Response time = <0.25 seconds anywhere.
36
Systems Engineering Principles
Overwhelm problems with computational power;
Impose standard file management;
Manage through standard job scheduling;
Apply simplified data processing discipline.
37
Scalable Engineering Infrastructure
Goal: Create very large scale, high performance
computing infrastructure
Hardware + software systems to make it easy to build
products
Focus on price/performance, and ease of use
Enables better products
Allows rapid experimentation with large data sets with
very simple programs allows algorithms to be innovated
and evolved with real world data
Scalable Serving capacity
Design to run on lots of cheap failure prone hardware
If a service gets a lot of traffic, you simply add servers
and bandwidth.
Every engineer creates software that scales, monitors
itself and recovers from ground up
The net result is that every service and every reusable
component embodies these properties and when
something succeeds, it has room to fly.
Google
GFS, MapReduce and Bigtable are the fundamental
building blocks
indices containing more documents
updated more often
faster queries
faster product development cycles
…
38
Rethinking Development Practices
Build on your own API
Develop the APIs first
Build your own application using the APIs – you know it works!
Take a call on which of these you would expose for external
developers
Sampling and Testing
Release early and iterate
Continuous User Feedback
Public Beta
Open to all – not to a limited set of users
Potentially years of beta – not a fixed timeline
39
40
File systems overview
NFS & AFS (Andrew File System)
GFS
Discussion
41
File Systems Overview
System that permanently stores data
Usually layered on top of a lower-level physical
storage medium
Divided into logical units called “files”
Addressable by a filename (“foo.txt”)
Usually supports hierarchical nesting (directories)
A file path joins file & directory names into a relative
or absolute address to identify a file
(“/home/aaron/foo.txt”)
42
What Gets Stored
User data itself is the bulk of the file system's
contents
Also includes meta-data on a drive-wide and per-file
basis:
Drive-wide:
Per-file:
Available space
name
Formatting info
owner
character set
modification date
...
physical layout...
43
High-Level Organization
Files are organized in a “tree” structure made of
nested directories
One directory acts as the “root”
“links” (symlinks, shortcuts, etc) provide simple
means of providing multiple access paths to one
file
Other file systems can be “mounted” and dropped
in as sub-hierarchies (other drives, network
shares)
44
Low-Level Organization (1/2)
File data and meta-data stored separately
File descriptors + meta-data stored in
inodes
Large tree or table at designated location
on disk
Tells how to look up file contents
Meta-data may be replicated to increase
system reliability
45
Low-Level Organization (2/2)
“Standard” read-write medium is a hard drive (other
media: CDROM, tape, ...)
Viewed as a sequential array of blocks
Must address ~1 KB chunk at a time
Tree structure is “flattened” into blocks
Overlapping reads/writes/deletes can cause
fragmentation: files are often not stored with a linear
layout
inodes store all block numbers related to file
46
Fragmentation
A
B
C
(free space)
A
B
C
A
(free space)
A
(free space)
C
A
(free space)
A
D
C
A
D
(free)
47
Design Considerations
Smaller inode size reduces amount of wasted
space
Larger inode size increases speed of sequential
reads (may not help random access)
Should the file system be faster or more reliable?
But faster at what: Large files? Small files? Lots of
reading? Frequent writers, occasional readers?
48
Distributed Filesystems
Support access to files on remote servers
Must support concurrency
Make varying guarantees about locking, who
“wins” with concurrent writes, etc...
Must gracefully handle dropped connections
Can offer support for replication and local
caching
Different implementations sit in different places
on complexity/feature scale
49
NFS
First developed in 1980s by Sun
Presented with standard UNIX FS interface
Network drives are mounted into local
directory hierarchy
50
NFS Protocol
Initially completely stateless
Operated over UDP; did not use TCP
streams
File locking, etc, implemented in higherlevel protocols
Modern implementations use TCP/IP &
stateful protocols
51
Server-side Implementation
NFS defines a virtual file system
Does not actually manage local disk layout on server
Server instantiates NFS volume on top of local file
system
Local hard drives managed by concrete file systems
(EXT, ReiserFS, ...)
Other networked FS's mounted in by...?
User-visible filesystem
EXT3 fs
EXT3 fs
Hard Drive 1
Hard Drive 2
NFS server
Server filesystem
NFS client
EXT2 fs
ReiserFS
Hard Drive 1
Hard Drive 2
52
NFS Locking
NFS v4 supports stateful locking of files
Clients inform server of intent to lock
Server can notify clients of outstanding
lock requests
Locking is lease-based: clients must
continually renew locks before a timeout
Loss of contact with server abandons
locks
53
NFS Client Caching
NFS Clients are allowed to cache copies of remote
files for subsequent accesses
Supports close-to-open cache consistency
When client A closes a file, its contents are
synchronized with the master, and timestamp is
changed
When client B opens the file, it checks that local
timestamp agrees with server timestamp. If not, it
discards local copy.
Concurrent reader/writers must use flags to disable
caching
54
NFS: Tradeoffs
NFS Volume managed by single server
Higher load on central server
Simplifies coherency protocols
Full POSIX system means it “drops in”
very easily, but isn’t “great” for any
specific need
55
AFS (The Andrew File System)
Developed at Carnegie Mellon
Strong security, high scalability
Supports 50,000+ clients at enterprise
level
56
AFS Overview
/afs
pku.edu.cn
students
admin
dcst
teachers
ai
hpc
projects
washington.edu
tsinghua.edu.cn
games
57
Security in AFS
Uses Kerberos authentication
Supports richer set of access control bits
than UNIX
Separate “administer”, “delete” bits
Allows application-specific bits
58
Local Caching
File reads/writes operate on locally cached
copy
Local copy sent back to master when file is
closed
Open local copies are notified of external
updates through callbacks
59
Local Caching - Tradeoffs
Shared database files do not work well on
this system
Does not support write-through to shared
medium
60
Replication
AFS allows read-only copies of filesystem volumes
Copies are guaranteed to be atomic checkpoints
of entire FS at time of read-only copy generation
Modifying data requires access to the sole r/w
volume
Changes do not propagate to read-only copies
61
AFS Conclusions
Not quite POSIX
Stronger security/permissions
No file write-through
High availability through replicas, local
caching
Not appropriate for all file types
62
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung,
SOSP 2003
(These slides by Alex Moshchuk)
63
Motivation
Google needed a good distributed file system
Redundant storage of massive amounts of data on
cheap and unreliable computers
Why not use an existing file system?
Google’s problems are different from anyone else’s
Different workload and design priorities
GFS is designed for Google apps and workloads
Google apps are designed for GFS
64
Assumptions
High component failure rates
Inexpensive commodity components fail all the
time
“Modest” number of HUGE files
Just a few million
Each is 100MB or larger; multi-GB files typical
Files are write-once, mostly appended to
Perhaps concurrently
Large streaming reads
High sustained throughput favored over low latency
65
GFS Design Decisions
Files stored as chunks
Fixed size (64MB)
Reliability through replication
Each chunk replicated across 3+ chunkservers
Single master to coordinate access, keep metadata
Simple centralized management
No data caching
Little benefit due to large data sets, streaming reads
Familiar interface, but customize the API
Simplify the problem; focus on Google apps
Add snapshot and record append operations
66
GFS Architecture
Single master
Mutiple chunkservers
…Can anyone see a potential weakness in this design?
67
Single master
From distributed systems we know this is a:
Single point of failure
Scalability bottleneck
GFS solutions:
Shadow masters
Minimize master involvement
never move data through it, use only for metadata
and cache metadata at clients
large chunk size
master delegates authority to primary replicas in data mutations
(chunk leases)
Simple, and good enough!
68
Metadata (1/2)
Global metadata is stored on the master
File and chunk namespaces
Mapping from files to chunks
Locations of each chunk’s replicas
All in memory (64 bytes / chunk)
Fast
Easily accessible
69
Metadata (2/2)
Master has an operation log for persistent
logging of critical metadata updates
persistent on local disk
replicated
checkpoints for faster recovery
70
Mutations
Mutation = write or append
must be done for all replicas
Goal: minimize master involvement
Lease mechanism:
master picks one replica as
primary; gives it a “lease”
for mutations
primary defines a serial
order of mutations
all replicas follow this order
Data flow decoupled from
control flow
71
Atomic record append
Client specifies data
GFS appends it to the file atomically at least once
GFS picks the offset
works for concurrent writers
Used heavily by Google apps
e.g., for files that serve as multiple-producer/single-
consumer queues
72
Relaxed consistency model (1/2)
“Consistent” = all replicas have the same value
“Defined” = replica reflects the mutation,
consistent
Some properties:
concurrent writes leave region consistent, but possibly
undefined
failed writes leave the region inconsistent
Some work has moved into the applications:
e.g., self-validating, self-identifying records
73
Relaxed consistency model (2/2)
Simple, efficient
Google apps can live with it
what about other apps?
Namespace updates atomic and serializable
74
Master’s responsibilities (1/2)
Metadata storage
Namespace management/locking
Periodic communication with chunkservers
give instructions, collect state, track cluster health
Chunk creation, re-replication, rebalancing
balance space utilization and access speed
spread replicas across racks to reduce correlated failures
re-replicate data if redundancy falls below threshold
rebalance data to smooth out storage and request load
75
Master’s responsibilities (2/2)
Garbage Collection
simpler, more reliable than traditional file delete
master logs the deletion, renames the file to a hidden
name
lazily garbage collects hidden files
Stale replica deletion
detect “stale” replicas using chunk version numbers
76
Fault Tolerance
High availability
fast recovery
master and chunkservers restartable in a few seconds
chunk replication
default: 3 replicas.
shadow masters
Data integrity
checksum every 64KB block in each chunk
77
Performance
78
Deployment in Google
50+ GFS clusters
Each with thousands of storage nodes
Managing petabytes of data
GFS is under BigTable, etc.
79
Conclusion
GFS demonstrates how to support large-scale
processing workloads on commodity hardware
design to tolerate frequent component failures
optimize for huge files that are mostly appended and
read
feel free to relax and extend FS interface as required
go for simple solutions (e.g., single master)
GFS has met Google’s storage needs… it must be
good!
80
Discussion
How many sys-admins does it take to run a system
like this?
much of management is built in
Google currently has ~450,000 machines
GFS: only 1/3 of these are “effective”
that’s a lot of extra equipment, extra cost, extra power,
extra space!
GFS has achieved availability/performance at a very low cost,
but can you do it for even less?
Is GFS useful as a general-purpose commercial
product?
small write performance not good enough?
relaxed consistency model
81
82
Motivation
200+ processors
200+ terabyte database
1010 total clock cycles
0.1 second response time
5¢ average advertising revenue
From: www.cs.cmu.edu/~bryant/presentations/DISC-FCRC07.ppt
83
Motivation: Large Scale Data
Processing
Want to process lots of data ( > 1 TB)
Want to parallelize across
hundreds/thousands of CPUs
… Want to make this easy
"Google Earth uses 70.5 TB: 70 TB for the
raw imagery and 500 GB for the index data."
From: http://googlesystem.blogspot.com/2006/09/howmuch-data-does-google-store.html
84
MapReduce
Automatic parallelization &
distribution
Fault-tolerant
Provides status and monitoring tools
Clean abstraction for programmers
85
Programming Model
Borrows from functional programming
Users implement interface of two functions:
map
(in_key, in_value) ->
(out_key, intermediate_value) list
reduce (out_key, intermediate_value list)
->
out_value list
86
map
Records from the data source (lines out of
files, rows of a database, etc) are fed into the
map function as key*value pairs: e.g.,
(filename, line).
map() produces one or more intermediate
values along with an output key from the
input.
87
reduce
After the map phase is over, all the
intermediate values for a given output key
are combined together into a list
reduce() combines those intermediate
values into one or more final values for that
same output key
(in practice, usually only one final value per
key)
88
Architecture
Input key*value
pairs
Input key*value
pairs
...
map
map
Data store 1
Data store n
(key 1,
values...)
(key 2,
values...)
(key 3,
values...)
(key 2,
values...)
(key 1,
values...)
(key 3,
values...)
== Barrier == : Aggregates intermediate values by output key
key 1,
intermediate
values
reduce
final key 1
values
key 2,
intermediate
values
reduce
final key 2
values
key 3,
intermediate
values
reduce
final key 3
values
89
Parallelism
map() functions run in parallel, creating
different intermediate values from different
input data sets
reduce() functions also run in parallel, each
working on a different output key
All values are processed independently
Bottleneck: reduce phase can’t start until
map phase is completely finished.
90
Example: Count word
occurrences
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
91
Example vs. Actual Source Code
Example is written in pseudo-code
Actual implementation is in C++, using a
MapReduce library
Bindings for Python and Java exist via
interfaces
True code is somewhat more involved
(defines how the input key/values are
divided up and accessed, etc.)
92
Example
Page 1: the weather is good
Page 2: today is good
Page 3: good weather is good.
93
Map output
Worker 1:
(the 1), (weather 1), (is 1), (good 1).
Worker 2:
(today 1), (is 1), (good 1).
Worker 3:
(good 1), (weather 1), (is 1), (good 1).
94
Reduce Input
Worker 1:
(the 1)
Worker 2:
(is 1), (is 1), (is 1)
Worker 3:
(weather 1), (weather 1)
Worker 4:
(today 1)
Worker 5:
(good 1), (good 1), (good 1), (good 1)
95
Reduce Output
Worker 1:
(the 1)
Worker 2:
(is 3)
Worker 3:
(weather 2)
Worker 4:
(today 1)
Worker 5:
(good 4)
96
Some Other Real Examples
Term frequencies through the whole
Web repository
Count of URL access frequency
Reverse web-link graph
97
Implementation Overview
Typical cluster:
100s/1000s of 2-CPU x86 machines, 2-4 GB of memory
Limited bisection bandwidth
Storage is on local IDE disks
GFS: distributed file system manages data (SOSP'03)
Job scheduling system: jobs made up of tasks, scheduler
assigns tasks to machines
Implementation is a C++ library linked into user
programs
98
Architecture
99
Execution
100
Parallel Execution
101
Task Granularity And Pipelining
Fine granularity tasks: many more map tasks than machines
Minimizes time for fault recovery
Can pipeline shuffling with map execution
Better dynamic load balancing
Often use 200,000 map/5000 reduce tasks w/ 2000 machines
102
Locality
Effect: Thousands of machines read
input at local disk speed
Master program divvies up tasks based on
location of data: (Asks GFS for locations of
replicas of input file blocks) tries to have
map() tasks on same machine as physical
file data, or at least same rack
map() task inputs are divided into 64 MB
blocks: same size as Google File System
chunks
Without this, rack switches limit read rate
103
Fault Tolerance
Master detects worker failures
Re-executes completed & in-progress
map() tasks
Re-executes in-progress reduce() tasks
Master notices particular input key/values
cause crashes in map(), and skips those
values on re-execution.
Effect: Can work around bugs in thirdparty libraries!
104
Fault Tolerance
On worker failure:
Detect failure via periodic heartbeats
Re-execute completed and in-progress map
tasks
Re-execute in progress reduce tasks
Task completion committed through master
Master failure:
Could handle, but don't yet (master failure
unlikely)
Robust: lost 1600 of 1800 machines
once, but finished fine
105
Optimizations
No reduce can start until map is complete:
A single slow disk controller can rate-limit the whole
process
Master redundantly executes “slow-moving” map tasks;
uses results of first copy to finish, (one finishes first
“wins”)
Slow workers significantly lengthen completion time
Other jobs consuming resources on machine
Bad disks with soft errors transfer data very slowly
Weird things: processor caches disabled (!!)
Why is it safe to redundantly execute map tasks? Wouldn’t this mess up the
total computation?
106
Optimizations
“Combiner” functions can run on same machine as a
mapper
Causes a mini-reduce phase to occur before the real
reduce phase, to save bandwidth
Under what conditions is it sound to use a combiner?
107
Refinement
Sorting guarantees within each reduce
partition
Compression of intermediate data
Combiner: useful for saving network
bandwidth
Local execution for debugging/testing
User-defined counters
108
Performance
Tests run on cluster of 1800 machines:
4 GB of memory
Dual-processor 2 GHz Xeons with Hyperthreading
Dual 160 GB IDE disks
Gigabit Ethernet per machine
Bisection bandwidth approximately 100 Gbps
Two benchmarks:
MR_Grep
Scan 1010 100-byte records to extract records
matching a rare pattern (92K matching records)
MR_Sort
benchmark)
Sort 1010 100-byte records (modeled after TeraSort
109
MR_Grep
Locality optimization helps:
1800 machines read 1 TB of data at peak of ~31 GB/s
Without this, rack switches would limit to 10 GB/s
Startup overhead is significant for short jobs
110
MR_Sort
Backup tasks reduce job completion time significantly
System deals well with failures
Normal
No Backup Tasks
200 processes killed
111
More and more MapReduce
MapReduce Programs In Google Source Tree
Example uses:
distributed grep
distributed sort
web link-graph reversal
term-vector per host
web access log stats
inverted index construction
document clustering
machine learning
statistical machine translation
112
Real MapReduce : Rewrite of
Production Indexing System
Rewrote Google's production indexing system
using MapReduce
Set of 10, 14, 17, 21, 24 MapReduce operations
New code is simpler, easier to understand
MapReduce takes care of failures, slow machines
Easy to make indexing faster by adding more
machines
113
MapReduce Conclusions
MapReduce has proven to be a useful
abstraction
Greatly simplifies large-scale computations
at Google
Functional programming paradigm can be
applied to large-scale applications
Fun to use: focus on problem, let library
deal w/ messy details
114
Fay Chang, Jeffrey Dean, Sanjay Ghemawat,
Wilson C. Hsieh, Deborah A. Wallach, Mike
Burrows, Tushar Chandra, Andrew Fikes, Robert
E. Gruber
Google, Inc.
UWCS OS Seminar Discussion
Erik Paulson
2 October 2006
See also the (other)UW presentation by Jeff Dean in September of 2005
(See the link on the seminar page, or just google for “google bigtable”)
Motivation
Lots of (semi-)structured data at Google
URLs:
Contents, crawl metadata, links, anchors, pagerank, …
Per-user Data:
User preference settings, recent queries/search results, …
Geographic locations:
Physical entities (shops, restaurants, etc.). roads, satellite image
data, user annotations, …
Scale is large
Billions of URLs, many versions/page(~20K/version)
Hundreds of millions of users, thousands of q/sec
100TB+ of satellite image data
Why not just use commercial DB?
Scale is too large for most commercial databases
Even if it weren’t, cost would be very high
Building internally means system can be applied across
many projects for low incremental cost
Low-level storage optimizations help performance
significantly
Much harder to do when running on top of a database
layer
Also fun and challenging to build large-scale systems
Goals
Want asynchronous processes to be continuously
updating different pieces of data
Want access to most current data at any time
Need to support
Very high read/write rates (millions of ops per second)
Efficient scans over all or interesting subsets of data
Efficient joins of large one-to-one and one-to-many
datasets
Often want to examine data changes over time
E.g. Contents of a web page over multiple crawls
BigTable
Distributed multi-level map
With an interesting data model
Fault-tolerant, persistent
Scalable
Thousands of servers
Terabytes of in-memory data
Petabytes of disk-based data
Millions of reads/writes per second, efficient scans
Self-managing
Servers can be added/removed dynamically
Servers adjust to load imbalance
Status
Design/initial implementation started beginning of 2004
Currently ~100 BigTable cells (Winter,2005)
Production use or active development for many projects:
Google Print
My Search History
Orkut
Crawling/indexing pipeline
Google Maps/Google Earth
Blogger
…
Largest bigtable cell manages ~200TB of data spread over
several thousand machines(larger cells planned)
Background: Building Blocks
Building blocks:
Google File System (GFS): Raw storage
Scheduler: schedules jobs onto machines
Lock service: distributed lock manager
Also can reliable hold tiny files (100s of bytes) with high availability
MapReduce: simplified large-scale data processing
BigTable uses of building blocks
GFS:stores persistent state
Scheduler: schedules jobs involved in BigTable serving
Lock service: master election, location bootstrapping
MapReduce: often used to read/write BigTable data
Google File System (GFS)
GFS Master
Replicas
Masters
Misc. servers
Client
GFS Master
Client
C2
C1
C3
C2
Chunkserver1
C1
C3
C2
C4
Chunkserver2
C3
C2
Chunkserver3
•Master manages metadata
•Data transfers happen directly between clients/chunkservers
•Files broken into chunks (typically 64MB)
•Chunks triplicate across three machines for safety
MapReduce: Easy-to-use Cycles
Many Google problems:” Process lots of data to produce other data”
Many kindes of inputs
Document records, log files, sorted on-disk data structures, etc.
Want to use easily hundreds or thousands of CPUs
MapReduce: framework that provides (for certain classes of problems):
Automatic & efficient parallelization/distribution
Fault-tolerance, I/O scheduling, status/monitoring
Users writes Map and Reduce functions
Heavily used: ~3000 jobs, 1000s of machine days each day
BigTable can be input and/or output for MapReduce computations
Chubby
{lock/file/name} service
Coarse-grained locks, can store small amount of data
in a lock
5 replicas, need a majority vote to be active
Also an OSDI ’06 Paper
Typical Cluster
Cluster scheduling master
Machine 1
MapReduce
Job1
Lock Service
Machine 2
BigTable
Server
Single Task
GFS master
Machine N
BigTable
Server
MapReduce
Job1
BigTable Master
…
Scheduler
GFS
slave
chunkserver
Scheduler
GFS
slave
chunkserver
Scheduler
GFS
slave
chunkserver
Linux
Linux
Linux
BigTable Overview
Data Model
Implementation Structure
Tablets, compactions, locality groups, ……
API
Details
Shared logs, compression, replications
Current/Future Work
Basic Data Model
“com.cnn.www”
<html>
<html>
<html>
“anchor:my.look.ca”
“anchor:cnnsi.com”
“Contents:”
t5
t6
t3
“CNN”
t9
“CNN.com”
t8
Figure 1: Web Table
Bigtable is a distributed multidimensional stored sparse map.
Map is indexed by row key, column key and timestamp.
i.e. (row: string , column: string , time:int64 ) String (cell contents).
Rows are ordered in lexicographic order by row key.
Row range for a table is dynamically partitioned, Each row range is called “Tablet ”.
Columns: syntax is family : qualifier.
Cells can store multiple version of data with timestamps.
Good match for most of google’s applications
Rows
Name is an arbitrary string
Access to data in a row is atomic
Row creation is implicit upon storing data
Rows ordered lexicographically
Rows close together lexicographically usually on one or a
small number of machines
Tablets
Large tables broken into tablets at row boundaries
Tablet holds contiguous range of rows
Clients can often choose row keys to achieve locality
Aim for ~100MB to 200MB of data per tablet
Serving machine responsible for ~100 tablets
Fast recovery:
100 machines each pick up 1 tablet from failed machine
Fine-grained load balancing:
Migrate tablets away from overloaded machine
Master makes load-balancing decisions
Tablets & Splitting
“language”
“language:”
“contents:”
EN
<html>
“aaa.com”
“cnn.com”
“cnn.com/sports.html”
Tablets
“Website.com”
“yahoo.com/kids.html”
“Yahoo.com/kids/html/0”
“Zuppa.com/menu.html”
…
Tablet and Table
Tablets contains some range of rows
Tablet
Start : aardvark
End : apple
SSTable
64K
64K
Block Block
Index
SSTable
64K
64K
Block Block
Index
Figure : Tablet
Bigtable contains tables, Tables contains set of tablets and each tablet
contains set of rows (100-200MB).
Tablet
Tablet
aardvark
apple_two
apple
SSTable
64K
64K
Block Block
Index
boat
SSTable
64K
64K
Block Block
Index
SSTable
64K
64K
Block Block
Index
System Structure
Bigtable Cell
Bigtable Master
Bigtable client
metadata
Performs metadata ops+
Load balancing
Bigtable client
library
read/write
Bigtable tablet server
Serves data
Cluster scheduling system
Handles failover, monitoring
Bigtable tablet server
Serves data
GFS
Holds tablet data, logs
Bigtable tablet server
Open()
Serves data
Lock service
Holds metadata
Handles master-election
Locating Tablets
Since tablets move around from server to server,
given a row, how do clients find the right machine?
Need to find tablet whose row range covers the target
row
One approach: could use the BigTable master
Central server almost certainly would be bottleneck in
large system
Instead: store special tables containing tablet
location info in BigTable cell itself
Locating Tablets (cont.)
Google’s approach: 3-level hierarchical lookup scheme for tablets
Location is ip:port of relevant server
1st level: bootstrapped from lock service, points to owner of META0
2nd level: Uses META0 data to find owner of appropriates META1 tablet
3rd level: META1 table holds locations of tablets of all other tables
META table itself can be split into multiple tables
Aggressive prefetching +caching
Most ops go right to proper machine
Tablet Representation
read
Write buffer in memory
(random-access)
Append-only log on GFS
write
SSTable
on GFS
SSTable
on GFS
SSTable
on GFS
(mmap)
Tablet
SSTable: Immutable on disk ordered map form stringstring
String keys: <row, column, timestamp> triples
SSTables
This is the underlying file format used to store Bigtable
data.
SSTables are immutable.
If new data is added, a new SSTable is created.
Old SSTable is set out for garbage collection.
SSTable
64K
Block
64K
Block
Index
Tablet Assignment
Each Tablets Server is given a tablet for serving client requests.
Master keeps track of the tablet server (RPC) to assign the tablet.
Chubby directory is used to acquire lock by the tablet server.
If Tablet server terminates, it release the lock on the file.
Status is sent to Master by tablet server.
How Does Master comes to know about Tablets, Tablet servers?
Master acquires unique master lock in chubby.
Master scans server directory in chubby to find live servers.
Master communicates with each tablet server to get the details.
It scans the METADATA table to find the unassigned tablets.
Tablet Assignment
Master
Communicates with all the tablet
server to get the details about the
tablet they are serving
Acquires
unique lock
Chubby Directory
Once scanned it
will come to know
about the live
tablet server
Scans the
metadata table
Metadata table
Tablet
Server
Tablet Serving
Commit log stores the updates that are
made to the data.
Updates are stored in memtable.
Read Op
Memtable
Recovery process.
Memory
Reads/Writes that arrive at tablet server.
GFS
Is the request Well-formed.
Tablet Log
Write Op
SST
SST
SST
SSTable Files
Authorization.
Chubby holds the permission file.
If a mutation occurs it is wrote to
commit log and finally a group commit
is used.
Figure : Tablet Representation.
Figure is taken from the
paper “google bigtable”.
Compactions
Tablet state represented as set of immutable
compacted SSTable files, plus tail of log (buffered
in memory)
Minor compaction
When in-memory state fills up, pick tablet with most
data and write contents to SSTables stored in GFS
Separate file for each locality group for each tablet
Major compaction
Periodically compact all SSTables for tablet into new
base SSTable on GFS
Storage reclaimed from deletion at this point
Compaction
Converted to
SSTable
Frozen
SSTable
Memtable
Write
Operation
Figure : Minor Compaction.
SST
SSTable
SST
SST
SST
Small New SSTables Due to
minor compaction.
Figure : Merging Compaction.
Merging compaction leads to major compaction.
New Large SSTables
Columns
Columns have two-level name structure:
family:optional_qualifier
Column family
Unit of access control
Has associated type information
Qualifier gives unbounded columns
Additional level of indexing, if desired
Timestamps
Used to store different versions of data in a cell
New writes default to current time, but timestamps for
writes can also be set explicitly by clients
Lookup options:
“Return most recent K values”
“Return all values in timestamp range(or all values)”
Column families can be maked with attributes:
“Only retain most recent K values in a cell”
“Keep values until they are older than K seconds”
Locality Groups
Column families can be assigned to a locality group
Used to organize underlying storage representation for
performance
scans over one locality group are
O(bytes_in_locality_group), not O(bytes_in_table)
Data in a locality group can be explicitly memory-
mapped
Locality Group
Locality Group
“Contents:”
“com.cnn.www”
<html>
<html>
<html>
<html>
<html>
<html>
“anchor:cnnsi.com”
“anchor:my.look.ca”
“CNN
“CNN
” ”
Multiple column families are grouped into locality group.
Efficient reads are done by separating column families.
Additional parameters can be specified.
“CNN.com
“CNN.com
””
API
Metadata operations
Create/delete tables, column families, change metadata
Writes(atomic)
Set():write cells in row
DeleteCells():delete cells in a row
DeleteRow():delete all cells in a row
Reads
Scanner:read arbitrary cells in a bigtable
Each row read is atomic
Can restrict returned rows to a particular range
Can ask for just data from 1 row, all rows, etc.
Can ask for all columns, just certain column families, or specific columns
API
API
Shared Logs
Designed for 1M tablets, 1000s of tablet servers
1M logs being simultaneously written performs badly
Solution:shared logs
Write log file per tablet server instead of per tablet
Updates for many tablets co-mingled in same file
Start new log chunks every so often(64MB)
Problem: druing recovery, server needs to read log
data to apply mutations for a tablet
Lots of wasted I/O if lots of machines need to read data
for many tablets from same log chunk
Shared Log Recovery
Recovery:
Servers inform master of log chunks they need to read
Master aggregates and orchestrates sorting of needed
chunks
Assigns log chunks to be sorted to different tablet servers
Servers sort chunks by tablet, write sorted data to local disk
Other tablet servers ask master which servers have sorted
chunks they need
Tablet servers issue direct RPCs to peer tablet servers to
read sorted data for its dablets
Compression
Many opportunities for compression
Similar values in the same row/column at different
timestamps
Similar values in different columns
Similar values across adjacent rows
Within each SSTable for a locality group, encode
compressed blocks
Keep blocks small for random access (~64KB
compressed data)
Exploit fact that many values very similar
Needs to be low CPU cost for encoding/decoding
Two building blocks: BMDiff, Zippy
In Development/Future Plans
More expressive data manipulation/access
Allow sending small scripts to perform
read/modify/write transactions so that they execute on
server?
Multi-row (i.e. distributed) transaction support
General performance work for very large cells
BigTable as a service?
Interesting issues of resource fairness, performance
isolation, prioritization, etc. across different clients
Real Applications
154
Before we begin…
Intersection of databases and distributed systems
Will try to explain (or at least warn) when we hit a
patch of database
Remember this is a discussion!
Google Scale
Lots of data
Copies of the web, satellite data, user data, email and
USENET, Subversion backing store
Many incoming requests
No commercial system big enough
Couldn’t afford it if there was one
Might not have made appropriate design choices
Firm believers in the End-to-End argument
450,000 machines (NYTimes estimate, June 14th
2006
Building Blocks
Scheduler (Google WorkQueue)
Google Filesystem
Chubby Lock service
Two other pieces helpful but not required
Sawzall
MapReduce (despite what the Internet says)
BigTable: build a more application-friendly storage
service using these parts
Google File System
Large-scale distributed “filesystem”
Master: responsible for metadata
Chunk servers: responsible for reading and writing
large chunks of data
Chunks replicated on 3 machines, master responsible
for ensuring replicas exist
OSDI ’04 Paper
Chubby
{lock/file/name} service
Coarse-grained locks, can store small amount of data
in a lock
5 replicas, need a majority vote to be active
Also an OSDI ’06 Paper
Data model: a big map
•<Row, Column, Timestamp> triple for key - lookup, insert, and delete API
•Arbitrary “columns” on a row-by-row basis
•Column family:qualifier. Family is heavyweight, qualifier lightweight
•Column-oriented physical store- rows are sparse!
•Does not support a relational model
•No table-wide integrity constraints
•No multirow transactions
SSTable
Immutable, sorted file of key-value pairs
Chunks of data plus an index
Index is of block ranges, not values
64K
block
64K
block
64K
block
SSTable
Index
Tablet
Contains some range of rows of the table
Built out of multiple SSTables
Tablet
64K
block
Start:aardvark
64K
block
64K
block
End:apple
SSTable
Index
64K
block
64K
block
64K
block
SSTable
Index
Table
Multiple tablets make up the table
SSTables can be shared
Tablets do not overlap, SSTables can overlap
Tablet
aardvark
Tablet
apple
SSTable SSTable
apple_two_E
SSTable SSTable
boat
Finding a tablet
Servers
Tablet servers manage tablets, multiple tablets per
server. Each tablet is 100-200 megs
Each tablet lives at only one server
Tablet server splits tablets that get too big
Master responsible for load balancing and fault
tolerance
Use Chubby to monitor health of tablet servers, restart
failed servers
GFS replicates data. Prefer to start tablet server on same
machine that the data is already at
Editing
a
table
Mutations are logged, then applied to an
in-memory version
Logfile stored in GFS
Tablet
Insert
Memtable
Insert
Delete
Insert
apple_two_E
Delete
Insert
SSTable SSTable
boat
Compactions
Minor compaction – convert the memtable into an
SSTable
Reduce memory usage
Reduce log traffic on restart
Merging compaction
Reduce number of SSTables
Good place to apply policy “keep only N versions”
Major compaction
Merging compaction that results in only one SSTable
No deletion records, only live data
Locality Groups
Group column families together into an SSTable
Avoid mingling data, ie page contents and page
metadata
Can keep some groups all in memory
Can compress locality groups
Bloom Filters on locality groups – avoid searching
SSTable
Microbenchmarks
Application at Google
Lessons learned
Interesting point- only implement some of the
requirements, since the last is probably not needed
Many types of failure possible
Big systems need proper systems-level monitoring
Value simple designs
Questions and Answers !!!
173