Mass Data Processing Technology on Large Scale Clusters

Transcript Mass Data Processing Technology on Large Scale Clusters

For the class of Advanced Computer Architecture
All course material (slides, labs, etc) is licensed under the Creative
Commons Attribution 2.5 License .
Many thanks to Aaron Kimball & Sierra Michels-Slettvet for their
original version
Some slides are from the Internet.
1
Introduction to Distributed Systems
Google File System
MapReduce System
BigTable
2
Four Papers
 Luiz Barroso, Jeffrey Dean, and Urs Hoelzle, Web Search for a Planet:
The Google Cluster Architecture, IEEE MACRO, 2003
 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google
File System, 19th ACM Symposium on Operating Systems Principles,
Lake George, NY, October, 2003.
 Jeffrey Dean and Sanjay Ghemawat,
MapReduce: Simplified Data Processing on Large Clusters, OSDI'04:
Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004.
 Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah
A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert
E. Gruber, Bigtable: A Distributed Storage System for Structured Data,
OSDI'06: Seventh Symposium on Operating System Design and
Implementation, Seattle, WA, November, 2006.
3
4
Computer Speedup
Why slow down
here?
Then, How to
improve the
performance?
Moore’s Law: “The density of transistors on a chip doubles every 18 months, for
the same cost” (1965)
Image: Tom’s Hardware
5
Scope of Problems
What can you
do with 1
computer?
What can you
do with 100
computers?
What can you
do with an
entire data
center?
6
Distributed Problems
 Rendering multiple
frames of high-quality
animation
Image: DreamWorks Animation
7
Distributed Problems
 Simulating several hundred or thousand
characters
Happy Feet © Kingdom Feature Productions; Lord of the Rings © New Line Cinema
8
Distributed Problems
 Indexing the web (Google)
 Simulating an Internet-sized network for
networking experiments (PlanetLab)
 Speeding up content delivery (Akamai)
What is the key attribute that all these examples
have in common?
9
PlanetLab
PlanetLab is a global research network
that supports the development of new
network services.
PlanetLab currently consists of 809 nodes at 401 sites.
10
CDN - Akamai
11
Parallel vs. Distributed
 Parallel computing can mean:
 Vector processing of data (SIMD)
 Multiple CPUs in a single computer
(MIMD)
 Distributed computing is multiple
CPUs across many computers (MIMD)
12
A Brief History… 1975-85
 Parallel computing was
favored in the early
years
 Primarily vector-based
at first
 Gradually more threadbased parallelism was
introduced
Cray 2 supercomputer (Wikipedia)
13
A Brief History… 1985-95
“Massively parallel architectures”
start rising in prominence
Message Passing Interface (MPI)
and other libraries developed
Bandwidth was a big problem
14
A Brief History… 1995-Today
 Cluster/grid architecture increasingly
dominant
 Special node machines eschewed in
favor of COTS technologies
 Web-wide cluster software
 Companies like Google take this to the
extreme (10,000 node clusters)
15
Top 500, Architecture
16
Top 500 Trends
17
Top 500 Trends
18
Distributed System Concepts
 Multi-Thread Program
 Synchronization
 Semaphores, Conditional Variables, Barriers
 Network Concepts
 TCP/IP, Sockets, Ports
 RPC, Remote Invocation, RMI
 Synchronous, Asynchronous, Non-Blocking
 Transaction Processing System
 P2P, Grid
19
Semaphores
Set:
Reset:
 A semaphore is a flag
that can be raised or
lowered in one step
 Semaphores were flags
that railroad engineers
would use when
entering a shared track
Only one side of the semaphore can ever be red! (Can
both be green?)
20
Barriers
 A barrier knows in advance how
many threads it should wait for.
Threads “register” with the
barrier when they reach it, and
fall asleep.
 Barrier wakes up all registered
threads when total count is
correct
Barrier
 Pitfall: What happens if a thread
takes a long time?
21
Synchronous RPC
client
server
s = RPC(server_name, “foo.dll”,
get_hello, arg, arg, arg…)
RPC dispatcher
time
foo.dll:
String get_hello(a, b, c)
{
…
return “some hello str!”;
}
print(s);
...
22
Asynchronous RPC
client
server
h = Spawn(server_name,
“foo.dll”, long_runner, x, y…)
RPC dispatcher
...
keeps running…)
time
(More code
foo.dll:
String long_runner(x, y)
{
…
return new GiantObject();
}
GiantObject myObj = Sync(h);
23
Asynchronous RPC 2: Callbacks
client
server
h = Spawn(server_name, “foo.dll”,
callback, long_runner, x, y…)
RPC dispatcher
time
(More code
...
Thread spawns:
runs…)
foo.dll:
String long_runner(x, y)
{
…
return new Result();
}
void callback(o)
{
Uses Result
}
24
25
Early Google System
26
Spring 2000 Design
27
Late 2000 Design
28
Spring 2001 Design
29
Empty Google Cluster
30
Three Days Later…
31
A Picture is Worth…
32
The Google Infrastructure
 >200,000 commodity Linux servers;
 Storage capacity >5 petabytes;
 Indexed >8 billion web pages;
 Capital and operating costs at fraction of large scale
commercial servers;
 Traffic growth 20-30%/month.
33
Dimensions of a Google Cluster
•
•
•
•
•
•
359 racks
31,654 machines
63,184 CPUs
126,368 Ghz of processing power
63,184 Gb of RAM
2,527 Tb of Hard Drive space
 Appx. 40 million searches/day
34
Architecture for Reliability
 Replication (3x +) for redundancy;
 Replication for proximity and response;
 Fault tolerant software for cheap hardware.
 Policy: Reliability through software architecture,
not hardware.
35
Query Serving Infrastructure
 Processing a query may engage 1000+ servers;
 Index Servers manage distributed files;
 Document Servers access distributed data;
 Response time = <0.25 seconds anywhere.
36
Systems Engineering Principles
 Overwhelm problems with computational power;
 Impose standard file management;
 Manage through standard job scheduling;
 Apply simplified data processing discipline.
37
Scalable Engineering Infrastructure
 Goal: Create very large scale, high performance
computing infrastructure
Hardware + software systems to make it easy to build
products
 Focus on price/performance, and ease of use

 Enables better products

Allows rapid experimentation with large data sets with
very simple programs allows algorithms to be innovated
and evolved with real world data
 Scalable Serving capacity
Design to run on lots of cheap failure prone hardware
If a service gets a lot of traffic, you simply add servers
and bandwidth.
 Every engineer creates software that scales, monitors
itself and recovers from ground up
 The net result is that every service and every reusable
component embodies these properties and when
something succeeds, it has room to fly.


 Google

GFS, MapReduce and Bigtable are the fundamental
building blocks





indices containing more documents
updated more often
faster queries
faster product development cycles
…
38
Rethinking Development Practices
 Build on your own API
 Develop the APIs first
 Build your own application using the APIs – you know it works!
 Take a call on which of these you would expose for external
developers
 Sampling and Testing
 Release early and iterate
 Continuous User Feedback
 Public Beta
 Open to all – not to a limited set of users
 Potentially years of beta – not a fixed timeline
39
40
File systems overview
NFS & AFS (Andrew File System)
GFS
Discussion
41
File Systems Overview
 System that permanently stores data
 Usually layered on top of a lower-level physical
storage medium
 Divided into logical units called “files”
 Addressable by a filename (“foo.txt”)
 Usually supports hierarchical nesting (directories)
 A file path joins file & directory names into a relative
or absolute address to identify a file
(“/home/aaron/foo.txt”)
42
What Gets Stored
 User data itself is the bulk of the file system's
contents
 Also includes meta-data on a drive-wide and per-file
basis:
Drive-wide:
Per-file:
Available space
name
Formatting info
owner
character set
modification date
...
physical layout...
43
High-Level Organization
 Files are organized in a “tree” structure made of
nested directories
 One directory acts as the “root”
 “links” (symlinks, shortcuts, etc) provide simple
means of providing multiple access paths to one
file
 Other file systems can be “mounted” and dropped
in as sub-hierarchies (other drives, network
shares)
44
Low-Level Organization (1/2)
 File data and meta-data stored separately
 File descriptors + meta-data stored in
inodes
 Large tree or table at designated location
on disk
 Tells how to look up file contents
 Meta-data may be replicated to increase
system reliability
45
Low-Level Organization (2/2)
 “Standard” read-write medium is a hard drive (other




media: CDROM, tape, ...)
Viewed as a sequential array of blocks
Must address ~1 KB chunk at a time
Tree structure is “flattened” into blocks
Overlapping reads/writes/deletes can cause
fragmentation: files are often not stored with a linear
layout
 inodes store all block numbers related to file
46
Fragmentation
A
B
C
(free space)
A
B
C
A
(free space)
A
(free space)
C
A
(free space)
A
D
C
A
D
(free)
47
Design Considerations
 Smaller inode size reduces amount of wasted
space
 Larger inode size increases speed of sequential
reads (may not help random access)
 Should the file system be faster or more reliable?
 But faster at what: Large files? Small files? Lots of
reading? Frequent writers, occasional readers?
48
Distributed Filesystems
 Support access to files on remote servers
 Must support concurrency
 Make varying guarantees about locking, who
“wins” with concurrent writes, etc...
 Must gracefully handle dropped connections
 Can offer support for replication and local
caching
 Different implementations sit in different places
on complexity/feature scale
49
NFS
 First developed in 1980s by Sun
 Presented with standard UNIX FS interface
 Network drives are mounted into local
directory hierarchy
50
NFS Protocol
 Initially completely stateless
 Operated over UDP; did not use TCP
streams
 File locking, etc, implemented in higherlevel protocols
 Modern implementations use TCP/IP &
stateful protocols
51
Server-side Implementation
 NFS defines a virtual file system
 Does not actually manage local disk layout on server
 Server instantiates NFS volume on top of local file
system
 Local hard drives managed by concrete file systems
(EXT, ReiserFS, ...)
 Other networked FS's mounted in by...?
User-visible filesystem
EXT3 fs
EXT3 fs
Hard Drive 1
Hard Drive 2
NFS server
Server filesystem
NFS client
EXT2 fs
ReiserFS
Hard Drive 1
Hard Drive 2
52
NFS Locking
 NFS v4 supports stateful locking of files
 Clients inform server of intent to lock
 Server can notify clients of outstanding
lock requests
 Locking is lease-based: clients must
continually renew locks before a timeout
 Loss of contact with server abandons
locks
53
NFS Client Caching
 NFS Clients are allowed to cache copies of remote
files for subsequent accesses
 Supports close-to-open cache consistency
 When client A closes a file, its contents are
synchronized with the master, and timestamp is
changed
 When client B opens the file, it checks that local
timestamp agrees with server timestamp. If not, it
discards local copy.
 Concurrent reader/writers must use flags to disable
caching
54
NFS: Tradeoffs
 NFS Volume managed by single server
 Higher load on central server
 Simplifies coherency protocols
 Full POSIX system means it “drops in”
very easily, but isn’t “great” for any
specific need
55
AFS (The Andrew File System)
 Developed at Carnegie Mellon
 Strong security, high scalability
 Supports 50,000+ clients at enterprise
level
56
AFS Overview
/afs
pku.edu.cn
students
admin
dcst
teachers
ai
hpc
projects
washington.edu
tsinghua.edu.cn
games
57
Security in AFS
 Uses Kerberos authentication
 Supports richer set of access control bits
than UNIX
 Separate “administer”, “delete” bits
 Allows application-specific bits
58
Local Caching
 File reads/writes operate on locally cached
copy
 Local copy sent back to master when file is
closed
 Open local copies are notified of external
updates through callbacks
59
Local Caching - Tradeoffs
 Shared database files do not work well on
this system
 Does not support write-through to shared
medium
60
Replication
 AFS allows read-only copies of filesystem volumes
 Copies are guaranteed to be atomic checkpoints
of entire FS at time of read-only copy generation
 Modifying data requires access to the sole r/w
volume
 Changes do not propagate to read-only copies
61
AFS Conclusions
 Not quite POSIX
 Stronger security/permissions
 No file write-through
 High availability through replicas, local
caching
 Not appropriate for all file types
62
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung,
SOSP 2003
(These slides by Alex Moshchuk)
63
Motivation
 Google needed a good distributed file system
 Redundant storage of massive amounts of data on
cheap and unreliable computers
 Why not use an existing file system?
 Google’s problems are different from anyone else’s
 Different workload and design priorities
 GFS is designed for Google apps and workloads
 Google apps are designed for GFS
64
Assumptions
 High component failure rates
 Inexpensive commodity components fail all the




time
“Modest” number of HUGE files
 Just a few million
 Each is 100MB or larger; multi-GB files typical
Files are write-once, mostly appended to
 Perhaps concurrently
Large streaming reads
High sustained throughput favored over low latency
65
GFS Design Decisions
 Files stored as chunks
 Fixed size (64MB)
 Reliability through replication
 Each chunk replicated across 3+ chunkservers
 Single master to coordinate access, keep metadata
 Simple centralized management
 No data caching
 Little benefit due to large data sets, streaming reads
 Familiar interface, but customize the API
 Simplify the problem; focus on Google apps
 Add snapshot and record append operations
66
GFS Architecture
 Single master
 Mutiple chunkservers
…Can anyone see a potential weakness in this design?
67
Single master
 From distributed systems we know this is a:
 Single point of failure
 Scalability bottleneck
 GFS solutions:
 Shadow masters
 Minimize master involvement
 never move data through it, use only for metadata



and cache metadata at clients
large chunk size
master delegates authority to primary replicas in data mutations
(chunk leases)
 Simple, and good enough!
68
Metadata (1/2)
 Global metadata is stored on the master
 File and chunk namespaces
 Mapping from files to chunks
 Locations of each chunk’s replicas
 All in memory (64 bytes / chunk)
 Fast
 Easily accessible
69
Metadata (2/2)
 Master has an operation log for persistent
logging of critical metadata updates
 persistent on local disk
 replicated
 checkpoints for faster recovery
70
Mutations
 Mutation = write or append
 must be done for all replicas
 Goal: minimize master involvement
 Lease mechanism:
 master picks one replica as
primary; gives it a “lease”
for mutations
 primary defines a serial
order of mutations
 all replicas follow this order
 Data flow decoupled from
control flow
71
Atomic record append
 Client specifies data
 GFS appends it to the file atomically at least once
 GFS picks the offset
 works for concurrent writers
 Used heavily by Google apps
 e.g., for files that serve as multiple-producer/single-
consumer queues
72
Relaxed consistency model (1/2)
 “Consistent” = all replicas have the same value
 “Defined” = replica reflects the mutation,
consistent
 Some properties:
 concurrent writes leave region consistent, but possibly
undefined
 failed writes leave the region inconsistent
 Some work has moved into the applications:
 e.g., self-validating, self-identifying records
73
Relaxed consistency model (2/2)
 Simple, efficient
 Google apps can live with it
 what about other apps?
 Namespace updates atomic and serializable
74
Master’s responsibilities (1/2)
 Metadata storage
 Namespace management/locking
 Periodic communication with chunkservers
 give instructions, collect state, track cluster health
 Chunk creation, re-replication, rebalancing
 balance space utilization and access speed
 spread replicas across racks to reduce correlated failures
 re-replicate data if redundancy falls below threshold
 rebalance data to smooth out storage and request load
75
Master’s responsibilities (2/2)
 Garbage Collection
 simpler, more reliable than traditional file delete
 master logs the deletion, renames the file to a hidden
name
 lazily garbage collects hidden files
 Stale replica deletion
 detect “stale” replicas using chunk version numbers
76
Fault Tolerance
 High availability
 fast recovery
 master and chunkservers restartable in a few seconds
 chunk replication
 default: 3 replicas.
 shadow masters
 Data integrity
 checksum every 64KB block in each chunk
77
Performance
78
Deployment in Google
 50+ GFS clusters
 Each with thousands of storage nodes
 Managing petabytes of data
 GFS is under BigTable, etc.
79
Conclusion
 GFS demonstrates how to support large-scale
processing workloads on commodity hardware
 design to tolerate frequent component failures
 optimize for huge files that are mostly appended and
read
 feel free to relax and extend FS interface as required
 go for simple solutions (e.g., single master)
 GFS has met Google’s storage needs… it must be
good!
80
Discussion
 How many sys-admins does it take to run a system
like this?
 much of management is built in
 Google currently has ~450,000 machines
 GFS: only 1/3 of these are “effective”
 that’s a lot of extra equipment, extra cost, extra power,
extra space!
 GFS has achieved availability/performance at a very low cost,
but can you do it for even less?
 Is GFS useful as a general-purpose commercial
product?
 small write performance not good enough?
 relaxed consistency model
81
82
Motivation
 200+ processors
 200+ terabyte database
 1010 total clock cycles
 0.1 second response time
 5¢ average advertising revenue
From: www.cs.cmu.edu/~bryant/presentations/DISC-FCRC07.ppt
83
Motivation: Large Scale Data
Processing
Want to process lots of data ( > 1 TB)
Want to parallelize across
hundreds/thousands of CPUs
… Want to make this easy
"Google Earth uses 70.5 TB: 70 TB for the
raw imagery and 500 GB for the index data."
From: http://googlesystem.blogspot.com/2006/09/howmuch-data-does-google-store.html
84
MapReduce
 Automatic parallelization &
distribution
 Fault-tolerant
 Provides status and monitoring tools
 Clean abstraction for programmers
85
Programming Model
 Borrows from functional programming
 Users implement interface of two functions:
 map
(in_key, in_value) ->
(out_key, intermediate_value) list
 reduce (out_key, intermediate_value list)
->
out_value list
86
map
 Records from the data source (lines out of
files, rows of a database, etc) are fed into the
map function as key*value pairs: e.g.,
(filename, line).
 map() produces one or more intermediate
values along with an output key from the
input.
87
reduce
 After the map phase is over, all the
intermediate values for a given output key
are combined together into a list
 reduce() combines those intermediate
values into one or more final values for that
same output key
 (in practice, usually only one final value per
key)
88
Architecture
Input key*value
pairs
Input key*value
pairs
...
map
map
Data store 1
Data store n
(key 1,
values...)
(key 2,
values...)
(key 3,
values...)
(key 2,
values...)
(key 1,
values...)
(key 3,
values...)
== Barrier == : Aggregates intermediate values by output key
key 1,
intermediate
values
reduce
final key 1
values
key 2,
intermediate
values
reduce
final key 2
values
key 3,
intermediate
values
reduce
final key 3
values
89
Parallelism
 map() functions run in parallel, creating
different intermediate values from different
input data sets
 reduce() functions also run in parallel, each
working on a different output key
 All values are processed independently
 Bottleneck: reduce phase can’t start until
map phase is completely finished.
90
Example: Count word
occurrences
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
91
Example vs. Actual Source Code
 Example is written in pseudo-code
 Actual implementation is in C++, using a
MapReduce library
 Bindings for Python and Java exist via
interfaces
 True code is somewhat more involved
(defines how the input key/values are
divided up and accessed, etc.)
92
Example
 Page 1: the weather is good
 Page 2: today is good
 Page 3: good weather is good.
93
Map output
 Worker 1:
 (the 1), (weather 1), (is 1), (good 1).
 Worker 2:
 (today 1), (is 1), (good 1).
 Worker 3:
 (good 1), (weather 1), (is 1), (good 1).
94
Reduce Input
 Worker 1:
 (the 1)
 Worker 2:
 (is 1), (is 1), (is 1)
 Worker 3:
 (weather 1), (weather 1)
 Worker 4:
 (today 1)
 Worker 5:
 (good 1), (good 1), (good 1), (good 1)
95
Reduce Output
 Worker 1:
 (the 1)
 Worker 2:
 (is 3)
 Worker 3:
 (weather 2)
 Worker 4:
 (today 1)
 Worker 5:
 (good 4)
96
Some Other Real Examples
 Term frequencies through the whole
Web repository
 Count of URL access frequency
 Reverse web-link graph
97
Implementation Overview
 Typical cluster:
 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory
 Limited bisection bandwidth
 Storage is on local IDE disks
 GFS: distributed file system manages data (SOSP'03)
 Job scheduling system: jobs made up of tasks, scheduler
assigns tasks to machines
 Implementation is a C++ library linked into user
programs
98
Architecture
99
Execution
100
Parallel Execution
101
Task Granularity And Pipelining
 Fine granularity tasks: many more map tasks than machines
 Minimizes time for fault recovery
 Can pipeline shuffling with map execution
 Better dynamic load balancing
 Often use 200,000 map/5000 reduce tasks w/ 2000 machines
102
Locality
Effect: Thousands of machines read
input at local disk speed
 Master program divvies up tasks based on
location of data: (Asks GFS for locations of
replicas of input file blocks) tries to have
map() tasks on same machine as physical
file data, or at least same rack
 map() task inputs are divided into 64 MB
blocks: same size as Google File System
chunks
 Without this, rack switches limit read rate
103
Fault Tolerance
 Master detects worker failures
 Re-executes completed & in-progress
map() tasks
 Re-executes in-progress reduce() tasks
 Master notices particular input key/values
cause crashes in map(), and skips those
values on re-execution.
 Effect: Can work around bugs in thirdparty libraries!
104
Fault Tolerance
 On worker failure:
 Detect failure via periodic heartbeats
 Re-execute completed and in-progress map
tasks
 Re-execute in progress reduce tasks
 Task completion committed through master
 Master failure:
 Could handle, but don't yet (master failure
unlikely)
Robust: lost 1600 of 1800 machines
once, but finished fine
105
Optimizations
 No reduce can start until map is complete:
 A single slow disk controller can rate-limit the whole
process
 Master redundantly executes “slow-moving” map tasks;
uses results of first copy to finish, (one finishes first
“wins”)
Slow workers significantly lengthen completion time
Other jobs consuming resources on machine
Bad disks with soft errors transfer data very slowly
Weird things: processor caches disabled (!!)
Why is it safe to redundantly execute map tasks? Wouldn’t this mess up the
total computation?
106
Optimizations
 “Combiner” functions can run on same machine as a
mapper
 Causes a mini-reduce phase to occur before the real
reduce phase, to save bandwidth
Under what conditions is it sound to use a combiner?
107
Refinement
 Sorting guarantees within each reduce
partition
 Compression of intermediate data
 Combiner: useful for saving network
bandwidth
 Local execution for debugging/testing
 User-defined counters
108
Performance
 Tests run on cluster of 1800 machines:
 4 GB of memory
 Dual-processor 2 GHz Xeons with Hyperthreading
 Dual 160 GB IDE disks
 Gigabit Ethernet per machine
 Bisection bandwidth approximately 100 Gbps
Two benchmarks:
MR_Grep
Scan 1010 100-byte records to extract records
matching a rare pattern (92K matching records)
MR_Sort
benchmark)
Sort 1010 100-byte records (modeled after TeraSort
109
MR_Grep
 Locality optimization helps:
 1800 machines read 1 TB of data at peak of ~31 GB/s
 Without this, rack switches would limit to 10 GB/s
 Startup overhead is significant for short jobs
110
MR_Sort
 Backup tasks reduce job completion time significantly
 System deals well with failures
Normal
No Backup Tasks
200 processes killed
111
More and more MapReduce
MapReduce Programs In Google Source Tree
Example uses:
distributed grep
distributed sort
web link-graph reversal
term-vector per host
web access log stats
inverted index construction
document clustering
machine learning
statistical machine translation
112
Real MapReduce : Rewrite of
Production Indexing System
 Rewrote Google's production indexing system
using MapReduce
 Set of 10, 14, 17, 21, 24 MapReduce operations
 New code is simpler, easier to understand
 MapReduce takes care of failures, slow machines
 Easy to make indexing faster by adding more
machines
113
MapReduce Conclusions
 MapReduce has proven to be a useful
abstraction
 Greatly simplifies large-scale computations
at Google
 Functional programming paradigm can be
applied to large-scale applications
 Fun to use: focus on problem, let library
deal w/ messy details
114
Fay Chang, Jeffrey Dean, Sanjay Ghemawat,
Wilson C. Hsieh, Deborah A. Wallach, Mike
Burrows, Tushar Chandra, Andrew Fikes, Robert
E. Gruber
Google, Inc.
UWCS OS Seminar Discussion
Erik Paulson
2 October 2006
See also the (other)UW presentation by Jeff Dean in September of 2005
(See the link on the seminar page, or just google for “google bigtable”)
Motivation
 Lots of (semi-)structured data at Google
 URLs:
 Contents, crawl metadata, links, anchors, pagerank, …
 Per-user Data:
 User preference settings, recent queries/search results, …
 Geographic locations:
 Physical entities (shops, restaurants, etc.). roads, satellite image
data, user annotations, …
 Scale is large
 Billions of URLs, many versions/page(~20K/version)
 Hundreds of millions of users, thousands of q/sec
 100TB+ of satellite image data
Why not just use commercial DB?
 Scale is too large for most commercial databases
 Even if it weren’t, cost would be very high
 Building internally means system can be applied across
many projects for low incremental cost
 Low-level storage optimizations help performance
significantly
 Much harder to do when running on top of a database
layer
 Also fun and challenging to build large-scale systems 
Goals
 Want asynchronous processes to be continuously
updating different pieces of data
 Want access to most current data at any time
 Need to support
 Very high read/write rates (millions of ops per second)
 Efficient scans over all or interesting subsets of data
 Efficient joins of large one-to-one and one-to-many
datasets
 Often want to examine data changes over time
 E.g. Contents of a web page over multiple crawls
BigTable
 Distributed multi-level map
 With an interesting data model
 Fault-tolerant, persistent
 Scalable




Thousands of servers
Terabytes of in-memory data
Petabytes of disk-based data
Millions of reads/writes per second, efficient scans
 Self-managing
 Servers can be added/removed dynamically
 Servers adjust to load imbalance
Status
 Design/initial implementation started beginning of 2004
 Currently ~100 BigTable cells (Winter,2005)
 Production use or active development for many projects:
 Google Print
 My Search History
 Orkut
 Crawling/indexing pipeline
 Google Maps/Google Earth
 Blogger
 …
 Largest bigtable cell manages ~200TB of data spread over
several thousand machines(larger cells planned)
Background: Building Blocks
Building blocks:
 Google File System (GFS): Raw storage
 Scheduler: schedules jobs onto machines
 Lock service: distributed lock manager
 Also can reliable hold tiny files (100s of bytes) with high availability
 MapReduce: simplified large-scale data processing
BigTable uses of building blocks
 GFS:stores persistent state
 Scheduler: schedules jobs involved in BigTable serving
 Lock service: master election, location bootstrapping
 MapReduce: often used to read/write BigTable data
Google File System (GFS)
GFS Master
Replicas
Masters
Misc. servers
Client
GFS Master
Client
C2
C1
C3
C2
Chunkserver1
C1
C3
C2
C4
Chunkserver2
C3
C2
Chunkserver3
•Master manages metadata
•Data transfers happen directly between clients/chunkservers
•Files broken into chunks (typically 64MB)
•Chunks triplicate across three machines for safety
MapReduce: Easy-to-use Cycles
 Many Google problems:” Process lots of data to produce other data”
 Many kindes of inputs
 Document records, log files, sorted on-disk data structures, etc.
 Want to use easily hundreds or thousands of CPUs
 MapReduce: framework that provides (for certain classes of problems):
 Automatic & efficient parallelization/distribution
 Fault-tolerance, I/O scheduling, status/monitoring
 Users writes Map and Reduce functions
 Heavily used: ~3000 jobs, 1000s of machine days each day
 BigTable can be input and/or output for MapReduce computations
Chubby
 {lock/file/name} service
 Coarse-grained locks, can store small amount of data
in a lock
 5 replicas, need a majority vote to be active
 Also an OSDI ’06 Paper
Typical Cluster
Cluster scheduling master
Machine 1
MapReduce
Job1
Lock Service
Machine 2
BigTable
Server
Single Task
GFS master
Machine N
BigTable
Server
MapReduce
Job1
BigTable Master
…
Scheduler
GFS
slave
chunkserver
Scheduler
GFS
slave
chunkserver
Scheduler
GFS
slave
chunkserver
Linux
Linux
Linux
BigTable Overview
 Data Model
 Implementation Structure
 Tablets, compactions, locality groups, ……
 API
 Details
 Shared logs, compression, replications
 Current/Future Work
Basic Data Model
“com.cnn.www”
<html>
<html>
<html>
“anchor:my.look.ca”
“anchor:cnnsi.com”
“Contents:”
t5
t6
t3
“CNN”
t9
“CNN.com”
t8
Figure 1: Web Table
Bigtable is a distributed multidimensional stored sparse map.
Map is indexed by row key, column key and timestamp.
i.e. (row: string , column: string , time:int64 )  String (cell contents).
Rows are ordered in lexicographic order by row key.
Row range for a table is dynamically partitioned, Each row range is called “Tablet ”.
Columns: syntax is family : qualifier.
Cells can store multiple version of data with timestamps.
Good match for most of google’s applications
Rows
 Name is an arbitrary string
 Access to data in a row is atomic
 Row creation is implicit upon storing data
 Rows ordered lexicographically
 Rows close together lexicographically usually on one or a
small number of machines
Tablets
 Large tables broken into tablets at row boundaries
 Tablet holds contiguous range of rows
 Clients can often choose row keys to achieve locality
 Aim for ~100MB to 200MB of data per tablet
 Serving machine responsible for ~100 tablets
 Fast recovery:
 100 machines each pick up 1 tablet from failed machine
 Fine-grained load balancing:
 Migrate tablets away from overloaded machine
 Master makes load-balancing decisions
Tablets & Splitting
“language”
“language:”
“contents:”
EN
<html>
“aaa.com”
“cnn.com”
“cnn.com/sports.html”
Tablets
“Website.com”
“yahoo.com/kids.html”
“Yahoo.com/kids/html/0”
“Zuppa.com/menu.html”
…
Tablet and Table
Tablets contains some range of rows
Tablet
Start : aardvark
End : apple
SSTable
64K
64K
Block Block
Index
SSTable
64K
64K
Block Block
Index
Figure : Tablet
Bigtable contains tables, Tables contains set of tablets and each tablet
contains set of rows (100-200MB).
Tablet
Tablet
aardvark
apple_two
apple
SSTable
64K
64K
Block Block
Index
boat
SSTable
64K
64K
Block Block
Index
SSTable
64K
64K
Block Block
Index
System Structure
Bigtable Cell
Bigtable Master
Bigtable client
metadata
Performs metadata ops+
Load balancing
Bigtable client
library
read/write
Bigtable tablet server
Serves data
Cluster scheduling system
Handles failover, monitoring
Bigtable tablet server
Serves data
GFS
Holds tablet data, logs
Bigtable tablet server
Open()
Serves data
Lock service
Holds metadata
Handles master-election
Locating Tablets
 Since tablets move around from server to server,
given a row, how do clients find the right machine?
 Need to find tablet whose row range covers the target
row
 One approach: could use the BigTable master
 Central server almost certainly would be bottleneck in
large system
 Instead: store special tables containing tablet
location info in BigTable cell itself
Locating Tablets (cont.)
 Google’s approach: 3-level hierarchical lookup scheme for tablets




Location is ip:port of relevant server
1st level: bootstrapped from lock service, points to owner of META0
2nd level: Uses META0 data to find owner of appropriates META1 tablet
3rd level: META1 table holds locations of tablets of all other tables

META table itself can be split into multiple tables
Aggressive prefetching +caching
Most ops go right to proper machine
Tablet Representation
read
Write buffer in memory
(random-access)
Append-only log on GFS
write
SSTable
on GFS
SSTable
on GFS
SSTable
on GFS
(mmap)
Tablet
SSTable: Immutable on disk ordered map form stringstring
String keys: <row, column, timestamp> triples
SSTables
This is the underlying file format used to store Bigtable
data.
SSTables are immutable.
If new data is added, a new SSTable is created.
Old SSTable is set out for garbage collection.
SSTable
64K
Block
64K
Block
Index
Tablet Assignment










Each Tablets Server is given a tablet for serving client requests.
Master keeps track of the tablet server (RPC) to assign the tablet.
Chubby directory is used to acquire lock by the tablet server.
If Tablet server terminates, it release the lock on the file.
Status is sent to Master by tablet server.
How Does Master comes to know about Tablets, Tablet servers?
Master acquires unique master lock in chubby.
Master scans server directory in chubby to find live servers.
Master communicates with each tablet server to get the details.
It scans the METADATA table to find the unassigned tablets.
Tablet Assignment
Master
Communicates with all the tablet
server to get the details about the
tablet they are serving
Acquires
unique lock
Chubby Directory
Once scanned it
will come to know
about the live
tablet server
Scans the
metadata table
Metadata table
Tablet
Server
Tablet Serving
Commit log stores the updates that are
made to the data.
Updates are stored in memtable.
Read Op
Memtable
Recovery process.
Memory
Reads/Writes that arrive at tablet server.
GFS
Is the request Well-formed.
Tablet Log
Write Op
SST
SST
SST
SSTable Files
Authorization.
Chubby holds the permission file.
If a mutation occurs it is wrote to
commit log and finally a group commit
is used.
Figure : Tablet Representation.
Figure is taken from the
paper “google bigtable”.
Compactions
 Tablet state represented as set of immutable
compacted SSTable files, plus tail of log (buffered
in memory)
 Minor compaction
 When in-memory state fills up, pick tablet with most
data and write contents to SSTables stored in GFS

Separate file for each locality group for each tablet
 Major compaction
 Periodically compact all SSTables for tablet into new
base SSTable on GFS

Storage reclaimed from deletion at this point
Compaction
Converted to
SSTable
Frozen
SSTable
Memtable
Write
Operation
Figure : Minor Compaction.
SST
SSTable
SST
SST
SST
Small New SSTables Due to
minor compaction.
Figure : Merging Compaction.
Merging compaction leads to major compaction.
New Large SSTables
Columns
 Columns have two-level name structure:
 family:optional_qualifier
 Column family
 Unit of access control
 Has associated type information
 Qualifier gives unbounded columns
 Additional level of indexing, if desired
Timestamps
 Used to store different versions of data in a cell
 New writes default to current time, but timestamps for
writes can also be set explicitly by clients
 Lookup options:
 “Return most recent K values”
 “Return all values in timestamp range(or all values)”
 Column families can be maked with attributes:
 “Only retain most recent K values in a cell”
 “Keep values until they are older than K seconds”
Locality Groups
 Column families can be assigned to a locality group
 Used to organize underlying storage representation for
performance
scans over one locality group are
O(bytes_in_locality_group), not O(bytes_in_table)

 Data in a locality group can be explicitly memory-
mapped
Locality Group
Locality Group
“Contents:”
“com.cnn.www”
<html>
<html>
<html>
<html>
<html>
<html>
“anchor:cnnsi.com”
“anchor:my.look.ca”
“CNN
“CNN
” ”
Multiple column families are grouped into locality group.
Efficient reads are done by separating column families.
Additional parameters can be specified.
“CNN.com
“CNN.com
””
API
 Metadata operations
 Create/delete tables, column families, change metadata
 Writes(atomic)
 Set():write cells in row
 DeleteCells():delete cells in a row
 DeleteRow():delete all cells in a row
 Reads
 Scanner:read arbitrary cells in a bigtable




Each row read is atomic
Can restrict returned rows to a particular range
Can ask for just data from 1 row, all rows, etc.
Can ask for all columns, just certain column families, or specific columns
API
API
Shared Logs
 Designed for 1M tablets, 1000s of tablet servers
 1M logs being simultaneously written performs badly
 Solution:shared logs
 Write log file per tablet server instead of per tablet
 Updates for many tablets co-mingled in same file
 Start new log chunks every so often(64MB)
 Problem: druing recovery, server needs to read log
data to apply mutations for a tablet
 Lots of wasted I/O if lots of machines need to read data
for many tablets from same log chunk
Shared Log Recovery
Recovery:
 Servers inform master of log chunks they need to read
 Master aggregates and orchestrates sorting of needed
chunks
 Assigns log chunks to be sorted to different tablet servers
 Servers sort chunks by tablet, write sorted data to local disk
 Other tablet servers ask master which servers have sorted
chunks they need
 Tablet servers issue direct RPCs to peer tablet servers to
read sorted data for its dablets
Compression
 Many opportunities for compression
 Similar values in the same row/column at different
timestamps
 Similar values in different columns
 Similar values across adjacent rows
 Within each SSTable for a locality group, encode
compressed blocks
 Keep blocks small for random access (~64KB
compressed data)
 Exploit fact that many values very similar
 Needs to be low CPU cost for encoding/decoding
 Two building blocks: BMDiff, Zippy
In Development/Future Plans
 More expressive data manipulation/access
 Allow sending small scripts to perform
read/modify/write transactions so that they execute on
server?
 Multi-row (i.e. distributed) transaction support
 General performance work for very large cells
 BigTable as a service?
 Interesting issues of resource fairness, performance
isolation, prioritization, etc. across different clients
Real Applications
154
Before we begin…
 Intersection of databases and distributed systems
 Will try to explain (or at least warn) when we hit a
patch of database
 Remember this is a discussion!
Google Scale
 Lots of data
 Copies of the web, satellite data, user data, email and
USENET, Subversion backing store
 Many incoming requests
 No commercial system big enough
 Couldn’t afford it if there was one
 Might not have made appropriate design choices
 Firm believers in the End-to-End argument
 450,000 machines (NYTimes estimate, June 14th
2006
Building Blocks




Scheduler (Google WorkQueue)
Google Filesystem
Chubby Lock service
Two other pieces helpful but not required
 Sawzall
 MapReduce (despite what the Internet says)
 BigTable: build a more application-friendly storage
service using these parts
Google File System
 Large-scale distributed “filesystem”
 Master: responsible for metadata
 Chunk servers: responsible for reading and writing
large chunks of data
 Chunks replicated on 3 machines, master responsible
for ensuring replicas exist
 OSDI ’04 Paper
Chubby
 {lock/file/name} service
 Coarse-grained locks, can store small amount of data
in a lock
 5 replicas, need a majority vote to be active
 Also an OSDI ’06 Paper
Data model: a big map
•<Row, Column, Timestamp> triple for key - lookup, insert, and delete API
•Arbitrary “columns” on a row-by-row basis
•Column family:qualifier. Family is heavyweight, qualifier lightweight
•Column-oriented physical store- rows are sparse!
•Does not support a relational model
•No table-wide integrity constraints
•No multirow transactions
SSTable
 Immutable, sorted file of key-value pairs
 Chunks of data plus an index
 Index is of block ranges, not values
64K
block
64K
block
64K
block
SSTable
Index
Tablet
 Contains some range of rows of the table
 Built out of multiple SSTables
Tablet
64K
block
Start:aardvark
64K
block
64K
block
End:apple
SSTable
Index
64K
block
64K
block
64K
block
SSTable
Index
Table
 Multiple tablets make up the table
 SSTables can be shared
 Tablets do not overlap, SSTables can overlap
Tablet
aardvark
Tablet
apple
SSTable SSTable
apple_two_E
SSTable SSTable
boat
Finding a tablet
Servers
 Tablet servers manage tablets, multiple tablets per
server. Each tablet is 100-200 megs
 Each tablet lives at only one server
 Tablet server splits tablets that get too big
 Master responsible for load balancing and fault
tolerance
 Use Chubby to monitor health of tablet servers, restart
failed servers
 GFS replicates data. Prefer to start tablet server on same
machine that the data is already at
Editing
a
table
 Mutations are logged, then applied to an
in-memory version
 Logfile stored in GFS
Tablet
Insert
Memtable
Insert
Delete
Insert
apple_two_E
Delete
Insert
SSTable SSTable
boat
Compactions
 Minor compaction – convert the memtable into an
SSTable
 Reduce memory usage
 Reduce log traffic on restart
 Merging compaction
 Reduce number of SSTables
 Good place to apply policy “keep only N versions”
 Major compaction
 Merging compaction that results in only one SSTable
 No deletion records, only live data
Locality Groups
 Group column families together into an SSTable
 Avoid mingling data, ie page contents and page
metadata
 Can keep some groups all in memory
 Can compress locality groups
 Bloom Filters on locality groups – avoid searching
SSTable
Microbenchmarks
Application at Google
Lessons learned
 Interesting point- only implement some of the
requirements, since the last is probably not needed
 Many types of failure possible
 Big systems need proper systems-level monitoring
 Value simple designs
Questions and Answers !!!
173