Transcript Document

The Google File System
by S. Ghemawat, H. Gobioff, and S-T. Leung
CSCI 485 lecture by Shahram Ghandeharizadeh
Computer Science Department
University of Southern California
Primary Functionality of Google
Primary Functionality of Google


Search content on the web
in browsing mode.
Open world assumption: If
your search with Google
does not return results, it
does not mean that the
referenced content is nonexit. It only means that
Google does not know
about it when the search
was issued.



Google may retrieve
results if the search is
issued again.
Do not index/find me:
Google provides tags to
enable an information
provider to prevent
Google from indexing its
pages.
No one gets angry if
Google does not retrieve
information known to
exist on the Internet.
How is this different
than financial
applications?
Functionality
DB:
IR:


Search content on the web
in browsing mode.
Open world assumption: If
your search with Google
does not return results, it
does not mean that the
referenced content is nonexistent. It only means that
Google does not know
about it when the search
was issued.



Google may retrieve
results if the search is
issued again.
Do not index/find me:
Google provides tags to
enable an information
provider to prevent
Google from indexing its
pages.
No one gets angry if
Google does not retrieve
information known to
exist on the Internet.


Query based: Looking for
a needle in a hay stack.
Closed world assumption:
A data item that is not
known does not exist.



A query must retrieve
correct results 100% of
the time!
If a customer insists the
bank cannot find his or
her account because the
customer has used
Google’s “do not find me”
tags, the customer is
kicked out!
Customers become angry
if the system retrieves
incorrect data.
Key Observation
DB:
IR:


Okay to return either
no or incorrect results.
Acceptable for a user
search to observe
stale data.



Not okay to return
incorrect results.
A transaction must
observe consistent
data.
SQL front end.
Big Picture

A shared-nothing architecture consisting of
thousands of nodes!

A node is an off-the-shelf, commodity PC.
Yahoo’s Pig Latin
Google’s Map/Reduce Framework
Google’s Bigtable Data Model
Google File System
…….
Big Picture

Shared-nothing architecture consisting of
thousands of nodes!

A node is an off-the-shelf, commodity PC.
Yahoo’s Pig Latin
Google’s Map/Reduce Framework
Google’s Bigtable Data Model
Google File System
…….
Divide & Conquer
Big Picture

Source code for Pig and hadoop are
available for free download.
Yahoo’s Pig Latin
Google’s Map/Reduce Framework
Google’s Bigtable Data Model
Google File System
…….
Pig
Hadoop
Data Shipping



Client retrieves data
from the node.
Client performs
computation locally.
Limitation: Dumb
servers, utilizes the
limited network
bandwidth.
Process f(x)
Xmit
Data
Data
A Node
Function Shipping




Client ships the
function to the node
for processing.
Relevant data is sent
to client.
Function f(x) should
produce less data than
the original data
stored in the database.
Minimizes demand for
the network
bandwidth.
Process function
f(x)
Output of f(x)
A Node
Google

Application (configured with GFS client) may
run on the same PC as the one hosting a
chunkserver. Requirements:


Machine resources are not overwhelmed.
The lower reliability is acceptable.
References

Pig Latin


Map Reduce


Dean and Ghemawat. MapReduce: Simplified
Data Processing on Large Clusters.
Communications of the ACM, Vol. 51, No. 1,
January 2008.
Bigtable


Olston et. al. Pig Latin: A Not-So-Foreign
Language for Data Processing. SIGMOD 2008.
Chang et. al. Bigtable: A Distributed Storage
System for Structured Data. In OSDI 2006.
GFS

Ghemawat et. al. The Google File System. In
SOSP 2003.
Overview: GFS

A highly available, distributed file system for
inexpensive commodity PCs.



Supports node failures as the norm rather than
the exception.
Stores and retrieves multi-GB files.
Assumes files are append only (instead of
updates that modify a certain piece of existing
data).


Atomic append operation to enable multiple clients to
append to a file with minimal synchronization.
Relaxed consistency model to simplify the file
system and enhance performance.
Google File System: Assumptions
Google File System: Assumptions (Cont…)
GFS: Interfaces


Create, delete, open, close, read, and write
files.
Snapshot a file:


Create a copy of the file.
Record append operation:

Allows multiple clients to append data to the
same file concurrently, while guaranteeing the
atomicity of each individual client’s append.
GFS: Architecture





1 Master
Multiple chunkservers
File is partitioned into fixedsize chunks.
Each chunk has a 64 bit chunk
handle that is unique globally.
Each chunk is replicated on
several chunkservers.

Degree of replication is
application specific; default is 3.

Software




Master maintains all file
system meta-data:
namespace, access control
info, mapping from files to
chunks, current location of
chunks.
GFS client caches meta-data
about file system.
Client receives data from
chunkserver directly.
Client and chunkserver do
not cache file data.
GFS: Architecture





1 Master
Multiple chunkservers
File is partitioned into fixedsize (64 MB) chunks.
Each chunk has a 64 bit chunk
handle that is unique globally.
Each chunk is replicated on
several chunkservers.

Degree of replication is
application specific; default is 3.

Software




Client
chooses
one of the
replicas.
Master maintains all file
system meta-data:
namespace, access control
info, mapping from files to
chunks, current location of
chunks.
GFS client caches meta-data
about file system.
Client receives data from
chunkserver directly.
Client and chunkserver do
not cache file data.
GFS: Architecture





1 Master
Multiple chunkservers
File is partitioned into fixedsize (64 MB) chunks.
Each chunk has a 64 bit chunk
handle that is unique globally.
Each chunk is replicated on
several chunkservers.

Degree of replication is
application specific; default is 3.
• Unix allocates space lazily!
• Many small logical files are stored in one
file.

Software




Client
chooses
one of the
replicas.
Master maintains all file
system meta-data:
namespace, access control
info, mapping from files to
chunks, current location of
chunks.
GFS client caches meta-data
about file system.
Client receives data from
chunkserver directly.
Client and chunkserver do
not cache file data.
GFS Master



1 master simplifies software design.
Master monitors availability of chunkservers
using heart-beat messages.
1 master is a single point of failure:

Master does not store chunk location information
persistently: When the master is started, it asks
each chunkserver about its chunks (and
whenever a chunkserver joins).



File and chunk namespaces,
Mapping from files to chunks,
Location of each chunk’s replica.
Mutation = Update


Mutation is an operation that changes the
contents of either metadata (delete or create
a file) or a chunk (append a record).
Content mutation:




Performed on all chunk’s replicas.
Master grants a chunk lease to one of the
replicas, primary.
Primary picks a serial order for all mutations to
the chunk.
Lease:



Granted by master, typically 60 seconds.
Primary may request extensions.
If master loses communication with a primary, it
can safely grant a new lease to another replica
after the current lease expires.
Master & Logging

Master stores 3 types of metadata:
1.
2.
3.

The file and chunk namespaces,
Mapping from files to chunks,
Locations of each chunk’s replicas.
First two types are kept persistent by:



Logging mutations (updates) to an operation log
stored on the master’s local disk,
Replicating the operation log on multiple
machines.
What is required to support logging?
Master & Logging

Master stores 3 types of metadata:
1.
2.
3.

The file and chunk namespaces,
Mapping from files to chunks,
Locations of each chunk’s replicas.
First two types are kept persistent by:



Logging mutations (updates) to an operation log
stored on the master’s local disk,
Replicating the operation log on multiple
machines.
What is required to support logging?


Uniquely identify transactions and data items.
Checkpointing.
Master & Logging

Master stores 3 types of metadata:
1.
2.
3.

The file and chunk namespaces,
Mapping from files to chunks,
Locations of each chunk’s replicas.
First two types are kept persistent by:







Logging mutations (updates) to an operation log stored on
the master’s local disk,
Replicating the operation log on multiple machines.
Files and chunks, as well as their versions, are uniquely
identified by the logical times at which they were created.
GFS responds to a cleint operation only after flushing the
log record to disk both locally and remotely.
With failures, during recovery phase, master recovers its
file system by replaying the operation log.
Checkpoints are fuzzy.
Maintains a few older checkpoints and log files, deleting
the prior ones.
Master & Locking

Namespace management:

GFS represents its namespace as a lookup table
mapping full pathnames to metadata.






/d1/d2/…/dn/fileA consists of the following pathnames:
/d1
/d1/d2
…
/d1/d2/…/dn
/d1/d2/…/dn/fileA
Master & Locking

Namespace management:



GFS represents its namespace as a lookup table
mapping full pathnames to metadata.
Each node in the namespace tree has an
associated read-write lock.
Each master operation requires a set of locks
before it can perform its read/mutation operation:

Typically, an operation involving /d1/d2/…/dn/fileA will
acquire read locks on /d1, /d1/d2, /d1/d2/…/dn and
either a read or write lock on /d1/d2/…/dn/fileA.


A read lock is the same as a Shared lock.
A write lock is the same as an eXclusive lock.
Example

Operation 1:

Copy directory
/home/user to
/save/user

Operation 2:

Create /home/user/foo
Example

Operation 1:

Copy directory
/home/user to
/save/user

Operation 2:

Create /home/user/foo
Could they have used IS and IX locks?
Atomic Record Appends

Background:




With traditional writes, a client specifies the
offset at which data is to be written.
GFS cannot serialize concurrent writes to the
same region.
With record append, the client specifies only
the data. GFS appends the record to the file
at least once atomically at an offset of GFS’s
choosing and returns that offset to the
client.
What does “atomically” mean?
Atomic Record Appends

Background:




With record append, the client specifies only
the data. GFS appends the record to the file
at least once atomically at an offset of GFS’s
choosing and returns that offset to the
client.
What does “atomically” mean?


With traditional writes, a client specifies the
offset at which data is to be written.
GFS cannot serialize concurrent writes to the
same region.
The record is written as one sequence of bytes.
Does GFS write the record partially?
Atomic Record Appends

Background:




With record append, the client specifies only
the data. GFS appends the record to the file
at least once atomically at an offset of GFS’s
choosing and returns that offset to the
client.
What does “atomically” mean?


With traditional writes, a client specifies the
offset at which data is to be written.
GFS cannot serialize concurrent writes to the
same region.
The record is written as one sequence of bytes.
Does GFS write the record partially?

Yes, a record might be written partially to a file
replica.
Atomic Record Appends

How?

Discuss how regular chunk mutations are
supported.
Updates
Atomic Record Appends: How?

Client:




Pushes data to all replicas of the last chunk of the file.
Sends its write request to the primary.
Primary appends data to its replica and tells the
secondaries to write data at the exact offset where it
has written. If all secondaries succeed, primary
replies success to the client.
If a record append fails at any replica, primary
reports error and client retries the operation.


One or more of the replicas may have succeeded fully (or
written partially) → replicas of the same chunk may contain
different data including duplicates of the same record.
GFS does not guarantee that all replicas are bytewise
identical. GFS guarantees that the record is written at the
same offset at least once in its entirety (atmoic unit).
Summary


File namespace mutations are managed by requiring
Master to implement ACID properties: locking
guarantees atomicity, consistency, and isolation.
Operation log provides durability.
State of a file region after a data mutation depends
on the type of mutation, whether it succeeds or fails,
and whether there are concurrent mutations:




A file region is consistent if all clients will always see the
same data, regardless of which replicas they read from.
A file region is defined after a mutation if it is consistent
and clients will see what the mutation writes in its entirety.
When a mutation succeeds without interferance from
concurrent writers, the affted region is defined and by
implication consistent.
Concurrent successful mutations leave the region
undefined and consistent: all clients see the same data
that consists of mingled fragments from multiple
mutations.