GFS - Partha Dasgupta's Workstation!

Download Report

Transcript GFS - Partha Dasgupta's Workstation!

GFS
- Slides by Jatin
Idealogy
•
•
•
•
Huge amount of data
Ability to efficiently access data
Large quantity of Cheap machines
Component failures are the norm rather than
the exception
• Atomic append operation so that multiple
clients can append concurrently
Files in GFS
• Files are huge by traditional standards
• Most files are mutated by appending new data
rather than overwriting existing data
• Once written, the files are only read, and
often only sequentially.
• Appending becomes the focus of performance
optimization and atomicity guarantees
Architecture
• GFS cluster consists of a single master and multiple chunk servers
and is accessed by multiple clients.
• Each of these is typically a commodity Linux machine running a
user-level server process.
• Files are divided into fixed-size chunks identified by an immutable
and globally unique 64 bit chunk handle
• For reliability, each chunk is replicated on multiple chunk servers
• master maintains all file system metadata.
• The master periodically communicates with each chunk server in
HeartBeat messages to give it instructions and collect its state
• Neither the client nor the chunk server caches file data eliminating
cache coherence issues.
• Clients do cache metadata, however.
Read Process
• Single master vastly simplifies design
• Clients never read and write file data through the master. Instead, a
client asks the master which chunk servers it should contact.
• Using the fixed chunk size, the client translates the file name and
byte offset specified by the application into a chunk index within
the file
• It sends the master a request containing the file name and chunk
index. The master replies with the corresponding chunk handle and
locations of the replicas. The client caches this information using
the file name and chunk index as the key.
• The client then sends a request to one of the replicas, most likely
the closest one. The request specifies the chunk handle and a byte
range within that chunk
Architecture
Specifications
• Chunk Size = 64 MB
• Chunks stored as plain Unix files on chunk server.
• A persistent TCP connection to the chunk server
over an extended period of time (reduce network
overhead)
• cache all the chunk location information to
facilitate small random reads.
• Master keeps the metadata in memory
• Disadvantages – Small files become Hotspots.
• Solution – Higher replication for such files.
Metadata
•
•
•
•
File and chunk namespaces,
the mapping from files to chunks,
the locations of each chunk’s replicas.
Namespaces and file-to-chunk mapping are also
kept persistent by logging mutations to an
operation log stored on the master’s local disk
and replicated on remote machines
• Master does not store chunk location information
persistently, instead asks each chunk server about
its chunks at master startup and whenever a
chunkserver joins the cluster.
In-Memory Data Structures
• Tasks to be kept in mind:
– Chunk garbage collection
– re-replication in presence of chunk server failures
– chunk migration to balance load and disk space usage
across chunkservers
• The master maintains less than 64 bytes of metadata
for each 64 MB chunk
• The file namespace data typically requires less then 64
bytes per file because it stores file names compactly
using prefix compression
• Thus memory limitations of Master is not a concern.
Chunk Locations
• Master simply polls chunkservers for that
information at startup and periodically thereafter
• This eliminated the problem of keeping the
master and chunkservers in sync as chunkservers
join and leave the cluster, change names, fail,
restart, and so on
• chunkserver has the final word over what chunks
it does or does not have on its own disks.
• A consistent view at the Master need not be
maintained.
Operation Log
• contains a historical record of critical metadata changes
• central to GFS
• serves as a logical time line that defines the order of
concurrent operations
• must store it reliably and not make changes visible to
clients until metadata changes are made persistent
• respond to a client operation only after flushing the
corresponding log record to disk both locally and remotely.
• The master checkpoints its state whenever the log grows
beyond a certain size.
• The checkpoint is in a compact B-tree like form
Consistency Model
• Relaxed
• Mutations might lead a file region to different
states.
• GFS achieves “defined” status by
– (a) applying mutations to a chunk in the same
order on all its replicas
– (b) using chunk version numbers to detect any
replica that has become stale because it has
missed mutations while its chunkserver was down
• Clients cache chunk locations – Problem.
• Solution: Cache timeout & append
characteristics.
Failure Identification
• Regular handshakes
• Checksum data.
• Versioning of Chunks.
Applications
• Make applications mutate files by appending
rather than overwriting.
• Checksum each record prepared by Writer.
• Reader has the ability to identify and discard
extra padding.
• The master grants a chunk lease to one of the replicas, which
we call the primary. The primary picks a serial order for all
mutations to the chunk.
If a client request fails then
the modified region at the
primary is left in an
inconsistent state.
Namespace Management and Locking
• GFS does not have a per-directory data
structure that lists all the files in that directory
• GFS logically represents its namespace as a
lookup table mapping full pathnames to
metadata
• When ever a modification needs to be done
all nodes in the path till the file get locked (all
read locks except the full path which gets
write lock).
FAULT TOLERANCE
• Achieved by:
–
–
–
–
Fast Recovery
Chunk Replication
Master Replication
“shadow” masters (Read only access to file system)
• If Master’s machine or disk fails, monitoring
infrastructure outside GFS starts a new master process
elsewhere with the replicated operation log. Clients
use only the canonical name of the master (e.g. gfstest), which is a DNS alias that can be changed if the
master is relocated to another machine.