11年 DMC硏 혁신 방향(案)
Download
Report
Transcript 11年 DMC硏 혁신 방향(案)
The Google File System
【 Ghemawat, Gobioff, Leung 】
2014. 3. 24
Presenter: Seikwon Kim @ KAIST
Contents
Introduction
Ⅰ
Design Overview
Ⅱ
System Interactions
Ⅲ
Master Operation
IV
Fault Tolerance
V
Conclusion
【 Introduction 】
What is GFS?
- Distributed file system
- Goal: performance, scalability, reliability, availability
Why GFS?
- To meet Google’s data processing needs
- Different in design assumptions
Component failures are the norm
Files are huge
No data overwriting
Co-designing the client and the file system
2014 Internet System Technology
3/27
Ⅰ
Design Overview
1.1 Assumptions
1.2 Interface
1.3 Architecture
1.3 Chunk
1.4 Metadata
1.5 Consistency Model
1.1 Assumptions
Design Overview
Basic Assumptions
Cheap Component System
Optimization for General Sized Files
Workloads
- Large stream reads
- Small random reads
- Large sequential appends
High bandwidth > Low latency
2014 Internet System Technology
5/27
1.2 Interface
Design Overview
create
delete
open
close
read
write
snapshot : Copy file or directory
record append : Multiple clients to append data to the same file concurrently
Features
POSIX-like APIs but not POSIX APIs
Files are organized hierarchically in directories.
2014 Internet System Technology
6/27
1.3 Architecture
Design Overview
Architecture Overview
Single Master, Multiple Chunk Servers, Multiple Clients
Files are divided into 64MB chunks
Chunks are replicated in multiple chunk servers : Default 3
Master communicates with chunk servers with Heart Beat msg.
Master maintains all file system metadata
No file cache, but Yes metadata cache
2014 Internet System Technology
7/27
1.4 Chunk
Design Overview
Unit of data stored in GFS
Large chunk size is key design parameter : 64MB
Chunk replica is stored as a plain Linux file on a chunk server
Pros of Large Chunk Size
Minimize interaction between client and master
Reduce network overhead
Reduce metadata size stored in master.
Cons of Large Chunk Size
Chunk server become hot spot on one chunk
2014 Internet System Technology
8/27
1.5 Metadata
Design Overview
Metadata Types:
File and chunk namespaces : Persistent
Mapping from files to chunks : Persistent
Locations of chunk replicas : Not Persistent
All metadata is in memory.
Why Stored in Memory?
Fast Master Operations
Efficient to periodically scan through entire state background
Chunk garbage collection
Re-replication
Chunk migration
Low cost of adding Extra Memory
2014 Internet System Technology
9/27
1.5 Metadata Cont.
Design Overview
Chunk locations
Chunk location is not persistent
Polls from chunk server at Master start up
Keep up-to-date by Heart Beat Message
Operation Logs
Historical records of critical metadata changes
Persistent record of metadata changes
Logical timeline of concurrent operations
If an error occurs in master, it recovers by replaying the operation logs
Checkpoints to minimize the operation log
2014 Internet System Technology
10/27
1.6 Consistency Model
Design Overview
Relaxed Consistency Model
Guarantees the Atomic File-namespace Mutation
Levels of Consistency on Data
Inconsistent : Different client see different data
Consistent : All clients see same data, regardless of replica
Defined : Client sees complete written data
What Application has to do
Append rather then overwrite
Self Validation
2014 Internet System Technology
11/27
Ⅱ
System Interactions
2.1 Write Control
2.2 Data Flow
2.3 Atomic Record Append
2.4 Snapshot
2.1 Write Control
System Interactions
Lease
•
Primary Chunk Replica that is Granted by Master
Mutation
•
Operation that Changes the Content or Metadata
Write Process
What Application has to do
2014 Internet System Technology
13/27
2.2 Data Flow
System Interactions
Data is pushed linearly along a chain of chunk servers
Forwards the data to the closest machine
Distance - estimated from IP addresses
Network Construction in GFS
Line topology
Full outbound bandwidth
Pipelining: to minimize latency and maximize throughput
Elapsed time for transferring
Elapsed time = B/T + RL
•
B : bytes for transfer
•
T : network throughput
•
R : # of replicas
•
L : latency
2014 Internet System Technology
14/27
2.3 Atomic Record Appends
System Interactions
In Traditional Writes
Clients specifies offset where the data to be written
•
Data fragmentation
In Record Append
Client specifies only the data
Similar to write in GFS
Record Append Process
Much like write in GFS
GFS appends data to the file at least once atomically
The chunk is padded - when (record > maximum size)
Retry append when error occurs
2014 Internet System Technology
15/27
2.4 Snapshot
System Interactions
Instant File/Directory Copy
Snapshot Process
Master receive snapshot request
Revokes leases on chunks
Master logs the operation
Duplicate the metadata for file/directory
New snapshot
Duplicate local chunk when write operation comes
2014 Internet System Technology
16/27
Ⅲ
Master Operation
3.1 Namespace Management
3.2 Replica Placement
3.3 Creation, Re-replication, Rebalancing
3.4 Garbage Collection
3.5 Stale Relica Detection
2014 Internet System Technology
17/27
3.1 Namespace Management
Master Operation
Allows Multiple Operations at Same Time
Master Operations Need Lock
Prefix Compression
Locking Example
Snapshot of /home/user
read-lock on /home
read-lock on /home/user
write-lock on /home/user
File creation for /home/user/foo
read-lock on /home
read-lock on /home/user
write-lock on /home/user/foo
Locks conflicts
Serialize operations
2014 Internet System Technology
18/27
3.2 Replica Placement
Master Operation
Purpose of Replica Placement
Maximize data reliability and availability
Maximize bandwidth utilization
Spread Chunks across Rack
Available even when power circuit problem occurs
Rack
2014 Internet System Technology
19/27
3.3 Creation, Re-replication, Rebalancing
Master Operation
Movements for Chunk Replicas
Chunk creation
Re-replication
Load balancing
Creation
Place chunk at below-average-disk-space chunk server
Spread replicas across racks
Re-replication
Re-replicates a chunk when criteria falls below specified level
Rebalancing
Periodically examines for load balancing.
2014 Internet System Technology
20/27
3.4 Garbage Collection
Master Operation
Garbage Collection in GFS
Garbage Collection + Delete Process
Garbage Collection Process
Delete Process
① User Deletes a file
② Master renames or hides the file
③ During masters regular scan, removes the file
Regular Garbage Collection
① Receives regular Heart Beat Message
② Compare data with master metadata
③ Remove orphaned chunks
2014 Internet System Technology
21/27
3.5 Stale Replica Detection
Master Operation
Stale Data Created
When mutation data is missed
Server is down
Master manages Chunk Version Number
Distinguish between up-to-date and stale
Stale Chunk Removed in Regular GC
2014 Internet System Technology
22/27
FAULT TOLERANCE
IV
4.1 High Availability
4.2 Data Integrity
2014 Internet System Technology
23/27
4.1 High Availability
Fault Tolerance
Fast Recovery
Start-up time is in seconds
Chunk Replication
Master clones replicas as needed
Master Replication
Master state replicated synchronously
Shadow masters for read-only
For simplicity, only one master processes. Restart is fast.
2014 Internet System Technology
24/27
4.2 Data Integrity
Fault Tolerance
Checksumming to Detect Data Corruption
Checksums are kept in memory as well as disk.
On read error, error is reported to master.
Master will re-replicate the chunk.
Requestor read from other replicas.
2014 Internet System Technology
25/27
V
Conclusion
5 Conclusion
Conclusion
Supports Large-scale Data Processing Workloads on Commodity Hardware.
Provides Fault Tolerance
By constant monitoring
By replicating crucial data
By fast and automatic recovery
Delivers High Throughput to Concurrent Clients
2014 Internet System Technology
27/27