11年 DMC硏 혁신 방향(案)

Transcript 11年 DMC硏 혁신 방향(案)

The Google File System
【 Ghemawat, Gobioff, Leung 】
2014. 3. 24
Presenter: Seikwon Kim @ KAIST
Contents
Introduction
Ⅰ
Design Overview
Ⅱ
System Interactions
Ⅲ
Master Operation
IV
Fault Tolerance
V
Conclusion
【 Introduction 】
 What is GFS?
- Distributed file system
- Goal: performance, scalability, reliability, availability
 Why GFS?
- To meet Google’s data processing needs
- Different in design assumptions
 Component failures are the norm
 Files are huge
 No data overwriting
 Co-designing the client and the file system
2014 Internet System Technology
3/27
Ⅰ
Design Overview
1.1 Assumptions
1.2 Interface
1.3 Architecture
1.3 Chunk
1.4 Metadata
1.5 Consistency Model
1.1 Assumptions
Design Overview
Basic Assumptions
 Cheap Component System
 Optimization for General Sized Files
 Workloads
- Large stream reads
- Small random reads
- Large sequential appends
 High bandwidth > Low latency
2014 Internet System Technology
5/27
1.2 Interface
Design Overview
 create
 delete
 open
 close
 read
 write
 snapshot : Copy file or directory
 record append : Multiple clients to append data to the same file concurrently
Features
 POSIX-like APIs but not POSIX APIs
 Files are organized hierarchically in directories.
2014 Internet System Technology
6/27
1.3 Architecture
Design Overview
Architecture Overview
 Single Master, Multiple Chunk Servers, Multiple Clients
 Files are divided into 64MB chunks
 Chunks are replicated in multiple chunk servers : Default 3
 Master communicates with chunk servers with Heart Beat msg.
 Master maintains all file system metadata
 No file cache, but Yes metadata cache
2014 Internet System Technology
7/27
1.4 Chunk
Design Overview
 Unit of data stored in GFS
 Large chunk size is key design parameter : 64MB
 Chunk replica is stored as a plain Linux file on a chunk server
Pros of Large Chunk Size
 Minimize interaction between client and master
 Reduce network overhead
 Reduce metadata size stored in master.
Cons of Large Chunk Size
 Chunk server become hot spot on one chunk
2014 Internet System Technology
8/27
1.5 Metadata
Design Overview
 Metadata Types:
 File and chunk namespaces : Persistent
 Mapping from files to chunks : Persistent
 Locations of chunk replicas : Not Persistent
 All metadata is in memory.
Why Stored in Memory?
 Fast Master Operations
 Efficient to periodically scan through entire state background

Chunk garbage collection

Re-replication

Chunk migration
 Low cost of adding Extra Memory
2014 Internet System Technology
9/27
1.5 Metadata Cont.
Design Overview
Chunk locations
 Chunk location is not persistent

Polls from chunk server at Master start up

Keep up-to-date by Heart Beat Message
Operation Logs
 Historical records of critical metadata changes

Persistent record of metadata changes

Logical timeline of concurrent operations
 If an error occurs in master, it recovers by replaying the operation logs
 Checkpoints to minimize the operation log
2014 Internet System Technology
10/27
1.6 Consistency Model
Design Overview
 Relaxed Consistency Model
 Guarantees the Atomic File-namespace Mutation
 Levels of Consistency on Data
 Inconsistent : Different client see different data
 Consistent : All clients see same data, regardless of replica
 Defined : Client sees complete written data
What Application has to do
 Append rather then overwrite
 Self Validation
2014 Internet System Technology
11/27
Ⅱ
System Interactions
2.1 Write Control
2.2 Data Flow
2.3 Atomic Record Append
2.4 Snapshot
2.1 Write Control
System Interactions
 Lease
•
Primary Chunk Replica that is Granted by Master
 Mutation
•
Operation that Changes the Content or Metadata
Write Process
What Application has to do
2014 Internet System Technology
13/27
2.2 Data Flow
System Interactions
 Data is pushed linearly along a chain of chunk servers
 Forwards the data to the closest machine
 Distance - estimated from IP addresses
Network Construction in GFS
 Line topology

Full outbound bandwidth

Pipelining: to minimize latency and maximize throughput
 Elapsed time for transferring

Elapsed time = B/T + RL
•
B : bytes for transfer
•
T : network throughput
•
R : # of replicas
•
L : latency
2014 Internet System Technology
14/27
2.3 Atomic Record Appends
System Interactions
 In Traditional Writes
 Clients specifies offset where the data to be written
•
Data fragmentation
 In Record Append
 Client specifies only the data
 Similar to write in GFS
Record Append Process
 Much like write in GFS
 GFS appends data to the file at least once atomically
 The chunk is padded - when (record > maximum size)
 Retry append when error occurs
2014 Internet System Technology
15/27
2.4 Snapshot
System Interactions
 Instant File/Directory Copy
Snapshot Process
 Master receive snapshot request
 Revokes leases on chunks
 Master logs the operation
 Duplicate the metadata for file/directory

New snapshot
 Duplicate local chunk when write operation comes
2014 Internet System Technology
16/27
Ⅲ
Master Operation
3.1 Namespace Management
3.2 Replica Placement
3.3 Creation, Re-replication, Rebalancing
3.4 Garbage Collection
3.5 Stale Relica Detection
2014 Internet System Technology
17/27
3.1 Namespace Management
Master Operation
 Allows Multiple Operations at Same Time
 Master Operations Need Lock
 Prefix Compression
Locking Example



Snapshot of /home/user

read-lock on /home

read-lock on /home/user

write-lock on /home/user
File creation for /home/user/foo

read-lock on /home

read-lock on /home/user

write-lock on /home/user/foo
Locks conflicts

Serialize operations
2014 Internet System Technology
18/27
3.2 Replica Placement
Master Operation
 Purpose of Replica Placement
 Maximize data reliability and availability
 Maximize bandwidth utilization
 Spread Chunks across Rack
 Available even when power circuit problem occurs
Rack
2014 Internet System Technology
19/27
3.3 Creation, Re-replication, Rebalancing
Master Operation
 Movements for Chunk Replicas
 Chunk creation
 Re-replication
 Load balancing
 Creation
 Place chunk at below-average-disk-space chunk server
 Spread replicas across racks
 Re-replication
 Re-replicates a chunk when criteria falls below specified level
 Rebalancing
 Periodically examines for load balancing.
2014 Internet System Technology
20/27
3.4 Garbage Collection
Master Operation
 Garbage Collection in GFS
 Garbage Collection + Delete Process
Garbage Collection Process

Delete Process
① User Deletes a file
② Master renames or hides the file
③ During masters regular scan, removes the file

Regular Garbage Collection
① Receives regular Heart Beat Message
② Compare data with master metadata
③ Remove orphaned chunks
2014 Internet System Technology
21/27
3.5 Stale Replica Detection
Master Operation
 Stale Data Created
 When mutation data is missed
 Server is down
 Master manages Chunk Version Number
 Distinguish between up-to-date and stale
 Stale Chunk Removed in Regular GC
2014 Internet System Technology
22/27
FAULT TOLERANCE
IV
4.1 High Availability
4.2 Data Integrity
2014 Internet System Technology
23/27
4.1 High Availability
Fault Tolerance
 Fast Recovery
 Start-up time is in seconds
 Chunk Replication
 Master clones replicas as needed
 Master Replication
 Master state replicated synchronously
 Shadow masters for read-only
 For simplicity, only one master processes. Restart is fast.
2014 Internet System Technology
24/27
4.2 Data Integrity
Fault Tolerance
 Checksumming to Detect Data Corruption
 Checksums are kept in memory as well as disk.
 On read error, error is reported to master.
 Master will re-replicate the chunk.
 Requestor read from other replicas.
2014 Internet System Technology
25/27
V
Conclusion
5 Conclusion
Conclusion
 Supports Large-scale Data Processing Workloads on Commodity Hardware.
 Provides Fault Tolerance
 By constant monitoring
 By replicating crucial data
 By fast and automatic recovery
 Delivers High Throughput to Concurrent Clients
2014 Internet System Technology
27/27