The Google File System

Transcript The Google File System

The Google File System
Sanjay Ghemawat, Howard Gobioff
Shun-Talk Leung
Google ACM
2003
DB LAB 정규세미나
2013.03.12(월)
Tae-Hoon Kim
Copyright©2013.03 JBNU. DBLAB All Right Reserved.
Contents
Introduction
Design Overview
System Interactions
Master Operation
Fault Tolerance and Diagnosis
Measurements
Experiences
Related Work
Conclusions









2
Introduction

GFS(Google File System)는 전 분산 파일 시스템과
많이 유사



전 분산 파일 시스템에 관한 재조사




3
성능, 확장
신뢰성, 유효성
1st component failure는 exception 보다는 norm
2nd 파일들은 일반적으로 Multi-GB 정도 huge
3rd 파일들은 존재하는 파일상에 overwritting 보다 새 데
이터가 appending 됨으로써 mutated가 대부분
4th 응용에서 co-designing(인증서)와 파일 시스템 API는
전체 시스템에 이득
Design Overview(Assumptions)
첫째, 쉽게 구할 수 있는 전자 컴포넌트 장비 구성
둘째, 시스템은 많은 대량의 파일의 효율적 관리



Multi-GB파일들은 효율적으로 관리를 지원
셋째, 시스템의 2가지 Workload read


Large streaming reads, Small random reads
넷째, 시스템의 워크로드는 크고, 파일에 데이터를
추가시키는 순차적인 writes를 소유
다섯째, 시스템은 파일들을 다수의 클라이언트를 위
한 semantics 구현 필요
여섯째, High sustained대역폭 보다 low latency 중요



4
Design Overview(Interface)
기존의 file system interface와 유사




POSIX와 같은 표준API를 통해 미구현
계층적인 directory와 path 경로이름을 통해서 확인
Create, delete, open, close, read, write files를 지원
GFS의 operation



5
Snapshot : 적은 cost로 directory tree나 file copy를 생성
Record append : multiple clients들에게 병렬적으로 같은
append data를 허용하는 한편, 각 개개인의 클라이언트
들의 원자성(atomicity)을 보장
Design Overview(Architecture)
GFS Cluster



Single master
Multiple chunkserver
Chunk


파일은 고정된 크기(fixed-size)의 chunk로 나뉨
Chunk replicas default


3개의 replicas
Master


Metadata, HeartBeat messsage collect, chunkserver 통
신
Client는 master와 상호 작용
Client측에서 POSIX API를 미지원


6
Design Overview(Single Master)

Single master





7
디자인을 simplify 하기 위함
Master에게 정규화된 chunk placement를 만들기 위함
global knowledge를 이용하여 replication decision을 하
기 위함
Read나 write관여를 최소화(minimize)
Bottleneck 관여 없게
Design Overview(Single Master)
(file name, byte offset)
1
2
3
1
2
디자인을 simplify 하기 위함
 Master에게 정규화된 chunk placement를 만들기 위함
 8Read나 write관여를 최소화(minimize)
 Bottleneck 관여 없게

Design Overview(Chunk size)


Chunk size 64MB
Large chunk size의 이점



1. 클라이언트의 interact가 감소
2. 클라이언트의 네트워크 오버헤드 감소
3. Master상에 저장된 metadata의 감소


Large chunk size 단점



9
메모리 내 metadata유지를 허용
Small file은 한 개 작은 수의 chunk를 가짐
Chunkserver에 많은 클라이언트 들이 같은 파일에 엑세
스시 hot spots(분쟁지대)이 발생
적은 chunkserver 저장, 실행시 overloaded됨
Design Overview(Metadata)

Master저장의 3가지 주요 타입




주기적인 스캔


Chunkserver의 시작지점(startup) 정보
Chunk location의 사용


10
Chunk garbage collection, re-replication, chunk migration,
disk space
Chunk location의 metadata


The file, chunk namespaces
The mapping from files to chunk
The location of each chunk’s replica
ChunkLocation으로 인한 chunkserver sync, fail, restart, 등의
문제 제거
Chunk server의 final word 정보를 통해서 Chunk상의 디스크
의 소유 여부 확인
Design Overview(Metadata)

Operation log 특징





critical metadata의 변경되는historical record
logical time line
특정 size를 넘어설때 최근 checkpoint state log기록
Older checkpoint, log files의 삭제 가능성
Operation log 사용시




file system state 복구 가능
Startup time 최소화, operation 최소화
Checkpoint state log를 사용하여 복구 가능
Checkpoint 접근을 extra parsing없이 namespace lookup
을 사용하여 가능

11
Recovery speed up, availability 향상
Design Overview(Consistency Model)

File namespce 의 atomic보장



Namespace locking guarantee
Correctness(정확성)
File region state

모든 클라이언트들이 같은 데이터 측면에서
: file region consistent

File region이 consistent 하고 client가 변형 측면
: file region defined
 Concurrent writer로부터 mutation 성공시
:영향받은 region은 defined

응용 측면에서

relaxed consistency model
System Interactions

Mutation Order



Lease



consistent mutation 명령 유지
master에서 overhead 관리 목적
Master 측면





chunk의 metadata나 contents를 변화하는operation
모든 chunk’s의 replica에서 수행
replica중 하나를 chunk lease로 권한부여 ; primary
lease가 만기되기 전에 revoke 시도
renamed된 파일 상에 mutations을 해제하기를 원할 때
lease가 만기된 후 다른 replica에게 new lease 권한부여
Primary 측면

chunk에게 모든 mutation을 위한 순차 명령을 pick
System Interactions
data
1. current lease holds
한 것을 요청
2. primary identity,
replica location
data
data
3. 모든 replica에게 데이터 push
4.If Send
All replica
data? YES
5. serial order number의
변형 적용
6. Completed operation
7. replicas error
encounter를 client 에게
report
System Interactions

Data flow



machine의 네트워크 대역폭, 네트워크의 bottlenacks,
high-latency link을 피하고 latency를 최소화 하는 것
TCP connections상의 대기시간(latency)를 줄임
Atomic append operation



record append ; atomic append operation
Record Append는 분산된 application
Record Append는 mutation의 종류
System Interactions

Snapshot operation



Snapshot 목적



directory tree file의 복사본
copy-on-write를 구현
거대한 데이터의 branch copy
current state를 checkpoint하기 위한 용도
Snapshot 수행


Lease가 revoked, or 만료시, master는 disk에 operation
을 기록
directory tree나 source files에서 metadata를 복제 하여
in-memory에 기록
Master Operation(Namespace
Management and Locking)
home
user
save
home
user
foo
user
save
user
foo
Snapshot operation
File creation operation
can prevent a file home/user/foo from being created
While home/user/ is being snapshotted to save/user
이것은 serialized properly 해야 함
• 두 오퍼레이션이 lock을 얻기 위해서 Home/user과
충돌하기 때문
Read lock
Write lock
Master Operation

Chunk replicas placement 정책 보장



Maximize data reliability와 availability
Maximize network bandwith utilization
Chunk replicas 3가지 생성

Chunk creation


Re-replication


Choose where to place the initially empty replicas
Master re-replicates a chunk
Rebalancing

Master rebalances periodically
Master Operation
Application
GFS
Master
Log
삭제 기록
Starcraft
GFS Master
Namespace
JAPAN
HP
Samsung
Oracle
19
MEM metadata
Apple
HP
Samsung
Oracle
Japan
…
Chunk 01
Chunk )2
BBQueue
HamStar
Pizza Hot!
StarFuck
Starcraft
Master Operation
Chunk 04
GFS
MASTER
MEM metadata
…
Chunk
Chunk
Chunk
Chunk
Chunk
…
20
01
02
03
04
05
Master Operation

Stale Replica detection


Master가 grant 시 최신의 replicas를 알리고 chunk
version을 증가시킴
Master가 기록한 것 보다 좋은 버전 assume시



21
Granting lease 실패
최신 버전 Granting 실패
Master는 규칙적으로 garbage collection을 하여 stale
replicas를 제거
Fault Tolerance and Diagnosis

Fast Recovery(Master와 Chunkserver 측면)


Chunk Replication


Each chunk는 different rack 상의 다수의 chunkserver상
에서 복제(default는 3개가 복제)
Master Replication

22
State, start가 몇초내 terminate가 발생시 복구
Opeartion log, check point는 다수의 machine상에서 복
제됨
Fault Tolerance and Diagnosis

서버 측면의 Data Integrity



corruption 탐지를 위해 checksumming을 사용
chunkserver는 복사한 maintaining checksums의 integrity 확인
Chunk 측면의 Data Integrity

chunk는 32bit의 checksum

chunk replica를 이용하여 복원 가능

Diagnostic tools의 help


Isolation, debugging, performance analysis
많은 중요한 event를 기록


Diagnositc tools의 영향

성능 영향에 미치는 정도가 적음

23
Chunkserver going up and down
Log are written sequentially, asynchronously
Measurements

측정 환경









24
16개의 chunkserver
16개의 client
1.4Ghz Pentium III processor
2개의 replicas
2Gb의 메모리
2개의 5400rpm의 80Gb disk
100Mbps, full-duplex Enternet Connection(HP 2524
switch)
A,X : Research, development cluster
B,Y : production data Processing
Measurements
25
Measurements
26
Measurements
27
Measurements
Table 4: Operation Breakdown by Size(%)
28
Table 5: Byte Transferred Breakdown by
Operation Size(%)
Measurements
29
Experience

Linux 와 Disk 문제

IDE protocol version의 넓은 범위를 지원하는 driver가 대부분은
잘 동작하나, 간혹 drive, kernel에서 mismatches 발생


Linux 2.2 kernel인 fsync()로 인한 발생


일시적으로 Synchronous, Linux 2.4버전을 이용해서 해결
Single Reader-Lock 문제

30
Kernel과 protocol의 mitmatches
Reader lock이거나, mmap(Write lock)을 call시 발생
 pread()를 가진 mmap()로 대체 함으로써 해결
Related work

Large distributed file System


File’s data across storage serve



xFS : Serverless network System
Swift : Using distributed disk striping to provide high
I/O data rates
Relativity cheap, repliaction

31
AFS : Scale and performance in distrubuted file system
RAID : A case for redundant arrays of inexpensive disk
Related work

A primary-copy scheme


Addressing a problem


NASD : A cost-effective, high-bandwith storage
architecture
Distributed queue

32
Lustre : http://www.lustreorg, 2003.
Network-attached disk drives


Harp : Replication in the Harp file System
River : Cluster I/O with River
Conclusions


하드웨어 기반의 거대한 데이터 프로세싱
워크로드 지원
GFS는 failure tolerance를 허용
Constant monitoring,Replicating crucial data,
자동적인 복구




33
데이터 변질을 탐지하기 위해 체크서밍 수행
Master operation의 최소화
high aggregate throughput을 많은 concurrent
reader과 writer에게 deliver
Thank you for listening
my presentation : )
34

Reference [Keyword : The google File system pdf]









35
http://www.cs.umd.edu/class/spring2011/cmsc818k/Lectures/gfshdfs.pdf
https://www.cs.washington.edu/education/courses/490h/11wi/CSE490H_
files/gfs.pdf
http://cis.poly.edu/cs912/gfs.pdf
http://www.cs.utah.edu/~lifeifei/cs6931/gfs.pdf
http://bnrg.eecs.berkeley.edu/~randy/Courses/CS294.F07/gfs.pdf
http://www.uio.no/studier/emner/matnat/ifi/INF5100/h10/undervisnings
materiale/gfs.pdf
http://www.cs.rochester.edu/~sandhya/csc256/seminars/GoogleFS.pdf
https://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&source=web&cd
=22&cad=rja&ved=0CDoQFjABOBQ&url=http%3A%2F%2Fntucsiecloud
98.appspot.com%2Ffiles%2F3%2F1%2FGFS.pdf&ei=IB4qUfueG82UiAf5p
4DYBA&usg=AFQjCNGQ2st_jkqPrlv30W1BJGjFdcqE3g&sig2=Xmkl76Xh5cfsQqUR-beGw
http://courses.cs.vt.edu/~cs5204/fall08-kafura/Presentations/GoogleFilesystem.pdf

Summary




36
The Google File System or GFS is a scalable, fault-tolerance distributed file system customdesigned to handle Google’s data-intensive workloads. GFS provides high aggregate
throughput for a large number of readers and append-only writers (i.e., no overwrites) in a
fault-tolerant manner, while running on inexpensive commodity hardware. It is specially
optimized for large files, sequential reads/writes, and high sustained throughput instead of
low latency. GFS uses a single master with minimal involvement in regular operations and a
relaxed consistency model to simplify concurrent operations by many clients on same files.
GFS sits on top of local file systems in individual machines to create a meta-file system,
where each of the participating machine is called a chunkserver. The master keeps track of
meta-data using soft states that are periodically refreshed by talking to chunkservers. GFS
divides files into large 64MB chunks and replicates these chunks across machines and racks
for fault-tolerance and availability. It uses daisy-chained writing for better utilizing the
network bandwidth. It is not POSIX compliant but supports a reasonable set of file system
operations (e.g., create, delete, open, close, read, and write) useful for Google’s operations.
Unlike any other file system, GFS supports atomic record appends to allow multiple clients
to append records to the same file.
Key tradeoffs in GFS’ design include throughput vs latency, using an expensive storage
system vs designing an distributed file system using commodity hardware, write traffic for
replication vs savings in read traffic due to data locality, and simplicity of design using a
single master and a relaxed consistency model vs highly consistent but more complex
design. It is noticeable that the authors almost always go for simpler design at the expense
of many highly regarded traditional file system properties.
Reference : http://www.mosharaf.com/blog/2011/09/25/the-google-file-system/

The Google File System

Transcript The Google File System

Directory