www.cloudbus.org

Transcript www.cloudbus.org

The content of these slides has been mostly taken by the complementary
teaching material of the “Distributed Systems Concepts and Design” (3rd
Ed.) book, available at the www.cdk3.net web site. The slides have been
extended and integrated with additional material by Prof. Rajkumar Buyya
and Dr. Christian Vecchiola.
Distributed File Systems (DFS)
Dr. Christian Vecchiola
Postdoctoral Research Fellow
[email protected]
Cloud Computing and Distributed Systems (CLOUDS) Lab
Dept. of Computer Science and Software Engineering
The University of Melbourne
Most concepts are
drawn from Chapter 8
© Pearson Education
2
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Outline
• Introduction
• File System Basics
– What is a File System?
– DFSs
• Case Studies
–
–
–
–
NFS
Andrew File System
Google File System
ZFS
• Summary
– New Design Approaches
– Conclusions
3
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Introduction
• Learning Objectives
– Understand the requirement that affect the
design of distributed services
– Have a practical understanding of real DFS
implementations
•
•
•
•
Simple designs that sometimes just works
Importance of caching
Implications of security
Redundancy and availability
– Have a glimpse on how ongoing research can
impact on real implementation.
4
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Introduction
• Why do we need a DFS?
– Primary purpose of a Distributed System…
Connecting users and resources
– Resources…
• … can be inherently distributed
• … can actually be data (files, databases, …) and…
• … their availability becomes a crucial issue for the
performance of a Distributed System
5
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Introduction
• A case for DFS
Uhm… perhaps time
has come to buy a
rack of servers….
Server A
I want to store
my thesis on
the server!
I need to have
my book always
available..
I need
storage for
my reports
My boss
wants…
I need to
store my
analysis and
reports
safely…
6
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Introduction
• A Case for DS
Server A
Server B
Same
here… I
don’t
remember..
Hey… but
where did I
put my docs?
Server C
Uhm… … maybe we
need a DFS?... Well
after the paper and a
nap…
I am not sure whether
server A, or B, or C…
Wow… now I can
store a lot more
documents…
7
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Introduction
Server B
Server A
Server C
Distributed File System
It is reliable, fault
tolerant, highly available,
location transparent…. I
hope I can finish my
newspaper now…
Nice… my
boss will
promote me!
Good… I can access
Wow! I do not have my folders from
to remember which
anywhere..
server I stored the
data into…
8
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Introduction
• What is a DFS?
– DFSs …
• .. constitute the basis of many distributed applications
• .. allow multiple process to share data in a secure and
reliable way
• .. constitute the storage facility of a distributed system
– They might expose several characteristics:
•
•
•
•
high availability
fault tolerance
location transparency
replication
9
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Introduction
• A Little Bit of History….
– First Generation of DFS
• NFS – based models
• Andrew FS, …
– Distributed data store for
distributed objects
– Large scale, scalable, storage
• Google File System
• Amazon S3
• Cloud Storage..
1974 - 1995
1995 - today
2007 - now
10
7/7/2015
Introduction
Distributed File Systems
Distributed Systems Principles and Paradigms
Strict one copy
consistency
• Storage
Storage
System and their Properties
Type
Sharing Persistence Replication
Consistency
Example
& Caching
maintenance
Main Memory
X
X
X
1
RAM
File System
X
V
X
1
Unix File System
DFS
V
V
V
V
Sun NFS
Web
V
V
V
X
Web Server
Distributed
Shared Memory
V
X
V
V
Ivy (Ch. 16, CDK)
Remote Objects
V
X
X
1
CORBA, RMI
Persistent
Object Store
V
V
X
1
CORBA Persistent
Object Service
Persistent
distributed
Object Store
V
V
V
V
PerDiS, Khazana
Approximate
consistency
No automatic
consistency
11
7/7/2015
File System Basics
Distributed File Systems
Distributed Systems Principles and Paradigms
• Properties of a File System
– Persistent stored data sets
– Hierarchical name space visible to all processes
– APIs
• access and update operations
• sequential access model (plus random access facilities)
– Data shared among users with access control
– Concurrent access
• read-only (for sure…)
• what about updates?
– Other features..
• mountable file store…
• more…
12
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
File System Basics
• Example: UNIX File Systems APIs
Operation
Description
filedes = open(path, mode)
Opens an existing file located at path.
filedes = create(path, mode)
Creates a file and saves it to path.
status = close(filedes)
Closes the file resource represented by the descriptor filedes.
count = read(filedes, buffer, n)
Reads n bytes from the file mapped by filedes into buffer.
count = write(filedes, buffer, n)
Transfer n bytes from buffer to the file mapped by filedes.
pos = lseek(filedes, offset, whence)
Moves the read-write pointer of filedes according to (offset,
whence)
status = unlink(path)
Removes the pointer path from the directory structure, if there
is no other pointer to the same file, it is deleted.
status = link(path1, path2)
Adds the pointer path2 to the file pointed by path1.
status = stat(path, buffer)
Gets the file attributed of the file located at path and copies
them into buffer.
Question: how to write a function such as:
copy(char* source, char* dest) ?
13
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
File System Basics
• Typical File System Implementation
Module
Responsibilities
Directory Module
Maps file paths to file identifiers.
File Module
Maps file identifiers to physical files.
Access Control Module
Checks the permissions associated to the operation requested.
File Access Module
Reads or writes file data or attributes.
Block Module
Accesses and allocates disk blocks.
Device Module
Disk I/O and buffering.
Blocks
Device
Files
Directories
14
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
File System Basics
• File Attributed Record Structure
File Count
Creation Timestamp
System Managed
Properties
Read Timestamp
Write Timestamp
Attribute Timestamp
Reference Count
Owner
Owner Managed
Properties
File Type
Access Control List
e.g. UNIX: -rwx-r-x---
15
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
File System Basics
• Distributed File System Requirements
– Transparency
– Concurrency
– Replication
– Heterogeneity
– Fault tolerance
– Consistency
– Security
– Efficiency
16
7/7/2015
File System Basics
Distributed File Systems
Distributed Systems Principles and Paradigms
• DFS Requirements
– Transparency
• Access:
– same operations (client programs are unaware of distribution of files)
• Location:
– same namespace after relocation of files or processes
– client programs should see a uniform file namespace
• Mobility:
– automatic relocation of files when possible
– neither client programs nor system admin tables in the client nodes need
to be changed when files are moved
• Performance:
– must be satisfactory across a specified range of system loads
• Scaling:
– the infrastructure can be expanded to support additional loads and growth
17
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
File System Basics
• DFS Requirements
– Concurrency
• Main goal:
– Changes made to one file by one client should not interfere
with the operations simultaneously made by other clients
while updating or accessing the same file.
• Properties:
– Isolation
– File level and record level locking
– Other forms of concurrency control to minimize contention
18
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
File System Basics
• DFS Requirements
– Replication
• Main goal:
– The infrastructure maintains multiple copies of files
• Properties:
– Load sharing between nodes makes files easily accessible
– Local access (or near..) has better performance in terms of
latency
– Fault tolerance
Full Replication is difficult to implement.
Caching (of all or a portion of the file) gives
most of the benefit (except fault tolerance).
19
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
File System Basics
• DFS Requirements
– Heterogeneity
• The infrastructure can be accessed by clients running
on (almost) any OS or hardware platform.
• The design must be compatible with the file systems
of different OSes.
• The interface to the infrastructure must be open.
• Precise specification of APIs are published.
20
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
File System Basics
• DFS Requirements
– Fault Tolerance
• The infrastructure must continue to operate in
presence of failures (the client makes errors or
crashes):
– At most once semantics.
– At least once semantics.
They require
idempotent operations.
• The service must resume after the hosting node
crashes.
• If the service is replicated it must continue to operate
even in the presence of (partial) crashes.
21
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
File System Basics
• DFS Requirements
– Consistency
• Different client should see the same representation of
the same file, despite their location.
• Cache synchronization is crucial for consistency.
• Observations
– UNIX provides one-copy update semantics for operations
on local files, and caching is completely transparent.
– It is really difficult to achieve the same result in the case of
a DFS. Issues are:
» Performance
» Scalability
22
7/7/2015
File System Basics
Distributed File Systems
Distributed Systems Principles and Paradigms
• DFS Requirements
– Security
• Goal:
– access control and privacy as happens for local files.
• Properties:
–
–
–
–
based on identity of user making request
identities of remote users must be authenticated
privacy requires secure communication
service interfaces are
» open to all processes not excluded by a firewall.
» vulnerable to impersonation and other attacks
– Efficiency
• Goal
– performance comparable to a local system.
23
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
File System Basics
• Architecture of a DFS
– Separation of Concerns for…
• file management
• file access
– Three main modules:
• Client component
• Flat File Service
• Directory Service
24
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
File System Basics
• Architecture of a DFS
– File Service (Client-Server model)
Lookup
AddName
UnName
GetNames
Application
Program
Directory Service
Application
Program
Flat File Service
Client Module
Client Computer
Read, Write, Create,
Delete, GetAttributes,
SetAttributes
Server Node
Interface
25
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
File System Basics
• Architecture of a DFS
– File Service Modules
• Flat File Service
– Content file management (operations)
– Use of Unique File Identifiers (UFIDs) to uniquely refer to files.
• Directory Service
– Mapping between logical file names and UFIDs
– Directory manipulation.
– Add/Remove file from directory.
• Client Module
– Integrated access to both services (File & Directory Service)
– Maintain information about the network location of file services
and the files in use.
– Exposes the standard operations available for local files.
26
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
File System Basics
• Architecture of a DFS
– File Service (Example Interfaces)
Flat File Service
Read(FileId, i, n)  data
Write(FileId, i, data)
Create()  FileId
Delete(FileId)
GetAttributes(FileId) Attr
SetAttrbutes(FileId, Attr)
Unique across the
entire DFS
Composed paths
(/usr/bin/tar) are
resolved by recursive
calls to Lookup.
Directory Service
Lookup(Dir, Name)  FileId
AddName(Dir, Name, FileId)
UnName(Dir, Name)
GetNames(Dir, Pattern)  Names
Server Node
27
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
File System Basics
• Architecture of a DFS
– File Groups
• Collection of files located on any server.
• They can be moved between servers and maintain
the same name.
• Observations:
– Similar to the UNIX file system.
– They help distributing the load among servers.
– They have unique identifiers across all the DFS.
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
7/7/2015
28
29
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Real World File Systems
– Sun NFS (Network File System)
– Andrew File System
– Google File System
– Amazon S3 Dynamo
30
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• NFS (Network File System)
– Info
•
•
•
•
Developed by Sun Microsystems.
Open Standard (RFC, IETF).
Closely follow the abstract file service model.
Most popular, widely used (industry standard since 1985).
– Support for
•
•
•
•
transparency
heterogeneity
efficiency
fault tolerance
– Limited achievement of:
•
•
•
•
concurrency
replication
consistency
security
31
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• NFS (Network File System)
– An industry standard for file sharing on local networks since the
1980s.
– An open standard with clear and simple interfaces.
– Closely follows the abstract file service model defined above.
– Supports many of the design requirements already mentioned:
•
•
•
•
transparency
heterogeneity
efficiency
fault tolerance
– Limited achievement of:
•
•
•
•
concurrency
replication
consistency
security
32
7/7/2015
Case Studies
Distributed File Systems
Distributed Systems Principles and Paradigms
• Network File System (NFS)
– History
• 1985: Original Version (in-house use)
• 1989: NFSv2 (RFC 1094)
– Operated entirely over UDP
– Stateless protocol (the core)
– Support for 2GB files
• 1995: NFSv3 (RFC 1813)
–
–
–
–
–
Support for 64 bit (> 2GB files)
Support for asynchronous writes
Support for TCP
Support for additional attributes
Other improvements
• 2000-2003: NFSv4 (RFC 3010, RFC 3530)
– Collaboration with IETF
– Sun hands over the development of NFS
• 2010: NFSv4.1
– Adds Parallel NFS (pNFS) for parallel data access
33
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• NFS Architecture
Application
Program
Application
Program
Also User Space
Application
Program
Application
Program
Kernel
Kernel
Virtual File System
UNIX
File
System
Other
File
System
Virtual File System
NFS
Client
Client Node
UNIX
System Calls
NFS
Server
NFS
Client
UNIX
File
System
Server Node
34
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• NFS Architecture
– Kernel vs User Space implementation
• Kernel space:
–
–
–
–
–
Native integration with the UNIX Kernel.
Syscalls that access remote files can be routed properly.
Efficiency and performance (file caching of the kernel).
Direct access to the i-nodes (local files).
Binary compatibility.
• User space:
– Implementation as a library or a process.
– Portability over other operating systems (Mac OS,
Windows).
35
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• NFS Operations
Operation
read(fh, offset, count)  attr, data
write(fh, offset, count, data)  attr
create(dirfh, name, attr)  newfh, attr
Flat File Service
remove(dirfh, name)  status
getattr(fh)  attr
setattr(fh, attr)  attr
lookup(dirfh, name)  fh, attr
rename(dirfh, name, todirfh, toname)
link(newdirfh, newname, dirfh, name)
readdir(dirfh, cookie, count)  entries
Directory Service
(and more)
symlink(newdirfh, newname, string)  status
readlink(fh)  string
mkdir(dirfh, name, attr)  newfh, attr
File Handle (fh)
rmdir(dirfh, name)  status
statfs(fh)  fsstats
File System Identifier
i-node number
i-node generation
36
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• NFS
– Architecture Components (UNIX / Linux)
mountd
nfsd
mount (cmdline, or fstab)
nfs: port 10003
….: port 30410
…..
…..
rpcbind
/dir/to/export *.mydomain.com(ro,root_squash)
/dir/to/share host2.mydomain.com(ro)
/dir/read_write *.mydomain.com(rw)
…..
/etc/exports
Server Node: nfs-server
….
nfs-server:/dir/to/export /mnt/exp timeo=14,intr
nfs-server:/dir/to/share
/mnt/share rsize=8192
nfs-server:/dir/read_write /mnt/rw wsize=8192
….
/etc/fstab
Client Node: host2
37
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• NFS
– Architecture Components (UNIX / Linux)
• Server:
– nfsd: NFS server daemon that services requests from clients.
– mountd: NFS mount daemon that carries out the mount request
passed on by nfsd.
– rpcbind: RPC port mapper used to locate the nfsd daemon.
– /etc/exports: configuration file that defines which portion of the
file systems are exported through NFS and how.
• Client:
– mount: standard file system mount command.
– /etc/fstab: file system table file.
– nfsiod: (optional) local asynchronous NFS I/O server.
38
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• NFS
– Architecture Components (UNIX / Linux)
<..>
root
var
<..>
etc
tmp
to
dir
root
read_
write
mnt
tmp
etc
rw
share
export
share
exp
Server Node: nfs-server
Client Node: host2
NFS Protocol
var
39
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• NFS
– Architectural Components
• Mount Service
– used to “mount” (i.e.: connect to the file system) several
sources (nfs, devices, virtual file systems…)
– NFS mount:
mount –t nfs server:exported/path local/path
– On the server:
» Table mapping exports to IPs (nfs clients)
– On the client:
» Table mapping nfs mounts to <server, port> pairs.
40
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• NFS
– Security
• /etc/exports allows for selecting the users/hosts that
can mount the file system
• Early implementations of the NFS protocol where
stateless  the identity of each request is verified
• Each nfs client request carries user-id and group-id
• Disadvantages:
– Impersonation attacks
– Integration with Kerberos might be a solution!
41
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• NFS
– Security
• Kerberized NFS
– Kerberos is too costly to be applied to each request.
• Approach:
– Kerberos is used by the mount service to authenticate the request.
– User-id and Group-id are stored in the server together with the IP
address of the mount request.
– All subsequent requests (read, write,…) must come from the same
user and the same IP as those stored in the server.
• Disadvantages:
– What about multiple users?
– All remote file system trees must be mounted on user login.
42
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• NFS
– Summary
• NFS is a good example of a simple, robust,
transparent and high performance distributed
service.
• Observations:
– Access: good support and integration within the UNIX kernel.
– Concurrency: limited but enough for several scenarios (the
consistency is not perfect for read-write operations on the same
file)
– Location: not guaranteed, naming is controlled by the client
mount operation.
– Replication: limited to read-only file systems, for writable file
systems NIS (Network Information Service) runs over NFS and
integrates such support.
43
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• NFS
– Summary
• Observations (…continues):
– Failure: limited but effective. Thanks to the stateless protocol,
when the service resumes the system recovers.
– Mobility: hardly achieved, relocation of files is not possible while
relocation of file systems is supported.
– Performance: good, multiprocessor systems can reach very high
performance (the number of processors/cores is the limit).
– Scaling: thanks to the transparent way in which remote portions
of a file system can be attached to a local tree, large file systems
can be scattered over multiple server, thus distributing the load.
44
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• AFS (Andrew File System)
– Key features:
•
•
•
•
•
Single namespace
Security (trusted server)
Designed to support thousands of clients
Volumes
Caching strategy
– History
• Developed at CMU as part of the Andrew Project
• Three major implementations: Transarc (IBM), Arla, Linux
AFS (Kernel 2.6.x)
• Reference for several DFS (i.e. Coda, NFS v4, DCE/DFS)
45
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• AFS (Andrew File System)
– Assumptions
•
•
•
•
•
•
Files are small (i.e. entire file can be cached)
Frequency of reads much more than those of writes
Sequential access common
Files are not shared (i.e. read and written by only one user)
Shared files are usually not written
Disk space is plentiful
– Influence on..
• Architecture
• Interaction client-server
• Caching strategy
46
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• AFS (Andrew File System)
– Architecture
<>
volume
local
bin
<..>
tmp
….
share
local
<..>
usr
home
volume
local
bin
usr
AFS
(VICE server)
tmp
home
AFS Workstation (client)
47
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• AFS (Andrew File System)
– Architecture
• On the server(s):
– VICE daemon exports a volume (directories, files, …)
– RPC process.
• On the client (s):
– Venus daemon similar to NFS client
– A rigid division into:
» local (temporary files)
» shared (single namespace, shared file system)
• Infrastructure:
– RPC on top of UDP
48
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• AFS (Andrew File System)
– Volume
• A volume is a portion of a file system containing
– directories
– files
– volume mount points (link to other volumes)
• A volume identifies the file system exported by a
VICE server
• Volumes can be moved between servers in a
completely transparent manner.
• Volumes have a quota to limit the disk space.
49
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• AFS (Andrew File System)
– Caching strategy
• AFS server are stateful
• Files are cached on the clients
• Each time a file is requested from a server…
– the server register a callback to inform the client of updates
– if other clients change a portion of the file all the clients
that have the file are notified
– callbacks have to be re-established if the server chrashes
• A client operates on the local copy.
• The updates are sent to the server when the file is
closed.
50
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• AFS (Andrew File System)
– Security
• Identity (workstation, servers): Kerberos
• Users modeling: users and groups
• Permissions: access control lists
51
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• AFS (Andrew File System)
– Summary
• AFS is a reference model for several large scale DFS
• Compared to NFS it is based on a stateful protocol:
– Advanced caching strategies
– Inconsistencies for server crashes
• Mostly designed for:
– Dedicated and high performance servers (VICE servers)
– Thousands of users that mostly “read” rather than “write”
• Failures are a weak point for the system
52
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Google File System (GFS)
– GFS is the distributed file system that supports
the execution of large scale data intensive
applications within the Google infrastructure.
– GFS originates from the re-evaluation of the
common assumptions made in DFS design.
– GFS is designed to:
•
•
•
•
Handle hundred of terabytes of data
Manage over thousands of disks
Spawn across thousands of machines
Serve hundreds of clients
53
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Google File System (GFS)
– Design assumptions
• Component failures are a norm rather than the
exception.
• Files are huge by traditional standards (Multi-GB)
• Most files are mutated by appending new data rather
than rewriting existing data.
• Co-designing applications and the file system API
benefits the overall system by increasing flexibility.
54
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Google File System (GFS)
– Design assumptions
• Failure is a norm.
– The system is built from many inexpensive commodity
components that often fall.
– The system must constantly monitor itself, and recover
promptly from failure is a routine task.
• The system store large files.
– Each file is typically 100MB or larger.
– Multi-GB are common case and should be managed
efficiently.
– Small files must be supported but with no optmization.
55
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Google File System (GFS)
– Design Assumptions
• Append operations are more common than random
writes.
– Two kinds of workloads: large streaming reads and small reads.
– Large streaming reads operate over contiguous region of the file.
– Small reads can be batched together for performance reasons by
applications.
– The system must efficiently implement well defined semantics for
multiple clients appending the same file.
– Files are often used for producer-consumer queues or for manyways merging.
56
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Google File System (GFS)
– Design Assumptions
• High sustained bandwidth is more important than
low latency.
– Most of the supported application put a premium on
processing data in bulk at high rate.
– Few applications have stringent response time
requirements for an individual read or write.
57
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Google File System (GFS)
– Interface and Operations
• The interface is POSIX-like but not compliant
• POSIX operations
– create, delete
– open, close
– read, write
• Additional operations
– snapshot : creates a copy of a file or a directory tree at a low cost.
– record append: allows multiple client to append data to the same
file.
• Structure:
– directory based from a logical point of view.
58
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Google File System (GFS)
– Architecture
Application
/foo/bar chunk ef80
….
….
….
Chunk
state
Linux File System
GFS Client
Chunk handle,
chunk location
GFS Master
Instructions to
chunk servers
File name,
chunk index
chunk index,
byte range
Linux File System
chunk data
GFS Chunkserver
GFS Chunkserver
59
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Google File System (GFS)
– Architecture
• The DFS is the result of a collection of chunks.
• A chunk stores a contiguous portion of file and if is of fixed
size (normally 64 MB, but configurable).
• The master node contains the mapping between logical files and
chunks, together with their locations.
• Chunk servers use the local storage of the Linux file system to
store chunks.
• The master is contacted when a client require access to a file
and reply with the location of the chunk requested.
• The subsequent communication for operations on the same
chunk occurs between the GFS client and the chunkserver.
60
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Google File System (GFS)
– Disaster Recovery
• Each process (master, chunk server) is designed to
quickly recover its state after crash or termination.
• By routine the processes are killed to keep the
system clean.
• The master is in failover and it can be started on
another node if not alive.
• In order to be tolerant to failures, each chunk can be
replicated multiple times on different servers.
• Heartbet monitoring and logging are fundamental for
the integrity of the system.
61
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Google File System (GFS)
– Consistency Model
• Relaxed consistency model (simple and efficient).
• File namespace mutations (i.e. file creation):
– Atomic and handled by the master.
– The operation is very fast.
– Namespace locking guarantees atomicity and correctness.
• The mutation of a file region depends on the type of mutation
and whether it succeeds or fails.
– Single mutation succeeded: all client same data (written).
– Concurrent mutation succeeded: all client see same data but might be the
one written.
– Failed mutation: clients might see different data at different times.
• Long after a succesful mutation failure might corrupt data
– Continuous handshakes, checksumming, restoration from healthy
replicas.
62
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Google File System (GFS)
– Consistency Model
1
Master
Client
4
2
3
1.
7
Secondary
Replica B
2.
3.
4.
6
Primary replica
5
5.
6
Secondary
Replica A
6.
7.
the client ask which chunk server holds the lease
for the chunk of interest
the server replies back with the address of the
primary (and secondary replicas)
the client pushes data to all the replicas
once all the replicas have acknowledged the
reception of data, the client sends a write request
to the primary
the primary forwards the request to the secondary
replicas and waits for their acknowledge.
secondary replicas perform the write and reply
back to the primary.
the primary replies to the client, any errors
occurred are reported to the client.
Process
63
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Google File System (GFS)
– Summary
• Target
– Support large scale distributed applications
• Assumptions
– Failure is a norm
– Large files are common
– Appends vs random writes
• Characteristic
– Simple structure
– Chunk-based model
– Extremely fail tolerant
64
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Dynamo is highly-available, distributed, key-value store
serving as a storage facility for maintaining the state of
many services exposed by Amazon.
– Dynamo is not a distributed system, but a distributed
storage infrastructure facing similar problems:
•
•
•
•
Performance
High-availability
Scalability
Consistence
– Therefore, it is useful to study Dynamo within the
context of DFSs.
65
7/7/2015
Case Studies
Distributed File Systems
Distributed Systems Principles and Paradigms
• Amazon Dynamo
– Scenario Requirements
• Dynamo needs to support reliability at massive scale:
– Amazon.com platform:
» Ten millions of customers served all over the world during peak
periods
» Infrastructure composed by tens of thousands of server and
network components located in many datacenters worldwide
– Amazon.com services:
» Shopping Cart: 10 millions request (over 3 millions checkouts) /
day
» Session State: hundreds of thousands of concurrently active
sessions
• Reliability is fundamental
• Few seconds downtimes cause a massive loss and a bad
customer experience
66
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Observations
• Failure is part of the standard mode of operation when dealing
with an infrastructure of millions of components.
• As a result, Amazon applications needs to be constructed in a
manner that they treat failure as the normal case.
• Application state management has a fundamental impact on
how the infrastructure scales and is reliable.
• Not all the applications have the same storage needs for their
state.
• Dynamo is a storage solution specifically designed
– to provide a scalable and reliable object store
– to be configurable in several modes of operation
67
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Features
• In order to address this problem Dynamo composes
together several technologies:
– Data partitioning and replication: consistent hashing
– Data consistency: object versioning, and quorum-like
updates
– Failure detection: gossip protocol and explicit membership
– Maintenance: completely distributed infrastructure with
minimal needs of manual administration
• Dynamo it is a good example on how the theory and
algorithms are applied in a production environment.
68
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Assumptions
• Query Model:
– simple read and write operations to a data item that is
identified by a key
– both keys and objects are managed as binary blobs
• ACID Properties:
– in order to avoid poor availability ACID properties are
relaxed
– Dynamo targets applications that operates with weak
consistency if this favors high availability
– there is no isolation warranty and only single key updates
are permitted
69
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Assumptions
• Efficiency
– the system needs to function on a commodity hardware
infrastructure
– services have stringent latency requirements in general
measured at the 99.9% percentile of a distribution
– services must be able to configure Dynamo to achieve such
result
• Security
– Dynamo is used in an internal environment that is assumed
to be non hostile
– there are no security requirements such as authentication or
authorization
70
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– SLA-driven design
• General parameters are:
– average, median, and expected variance
• Amazon solution:
– 99.9% of the distribution (rather than average, median,…)
• Observation:
– 99.9% allows to provide a good customer experience for all the
customers
– important customer segments are not cut out
– a bigger value has demonstrated to be cost ineffective
– Amazon service logic is generally lightweight, hence the
responsibility is in the storage component
71
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Design Considerations
Availability is preferred over Consistency
• Observations:
– In systems prone to server and network failures availability
can be increased with optimistic replication:
» changes are propagated in background
» concurrent and disconnected work is tolerated
» it is challenge to detect and resolve conflicts on data
– Two problems:
Dynamo has been designed
to be eventually consistent,
» who resolve the conflicts?
which means in the end all
the replicas get the update.
» how to resolve them?
72
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Design Considerations
• When to solve conflicts?
– Two options
» during writes (common case, performed by the data
store, the write is rejected if there is inconsistency)
» during reads (quite uncommon, let the application
figure out what to do)
– Dynamo solution:
» Conflicts are resolved during reads
Outcome: customers are
» Applications are responsible to solve conflicts
always able to update their
shopping cart, despite network
.
» This allows an “always
writable” store
partitions or server failures.
73
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Design Considerations
• Who performs conflict resolution?
– If resolution is done by the data store the options are
limited and only simple policies can be used (“last write
wins”).
– If resolution is done by the applications it can choose the
strategy that best suit the case
• Observations
– The second solution provides better flexibility but forces
application developer to write their conflict resolution
logic.
– Amazon implements the second solution and provides a
“last write wins” default policy for lazy applications.
74
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Design Considerations
• Key principles
– Incremental scalability: Dynamo should be able to scale
one host at time, without major impacts on both systems
operators and the system itself.
– Symmetry: every node should have the same set of
responsibility of its peers, this simplifies the process of
system provisioning and maintenance.
– Decentralization: the design should favor decentralized P2P
techniques against centralized control, to avoid outages.
– Heterogeneity: the system should be able to positively
exploit the heterogeneity of the infrastructure where it runs,
for example for work distribution.
75
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
Key Ring
– Architecture
Storage Peer
Storage
RequestPeer
Coordinator
Pluggable Storage
Engine (BDB, MySQL)
Failure and Memberhip
Detection
Storage Peer
.
Storage Peer
Storage Peer
Storage Peer
76
7/7/2015
Case Studies
Distributed File Systems
Distributed Systems Principles and Paradigms
• Amazon Dynamo
– Architecture
• Very complex and includes scalable and robust solutions
for
–
–
–
–
–
–
–
–
–
–
data persistence
load balancing
membership and failure detection
replica synchronization
overload handling
state transfer
concurrency and job scheduling
request routing
system monitoring and alarming
configuration management
Focus
77
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Infrastructure Features
Problem
Solution
Advantages
Partitioning
Consistent Hashing
Incremental scalability.
High availability for
writers
Vector clocks with
reconciliations during
reads
Version size is decoupled from update rate.
Handling temporary
failures
Sloppy quorum and
hinted handoff
Provides high-availability and durability
guarantee when some of the replica are not
available.
Recovering from
permanent failures
Anti-entropy using
Merkle trees
Synchronizes divergent replicas in background.
Membership and
failure detection
Gossip-based protocol
and failure detection
Preserves symmetry and avoids having a
centralized registry for storing membership and
node liveness information.
78
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Partitioning
• The key hash space define the domain in which data is
partitioned among nodes.
• This space is treated as “circular” (highest value wraps around
the smallest).
• Each node is assigned a random value in the domain and the
space is equally partitioned among nodes.
• Each node is responsible of the region in the ring between it and
its predecessor.
• To locate an object, its key is hashed and the ring is walked
clockwise to find the first node whose value is bigger than the
key hash.
• That node will contain the master copy of the object.
79
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Partitioning
• Challenges
– Random partitioning leads to unbalanced distribution
– Random partitioning is oblivious to the heterogeneity of
nodes.
• Solution
– Virtual nodes
» A virtual node is mapped to a key-range
» Each physical node can have several virtual nodes
» Virtual nodes belonging to the same physical node do
not hold contiguous areas.
80
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Partitioning
• Advantages of Virtual Nodes
– If a node becomes not available the load associated to this
node is evenly spread across the other nodes.
– When a node becomes available again it accepts roughly
the equivalent amount of load from all the nodes.
– The number of virtual node a node is responsible for can
decided based on its capacity, accounting for the
heterogeneity in the physical infrastructure.
81
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Replication
• Data is replicated on multiple hosts.
• The number of hosts is N (configurable).
• Each key K is assigned to a coordinator node and this node
manages its replication.
• Each key a node is responsible for is replicated to the N-1
clockwise successors in the ring.
• The list of nodes that store a specific key is called preference
list.
• The preference list contains more than N nodes to account for
failures.
• Because of virtual nodes, some positions are skipped in the
clockwise walk to have only distinct physical nodes in the list.
82
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Data Versioning
• Dynamo provides eventual consistency:
– updates are propagated asynchronously.
– put() returns to the caller before all replicas are synched.
• This structure ensures that operations are timebound.
• The disadvantage are…
– Under failure scenarios updates may not arrive at all.
– There can be different versions of the same object in the
store.
• It is important to ensure causality among different
versions.
83
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Data Versioning
• Dynamo….
– … treats each operation on an object (deletion, insertion,
update) as an immutable version of it.
– … uses vector clocks to provide temporal ordering of
versions of the same object.
– … in case conflicts cannot be resolved by the
infrastructure, they are reported to the client application.
• Vector clocks…
– … help to merge different versions of an object
– … provide applications with a clear view of conflicting
versions
84
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Data Versioning
• Vector Clocks
– A vector clock is a vector containing <Nid, counter> pairs
– Nid is the identifier of the node where the operation is
performed
– counter is node-related counter for that object
– Pairs are sequentially appended to an object as operations
are performed on it.
– The sequence of pairs help…
» … to identify the history of each version of the object
» … to merge different version if possible
» … to give complete context to applications in case of
unsolvable conflicts
85
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Data Versioning
Write handled by Sx
• Vector Clocks
D1: ([Sx,1])
Write handled by Sx
D2: ([Sx,2])
Write handled by Sz
Write handled by Sy
D3: ([Sx,2],[Sy,1])
D4: ([Sx,2],[Sz,1])
Reconciled and written by Sx
D5: ([Sx,3],[Sy,1],[Sz,1])
86
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Data Versioning
• Vector Clocks
– A node that receives D1 and D2 can infer that D1 is an
older copy of D2 and can be garbage collected.
– A node that receives D2 and D4 (D3) can understand that
D4 (D3) is a newer copy.
– A node that is aware of D3 and receives D4 can detect that
there is no casual relation among them:
» There are changes that do not reflect into each other.
» Both version must be kept and presented to the client.
– A client receiving both D3 and D4 can resolve the conflict.
– If Sx coordinates the write can merge the two versions and
update is value for the clock (D5:([Sx,3]…)
87
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Handling Failures: Hinted Handoff
• The quorum like method implies the definition of
two numbers:
– R: minimum number of nodes that must participate in a
successful read operation.
– W: minimum number of nodes that must participate in a
successful write operation.
• If the replicas are N, we have a quorum system if:
– W+R > N
• The latency of a read(write) is determined by the
slowest node in the R(W) chain.
88
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Handling Failures: Hinted Handoff
• The quorum like method is not applicable during
server failures and network partitions.
• To solve this, the sloppy quorum is implemented:
– All read and write operations are performed by the first N
healthy nodes in the preference list.
– When nodes become healthy again a background
synchronization process is started.
89
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Handling Failures: Hinted Handoff
• Example: (N=3, A,B,C,D)
– A is unreachable.
– All the write operations that would have been sent to A are
sent to D.
– D keeps track that the object is being managed on behalf of
A and keeps all the operations on it in a separate local
database
– Upon detecting that A is recovered, D will send the
replicas to A, and eventually delete them on successful
transfer.
90
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Permanent Failures: Replica Synchronization
• Hinted handoff, works better if…
– …system member ship churn is low
– …failures are temporary
• What happens if hinted replicas become unavailable
before getting back to the original replica node?
• Dynamo uses an anti-entropy protocol to solve this
and other durability treats.
• This protocol is based on Merkle trees for replica
synchronization.
91
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Permanent Failures: Replica Synchronization
• Merkle Trees
– Is a tree where nodes contain hash values:
» Leaves nodes contain hash values of individual keys.
» Parent nodes contain hash value of their children.
– Advanatges:
» Each subtree can be checked independently
» Trees helps in reducing the amount of data that needs to be
transferred to do the check.
– Merkle tree minimizes ..
» The amount of data that needs to be transferred for a
synchronization check
» The number of disk reads during the synchronization process
92
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Permanent Failures: Replica Synchronization
• Anti-entropy protocol:
– Each node maintain a separate Merkle tree for each key
range is managing.
– This allows nodes to compare key ranges to see whether
they are up to date.
– Two nodes exchange the root of Merle trees of key ranges
that they have in common.
– Using the tree traversal on the children the differences
between keys are indentified.
• Disadvantage:
– Tree recalculation on node join/leave.
93
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Failure Detection: Ring Membership
• Common facts:
– An outage rarely identify a permanent failure, therefore it
should not result in rebalancing the partition.
– Manual errors can result in unintentional startup of
Dynamo nodes.
• Result:
– Dynamo nodes join/leave is made explicit and not implicit
as a result of the network status.
– Administrator have to manually add or remove nodes from
the ring.
94
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Studies
• Amazon Dynamo
– Failure Detection: Ring Membership
• Protocol:
– The admin issues a write request for joining a node to or
removing it from the ring.
– The node that serves the write request persists it together
with its time of issue.
– A gossip-based protocol propagates the membership
changes and maintains an eventually consistent view of the
system.
– Each node contacts a peer chosen randomly every second
and the two nodes exchange efficiently the membership
change histories.
95
7/7/2015
Case Studies
Distributed File Systems
Distributed Systems Principles and Paradigms
• Amazon Dynamo
– Failure Detection: Ring Membership
• Failure detection is used to avoid communicating with
unreachable nodes during:
– put() and get() operations
– partitions and hinted replica transfer
• A purely local notion of failure is sufficient:
– A considers B unreachable if B does not respond to A’s
messages.
– A uses alternate nodes to service requests mapping B’s
partition.
– A periodically retries B to check the latter’s recovery.
• In absence of traffic between A and B (triggered by
clients) there is no need for A to know that B is
unreachable.
96
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Case Study
• Amazon Dynamo
– Summary
• Dynamo is an highly available, distributed key-value
store.
• Its design is strongly influenced by the business need
of Amazon:
– Delivering a good customer experience among all the
customer.
– Meeting the SLAs made with customer.
• As a result Dynamo
– privileges availability over consistency
– provides a simple but extremely efficient infrastructure
– puts part of the burden on applications
97
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Summary
• NFS
– Evolution of NFS
• Web-NFS
– Web like service for NFS on a well known port.
– Requests use a 'public file handle' and a pathname-capable variant
of lookup().
– Enables applications to access NFS servers directly, e.g. to read a
portion of a large file.
• One-copy-update Semantics (Spritely NFS, NQNFS)
– Include an open() operation.
– Maintains tables of open files on the server:
» Prevention from multiple writers.
» Callback for clients when files are updated.
– Performance improved by reducing GetAttr(…) traffic.
98
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Summary
• Recent Advancements in DFS
– Evolution of NFS
• Web-NFS
– Web like service for NFS on a well known port.
– Requests use a 'public file handle' and a pathname-capable
variant of lookup().
– Enables applications to access NFS servers directly, e.g. to
read a portion of a large file.
• One-copy-update Semantics (Spritely NFS, NQNFS)
– Include an open() operation.
– Maintains tables of open files on the server:
» Prevention from multiple writers.
» Callback for clients when files are updated.
– Performance improved by reducing GetAttr(…) traffic.
99
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Summary
• Recent Advancements in DFS
– Disk Storage and Organization
• RAID
– Improves performance and reliability by striping data
redundantly across several disk drives.
• Long Structured File Storage
– Updated pages are stored contiguously in memory and
committed to disk in large contiguous blocks (~ 1 Mbyte).
– File maps are modified whenever an update occurs.
– Garbage collection to recover disk space.
100
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Summary
• New Design Approaches
– Distribute file data across several servers
• Exploits high-speed networks (InfiniBand, Gigabit
Ethernet)
• Layered approach, lowest level is like a
'distributed virtual disk'
• Achieves scalability even for a single heavily-used
file
– 'Serverless' architecture
• Exploits processing and disk resources in all
available network nodes
• Service is distributed at the level of individual files
101
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Summary
• New Design Approaches
– Examples:
• xFS (Section 8.5): Experimental implementation
demonstrated a substantial performance gain over
NFS and AFS
• Frangipani (Section 8.5): Performance similar to
local UNIX file access
• Tiger Video File System (see Chapter 15)
• Peer-to-peer systems: Napster, OceanStore (UCB),
Farsite (MSR), Publius (AT&T research) - see web
for documentation on these very recent systems
102
7/7/2015
Distributed File Systems
Distributed Systems Principles and Paradigms
Summary
• New Design Approaches
– Replicated read-write files
• High availability
• Disconnected working
– re-integration after disconnection is a major problem if
conflicting updates have occurred
• Examples:
– Bayou system (Section 14.4.2)
– Coda system (Section 14.4.3)
102
*
103
7/7/2015
Summary
Distributed File Systems
Distributed Systems Principles and Paradigms
• Conclusions
– Distributed File Systems
• … provide users with the illusions of a logical file systems that
is mapped onto a distributed infrastructure
• … hide complexity of the network and its management from
users
– Case Studies:
• Sun NFS: excellent example of DFS meeting many important
design requirements
• Andrew FS, Google FS: example of scalable DFS
• Amazon Dynamo: example of scalable object store
– Observations
• Effective file caching can improve DFS performance
• The DFS architecture allows for masking of failures
• Consistency vs Update Semantics vs Fault tolerance (trade-off)
Distributed File Systems
Distributed Systems Principles and Paradigms
Backup Slides
104
7/7/2015