Document

Transcript Document

Distributed File Systems
Andy Wang
COP 5611
Advanced Operating Systems
Outline




Basic concepts
NFS
Andrew File System
Replicated file systems



Ficus
Coda
Serverless file systems
Basic Distributed FS Concepts


You are here, the file’s there, what do
you do about it?
Important questions




What files can I access?
How do I name them?
How do I get the data?
How do I synchronize with others?
What files can be accessed?

Several possible choices





Every file in the world
Every file stored in this kind of system
Every file in my local installation
Selected volumes
Selected individual files
What dictates the choice?

Why not make every file available?





Naming issues
Scaling issues
Local autonomy
Security
Network traffic
Naming Files in a Distributed
System

How much transparency?




Does every user/machine/sub-network
need its own namespace?
How do I find a site that stores the file
that I name? Is it implicit in the name?
Can my naming scheme scale?
Must everyone agree on my scheme?
How do I get remote files?




Fetch it over the network?
How much caching?
Replication?
What security is required for data
transport?
Synchronization and
Consistency


Will there be trouble if multiple sites
want to update a file?
Can I get any guarantee that I always
see consistent versions of data?


i.e., will I ever see old data after new?
How soon do I see new data?
NFS


Networked file system
Provide distributed filing by remote
access



With a high degree of transparency
Method of providing highly transparent
access to remote files
Developed by Sun
NFS Characteristics





Volume-level access
RPC-based (uses XDR)
Stateless remote file access
Location (not name) transparent
Implementation for many systems


All interoperate, even non-Unix ones
Currently based on VFS
VFS/Vnode Review

VFS—Virtual File System



Common interface allowing multiple file
system implementations on one system
Plugged in below user level
Files represented by vnodes
NFS Diagram
NFS
Client
NFS
Server
/
/
/tmp
x
/mnt
y
/home
foo
/bin
bar
NFS File Handles



On clients, files are represented by
vnodes
The client internally represents remote
files as handles
Opaque to client


But meaningful to server
To name remote file, provide handle to
server
NFS Handle Diagram
Client side
Server side
User process
file descriptor
handle
NFS server
VFS level
vnode
vnode
VFS level
NFS level
handle
inode
UFS
How to make this work?

Could integrate it into the kernel


Non-portable, non-distributable
Instead, use existing features to do the
work


VFS for common interface
RPC for data transport
Using RPC for NFS

Must have some process at server that
answers the RPC requests


Continuously running daemon process
Somehow, must perform mounts over
machine boundaries

A second daemon process for this
NFS Processes



nfsd daemons—server daemons that
accept RPC calls for NFS
rpc.mountd daemons—server
daemons that handle mount requests
biod daemons—optional client
daemons that can improve performance
NFS from the Client’s Side

User issues a normal file operation




Like read()
Passes through vnode interface to
client-side NFS implementation
Client-side NFS implementation formats
and sends an RPC packet to perform
operation
Single client blocks until RPC returns
NFS RPC Procedures

16 RPC procedures to implement NFS



Lookup() is the key operation


Some for files, some for file systems
Including directory ops, link ops, read,
write, etc.
Because it fetches handles
Other NFS file operations use the
handle
Mount Operations

Must mount an NFS file system on the
client before you can use it




Requires local and remote operations
Local ops indicate mount point has an NFStype VFS at that point in hierarchy
Remote operations go to remote
rpc.mountd
Mount provides “primal” file handle
NFS on the Server Side




The server side is represented by the
local VFS actually storing the data
Plus rpc.mountd and nfsd daemons
NFS is stateless—servers do not keep
track of clients
Each NFS operation must be selfcontained (from server’s point of view)
Implications of Statelessness





Self-contained NFS RPC requests
NFS operations should be idempotent
NFS should use a stateless transport
protocol (e.g., UDP)
Servers don’t worry about client crashes
Server crashes won’t leave junk
More Implications of
Statelessness

Servers don’t know what files clients
think are open



Unlike in UFS, LFS, most local VFS file
systems
Makes it much harder to provide certain
semantics
Scales nicely, though
Preserving UNIX File
Operation Semantics


NFS works hard to provide identical
semantics to local UFS operations
Some of this is tricky


Especially given statelessness of server
E.g., how do you avoid discarding pages of
unlinked file a client has open?
Sleazy NFS Tricks


Used to provide desired semantics
despite statelessness of the server
E.g., if client unlinks open file, send
rename to server rather than remove



Perform actual remove when file is closed
Won’t work if file removed on server
Won’t work with cooperating clients
File Handles





Method clients use to identify files
Created by the server on file lookup
Must be unique mappings of server file
identifier to universal identifier
File handles become invalid when
server frees or reuses inode
Inode generation number in handle
shows when stale
NFS Daemon Processes





nfsd daemon
biod daemon
rpc.mount daemon
rpc.lockd daemon
rpc.statd daemon
nfsd Daemon




Handle incoming RPC requests
Often multiple nfsd daemons per site
A nfsd daemon makes kernel calls to do
the real work
Allows multiple threads
biod Daemon

Does readahead for clients



To make use of kernel file buffer cache
Only improves performance—NFS works
correctly without biod daemon
Also flushes buffered writes for clients
rpc.mount Daemon




Runs on server to handle VFS-level
operations for NFS
Particularly remote mount requests
Provides initial file handle for a remote
volume
Also checks that incoming requests are
from privileged ports (in UDP/IP packet
source address)
rpc.lockd Daemon



NFS server is stateless, so it does not
handle file locking
rpc.lockd provides locking
Runs on both client and server


Client side catches request, forwards to
server daemon
rpc.lockd handles lock recovery when
server crashes
rpc.statd Daemon



Also runs on both client and server
Used to check status of a machine
Server’s rpc.lockd asks rpc.statd to
store permanent lock information (in file
system)


And to monitor status of locking machine
If client crashes, clear its locks from
server
Recovering Locks After a
Crash



If server crashes and recovers, its
rpc.lockd contacts clients to reestablish
locks
If client crashes, rpc.statd contacts
client when it becomes available again
Client has short grace period to
revalidate locks

Then they’re cleared
Caching in NFS


What can you cache at NFS clients?
How do you handle invalid client
caches?
What can you cache?

Data blocks read ahead by biod daemon

Cached in normal file system cache area
What can you cache, con’t?

File attributes



Specially cached by NFS
Directory attributes handled a little
differently than file attributes
Especially important because many
programs get and set attributes frequently
Security in NFS

NFS inherits RPC mechanism security



Some RPC mechanisms provide decent
security
Some don’t
Mount security provided via knowing
which ports are permitted to mount
what
The Andrew File System


A different approach to remote file
access
Meant to service a large organization


Such as a university campus
Scaling is a major goal
Basic Andrew Model


Files are stored permanently at file
server machines
Users work from workstation machines


With their own private namespace
Andrew provides mechanisms to cache
user’s files from shared namespace
User Model of AFS Use



Sit down at any AFS workstation
anywhere
Log in and authenticate who I am
Access all files without regard to which
workstation I’m using
The Local Namespace



Each workstation stores a few files
Mostly system programs and
configuration files
Workstations are treated as generic,
interchangeable entities
Virtue and Vice

Vice is the system run by the file
servers


Distributed system
Virtue is the protocol client workstations
use to communicate to Vice
Overall Architecture


System is viewed as a WAN composed
of LANs
Each LAN has a Vice cluster server


Which stores local files
But Vice makes all files available to all
clients
Andrew Architecture Diagram
LAN
WAN
LAN
LAN
Caching the User Files


Goal is to offload work from servers to
clients
When must servers do work?



To answer requests
To move data
Whole files cached at clients
Why Whole-file Caching?
Minimizes communications with server
 Most files used in entirety, anyway
 Easier cache management problem
 Requires substantial free disk space on
workstations
- Doesn’t address huge file problems

The Shared Namespace



An Andrew installation has globally
shared namespace
All client’s files in the namespace with
the same names
High degree of name and location
transparency
How do servers provide the
namespace?




Files are organized into volumes
Volumes are grafted together into
overall namespace
Each file has globally unique ID
Volumes are stored at individual servers

But a volume can be moved from server to
server
Finding a File



At high level, files have names
Directory translates name to unique ID
If client knows where the volume is, it
simply sends unique ID to appropriate
server
Finding a Volume

What if you enter a new volume?



How do you find which server stores the
volume?
Volume-location database stored on
each server
Once information on volume is known,
client caches it
Making a Volume

When a volume moves from server to
server, update database



Heavyweight distributed operation
What about clients with cached
information?
Old server maintains forwarding info

Also eases server update
Handling Cached Files


Files fetched transparently when
needed
File system traps opens

Sends them to local Venus process
The Venus Daemon





Responsible for handling single client
cache
Caches files on open
Writes modified versions back on close
Cached files saved locally after close
Cache directory entry translations, too
Consistency for AFS



If my workstation has a locally cached
copy of a file, what if someone else
changes it?
Callbacks used to invalidate my copy
Requires servers to keep info on who
caches files
Write Consistency in AFS


What if I write to my cached copy of a
file?
Need to get write permission from
server



Which invalidates other copies
Permission obtained on open for write
Need to obtain new data at this point
Write Consistency in AFS,
Con’t



Initially, written only to local copy
On close, Venus sends update to server
Extra mechanism to handle failures
Storage of Andrew Files


Stored in UNIX file systems
Client cache is a directory on local
machine

Low-level names do not match Andrew
names
Venus Cache Management

Venus keeps two caches



Status cache kept in virtual memory


Status
Data
For fast attribute lookup
Data cache kept on disk
Venus Process Architecture



Venus is single user process
But multithreaded
Uses RPC to talk to server

RPC is built on low level datagram service
AFS Security

Only server/Vice are trusted here




Client machines might be corrupted
No client programs run on Vice
machines
Clients must authenticate themselves to
servers
Encrypted transmissions
AFS File Protection

AFS supports access control lists




Each file has list of users who can access it
And permitted modes of access
Maintained by Vice
Used to mimic UNIX access control
AFS Read-only Replication

For volumes containing files that are
used frequently, but not changed often


E.g., executables
AFS allows multiple servers to store
read-only copies
Distributed FS, Continued
Andy Wang
COP 5611
Advanced Operating Systems
Outline

Replicated file systems



Ficus
Coda
Serverless file systems
Replicated File Systems



NFS provides remote access
AFS provides high quality caching
Why isn’t this enough?

More precisely, when isn’t this enough?
When Do You Need Replication?






For write performance
For reliability
For availability
For mobile computing
For load sharing
Optimistic replication increases these
advantages
Some Replicated File Systems





Locus
Ficus
Coda
Rumor
All optimistic: few conservative file
replication systems have been built
Ficus




Optimistic file replication based on
peer-to-peer model
Built in Unix context
Meant to service large network of
workstations
Built using stackable layers
Peer-to-peer Replication





All replicas are equal
No replicas are masters, or servers
All replicas can provide any service
All replicas can propagate updates to all
other replicas
Client/server is the other popular model
Basic Ficus Architecture


Ficus replicates at volume granularity
Given volume can be replicated many
times


Updates propagated as they occur


Performance limitations on scale
On single best-efforts basis
Consistency achieved by periodic
reconciliation
Stackable Layers in Ficus


Ficus is built out of stackable layers
Exact composition depends on what
generation of system you look at
Ficus Stackable Layers Diagram
Select
FLFS
Transport
FPFS
FPFS
Storage
Storage
Ficus Diagram
Site
A
1
Site
B
2
Site
C
3
An Update Occurs
Site
A
1
Site
B
2
Site
C
3
Reconciliation in Ficus

Reconciliation process runs periodically
on each Ficus site


For each local volume replica
Reconciliation strategy implies eventual
consistency guarantee

Frequency of reconciliation affects how
long “eventually” takes
Steps in Reconciliation
1. Get information about the state of a
remote replica
2. Get information about the state of the
local replica
3. Compare the two sets of information
4. Change local replica to reflect remote
changes
Ficus Reconciliation Diagram
Site
A
1
Site
B
2
C Reconciles
With A
Site
C
3
Ficus Reconciliation Diagram
Con’t
Site
A
1
Site
B
2
B Reconciles
With C
Site
C
3
Gossiping and Reconciliation



Reconciliation benefits from the use of
gossip
In example just shown, an update
originating at A got to B through
communications between B and C
So B can get the update without talking
to A directly
Benefits of Gossiping






Potentially less communications
Shares load of sending updates
Easier recovery behavior
Handles disconnections nicely
Handles mobile computing nicely
Peer model systems get more benefit
than client/server model systems
Reconciliation Topology



Reconciliation in Ficus is pair-wise
In the general case, which pairs of
replicas should reconcile?
Reconciling all pairs is unnecessary


Due to gossip
Want to minimize number of recons

But propagate data quickly
Ring Reconciliation Topology
Adaptive Ring Topology
Problems in File Reconciliation







Recognizing updates
Recognizing update conflicts
Handling conflicts
Recognizing name conflicts
Update/remove conflicts
Garbage collection
Ficus has solutions for all these problems
Recognizing Updates in Ficus




Ficus keeps per-file version vectors
Updates detected by version vector
comparisons
The data for the later version can then
be propagated
Ficus propagates full files
Recognizing Update Conflicts



Concurrent updates can lead to update
conflicts
Version vectors permit detection of
update conflicts
Works for n-way conflicts, too
Handling Update Conflicts




Ficus uses resolver programs to handle
conflicts
Resolvers work on one pair of replicas
of one file
System attempts to deduce file type
and call proper resolver
If all resolvers fail, notify user

Ficus also blocks access to file
Handling Directory Conflicts

Directory updates have very limited
semantics


So directory conflicts are easier to deal
with
Ficus uses in-kernel mechanisms to
automatically fix most directory conflicts
Directory Conflict Diagram
Earth
Earth
Mars
Mars
Saturn
Sedna
Replica 1
Replica 2
How Did This Directory Get Into
This State?



If we could figure out what operations
were performed on each side that cased
each replica to enter this state,
We could produce a merged version
But there are several possibilities
Possibility 1
1. Earth and Mars exist
2. Create Saturn at replica 1
3. Create Sedna at replica 2
Correct result is directory containing
Earth, Mars, Saturn, and Sedna
The Create/delete Ambiguity





This is an example of a general problem
with replicated data
Cannot be solved with per-file version
vectors
Requires per-entry information
Ficus keeps such information
Must save removed files’ entries for a
while
Possibility 2
1. Earth, Mars, and Saturn exist
2. Delete Saturn at replica 2
3. Create Sedna at replica 2

Correct result is directory containing
Earth, Mars, and Sedna

And there are other possibilities
Recognizing Name Conflicts




Name conflicts occur when two
different files are concurrently given
same name
Ficus recognizes them with its per-entry
directory info
Then what?
Handle similarly to update conflicts

Add disambiguating suffixes to names
Internal Representation of
Problem Directory
Earth
Earth
Mars
Mars
Saturn
Saturn
Sedna
Replica 1
Replica 2
Update/remove Conflicts
Consider case where file “Saturn” has
two replicas
1. Replica 1 receives an update
2. Replica 2 is removed

What should happen?

A matter of systems semantics,
basically

Ficus’ No-lost-updates Semantics




Ficus handles this problem by defining
its semantics to be no-lost-updates
In other words, the update must not
disappear
But the remove must happen
Put “Saturn” in the orphanage

Requires temporarily saving removed files
Removals and Hard Links

Unix and Ficus support hard links



Effectively, multiple names for a file
Cannot remove a file’s bits until the last
hard link to the file is removed
Tricky in a distributed system
Link Example
Replica 1
Replica 2
foodir
foodir
red
blue
red
blue
Link Example, Part II
Replica 1
Replica 2
foodir
foodir
red
blue
update blue
red
blue
Link Example, Part III
Replica 1
Replica 2
foodir
foodir
red
blue
delete blue
red
blue
create hard link in
bardir to blue
bardir
What Should Happen Here?




Clearly, the link named foodir/blue
should disappear
And the link in bardir link point to?
But what version of the data should the
bardir link point to?
No-lost-update semantics say it must be
the update at replica 1
Garbage Collection in Ficus

Ficus cannot throw away removed
things at once




Directory entries
Updated files for no-lost-updates
Non-updated files due to hard links
When can Ficus reclaim the space these
use?
When Can I Throw Away My Data

Not until all links to the file disappear


Moreover, just because I know all links
have disappeared doesn’t mean I can
throw everything away


Global information, not local
Must wait till everyone knows
Requires two trips around the ring
Why Can’t I Forget When I Know
There Are No Links

I can throw the data away


But I can’t forget that I knew this



I don’t need it, nobody else does either
Because not everyone knows it
For them to throw their data away, they
must learn
So I must remember for their benefit
Coda




A different approach to optimistic
replication
Inherits a lot form Andrew
Basically, a client/server solution
Developed at CMU
Coda Replication Model




Files stored permanently at server
machines
Client workstations download temporary
replicas, not cached copies
Can perform updates without getting
token from the server
So concurrent updates possible
Detecting Concurrent Updates


Workstation replicas only reconcile with
their server
At recon time, they compare their state
of files with server’s state


Detecting any problems
Since workstations don’t gossip,
detection is easier than in Ficus
Handling Concurrent Updates




Basic strategy is similar to Ficus’
Resolver programs are called to deal
with conflicts
Coda allows resolvers to deal with
multiple related conflicts at once
Also has some other refinements to
conflict resolution
Server Replication in Coda




Unlike Andrew, writable copies of a file
can be stored at multiple servers
Servers have peer-to-peer replication
Servers have strong connectivity, crash
infrequently
Thus, Coda uses simpler peer-to-peer
algorithms than Ficus must
Why Is Coda Better Than AFS?

Writes don’t lock the file





Writes happen quicker
More local autonomy
Less write traffic on the network
Workstations can be disconnected
Better load sharing among servers
Comparing Coda to Ficus

Coda uses simpler algorithms





Less likely to be bugs
Less likely to be performance problems
Coda doesn’t allow client gossiping
Coda has built-in security
Coda garbage collection simpler
Serverless Network File Systems



New network technologies are much
faster, with much higher bandwidth
In some cases, going over the net is
quicker than going to local disk
How can we improve file systems by
taking advantage of this change?
Fundamental Ideas of xFS




Peer workstations providing file service
for each other
High degree of location independence
Make use of all machine’s caches
Provide reliability in case of failures
xFS


Developed at Berkeley
Inherits ideas from several sources




LFS
Zebra (RAID-like ideas)
Multiprocessor cache consistency
Built for Network of Workstations
(NOW) environment
What Does a File Server Do?




Stores file data blocks on its disks
Maintains file location information
Maintains cache of data blocks
Manages cache consistency for its
clients
xFS Must Provide These Services



In essence, every machine takes on
some of the server’s responsibilities
Any data or metadata might be located
at any machine
Key challenge is providing same
services centralized server provided in a
distributed system
Key xFS Concepts




Metadata manager
Stripe groups for data storage
Cooperative caching
Distributed cleaning processes
How Do I Locate a File in xFS?

I’ve got a file name, but where is it?



Assuming it’s not locally cached
File’s director converts name to a
unique index number
Consult the metadata manager to find
out where file with that index number is
stored-the manager map
The Manager Map

Data structure that allows translation of
index numbers to file managers




Not necessarily file locations
Kept by each metadata manager
Globally replicated data structure
Simply says what machine manages the
file
Using the Manager Map

Look up index number in local map


Index numbers are clustered, so many
fewer entries than files
Send request to responsible manager
What Does the Manager Do?
Manager keeps two types of
information
1. imap information
2. caching information

If some other sites has the file in its
cache, tell requester to go to that site

Always use cache before disk

Even if cache is remote

What if No One Caches the
Block?



Metadata manager for this file then
must consult its imap
Imap tells which disks store the data
block
Files are striped across disks stored on
multiple machines

Typically single block is on one disk
Writing Data




xFS uses RAID-like methods to store
data
RAID sucks for small writes
So xFS avoids small writes
By using LFS-style operations

Batch writes until you have a full stripe’s
worth
Stripe Groups



Set of disks that cooperatively store
data in RAID fashion
xFS uses single parity disk
Alternative to striping all data across all
disks
Cooperative Caching



Each site’s cache can service requests
from all other sites
Working from assumption that network
access is quicker than disk access
Metadata managers used to keep track
of where data is cached

So remote cache access takes 3 network
hops
Getting a Block from a Remote
Cache
3
Request
Block
1
2
Manager
Map
Cache
Consistency
State
Unix
Cache
Client
MetaData
Server
Caching
Site
Providing Cache Consistency



Per-block token consistency
To write a block, client requests token
from metadata server
Metadata server retrievers token from
whoever has it


And invalidates other caches
Writing site keeps token
Which Sites Should Manage
Which Files?


Could randomly assign equal number of
file index groups to each site
Better if the site using a file also
manages it

In particular, if most frequent writer
manages it

Can reduce network traffic by ~50%
Cleaning Up




File data (and metadata) is stored in log
structures spread across machines
A distributed cleaning method is
required
Each machine stores info on its usage
of stripe groups
Each cleans up its own mess
Basic Performance Results




Early results from incomplete system
Can provide up to 10 times the
bandwidth of file data as single NFS
server
Even better on creating small files
Doesn’t compare xFS to multimachine
servers