Transcript Methods

Distributed File System


File system spread over multiple, autonomous computers.
A distributed file system should provide:
 Network transparency: hide the details of where a file
is located.
 High availability: ease of accessibility irrespective of
the physical location of the file.



This objective is difficult to achieve because the distributed
file system is vulnerable to problems in underlying networks
as well as crashes of systems that are the “file sources”.
Replication / mirroring can be used to alleviate the above
problem.
However, replication/mirroring introduces additional issues
such as consistency.
B. Prabhakaran
1
DFS: Architecture



In general, files in a DFS can be located in “any” system.
We call the “source(s)” of files to be servers and those
accessing them to be clients.
Potentially, a server for a file can become a client for
another file.
However, most distributed systems distinguish between
clients and servers in more strict way:




Clients simply access files and do not have/share local files.
Even if clients have disks, they (disks) are used for swapping,
caching, loading the OS, etc.
Servers are the actual sources of files.
In most cases, servers are more powerful machines (in terms of
CPU, physical memory, disk bandwidth, ..)
B. Prabhakaran
2
DFS: Architecture …
…
…
Server
Server
…
….
Server
Computer Network
Client
Client
B. Prabhakaran
3
DFS Data Access
Request to
Access data
Load data
to client cache
Return data
to client
Check
client
cache
Data
present
Issue disk
read
Data
Not present
Data
Not present
Check
Local disk
(if any)
Data
Not present
Load server
cache
Data
present
Data
present
Send request to
File server
Check
Server cache
Network
B. Prabhakaran
4
Mechanisms for DFS




Mounting: to help in combining files/directories in
different systems and form a single file system structure.
Caching: to reduce the response time in bringing data from
remote machines.
Hints: modified caching
Bulk data transfer: helps in reducing the delay due to
transfer of files over the network. Bulk:





Obtain multiple number of blocks with a single seek
Format, transfer large number of packets in a single context
switch.
Reduce the number of acknowledgements to be sent.
(e.g.,) useful when downloading OS onto a diskless client.
Encryption: Establish a key for encryption with the help of
an authentication server.
B. Prabhakaran
5
Mounting





Mounting helps to build a hierarchy of file directories.
A collection of files can be mounted at an internal node of
the hierarchy.
Node at which this collection of files is mounted: mount
point.
Operating systems kernel maintains a structure called the
mount table, mapping mount points to appropriate storage
devices.
Mount table can be maintained at:


Each client. Employed in Sun Network File System (NFS).
Servers. All clients see the same file system structure. Employed
in Sprite file system.
B. Prabhakaran
6
Name Space Hierarchy
Server X
Root (/)
Mount
Points
b
a
Server Y
c
Server Z
g
d
e
f
h
B. Prabhakaran
i
7
Caching






Performance of distributed file system, in terms of
response time, depends on the ability to “get” the files to
the user.
When files are in different servers, caching might be
needed to improve the response time.
A copy of data (in files) is brought to the client (when
referenced). Subsequent data accesses are made on the
client cache.
Client cache can be on disk or main memory.
Data cached may include future blocks that may be
referenced too.
Caching implies DFS needs to guarantee consistency of
data.
B. Prabhakaran
8
Hints





Hints can be used when cached data need not be
completely be accurate.
Example: Mapping of the name of a file/directory to the
actual physical device. The address/name of device can be
stored as a hint.
If this address fails to access the requested file, the cached
data can be purged.
The file server can refer to a name server, determine the
actual location of file/directory, and update the cache.
In hints, a cache is neither updated nor invalidated when a
change occurs to the content.
B. Prabhakaran
9
Design Issues







Naming: Locating the file/directory in a DFS based on
name.
Location of cache: disk, main memory, both.
Writing policy: Updating original data source when cache
content gets modified.
Cache consistency: Modifying cache when data source
gets modified.
Availability: More copies of files/resources.
Scalability: Ability to handle more clients/users.
Semantics: Meaning of different operations (read,
write,…)
B. Prabhakaran
10
Naming





Name space: (e.g.,) /home/students/jack, /home/staff/jill.
Name space is a collection of names.
Location transparency: file names do not indicate their
physical locations.
Name resolution: mapping name space to an
object/device/file/directory.
Naming approaches:

Simple Concatenation: add hostname to file names.
 Guarantees unique names.
 No transparency. Moving a file to another host involves a
file name change.
B. Prabhakaran
11
Naming: Approaches ...

Naming approaches:



.....
Mounting: mount remote directories to local ones. Location
transparent after mounting. (followed in Sun NFS).
 Example: /students is mounted at /home.
 Remember: different clients in the system can mount in
different ways. (e.g.,) In client 1: mount /students at /. i.e.,
/students/jack, /students/jill. In client 2: mount /students at
/usr, i.e., /usr/students/jack, /usr/students/jill.
Single Global Directory:all files in the system belong to a single
name space. (followed in Sprite OS).
 System wide unique names, i.e., all clients mount the same
way.
 Difficult to enforce this restriction. Can work only among
(highly) cooperating systems (or system administrators !)
B. Prabhakaran
12
Naming: Context


Context: identifying the name space within which name
resolution is to be done.
Example: context using ~ (tilde).






~jill/t: /home/staff/jill/t
~john/t: /home/students/john/t
~name: represents the directory structure associated with
a person or a project.
Whenever file “t” is accessed, it is interpreted with
reference to ~’s environment.
~ helps when different clients mount in different ways,
still sharing the same of users and their home directories.
(e.g.,) ~john may be mapped to /home/students/john in
client 1 and to /usr/students/john in client 2.
B. Prabhakaran
13
Name Resolution


Done by name servers that map file names to actual files.
Centralized name server: send names to the server and get
the path of servers+devices that lead to the requested file.


Name server becomes a bottle neck.
Distributed name server: (e.g.,) consider access to a file
/a/b/c/d/e


Local name server identifies the remote server that handles the
part /b/c/d/e
This procedure may be recursively done till ../e is resolved.
B. Prabhakaran
14
Caching

In main memory:





Faster than disks.
Diskless workstations can also cache.
Server-cache is in main memory -> same design can be used in
clients also.
Disadvantage: clients need main memory for virtual memory
management too.
In disks:



Large files can be cached.
Virtual memory management is straight forward.
After caching the necessary files, the client can get disconnected
from network (if needed, for instance, to help its mobility).
B. Prabhakaran
15
Writing Policy


When should a modified cache content be transferred to
the server?
Write-through policy:




Delayed writing policy:




Immediate writing at server when cache content is modified.
Advantage: reliability, crash of cache (client) does not mean loss
of data.
Disadvantage: Several writes for each small change.
Write at the server, after a delay.
Advantage: small/frequent changes do not increase network
traffic.
Disadvantage: less reliable, susceptible to client crashes.
Write at the time of file closing.
B. Prabhakaran
16
Cache Consistency


When should a modified source content be transferred to
the cache?
Server-initiated policy:


Client-initiated policy:


Client cache manager checks the freshness of data before
delivering to users. Overhead for every data access.
Concurrent-write sharing policy:



Server cache manager informs client cache managers that can
then retrieve the data.
Multiple clients open the file, at least one client is writing.
File server asks other clients to purge/remove the cached data
for the file, to maintain consistency.
Sequential-write sharing policy: a client opens a file that
was recently closed after writing.
B. Prabhakaran
17
Cache Consistency ...

Sequential-write sharing policy: a client opens a file that
was recently closed after writing.


This client may have outdated cache blocks of the file (since the
other client might have modified the file contents).
 Use time stamps for both cache and files. Compare the time
stamps to know the freshness of blocks.
The other client (which was writing previously) may still have
modified data in its cache that has not yet been updated on
server. (e.g.,) due to delayed writing.
 Server can force the previous client to flush its cache
whenever a new client opens the file.
B. Prabhakaran
18
Availability



Intention: overcome the failure of servers or network
links.
Solution: replication, i.e., maintain copies of files at
different servers.
Issues:



Maintaining consistency
Detecting inconsistencies, if they happen despite best efforts.
Possible reasons for such inconsistencies:
 Replica is not updated due to a server failure or a broken
network link.
Inconsistency problems and their recovery may reduce the
benefit of replication.
B. Prabhakaran
19
Availability: Replication

Unit of replication: is mostly a file.


Replicas of a file in a directory may be handled by different
servers, requiring extra name resolutions to locate the replicas.
Replication unit: group of files:


Advantage: process of name resolution, etc., to locate replicas
can be done for a set of files and not for individual files.
Disadvantage: wasteful of disk space if only very few of this
group of files is needed by users often.
B. Prabhakaran
20
Replica Management


Two-phase commit protocols can be used to update all
replicas.
Other schemes:


Weighted votes:
 A certain number of votes r or w is to be obtained before
reading or writing.
Current synchronization site (CSS):
 Designate a process/site to control the modifications.
 File open/close are done through CSS.
 CSS can become a bottleneck.
B. Prabhakaran
21
Scalability



Ease of adding more servers and clients with respect to
the problems / design issues discussed before such as
caching, replication management, etc.
Server-initiated cache invalidation scales up better.
Using the clients cache:




A server serves only X clients.
New clients (after the first X) are informed of the X clients from
whom they can get the data (sort of chaining/hierarchy).
Cache misses & invalidations are propagated up and down this
hierarchy, i.e., each node serves as a mini-file server for its
children.
Structure of a server:

I/O operations through threads (light weight processes) can help
in handling more clients.
B. Prabhakaran
22
Semantics




What is the effect / meaning of an operation?
(e.g.,) read returns the data due to latest write operation.
Guaranteeing the above semantics in the presence of
caching can be difficult.
We saw techniques for these under caching.
B. Prabhakaran
23
Case Study: Sun NFS





Major goal: keep the distributed file system independent
of underlying hardware and operating system.
NFS (Network File System): uses the Remote Procedure
Call (RPC) for remote file operations.
Virtual file system (VFS) interface: provides uniform,
virtual file operations that are mapped to the actual file
system. (e.g.,) VFS can be mapped to DOS, so NFS can
work with PCs.
VFS uses a structure called vnode (virtual node) that is
unique in a NFS.
Each vnode has a mount table that provides a pointer to its
parent file system and to the system over which it is
mounted.
B. Prabhakaran
24
Sun NFS...





A vnode can be a mount point.
Using mount tables, VFS interface can distinguish
between local and remote file systems.
Requests to remote files are routed to the NFS by the VFS
interface.
RPCs are used to reach remote VFS interface.
Remote VFS invokes appropriate local file operation.
B. Prabhakaran
25
Sun NFS Architecture
Client
Kernel
OS Interface
Server
Server
Routines
VFS Interface
Others
Unix
Disk
NFS
VFS Interface
Disks
RPC/XDR
RPC/XDR
Network
B. Prabhakaran
26
NFS: Naming & Location


Each client can configure its file system independent of
others. i.e., different clients can see different name spaces.
Name resolution example:






Look up for a/b/c. a corresponds to vnode1 (assume).
Look up on vnode1/b returns vnode2 that might say the object is
on server X.
Look up on vnode2/c is sent to X. X returns a file handle (if the
file exists, permission matches, etc).
File handle is used for subsequent file operations.
Name resolution in NFS is an iterative process (slow).
Name space information is not maintained at each server as
the servers in NFS are stateless (to be discussed later).
B. Prabhakaran
27
NFS: Caching

NFS Client Cache:



File blocks: cached on demand.
 Employs read ahead. Large block sizes (8 Kbytes) for data
transfer to improve the sequential read performance.
 Entire files cached, if they are small. Timestamps of files are
also cached.
 Cached blocks are valid for certain period after which
validation is needed from server. Validation done by
comparing time stamps of file at server.
 Delayed writing policy used. Modified files are flushed after
closing to handle sequential-write sharing.
File name to vnode translations: directory name lookup cache
holds the vnodes for remote directory names.
 Cache updated when lookup fails (cache acts as hints).
Attributes of files & directories:
B. Prabhakaran
28
NFS: Caching

NFS Client Cache: ....

Attributes of files & directories:
 Attribute inquiries form 90% of calls made to servers.
 Cache entries are updated every time new attributes are
received from server.
 File attributes are discarded after 3 seconds and directory
attributes after 30 seconds.
B. Prabhakaran
29
NFS: Stateless Server




NFS servers are stateless to help crash recovery.
Stateless: no record of past requests (e.g., whether file is
open, position of file pointer, etc.,).
Client requests contain all the needed information. No
response, client simply re-sends the request.
After a crash, a stateless server simply restarts. No need to:



Restore previous transaction records.
Update clients or negotiate with clients on file status.
Disadvantages:



Client message sizes are larger.
Server cache management difficult since server has no idea on
which files have been opened/closed.
Server can provide little information for file sharing.
B. Prabhakaran
30
Un/mounting in NFS




Mounting of files in Unix is done by using a mount table
stored in a file: /etc/mnttab.
mnttab is read by programs using procedures such as
getmntent.
mount command adds an entry in mnttab, i.e., every time a
file system is mounted in the system.
umount command removes an entry in mnttab, i.e., every
time a file system is unmounted from the system.
B. Prabhakaran
31
Un/Mounting

First entry in mnttab: file system that was mounted first.
Usually, file systems get mounted at boot time.
Mount: term used for mounting tapes onto systems, I guess.
Each entry is a line of fields separated by spaces in the form:
<special> <mount_point> <fstype> <options> <time>
<special>: The name of the resource to be mounted.
<mount_point> : pathname of the directory on which the filesystem is
mounted.
<fstype> : file system type of the mounted file system.
<options> : mount options.

<time> : time at which the file system was mounted.

Entries for <special>: path-name of a block-special device (e.g.,
/dev/fd0), the name of a remote filesystem (casa:/export/home, i.e.,
host:pathname), or the name of a swap file.







B. Prabhakaran
32
Sharing Filesystems




In SunOS, share command is used to specify the file systems that can be
mounted by other systems.
(e.g.), share [ -F FSType ] [ -o specific_options ] [-d description
] [ pathname ]
Share command makes a resource available to remote system, through a
file system of FSType.
<specific_options> : control access of the shared resource.




rw pathname is shared read/write to all clients. This is also the default
behavior.
rw=client[:client]...pathname is shared read/write only to the listed clients.
No other systems can access pathname.
ro pathname is shared read-only to all clients.
ro=client[:client]... pathname is shared read-only only to the listed
clients. No other systems can access pathname.
B. Prabhakaran
33
Sharing Filesystems…


<-d description>: -d flag may be used to provide a description of the
resource being shared.
Example : To share the /disk file system read-only at boot time.




share -F nfs -o ro /disk
share -F nfs -o rw=usera:userb /somefs
Multiple share commands on same file system? : Last command
supersedes.
Try:



/etc/dfs/dfstab: list of share commands to be executed at boot time
/etc/dfs/fstypes: list of file system types, NFS by default
/etc/dfs/sharetab: system record of shared file systems.
B. Prabhakaran
34
Automounting






mount a remote file system only when it is accessed, perhaps for a
guessed duration of time.
automount utility: installs autofs mount points and associates an
automount map with each mount point.
autofs file system monitors attempts to access directories within it and
notifies the automountd daemon.
automountd uses the map to locate a file system. Then mounts at the
point of reference within the autofs file system.
A map can be assigned to an autofs mount using an entry in the
/etc/auto_master map or a direct map.
File system is not accessed within an appropriate interval (10
minutes by default) ? : the automountd daemon unmounts the file
system.
B. Prabhakaran
35
Cluster File System

System Model: a set of storage devices that can be
accessed by a set of workstations.
System 1
System n
Very High Speed Network
RAID
Tapes/CDs
B. Prabhakaran
RAID
RAID: Redundant Array of
Inexpensive Disks
36
Cluster File System





Storage devices can be viewed as a “pool of centralized
resources”.
Storage devices are shared by a set of workstations/systems
or a cluster as it is called.
Both the pool of storage and the cluster are attached to very
high speed networks (typically optical networks).
Devices can be mounted to different systems: e.g., Raid1 to
system n, Raid 2 to system 1 etc.
Features:




Mirroring: replication of entire disks
Striping: data (e.g., multimedia) spread over multiple disks
Online reconfiguration: add/delete storage devices dynamically
Assign/remove devices to applications/systems dynamically
B. Prabhakaran
37
Storage Virtualization



Means logical representation of the physical resources:
storage devices & workstations
Virtualization specifies details such as which devices are
meant for which host, how they can be shared, etc.
Possible places for virtualization: (each choice has its own
advantages and disadvantages)
 Workstations or hosts
 Volume managers (software) are run on hosts,
providing control over how data is stored and
accessed over the different devices.
B. Prabhakaran
38
Storage Virtualization...

Possible places for virtualization:
 In storage subsystem
 Associated with large-scale RAID large subsystems
(many terabytes). Virtualization services embedded
on storage controllers.
 In special appliances: “in-band” or “out-of-band”
 Special, intelligent appliances are used to provide
virtualization
 Appliance name: NAS (Network Attached Storage)
 In-band: NAS is part of storage pool
 Out-of-band: NAS not a part of storage pool
B. Prabhakaran
39
Veritas Volume Manager






Works on both Unix and Windows
Builds a diskgroup spanning multiple devices.
Dynamic diskgroups management
Striping of data on multiple RAIDs.
Striping distributes data on multiple disks and hence
increases the disk bandwidth for retrieval. Suitable for
multimedia data.
Cluster Volume Manager:

Allows a volume to be simultaneously mounted for use across
multiple servers for both reads and writes.
B. Prabhakaran
40
Veritas Cluster Server




Cluster server handles upto 32 systems.
It monitors, controls and restarts applications in response to
a variety of task.
(e.g.,) application A1 may be started on system n is system
1 fails. Disk group D1 will be automatically assigned to
system n.
(e.g.) Disk group D2 may be assigned to system 1 if D1
fails and the application A1 will continue.
Sn
S1
D2
D1
B. Prabhakaran
41
Service Groups


A set of resources working together to provide application
services to clients.
Service group example:






Disk groups having data
Volume built using disk group
File system (directories) using the volume
Servers/systems providing the application
Application program + libraries
Types of Service Groups:


Failover Groups: runs on 1 system in a cluster at a time. Used for
applications that are not designed to maintain data consistency on
multiple copies.
Cluster server monitors the heart beat of the system. If it fails, the
backup is brought on-line.
B. Prabhakaran
42
Service Groups...

Types of Service Groups...:


Parallel groups: run concurrently on more than 1 system.
Time-to-recovery:





On a failure, an application service is moved to another server in
the cluster.
Disk groups are de-imported from the crashed server and
imported by the back-up server.
Volume manager helps to manage the disk group ownership and
accelerate recovery process of the cluster.
New ownership properties are broadcast to the cluster to ensure
data security.
Time to take to bring the back-up online.
B. Prabhakaran
43
Disaster Tolerance


More than 1 cluster connected by very high speed networks
over a wide area network.
Cluster 1 and 2 geographically distributed.
Very High Speed Link
Over a Wide Area Network
Cluster 1
Cluster 2
B. Prabhakaran
44
Veritas Volume Replicator









Redundant copy of application in another cluster must be
kept up-to-date.
Volume Replicator allows a disk group to be replicated at 1
or more remote clusters.
Initialization of replication: entire disk group is replicated.
Runtime: only modifications to data are communicated.
Conserves network bandwidth.
Disk groups at the remote cluster are not usually active.
Identical instance of application is run on the remote
cluster in idle mode.
Disaster is identified by volume replicator using heart
beats.
Puts remote cluster on-line for the applications.
Time-to-recovery: less than 1 minute.
45
B. Prabhakaran