Chapter 9. File System Implementation

Transcript Chapter 9. File System Implementation

Chapter 9. File System
Implementation
• Introduction
• System V File System
• Berkeley Fast File System
• Temporary File System
• Special-purpose File Systems
• Old Buffer Cache
File System Implementation 1
Introduction
• Two local general-purpose file systems
– System V file system (s5fs)
– Berkeley fast file system (FFS)
• S5fs
– original UNIX file system
• FFS
– introduced in 4.2BSD
• Vnode/vfs
– integrated version of FFS is known as UNIX file
system (ufs)
File System Implementation 2
System V File System
• On-disk layout
B
boot area
S
inode list
data blocks
superblock
• Boot area
– contains code required to bootstrap
• Superblock
– contains attributes and metadata of the file
system
File System Implementation 3
System V File System (cont)
• Inode list
– linear array of inodes
– one inode for each file
– size of inode is 64 bytes
– inode list has a fixed size
• limits the maximum number of files the partition can
contain
File System Implementation 4
S5fs Directories
• Contains fixed size records of 16 bytes
• First two bytes: inode number
• Next fourteen bytes: filename
• Limits
– 65535 files per disk partition
– 14 characters per filename
File System Implementation 5
S5fs Inodes
• On-disk inode and In-core inode
– struct dinode, struct inode
struct dinode
Field
Size (bytes)
di_mode
di_nlinks
di_uid
di_gid
di_size
di_addr
di_gen
di_atime
di_mtime
di_ctime
2
2
2
2
4
39
1
4
4
4
Description
File type, permission, etc.
number of hard links to file
owner UID
owner GID
size in bytes
array of block addresses
generation number
time of last access
time file was last modified
time inode was last changed
File System Implementation 6
S5fs Inodes (cont)
di_mode: suid sgid sticky
type (4 bits) u
Disk block:
g
s
owner
r
w
group
x
r
w
others
x
r
w
x
disk
inode block array
0
1
2
...
10
indirect
11 double indirect
12 triple indirect
File System Implementation 7
S5fs Superblock
• Metadata about the file system
– The kernel reads the superblock when mounting
the file system and stores it in memory until the
file system is unmounted
• Contains the following information
– size in blocks of the file system
– size in blocks of the inode list
– number of free blocks and inodes
– free block list, free inode list
• does not keep free list completely in the superblock
File System Implementation 8
S5fs Kernel Organization
• In-core inodes
– struct inode
– contains all the fields of the on-disk inode, and
some additional fields, such as
– vnode
• the i_vnode field of the inode contains the vnode of
the file
– Device ID of the partition containing the file
– Inode number of the file
File System Implementation 9
S5fs Kernel Organization (cont)
– Flags for synchronization and cache
management
– Pointers to keep the inode on a free list
– Pointers to keep the inode on a hash queue
• The kernel hashes inodes by their inode numbers, so
as to locate them quickly when needed
– Block number of last block read
File System Implementation 10
S5fs Kernel Organization (cont)
inode free list
hash
queue 0
i_number
= 40
i_number
= 268
i_number
= 1056
hash
queue 1
i_number
= 73
i_number
= 17
i_number
= 593
hash
queue 2
i_number
= 86
hash
queue 3
i_number
= 11
i_number
= 199
i_number
= 27
i_number
=8
i_number
= 103
File System Implementation 11
S5fs Inode Lookup
• Lookuppn( )
– in the file-system-independent layer
– performs pathname parsing
– parses one component at a time, invoking
VOP_LOOKUP operation
– when searching an s5fs directory, translates
to a call to s5lookup( ) function
• s5lookup( )
– Check the directory name lookup cache
• In case of a cache miss, it reads the directory one
block at a time, searching the entries for the
specified file name
File System Implementation 12
S5fs Inode Lookup (cont)
– If the directory contains a valid entry for the
file, s5lookup( ) obtains the inode number from
the entry
– Calls iget( ) to locate that inode and initializes
the vnode
– Finally, iget( ) returns a pointer to the inode to
s5lookup( ). s5lookup( ), in turn, returns a
pointer to the vnode to lookuppn( )
File System Implementation 13
S5fs File I/O
• read and write system calls
– accept a file descriptor (the index returned by
open)
• File descriptor
– used as an index into the descriptor table to
obtain the pointer to the open file object (struct
file)
– the kernel obtains the vnode pointer from the
file structure
• Before starting I/O
– the kernel invokes VOP_WRLOCK operations
to serialize access to the file
File System Implementation 14
S5fs File I/O (cont)
• The kernel then invoke VOP_READ or
VOP_WRITE operation
– This results in a call to s5read( ) or s5write( )
• In case of s5read( )
– s5read( ) translates the starting offset to the
logical block number
– it then reads the data one page at a time
• by mapping the block into the kernel virtual address
space and calling uiomove( ) to copy the data into
user space
File System Implementation 15
S5fs File I/O (cont)
• uiomove( ) calls the copyout( ) routine to perform the
actual data transfer
• if the page is not in memory, copyout( ) will generate
a page fault
• the page fault handler will invoke VOP_GETPAGE
operation on its vnode
• in s5fs, VOP_GETPAGE is implemented by
s5getpage( )
• the calling process sleeps until the I/O completes
– s5read( ) returns when all data has been read
– the system-independent code
• unlocks the vnode, advanced the offset pointer in the
file structure, and returns to the user
File System Implementation 16
Allocating and Reclaiming Inodes
• An inode remains active as long as its
vnode has a non-zero reference count
• When the count drops to zero, the filesystem-independent code invokes the
VOP_INACTIVE operation which frees the
inode
• When an inode becomes inactive, the
kernel puts it on the free list, but does not
invalidate it
File System Implementation 17
Analysis of s5fs
• Simple design introduces problems in
– reliability, performance, functionality
• Reliability
– superblock contains vital information about
the entire file system
• Performance
– s5fs groups all inodes together at the
beginning of the file system
• accessing a file requires reading the inode then the
file data, causes a long seek on the disk
• e.g. ls -l causes a random disk access pattern
File System Implementation 18
Analysis of s5fs (cont)
– Disk block allocation is also suboptimal
• After the file system has been used for a while, the
order of blocks in the free block list becomes
completely random
• This slows down sequential access operations on
files, since logically consecutive block may be very
far apart on the disk
– Restricting of file names to 14 characters
File System Implementation 19
Berkeley Fast File System
• Address many limitation of s5fs
• Hard disk structure
– platter, disk head, track, sector, cylinder
– head seek, rotational latency
• FFS on-disk organization
– FFS divides the partition into one or more
cylinder groups, each containing a small set of
consecutive cylinders
• This allows UNIX to store related data in the same
cylinder group to minimize disk head movement
File System Implementation 20
Berkeley FFS (cont)
– Superblock is divided into two structures
• FFS superblock contains information about the
entire file system, it does not change unless the file
system is rebuilt
• Each cylinder group has a data structure describing
summary information about that group, including the
free inode and free block lists.
• Each cylinder group contains a duplicate copy of the
superblock
• FFS maintains there duplicates at different offsets in
each cylinder group in such as way that no single
track, cylinder, or platter contains all copies of the
superblock
File System Implementation 21
FFS Blocks
• Blocks and Fragments
– FFS allows each block to be divided into one
or more fragments
– The number of fragments per block may be set
to 1, 2, 4, or 8, allowing a lower bound of 512
bytes, the same as the disk sector size
– An FFS is composed entirely of complete
blocks, except for the last block, which may
contain one or more consecutive fragments
– This scheme reduces space wastage, but
requires occasional recopying of file data
File System Implementation 22
FFS Disk Allocation
• Allocation policies
– FFS aims to colocate related information on
the disk and optimize sequential access
– 1. Attempt to place the inodes of all files of a
single directory in the same cylinder group
– 2. Create each new directory in a different
cylinder group from it parent, so as to
distribute data uniformly over the disk
– 3. Try to place the data blocks of the file in the
same cylinder group as the inode
File System Implementation 23
FFS Disk Allocation (cont)
– 4. To avoid filling an entire cylinder group with
one large file, change the cylinder group when
the file size reaches 48Kbytes and again at
every megabyte
– 5. Allocate sequential blocks of a file at
rotationally optimal positions
• Rotational optimization tries to determine the
number of sectors to skip so that the desired sector
is under the disk head when the read is initiated.
File System Implementation 24
FFS Functionality Enhancements
• Long file names
– maximum size of the filename is 255 characters
• Symbolic links, and atomic rename( )
7
4
2
‘f’ ‘1’ 0 0
14
8
5
‘f’ ‘i’ ‘l’ ‘e’
‘2’ 0 0 0
(a) initial state
inode number
allocation size
name length
name plus extra space
7
24
2
‘f’ ‘1’ 0 0
padding
(b) after deleting file2
FFS Directory
File System Implementation 25
Analysis of FFS
• Substantial performance gains
– read throughput
• 29Kbyte/sec in s5fs  221Kbytes/sec in FFS
• CPU utilization: 11%  43%
– write throughput
• 48Kbytes/sec  142 Kbytes/sec
• CPU utilization: 29%  43%
• Disk space wastage
– half a block per file in s5fs
– half a fragment per file in FFS
• more space is required to monitor the free blocks
and fragments
File System Implementation 26
Analysis of FFS (cont)
• Modern SCSI disks do not have fixed size
cylinders
– FFS is oblivious to this
• Overall, FFS provides great benefits
– wide acceptance
• 4.3BSD added two types of caching to
speed up name lookups
File System Implementation 27
Temporary File Systems
• Basic concepts
– Many utilities and applications extensively use
temporary files to store results of intermediate
phases of execution
– The synchronous updates are really
unnecessary for temporary files, because they
are not meant to be persistent
– Addressed by using RAM disks, which provide
file systems that reside entirely in physical
memory (dedicating a large amount of memory)
– RAM disks are implemented by a device driver
that emulates a disk
File System Implementation 28
Temporary File Systems (cont)
• Two implementations
– Memory File System (mfs)
– tmpfs File System
• mfs
– Developed by UC Berkeley
– Entire file system is built in the virtual address
space of the process that handled the mount
operation
– This process does not return from the mount
call, but remains in the kernel, waiting for I/O
requests to the file system
File System Implementation 29
Temporary File Systems (cont)
– Each mfsnode, which is the file-systemdependent part of the vnode, contains the PID
of the mount process, which now functions as
an I/O server
– The pages of the mfs files compete with all
other processes for physical memory
– Using a separate process to handle all I/O
requires two context switches for each
operation
– The file system still resides in a separate
address space, which means we still need
extra in-memory copy operations
File System Implementation 30
Temporary File Systems (cont)
• tmpfs file system
– Developed by Sun Microsystems
– Combined the powerful facilities of the
vnode/vfs interface and the new VM
architecture
– tmpfs is implemented entirely in the kernel
– All file metadata is stored in non-paged
memory, dynamically allocated from the kernel
heap
– The data blocks are in paged memory and are
represented using the anonymous pages
facility in the VM subsystem
File System Implementation 31
Temporary File Systems (cont)
– Each page is mapped by an anonymous object
(struct anon), which contains the location of
the page in physical memory or on the swap
space
– The tmpnode, which is the file-systemdependent object for each file, has a pointer to
the anonymous map (struct anon_map) for the
file
– Pages can be swapped out by the paging
system and compete for physical memory
File System Implementation 32
Temporary File Systems (cont)
– Advantages of tmpfs
• does not use a separate I/O server and thus avoids
wasteful context switches
• holding the metadata in unpaged kernel memory
eliminates the memory-to-memory copies and some
disk I/O
• the support for memory mapping allows fast, direct
access to file data
File System Implementation 33
Locating tmpfs pages
swap area
on disk
struct anon_map
struct
vnode
struct anon
page
struct anon
struct
tmpnode
page in
memory
File System Implementation 34
Special-Purpose File Systems
• The specfs file system
– Provides a uniform interface to device files
– The primary purpose of specfs is to intercept
I/O calls to device files and translate them to
calls to the appropriate device driver routines
• The /proc file system
– Provides an elegant and powerful interface to
the address space of any process
• The processor file system
– Provides an interface to the individual
processors on a multiprocessor machine
File System Implementation 35
Old Buffer Cache
• Background
– Traditional UNIX systems use a dedicated area
in memory called block buffer cache to cache
blocks accessed through file system
– Backing store of a cache is the persistent
location of the data
– A cache can be write-through or write-behind
– write-through cache writes out modified data
to the backing store immediately
– write-behind: modified blocks are simply
marked as dirty, and written to the disk at a
later time
File System Implementation 36
Old Buffer Cache (cont)
• Advantages
– Reduce disk traffic and eliminate unnecessary
disk I/O
– Synchronizes access to disk blocks through the
locked and wanted flags
• Disadvantages
– The write-behind nature of the cache means the
data may be lost if the system crashes
– Reducing disk access greatly improves
performance, but the data must be copied twice
• disk  buffer, then buffer  user address space
• e.g. cache wiping problem
File System Implementation 37