Transcript General

UCDavis, ecs150
Spring 2006
:
Operating System
ecs150 Spring 2006
#5: File Systems
(chapters: 6.4~6.7, 8)
Dr. S. Felix Wu
Computer Science Department
University of California, Davis
http://www.cs.ucdavis.edu/~wu/
[email protected]
05/10/2006
ecs150, spring 2006
1
UCDavis, ecs150
Spring 2006
File System Abstraction
Files
 Directories

05/10/2006
ecs150, spring 2006
2
UCDavis, ecs150
Spring 2006
System-call interface
Active file entries
VNODE Layer or VFS
Local naming (UFS)
FFS
Buffer cache
Block or character device driver
Hardware
05/10/2006
ecs150, spring 2006
3
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
4
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
5
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
6
UCDavis, ecs150
Spring 2006
dirp = opendir(const char *filename);
struct dirent *direntp = readdir(dirp);
struct dirent {
ino_t d_ino;
char d_name[NAME_MAX+1];
};
directory
dirent
dirent
dirent
inode
file_name
inode
file_name
inode
file_name
file
file
file
05/10/2006
ecs150, spring 2006
7
UCDavis, ecs150
Spring 2006
Local versus Remote
System Call Interface
 V-node
 Local versus remote

– NFS or i-node
– Stackable File System

Hard-disk blocks
05/10/2006
ecs150, spring 2006
8
UCDavis, ecs150
Spring 2006
File-System Structure

File structure
– Logical storage unit
– Collection of related information
File system resides on secondary storage
(disks).
 File system organized into layers.
 File control block – storage structure
consisting of information about a file.

05/10/2006
ecs150, spring 2006
9
UCDavis, ecs150
Spring 2006
File  Disk
separate the disk into blocks
 separate the file into blocks as well
 paging from file to disk

blocks: 4 - 7- 2- 10- 12
How to represent the file??
How to link these 5 pages together??
05/10/2006
ecs150, spring 2006
10
UCDavis, ecs150
Spring 2006

Hard Disk
Track, Sector, Head
– Track + Heads  Cylinder

Performance
– seek time
– rotation time
– transfer time

LBA
– Linear Block Addressing
05/10/2006
ecs150, spring 2006
11
UCDavis, ecs150
Spring 2006
File  Disk blocks
file
block
0
file
block
1
file
block
2
file
block
3
0
file
block
4
4
7
2
10
12
What are the disadvantages?
1. disk access can be slow for “random access”.
2. How big is each block? 64 bytes? 68 bytes?
05/10/2006
ecs150, spring 2006
12
UCDavis, ecs150
Spring 2006
A File System
partition
b s
i-node
05/10/2006
partition
i-list
i-node
d
…….
partition
directory and data blocks
i-node
ecs150, spring 2006
13
UCDavis, ecs150
Spring 2006
One Logical File  Physical Disk Blocks
efficient representation & access
05/10/2006
ecs150, spring 2006
14
UCDavis, ecs150
Spring 2006
A file
An i-node
??? entries in
one disk block
Typical:
each block 8K or 16K bytes
05/10/2006
ecs150, spring 2006
15
UCDavis, ecs150
Spring 2006
inode (index node) structure

meta-data of the file.
–
–
–
–
–
–
–
–
–
–
05/10/2006
di_mode
di_nlinks
di_uid
di_gid
di_size
di_addr
di_gen
di_atime
di_mtime
di_ctime
02
02
02
02
04
39
01
04
04
04
ecs150, spring 2006
16
UCDavis, ecs150
Spring 2006
System-call interface
Active file entries
VNODE Layer or VFS
Local naming (UFS)
FFS
Buffer cache
Block or character device driver
Hardware
05/10/2006
ecs150, spring 2006
17
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
18
UCDavis, ecs150
Spring 2006
A File System
partition
b s
i-node
05/10/2006
partition
i-list
i-node
d
…….
partition
directory and data blocks
i-node
ecs150, spring 2006
19
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
20
UCDavis,
ecs150 ufs2_dinode {
125
struct
Spring 2006
126 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */
127 int16_t di_nlink; /* 2: File link count. */
128 u_int32_t di_uid; /* 4: File owner. */
129 u_int32_t di_gid; /* 8: File group. */
130 u_int32_t di_blksize; /* 12: Inode blocksize. */
131 u_int64_t di_size; /* 16: File byte count. */
132 u_int64_t di_blocks; /* 24: Bytes actually held. */
133 ufs_time_t di_atime; /* 32: Last access time. */
134 ufs_time_t di_mtime; /* 40: Last modified time. */
135 ufs_time_t di_ctime; /* 48: Last inode change time. */
136 ufs_time_t di_birthtime; /* 56: Inode creation time. */
137 int32_t di_mtimensec; /* 64: Last modified time. */
138 int32_t di_atimensec; /* 68: Last access time. */
139 int32_t di_ctimensec; /* 72: Last inode change time. */
140 int32_t di_birthnsec; /* 76: Inode creation time. */
141 int32_t di_gen; /* 80: Generation number. */
142 u_int32_t di_kernflags; /* 84: Kernel flags. */
143 u_int32_t di_flags; /* 88: Status flags (chflags). */
144 int32_t di_extsize; /* 92: External attributes block. */
145 ufs2_daddr_t di_extb[NXADDR];/* 96: External attributes block. */
146 ufs2_daddr_t di_db[NDADDR]; /* 112: Direct disk blocks. */
147 ufs2_daddr_t di_ib[NIADDR]; /* 208: Indirect disk blocks. */
148 int64_t di_spare[3]; /* 232: Reserved; currently unused */
149 };
05/10/2006
ecs150, spring 2006
21
UCDavis, ecs150
Springstruct
2006
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
ufs1_dinode {
u_int16_t di_mode; /* 0: IFMT, permissions; see below. */
int16_t di_nlink; /* 2: File link count. */
union {
u_int16_t oldids[2]; /* 4: Ffs: old user and group ids. */
} di_u;
u_int64_t di_size; /* 8: File byte count. */
int32_t di_atime; /* 16: Last access time. */
int32_t di_atimensec; /* 20: Last access time. */
int32_t di_mtime; /* 24: Last modified time. */
int32_t di_mtimensec; /* 28: Last modified time. */
int32_t di_ctime; /* 32: Last inode change time. */
int32_t di_ctimensec; /* 36: Last inode change time. */
ufs1_daddr_t di_db[NDADDR]; /* 40: Direct disk blocks. */
ufs1_daddr_t di_ib[NIADDR]; /* 88: Indirect disk blocks. */
u_int32_t di_flags; /* 100: Status flags (chflags). */
int32_t di_blocks; /* 104: Blocks actually held. */
int32_t di_gen; /* 108: Generation number. */
u_int32_t di_uid; /* 112: File owner. */
u_int32_t di_gid; /* 116: File group. */
int32_t di_spare[2]; /* 120: Reserved; currently unused */
};
05/10/2006
ecs150, spring 2006
22
UCDavis, ecs150
Spring 2006



i-node
How many disk blocks can a FS have?
How many levels of i-node indirection will be
necessary to store a file of 2G bytes? (I.e., 0, 1, 2
or 3)
What is the largest possible file size in i-node?
05/10/2006
ecs150, spring 2006
23
UCDavis, ecs150
Spring 2006

Answer
How many disk blocks can a FS have?
– 264 or 232: Pointer (to blocks) size is 8/4 bytes.

How many levels of i-node indirection will be
necessary to store a file of 2G (231) bytes? (I.e., 0,
1, 2 or 3)
– 12*210 + 28 * 210 + 28 *28 *2 10 + 28 * 28 *28 *2 10 >? 231

What is the largest possible file size in i-node?
– 12*210 + 28 * 210 + 28 *28 *2 10 + 28 * 28 *28 *2 10
– 264 –1
– 232 * 210
You need to consider three issues and find the minimum!
05/10/2006
ecs150, spring 2006
24
UCDavis, ecs150
Spring 2006
A File System
partition
b s
i-node
05/10/2006
partition
i-list
i-node
d
…….
partition
directory and data blocks
i-node
ecs150, spring 2006
25
UCDavis, ecs150
Spring 2006
FFS and UFS

/usr/src/sys/ufs/ffs/*
– Higher-level: directory structure
– Soft updates & Snapshot

/usr/src/sys/ufs/ufs/*
– Lower-level: buffer, i-node
05/10/2006
ecs150, spring 2006
26
UCDavis, ecs150
Spring 2006
# of i-nodes

UFS1: pre-allocation
– 3% of HD, about < 25% used.

UFS2: dynamic allocation
– Still limited # of i-nods
05/10/2006
ecs150, spring 2006
27
UCDavis, ecs150
Spring 2006
di_size vs. di_blocks

???
05/10/2006
ecs150, spring 2006
28
UCDavis, ecs150
Spring 2006
One Logical File  Physical Disk Blocks
efficient representation & access
05/10/2006
ecs150, spring 2006
29
UCDavis, ecs150
Spring 2006
di_size vs. di_blocks
Logical
 Physical

fstat
 du

05/10/2006
ecs150, spring 2006
30
UCDavis, ecs150
Spring 2006
Extended Attributes in UFS2

Attributes associated with the File
– di_extb[2];
– two blocks, but indirection if needed.

Format
– Length
– Name Space
– Content Pad Length
– Name Length
– Name
– Content

4
1
1
1
mod 8
variable
Applications: ACL, Data Labelling
05/10/2006
ecs150, spring 2006
31
UCDavis, ecs150
Spring 2006
Some thoughts….
What can you do with “extended attributes”?
 How to design/implement?

– Should/can we do it “Stackable File Systems”?
– Otherwise, the program to manipulate the EA’s
will have to be very UFS2-dependent or FiST with
an UFS2 optimization option.

Are there any counter examples?
– security and performance considerations.
05/10/2006
ecs150, spring 2006
32
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
33
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
34
UCDavis, ecs150
Spring 2006
struct dirent {
ino_t d_ino;
char d_name[NAME_MAX+1];
};
struct stat {…
short nlinks;
…};
directory
dirent
dirent
dirent
inode
file_name
inode
file_name
inode
file_name
file
file
file
05/10/2006
ecs150, spring 2006
35
UCDavis, ecs150
Spring 2006
A File System
partition
b s
i-node
05/10/2006
partition
i-list
i-node
d
…….
partition
directory and data blocks
i-node
ecs150, spring 2006
36
UCDavis, ecs150
Spring 2006
ln –s /usr/src/sys/sys/proc.h ppp.h
 ln /usr/src/sys/sys/proc.h ppp.h

05/10/2006
ecs150, spring 2006
37
UCDavis, ecs150
Spring 2006
File System Buffer Cache
application:
OS:
read/write files
translate file to disk blocks
maintains
...buffer cache ...
controls disk accesses: read/write blocks
hardware:
Any problems?
05/10/2006
ecs150, spring 2006
38
UCDavis, ecs150
Spring 2006
File System Consistency
To maintain file system consistency the
ordering of updates from buffer cache to
disk is critical
 Example:

– if the directory block is written back before the
i-node and the system crashes, the directory
structure will be inconsistent
05/10/2006
ecs150, spring 2006
39
UCDavis, ecs150
Spring 2006
File System Consistency





File system almost always use a buffer/disk cache for
performance reasons
This problem is critical especially for the blocks that
contain control information: i-node, free-list, directory
blocks
Two copies of a disk block (buffer cache, disk) 
consistency problem if the system crashes before all the
modified blocks are written back to disk
Write back critical blocks from the buffer cache to disk
immediately
Data blocks are also written back periodically: sync
05/10/2006
ecs150, spring 2006
40
UCDavis, ecs150
Spring 2006
Two Strategies

Prevention
– Use un-buffered I/O when writing i-nodes or pointer
blocks
– Use buffered I/O for other writes and force sync every
30 seconds

Detect and Fix
– Detect the inconsistency
– Fix them according to the “rules”
– Fsck (File System Checker)
05/10/2006
ecs150, spring 2006
41
UCDavis, ecs150
Spring 2006
File System Integrity

Block consistency:
– Block-in-use table
– Free-list table

0 1 1 1 0 0 0 1 0 0 0 2
1 0 0 0 1 1 1 0 1 0 2 0
File consistency:
– how many directories pointing to that i-node?
– nlink?
– three cases: D == L, L > D, D > L

05/10/2006
What to do with the latter two cases?
ecs150, spring 2006
42
UCDavis, ecs150
Spring 2006
File System Integrity

File system states
(a) consistent
(b) missing block
(c) duplicate block in free list
(d) duplicate data block
05/10/2006
ecs150, spring 2006
43
UCDavis, ecs150
Spring 2006
Metadata Operations

Metadata operations modify the structure
of the file system
– Creating, deleting, or renaming
files, directories, or special files
– Directory & I-node

Data must be written to disk in such a way
that the file system can be recovered to a
consistent state after a system crash
05/10/2006
ecs150, spring 2006
44
UCDavis, ecs150
Spring 2006
Metadata Integrity

FFS uses synchronous writes to guarantee
the integrity of metadata
– Any operation modifying multiple pieces of
metadata will write its data to disk in a specific
order
– These writes will be blocking

Guarantees integrity and durability of
metadata updates
05/10/2006
ecs150, spring 2006
45
UCDavis, ecs150
Spring 2006
Deleting a file (I)
i-node-1
abc
def
i-node-2
ghi
i-node-3
Assume we want to delete file “def”
05/10/2006
ecs150, spring 2006
46
UCDavis, ecs150
Spring 2006
Deleting a file (II)
i-node-1
abc
?
def
ghi
i-node-3
Cannot delete i-node before directory entry “def”
05/10/2006
ecs150, spring 2006
47
UCDavis, ecs150
Spring 2006
Deleting a file (III)

Correct sequence is
1.
2.

Write to disk directory block containing deleted
directory entry “def”
Write to disk i-node block containing deleted inode
Leaves the file system in a consistent state
05/10/2006
ecs150, spring 2006
48
UCDavis, ecs150
Spring 2006
Creating a file (I)
i-node-1
abc
ghi
i-node-3
Assume we want to create new file “tuv”
05/10/2006
ecs150, spring 2006
49
UCDavis, ecs150
Spring 2006
Creating a file (II)
i-node-1
abc
ghi
i-node-3
tuv
?
Cannot write directory entry “tuv” before i-node
05/10/2006
ecs150, spring 2006
50
UCDavis, ecs150
Spring 2006
Creating a file (III)

Correct sequence is
1.
2.

Write to disk i-node block containing new i-node
Write to disk directory block containing new
directory entry
Leaves the file system in a consistent state
05/10/2006
ecs150, spring 2006
51
UCDavis, ecs150
Spring 2006
Synchronous Updates

Used by FFS to guarantee consistency of
metadata:
– All metadata updates are done through blocking
writes
Increases the cost of metadata updates
 Can significantly impact the performance
of whole file system

05/10/2006
ecs150, spring 2006
52
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
53
UCDavis, ecs150
Spring 2006
SOFT UPDATES
Use delayed writes (write back)
 Maintain dependency information about
cached pieces of metadata:

This i-node must be updated before/after this
directory entry

Guarantee that metadata blocks are written
to disk in the required order
05/10/2006
ecs150, spring 2006
54
UCDavis, ecs150
Spring 2006
3 Soft Update Rules
Never point to a structure before it has
been initialized.
 Never reuse a resource before nullifying
all previous pointers to it.
 Never reset the old pointer to a live
resource before the new pointer has been
set.

05/10/2006
ecs150, spring 2006
55
UCDavis, ecs150
Spring 2006
Problem #1 with S.U.
Synchronous writes guaranteed that
metadata operations were durable once the
system call returned
 Soft Updates guarantee that file system will
recover into a consistent state but not
necessarily the most recent one

– Some updates could be lost
05/10/2006
ecs150, spring 2006
56
UCDavis, ecs150
Spring 2006
What are the dependency relationship?
We want to delete file “foo”
and create new file “bar”
Block A
Block B
foo
i-node-2
NEW bar
NEW i-node-3
05/10/2006
ecs150, spring 2006
57
UCDavis, ecs150
Spring 2006
Circular Dependency
X-2nd
Y-1st
We want to delete file “foo”
and create new file “bar”
Block A
Block B
foo
i-node-2
NEW bar
NEW i-node-3
05/10/2006
ecs150, spring 2006
58
UCDavis, ecs150
Spring 2006
Problem #2 with S.U.

Cyclical dependencies:
– Same directory block contains entries to be
created and entries to be deleted
– These entries point to i-nodes in the same block

Brainstorming:
– How to resolve this issue in S.U.?
05/10/2006
ecs150, spring 2006
59
UCDavis, ecs150
Spring 2006
How to update?? i-node first or director block first?
05/10/2006
ecs150, spring 2006
60
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
61
UCDavis, ecs150
Spring 2006
Solution in S.U.

Roll back metadata in one of the blocks to
an earlier, safe state
Block A’
def
(Safe state does not contain new directory
entry)
05/10/2006
ecs150, spring 2006
62
UCDavis, ecs150
Spring 2006
Write first block with metadata that were
rolled back (block A’ of example)
 Write blocks that can be written after first
block has been written (block B of
example)
 Roll forward block that was rolled back
 Write that block
 Breaks the cyclical dependency but must
now write twice block A

05/10/2006
ecs150, spring 2006
63
UCDavis, ecs150
Spring 2006
Before any Write Operation
SU Dependency Checking
(roll back if necessary)
After any Write Operation
SU Dependency Processing
(task list updating)
(roll forward if necessary)
05/10/2006
ecs150, spring 2006
64
UCDavis, ecs150
Spring 2006

two most popular approaches for improving
the performance of metadata operations and
recovery:
– Journaling
– Soft Updates
Journaling systems record metadata
operations on an auxiliary log
 Soft Updates uses ordered writes

05/10/2006
ecs150, spring 2006
65
UCDavis, ecs150
Spring 2006
JOURNALING
Journaling systems maintain an auxiliary
log that records all meta-data operations
 Write-ahead logging ensures that the log is
written to disk before any blocks containing
data modified by the corresponding
operations.

– After a crash, can replay the log to bring the file
system to a consistent state
05/10/2006
ecs150, spring 2006
66
UCDavis, ecs150
Spring 2006
JOURNALING
Log writes are performed in addition to the
regular writes
 Journaling systems incur log write overhead
but

– Log writes can be performed efficiently
because they are sequential (block operation
consideration)
– Metadata blocks do not need to be written
back after each update
05/10/2006
ecs150, spring 2006
67
UCDavis, ecs150
Spring 2006
JOURNALING

Journaling systems can provide
– same durability semantics as FFS if log is
forced to disk after each meta-data operation
– the laxer semantics of Soft Updates if log
writes are buffered until entire buffers are
full
05/10/2006
ecs150, spring 2006
68
UCDavis, ecs150
Spring 2006
Soft Updates vs. Journaling
Advantages
 disadvantages

05/10/2006
ecs150, spring 2006
69
UCDavis, ecs150
Spring 2006
With Soft Updates??
Do we still need “FSCK”? at boot time?
CPU
05/10/2006
ecs150, spring 2006
70
UCDavis, ecs150
Spring 2006
Recover the Missing Resources

In the background, in an active FS…
– We don’t want to wait for the lengthy FSCK
process to complete…

A related issue:
– the virus scanning process
– what happens if we get a new virus signature?
05/10/2006
ecs150, spring 2006
71
UCDavis, ecs150
Spring 2006
Snapshot of the FS
backup and restore
 dump reliably an active File System

– what will we do today to dump our 40GB FS
“consistent” snapshots? (in the midnight…)

“background FSCK checks”
05/10/2006
ecs150, spring 2006
72
UCDavis, ecs150
Spring 2006
What is a snapshot?
(I mean “conceptually”.)
Freeze all activities related to the FS.
 Copy everything to “some space”.
 Resume the activities.

How do we efficiently implement
this concept such that the
activities will only be blocked for
about 0.25 seconds, and we don’t
have to buy a really big hard
drive?
05/10/2006
ecs150, spring 2006
73
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
74
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
75
UCDavis, ecs150
Spring 2006
Copy-on-Write
05/10/2006
ecs150, spring 2006
76
UCDavis, ecs150
Spring 2006
Snapshot: a file
Logical size
Versus physical size
05/10/2006
ecs150, spring 2006
77
UCDavis, ecs150
Spring 2006
Example
#
#
#
#
mkdir /backups/usr/noon
mount –u –o snapshot /usr/snap.noon /usr
mdconfig –a –t vnode –u 0 –f /usr/snap.noon
mount –r /dev/md0 /backups/usr/noon
/* do whatever you want to test it */
# umount /backups/usr/noon
# mdconfig –d –u 0
# rm –f /usr/snap.noon
05/10/2006
ecs150, spring 2006
78
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
79
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
80
UCDavis, ecs150
Spring 2006
Example
#
#
#
#
mkdir /backups/usr/noon
mount –u –o snapshot /usr/snap.noon /usr
mdconfig –a –t vnode –u 0 –f /usr/snap.noon
mount –r /dev/md0 /backups/usr/noon
/* do whatever you want to test it */
# umount /backups/usr/noon
# mdconfig –d –u 0
# rm –f /usr/snap.noon
05/10/2006
ecs150, spring 2006
81
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
82
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
83
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
84
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
85
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
86
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
87
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
88
UCDavis, ecs150
Spring 2006
Example
#
#
#
#
mkdir /backups/usr/noon
mount –u –o snapshot /usr/snap.noon /usr
mdconfig –a –t vnode –u 0 –f /usr/snap.noon
mount –r /dev/md0 /backups/usr/noon
/* do whatever you want to test it */
# umount /backups/usr/noon
# mdconfig –d –u 0
# rm –f /usr/snap.noon
05/10/2006
ecs150, spring 2006
89
UCDavis, ecs150
Spring 2006
Copy-on-Write
05/10/2006
ecs150, spring 2006
90
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
91
UCDavis, ecs150
Spring 2006
A file
05/10/2006
A File System
??? entries in
one disk block
ecs150, spring 2006
92
UCDavis, ecs150
Spring 2006
A file
A Snapshot i-node
??? entries in
one disk block
Not used or
Not yet copy
05/10/2006
ecs150, spring 2006
93
UCDavis, ecs150
Spring 2006
A file
Copy-on-write
??? entries in
one disk block
Not used or
Not yet copy
05/10/2006
ecs150, spring 2006
94
UCDavis, ecs150
Spring 2006
A file
Copy-on-write
??? entries in
one disk block
Not used or
Not yet copy
05/10/2006
ecs150, spring 2006
95
UCDavis, ecs150
Spring 2006
Multiple Snapshots
about 20 snapshots
 Interactions/sharing among snapshots

05/10/2006
ecs150, spring 2006
96
UCDavis, ecs150
Spring 2006
Snapshot of the FS
backup and restore
 dump reliably an active File System

– what will we do today to dump our 40GB FS
“consistent” snapshots? (in the midnight…)

“background FSCK checks”
05/10/2006
ecs150, spring 2006
97
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
98
UCDavis, ecs150
Spring 2006
VFS: the FS Switch

Sun Microsystems introduced the virtual file system
interface in 1985 to accommodate diverse filesystem types
cleanly.

VFS allows diverse specific file systems to coexist in a file tree,
isolating all FS-dependencies in pluggable filesystem modules.
user space
syscall layer (file, uio, etc.)
network
protocol
stack
(TCP/IP)
Virtual File System (VFS)
NFS FFS LFS *FS etc. etc.
VFS was an internal kernel restructuring
with no effect on the syscall interface.
Incorporates object-oriented concepts:
a generic procedural interface with
multiple implementations.
device drivers
Other abstract interfaces in the kernel: device drivers,
file objects, executable files, memory objects.
05/10/2006
Based on abstract objects with dynamic
method binding by type...in C.
ecs150, spring 2006
99
UCDavis, ecs150
Spring 2006
vnode

In the VFS framework, every file or directory in active use
is represented by a vnode object in kernel memory.
free vnodes
syscall layer
Each vnode has a standard
file attributes struct.
Generic vnode points at
filesystem-specific struct
(e.g., inode, rnode), seen
only by the filesystem.
Each specific file system
maintains a cache of its
resident vnodes.
NFS
05/10/2006
ecs150, spring 2006
UFS
Vnode operations are
macros that vector to
filesystem-specific
procedures.
100
UCDavis, ecs150
Spring 2006
vnode Operations and
Attributes
vnode attributes (vattr)
type (VREG, VDIR, VLNK, etc.)
mode (9+ bits of permissions)
nlink (hard link count)
owner user ID
owner group ID
filesystem ID
unique file ID
file size (bytes and blocks)
access time
modify time
generation number
generic operations
vop_getattr (vattr)
vop_setattr (vattr)
vhold()
vholdrele()
05/10/2006
directories only
vop_lookup (OUT vpp, name)
vop_create (OUT vpp, name, vattr)
vop_remove (vp, name)
vop_link (vp, name)
vop_rename (vp, name, tdvp, tvp, name)
vop_mkdir (OUT vpp, name, vattr)
vop_rmdir (vp, name)
vop_symlink (OUT vpp, name, vattr, contents)
vop_readdir (uio, cookie)
vop_readlink (uio)
files only
vop_getpages (page**, count, offset)
vop_putpages (page**, count, sync, offset)
vop_fsync ()
ecs150, spring 2006
101
UCDavis, ecs150
Spring 2006
Network File System (NFS)
server
client
syscall layer
user programs
VFS
syscall layer
NFS
server
VFS
UFS
UFS
NFS
client
network
05/10/2006
ecs150, spring 2006
102
UCDavis, ecs150
Spring 2006
vnode Cache
HASH(fsid, fileid)
VFS free list head
Active vnodes are reference- counted
by the structures that hold pointers to
them.
- system open file table
- process current directory
- file system mount points
- etc.
Each specific file system maintains its
own hash of vnodes (BSD).
- specific FS handles initialization
- free list is maintained by VFS
vget(vp): reclaim cached inactive vnode from VFS free list
vref(vp): increment reference count on an active vnode
vrele(vp): release reference count on a vnode
vgone(vp): vnode is no longer valid (file is removed)
05/10/2006
ecs150, spring 2006
103
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
104
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
105
UCDavis, ecs150
Spring 2006
struct vnode {
struct mtx v_interlock;
/* lock for "i" things */
u_long v_iflag;
/* i vnode flags (see below) */
int
v_usecount;
/* i ref count of users */
long
v_numoutput;
/* i writes in progress */
struct thread *v_vxthread;
/* i thread owning VXLOCK */
int
v_holdcnt;
/* i page & buffer references */
struct buflists v_cleanblkhd; /* i SORTED clean blocklist */
struct buf
*v_cleanblkroot;/* i clean buf splay tree */
int
v_cleanbufcnt;
/* i number of clean buffers */
struct buflists v_dirtyblkhd;
/* i SORTED dirty blocklist */
struct buf
*v_dirtyblkroot; /* i dirty buf splay tree */
int
v_dirtybufcnt;
05/10/2006
ecs150, spring 2006
106
UCDavis, ecs150
Spring 2006
Transaction-based FS
Performance versus consistency
 “Atomic Writes” on Multiple Blocks

– See the paper titled “Atomic Writes for Data
Integrity and Consistency in Shared Storage
Devices for Clusters” by Okun and Barak,
FGCS, vol. 20, pages 539-547, 2004.
– Modify SCSI handling
05/10/2006
ecs150, spring 2006
107
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
108
UCDavis, ecs150
Spring 2006
05/10/2006
ecs150, spring 2006
109