Transcript General

UCDavis, ecs150
Fall 2007
:
Operating System
ecs150 Fall 2007
#5: File Systems
(chapters: 6.4~6.7, 8)
Dr. S. Felix Wu
Computer Science Department
University of California, Davis
http://www.cs.ucdavis.edu/~wu/
[email protected]
11/13/2007
ecs150, Fall 2007
1
UCDavis, ecs150
Fall 2007
File System Abstraction
Files
 Directories

11/13/2007
ecs150, Fall 2007
2
UCDavis, ecs150
Fall 2007
System-call interface
Active file entries
VNODE Layer or VFS
Local naming (UFS)
FFS
Buffer cache
Block or character device driver
Hardware
11/13/2007
ecs150, Fall 2007
3
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
4
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
5
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
6
UCDavis, ecs150
Fall 2007
dirp = opendir(const char *filename);
struct dirent *direntp = readdir(dirp);
struct dirent {
ino_t d_ino;
char d_name[NAME_MAX+1];
};
directory
dirent
dirent
dirent
inode
file_name
inode
file_name
inode
file_name
file
file
file
11/13/2007
ecs150, Fall 2007
7
UCDavis, ecs150
Fall 2007
Local versus Remote
System Call Interface
 V-node
 Local versus remote

– NFS or i-node
– Stackable File System

Hard-disk blocks
11/13/2007
ecs150, Fall 2007
8
UCDavis, ecs150
Fall 2007
File-System Structure

File structure
– Logical storage unit
– Collection of related information
File system resides on secondary storage
(disks).
 File system organized into layers.
 File control block – storage structure
consisting of information about a file.

11/13/2007
ecs150, Fall 2007
9
UCDavis, ecs150
Fall 2007
File  Disk
separate the disk into blocks
 separate the file into blocks as well
 paging from file to disk

blocks: 4 - 7- 2- 10- 12
How to represent the file??
How to link these 5 pages together??
11/13/2007
ecs150, Fall 2007
10
UCDavis, ecs150
Fall 2007
Bit torrent pieces
1 big file (X Gigabytes) with a number of
pieces (5%) already in (and sharing with
others).
 How much disk space do we need at this
moment?

11/13/2007
ecs150, Fall 2007
11
UCDavis, ecs150
Fall 2007

Hard Disk
Track, Sector, Head
– Track + Heads  Cylinder

Performance
– seek time
– rotation time
– transfer time

LBA
– Linear Block Addressing
11/13/2007
ecs150, Fall 2007
12
UCDavis, ecs150
Fall 2007
File  Disk blocks
file
block
0
file
block
1
file
block
2
file
block
3
0
file
block
4
4
7
2
10
12
What are the disadvantages?
1. disk access can be slow for “random access”.
2. How big is each block? 64 bytes? 68 bytes?
11/13/2007
ecs150, Fall 2007
13
UCDavis, ecs150
Fall 2007
Kernel Hacking Session
This Friday from 7:30 p.m. until midnight..
 3083 Kemper

– Bring your laptop
– And bring your mug…
11/13/2007
ecs150, Fall 2007
14
UCDavis, ecs150
Fall 2007
A File System
partition
b s
i-node
11/13/2007
partition
i-list
i-node
d
…….
partition
directory and data blocks
i-node
ecs150, Fall 2007
15
UCDavis, ecs150
Fall 2007
One Logical File  Physical Disk Blocks
efficient representation & access
11/13/2007
ecs150, Fall 2007
16
UCDavis, ecs150
Fall 2007
A file
An i-node
??? entries in
one disk block
Typical:
each block 8K or 16K bytes
11/13/2007
ecs150, Fall 2007
17
UCDavis, ecs150
Fall 2007
inode (index node) structure

meta-data of the file.
–
–
–
–
–
–
–
–
–
–
11/13/2007
di_mode
di_nlinks
di_uid
di_gid
di_size
di_addr
di_gen
di_atime
di_mtime
di_ctime
02
02
02
02
04
39
01
04
04
04
ecs150, Fall 2007
18
UCDavis, ecs150
Fall 2007
System-call interface
Active file entries
VNODE Layer or VFS
Local naming (UFS)
FFS
Buffer cache
Block or character device driver
Hardware
11/13/2007
ecs150, Fall 2007
19
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
20
UCDavis, ecs150
Fall 2007
A File System
partition
b s
i-node
11/13/2007
partition
i-list
i-node
d
…….
partition
directory and data blocks
i-node
ecs150, Fall 2007
21
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
22
UCDavis,
ecs150 ufs2_dinode {
125
struct
Fall 2007
126 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */
127 int16_t di_nlink; /* 2: File link count. */
128 u_int32_t di_uid; /* 4: File owner. */
129 u_int32_t di_gid; /* 8: File group. */
130 u_int32_t di_blksize; /* 12: Inode blocksize. */
131 u_int64_t di_size; /* 16: File byte count. */
132 u_int64_t di_blocks; /* 24: Bytes actually held. */
133 ufs_time_t di_atime; /* 32: Last access time. */
134 ufs_time_t di_mtime; /* 40: Last modified time. */
135 ufs_time_t di_ctime; /* 48: Last inode change time. */
136 ufs_time_t di_birthtime; /* 56: Inode creation time. */
137 int32_t di_mtimensec; /* 64: Last modified time. */
138 int32_t di_atimensec; /* 68: Last access time. */
139 int32_t di_ctimensec; /* 72: Last inode change time. */
140 int32_t di_birthnsec; /* 76: Inode creation time. */
141 int32_t di_gen; /* 80: Generation number. */
142 u_int32_t di_kernflags; /* 84: Kernel flags. */
143 u_int32_t di_flags; /* 88: Status flags (chflags). */
144 int32_t di_extsize; /* 92: External attributes block. */
145 ufs2_daddr_t di_extb[NXADDR];/* 96: External attributes block. */
146 ufs2_daddr_t di_db[NDADDR]; /* 112: Direct disk blocks. */
147 ufs2_daddr_t di_ib[NIADDR]; /* 208: Indirect disk blocks. */
148 int64_t di_spare[3]; /* 232: Reserved; currently unused */
149 };
11/13/2007
ecs150, Fall 2007
23
UCDavis, ecs150
Fall 2007
166
struct
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
ufs1_dinode {
u_int16_t di_mode; /* 0: IFMT, permissions; see below. */
int16_t di_nlink; /* 2: File link count. */
union {
u_int16_t oldids[2]; /* 4: Ffs: old user and group ids. */
} di_u;
u_int64_t di_size; /* 8: File byte count. */
int32_t di_atime; /* 16: Last access time. */
int32_t di_atimensec; /* 20: Last access time. */
int32_t di_mtime; /* 24: Last modified time. */
int32_t di_mtimensec; /* 28: Last modified time. */
int32_t di_ctime; /* 32: Last inode change time. */
int32_t di_ctimensec; /* 36: Last inode change time. */
ufs1_daddr_t di_db[NDADDR]; /* 40: Direct disk blocks. */
ufs1_daddr_t di_ib[NIADDR]; /* 88: Indirect disk blocks. */
u_int32_t di_flags; /* 100: Status flags (chflags). */
int32_t di_blocks; /* 104: Blocks actually held. */
int32_t di_gen; /* 108: Generation number. */
u_int32_t di_uid; /* 112: File owner. */
u_int32_t di_gid; /* 116: File group. */
int32_t di_spare[2]; /* 120: Reserved; currently unused */
};
11/13/2007
ecs150, Fall 2007
24
UCDavis, ecs150
Fall 2007
Bittorrent pieces
File size: 10 GB
Pieces downloaded: 512 MB
How much disk space do we need?
11/13/2007
ecs150, Fall 2007
25
UCDavis, ecs150
Fall 2007
#include <stdio.h>
#include <stdlib.h>
# ./t
int
main
# ls –l ./sss.txt
(void)
{
FILE *f1 = fopen("./sss.txt", "w");
int i;
for (i = 0; i < 1000; i++)
{
fseek(f1, rand(), SEEK_SET);
fprintf(f1, "%d%d%d%d", rand(), rand(),
rand(), rand());
if (i % 100 == 0) sleep(1);
}
fflush(f1);
}
11/13/2007
ecs150, Fall 2007
26
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
27
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
28
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
29
UCDavis, ecs150
Fall 2007
A file
An i-node
??? entries in
one disk block
Typical:
each block 1K
11/13/2007
ecs150, Fall 2007
30
UCDavis, ecs150
Fall 2007




i-node
How many disk blocks can a FS have?
How many levels of i-node indirection will be
necessary to store a file of 2G bytes? (I.e., 0, 1, 2
or 3)
What is the largest possible file size in i-node?
What is the size of the i-node itself for a file of
10GB with only 512 MB downloaded?
11/13/2007
ecs150, Fall 2007
31
UCDavis, ecs150
Fall 2007

Answer
How many disk blocks can a FS have?
– 264 or 232: Pointer (to blocks) size is 8/4 bytes.

How many levels of i-node indirection will be
necessary to store a file of 2G (231) bytes? (I.e., 0,
1, 2 or 3)
– 12*210 + 28 * 210 + 28 *28 *2 10 + 28 * 28 *28 *2 10 >? 231

What is the largest possible file size in i-node?
– 12*210 + 28 * 210 + 28 *28 *2 10 + 28 * 28 *28 *2 10
– 264 –1
– 232 * 210
You need to consider three issues and find the minimum!
11/13/2007
ecs150, Fall 2007
32
UCDavis, ecs150
Fall 2007
Answer: Lower Bound

How many pointers?
– 512MB divided by the block size (1K)
– 512K pointers times 8 (4) bytes = 4 (2) MB
11/13/2007
ecs150, Fall 2007
33
UCDavis, ecs150
Fall 2007
Bittorrent pieces
File size: 10 GB
Pieces downloaded: 512 MB
How much disk space do we need?
11/13/2007
ecs150, Fall 2007
34
UCDavis, ecs150
Fall 2007
Answer: Upper Bound
In the worst case, EVERY indirection block
has at least one entry!
 How many indirection blocks?

– Single: 1 block
– Double: 1 + 28
– Tripple: 1 + 28 + 216

Total ~ 216 blocks times 1K = 64 MB
– 214 times 1K = 16MB (ufs2 inode)
11/13/2007
ecs150, Fall 2007
35
UCDavis, ecs150
Fall 2007
Answer (4)

2 MB ~ 64 MB
 4 MB ~ 16 MB
ufs1
ufs2

Answer: sss.txt ~17 MB
– ~16 MB (inode indirection blocks)
– 1000 writes times 1K ~ 1MB
11/13/2007
ecs150, Fall 2007
36
UCDavis, ecs150
Fall 2007
A file
An i-node
??? entries in
one disk block
Typical:
each block 1K
11/13/2007
ecs150, Fall 2007
37
UCDavis, ecs150
Fall 2007
A File System
partition
b s
i-node
11/13/2007
partition
i-list
i-node
d
…….
partition
directory and data blocks
i-node
ecs150, Fall 2007
38
UCDavis, ecs150
Fall 2007
FFS and UFS

/usr/src/sys/ufs/ffs/*
– Higher-level: directory structure
– Soft updates & Snapshot

/usr/src/sys/ufs/ufs/*
– Lower-level: buffer, i-node
11/13/2007
ecs150, Fall 2007
39
UCDavis, ecs150
Fall 2007
# of i-nodes

UFS1: pre-allocation
– 3% of HD, about < 25% used.

UFS2: dynamic allocation
– Still limited # of i-nods
11/13/2007
ecs150, Fall 2007
40
UCDavis, ecs150
Fall 2007
di_size vs. di_blocks

???
11/13/2007
ecs150, Fall 2007
41
UCDavis, ecs150
Fall 2007
One Logical File  Physical Disk Blocks
efficient representation & access
11/13/2007
ecs150, Fall 2007
42
UCDavis, ecs150
Fall 2007
di_size vs. di_blocks
Logical
 Physical

fstat
 du

11/13/2007
ecs150, Fall 2007
43
UCDavis, ecs150
Fall 2007
Extended Attributes in UFS2

Attributes associated with the File
– di_extb[2];
– two blocks, but indirection if needed.

Format
– Length
– Name Space
– Content Pad Length
– Name Length
– Name
– Content

4
1
1
1
mod 8
variable
Applications: ACL, Data Labelling
11/13/2007
ecs150, Fall 2007
44
UCDavis, ecs150
Fall 2007
Some thoughts….
What can you do with “extended attributes”?
 How to design/implement?

– Should/can we do it “Stackable File Systems”?
– Otherwise, the program to manipulate the EA’s
will have to be very UFS2-dependent or FiST with
an UFS2 optimization option.

Are there any counter examples?
– security and performance considerations.
11/13/2007
ecs150, Fall 2007
45
UCDavis, ecs150
Fall 2007
A File System
partition
b s
i-node
11/13/2007
partition
i-list
i-node
d
…….
partition
directory and data blocks
i-node
ecs150, Fall 2007
46
UCDavis, ecs150
Fall 2007
struct dirent {
ino_t d_ino;
char d_name[NAME_MAX+1];
};
struct stat {…
short nlinks;
…};
directory
dirent
dirent
dirent
inode
file_name
inode
file_name
inode
file_name
file
file
file
11/13/2007
ecs150, Fall 2007
47
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
48
UCDavis, ecs150
Fall 2007
2
.
..
root
wheel
drwxr-xr-x
Apr 1 2004
usr
vmunix
3
4
5
6
7
root
wheel
drwxr-xr-x
Apr 1 2004
root
wheel
rwxr-xr-x
Apr 15 2004
kirk
staff
rw-rw-r-Jan 19 2004
root
wheel
drwxr-xr-x
Apr 1 2004
.
..
bin
foo
text
9
11/13/2007
directory
/
4
2
7
6
directory
/usr
data
Hello World!
.
..
ex
groff
vi
8
bin
bin
rwxr-xr-x
Apr 15 2004
2
2
4
5
text
ecs150, Fall 2007
7
4
9
10
9
data
file
/vmunix
file
/usr/foo
directory
/usr/bin
file
/usr/bin/vi
49
UCDavis, ecs150
Fall 2007
What is the difference?
ln –s /usr/src/sys/sys/proc.h ppp.h
 ln /usr/src/sys/sys/proc.h ppp.h

11/13/2007
ecs150, Fall 2007
50
UCDavis, ecs150
Fall 2007
Hard versus Symbolic

ln –s /usr/src/sys/sys/proc.h ppp.h
– Link to anything, any mounted partitions
– Delete a Symbolic link?

ln /usr/src/sys/sys/proc.h ppp.h
– Link only to “file” (not directory)
– Link only within the same partition -- why?
– Delete a Hard Link?
11/13/2007
ecs150, Fall 2007
51
UCDavis,
ecs150 ufs2_dinode {
125
struct
Fall 2007
126 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */
127 int16_t di_nlink; /* 2: File link count. */
128 u_int32_t di_uid; /* 4: File owner. */
129 u_int32_t di_gid; /* 8: File group. */
130 u_int32_t di_blksize; /* 12: Inode blocksize. */
131 u_int64_t di_size; /* 16: File byte count. */
132 u_int64_t di_blocks; /* 24: Bytes actually held. */
133 ufs_time_t di_atime; /* 32: Last access time. */
134 ufs_time_t di_mtime; /* 40: Last modified time. */
135 ufs_time_t di_ctime; /* 48: Last inode change time. */
136 ufs_time_t di_birthtime; /* 56: Inode creation time. */
137 int32_t di_mtimensec; /* 64: Last modified time. */
138 int32_t di_atimensec; /* 68: Last access time. */
139 int32_t di_ctimensec; /* 72: Last inode change time. */
140 int32_t di_birthnsec; /* 76: Inode creation time. */
141 int32_t di_gen; /* 80: Generation number. */
142 u_int32_t di_kernflags; /* 84: Kernel flags. */
143 u_int32_t di_flags; /* 88: Status flags (chflags). */
144 int32_t di_extsize; /* 92: External attributes block. */
145 ufs2_daddr_t di_extb[NXADDR];/* 96: External attributes block. */
146 ufs2_daddr_t di_db[NDADDR]; /* 112: Direct disk blocks. */
147 ufs2_daddr_t di_ib[NIADDR]; /* 208: Indirect disk blocks. */
148 int64_t di_spare[3]; /* 232: Reserved; currently unused */
149 };
11/13/2007
ecs150, Fall 2007
52
UCDavis, ecs150
Fall 2007
struct dirent {
ino_t d_ino;
char d_name[NAME_MAX+1];
};
struct stat {…
short nlinks;
…};
directory
dirent
dirent
dirent
inode
file_name
inode
file_name
inode
file_name
file
file
file
11/13/2007
ecs150, Fall 2007
53
UCDavis, ecs150
Fall 2007
File System Buffer Cache
application:
OS:
read/write files
translate file to disk blocks
maintains
...buffer cache ...
controls disk accesses: read/write blocks
hardware:
Any problems?
11/13/2007
ecs150, Fall 2007
54
UCDavis, ecs150
Fall 2007
File System Consistency
To maintain file system consistency the
ordering of updates from buffer cache to
disk is critical
 Example:

– if the directory block is written back before the
i-node and the system crashes, the directory
structure will be inconsistent
11/13/2007
ecs150, Fall 2007
55
UCDavis, ecs150
Fall 2007
File System Consistency





File system almost always use a buffer/disk cache for
performance reasons
This problem is critical especially for the blocks that
contain control information: i-node, free-list, directory
blocks
Two copies of a disk block (buffer cache, disk) 
consistency problem if the system crashes before all the
modified blocks are written back to disk
Write back critical blocks from the buffer cache to disk
immediately
Data blocks are also written back periodically: sync
11/13/2007
ecs150, Fall 2007
56
UCDavis, ecs150
Fall 2007
Two Strategies

Prevention
– Use un-buffered I/O when writing i-nodes or pointer
blocks
– Use buffered I/O for other writes and force sync every
30 seconds

Detect and Fix
– Detect the inconsistency
– Fix them according to the “rules”
– Fsck (File System Checker)
11/13/2007
ecs150, Fall 2007
57
UCDavis, ecs150
Fall 2007
File System Integrity

Block consistency:
– Block-in-use table
– Free-list table

0 1 1 1 0 0 0 1 0 0 0 2
1 0 0 0 1 1 1 0 1 0 2 0
File consistency:
– how many directories pointing to that i-node?
– nlink?
– three cases: D == L, L > D, D > L

11/13/2007
What to do with the latter two cases?
ecs150, Fall 2007
58
UCDavis, ecs150
Fall 2007
File System Integrity

File system states
(a) consistent
(b) missing block
(c) duplicate block in free list
(d) duplicate data block
11/13/2007
ecs150, Fall 2007
59
UCDavis, ecs150
Fall 2007
Metadata Operations

Metadata operations modify the structure
of the file system
– Creating, deleting, or renaming
files, directories, or special files
– Directory & I-node

Data must be written to disk in such a way
that the file system can be recovered to a
consistent state after a system crash
11/13/2007
ecs150, Fall 2007
60
UCDavis, ecs150
Fall 2007
Metadata Integrity

FFS uses synchronous writes to guarantee
the integrity of metadata
– Any operation modifying multiple pieces of
metadata will write its data to disk in a specific
order
– These writes will be blocking

Guarantees integrity and durability of
metadata updates
11/13/2007
ecs150, Fall 2007
61
UCDavis, ecs150
Fall 2007
Deleting a file (I)
i-node-1
abc
def
i-node-2
ghi
i-node-3
Assume we want to delete file “def”
11/13/2007
ecs150, Fall 2007
62
UCDavis, ecs150
Fall 2007
Deleting a file (II)
i-node-1
abc
?
def
ghi
i-node-3
Cannot delete i-node before directory entry “def”
11/13/2007
ecs150, Fall 2007
63
UCDavis, ecs150
Fall 2007
Deleting a file (III)

Correct sequence is
1.
2.

Write to disk directory block containing deleted
directory entry “def”
Write to disk i-node block containing deleted inode
Leaves the file system in a consistent state
11/13/2007
ecs150, Fall 2007
64
UCDavis, ecs150
Fall 2007
Creating a file (I)
i-node-1
abc
ghi
i-node-3
Assume we want to create new file “tuv”
11/13/2007
ecs150, Fall 2007
65
UCDavis, ecs150
Fall 2007
Creating a file (II)
i-node-1
abc
ghi
i-node-3
tuv
?
Cannot write directory entry “tuv” before i-node
11/13/2007
ecs150, Fall 2007
66
UCDavis, ecs150
Fall 2007
Creating a file (III)

Correct sequence is
1.
2.

Write to disk i-node block containing new i-node
Write to disk directory block containing new
directory entry
Leaves the file system in a consistent state
11/13/2007
ecs150, Fall 2007
67
UCDavis, ecs150
Fall 2007
Synchronous Updates

Used by FFS to guarantee consistency of
metadata:
– All metadata updates are done through blocking
writes
Increases the cost of metadata updates
 Can significantly impact the performance
of whole file system

11/13/2007
ecs150, Fall 2007
68
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
69
UCDavis, ecs150
Fall 2007
SOFT UPDATES
Use delayed writes (write back)
 Maintain dependency information about
cached pieces of metadata:

This i-node must be updated before/after this
directory entry

Guarantee that metadata blocks are written
to disk in the required order
11/13/2007
ecs150, Fall 2007
70
UCDavis, ecs150
Fall 2007
3 Soft Update Rules
Never point to a structure before it has
been initialized.
 Never reuse a resource before nullifying
all previous pointers to it.
 Never reset the old pointer to a live
resource before the new pointer has been
set.

11/13/2007
ecs150, Fall 2007
71
UCDavis, ecs150
Fall 2007
Problem #1 with S.U.
Synchronous writes guaranteed that
metadata operations were durable once the
system call returned
 Soft Updates guarantee that file system will
recover into a consistent state but not
necessarily the most recent one

– Some updates could be lost
11/13/2007
ecs150, Fall 2007
72
UCDavis, ecs150
Fall 2007
What are the dependency relationship?
We want to delete file “foo”
and create new file “bar”
Block A
Block B
foo
i-node-2
NEW bar
NEW i-node-3
11/13/2007
ecs150, Fall 2007
73
UCDavis, ecs150
Fall 2007
Circular Dependency
X-2nd
Y-1st
We want to delete file “foo”
and create new file “bar”
Block A
Block B
foo
i-node-2
NEW bar
NEW i-node-3
11/13/2007
ecs150, Fall 2007
74
UCDavis, ecs150
Fall 2007
Problem #2 with S.U.

Cyclical dependencies:
– Same directory block contains entries to be
created and entries to be deleted
– These entries point to i-nodes in the same block

Brainstorming:
– How to resolve this issue in S.U.?
11/13/2007
ecs150, Fall 2007
75
UCDavis, ecs150
Fall 2007
FS: buffer or disk??

They appear in both and we try to
synchronize them..
11/13/2007
ecs150, Fall 2007
76
UCDavis, ecs150
Fall 2007
Disk
Block A-Dir
Block B-i-Node
foo
i-node-2
11/13/2007
ecs150, Fall 2007
77
UCDavis, ecs150
Fall 2007
Buffer
Block A-Dir
Block B-i-Node
NEW bar
NEW i-node-3
11/13/2007
ecs150, Fall 2007
78
UCDavis, ecs150
Fall 2007
Synchronize??
Block A
Block B
foo
i-node-2
NEW bar
NEW i-node-3
11/13/2007
ecs150, Fall 2007
79
UCDavis, ecs150
Fall 2007
How to update?? i-node first or director block first?
11/13/2007
ecs150, Fall 2007
80
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
81
UCDavis, ecs150
Fall 2007
Solution in S.U.

Roll back metadata in one of the blocks to
an earlier, safe state
Block A’
def
(Safe state does not contain new directory
entry)
11/13/2007
ecs150, Fall 2007
82
UCDavis, ecs150
Fall 2007
Write first block with metadata that were
rolled back (block A’ of example)
 Write blocks that can be written after first
block has been written (block B of
example)
 Roll forward block that was rolled back
 Write that block
 Breaks the cyclical dependency but must
now write twice block A

11/13/2007
ecs150, Fall 2007
83
UCDavis, ecs150
Fall 2007
Before any Write Operation
SU Dependency Checking
(roll back if necessary)
After any Write Operation
SU Dependency Processing
(task list updating)
(roll forward if necessary)
11/13/2007
ecs150, Fall 2007
84
UCDavis, ecs150
Fall 2007

two most popular approaches for improving
the performance of metadata operations and
recovery:
– Journaling
– Soft Updates
Journaling systems record metadata
operations on an auxiliary log
 Soft Updates uses ordered writes

11/13/2007
ecs150, Fall 2007
85
UCDavis, ecs150
Fall 2007
JOURNALING
Journaling systems maintain an auxiliary
log that records all meta-data operations
 Write-ahead logging ensures that the log is
written to disk before any blocks containing
data modified by the corresponding
operations.

– After a crash, can replay the log to bring the file
system to a consistent state
11/13/2007
ecs150, Fall 2007
86
UCDavis, ecs150
Fall 2007
JOURNALING
Log writes are performed in addition to the
regular writes
 Journaling systems incur log write overhead
but

– Log writes can be performed efficiently
because they are sequential (block operation
consideration)
– Metadata blocks do not need to be written
back after each update
11/13/2007
ecs150, Fall 2007
87
UCDavis, ecs150
Fall 2007
JOURNALING

Journaling systems can provide
– same durability semantics as FFS if log is
forced to disk after each meta-data operation
– the laxer semantics of Soft Updates if log
writes are buffered until entire buffers are
full
11/13/2007
ecs150, Fall 2007
88
UCDavis, ecs150
Fall 2007
Soft Updates vs. Journaling
Advantages
 disadvantages

11/13/2007
ecs150, Fall 2007
89
UCDavis, ecs150
Fall 2007
With Soft Updates??
Do we still need “FSCK”? at boot time?
CPU
11/13/2007
ecs150, Fall 2007
90
UCDavis, ecs150
Fall 2007
Recover the Missing Resources

In the background, in an active FS…
– We don’t want to wait for the lengthy FSCK
process to complete…

A related issue:
– the virus scanning process
– what happens if we get a new virus signature?
11/13/2007
ecs150, Fall 2007
91
UCDavis, ecs150
Fall 2007
Snapshot of the FS
backup and restore
 dump reliably an active File System

– what will we do today to dump our 40GB FS
“consistent” snapshots? (in the midnight…)

“background FSCK checks”
11/13/2007
ecs150, Fall 2007
92
UCDavis, ecs150
Fall 2007
What is a snapshot?
(I mean “conceptually”.)
Freeze all activities related to the FS.
 Copy everything to “some space”.
 Resume the activities.

How do we efficiently implement
this concept such that the
activities will only be blocked for
about 0.25 seconds, and we don’t
have to buy a really big hard
drive?
11/13/2007
ecs150, Fall 2007
93
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
94
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
95
UCDavis, ecs150
Fall 2007
Copy-on-Write
11/13/2007
ecs150, Fall 2007
96
UCDavis, ecs150
Fall 2007
Snapshot: a file
Logical size
Versus physical size
11/13/2007
ecs150, Fall 2007
97
UCDavis, ecs150
Fall 2007
Example
#
#
#
#
mkdir /backups/usr/noon
mount –u –o snapshot /usr/snap.noon /usr
mdconfig –a –t vnode –u 0 –f /usr/snap.noon
mount –r /dev/md0 /backups/usr/noon
/* do whatever you want to test it */
# umount /backups/usr/noon
# mdconfig –d –u 0
# rm –f /usr/snap.noon
11/13/2007
ecs150, Fall 2007
98
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
99
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
100
UCDavis, ecs150
Fall 2007
#include <stdio.h>
#include <stdlib.h>
int
main
(void)
{
FILE *f1 = fopen("./sss.txt", "w");
int i;
for (i = 0; i < 1000; i++)
{
fseek(f1, rand(), SEEK_SET);
fprintf(f1, "%d%d%d%d", rand(), rand(),
rand(), rand());
if (i % 100 == 0) sleep(1);
}
fflush(f1);
}
11/13/2007
ecs150, Fall 2007
101
UCDavis, ecs150
Fall 2007
Example
#
#
#
#
mkdir /backups/usr/noon
mount –u –o snapshot /usr/snap.noon /usr
mdconfig –a –t vnode –u 0 –f /usr/snap.noon
mount –r /dev/md0 /backups/usr/noon
/* do whatever you want to test it */
# umount /backups/usr/noon
# mdconfig –d –u 0
# rm –f /usr/snap.noon
11/13/2007
ecs150, Fall 2007
102
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
103
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
104
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
105
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
106
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
107
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
108
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
109
UCDavis, ecs150
Fall 2007
Example
#
#
#
#
mkdir /backups/usr/noon
mount –u –o snapshot /usr/snap.noon /usr
mdconfig –a –t vnode –u 0 –f /usr/snap.noon
mount –r /dev/md0 /backups/usr/noon
/* do whatever you want to test it */
# umount /backups/usr/noon
# mdconfig –d –u 0
# rm –f /usr/snap.noon
11/13/2007
ecs150, Fall 2007
110
UCDavis, ecs150
Fall 2007
Copy-on-Write
11/13/2007
ecs150, Fall 2007
111
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
112
UCDavis, ecs150
Fall 2007
A file
11/13/2007
A File System
??? entries in
one disk block
ecs150, Fall 2007
113
UCDavis, ecs150
Fall 2007
A file
A Snapshot i-node
??? entries in
one disk block
Not used or
Not yet copy
11/13/2007
ecs150, Fall 2007
114
UCDavis, ecs150
Fall 2007
A file
Copy-on-write
??? entries in
one disk block
Not used or
Not yet copy
11/13/2007
ecs150, Fall 2007
115
UCDavis, ecs150
Fall 2007
A file
Copy-on-write
??? entries in
one disk block
Not used or
Not yet copy
11/13/2007
ecs150, Fall 2007
116
UCDavis, ecs150
Fall 2007
Multiple Snapshots
about 20 snapshots
 Interactions/sharing among snapshots

11/13/2007
ecs150, Fall 2007
117
UCDavis, ecs150
Fall 2007
Snapshot of the FS
backup and restore
 dump reliably an active File System

– what will we do today to dump our 40GB FS
“consistent” snapshots? (in the midnight…)

“background FSCK checks”
11/13/2007
ecs150, Fall 2007
118
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
119
UCDavis, ecs150
Fall 2007
VFS: the FS Switch

Sun Microsystems introduced the virtual file system
interface in 1985 to accommodate diverse filesystem types
cleanly.

VFS allows diverse specific file systems to coexist in a file tree,
isolating all FS-dependencies in pluggable filesystem modules.
user space
syscall layer (file, uio, etc.)
network
protocol
stack
(TCP/IP)
Virtual File System (VFS)
NFS FFS LFS *FS etc. etc.
VFS was an internal kernel restructuring
with no effect on the syscall interface.
Incorporates object-oriented concepts:
a generic procedural interface with
multiple implementations.
device drivers
Other abstract interfaces in the kernel: device drivers,
file objects, executable files, memory objects.
11/13/2007
ecs150, Fall 2007
Based on abstract objects with dynamic
method binding by type...in C.
120
UCDavis, ecs150
Fall 2007
vnode

In the VFS framework, every file or directory in active use
is represented by a vnode object in kernel memory.
free vnodes
syscall layer
Each vnode has a standard
file attributes struct.
Generic vnode points at
filesystem-specific struct
(e.g., inode, rnode), seen
only by the filesystem.
Each specific file system
maintains a cache of its
resident vnodes.
NFS
11/13/2007
UFS
ecs150, Fall 2007
Vnode operations are
macros that vector to
filesystem-specific
procedures.
121
UCDavis, ecs150
Fall 2007
vnode Operations and
Attributes
vnode attributes (vattr)
type (VREG, VDIR, VLNK, etc.)
mode (9+ bits of permissions)
nlink (hard link count)
owner user ID
owner group ID
filesystem ID
unique file ID
file size (bytes and blocks)
access time
modify time
generation number
generic operations
vop_getattr (vattr)
vop_setattr (vattr)
vhold()
vholdrele()
11/13/2007
directories only
vop_lookup (OUT vpp, name)
vop_create (OUT vpp, name, vattr)
vop_remove (vp, name)
vop_link (vp, name)
vop_rename (vp, name, tdvp, tvp, name)
vop_mkdir (OUT vpp, name, vattr)
vop_rmdir (vp, name)
vop_symlink (OUT vpp, name, vattr, contents)
vop_readdir (uio, cookie)
vop_readlink (uio)
files only
vop_getpages (page**, count, offset)
vop_putpages (page**, count, sync, offset)
vop_fsync ()
ecs150, Fall 2007
122
UCDavis, ecs150
Fall 2007
Network File System (NFS)
server
client
syscall layer
user programs
VFS
syscall layer
NFS
server
VFS
UFS
UFS
NFS
client
network
11/13/2007
ecs150, Fall 2007
123
UCDavis, ecs150
Fall 2007
vnode Cache
HASH(fsid, fileid)
VFS free list head
Active vnodes are reference- counted
by the structures that hold pointers to
them.
- system open file table
- process current directory
- file system mount points
- etc.
Each specific file system maintains its
own hash of vnodes (BSD).
- specific FS handles initialization
- free list is maintained by VFS
vget(vp): reclaim cached inactive vnode from VFS free list
vref(vp): increment reference count on an active vnode
vrele(vp): release reference count on a vnode
vgone(vp): vnode is no longer valid (file is removed)
11/13/2007
ecs150, Fall 2007
124
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
125
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
126
UCDavis, ecs150
Fall 2007
struct vnode {
struct mtx v_interlock;
/* lock for "i" things */
u_long v_iflag;
/* i vnode flags (see below) */
int
v_usecount;
/* i ref count of users */
long
v_numoutput;
/* i writes in progress */
struct thread *v_vxthread;
/* i thread owning VXLOCK */
int
v_holdcnt;
/* i page & buffer references */
struct buflists v_cleanblkhd; /* i SORTED clean blocklist */
struct buf
*v_cleanblkroot;/* i clean buf splay tree */
int
v_cleanbufcnt;
/* i number of clean buffers */
struct buflists v_dirtyblkhd;
/* i SORTED dirty blocklist */
struct buf
*v_dirtyblkroot; /* i dirty buf splay tree */
int
v_dirtybufcnt;
11/13/2007
ecs150, Fall 2007
127
UCDavis, ecs150
Fall 2007
Distributed FS
ftp.cs.ucdavis.edu fs0: /dev/hd0a
/
usr
sys
dev
lib
bin
etc
bin
/
local adm
home
Server.yahoo.com fs0: /dev/hd0e
11/13/2007
ecs150, Fall 2007
128
UCDavis, ecs150
Fall 2007
logical disks
fs0: /dev/hd0a
/
usr
sys
dev
lib
bin
etc
bin
mount -t ufs /dev/hd0e /usr
/
local adm
home
fs1: /dev/hd0e
mount -t nfs 152.1.23.12:/export/cdrom /mnt/cdrom
11/13/2007
ecs150, Fall 2007
129
UCDavis, ecs150
Fall 2007

Correctness
One-copy Unix Semantics
– every modification to every byte of a file has to
be immediately and permanently visible to
every client.
11/13/2007
ecs150, Fall 2007
130
UCDavis, ecs150
Fall 2007

Correctness
One-copy Unix Semantics
– every modification to every byte of a file has to
be immediately and permanently visible to
every client.
– Conceptually FS sequent access
Make sense in a local file system
 Single processor versus shared memory


Is this necessary?
11/13/2007
ecs150, Fall 2007
131
UCDavis, ecs150
Fall 2007
DFS Architecture

Server
– storage for the distributed/shared files.
– provides an access interface for the clients.

Client
– consumer of the files.
– runs applications in a distributed environment.
open
read
opendir
readdir
close
write
stat
applications
11/13/2007
ecs150, Fall 2007
132
UCDavis, ecs150
Fall 2007


NFS (SUN, 1985)
Based on RPC (Remote Procedure Call) and XDR
(Extended Data Representation)
Server maintains no state
– a READ on the server opens, seeks, reads, and closes
– a WRITE is similar, but the buffer is flushed to disk before closing


Server crash: client continues to try until server reboots –
no loss
Client crashes: client must rebuild its own state – no effect
on server
11/13/2007
ecs150, Fall 2007
133
UCDavis, ecs150
Fall 2007
RPC - XDR
RPC: Standard protocol for calling
procedures in another machine
 Procedure is packaged with authorization
and admin info
 XDR: standard format for data, because
manufacturers of computers cannot agree on
byte ordering.

11/13/2007
ecs150, Fall 2007
134
UCDavis, ecs150
Fall 2007
rpcgen
RPC program
data
structure
RPC client.c
11/13/2007
rpcgen
RPC.h
ecs150, Fall 2007
data
structure
RPC server.c
135
UCDavis, ecs150
Fall 2007
NFS Operations
Every operation is independent: server
opens file for every operation
 File identified by handle -- no state
information retained by server
 client maintains mount table, v-node, offset
in file table etc.

What do these imply???
11/13/2007
ecs150, Fall 2007
136
UCDavis, ecs150
Fall 2007
Client computer
Client computer
Application Application
program
program
NFS
Application
program Client
Kernel
Server computer
Application
program
UNIX
system calls
Virtual file system
Operations
on local files
UNIX
file
system
Other
file system
UNIX kernel
Virtual file system
Operations
on
remote files
NFS
client
NFS
NFS Client
server
UNIX
file
system
NFS
protocol
(remote operations)
mount –t nfs home.yahoo.com:/pub/linux /mnt/linux
11/13/2007
ecs150, Fall 2007
137
*
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
138
UCDavis, ecs150
Fall 2007

State-ful vs. State-less
A server is fully aware of its clients
– does the client have the newest copy?
– what is the offset of an opened file?
– “a session” between a client and a server!

A server is completely unaware of its clients
– memory-less: I do not remember you!!
– Just tell me what you want to get (and where).
– I am not responsible for your offset values (the client needs to
maintain the state).
11/13/2007
ecs150, Fall 2007
139
UCDavis, ecs150
Fall 2007
The State
open
read
stat
lseek
applications
open
read
stat
lseek
offset
applications
11/13/2007
ecs150, Fall 2007
140
UCDavis, ecs150
Fall 2007
Network File Sharing

Server side:
– Rpcbind (portmap)
– Mountd - respond to mount requests (sometimes called
rpc.mountd).

Relies on several files
– /etc/dfs/dfstab,
– /etc/exports,
– /etc/netgroup
–
–
–
–
nfsd - serves files - actually a call to kernel level code.
lockd – file locking daemon.
statd – manages locks for lockd.
rquotad – manages quotas for exported file systems.
11/13/2007
ecs150, Fall 2007
141
UCDavis, ecs150
Fall 2007
Network File Sharing

Client Side
– biod - client side caching daemon
– mount must understand the hostname:directory convention.
– Filesystem entries in /etc/[v]fstab tell the client what filesystems to
mount.
11/13/2007
ecs150, Fall 2007
142
UCDavis, ecs150
Fall 2007

Unix file semantics
NFS:
– open a file with read-write mode
– later, the server’s copy becomes read-only
mode
– now, the application tries to write it!!
11/13/2007
ecs150, Fall 2007
143
UCDavis, ecs150
Fall 2007

Problems with NFS
Performance not scaleable:
– maybe it is OK for a local office.
– will be horrible with large scale systems.
11/13/2007
ecs150, Fall 2007
144
UCDavis, ecs150
Fall 2007

Similar to UNIX file caching for local files:
– pages (blocks) from disk are held in a main memory buffer cache until
the space is required for newer pages. Read-ahead and delayed-write
optimisations.
– For local files, writes are deferred to next sync event (30 second
intervals)
– Works well in local context, where files are always accessed through
the local cache, but in the remote case it doesn't offer necessary
synchronization guarantees to clients.

NFS v3 servers offers two strategies for updating the disk:
– write-through - altered pages are written to disk as soon as they are
received at the server. When a write() RPC returns, the NFS client
knows that the page is on the disk.
– delayed commit - pages are held only in the cache until a commit() call
is received for the relevant file. This is the default mode used by NFS
v3 clients. A commit() is issued by the client whenever a file is closed.
11/13/2007
ecs150, Fall 2007
145
*
UCDavis, ecs150
Fall 2007

Server caching does nothing to reduce RPC traffic between client and
server
– further optimisation is essential to reduce server load in large networks
– NFS client module caches the results of read, write, getattr, lookup and
readdir operations
– synchronization of file contents (one-copy semantics) is not guaranteed
when two or more clients are sharing the same file.

Timestamp-based validity check
– reduces inconsistency, but doesn't eliminate it
– validity condition for cache entries at the client:
(T - Tc < t) v (Tmclient = Tmserver)
– t is configurable (per file) but is typically set to
3 seconds for files and 30 secs. for directories
– it remains difficult to write distributed
applications that share files with NFS
11/13/2007
ecs150, Fall 2007
t
Tc
freshness guarantee
time when cache entry was
last validated
Tm time when block was last
updated at server
T current time
146
*
UCDavis, ecs150
Fall 2007
AFS
State-ful clients and servers.
 Caching the files to clients.

– File close ==> check-in the changes.

How to maintain consistency?
– Using “Callback” in v2/3 (Valid or Cancelled)
open
read
applications
invalidate and re-cache
11/13/2007
ecs150, Fall 2007
147
UCDavis, ecs150
Fall 2007
Why AFS?


Shared files are infrequently updated
Local cache of a few hundred mega bytes
– Now 50~100 giga bytes

Unix workload:
– Files are small, Read Operations dominated, sequential
access is common, read/written by one user, reference
bursts.
– Are these still true?
11/13/2007
ecs150, Fall 2007
148
UCDavis, ecs150
Fall 2007
Fault Tolerance in AFS

a server crashes

a client crashes
– check for call-back tokens first.
11/13/2007
ecs150, Fall 2007
151
UCDavis, ecs150
Fall 2007
Problems with AFS
Availability
 what happens if call-back itself is lost??

11/13/2007
ecs150, Fall 2007
152
UCDavis, ecs150
Fall 2007
GFS – Google File System
“failures” are norm
 Multiple-GB files are common
 Append rather than overwrite

– Random writes are rare

Can we relax the consistency?
11/13/2007
ecs150, Fall 2007
153
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
154
UCDavis, ecs150
Fall 2007
The Master
Maintains all file system metadata.
names space, access control info, file to chunk mappings,
chunk (including replicas) location, etc.
Periodically communicates with
chunkservers in HeartBeat messages to
give instructions and check state
11/13/2007
ecs150, Fall 2007
155
UCDavis, ecs150
Fall 2007
The Master
Helps make sophisticated chunk
placement and replication decision,
using global knowledge
For reading and writing, client contacts
Master to get chunk locations, then deals
directly with chunkservers
Master is not a bottleneck for reads/writes
11/13/2007
ecs150, Fall 2007
156
UCDavis, ecs150
Fall 2007
Chunkservers
Files are broken into chunks. Each chunk has
a immutable globally unique 64-bit chunkhandle.
handle is assigned by the master at chunk creation
Chunk size is 64 MB
Each chunk is replicated on 3 (default) servers
11/13/2007
ecs150, Fall 2007
157
UCDavis, ecs150
Fall 2007
Clients
Linked to apps using the file system API.
Communicates with master and
chunkservers for reading and writing
Master interactions only for metadata
Chunkserver interactions for data
Only caches metadata information
Data is too large to cache.
11/13/2007
ecs150, Fall 2007
158
UCDavis, ecs150
Fall 2007
Chunk Locations
Master does not keep a persistent record of
locations of chunks and replicas.
Polls chunkservers at startup, and when new
chunkservers join/leave for this.
Stays up to date by controlling placement of new
chunks and through HeartBeat messages (when
monitoring chunkservers)
11/13/2007
ecs150, Fall 2007
159
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
160
UCDavis, ecs150
Fall 2007

CODA
Server Replication:
– if one server goes down, I can get another.

Disconnected Operation:
– if all go down, I will use my own cache.
11/13/2007
ecs150, Fall 2007
161
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
162
UCDavis, ecs150
Fall 2007
Disconnected Operation
Continue critical work when that repository
is inaccessible.
 Key idea: caching data.

– Performance
– Availability

Server Replication
11/13/2007
ecs150, Fall 2007
163
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
164
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
165
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
166
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
167
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
168
UCDavis, ecs150
Fall 2007
11/13/2007
ecs150, Fall 2007
169
UCDavis, ecs150
Fall 2007

Consistency
If John update file X on server A and Mary
read file X on server B….
Read-one & Write-all
11/13/2007
ecs150, Fall 2007
170
UCDavis, ecs150
Fall 2007
Read x & Write (N-x+1)
read
write
11/13/2007
ecs150, Fall 2007
171
UCDavis, ecs150
Fall 2007
Initial
Alice-W
Bob-W
Alice-R
Chris-W
Dan-R
Emily-W
Frank-R
11/13/2007
Example: R3W4 (6+1)
0
2
2
2
2
2
7
7
0
2
3
3
1
1
7
7
0
0
3
3
1
1
1
1
ecs150, Fall 2007
0
2
3
3
1
1
1
1
0
2
3
3
1
1
1
1
0
0
0
0
0
0
7
7
172