Linux Virtual File System

Download Report

Transcript Linux Virtual File System

Linux Virtual File System
Peter J. Braam
P.J.Braam/CMU -- 1
Aims
• Present the data structures in Linux VFS
• Provide information about flow of control
• Describe methods and invariants needed to
implement a new file system
• Illustrate with some examples
P.J.Braam/CMU -- 2
History
• BSD implemented VFS for
NFS: aim dispatch to
different filesystems
• VMS had elaborate
filesystem
• NT/Win95 have VFS type
interfaces
• Newer systems integrate
VM with buffer cache.
File access
P.J.Braam/CMU -- 3
Linux Filesystems
• Media based
–
–
–
–
–
–
–
–
–
–
–
–
ext2 - Linux native
ufs - BSD
fat - DOS FS
vfat - win 95
hpfs - OS/2
minix - well….
Isofs - CDROM
sysv - Sysv Unix
hfs - Macintosh
affs - Amiga Fast FS
NTFS - NT’s FS
adfs - Acorn-strongarm
• Network
–
–
–
–
–
nfs
Coda
AFS - Andrew FS
smbfs - LanManager
ncpfs - Novell
• Special ones
– procfs -/proc
– umsdos - Unix in DOS
– userfs - redirector to user
P.J.Braam/CMU -- 4
Linux Filesystems (ctd)
• Forthcoming:
• Linux serves (unrelated
– devfs - device file system
to the VFS!)
– DFS - DCE distributed
FS
• Varia:
–
–
–
–
–
cfs - crypt filesystem
cfs - cache filesystem
ftpfs - ftp filesystem
mailfs - mail filesystem
pgfs - Postgres versioning
file system
– NFS - user & kernel
– Coda
– AppleShare netatalk/CAP
– SMB - samba
– NCP - Novell
P.J.Braam/CMU -- 5
Usefulness
Linux is Obsolete
Andrew Tanenbaum
P.J.Braam/CMU -- 6
Linux VFS
• Multiple interfaces build up
VFS:
–
–
–
–
–
File access
files
dentries
inodes
superblock
quota
• VFS can do all caching &
provides utility fctns to FS
• FS provides methods to
VFS; many are optional
P.J.Braam/CMU -- 7
User level file access
• Typical user level types and code:
– pathnames: “/myfile”
– file descriptors: fd = open(“/myfile”…)
– attributes in struct stat: stat(“/myfile”, &mybuf),
chmod, chown...
– offsets: write, read, lseek
– directory handles: DIR *dh = opendir(“/mydir”)
– directory entries: struct dirent *ent = readdir(dh)
P.J.Braam/CMU -- 8
VFS
• Manages kernel level file abstractions in one
format for all file systems
• Receives system call requests from user level (e.g.
write, open, stat, link)
• Interacts with a specific file system based on
mount point traversal
• Receives requests from other parts of the kernel,
mostly from memory management
P.J.Braam/CMU -- 9
File system level
• Individual File Systems
– responsible for managing file & directory data
– responsible for managing meta-data: timestamps,
owners, protection etc
– translates data between
• particular FS data: e.g. disk data, NFS data,
Coda/AFS data
• VFS data: attributes etc in standard format
– e.g. nfs_getattr(….) returns attributes in VFS format,
acquires attributes in NFS format to do so.
P.J.Braam/CMU -- 10
Anatomy of stat system call
sys_stat(path, buf) {
dentry = namei(path);
if ( dentry == NULL ) return -ENOENT; Establish VFS data
inode = dentry->d_inode;
rc =inode->i_op->i_permission(inode);
if ( rc ) return -EPERM;
Call into inode layer
of filesystem
Call into inode layer
rc = inode->i_op->i_getattr(inode, buf);
of filesystem
dput(dentry);
return rc;
}
P.J.Braam/CMU -- 11
Anatomy of fstatfs system call
sys_fstatfs(fd, buf) {
/* for things like “df” */
file = fget(fd);
Translate fd to VFS
if ( file == NULL ) return -EBADF;
data structure
superb = file->f_dentry->d_inode->i_super;
rc = superb->sb_op->sb_statfs(sb, buf);
return rc;
Call into superblock
layer of filesystem
}
P.J.Braam/CMU -- 12
Data structures
• VFS data structures for:
– VFS handle to the file: inode (BSD: vnode)
– User instantiated file handle: file (BSD: file)
– The whole filesystem: superblock (BSD: vfs)
– A name to inode translation: dentry
P.J.Braam/CMU -- 13
Shorthand method notation
•
•
•
•
super block methods: sss_methodname
inode methods: iii_methodname
dentry methods: ddd_methodname
file methods: fff_methodname
• instead of :
inode i_op lookup we write iii_lookup
P.J.Braam/CMU -- 14
namei
VFS
FS
struct dentry *namei(parent, name) {
if (dentry = d_lookup(parent,name))
ddd_hash(parent, name)
ddd_revalidate(dentry)
else
iii_lookup(parent, name)
struct inode *iget(ino, dev) {
/* try cache else .. */
sss_read_inode(…)
}
P.J.Braam/CMU -- 15
Superblocks
• Handle metadata only (attributes etc)
• Responsible for retrieving and storing
metadata from the FS media or peers
• Struct superblocks hold things like:
–
–
–
–
device, blocksize, dirty flags, list of dirty inodes
super operations
wait queue
pointer to the root inode of this FS
P.J.Braam/CMU -- 16
Super Operations (sss_)
• Ops on Inodes:
–
–
–
–
–
–
read_inode
put_inode
write_inode
delete_inode
clear_inode
notify_change
• Superblock manips:
–
–
–
–
read_super (mount)
put_super (unmount)
write_super (unmount)
statfs (attributes)
P.J.Braam/CMU -- 17
Inodes
• Inodes are VFS abstraction for the file
• Inode has operations (iii_methods)
• VFS maintains an inode cache, NOT the
individual FS’s (compare NT, BSD etc)
• Inodes contain an FS specific area where:
– ext2 stores disk block numbers etc
– AFS would store the FID
• Extraordinary inode ops are good for
dealing with stale NFS file handles etc.
P.J.Braam/CMU -- 18
What’s inside an inode - 1
list_head i_hash
list_head i_list
list_head i_dentry
int i_count
long i_ino
int i_dev
{m,a,c}time
{u,g}id
mode
size
n_link
caching
Identifies file
Usual stuff
P.J.Braam/CMU -- 19
What’s inside an inode -2
superblock i_sb
inode_ops i_op
wait objects, semaphore
lock
vm_area_struct
pipe/socket info
Which FS
For mmap,
networking
waiting
page information
union {
ext2fs_inode_info i_ext2
nfs_inode_info i_nfs
coda_inode_info i_coda
..} u
FS Specific
info:
blockno’s
fids etc
P.J.Braam/CMU -- 20
Inode state
• Inode can be on one or two lists:
– (hash & in_use) or (hash & dirty ) or unused
– inode has a use count i_count
• Transitions
– unused  hash: iget calls sss_read_inode
– dirty in_use: sss_write_inode
– hash  unused: call on sss_clear_inode, but if
i_nlink = 0: iput calls sss_delete_inode when
i_count falls to 0
P.J.Braam/CMU -- 21
Inode Cache
Players:
1. iget: if i_count>0 ++
2. iput: if i_count>1 - -
3. free_inodes
4. syncing inodes
Inode_hashtable
sss_clear_inode
(freeing inos)
or
sss_delete_inode
(iput)
Fs storage
sss_read_inode
(iget)
Fs storage
Unused inodes
Dirty inodes
media fs only
(mark_inode_dirty)
Used inodes
sss_write_inode
(sync one)
Fs storage
P.J.Braam/CMU -- 22
Sales
Red Hat Software sold 240,000 copies of Red Hat
Linux in 1997 and expects to reach 400,000 in
1998.
Estimates of installed servers (InfoWorld):
- Linux: 7 million
- OS/2: 5 million
- Macintosh: 1 million
P.J.Braam/CMU -- 23
Inode operations (iii_)
• lookup: return inode
– calls iget
• creation/removal
–
–
–
–
–
–
–
–
create
link
unlink
symlink
mkdir
rmdir
mknod
rename
• symbolic links
– readlink
– follow link
• pages
– readpage, writepage,
updatepage - read or write
page. Generic for mediafs.
– bmap - return disk block
number of logical block
• special operations
– revalidate - see dentry sect
– truncate
– permission
P.J.Braam/CMU -- 24
Dentry world
• Dentry is a name to inode translation structure
• Cached agressively by VFS
• Eliminates lookups by FS & private caches
– timing on Coda FS: ls -lR 1000 files after priming cache
• linux 2.0.32: 7.2secs
• linux 2.1.92: 0.6secs
– disk fs: less benefit, NFS even more
• Negative entries!
• Namei is dramatically simplified
P.J.Braam/CMU -- 25
Inside dentry’s
•
•
•
•
•
•
name
pointer to inode
pointer to parent dentry
list head of children
chains for lots of lists
use count
P.J.Braam/CMU -- 26
Dentry associated lists
Legend:
inode
dentry inode relationship
inode I_dentry list head
= d_inode pointer
d_alias chains
place: d_instantiate
remove: dentry_iput
dentry
dentry tree relationship
inode i_dentry list head
= d_parent pointer
d_child chains
place: d_alloc
remove: d_prune, d_invalidate, d_put
P.J.Braam/CMU -- 27
Dcache
dentry_hashtable (d_hash chains)
dhash(parent, name) list head
prune
d_invalidate
d_drop
namei
iii_lookup
d_add
unused dentries (d_lru chains)
• namei tries cache:
d_lookup
– ddd_compare
• Success: ddd_revalidate
– d_invalidate if fails
– proceed if success
• Failure: iii_lookup
– find inode
– iget
• sss_read_inode
– finish:
• d_add
– can give negative entry
in dcache
P.J.Braam/CMU -- 28
Dentry methods
•
•
•
•
ddd_revalidate: can force new lookup
ddd_hash: compute hash value of name
ddd_compare: are names equal?
ddd_delete, ddd_put, ddd_iput: FS cleanup
opportunity
P.J.Braam/CMU -- 29
Dentry particulars:
• ddd_hash and ddd_compare have to deal
with extraordinary cases for msdos/vfat:
– case insensitive
– long and short filename pleasantries
• ddd_revalidate -- can force new lookup if
inode not in use:
– used for NFS/SMBfs aging
– used for Coda/AFS callbacks
P.J.Braam/CMU -- 30
Style
Dijkstra probably hates me
Linus Torvalds
P.J.Braam/CMU -- 31
Memory mapping
• vm_area structure has
– vm_operations
– inode, addresses etc.
• vm_operations
– map, unmap
– swapin, swapout
– nopage -- read when page isn’t in VM
• mmap
– calls on iii_readpage
– keeps a use count on the inode until unmap
P.J.Braam/CMU -- 32