Transcript General
UCDavis, ecs150 Fall 2007 : Operating System ecs150 Fall 2007 #5: File Systems (chapters: 6.4~6.7, 8) Dr. S. Felix Wu Computer Science Department University of California, Davis http://www.cs.ucdavis.edu/~wu/ [email protected] 11/13/2007 ecs150, Fall 2007 1 UCDavis, ecs150 Fall 2007 File System Abstraction Files Directories 11/13/2007 ecs150, Fall 2007 2 UCDavis, ecs150 Fall 2007 System-call interface Active file entries VNODE Layer or VFS Local naming (UFS) FFS Buffer cache Block or character device driver Hardware 11/13/2007 ecs150, Fall 2007 3 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 4 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 5 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 6 UCDavis, ecs150 Fall 2007 dirp = opendir(const char *filename); struct dirent *direntp = readdir(dirp); struct dirent { ino_t d_ino; char d_name[NAME_MAX+1]; }; directory dirent dirent dirent inode file_name inode file_name inode file_name file file file 11/13/2007 ecs150, Fall 2007 7 UCDavis, ecs150 Fall 2007 Local versus Remote System Call Interface V-node Local versus remote – NFS or i-node – Stackable File System Hard-disk blocks 11/13/2007 ecs150, Fall 2007 8 UCDavis, ecs150 Fall 2007 File-System Structure File structure – Logical storage unit – Collection of related information File system resides on secondary storage (disks). File system organized into layers. File control block – storage structure consisting of information about a file. 11/13/2007 ecs150, Fall 2007 9 UCDavis, ecs150 Fall 2007 File Disk separate the disk into blocks separate the file into blocks as well paging from file to disk blocks: 4 - 7- 2- 10- 12 How to represent the file?? How to link these 5 pages together?? 11/13/2007 ecs150, Fall 2007 10 UCDavis, ecs150 Fall 2007 Bit torrent pieces 1 big file (X Gigabytes) with a number of pieces (5%) already in (and sharing with others). How much disk space do we need at this moment? 11/13/2007 ecs150, Fall 2007 11 UCDavis, ecs150 Fall 2007 Hard Disk Track, Sector, Head – Track + Heads Cylinder Performance – seek time – rotation time – transfer time LBA – Linear Block Addressing 11/13/2007 ecs150, Fall 2007 12 UCDavis, ecs150 Fall 2007 File Disk blocks file block 0 file block 1 file block 2 file block 3 0 file block 4 4 7 2 10 12 What are the disadvantages? 1. disk access can be slow for “random access”. 2. How big is each block? 64 bytes? 68 bytes? 11/13/2007 ecs150, Fall 2007 13 UCDavis, ecs150 Fall 2007 Kernel Hacking Session This Friday from 7:30 p.m. until midnight.. 3083 Kemper – Bring your laptop – And bring your mug… 11/13/2007 ecs150, Fall 2007 14 UCDavis, ecs150 Fall 2007 A File System partition b s i-node 11/13/2007 partition i-list i-node d ……. partition directory and data blocks i-node ecs150, Fall 2007 15 UCDavis, ecs150 Fall 2007 One Logical File Physical Disk Blocks efficient representation & access 11/13/2007 ecs150, Fall 2007 16 UCDavis, ecs150 Fall 2007 A file An i-node ??? entries in one disk block Typical: each block 8K or 16K bytes 11/13/2007 ecs150, Fall 2007 17 UCDavis, ecs150 Fall 2007 inode (index node) structure meta-data of the file. – – – – – – – – – – 11/13/2007 di_mode di_nlinks di_uid di_gid di_size di_addr di_gen di_atime di_mtime di_ctime 02 02 02 02 04 39 01 04 04 04 ecs150, Fall 2007 18 UCDavis, ecs150 Fall 2007 System-call interface Active file entries VNODE Layer or VFS Local naming (UFS) FFS Buffer cache Block or character device driver Hardware 11/13/2007 ecs150, Fall 2007 19 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 20 UCDavis, ecs150 Fall 2007 A File System partition b s i-node 11/13/2007 partition i-list i-node d ……. partition directory and data blocks i-node ecs150, Fall 2007 21 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 22 UCDavis, ecs150 ufs2_dinode { 125 struct Fall 2007 126 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */ 127 int16_t di_nlink; /* 2: File link count. */ 128 u_int32_t di_uid; /* 4: File owner. */ 129 u_int32_t di_gid; /* 8: File group. */ 130 u_int32_t di_blksize; /* 12: Inode blocksize. */ 131 u_int64_t di_size; /* 16: File byte count. */ 132 u_int64_t di_blocks; /* 24: Bytes actually held. */ 133 ufs_time_t di_atime; /* 32: Last access time. */ 134 ufs_time_t di_mtime; /* 40: Last modified time. */ 135 ufs_time_t di_ctime; /* 48: Last inode change time. */ 136 ufs_time_t di_birthtime; /* 56: Inode creation time. */ 137 int32_t di_mtimensec; /* 64: Last modified time. */ 138 int32_t di_atimensec; /* 68: Last access time. */ 139 int32_t di_ctimensec; /* 72: Last inode change time. */ 140 int32_t di_birthnsec; /* 76: Inode creation time. */ 141 int32_t di_gen; /* 80: Generation number. */ 142 u_int32_t di_kernflags; /* 84: Kernel flags. */ 143 u_int32_t di_flags; /* 88: Status flags (chflags). */ 144 int32_t di_extsize; /* 92: External attributes block. */ 145 ufs2_daddr_t di_extb[NXADDR];/* 96: External attributes block. */ 146 ufs2_daddr_t di_db[NDADDR]; /* 112: Direct disk blocks. */ 147 ufs2_daddr_t di_ib[NIADDR]; /* 208: Indirect disk blocks. */ 148 int64_t di_spare[3]; /* 232: Reserved; currently unused */ 149 }; 11/13/2007 ecs150, Fall 2007 23 UCDavis, ecs150 Fall 2007 166 struct 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 ufs1_dinode { u_int16_t di_mode; /* 0: IFMT, permissions; see below. */ int16_t di_nlink; /* 2: File link count. */ union { u_int16_t oldids[2]; /* 4: Ffs: old user and group ids. */ } di_u; u_int64_t di_size; /* 8: File byte count. */ int32_t di_atime; /* 16: Last access time. */ int32_t di_atimensec; /* 20: Last access time. */ int32_t di_mtime; /* 24: Last modified time. */ int32_t di_mtimensec; /* 28: Last modified time. */ int32_t di_ctime; /* 32: Last inode change time. */ int32_t di_ctimensec; /* 36: Last inode change time. */ ufs1_daddr_t di_db[NDADDR]; /* 40: Direct disk blocks. */ ufs1_daddr_t di_ib[NIADDR]; /* 88: Indirect disk blocks. */ u_int32_t di_flags; /* 100: Status flags (chflags). */ int32_t di_blocks; /* 104: Blocks actually held. */ int32_t di_gen; /* 108: Generation number. */ u_int32_t di_uid; /* 112: File owner. */ u_int32_t di_gid; /* 116: File group. */ int32_t di_spare[2]; /* 120: Reserved; currently unused */ }; 11/13/2007 ecs150, Fall 2007 24 UCDavis, ecs150 Fall 2007 Bittorrent pieces File size: 10 GB Pieces downloaded: 512 MB How much disk space do we need? 11/13/2007 ecs150, Fall 2007 25 UCDavis, ecs150 Fall 2007 #include <stdio.h> #include <stdlib.h> # ./t int main # ls –l ./sss.txt (void) { FILE *f1 = fopen("./sss.txt", "w"); int i; for (i = 0; i < 1000; i++) { fseek(f1, rand(), SEEK_SET); fprintf(f1, "%d%d%d%d", rand(), rand(), rand(), rand()); if (i % 100 == 0) sleep(1); } fflush(f1); } 11/13/2007 ecs150, Fall 2007 26 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 27 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 28 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 29 UCDavis, ecs150 Fall 2007 A file An i-node ??? entries in one disk block Typical: each block 1K 11/13/2007 ecs150, Fall 2007 30 UCDavis, ecs150 Fall 2007 i-node How many disk blocks can a FS have? How many levels of i-node indirection will be necessary to store a file of 2G bytes? (I.e., 0, 1, 2 or 3) What is the largest possible file size in i-node? What is the size of the i-node itself for a file of 10GB with only 512 MB downloaded? 11/13/2007 ecs150, Fall 2007 31 UCDavis, ecs150 Fall 2007 Answer How many disk blocks can a FS have? – 264 or 232: Pointer (to blocks) size is 8/4 bytes. How many levels of i-node indirection will be necessary to store a file of 2G (231) bytes? (I.e., 0, 1, 2 or 3) – 12*210 + 28 * 210 + 28 *28 *2 10 + 28 * 28 *28 *2 10 >? 231 What is the largest possible file size in i-node? – 12*210 + 28 * 210 + 28 *28 *2 10 + 28 * 28 *28 *2 10 – 264 –1 – 232 * 210 You need to consider three issues and find the minimum! 11/13/2007 ecs150, Fall 2007 32 UCDavis, ecs150 Fall 2007 Answer: Lower Bound How many pointers? – 512MB divided by the block size (1K) – 512K pointers times 8 (4) bytes = 4 (2) MB 11/13/2007 ecs150, Fall 2007 33 UCDavis, ecs150 Fall 2007 Bittorrent pieces File size: 10 GB Pieces downloaded: 512 MB How much disk space do we need? 11/13/2007 ecs150, Fall 2007 34 UCDavis, ecs150 Fall 2007 Answer: Upper Bound In the worst case, EVERY indirection block has at least one entry! How many indirection blocks? – Single: 1 block – Double: 1 + 28 – Tripple: 1 + 28 + 216 Total ~ 216 blocks times 1K = 64 MB – 214 times 1K = 16MB (ufs2 inode) 11/13/2007 ecs150, Fall 2007 35 UCDavis, ecs150 Fall 2007 Answer (4) 2 MB ~ 64 MB 4 MB ~ 16 MB ufs1 ufs2 Answer: sss.txt ~17 MB – ~16 MB (inode indirection blocks) – 1000 writes times 1K ~ 1MB 11/13/2007 ecs150, Fall 2007 36 UCDavis, ecs150 Fall 2007 A file An i-node ??? entries in one disk block Typical: each block 1K 11/13/2007 ecs150, Fall 2007 37 UCDavis, ecs150 Fall 2007 A File System partition b s i-node 11/13/2007 partition i-list i-node d ……. partition directory and data blocks i-node ecs150, Fall 2007 38 UCDavis, ecs150 Fall 2007 FFS and UFS /usr/src/sys/ufs/ffs/* – Higher-level: directory structure – Soft updates & Snapshot /usr/src/sys/ufs/ufs/* – Lower-level: buffer, i-node 11/13/2007 ecs150, Fall 2007 39 UCDavis, ecs150 Fall 2007 # of i-nodes UFS1: pre-allocation – 3% of HD, about < 25% used. UFS2: dynamic allocation – Still limited # of i-nods 11/13/2007 ecs150, Fall 2007 40 UCDavis, ecs150 Fall 2007 di_size vs. di_blocks ??? 11/13/2007 ecs150, Fall 2007 41 UCDavis, ecs150 Fall 2007 One Logical File Physical Disk Blocks efficient representation & access 11/13/2007 ecs150, Fall 2007 42 UCDavis, ecs150 Fall 2007 di_size vs. di_blocks Logical Physical fstat du 11/13/2007 ecs150, Fall 2007 43 UCDavis, ecs150 Fall 2007 Extended Attributes in UFS2 Attributes associated with the File – di_extb[2]; – two blocks, but indirection if needed. Format – Length – Name Space – Content Pad Length – Name Length – Name – Content 4 1 1 1 mod 8 variable Applications: ACL, Data Labelling 11/13/2007 ecs150, Fall 2007 44 UCDavis, ecs150 Fall 2007 Some thoughts…. What can you do with “extended attributes”? How to design/implement? – Should/can we do it “Stackable File Systems”? – Otherwise, the program to manipulate the EA’s will have to be very UFS2-dependent or FiST with an UFS2 optimization option. Are there any counter examples? – security and performance considerations. 11/13/2007 ecs150, Fall 2007 45 UCDavis, ecs150 Fall 2007 A File System partition b s i-node 11/13/2007 partition i-list i-node d ……. partition directory and data blocks i-node ecs150, Fall 2007 46 UCDavis, ecs150 Fall 2007 struct dirent { ino_t d_ino; char d_name[NAME_MAX+1]; }; struct stat {… short nlinks; …}; directory dirent dirent dirent inode file_name inode file_name inode file_name file file file 11/13/2007 ecs150, Fall 2007 47 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 48 UCDavis, ecs150 Fall 2007 2 . .. root wheel drwxr-xr-x Apr 1 2004 usr vmunix 3 4 5 6 7 root wheel drwxr-xr-x Apr 1 2004 root wheel rwxr-xr-x Apr 15 2004 kirk staff rw-rw-r-Jan 19 2004 root wheel drwxr-xr-x Apr 1 2004 . .. bin foo text 9 11/13/2007 directory / 4 2 7 6 directory /usr data Hello World! . .. ex groff vi 8 bin bin rwxr-xr-x Apr 15 2004 2 2 4 5 text ecs150, Fall 2007 7 4 9 10 9 data file /vmunix file /usr/foo directory /usr/bin file /usr/bin/vi 49 UCDavis, ecs150 Fall 2007 What is the difference? ln –s /usr/src/sys/sys/proc.h ppp.h ln /usr/src/sys/sys/proc.h ppp.h 11/13/2007 ecs150, Fall 2007 50 UCDavis, ecs150 Fall 2007 Hard versus Symbolic ln –s /usr/src/sys/sys/proc.h ppp.h – Link to anything, any mounted partitions – Delete a Symbolic link? ln /usr/src/sys/sys/proc.h ppp.h – Link only to “file” (not directory) – Link only within the same partition -- why? – Delete a Hard Link? 11/13/2007 ecs150, Fall 2007 51 UCDavis, ecs150 ufs2_dinode { 125 struct Fall 2007 126 u_int16_t di_mode; /* 0: IFMT, permissions; see below. */ 127 int16_t di_nlink; /* 2: File link count. */ 128 u_int32_t di_uid; /* 4: File owner. */ 129 u_int32_t di_gid; /* 8: File group. */ 130 u_int32_t di_blksize; /* 12: Inode blocksize. */ 131 u_int64_t di_size; /* 16: File byte count. */ 132 u_int64_t di_blocks; /* 24: Bytes actually held. */ 133 ufs_time_t di_atime; /* 32: Last access time. */ 134 ufs_time_t di_mtime; /* 40: Last modified time. */ 135 ufs_time_t di_ctime; /* 48: Last inode change time. */ 136 ufs_time_t di_birthtime; /* 56: Inode creation time. */ 137 int32_t di_mtimensec; /* 64: Last modified time. */ 138 int32_t di_atimensec; /* 68: Last access time. */ 139 int32_t di_ctimensec; /* 72: Last inode change time. */ 140 int32_t di_birthnsec; /* 76: Inode creation time. */ 141 int32_t di_gen; /* 80: Generation number. */ 142 u_int32_t di_kernflags; /* 84: Kernel flags. */ 143 u_int32_t di_flags; /* 88: Status flags (chflags). */ 144 int32_t di_extsize; /* 92: External attributes block. */ 145 ufs2_daddr_t di_extb[NXADDR];/* 96: External attributes block. */ 146 ufs2_daddr_t di_db[NDADDR]; /* 112: Direct disk blocks. */ 147 ufs2_daddr_t di_ib[NIADDR]; /* 208: Indirect disk blocks. */ 148 int64_t di_spare[3]; /* 232: Reserved; currently unused */ 149 }; 11/13/2007 ecs150, Fall 2007 52 UCDavis, ecs150 Fall 2007 struct dirent { ino_t d_ino; char d_name[NAME_MAX+1]; }; struct stat {… short nlinks; …}; directory dirent dirent dirent inode file_name inode file_name inode file_name file file file 11/13/2007 ecs150, Fall 2007 53 UCDavis, ecs150 Fall 2007 File System Buffer Cache application: OS: read/write files translate file to disk blocks maintains ...buffer cache ... controls disk accesses: read/write blocks hardware: Any problems? 11/13/2007 ecs150, Fall 2007 54 UCDavis, ecs150 Fall 2007 File System Consistency To maintain file system consistency the ordering of updates from buffer cache to disk is critical Example: – if the directory block is written back before the i-node and the system crashes, the directory structure will be inconsistent 11/13/2007 ecs150, Fall 2007 55 UCDavis, ecs150 Fall 2007 File System Consistency File system almost always use a buffer/disk cache for performance reasons This problem is critical especially for the blocks that contain control information: i-node, free-list, directory blocks Two copies of a disk block (buffer cache, disk) consistency problem if the system crashes before all the modified blocks are written back to disk Write back critical blocks from the buffer cache to disk immediately Data blocks are also written back periodically: sync 11/13/2007 ecs150, Fall 2007 56 UCDavis, ecs150 Fall 2007 Two Strategies Prevention – Use un-buffered I/O when writing i-nodes or pointer blocks – Use buffered I/O for other writes and force sync every 30 seconds Detect and Fix – Detect the inconsistency – Fix them according to the “rules” – Fsck (File System Checker) 11/13/2007 ecs150, Fall 2007 57 UCDavis, ecs150 Fall 2007 File System Integrity Block consistency: – Block-in-use table – Free-list table 0 1 1 1 0 0 0 1 0 0 0 2 1 0 0 0 1 1 1 0 1 0 2 0 File consistency: – how many directories pointing to that i-node? – nlink? – three cases: D == L, L > D, D > L 11/13/2007 What to do with the latter two cases? ecs150, Fall 2007 58 UCDavis, ecs150 Fall 2007 File System Integrity File system states (a) consistent (b) missing block (c) duplicate block in free list (d) duplicate data block 11/13/2007 ecs150, Fall 2007 59 UCDavis, ecs150 Fall 2007 Metadata Operations Metadata operations modify the structure of the file system – Creating, deleting, or renaming files, directories, or special files – Directory & I-node Data must be written to disk in such a way that the file system can be recovered to a consistent state after a system crash 11/13/2007 ecs150, Fall 2007 60 UCDavis, ecs150 Fall 2007 Metadata Integrity FFS uses synchronous writes to guarantee the integrity of metadata – Any operation modifying multiple pieces of metadata will write its data to disk in a specific order – These writes will be blocking Guarantees integrity and durability of metadata updates 11/13/2007 ecs150, Fall 2007 61 UCDavis, ecs150 Fall 2007 Deleting a file (I) i-node-1 abc def i-node-2 ghi i-node-3 Assume we want to delete file “def” 11/13/2007 ecs150, Fall 2007 62 UCDavis, ecs150 Fall 2007 Deleting a file (II) i-node-1 abc ? def ghi i-node-3 Cannot delete i-node before directory entry “def” 11/13/2007 ecs150, Fall 2007 63 UCDavis, ecs150 Fall 2007 Deleting a file (III) Correct sequence is 1. 2. Write to disk directory block containing deleted directory entry “def” Write to disk i-node block containing deleted inode Leaves the file system in a consistent state 11/13/2007 ecs150, Fall 2007 64 UCDavis, ecs150 Fall 2007 Creating a file (I) i-node-1 abc ghi i-node-3 Assume we want to create new file “tuv” 11/13/2007 ecs150, Fall 2007 65 UCDavis, ecs150 Fall 2007 Creating a file (II) i-node-1 abc ghi i-node-3 tuv ? Cannot write directory entry “tuv” before i-node 11/13/2007 ecs150, Fall 2007 66 UCDavis, ecs150 Fall 2007 Creating a file (III) Correct sequence is 1. 2. Write to disk i-node block containing new i-node Write to disk directory block containing new directory entry Leaves the file system in a consistent state 11/13/2007 ecs150, Fall 2007 67 UCDavis, ecs150 Fall 2007 Synchronous Updates Used by FFS to guarantee consistency of metadata: – All metadata updates are done through blocking writes Increases the cost of metadata updates Can significantly impact the performance of whole file system 11/13/2007 ecs150, Fall 2007 68 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 69 UCDavis, ecs150 Fall 2007 SOFT UPDATES Use delayed writes (write back) Maintain dependency information about cached pieces of metadata: This i-node must be updated before/after this directory entry Guarantee that metadata blocks are written to disk in the required order 11/13/2007 ecs150, Fall 2007 70 UCDavis, ecs150 Fall 2007 3 Soft Update Rules Never point to a structure before it has been initialized. Never reuse a resource before nullifying all previous pointers to it. Never reset the old pointer to a live resource before the new pointer has been set. 11/13/2007 ecs150, Fall 2007 71 UCDavis, ecs150 Fall 2007 Problem #1 with S.U. Synchronous writes guaranteed that metadata operations were durable once the system call returned Soft Updates guarantee that file system will recover into a consistent state but not necessarily the most recent one – Some updates could be lost 11/13/2007 ecs150, Fall 2007 72 UCDavis, ecs150 Fall 2007 What are the dependency relationship? We want to delete file “foo” and create new file “bar” Block A Block B foo i-node-2 NEW bar NEW i-node-3 11/13/2007 ecs150, Fall 2007 73 UCDavis, ecs150 Fall 2007 Circular Dependency X-2nd Y-1st We want to delete file “foo” and create new file “bar” Block A Block B foo i-node-2 NEW bar NEW i-node-3 11/13/2007 ecs150, Fall 2007 74 UCDavis, ecs150 Fall 2007 Problem #2 with S.U. Cyclical dependencies: – Same directory block contains entries to be created and entries to be deleted – These entries point to i-nodes in the same block Brainstorming: – How to resolve this issue in S.U.? 11/13/2007 ecs150, Fall 2007 75 UCDavis, ecs150 Fall 2007 FS: buffer or disk?? They appear in both and we try to synchronize them.. 11/13/2007 ecs150, Fall 2007 76 UCDavis, ecs150 Fall 2007 Disk Block A-Dir Block B-i-Node foo i-node-2 11/13/2007 ecs150, Fall 2007 77 UCDavis, ecs150 Fall 2007 Buffer Block A-Dir Block B-i-Node NEW bar NEW i-node-3 11/13/2007 ecs150, Fall 2007 78 UCDavis, ecs150 Fall 2007 Synchronize?? Block A Block B foo i-node-2 NEW bar NEW i-node-3 11/13/2007 ecs150, Fall 2007 79 UCDavis, ecs150 Fall 2007 How to update?? i-node first or director block first? 11/13/2007 ecs150, Fall 2007 80 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 81 UCDavis, ecs150 Fall 2007 Solution in S.U. Roll back metadata in one of the blocks to an earlier, safe state Block A’ def (Safe state does not contain new directory entry) 11/13/2007 ecs150, Fall 2007 82 UCDavis, ecs150 Fall 2007 Write first block with metadata that were rolled back (block A’ of example) Write blocks that can be written after first block has been written (block B of example) Roll forward block that was rolled back Write that block Breaks the cyclical dependency but must now write twice block A 11/13/2007 ecs150, Fall 2007 83 UCDavis, ecs150 Fall 2007 Before any Write Operation SU Dependency Checking (roll back if necessary) After any Write Operation SU Dependency Processing (task list updating) (roll forward if necessary) 11/13/2007 ecs150, Fall 2007 84 UCDavis, ecs150 Fall 2007 two most popular approaches for improving the performance of metadata operations and recovery: – Journaling – Soft Updates Journaling systems record metadata operations on an auxiliary log Soft Updates uses ordered writes 11/13/2007 ecs150, Fall 2007 85 UCDavis, ecs150 Fall 2007 JOURNALING Journaling systems maintain an auxiliary log that records all meta-data operations Write-ahead logging ensures that the log is written to disk before any blocks containing data modified by the corresponding operations. – After a crash, can replay the log to bring the file system to a consistent state 11/13/2007 ecs150, Fall 2007 86 UCDavis, ecs150 Fall 2007 JOURNALING Log writes are performed in addition to the regular writes Journaling systems incur log write overhead but – Log writes can be performed efficiently because they are sequential (block operation consideration) – Metadata blocks do not need to be written back after each update 11/13/2007 ecs150, Fall 2007 87 UCDavis, ecs150 Fall 2007 JOURNALING Journaling systems can provide – same durability semantics as FFS if log is forced to disk after each meta-data operation – the laxer semantics of Soft Updates if log writes are buffered until entire buffers are full 11/13/2007 ecs150, Fall 2007 88 UCDavis, ecs150 Fall 2007 Soft Updates vs. Journaling Advantages disadvantages 11/13/2007 ecs150, Fall 2007 89 UCDavis, ecs150 Fall 2007 With Soft Updates?? Do we still need “FSCK”? at boot time? CPU 11/13/2007 ecs150, Fall 2007 90 UCDavis, ecs150 Fall 2007 Recover the Missing Resources In the background, in an active FS… – We don’t want to wait for the lengthy FSCK process to complete… A related issue: – the virus scanning process – what happens if we get a new virus signature? 11/13/2007 ecs150, Fall 2007 91 UCDavis, ecs150 Fall 2007 Snapshot of the FS backup and restore dump reliably an active File System – what will we do today to dump our 40GB FS “consistent” snapshots? (in the midnight…) “background FSCK checks” 11/13/2007 ecs150, Fall 2007 92 UCDavis, ecs150 Fall 2007 What is a snapshot? (I mean “conceptually”.) Freeze all activities related to the FS. Copy everything to “some space”. Resume the activities. How do we efficiently implement this concept such that the activities will only be blocked for about 0.25 seconds, and we don’t have to buy a really big hard drive? 11/13/2007 ecs150, Fall 2007 93 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 94 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 95 UCDavis, ecs150 Fall 2007 Copy-on-Write 11/13/2007 ecs150, Fall 2007 96 UCDavis, ecs150 Fall 2007 Snapshot: a file Logical size Versus physical size 11/13/2007 ecs150, Fall 2007 97 UCDavis, ecs150 Fall 2007 Example # # # # mkdir /backups/usr/noon mount –u –o snapshot /usr/snap.noon /usr mdconfig –a –t vnode –u 0 –f /usr/snap.noon mount –r /dev/md0 /backups/usr/noon /* do whatever you want to test it */ # umount /backups/usr/noon # mdconfig –d –u 0 # rm –f /usr/snap.noon 11/13/2007 ecs150, Fall 2007 98 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 99 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 100 UCDavis, ecs150 Fall 2007 #include <stdio.h> #include <stdlib.h> int main (void) { FILE *f1 = fopen("./sss.txt", "w"); int i; for (i = 0; i < 1000; i++) { fseek(f1, rand(), SEEK_SET); fprintf(f1, "%d%d%d%d", rand(), rand(), rand(), rand()); if (i % 100 == 0) sleep(1); } fflush(f1); } 11/13/2007 ecs150, Fall 2007 101 UCDavis, ecs150 Fall 2007 Example # # # # mkdir /backups/usr/noon mount –u –o snapshot /usr/snap.noon /usr mdconfig –a –t vnode –u 0 –f /usr/snap.noon mount –r /dev/md0 /backups/usr/noon /* do whatever you want to test it */ # umount /backups/usr/noon # mdconfig –d –u 0 # rm –f /usr/snap.noon 11/13/2007 ecs150, Fall 2007 102 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 103 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 104 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 105 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 106 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 107 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 108 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 109 UCDavis, ecs150 Fall 2007 Example # # # # mkdir /backups/usr/noon mount –u –o snapshot /usr/snap.noon /usr mdconfig –a –t vnode –u 0 –f /usr/snap.noon mount –r /dev/md0 /backups/usr/noon /* do whatever you want to test it */ # umount /backups/usr/noon # mdconfig –d –u 0 # rm –f /usr/snap.noon 11/13/2007 ecs150, Fall 2007 110 UCDavis, ecs150 Fall 2007 Copy-on-Write 11/13/2007 ecs150, Fall 2007 111 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 112 UCDavis, ecs150 Fall 2007 A file 11/13/2007 A File System ??? entries in one disk block ecs150, Fall 2007 113 UCDavis, ecs150 Fall 2007 A file A Snapshot i-node ??? entries in one disk block Not used or Not yet copy 11/13/2007 ecs150, Fall 2007 114 UCDavis, ecs150 Fall 2007 A file Copy-on-write ??? entries in one disk block Not used or Not yet copy 11/13/2007 ecs150, Fall 2007 115 UCDavis, ecs150 Fall 2007 A file Copy-on-write ??? entries in one disk block Not used or Not yet copy 11/13/2007 ecs150, Fall 2007 116 UCDavis, ecs150 Fall 2007 Multiple Snapshots about 20 snapshots Interactions/sharing among snapshots 11/13/2007 ecs150, Fall 2007 117 UCDavis, ecs150 Fall 2007 Snapshot of the FS backup and restore dump reliably an active File System – what will we do today to dump our 40GB FS “consistent” snapshots? (in the midnight…) “background FSCK checks” 11/13/2007 ecs150, Fall 2007 118 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 119 UCDavis, ecs150 Fall 2007 VFS: the FS Switch Sun Microsystems introduced the virtual file system interface in 1985 to accommodate diverse filesystem types cleanly. VFS allows diverse specific file systems to coexist in a file tree, isolating all FS-dependencies in pluggable filesystem modules. user space syscall layer (file, uio, etc.) network protocol stack (TCP/IP) Virtual File System (VFS) NFS FFS LFS *FS etc. etc. VFS was an internal kernel restructuring with no effect on the syscall interface. Incorporates object-oriented concepts: a generic procedural interface with multiple implementations. device drivers Other abstract interfaces in the kernel: device drivers, file objects, executable files, memory objects. 11/13/2007 ecs150, Fall 2007 Based on abstract objects with dynamic method binding by type...in C. 120 UCDavis, ecs150 Fall 2007 vnode In the VFS framework, every file or directory in active use is represented by a vnode object in kernel memory. free vnodes syscall layer Each vnode has a standard file attributes struct. Generic vnode points at filesystem-specific struct (e.g., inode, rnode), seen only by the filesystem. Each specific file system maintains a cache of its resident vnodes. NFS 11/13/2007 UFS ecs150, Fall 2007 Vnode operations are macros that vector to filesystem-specific procedures. 121 UCDavis, ecs150 Fall 2007 vnode Operations and Attributes vnode attributes (vattr) type (VREG, VDIR, VLNK, etc.) mode (9+ bits of permissions) nlink (hard link count) owner user ID owner group ID filesystem ID unique file ID file size (bytes and blocks) access time modify time generation number generic operations vop_getattr (vattr) vop_setattr (vattr) vhold() vholdrele() 11/13/2007 directories only vop_lookup (OUT vpp, name) vop_create (OUT vpp, name, vattr) vop_remove (vp, name) vop_link (vp, name) vop_rename (vp, name, tdvp, tvp, name) vop_mkdir (OUT vpp, name, vattr) vop_rmdir (vp, name) vop_symlink (OUT vpp, name, vattr, contents) vop_readdir (uio, cookie) vop_readlink (uio) files only vop_getpages (page**, count, offset) vop_putpages (page**, count, sync, offset) vop_fsync () ecs150, Fall 2007 122 UCDavis, ecs150 Fall 2007 Network File System (NFS) server client syscall layer user programs VFS syscall layer NFS server VFS UFS UFS NFS client network 11/13/2007 ecs150, Fall 2007 123 UCDavis, ecs150 Fall 2007 vnode Cache HASH(fsid, fileid) VFS free list head Active vnodes are reference- counted by the structures that hold pointers to them. - system open file table - process current directory - file system mount points - etc. Each specific file system maintains its own hash of vnodes (BSD). - specific FS handles initialization - free list is maintained by VFS vget(vp): reclaim cached inactive vnode from VFS free list vref(vp): increment reference count on an active vnode vrele(vp): release reference count on a vnode vgone(vp): vnode is no longer valid (file is removed) 11/13/2007 ecs150, Fall 2007 124 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 125 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 126 UCDavis, ecs150 Fall 2007 struct vnode { struct mtx v_interlock; /* lock for "i" things */ u_long v_iflag; /* i vnode flags (see below) */ int v_usecount; /* i ref count of users */ long v_numoutput; /* i writes in progress */ struct thread *v_vxthread; /* i thread owning VXLOCK */ int v_holdcnt; /* i page & buffer references */ struct buflists v_cleanblkhd; /* i SORTED clean blocklist */ struct buf *v_cleanblkroot;/* i clean buf splay tree */ int v_cleanbufcnt; /* i number of clean buffers */ struct buflists v_dirtyblkhd; /* i SORTED dirty blocklist */ struct buf *v_dirtyblkroot; /* i dirty buf splay tree */ int v_dirtybufcnt; 11/13/2007 ecs150, Fall 2007 127 UCDavis, ecs150 Fall 2007 Distributed FS ftp.cs.ucdavis.edu fs0: /dev/hd0a / usr sys dev lib bin etc bin / local adm home Server.yahoo.com fs0: /dev/hd0e 11/13/2007 ecs150, Fall 2007 128 UCDavis, ecs150 Fall 2007 logical disks fs0: /dev/hd0a / usr sys dev lib bin etc bin mount -t ufs /dev/hd0e /usr / local adm home fs1: /dev/hd0e mount -t nfs 152.1.23.12:/export/cdrom /mnt/cdrom 11/13/2007 ecs150, Fall 2007 129 UCDavis, ecs150 Fall 2007 Correctness One-copy Unix Semantics – every modification to every byte of a file has to be immediately and permanently visible to every client. 11/13/2007 ecs150, Fall 2007 130 UCDavis, ecs150 Fall 2007 Correctness One-copy Unix Semantics – every modification to every byte of a file has to be immediately and permanently visible to every client. – Conceptually FS sequent access Make sense in a local file system Single processor versus shared memory Is this necessary? 11/13/2007 ecs150, Fall 2007 131 UCDavis, ecs150 Fall 2007 DFS Architecture Server – storage for the distributed/shared files. – provides an access interface for the clients. Client – consumer of the files. – runs applications in a distributed environment. open read opendir readdir close write stat applications 11/13/2007 ecs150, Fall 2007 132 UCDavis, ecs150 Fall 2007 NFS (SUN, 1985) Based on RPC (Remote Procedure Call) and XDR (Extended Data Representation) Server maintains no state – a READ on the server opens, seeks, reads, and closes – a WRITE is similar, but the buffer is flushed to disk before closing Server crash: client continues to try until server reboots – no loss Client crashes: client must rebuild its own state – no effect on server 11/13/2007 ecs150, Fall 2007 133 UCDavis, ecs150 Fall 2007 RPC - XDR RPC: Standard protocol for calling procedures in another machine Procedure is packaged with authorization and admin info XDR: standard format for data, because manufacturers of computers cannot agree on byte ordering. 11/13/2007 ecs150, Fall 2007 134 UCDavis, ecs150 Fall 2007 rpcgen RPC program data structure RPC client.c 11/13/2007 rpcgen RPC.h ecs150, Fall 2007 data structure RPC server.c 135 UCDavis, ecs150 Fall 2007 NFS Operations Every operation is independent: server opens file for every operation File identified by handle -- no state information retained by server client maintains mount table, v-node, offset in file table etc. What do these imply??? 11/13/2007 ecs150, Fall 2007 136 UCDavis, ecs150 Fall 2007 Client computer Client computer Application Application program program NFS Application program Client Kernel Server computer Application program UNIX system calls Virtual file system Operations on local files UNIX file system Other file system UNIX kernel Virtual file system Operations on remote files NFS client NFS NFS Client server UNIX file system NFS protocol (remote operations) mount –t nfs home.yahoo.com:/pub/linux /mnt/linux 11/13/2007 ecs150, Fall 2007 137 * UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 138 UCDavis, ecs150 Fall 2007 State-ful vs. State-less A server is fully aware of its clients – does the client have the newest copy? – what is the offset of an opened file? – “a session” between a client and a server! A server is completely unaware of its clients – memory-less: I do not remember you!! – Just tell me what you want to get (and where). – I am not responsible for your offset values (the client needs to maintain the state). 11/13/2007 ecs150, Fall 2007 139 UCDavis, ecs150 Fall 2007 The State open read stat lseek applications open read stat lseek offset applications 11/13/2007 ecs150, Fall 2007 140 UCDavis, ecs150 Fall 2007 Network File Sharing Server side: – Rpcbind (portmap) – Mountd - respond to mount requests (sometimes called rpc.mountd). Relies on several files – /etc/dfs/dfstab, – /etc/exports, – /etc/netgroup – – – – nfsd - serves files - actually a call to kernel level code. lockd – file locking daemon. statd – manages locks for lockd. rquotad – manages quotas for exported file systems. 11/13/2007 ecs150, Fall 2007 141 UCDavis, ecs150 Fall 2007 Network File Sharing Client Side – biod - client side caching daemon – mount must understand the hostname:directory convention. – Filesystem entries in /etc/[v]fstab tell the client what filesystems to mount. 11/13/2007 ecs150, Fall 2007 142 UCDavis, ecs150 Fall 2007 Unix file semantics NFS: – open a file with read-write mode – later, the server’s copy becomes read-only mode – now, the application tries to write it!! 11/13/2007 ecs150, Fall 2007 143 UCDavis, ecs150 Fall 2007 Problems with NFS Performance not scaleable: – maybe it is OK for a local office. – will be horrible with large scale systems. 11/13/2007 ecs150, Fall 2007 144 UCDavis, ecs150 Fall 2007 Similar to UNIX file caching for local files: – pages (blocks) from disk are held in a main memory buffer cache until the space is required for newer pages. Read-ahead and delayed-write optimisations. – For local files, writes are deferred to next sync event (30 second intervals) – Works well in local context, where files are always accessed through the local cache, but in the remote case it doesn't offer necessary synchronization guarantees to clients. NFS v3 servers offers two strategies for updating the disk: – write-through - altered pages are written to disk as soon as they are received at the server. When a write() RPC returns, the NFS client knows that the page is on the disk. – delayed commit - pages are held only in the cache until a commit() call is received for the relevant file. This is the default mode used by NFS v3 clients. A commit() is issued by the client whenever a file is closed. 11/13/2007 ecs150, Fall 2007 145 * UCDavis, ecs150 Fall 2007 Server caching does nothing to reduce RPC traffic between client and server – further optimisation is essential to reduce server load in large networks – NFS client module caches the results of read, write, getattr, lookup and readdir operations – synchronization of file contents (one-copy semantics) is not guaranteed when two or more clients are sharing the same file. Timestamp-based validity check – reduces inconsistency, but doesn't eliminate it – validity condition for cache entries at the client: (T - Tc < t) v (Tmclient = Tmserver) – t is configurable (per file) but is typically set to 3 seconds for files and 30 secs. for directories – it remains difficult to write distributed applications that share files with NFS 11/13/2007 ecs150, Fall 2007 t Tc freshness guarantee time when cache entry was last validated Tm time when block was last updated at server T current time 146 * UCDavis, ecs150 Fall 2007 AFS State-ful clients and servers. Caching the files to clients. – File close ==> check-in the changes. How to maintain consistency? – Using “Callback” in v2/3 (Valid or Cancelled) open read applications invalidate and re-cache 11/13/2007 ecs150, Fall 2007 147 UCDavis, ecs150 Fall 2007 Why AFS? Shared files are infrequently updated Local cache of a few hundred mega bytes – Now 50~100 giga bytes Unix workload: – Files are small, Read Operations dominated, sequential access is common, read/written by one user, reference bursts. – Are these still true? 11/13/2007 ecs150, Fall 2007 148 UCDavis, ecs150 Fall 2007 Fault Tolerance in AFS a server crashes a client crashes – check for call-back tokens first. 11/13/2007 ecs150, Fall 2007 151 UCDavis, ecs150 Fall 2007 Problems with AFS Availability what happens if call-back itself is lost?? 11/13/2007 ecs150, Fall 2007 152 UCDavis, ecs150 Fall 2007 GFS – Google File System “failures” are norm Multiple-GB files are common Append rather than overwrite – Random writes are rare Can we relax the consistency? 11/13/2007 ecs150, Fall 2007 153 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 154 UCDavis, ecs150 Fall 2007 The Master Maintains all file system metadata. names space, access control info, file to chunk mappings, chunk (including replicas) location, etc. Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state 11/13/2007 ecs150, Fall 2007 155 UCDavis, ecs150 Fall 2007 The Master Helps make sophisticated chunk placement and replication decision, using global knowledge For reading and writing, client contacts Master to get chunk locations, then deals directly with chunkservers Master is not a bottleneck for reads/writes 11/13/2007 ecs150, Fall 2007 156 UCDavis, ecs150 Fall 2007 Chunkservers Files are broken into chunks. Each chunk has a immutable globally unique 64-bit chunkhandle. handle is assigned by the master at chunk creation Chunk size is 64 MB Each chunk is replicated on 3 (default) servers 11/13/2007 ecs150, Fall 2007 157 UCDavis, ecs150 Fall 2007 Clients Linked to apps using the file system API. Communicates with master and chunkservers for reading and writing Master interactions only for metadata Chunkserver interactions for data Only caches metadata information Data is too large to cache. 11/13/2007 ecs150, Fall 2007 158 UCDavis, ecs150 Fall 2007 Chunk Locations Master does not keep a persistent record of locations of chunks and replicas. Polls chunkservers at startup, and when new chunkservers join/leave for this. Stays up to date by controlling placement of new chunks and through HeartBeat messages (when monitoring chunkservers) 11/13/2007 ecs150, Fall 2007 159 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 160 UCDavis, ecs150 Fall 2007 CODA Server Replication: – if one server goes down, I can get another. Disconnected Operation: – if all go down, I will use my own cache. 11/13/2007 ecs150, Fall 2007 161 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 162 UCDavis, ecs150 Fall 2007 Disconnected Operation Continue critical work when that repository is inaccessible. Key idea: caching data. – Performance – Availability Server Replication 11/13/2007 ecs150, Fall 2007 163 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 164 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 165 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 166 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 167 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 168 UCDavis, ecs150 Fall 2007 11/13/2007 ecs150, Fall 2007 169 UCDavis, ecs150 Fall 2007 Consistency If John update file X on server A and Mary read file X on server B…. Read-one & Write-all 11/13/2007 ecs150, Fall 2007 170 UCDavis, ecs150 Fall 2007 Read x & Write (N-x+1) read write 11/13/2007 ecs150, Fall 2007 171 UCDavis, ecs150 Fall 2007 Initial Alice-W Bob-W Alice-R Chris-W Dan-R Emily-W Frank-R 11/13/2007 Example: R3W4 (6+1) 0 2 2 2 2 2 7 7 0 2 3 3 1 1 7 7 0 0 3 3 1 1 1 1 ecs150, Fall 2007 0 2 3 3 1 1 1 1 0 2 3 3 1 1 1 1 0 0 0 0 0 0 7 7 172