Kernel Modules - Northern Kentucky University

Download Report

Transcript Kernel Modules - Northern Kentucky University

CSC 660: Advanced OS
Distributed Filesystems
CSC 660: Advanced Operating Systems
Slide #1
Topics
1.
2.
3.
4.
5.
Filesystem History
Distributed Filesystems
AFS
GoogleFS
Common filesystem issues
CSC 660: Advanced Operating Systems
Slide #2
Filesystem History
•
•
•
•
•
•
•
•
•
FS (1974)
Fast Filesystem (FFS) / UFS (1984)
Log-structured Filesystem (1991)
ext2 (1993)
ext3 (2001)
WAFL (1994)
XFS (1994)
Reiserfs (1998)
ZFS (2004)
CSC 660: Advanced Operating Systems
Slide #3
FS
• First UNIX filesystem (1974)
• Simple
–
–
–
–
Layout: superblock, inodes, then data blocks.
Unused blocks stored in free linked list, not bitmap.
512 byte blocks, no fragments.
Short filenames.
• Slow: 2% of raw disk bandwidth.
– Disk seeks consume most file access time due to small
block size and high fragmentation.
– Later doubled perf by using 1KB blocks.
CSC 660: Advanced Operating Systems
Slide #4
FFS
• BSD (1984), basis for SYSV UFS
• More complex
–
–
–
–
–
Cylinder groups: inodes, bitmaps, data blocks.
Larger blocks (4K) with 1K fragments.
Block layout based on physical disk parameters.
Long filenames, symlinks, file locks, quotas.
10% space reserved by default.
• Faster: 14-47% of raw disk bandwidth.
– Creating a new file requires 5 seeks.
– 2 inode seeks, 1 file data, 1 dir data, 1 dir inode
– User/kernel memory copies take 40% of disk op time.
CSC 660: Advanced Operating Systems
Slide #5
Log-structured Filesystem (LFS)
• All data stored as sequential log entries.
– Divided into large log segments.
– Cleaner defragments, produces new segments.
• Fast recovery: checkpoint + roll forward.
• Performance: 70% of raw disk bandwidth.
– Large sequential writes vs multiple writes/seeks.
– Inode map tracks dynamic locations of inodes.
CSC 660: Advanced Operating Systems
Slide #6
ext2 and ext3
FFS + performance features.
–
–
–
–
–
–
Variable block size (1K-4K), no fragments.
Partitions disk into block groups.
Data block preallocation + read ahead.
Fast symlinks (stored in inode.)
5% space reserved by default.
Very fast.
ext3 adds journaling capabilities.
CSC 660: Advanced Operating Systems
Slide #7
WAFL
Network Appliance (1994)
Metadata in files
–
–
–
–
Root inode points to inode file.
Filesystem is tree of blocks with inode file.
Write metadata anywhere faster with RAID.
Allows filesystem to be expanded on fly.
CSC 660: Advanced Operating Systems
Slide #8
WAFL
Copy on write snapshots
–
–
–
–
–
–
Hourly (4/day, keep 2d), Daily (keep 7d)
Users can get deleted files from .snapshot dirs.
Snapshots created by just copying root inode.
Creates consistency point snapshot every few seconds.
Writes only to unused blocks between consistency snaps.
Recovery = last consistency point + replay NVRAM log.
CSC 660: Advanced Operating Systems
Slide #9
XFS
SGI (1994)
Complex journaling filesystem
– Uses B+ trees to track free space, index dirs,
locate file blocks and inodes.
– Dynamic inode allocation, metadata journaling,
volume manager, multithreaded, allocate on
flush.
– 64-bit filesystem (filesystems up to 263 bytes.)
– Fast: 90-95% of raw disk bandwidth.
CSC 660: Advanced Operating Systems
Slide #10
Reiserfs
Multiple different versions (v1-4)
Complex tree-based filesystem
–
–
–
–
Uses B+ trees (v3) or dancing trees (v4).
Journaling, allocate on flush, COW, tail-packing
High perf with small files, large directories.
Second to ext2 in perf (v3.)
CSC 660: Advanced Operating Systems
Slide #11
ZFS
Sun (2004)
Copy-on-write + volume management
–
–
–
–
Variable block size + compression.
Built-in volume manager (striping, pooling.)
Self-healing with 64-bit checksums + mirroring.
COW transactional model (live data never
overwritten)
– Fast snapshots (just don’t release old blocks.)
– 128-bit filesystem.
CSC 660: Advanced Operating Systems
Slide #12
Distributed Filesystems
Use filesystem to transparently share data
between computers.
Accessing files via a distributed filesystem:
1.
2.
3.
4.
5.
Client mounts network filesystem.
Client makes a request for file access.
Client kernel sends network request to server.
Server performs file ops on physical disk.
Server sends response across network to client.
CSC 660: Advanced Operating Systems
Slide #13
Naming
Mapping between logical and physical objects.
UNIX filenames mapped to inodes.
Network filenames map to hostname, vnode pairs.
Location independent names
Filename is a dynamic one-to-many mapping.
Files can migrate to other servers w/o renaming.
Files can be replicated across multiple servers.
CSC 660: Advanced Operating Systems
Slide #14
Naming Implementation
Location-dependent (non-transparent)
filename -> <system,disk,inode>
Location-independent (transparent)
filename -> file_identifier -> <system,disk,inode>
Identifiers must be unique.
Identifiers must be updated to point to a new
physical location when a file is moved.
CSC 660: Advanced Operating Systems
Slide #15
Caching
Problem: Every file access uses network.
Solution: Store remote data on local system.
Cache can be memory or disk based.
Read-ahead can reduce accesses further.
CSC 660: Advanced Operating Systems
Slide #16
Cache Update Policies
Write Through
Write data to server and cache at once.
Return to program when server write complete.
High reliability, poor performance.
Delayed Write
Write data to cache, then return to program.
Modifications written through to server later.
High performance, poor reliability.
CSC 660: Advanced Operating Systems
Slide #17
NFS with Cachefs
CSC 660: Advanced Operating Systems
Slide #18
Cache Consistency Problem
Cache Consistency Problem
Keeping cached copies consistent with server.
Consistency overhead can decrease performance if too
many writes done on same set of files.
Client-initiated consistency
Client asks server if data is consistent.
When: every file access, periodically.
Server-initiated consistency
Server detects conflicts and invalidates client caches.
Server has to maintain state of what is cached where.
CSC 660: Advanced Operating Systems
Slide #19
Stateful File Access
Stateful process:
1.
2.
3.
4.
5.
6.
Client sends open request to server.
Server opens file, inserts into open file table.
Server returns file identifier to client.
Client uses identifier to read/write file.
Client closes file.
Server removes file from open file table.
Features
High performance, because fewer disk accesses.
Problem of clients that crash without closing files.
CSC 660: Advanced Operating Systems
Slide #20
Stateless File Service
Every request is self contained.
Must specify filename, position in every request.
Server doesn’t know which files are open.
Server crashes have minimal effect.
Stateful servers must poll clients to recover state.
CSC 660: Advanced Operating Systems
Slide #21
NFS
Sun
v2 (1984)
v3 (1992) TCP + 64-bit.
Implementation
– System calls via Sun RPC calls.
– Stateless: client obtains filesystem ID on mount, then
uses filesystem ID (like filehandle) in subsequent reqs.
– UNIX-centric (UIDs, GIDs, permissions)
– Server authenticates by client IP address.
• Client UIDs mapped to server w/ root quashing.
• Danger: Client root user can su to any desired UID.
CSC 660: Advanced Operating Systems
Slide #22
CIFS
Microsoft (1998)
Derived from 1980s IBM SMB net filesystem.
Implementation
Originally ran over NetBIOS, not TCP/IP.
\\svr\share\path Universal Naming Convention
Auth: NTLM (insecure), NTLMv2, Kerberos
MS Windows-centric (filenames, ACLs, EOLs)
CSC 660: Advanced Operating Systems
Slide #23
AFS
CMU (1983)
– Sold by Transarc/IBM, then free as OpenAFS.
Features
– Uniform /afs name space.
– Location-independent file sharing.
– Whole file caching on client.
– Secure authentication via Kerberos.
CSC 660: Advanced Operating Systems
Slide #24
AFS
Global namespace divided into cells
– Cells separate authorization domains.
– Cells included in pathname: /afs/CELL/
– Ex: cmu.edu, intel.com
Cells contain multiple servers
–
–
–
–
Location independence managed via volume db.
Files are located on volumes.
Volumes can migrate between servers.
Volumes can be replicated in read-only fashion.
CSC 660: Advanced Operating Systems
Slide #25
NFSv4
IETF (2000)
Based on 1998 Sun draft.
New Features
–
–
–
–
–
Only one protocol.
Global namespace.
Security (ACLs, Kerberos, encryption)
Cross platform + internationalized.
Better caching via delegation of files to clients.
CSC 660: Advanced Operating Systems
Slide #26
GoogleFS Assumptions
1.
2.
3.
4.
5.
6.
High rate of commodity hardware failures.
Small number of huge files (multi-GB +).
Reads: large streaming + small random.
Most modifications are appends.
High bandwidth >> low latency.
Applications / filesystem co-designed.
CSC 660: Advanced Operating Systems
Slide #27
GoogleFS Architecture
CSC 660: Advanced Operating Systems
Slide #28
GoogleFS Architecture
• Master server
– Metadata: namespace, ACL, chunk mapping.
– Chunk lease management, garbage collection,
chunk migration.
• Chunk servers
– Serve chunks (64MB + checksum) of files.
– Chunks replicated on multiple (3) servers.
CSC 660: Advanced Operating Systems
Slide #29
GoogleFS Writing
1.
2.
3.
4.
5.
6.
7.
Client asks master which
chunksvr has lease.
Master responds:
leaseholder + replicas.
Client pushes data to all
replicas.
Client sends write to
primary replica.
Primary forwards req.
Secondaries reply to
primary on completion.
Primary replies to client.
CSC 660: Advanced Operating Systems
Slide #30
GoogleFS Consistency
File regions can be
Consistent: all clients see the same data.
Defined: consistent + clients will see entire write.
Inconsistent: different clients see different data.
Files can be modified by
Random write: data written at specified offset.
Record append: data is appended atomically at
least once. Padding or record duplicates may be
inserted as part of an append operation.
CSC 660: Advanced Operating Systems
Slide #31
GoogleFS Consistency
Writers deal with consistency issues by
1. Preferring appends to random writes.
2. Application-level checkpoints.
3. Self-identifying records with checksums.
Readers deal with consistency issues by
1. Processing file only up until checkpoint.
2. Ignoring padding.
3. Discarding records with duplicate checksums.
CSC 660: Advanced Operating Systems
Slide #32
Chunk Replication
New Chunks
– Replicate new chunks on servers with below-average
disk utilization.
– Limit the number of recent chunk creations on each
server, due to iminent writes.
Re-replication
– Prioritize chunks based on how far chunk is away from
replication goal.
– Master clones chunk by choosing a server and telling it
to replicate chunk from closest replica.
– Master re-balances chunk distribution periodically.
CSC 660: Advanced Operating Systems
Slide #33
GoogleFS Reliability
Chunk level reliability
Incremental checksums on each chunk
Chunks replicated by default across 3 servers.
Single master server
Metadata stored in memory, operation log.
Metadata recovered by polling chunk servers.
Shadow masters provide ro access if primary down.
CSC 660: Advanced Operating Systems
Slide #34
Common Problems
1. Consistency after crash.
2. Large contiguous allocations.
3. Metadata allocation.
CSC 660: Advanced Operating Systems
Slide #35
Consistency
• Detect + Repair
– Use fsck to repair.
– Journal replay.
• Always Consistent
– Copy on write.
CSC 660: Advanced Operating Systems
Slide #36
Large Contiguous Allocations
• Pre-allocation.
• Block groups.
• Multiple block sizes.
CSC 660: Advanced Operating Systems
Slide #37
Metadata Allocation
• Fixed number in one location.
• Fixed number spread across disk.
• Dynamically allocated in files.
CSC 660: Advanced Operating Systems
Slide #38
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Jerry Breecher, “Distributed Filesystems,” http://cs.clarku.edu/~jbreecher/os/lectures/Section17-Dist_File_Sys.ppt
Florian Buchholz, “The structure of the Reiser file system,” http://homes.cerias.purdue.edu/~florian/reiser/reiserfs.php,
2006.
Remy Card, Theodore T’so, Stephen Tweedie, “Design and Impementation of the Second Extended Filesystem,”
http://web.mit.edu/tytso/www/linux/ext2intro.html, 1994.
Sanjay Ghemawat et. al., “The Google File System,” SOSP, 2003.
Christopher Hertel, Implementing CIFS, Prentice Hall, 2003.
Val Henson, “A Brief History of UNIX Filesystems,” http://infohost.nmt.edu/~val/fs_slides.pdf
Dave Hitz, James Lau, Michael Malcolm, “File System Design for an NFS File Server Appliance,” Proceedings of the
USENIX Winter 1994 Technical Conference, http://www.netapp.com/library/tr/3002.pdf
John Howard et. al., “Scale and Performance in a Distributed File System,” ACM Transactions on Computer Systems
6(1), 1988.
Marshall K. McKusick, “A Fast File System for Unix,” Transactions on Computer Systems 2(3), 1984.
Brian Powlowski et. a., “The NFS Version 4 Protocol,” SANE 2000.
Daniel Robbins, “Advanced File System Implementor’s Guide,” IBM Developer Works, http://www128.ibm.com/developerworks/linux/library/l-fs9.html, 2002.
Claudia Rodriguez et al, The Linux Kernel Primer, Prentice-Hall, 2005.
Mendel Rosenblum and John K. Osterhout, “The Design and Implementation of a Log-structured Filesystem,” 13th
ACM SOSP, 1991.
R. Sandberg, “Design and Implementation of the Sun Network Filesystem,” Proceedings of the USENIX 1985 Summer
Conference, 1985.
Adam Sweeney et. al., “Scalability in the XFS File System,” Proceedings of the USENIX 1996 Annual Technical
Conference, 1996.
Wikipedia, http://en.wikipedia.org/wiki/Comparison_of_file_systems
CSC 660: Advanced Operating Systems
Slide #39