Document 7127952
Download
Report
Transcript Document 7127952
File Systems: Designs
Kamen Yotov
CS 614 Lecture, 04/26/2001
Overview
The Design and Implementation of a
Log-Structured File System
Sequential Structure
Speeds up Writes & Crash Recovery
The Zebra Stripped Network File
System
Stripping across multiple servers
RAID equivalent data recovery
Log-structured FS: Intro
Order of magnitude faster!?!
The future is dominated by writes
Main memory increases
Reads are handled by cache
Logging is old, but now is different
NTFS, Linux Kernel 2.4
Challenge – finding free space
Bandwidth utilization 65-75% vs. 5-10%
Design issues of the
1990’s
Importance factors
CPU – exponential growth
Main memory – caches, buffers
Disks – bandwidth, access time
Workloads
Small files – single random I/Os
Large files – bandwidth vs. FS policies
Problems with current FS
Scattered information
5 I/O operations to access a file under BSD
Synchronous writes
May be only the meta-data, but it’s enough
Hard to benefit from the faster CPUs
Network file systems
More synchrony in the wrong place
Log-structured FS (LFS)
Fundamentals
Buffering many small write operations
Writing them at once on a single,
continuous disk block
Simple or not?
How to retrieve information?
How to manage the free space?
LFS: File Location and Reading
Index structures to permit random
access retrieval
Again inodes, …but at random positions
on the log!
inode map: indexed, memory resident
Writes are better, while reads are at
least as good!
Example: Creating 2 Files
inode
directory
data
inode map
LFS: Free Space Management
Need large free chucks of space
Threading
Excessive fragmentation of free space
Not better than other file systems
Copying
Can be to different places…
Big costs
Combination
Threading & copying
Solution: Segmentation
Large fixed-size blocks (e.g. 1MB)
Threading through segments
Copying inside segments
Transfer longer than seeking
Segment cleaning
Which are the live chucks
To what file they belong and at what position
(inode update)
Segment summary block(s)
File version stamps
Segment Cleaning Policies
When should the cleaner execute?
Continuously, At night, When exhausted
How many segments to clean at once?
Which segments are to be cleaned?
Most fragmented ones or…
How should the live blocks be grouped
when written back?
Locality for future reads…
Measuring & Analysing
Write cost
Average amount of time busy for byte of data
written, including cleaning overhead
1.0 is perfect – full bandwidth, no overhead
Bigger is worse
LFS: seek and rotational latency negligible, so it’s
just totaldata!
Performance trade-off: utilization vs. speed
The key: bimodal segment distribution!
Simulation & Results
Purpose: Analyze different cleaning
policies
Harsh model
File system is modeled as a set of 4K files
At each step a file is chosen and rewritten
Uniform: Each with equal likelihood to be
chosen
Hot-and-cold: The 10-90 formula
Runs until write cost is stabilized
Write Cost vs. Disk Utilization
Write cost
No variance
18
16
14
12
10
FFS today
8
LFS hot-and-cold
6
4
2
FFS improved
LFS uniform
Disk utilization
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Hot & Cold Segments
Why is locality worse than no locality?
Free space valuable in cold segments
Value based on data stability
Approximate stability with age
Cost-benefit policy
Benefit: Amount of:
Space cleaned (inverse of utilization of segment)
Time stays free (timestamp of youngest block)
Cost (read + write live data)
Segment Utilization Distributions
Fraction of segments (0.001)
9
8
7
Hot-and-cold (greedy)
6
5
Uniform
4
3
2
1
Hot-and-cold (cost-benefit) Segment utilization
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Write Cost vs. Disk Utilization
(revisited)
Write cost
No variance
18
16
14
12
10
FFS today
8
6
4
FFS improved
LFS cost-benefit
LFS uniform
2
Disk utilization
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Crash Recovery
Currently file systems require full scan
Log-based systems are definitely better
Check-pointing (two-phase, trade-offs)
(Meta-)information – log
Checkpoint region – fixed position
inode map blocks
segment usage table
time & last segment written
Roll-forward
Crash Recovery (cont.)
Naïve method: On a crash, just use the latest
checkpoint and go from there!
Roll-forward recovery
Scan segment summary blocks for new inodes
If just data, but no inode, assume incomplete and
ignore
Adjust utilization of segments
Restore consistency between directory entries and
inodes (special records in the log for the purpose)
Experience with Sprite LFS
Part of Sprite Network Operating System
All implemented, roll-forwarding disabled
Short 30 second check-pointing interval
Not more complicated to implement than a
normal “allocation” file system
NTFS and Ext2 even more…
Not great improvement to the user as few
applications are disk-bound!
So, let’s go micro!
Micro-benchmarks were produced
10 times faster when creating small files
Faster in reading of order preserved
Only case slower in Sprite is
Write file randomly
Read it sequentially
Produced locality differs a lot!
Sprite LFS vs. Sun OS 4.0.3
Sizes
4KB block
1MB segment
x10 speed-up in
writes/deletes
Temporal locality
Saturate CPU!!!
Random write
Size
8KB block
Slow on
writes/deletes
Logical locality
Keep disk busy
Sequential read
Related Work
WORM media – always been logged
Maintain indexes
No deletion necessary
Garbage collection
Scavenging = Segment cleaning
Generational = Cost-benefit scheme
Difference: random vs. sequential
Logging similar to database systems
Use of the log differs (like NTFS, Linux 2.4)
Recovery is like “redo log”-ging
Zebra Networked FS: Intro
Multi-server networked file system
Clients stripe data through
Redundancy ensures fault-tolerance &
recoverability
Suitable for multimedia & parallel tasks
Borrows from RAID and LFS principles
Achieves speed-ups from 20% to 5x
Zebra: Background
data
RAID
parity
stripe
Definitions
Stripes
Fragments
Problems
Bandwidth bottleneck
Small files
Differences with Distributed File Systems
Per file vs. Per client stripping
RAID standard
4 I/Os for small files
LFS
2 reads
2 writes
Data distribution
Parity distribution
Storage efficient
many files (LFS)
large file
1 2 3 4 5 6
1
4
2
5
3
6
small file (1)
small file (2)
1 2 3 4 5 6
1
4
2
5
3
6
Zebra: Network LSF
Logging between clients and servers
(as opposed to file server and disks)
Per client stripping
More efficient storage space usage
Parity mechanism is simplified
No overhead for small files
Never needs to be modified
Typical distributed computing problems
Zebra: Components
File Manager
Stripe Cleaner
Storage Servers
Clients
File Manager and Stripe cleaner
may reside on a Storage Server
as separate processes – useful
for fault tolerance!
Client
Client
File
Manager
Stripe
Cleaner
Client
Client
Fast
Network
Client
Client
Storage
Server
…
Storage
Server
Zebra: Component Dynamics
Clients
Location, fetching &
delivery of fragments
Striping, parity
computation, writing
Storage servers
Bulk data repositories
Fragment operations
Store, Append,
Retrieve, Delete,
Identify
Synchronous, non
overwrite semantics
File Manager
Meta-data repository
Just pointers to blocks
RPC bottleneck for many
small files
Can run as a separate
process on a Storage
Server
Stripe Cleaner
Similar to the Sprite LFS
we discussed
Runs as a separate, user
mode process
Zebra: System Operation - Deltas
Communication via Deltas
Fields
File ID, File version, Block #
Old & New block pointers
Types
Update, Cleaner, Reject
Reliable, because stored in the log
Replay after crashes
Zebra: System Operation (cont.)
Writing files
Flushes on
Threshold age (30 s)
Cache full & dirty
Application fsync
File manager request
Striping
Deltas update
Concurrent transfers
Reading files
Nearly identical to
conventional FS
Good client caching
Consistency
Stripe cleaning
Choosing which to…
Space utilization
through deltas
Stripe Status File
Zebra: Advanced
System Operations
Adding Storage Servers
Scalable
Restoring from crashes
Consistency & Availability
Specifics due to distributed system state
Internally inconsistent stripes
Stripe information inconsistent with File Manager
Stripe cleaner state consistency with Storage Servers
Logging and check-pointing
Fast recoveries after failures
Prototyping
Most of the interesting parts only on paper
Included
All UNIX file commands, file system semantics
Functional cleaner
Clients construct fragments and write parities
File Manager and Storage Servers checkpoint
Some advanced crash recovery methods omitted
Metadata not yet stored on Storage Servers
Clients do not automatically reconstruct fragments upon a
Storage Server crash
Storage Servers do not reconstruct fragments on recovery
File Manager and Cleaner not automatically restarted
Measurements: Platform
Cluster of DECstation-5000 Model 200
100 Mb/s FDDI local network ring
20 SPECint
32 MB RAM
12 MB/s memory to memory copy
8 MB/s memory to controller copy
RZ57 1GB disks, 15ms seek
2 MB/s native transfer bandwidth
1.6 MB/s real transfer bandwidth (due to controller)
Caching disk controllers (1MB)
Measurements: Results (1)
Large File Writes
4
Total Throughput (MB/s)
3.5
3
1 client
2.5
2 clients
3 clients
2
1 client w/ parity
Sprite
1.5
NFS
1
0.5
0
1
2
3
Servers
4
Measurements: Results (2)
Large File Reads
Total Throughput (MB/s)
6
5
1 client
4
2 clients
3 clients
3
1 client (reconstruct)
Sprite
2
NFS
1
0
1
2
3
Servers
4
Measurements: Results (3)
Small File Writes
70
Elapsed Time (seconds)
60
50
Server Flush
40
Client Flush
Write
30
Open/Close
20
10
0
NFS
Sprite
Zebra
File System
Sprite N.C.
Zebra N.C.
Measurements: Results (4)
Resource Utilization
100%
90%
80%
70%
% utilized
FM CPU
60%
FM Disk
50%
Client CPU
SS CPU
40%
SS Disk
30%
20%
10%
0%
Zebra LW
Sprite LW
Zebra LR
Sprite LR
Zebra SW Sprite SW
Zebra: Conclustions
Pros
Applies parity and
log structure to
network file systems
Performance
Scalability
Cost-effective
servers
Availability
Simplicity
Cons
Lacks name caching,
causing severe
performance
degradations
Not well suited for
transaction processing
Metadata problems
Small reads are
problematic again