Document 7127952

Transcript Document 7127952

File Systems: Designs
Kamen Yotov
CS 614 Lecture, 04/26/2001
Overview
The Design and Implementation of a
Log-Structured File System
Sequential Structure
 Speeds up Writes & Crash Recovery

The Zebra Stripped Network File
System
Stripping across multiple servers
 RAID equivalent data recovery

Log-structured FS: Intro
Order of magnitude faster!?!
The future is dominated by writes
Main memory increases
 Reads are handled by cache

Logging is old, but now is different

NTFS, Linux Kernel 2.4
Challenge – finding free space
Bandwidth utilization 65-75% vs. 5-10%
Design issues of the
1990’s
Importance factors
CPU – exponential growth
 Main memory – caches, buffers
 Disks – bandwidth, access time

Workloads
Small files – single random I/Os
 Large files – bandwidth vs. FS policies

Problems with current FS
Scattered information

5 I/O operations to access a file under BSD
Synchronous writes
May be only the meta-data, but it’s enough
 Hard to benefit from the faster CPUs

Network file systems

More synchrony in the wrong place
Log-structured FS (LFS)
Fundamentals
Buffering many small write operations
 Writing them at once on a single,
continuous disk block

Simple or not?
How to retrieve information?
 How to manage the free space?

LFS: File Location and Reading
Index structures to permit random
access retrieval
Again inodes, …but at random positions
on the log!
inode map: indexed, memory resident
Writes are better, while reads are at
least as good!
Example: Creating 2 Files
inode
directory
data
inode map
LFS: Free Space Management
Need large free chucks of space

Threading
Excessive fragmentation of free space
 Not better than other file systems


Copying
Can be to different places…
 Big costs


Combination
Threading & copying
Solution: Segmentation
Large fixed-size blocks (e.g. 1MB)


Threading through segments
Copying inside segments
Transfer longer than seeking
Segment cleaning




Which are the live chucks
To what file they belong and at what position
(inode update)
Segment summary block(s)
File version stamps
Segment Cleaning Policies
When should the cleaner execute?

Continuously, At night, When exhausted
How many segments to clean at once?
Which segments are to be cleaned?

Most fragmented ones or…
How should the live blocks be grouped
when written back?

Locality for future reads…
Measuring & Analysing
Write cost




Average amount of time busy for byte of data
written, including cleaning overhead
1.0 is perfect – full bandwidth, no overhead
Bigger is worse
LFS: seek and rotational latency negligible, so it’s
just totaldata!
Performance trade-off: utilization vs. speed
The key: bimodal segment distribution!
Simulation & Results
Purpose: Analyze different cleaning
policies
Harsh model
File system is modeled as a set of 4K files
 At each step a file is chosen and rewritten

Uniform: Each with equal likelihood to be
chosen
 Hot-and-cold: The 10-90 formula

Runs until write cost is stabilized
Write Cost vs. Disk Utilization
Write cost
No variance
18
16
14
12
10
FFS today
8
LFS hot-and-cold
6
4
2
FFS improved
LFS uniform
Disk utilization
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Hot & Cold Segments
Why is locality worse than no locality?
Free space valuable in cold segments


Value based on data stability
Approximate stability with age
Cost-benefit policy

Benefit: Amount of:



Space cleaned (inverse of utilization of segment)
Time stays free (timestamp of youngest block)
Cost (read + write live data)
Segment Utilization Distributions
Fraction of segments (0.001)
9
8
7
Hot-and-cold (greedy)
6
5
Uniform
4
3
2
1
Hot-and-cold (cost-benefit) Segment utilization
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Write Cost vs. Disk Utilization
(revisited)
Write cost
No variance
18
16
14
12
10
FFS today
8
6
4
FFS improved
LFS cost-benefit
LFS uniform
2
Disk utilization
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Crash Recovery
Currently file systems require full scan
Log-based systems are definitely better

Check-pointing (two-phase, trade-offs)
(Meta-)information – log
 Checkpoint region – fixed position





inode map blocks
segment usage table
time & last segment written
Roll-forward
Crash Recovery (cont.)
Naïve method: On a crash, just use the latest
checkpoint and go from there!
Roll-forward recovery




Scan segment summary blocks for new inodes
If just data, but no inode, assume incomplete and
ignore
Adjust utilization of segments
Restore consistency between directory entries and
inodes (special records in the log for the purpose)
Experience with Sprite LFS
Part of Sprite Network Operating System
All implemented, roll-forwarding disabled
Short 30 second check-pointing interval
Not more complicated to implement than a
normal “allocation” file system

NTFS and Ext2 even more…
Not great improvement to the user as few
applications are disk-bound!
So, let’s go micro!
Micro-benchmarks were produced
10 times faster when creating small files
Faster in reading of order preserved
Only case slower in Sprite is
Write file randomly
 Read it sequentially

Produced locality differs a lot!
Sprite LFS vs. Sun OS 4.0.3
Sizes


4KB block
1MB segment
x10 speed-up in
writes/deletes
Temporal locality
Saturate CPU!!!
Random write
Size

8KB block
Slow on
writes/deletes
Logical locality
Keep disk busy
Sequential read
Related Work
WORM media – always been logged


Maintain indexes
No deletion necessary
Garbage collection



Scavenging = Segment cleaning
Generational = Cost-benefit scheme
Difference: random vs. sequential
Logging similar to database systems

Use of the log differs (like NTFS, Linux 2.4)
Recovery is like “redo log”-ging
Zebra Networked FS: Intro
Multi-server networked file system
Clients stripe data through
 Redundancy ensures fault-tolerance &
recoverability

Suitable for multimedia & parallel tasks
Borrows from RAID and LFS principles
Achieves speed-ups from 20% to 5x
Zebra: Background
data
RAID

parity
stripe
Definitions
Stripes
 Fragments


Problems
Bandwidth bottleneck
 Small files


Differences with Distributed File Systems
Per file vs. Per client stripping
RAID standard
4 I/Os for small files


LFS


2 reads
2 writes

Data distribution
Parity distribution
Storage efficient
many files (LFS)
large file
1 2 3 4 5 6
1
4
2
5
3
6
small file (1)
small file (2)
1 2 3 4 5 6
1
4
2
5
3
6
Zebra: Network LSF
Logging between clients and servers
(as opposed to file server and disks)
Per client stripping
More efficient storage space usage
 Parity mechanism is simplified

No overhead for small files
 Never needs to be modified

Typical distributed computing problems
Zebra: Components
File Manager
Stripe Cleaner
Storage Servers
Clients

File Manager and Stripe cleaner
may reside on a Storage Server
as separate processes – useful
for fault tolerance!
Client
Client
File
Manager
Stripe
Cleaner
Client
Client
Fast
Network
Client
Client
Storage
Server
…
Storage
Server
Zebra: Component Dynamics
Clients


Location, fetching &
delivery of fragments
Striping, parity
computation, writing
Storage servers


Bulk data repositories
Fragment operations


Store, Append,
Retrieve, Delete,
Identify
Synchronous, non
overwrite semantics
File Manager




Meta-data repository
Just pointers to blocks
RPC bottleneck for many
small files
Can run as a separate
process on a Storage
Server
Stripe Cleaner


Similar to the Sprite LFS
we discussed
Runs as a separate, user
mode process
Zebra: System Operation - Deltas
Communication via Deltas

Fields
File ID, File version, Block #
 Old & New block pointers


Types

Update, Cleaner, Reject
Reliable, because stored in the log
 Replay after crashes

Zebra: System Operation (cont.)
Writing files

Flushes on







Threshold age (30 s)
Cache full & dirty
Application fsync
File manager request
Striping
Deltas update
Concurrent transfers
Reading files


Nearly identical to
conventional FS
Good client caching

Consistency
Stripe cleaning



Choosing which to…
Space utilization
through deltas
Stripe Status File
Zebra: Advanced
System Operations
Adding Storage Servers

Scalable
Restoring from crashes


Consistency & Availability
Specifics due to distributed system state




Internally inconsistent stripes
Stripe information inconsistent with File Manager
Stripe cleaner state consistency with Storage Servers
Logging and check-pointing

Fast recoveries after failures
Prototyping
Most of the interesting parts only on paper

Included





All UNIX file commands, file system semantics
Functional cleaner
Clients construct fragments and write parities
File Manager and Storage Servers checkpoint
Some advanced crash recovery methods omitted




Metadata not yet stored on Storage Servers
Clients do not automatically reconstruct fragments upon a
Storage Server crash
Storage Servers do not reconstruct fragments on recovery
File Manager and Cleaner not automatically restarted
Measurements: Platform
Cluster of DECstation-5000 Model 200






100 Mb/s FDDI local network ring
20 SPECint
32 MB RAM
12 MB/s memory to memory copy
8 MB/s memory to controller copy
RZ57 1GB disks, 15ms seek



2 MB/s native transfer bandwidth
1.6 MB/s real transfer bandwidth (due to controller)
Caching disk controllers (1MB)
Measurements: Results (1)
Large File Writes
4
Total Throughput (MB/s)
3.5
3
1 client
2.5
2 clients
3 clients
2
1 client w/ parity
Sprite
1.5
NFS
1
0.5
0
1
2
3
Servers
4
Measurements: Results (2)
Large File Reads
Total Throughput (MB/s)
6
5
1 client
4
2 clients
3 clients
3
1 client (reconstruct)
Sprite
2
NFS
1
0
1
2
3
Servers
4
Measurements: Results (3)
Small File Writes
70
Elapsed Time (seconds)
60
50
Server Flush
40
Client Flush
Write
30
Open/Close
20
10
0
NFS
Sprite
Zebra
File System
Sprite N.C.
Zebra N.C.
Measurements: Results (4)
Resource Utilization
100%
90%
80%
70%
% utilized
FM CPU
60%
FM Disk
50%
Client CPU
SS CPU
40%
SS Disk
30%
20%
10%
0%
Zebra LW
Sprite LW
Zebra LR
Sprite LR
Zebra SW Sprite SW
Zebra: Conclustions
Pros






Applies parity and
log structure to
network file systems
Performance
Scalability
Cost-effective
servers
Availability
Simplicity
Cons




Lacks name caching,
causing severe
performance
degradations
Not well suited for
transaction processing
Metadata problems
Small reads are
problematic again

Document 7127952

Transcript Document 7127952

Directory