The design and implementation of a log-structured file system

Download Report

Transcript The design and implementation of a log-structured file system

M. Rosenblum and J.K. Ousterhout
The design and implementation
of a log-structured file system
Proceedings of the 13th ACM Symposium on
Operating Systems Principles, December 1991
Lanfranco Muzi
PSU – May 26th, 2005
Presentation outline
• Motivation
• Basic functioning of Sprite LFS
• Design issues and choices
• Performance of Sprite LFS
• Conclusions
Motivation
“New” facts…
• CPU speeds were dramatically increasing
(1991 - and continued to do so…)
• Memories become cheaper and larger
• Disk have larger capacities, but
performance does not keep up with other
components: access time dominated by
seek and rotational latency (mechanical
issues
Consequences…
Motivation
…consequences
• Applications become more disk-bound
• Size of cache increases
• Most read requests hit in cache
• All writes must eventually go to disk (safety)
• Higher write traffic fraction
• But a file system optimized for reads pays a high
cost during writes (to achieve logical locality)
Problems with conventional
file systems - 1
Information is “spread around” on disk
E.g. create new file in FFS requires 5 disk
I/Os:
 2 for file i-node
 1 for file data
 2 for directory i-node and data
Seek takes much longer than data writing in
the case of small files, which is the target in
this study
Problems with conventional
file systems - 2
Tendency to write synchronously
E.g. Unix FFS:
 Data blocks written asynchronously…
 …But metadata (inodes, directories)
written synchronously
•Synchronous
writes slave application
performance (and CPU usage) to the disk
•Again,
seeking for metadata update
dominates write performance for small files
The Sprite LFS
•Write asynchronously: buffer a series of
writes in memory
•Periodically copy buffer to disk
 In a single write
 On a single contiguous segment (data
blocks, attributes, directories…)
 Rewrite instead of updating in place
•All info on disk is in a single sequential
structure: the log
Sprite LFS – main issues
1) How to retrieve information from the
log?
2) How to make large extents of free
space always available for writing
contiguous segments?
File location and reading
Basic data structures analogous to Unix FFS:
 One inode per file: attributes, address of
first 10 blocks or indirect blocks
…But inodes are in the log, i.e. NOT at fixed
locations on disk…
New data structure: inode map
 Located in the log (no free list)
 Fixed checkpoint region on disk holds
addresses of all map blocks
 Indexed by file identifying number, gives
location of relative inode
Checkpoint regions
•Contain addresses of all blocks in inode
map and segment usage table, current time
and pointer to last segment written
•Two (for safety)
•Located at fixed positions on disk
•Used for crash recovery
Free space management - I
GOAL: keep large free extents
to write new data
 Disk divided into fixed-length segments
(512kB or 1MB)
 A segment is always written sequentially
from beginning to end
 Segments written in succession until end
of disk space (older segments get
fragmented meanwhile)
…and then?
Free space management - II
Segment cleaning – copying live data out of a
segment

Read a number of segments into memory

Identify live data
Write live data only back to smaller
number of clean segments

Free space management - example
Read these
Old log end segments
Free segment
Writing memory buffer
Cleaner thread: copy segments to memory
buffer
Free space management - example
Old log end
New log end
Writing memory buffer
Cleaner thread: identify live blocks
Free space management - example
Old log end
New log end
Writing memory buffer
Cleaner thread: queue compacted data for
writing
Free space management - example
Old log end
New log end
Writing memory buffer
Writer thread: write compacted and new data
to segments, then mark old segments as free
Free space management implementation
Segment summary block – identify each piece
of information in segment
E.g.: for a file, each data block identified
by version number+inode number (=unique
identifier, UID) and block number

Version number incremented in inode map
when file deleted

If UID of block not equal to that in inode
map when scanned, block is discarded

Free space management –
cleaning policies
1) Which segments to clean?
2) How should live blocks be grouped
when they are written out?
Free space management –
cleaning policies
Cleaning policies can be compared in terms of the
Write cost:
total bytes read and written N  N  u  N  1  u 
2
Write cost 


1  u 
new data written
N  1  u 
N=number of segments read
U=fraction of live data in read segments (0 u <1)
•Average amount of time disk is busy per byte if
new data written (seek and rot. Latency negligible
in LFS)
•Note: includes cleaning overhead
•Note dependence on u
Free space mgmt – cleaning policies
Low u = low write cost
•Note: underutilized disk gives low write cost, but high storage
cost!
•…But u defined only for read segment (not overall)
•Achieve bimodal distribution: keep most segments nearly full, but
a few nearly empty (have cleaner work on these)
How to achieve bimodal distribution?
•First attempt: cleaner always chooses lowest u
segments and sorts by age before writing – FAILURE!
•Free space in “cold” (i.e. more stable) segments is more
“valuable” (will last longer)
•Assumption: stability of segment proportional to age of
youngest block (i.e. older = colder)
•Replace greedy policy with Cost-benefit criterion
benefit free space generated  age 1  u   age


cost
cost
1 u
•Clean segments with higher ratio
•Still group by age before rewriting
Cost-benefit - Results
•Left: bimodal distribution achieved - Cold cleaned at
u=75%, hot at u=15%
•Right: cost-benefit better, especially at utilization>60%
Support for Cost-benefit
•Segment usage table: record number of live bytes
and most recent modification time (used by cleaner
to choose segments: u and age)
•Segment summary: record age of youngest block
(used by writer to sort live blocks)
Performance – micro-benchmarks I
Small file performance
•SunOS based on Unix FFS
•NB: best case for SpriteLFS: no cleaning overhead
•Sprite keeps disk 17% busy (85% for SunOS) and CPU
saturated: will improve with CPU speed (right)
Performance – micro-benchmarks II
Large file performance
Single 100MB file
•Traditional FS: logical locality – pay additional cost for organizing
disk layout, assuming read patterns
•LFS: temporal locality – group information created at the same
time – not optimal for reading randomly written files
Performance – cleaning overhead
•This experiment: statistics over several months of real usage
•Previous results did not include cleaning
•Write cost ranges 1.2-1.6 - more than half of cleaned segments
empty
•Cleaning overhead limits write performance: about 70% of
bandwidth for writing
•Improvement: cleaning could be performed at night or during idle
periods
Conclusions
• Prototype log-structured file system implemented
and tested
• Due to cleaning overhead, segment cleaning
policies are crucial - tested in simulations before
implementation
• Results in tests (without cleaning overhead)
Higher performance than FFS in writes for both small
and large files
Comparable read performance (except one case)
• Results in real usage (with cleaning)
Simulation results confirmed
70% of bandwidth can be used for writing
References
• M. Rosenblum and J.Ousterhout, “The design and
implementation of a log-structured file system”, Proceedings
of the 13th ACM Symposium on Operating Systems Principles,
December 1991
• Marshall K. McKusick, William N. Joy, Samuel J. Leffler,
and Robert S. Fabry, “A Fast File System for Unix”, ACM
Transactions on Computer Systems, 2(3), August 1984, pp. 181-197
• A. Tanenbaum “Modern operating systems” 2nd ed. (Chpt.4
“File systems”), Prentice Hall