Transcript Document
Transactions and Reliability
Sarah Diesburg
Operating Systems
COP 4610
Motivation
File systems have lots of metadata:
Free blocks, directories, file headers, indirect
blocks
Metadata is heavily cached for
performance
Problem
System crashes
OS needs to ensure that the file system
does not reach an inconsistent state
Example: move a file between directories
Remove a file from the old directory
Add a file to the new directory
What happens when a crash occurs in the
middle?
UNIX File System (Ad Hoc FailureRecovery)
Metadata handling:
Uses a synchronous write-through caching
policy
A call to update metadata does not return until the
changes are propagated to disk
Updates are ordered
When crashes occur, run fsck to repair inprogress operations
Some Examples of Metadata Handling
Undo effects not yet visible to users
If a new file is created, but not yet added to the
directory
Delete the file
Continue effects that are visible to users
If file blocks are already allocated, but not
recorded in the bitmap
Update the bitmap
UFS User Data Handling
Uses a write-back policy
Modified blocks are written to disk at 30-second
intervals
Unless a user issues the sync system call
Data updates are not ordered
In many cases, consistent metadata is good
enough
Example: Vi
Vi saves changes by doing the following
1. Writes the new version in a temp file
Now we have old_file and new_temp file
2. Moves the old version to a different temp file
Now we have new_temp and old_temp
3. Moves the new version into the real file
Now we have new_file and old_temp
4. Removes the old version
Now we have new_file
Example: Vi
When crashes occur
Looks for the leftover files
Moves forward or backward depending on the
integrity of files
Transaction Approach
A transaction groups operations as a unit,
with the following characteristics:
Atomic: all operations either happen or they
do not (no partial operations)
Serializable: transactions appear to happen
one after the other
Durable: once a transaction happens, it is
recoverable and can survive crashes
More on Transactions
A transaction is not done until it is
committed
Once committed, a transaction is durable
If a transaction fails to complete, it must
rollback as if it did not happen at all
Critical sections are atomic and
serializable, but not durable
Transaction Implementation (One Thread)
Example: money transfer
Begin transaction
x = x – 1;
y = y + 1;
Commit
Transaction Implementation (One Thread)
Common implementations involve the use
of a log, a journal that is never erased
A file system uses a write-ahead log to
track all transactions
Transaction Implementation (One Thread)
Once accounts of x and y are on a log, the
log is committed to disk in a single write
Actual changes to those accounts are
done later
Transaction Illustrated
x = 1;
y = 1;
x = 1;
y = 1;
Transaction Illustrated
x = 0;
y = 2;
x = 1;
y = 1;
Transaction Illustrated
x = 0;
y = 2;
begin transaction
old x: 1 new x: 0
old y: 1 new y: 2
commit
x = 1;
y = 1;
Commit the log
to disk before
updating the actual
values on disk
Transaction Steps
Mark the beginning of the transaction
Log the changes in account x
Log the changes in account y
Commit
Modify account x on disk
Modify account y on disk
Scenarios of Crashes
If a crash occurs after the commit
Replays the log to update accounts
If a crash occurs before or during the
commit
Rolls back and discard the transaction
Two-Phase Locking (Multiple Threads)
Logging alone not enough to prevent
multiple transactions from trashing one
another (not serializable)
Solution: two-phase locking
1. Acquire all locks
2. Perform updates and release all locks
Thread A cannot see thread B’s changes
until thread A commits and releases locks
Transactions in File Systems
Almost all file systems built since 1985 use
write-ahead logging
NTFS, HFS+, ext3, ext4, …
+ Eliminates running fsck after a crash
+ Write-ahead logging provides reliability
- All modifications need to be written twice
Log-Structured File System (LFS)
If logging is so great, why don’t we treat
everything as log entries?
Log-structured file system
Everything is a log entry (file headers,
directories, data blocks)
Write the log only once
Use version stamps to distinguish between old and
new entries
More on LFS
New log entries are always appended to
the end of the existing log
All writes are sequential
Seeks only occurs during reads
Not so bad due to temporal locality and caching
Problem:
Need to create more contiguous space all the
time
RAID and Reliability
So far, we assume that we have a single disk
What if we have multiple disks?
The chance of a single-disk failure increases
RAID: redundant array of independent disks
Standard way of organizing disks and classifying the
reliability of multi-disk systems
General methods: data duplication, parity, and errorcorrecting codes (ECC)
RAID 0
No redundancy
Uses block-level striping across disks
i.e., 1st block stored on disk 1, 2nd block stored
on disk 2
Failure causes data loss
Non-Redundant Disk Array Diagram
(RAID Level 0)
open(foo)
read(bar)
write(zoo)
File
System
Mirrored Disks (RAID Level 1)
Each disk has a second disk that mirrors
its contents
Writes go to both disks
+ Reliability is doubled
+ Read access faster
- Write access slower
- Expensive and inefficient
Mirrored Disk Diagram (RAID Level 1)
open(foo)
read(bar)
write(zoo)
File
System
Memory-Style ECC (RAID Level 2)
Some disks in array are used to hold ECC
Byte to detect error, extra bits for error
correcting
+ More efficient than mirroring
+ Can correct, not just detect, errors
- Still fairly inefficient
e.g., 4 data disks require 3 ECC disks
Memory-Style ECC Diagram
(RAID Level 2)
open(foo)
read(bar)
write(zoo)
File
System
Bit-Interleaved Parity (RAID Level 3)
Uses bit-level striping across disks
i.e., 1st byte stored on disk 1, 2nd byte stored on
disk 2
One disk in the array stores parity for the
other disks
No detection bits needed, relies on disk
controller to detect errors
+ More efficient than Levels 1 and 2
- Parity disk doesn’t add bandwidth
Parity Method
Disk 1:
Disk 2:
Disk 3:
Parity:
1001
0101
1000
0100 = 1001 xor 0101 xor 1000
To recover disk 2
Disk 2: 0101 = 1001 xor 1000 xor 0100
Bit-Interleaved RAID Diagram (Level 3)
open(foo)
read(bar)
write(zoo)
File
System
Block-Interleaved Parity (RAID Level 4)
Like bit-interleaved, but data is interleaved
in blocks
+ More efficient data access than level 3
- Parity disk can be a bottleneck
- Small writes require 4 I/Os
Read the old block
Read the old parity
Write the new block
Write the new parity
Block-Interleaved Parity Diagram
(RAID Level 4)
open(foo)
read(bar)
write(zoo)
File
System
Block-Interleaved Distributed-Parity
(RAID Level 5)
Sort of the most general level of RAID
Spreads the parity out over all disks
+ No parity disk bottleneck
+ All disks contribute read bandwidth
– Requires 4 I/Os for small writes
Block-Interleaved Distributed-Parity
Diagram (RAID Level 5)
open(foo)
read(bar)
write(zoo)
File
System