Transcript Document

Transactions and Reliability
Sarah Diesburg
Operating Systems
COP 4610
Motivation
File systems have lots of metadata:
Free blocks, directories, file headers, indirect
blocks
Metadata is heavily cached for
performance
Problem
System crashes
OS needs to ensure that the file system
does not reach an inconsistent state
Example: move a file between directories
Remove a file from the old directory
Add a file to the new directory
What happens when a crash occurs in the
middle?
UNIX File System (Ad Hoc FailureRecovery)
Metadata handling:
Uses a synchronous write-through caching
policy
A call to update metadata does not return until the
changes are propagated to disk
Updates are ordered
When crashes occur, run fsck to repair inprogress operations
Some Examples of Metadata Handling
Undo effects not yet visible to users
If a new file is created, but not yet added to the
directory
Delete the file
Continue effects that are visible to users
If file blocks are already allocated, but not
recorded in the bitmap
Update the bitmap
UFS User Data Handling
Uses a write-back policy
Modified blocks are written to disk at 30-second
intervals
Unless a user issues the sync system call
Data updates are not ordered
In many cases, consistent metadata is good
enough
Example: Vi
Vi saves changes by doing the following
1. Writes the new version in a temp file
Now we have old_file and new_temp file
2. Moves the old version to a different temp file
Now we have new_temp and old_temp
3. Moves the new version into the real file
Now we have new_file and old_temp
4. Removes the old version
Now we have new_file
Example: Vi
When crashes occur
Looks for the leftover files
Moves forward or backward depending on the
integrity of files
Transaction Approach
A transaction groups operations as a unit,
with the following characteristics:
Atomic: all operations either happen or they
do not (no partial operations)
Serializable: transactions appear to happen
one after the other
Durable: once a transaction happens, it is
recoverable and can survive crashes
More on Transactions
A transaction is not done until it is
committed
Once committed, a transaction is durable
If a transaction fails to complete, it must
rollback as if it did not happen at all
Critical sections are atomic and
serializable, but not durable
Transaction Implementation (One Thread)
Example: money transfer
Begin transaction
x = x – 1;
y = y + 1;
Commit
Transaction Implementation (One Thread)
Common implementations involve the use
of a log, a journal that is never erased
A file system uses a write-ahead log to
track all transactions
Transaction Implementation (One Thread)
Once accounts of x and y are on a log, the
log is committed to disk in a single write
Actual changes to those accounts are
done later
Transaction Illustrated
x = 1;
y = 1;
x = 1;
y = 1;
Transaction Illustrated
x = 0;
y = 2;
x = 1;
y = 1;
Transaction Illustrated
x = 0;
y = 2;
begin transaction
old x: 1 new x: 0
old y: 1 new y: 2
commit
x = 1;
y = 1;
Commit the log
to disk before
updating the actual
values on disk
Transaction Steps
Mark the beginning of the transaction
Log the changes in account x
Log the changes in account y
Commit
Modify account x on disk
Modify account y on disk
Scenarios of Crashes
If a crash occurs after the commit
Replays the log to update accounts
If a crash occurs before or during the
commit
Rolls back and discard the transaction
Two-Phase Locking (Multiple Threads)
Logging alone not enough to prevent
multiple transactions from trashing one
another (not serializable)
Solution: two-phase locking
1. Acquire all locks
2. Perform updates and release all locks
Thread A cannot see thread B’s changes
until thread A commits and releases locks
Transactions in File Systems
Almost all file systems built since 1985 use
write-ahead logging
NTFS, HFS+, ext3, ext4, …
+ Eliminates running fsck after a crash
+ Write-ahead logging provides reliability
- All modifications need to be written twice
Log-Structured File System (LFS)
If logging is so great, why don’t we treat
everything as log entries?
Log-structured file system
Everything is a log entry (file headers,
directories, data blocks)
Write the log only once
Use version stamps to distinguish between old and
new entries
More on LFS
New log entries are always appended to
the end of the existing log
All writes are sequential
Seeks only occurs during reads
Not so bad due to temporal locality and caching
Problem:
Need to create more contiguous space all the
time
RAID and Reliability
 So far, we assume that we have a single disk
 What if we have multiple disks?
The chance of a single-disk failure increases
 RAID: redundant array of independent disks
Standard way of organizing disks and classifying the
reliability of multi-disk systems
General methods: data duplication, parity, and errorcorrecting codes (ECC)
RAID 0
No redundancy
Uses block-level striping across disks
i.e., 1st block stored on disk 1, 2nd block stored
on disk 2
Failure causes data loss
Non-Redundant Disk Array Diagram
(RAID Level 0)
open(foo)
read(bar)
write(zoo)
File
System
Mirrored Disks (RAID Level 1)
Each disk has a second disk that mirrors
its contents
Writes go to both disks
+ Reliability is doubled
+ Read access faster
- Write access slower
- Expensive and inefficient
Mirrored Disk Diagram (RAID Level 1)
open(foo)
read(bar)
write(zoo)
File
System
Memory-Style ECC (RAID Level 2)
Some disks in array are used to hold ECC
Byte to detect error, extra bits for error
correcting
+ More efficient than mirroring
+ Can correct, not just detect, errors
- Still fairly inefficient
e.g., 4 data disks require 3 ECC disks
Memory-Style ECC Diagram
(RAID Level 2)
open(foo)
read(bar)
write(zoo)
File
System
Bit-Interleaved Parity (RAID Level 3)
Uses bit-level striping across disks
i.e., 1st byte stored on disk 1, 2nd byte stored on
disk 2
One disk in the array stores parity for the
other disks
No detection bits needed, relies on disk
controller to detect errors
+ More efficient than Levels 1 and 2
- Parity disk doesn’t add bandwidth
Parity Method
Disk 1:
Disk 2:
Disk 3:
Parity:
1001
0101
1000
0100 = 1001 xor 0101 xor 1000
To recover disk 2
Disk 2: 0101 = 1001 xor 1000 xor 0100
Bit-Interleaved RAID Diagram (Level 3)
open(foo)
read(bar)
write(zoo)
File
System
Block-Interleaved Parity (RAID Level 4)
Like bit-interleaved, but data is interleaved
in blocks
+ More efficient data access than level 3
- Parity disk can be a bottleneck
- Small writes require 4 I/Os
Read the old block
Read the old parity
Write the new block
Write the new parity
Block-Interleaved Parity Diagram
(RAID Level 4)
open(foo)
read(bar)
write(zoo)
File
System
Block-Interleaved Distributed-Parity
(RAID Level 5)
Sort of the most general level of RAID
Spreads the parity out over all disks
+ No parity disk bottleneck
+ All disks contribute read bandwidth
– Requires 4 I/Os for small writes
Block-Interleaved Distributed-Parity
Diagram (RAID Level 5)
open(foo)
read(bar)
write(zoo)
File
System