Database Availability Benchmarking: an example

Download Report

Transcript Database Availability Benchmarking: an example

Breaking databases for fun and
publications: availability benchmarks
Aaron Brown
UC Berkeley ROC Group
HPTS 2001
Slide 1
Motivation
• Drinking the availability Kool-Aid
– availability is the key metric for modern apps.
• Database stack’s availability is especially
important
– guardians of the world’s hard state
– almost any user’s request for electronic information
hits a database stack
» web services, directories, enterprise apps, ...
• Can we trust database software stacks in the
face of failure?
Slide 2
Availability benchmarking 101
• Availability benchmarks quantify system
behavior under failures, maintenance, recovery
QoS Metric
normal behavior
(99% conf.)
QoS degradation
failure
Repair Time
0
Time
• They require
– a realistic workload for the system: TPC-C
– quality of service metrics: txn rates, OK and aborted
– fault-injection to simulate failures: single-disk errors
Slide 3
Well, what happens?
• Setup
– 3-tier: Microsoft SQLServer/COM+/IIS & bus. logic
– TPC-C-like workload; faults injected into DB data & log
• Results
– DBMS tolerates transient and recoverable failures,
reflecting errors back via transaction aborts
– middleware highly unstable: degrades or crashes when
DBMS fails or undergoes lengthy recovery
sticky uncorrectable write error, log disk
140
120
120
100
80
middleware causes
degraded performance
60
database
recovers
40
20
Successful txns
Failed txns
0
Throughput, txn/min
Throughput, txn/min
Disk hang during write to data disk
140
100
database fails,
middleware degrades
80
60
middleware
crashes
40
20
Successful txns
Failed txns
fault
0
0
1
2
3
4
5
6
7
8
9
Time (minutes)
10 11 12 13 14 15
0
1
2
3
4
5
6
7
8
9
Time (minutes)
10 11 12 13 14 15Slide
4
Summary
• Database is pretty resilient
– transaction abort == good error-reflection mechanism
• Middleware/applications suck
(well, at least this instance of them)
• Robustness is end-to-end
– user cannot distinguish DBMS and middleware failures
– failure recovery must go beyond the DBMS
• Achievable Grand Challenges?
– build and run availability benchmarks on your systems
– tolerate and recover from non-failstop system-level
faults
Does performance matter?
Slide 5
Backup slides
Slide 6
Experimental setup
• Database
– Microsoft SQL Server 2000, default configuration
• Middleware/front-end software
– Microsoft COM+ transaction monitor/coordinator
– IIS 5.0 web server with Microsoft’s tpcc.dll HTML
terminal interface and business logic
– Microsoft BenchCraft remote terminal emulator
• TPC-C-like OLTP order-entry workload
– 10 warehouses, 100 active users, ~860 MB database
• Measured metrics
– throughput of correct NewOrder transactions/min
– rate of aborted NewOrder transactions (txn/min)
Slide 7
Experimental setup (2)
Front End
DB Server
SCSI
system
disk
MS BenchCraft RTE
IIS + MS tpcc.dll
MS COM+
Intel P-III/450
256 MB DRAM
Windows 2000 AS
IDE
system
disk
100mb
Ethernet
Disk Emulator
Emulated
Disk
Adaptec
3940
SQL Server 2000
AMD K6-2/333
128 MB DRAM
Windows 2000 AS
SCSI
system
disk
AdvStor
ASC-U2W
IBM
18 GB
10k RPM
DB data/
log disks
Adaptec
2940
IBM
18 GB
10k RPM
emulator
backing disk
(NTFS)
ASC VirtualSCSI lib.
Intel P-II/300
128 MB DRAM
Windows NT 4.0
= Fast/Wide SCSI bus, 20 MB/sec
• Database installed in one of two configurations:
– data on emulated disk, log on real (IBM) disk
– data on real (IBM) disk, log on emulated disk
Slide 8
Results
• All results are from single-fault microbenchmarks
• 14 different fault types
– injected once for each of data and log partitions
• 4 categories of behavior detected
1)
2)
3)
4)
normal
transient glitch
degraded
failed
Slide 9
Type 1: normal behavior
140
Throughput, txn/min
120
100
80
60
40
fault
20
Successful txns
Failed txns
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Time (minutes)
• System tolerates fault
• Demonstrated for all sector-level faults except:
– sticky uncorrectable read, data partition
– sticky uncorrectable write, log partition
Slide 10
40
20
Type 2: transient glitch
fault
140
Throughput, txn/min
120
100
80
0
60
0
40
1
2
fault
20
3
4
Successful txns
Failed txns
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
5
6
Time (
Time (minutes)
• One transaction is affected, aborts with error
• Subsequent transactions using same data would fail
• Demonstrated for one fault only:
– sticky uncorrectable read, data partition
7
Slide 11
Type 3: degraded behavior
140
Throughput, txn/min
120
100
80
60
40
20
Successful txns
Failed txns
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Time (minutes)
• DBMS survives error after running log recovery
• Middleware partially fails, results in degraded perf.
• Demonstrated for one fault only:
– sticky uncorrectable write, log partition
Slide 12
Type 4: failure
140
140
120
120
Throughput, txn/min
Throughput, txn/min
• Example behaviors (10 distinct variants observed)
100
80
60
40
20
Successful txns
Failed txns
fault
0
100
80
60
40
20
Successful txns
Failed txns
fault
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Time (minutes)
Disk hang during write to data disk
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Time (minutes)
Simulated log disk power failure
• DBMS hangs or aborts all transactions
• Middleware behaves erratically, sometimes crashing
• Demonstrated for all fatal disk-level faults
– SCSI hangs, disk power failures
Slide 13