CS 136, Advanced Architecture Storage Performance Measurement Outline • I/O Benchmarks, Performance, and Dependability • Introduction to Queueing Theory CS136

Download Report

Transcript CS 136, Advanced Architecture Storage Performance Measurement Outline • I/O Benchmarks, Performance, and Dependability • Introduction to Queueing Theory CS136

CS 136, Advanced Architecture
Storage Performance Measurement
Outline
• I/O Benchmarks, Performance, and Dependability
• Introduction to Queueing Theory
CS136
2
I/O Performance
300
Metrics:
Response Time
vs. Throughput
Response
Time (ms)
200
100
0
100%
0%
Throughput
(% total BW)
Queue
Proc
IOC
Device
Response time = Queue Time + Device Service Time
CS136
3
I/O Benchmarks
• For better or worse, benchmarks shape a field
– Processor benchmarks classically aimed at response time for
fixed-size problem
– I/O benchmarks typically measure throughput, possibly with
upper limit on response times (or 90% of response times)
• Transaction Processing (TP) (or On-Line TP =
OLTP)
– If bank computer fails when customer withdraws money, TP
system guarantees account debited if customer gets $ &
account unchanged if no $
– Airline reservation systems & banks use TP
• Atomic transactions make this work
• Classic metric is Transactions Per Second (TPS)
CS136
4
I/O Benchmarks:
Transaction Processing
• Early 1980s great interest in OLTP
– Expecting demand for high TPS (e.g., ATMs, credit cards)
– Tandem’s success implied medium-range OLTP expanding
– Each vendor picked own conditions for TPS claims, reported
only CPU times with widely different I/O
– Conflicting claims led to disbelief in all benchmarks  chaos
• 1984 Jim Gray (Tandem) distributed paper to
Tandem + 19 in other companies proposing
standard benchmark
• Published “A measure of transaction processing
power,” Datamation, 1985 by Anonymous et. al
– To indicate that this was effort of large group
– To avoid delays in legal departments at each author’s firm
• Led to Transaction Processing Council in 1988
– www.tpc.org
CS136
5
I/O Benchmarks: TP1 by Anon et. al
• Debit/Credit Scalability: size of account, branch,
teller, history function of throughput
TPS
Number of ATMs
Account-file size
10
1,000
0.1 GB
100
10,000
1.0 GB
1,000
100,000
10.0 GB
10,000
1,000,000
100.0 GB
– Each input TPS =>100,000 account records, 10 branches, 100
ATMs
– Accounts must grow since customer unlikely to use bank
more often just because they have faster computer!
• Response time: 95% transactions take ≤ 1 second
• Report price (initial purchase price + 5 year
maintenance = cost of ownership)
• Hire auditor to certify results
CS136
6
Unusual Characteristics of TPC
• Price included in benchmarks
– Cost of HW, SW, 5-year maintenance agreement
» Price-performance as well as performance
• Data set must scale up as throughput increases
– Trying to model real systems: demand on system and size of
data stored in it increase together
• Benchmark results are audited
– Ensures only fair results submitted
• Throughput is performance metric but response
times are limited
– E.g, TPC-C: 90% transaction response times < 5 seconds
• Independent organization maintains the
benchmarks
– Ballots on changes, holds meetings to settle disputes...
CS136
7
TPC Benchmark History/Status
Benchmark
Data Size (GB)
A: Debit Credit (retired)
0.1 to 10
B: Batch Debit Credit
0.1 to 10
(retired)
C: Complex Query
100 to 3000
OLTP
(min. 07 * tpm)
D: Decision Support
100, 300, 1000
(retired)
H: Ad hoc decision
100, 300, 1000
support
R: Business reporting
1000
decision support (retired)
W: Transactional web
~ 50, 500
App: app. server & web
services
CS136
Performance
Metric
transactions/s
transactions/s
1st Results
new order
trans/min (tpm)
queries/hour
Sep-92
queries/hour
Oct-99
queries/hour
Aug-99
web interactions/sec.
Jul-00
Web Service
Interactions/sec
(SIPS)
Jun-05
Jul-90
Jul-91
Dec-95
8
I/O Benchmarks via SPEC
• SFS 3.0: Attempt by NFS companies to agree on
standard benchmark
–
–
–
–
–
–
Run on multiple clients & networks (to prevent bottlenecks)
Same caching policy in all clients
Reads: 85% full-block & 15% partial-block
Writes: 50% full-block & 50% partial-block
Average response time: 40 ms
Scaling: for every 100 NFS ops/sec, increase capacity 1GB
• Results: plot of server load (throughput) vs.
response time & number of users
– Assumes: 1 user => 10 NFS ops/sec
– 3.0 for NFS 3.0
• Added SPECMail (mailserver), SPECWeb
(webserver) benchmarks
CS136
9
2005 Example SPEC SFS Result:
NetApp FAS3050c NFS servers
Response time (ms)
• 2.8 GHz Pentium Xeons, 2 GB DRAM per
processor, 1GB non-volatile memory per system
• 4 FDDI nets; 32 NFS Daemons, 24 GB file size
• 168 fibre channel disks: 72 GB, 15000 RPM, 2 or 4
FC controllers
8
7
6
5
34,089
2 processors
4
3
2
1
0
4 processors
0
CS136
47,927
10000
20000
30000
40000
Operations/second
50000
60000
10
Availability Benchmark Methodology
• Goal: quantify variation in QoS metrics as events
occur that affect system availability
• Leverage existing performance benchmarks
– to generate fair workloads
– to measure & trace quality of service metrics
• Use fault injection to compromise system
– hardware faults (disk, memory, network, power)
– software faults (corrupt input, driver error returns)
– maintenance events (repairs, SW/HW upgrades)
• Examine single-fault and multi-fault workloads
– the availability analogues of performance micro- and macrobenchmarks
CS136
11
Example single-fault result
220
2
215
Reconstruction
200
0
195
190
0
10
20
30
40
50
60
70
80
90
100
110
160
2
140
Reconstruction
120
#failures tolerated
Solaris
1
205
Hits per second
Linux
210
1
Hits/sec
# failures tolerated
100
0
80
0
10
20
30
40
50
60
70
80
90
100
110
Time (minutes)
• Compares Linux and Solaris reconstruction
CS136
– Linux: minimal performance impact but longer window of
vulnerability to second fault
– Solaris: large perf. impact but restores redundancy fast
12
Reconstruction policy (2)
• Linux: favors performance over data availability
– Automatically-initiated reconstruction, idle bandwidth
– Virtually no performance impact on application
– Very long window of vulnerability (>1hr for 3GB RAID)
• Solaris: favors data availability over app. perf.
– Automatically-initiated reconstruction at high BW
– As much as 34% drop in application performance
– Short window of vulnerability (10 minutes for 3GB)
• Windows: favors neither!
– Manually-initiated reconstruction at moderate BW
– As much as 18% app. performance drop
– Somewhat short window of vulnerability (23 min/3GB)
CS136
13
Introduction to Queueing Theory
Arrivals
Departures
• More interested in long-term steady state than in
startup
⇒ Arrivals = Departures
• Little’s Law:
Mean number tasks in system = arrival rate x
mean response time
– Observed by many, Little was first to prove
– Makes sense: large number of customers means long waits
• Applies to any system in equilibrium, as long as
black box not creating or destroying tasks
CS136
14
Deriving Little’s Law
•
•
•
•
Define arr(t) = # arrivals in interval (0,t)
Define dep(t) = # departures in (0,t)
Clearly, N(t) = # in system at time t = arr(t) – dep(t)
Area between curves = spent(t) = total time spent in
system by all customers (measured in customer-seconds)
Arrivals and Departures
12
10
N(t)
8
6
4
2
arr(t)
1
dep(t)
0
CS136
Time t
15
Deriving Little’s Law (cont’d)
• Define average arrival rate during interval t, in
customers/second, as λt = arr(t)/t
• Define Tt as system time/customer, averaged
over all customers in (0,t)
– Since spent(t) = accumulated customer-seconds, divide by
arrivals up to that point to get Tt = spent(t)/arr(t)
• Mean tasks in system over (0,t) is accumulated
customer-seconds divided by seconds:
Mean_taskst = spent(t)/t
• Above three equations give us:
Mean_taskst = λt Tt
• Assuming limits of λt and Tt exist, limit of
mean_taskst also exists and gives Little’s result:
Mean tasks in system = arrival rate × mean time in system
CS136
16
A Little Queuing Theory: Notation
System
Queue
Proc
server
IOC
Device
• Notation:
Timeserver average time to service a task
Average service rate = 1 / Timeserver (traditionally µ)
Timequeue average time/task in queue
Timesystem average time/task in system
= Timequeue + Timeserver
Arrival rate avg no. of arriving tasks/sec (traditionally λ)
• Lengthserver average number of tasks in service
Lengthqueue average length of queue
Lengthsystem average number of tasks in service
= Lengthqueue + Lengthserver
Little’s Law: Lengthserver = Arrival rate x Timeserver
CS136
17
Server Utilization
• For a single server, service rate = 1 / Timeserver
• Server utilization must be between 0 and 1, since
system is in equilibrium (arrivals = departures);
often called traffic intensity, traditionally ρ)
• Server utilization = mean number tasks in service
= Arrival rate x Timeserver
• What is disk utilization if get 50 I/O requests per
second for disk and average disk service time is
10 ms (0.01 sec)?
• Server utilization = 50/sec x 0.01 sec = 0.5
• Or server is busy on average 50% of time
CS136
18
Time in Queue vs. Length of Queue
• We assume First In First Out (FIFO) queue
• Relationship of time in queue (Timequeue) to mean
number of tasks in queue (Lengthqueue)?
• Timequeue = Lengthqueue x Timeserver
+ “Mean time to complete service of
task when new task arrives if server is busy”
• New task can arrive at any instant; how to predict
last part?
• To predict performance, need to know something
about distribution of events
CS136
19
I/O Request Distributions
• I/O request arrivals can be modeled by random
variable
– Multiple processes generate independent I/O requests
– Disk seeks and rotational delays are probabilistic
• What distribution to use for model?
– True distributions are complicated
» Self-similar (fractal)
» Zipf
– We often ignore that and use Poisson
» Highly tractable for analysis
» Intuitively appealing (independence of arrival times)
CS136
20
The Poisson Distribution
• Probability of exactly k arrivals in (0,t) is:
Pk(t) = (λt)ke-λt/k!
– λ is arrival rate parameter
• More useful formulation is Poisson arrival
distribution:
–
–
–
–
PDF A(t) = P[next arrival takes time ≤ t] = 1 – e-λt
pdf a(t) = λe-λt
Also known as exponential distribution
Mean = standard deviation = λ
• Poisson distribution is memoryless:
– Assume P[arrival within 1 second] at time t0 = x
– Then P[arrival within 1 second] at time t1 > t0 is also x
» I.e., no memory that time has passed
CS136
21
Kendall’s Notation
• Queueing system is notated A/S/s/c, where:
– A encodes the interarrival distribution
– S encodes the service-time distribution
» Both A and S can be M (Memoryless, Markov, or
exponential), D (deterministic), Er (r-stage Erlang)
– s is the number of servers
– c is the capacity of the queue, if non-infinite
• Examples:
– D/D/1 is arrivals on clock tick, fixed service times, one server
– M/M/m is memoryless arrivals, memoryless service, multiple
servers (this is good model of a bank)
– M/M/m/m is case where customers go away rather than wait in
line
– G/G/1 is what disk drive is really like (but mostly intractable to
analyze)
CS136
22
M/M/1 Queuing Model
• System is in equilibrium
• Exponential interarrival and service times
• Unlimited source of customers (“infinite
population model”)
• FIFO queue
• Book also derives M/M/m
• Most important results:
–
–
–
–
–
CS136
Let arrival rate = λ = 1/average interarrival time
Let service rate = μ = 1/average service time
Define utilization = ρ = λ/μ
Then average number in system = ρ/(1-ρ)
And time in system = (1/μ)/(1-ρ)
23
Explosion of Load with Utilization
20
15
Number
10
in System
5
0
0
0.2
0.4
0.6
0.8
1
Utilization
CS136
24
Example M/M/1 Analysis
• Assume 40 disk I/Os / sec
– Exponential interarrival time
– Exponential service time with mean 20 ms
⇒ λ = 40, Timeserver = 1/μ = 0.02 sec
• Server utilization = ρ = Arrival rate  Timeserver
= λ/μ = 40 x 0.02 = 0.8 = 80%
• Timequeue = Timeserver x ρ /(1-ρ)
= 20 ms x 0.8/(1-0.8) = 20 x 4 = 80 ms
• Timesystem=Timequeue + Timeserver
= 80+20 ms = 100 ms
CS136
25
How Much Better
With 2X Faster Disk?
• Average service time is now 10 ms
⇒ Arrival rate/sec = 40, Timeserver = 0.01 sec
• Now server utilization = Arrival rate  Timeserver
= 40 x 0.01 = 0.4 = 40%
• Timequeue = Timeserver x ρ /(1-ρ)
= 10 ms x 0.4/(1-0.4) = 10 x 2/3 = 6.7 ms
• Timesystem = Timequeue + Timeserver
= 6.7+10 ms = 16.7 ms
• 6X faster response time with 2X faster disk!
CS136
26
Value of Queueing Theory in Practice
• Quick lesson:
– Don’t try for 100% utilization
– But how far to back off?
• Theory allows designers to:
– Estimate impact of faster hardware on utilization
– Find knee of response curve
– Thus find impact of HW changes on response time
• Works surprisingly well
CS136
27
Crosscutting Issues:
Buses  Point-to-Point Links & Switches
Standard
(Parallel) ATA
Serial ATA
(Parallel) SCSI
Serial Attach SCSI
PCI
PCI Express
•
•
width
8b
2b
16b
1b
32/64
2b
length
0.5 m
2m
12 m
10 m
0.5 m
0.5 m
Clock rate
133 MHz
3 GHz
80 MHz (DDR)
-33 / 66 MHz
3 GHz
MB/s
133
300
320
375
533
250
Max
2
?
15
16,256
?
?
No. bits and BW is per direction  2X for both
directions (not shown).
Since use fewer wires, commonly increase BW via
versions with 2X-12X the number of wires and BW
– …but timing problems arise
CS136
28
Storage Example: Internet Archive
• Goal of making a historical record of the Internet
– Internet Archive began in 1996
– Wayback Machine interface performs time travel to see what a
web page looked like in the past
• Contains over a petabyte (1015 bytes)
– Growing by 20 terabytes (1012 bytes) of new data per month
• Besides storing historical record, same hardware
crawls Web to get new snapshots
CS136
29
Internet Archive Cluster
• 1U storage node PetaBox GB2000 from
Capricorn Technologies
• Has 4 500-GB Parallel ATA (PATA) drives,
512 MB of DDR266 DRAM, G-bit Ethernet,
and 1 GHz C3 processor from VIA (80x86).
• Node dissipates  80 watts
• 40 GB2000s in a standard VME rack,
 80 TB raw storage capacity
• 40 nodes connected with 48-port Ethernet
switch
• Rack dissipates about 3 KW
• 1 Petabyte = 12 racks
CS136
30
Estimated Cost
• Via processor, 512 MB of DDR266 DRAM, ATA
disk controller, power supply, fans, and
enclosure = $500
• 7200 RPM 500-GB PATA drive = $375 (in 2006)
• 48-port 10/100/1000 Ethernet switch and all
cables for a rack = $3000
• Total cost $84,500 for a 80-TB rack
• 160 Disks are  60% of total
CS136
31
Estimated Performance
• 7200 RPM drive:
–
–
–
–
Average seek time = 8.5 ms
Transfer bandwidth 50 MB/second
PATA link can handle 133 MB/second
ATA controller overhead is 0.1 ms per I/O
• VIA processor is 1000 MIPS
– OS needs 50K CPU instructions for a disk I/O
– Network stack uses 100K instructions per data block
• Average I/O size:
– 16 KB for archive fetches
– 50 KB when crawling Web
• Disks are limit:
–  75 I/Os/s per disk, thus 300/s per node, 12000/s per rack
– About 200-600 MB/sec bandwidth per rack
• Switch must do 1.6-3.8 Gb/s over 40 Gb/s links
CS136
32
Estimated Reliability
• CPU/memory/enclosure MTTF is 1,000,000 hours
(x 40)
• Disk MTTF 125,000 hours (x 160)
• PATA controller MTTF 500,000 hours (x 40)
• PATA cable MTTF 1,000,000 hours (x 40)
• Ethernet switch MTTF 500,000 hours (x 1)
• Power supply MTTF 200,000 hours (x 40)
• Fan MTTF 200,000 hours (x 40)
• MTTF for system works out to 531 hours ( 3
weeks)
• 70% of failures in time are disks
• 20% of failures in time are fans or power supplies
CS136
33
Summary
System
Queue
Proc
server
IOC
Device
• Little’s Law: Lengthsystem = rate x Timesystem
(Mean no. customers = arrival rate x mean service time)
• Appreciation for relationship of latency and
utilization:
• Timesystem= Timeserver + Timequeue
• Timequeue = Timeserver x ρ/(1-ρ)
• Clusters for storage as well as computation
• RAID: Reliability matters, not performance
CS136
34