CS 136, Advanced Architecture Storage Performance Measurement Outline • I/O Benchmarks, Performance, and Dependability • Introduction to Queueing Theory CS136
Download ReportTranscript CS 136, Advanced Architecture Storage Performance Measurement Outline • I/O Benchmarks, Performance, and Dependability • Introduction to Queueing Theory CS136
CS 136, Advanced Architecture Storage Performance Measurement Outline • I/O Benchmarks, Performance, and Dependability • Introduction to Queueing Theory CS136 2 I/O Performance 300 Metrics: Response Time vs. Throughput Response Time (ms) 200 100 0 100% 0% Throughput (% total BW) Queue Proc IOC Device Response time = Queue Time + Device Service Time CS136 3 I/O Benchmarks • For better or worse, benchmarks shape a field – Processor benchmarks classically aimed at response time for fixed-size problem – I/O benchmarks typically measure throughput, possibly with upper limit on response times (or 90% of response times) • Transaction Processing (TP) (or On-Line TP = OLTP) – If bank computer fails when customer withdraws money, TP system guarantees account debited if customer gets $ & account unchanged if no $ – Airline reservation systems & banks use TP • Atomic transactions make this work • Classic metric is Transactions Per Second (TPS) CS136 4 I/O Benchmarks: Transaction Processing • Early 1980s great interest in OLTP – Expecting demand for high TPS (e.g., ATMs, credit cards) – Tandem’s success implied medium-range OLTP expanding – Each vendor picked own conditions for TPS claims, reported only CPU times with widely different I/O – Conflicting claims led to disbelief in all benchmarks chaos • 1984 Jim Gray (Tandem) distributed paper to Tandem + 19 in other companies proposing standard benchmark • Published “A measure of transaction processing power,” Datamation, 1985 by Anonymous et. al – To indicate that this was effort of large group – To avoid delays in legal departments at each author’s firm • Led to Transaction Processing Council in 1988 – www.tpc.org CS136 5 I/O Benchmarks: TP1 by Anon et. al • Debit/Credit Scalability: size of account, branch, teller, history function of throughput TPS Number of ATMs Account-file size 10 1,000 0.1 GB 100 10,000 1.0 GB 1,000 100,000 10.0 GB 10,000 1,000,000 100.0 GB – Each input TPS =>100,000 account records, 10 branches, 100 ATMs – Accounts must grow since customer unlikely to use bank more often just because they have faster computer! • Response time: 95% transactions take ≤ 1 second • Report price (initial purchase price + 5 year maintenance = cost of ownership) • Hire auditor to certify results CS136 6 Unusual Characteristics of TPC • Price included in benchmarks – Cost of HW, SW, 5-year maintenance agreement » Price-performance as well as performance • Data set must scale up as throughput increases – Trying to model real systems: demand on system and size of data stored in it increase together • Benchmark results are audited – Ensures only fair results submitted • Throughput is performance metric but response times are limited – E.g, TPC-C: 90% transaction response times < 5 seconds • Independent organization maintains the benchmarks – Ballots on changes, holds meetings to settle disputes... CS136 7 TPC Benchmark History/Status Benchmark Data Size (GB) A: Debit Credit (retired) 0.1 to 10 B: Batch Debit Credit 0.1 to 10 (retired) C: Complex Query 100 to 3000 OLTP (min. 07 * tpm) D: Decision Support 100, 300, 1000 (retired) H: Ad hoc decision 100, 300, 1000 support R: Business reporting 1000 decision support (retired) W: Transactional web ~ 50, 500 App: app. server & web services CS136 Performance Metric transactions/s transactions/s 1st Results new order trans/min (tpm) queries/hour Sep-92 queries/hour Oct-99 queries/hour Aug-99 web interactions/sec. Jul-00 Web Service Interactions/sec (SIPS) Jun-05 Jul-90 Jul-91 Dec-95 8 I/O Benchmarks via SPEC • SFS 3.0: Attempt by NFS companies to agree on standard benchmark – – – – – – Run on multiple clients & networks (to prevent bottlenecks) Same caching policy in all clients Reads: 85% full-block & 15% partial-block Writes: 50% full-block & 50% partial-block Average response time: 40 ms Scaling: for every 100 NFS ops/sec, increase capacity 1GB • Results: plot of server load (throughput) vs. response time & number of users – Assumes: 1 user => 10 NFS ops/sec – 3.0 for NFS 3.0 • Added SPECMail (mailserver), SPECWeb (webserver) benchmarks CS136 9 2005 Example SPEC SFS Result: NetApp FAS3050c NFS servers Response time (ms) • 2.8 GHz Pentium Xeons, 2 GB DRAM per processor, 1GB non-volatile memory per system • 4 FDDI nets; 32 NFS Daemons, 24 GB file size • 168 fibre channel disks: 72 GB, 15000 RPM, 2 or 4 FC controllers 8 7 6 5 34,089 2 processors 4 3 2 1 0 4 processors 0 CS136 47,927 10000 20000 30000 40000 Operations/second 50000 60000 10 Availability Benchmark Methodology • Goal: quantify variation in QoS metrics as events occur that affect system availability • Leverage existing performance benchmarks – to generate fair workloads – to measure & trace quality of service metrics • Use fault injection to compromise system – hardware faults (disk, memory, network, power) – software faults (corrupt input, driver error returns) – maintenance events (repairs, SW/HW upgrades) • Examine single-fault and multi-fault workloads – the availability analogues of performance micro- and macrobenchmarks CS136 11 Example single-fault result 220 2 215 Reconstruction 200 0 195 190 0 10 20 30 40 50 60 70 80 90 100 110 160 2 140 Reconstruction 120 #failures tolerated Solaris 1 205 Hits per second Linux 210 1 Hits/sec # failures tolerated 100 0 80 0 10 20 30 40 50 60 70 80 90 100 110 Time (minutes) • Compares Linux and Solaris reconstruction CS136 – Linux: minimal performance impact but longer window of vulnerability to second fault – Solaris: large perf. impact but restores redundancy fast 12 Reconstruction policy (2) • Linux: favors performance over data availability – Automatically-initiated reconstruction, idle bandwidth – Virtually no performance impact on application – Very long window of vulnerability (>1hr for 3GB RAID) • Solaris: favors data availability over app. perf. – Automatically-initiated reconstruction at high BW – As much as 34% drop in application performance – Short window of vulnerability (10 minutes for 3GB) • Windows: favors neither! – Manually-initiated reconstruction at moderate BW – As much as 18% app. performance drop – Somewhat short window of vulnerability (23 min/3GB) CS136 13 Introduction to Queueing Theory Arrivals Departures • More interested in long-term steady state than in startup ⇒ Arrivals = Departures • Little’s Law: Mean number tasks in system = arrival rate x mean response time – Observed by many, Little was first to prove – Makes sense: large number of customers means long waits • Applies to any system in equilibrium, as long as black box not creating or destroying tasks CS136 14 Deriving Little’s Law • • • • Define arr(t) = # arrivals in interval (0,t) Define dep(t) = # departures in (0,t) Clearly, N(t) = # in system at time t = arr(t) – dep(t) Area between curves = spent(t) = total time spent in system by all customers (measured in customer-seconds) Arrivals and Departures 12 10 N(t) 8 6 4 2 arr(t) 1 dep(t) 0 CS136 Time t 15 Deriving Little’s Law (cont’d) • Define average arrival rate during interval t, in customers/second, as λt = arr(t)/t • Define Tt as system time/customer, averaged over all customers in (0,t) – Since spent(t) = accumulated customer-seconds, divide by arrivals up to that point to get Tt = spent(t)/arr(t) • Mean tasks in system over (0,t) is accumulated customer-seconds divided by seconds: Mean_taskst = spent(t)/t • Above three equations give us: Mean_taskst = λt Tt • Assuming limits of λt and Tt exist, limit of mean_taskst also exists and gives Little’s result: Mean tasks in system = arrival rate × mean time in system CS136 16 A Little Queuing Theory: Notation System Queue Proc server IOC Device • Notation: Timeserver average time to service a task Average service rate = 1 / Timeserver (traditionally µ) Timequeue average time/task in queue Timesystem average time/task in system = Timequeue + Timeserver Arrival rate avg no. of arriving tasks/sec (traditionally λ) • Lengthserver average number of tasks in service Lengthqueue average length of queue Lengthsystem average number of tasks in service = Lengthqueue + Lengthserver Little’s Law: Lengthserver = Arrival rate x Timeserver CS136 17 Server Utilization • For a single server, service rate = 1 / Timeserver • Server utilization must be between 0 and 1, since system is in equilibrium (arrivals = departures); often called traffic intensity, traditionally ρ) • Server utilization = mean number tasks in service = Arrival rate x Timeserver • What is disk utilization if get 50 I/O requests per second for disk and average disk service time is 10 ms (0.01 sec)? • Server utilization = 50/sec x 0.01 sec = 0.5 • Or server is busy on average 50% of time CS136 18 Time in Queue vs. Length of Queue • We assume First In First Out (FIFO) queue • Relationship of time in queue (Timequeue) to mean number of tasks in queue (Lengthqueue)? • Timequeue = Lengthqueue x Timeserver + “Mean time to complete service of task when new task arrives if server is busy” • New task can arrive at any instant; how to predict last part? • To predict performance, need to know something about distribution of events CS136 19 I/O Request Distributions • I/O request arrivals can be modeled by random variable – Multiple processes generate independent I/O requests – Disk seeks and rotational delays are probabilistic • What distribution to use for model? – True distributions are complicated » Self-similar (fractal) » Zipf – We often ignore that and use Poisson » Highly tractable for analysis » Intuitively appealing (independence of arrival times) CS136 20 The Poisson Distribution • Probability of exactly k arrivals in (0,t) is: Pk(t) = (λt)ke-λt/k! – λ is arrival rate parameter • More useful formulation is Poisson arrival distribution: – – – – PDF A(t) = P[next arrival takes time ≤ t] = 1 – e-λt pdf a(t) = λe-λt Also known as exponential distribution Mean = standard deviation = λ • Poisson distribution is memoryless: – Assume P[arrival within 1 second] at time t0 = x – Then P[arrival within 1 second] at time t1 > t0 is also x » I.e., no memory that time has passed CS136 21 Kendall’s Notation • Queueing system is notated A/S/s/c, where: – A encodes the interarrival distribution – S encodes the service-time distribution » Both A and S can be M (Memoryless, Markov, or exponential), D (deterministic), Er (r-stage Erlang) – s is the number of servers – c is the capacity of the queue, if non-infinite • Examples: – D/D/1 is arrivals on clock tick, fixed service times, one server – M/M/m is memoryless arrivals, memoryless service, multiple servers (this is good model of a bank) – M/M/m/m is case where customers go away rather than wait in line – G/G/1 is what disk drive is really like (but mostly intractable to analyze) CS136 22 M/M/1 Queuing Model • System is in equilibrium • Exponential interarrival and service times • Unlimited source of customers (“infinite population model”) • FIFO queue • Book also derives M/M/m • Most important results: – – – – – CS136 Let arrival rate = λ = 1/average interarrival time Let service rate = μ = 1/average service time Define utilization = ρ = λ/μ Then average number in system = ρ/(1-ρ) And time in system = (1/μ)/(1-ρ) 23 Explosion of Load with Utilization 20 15 Number 10 in System 5 0 0 0.2 0.4 0.6 0.8 1 Utilization CS136 24 Example M/M/1 Analysis • Assume 40 disk I/Os / sec – Exponential interarrival time – Exponential service time with mean 20 ms ⇒ λ = 40, Timeserver = 1/μ = 0.02 sec • Server utilization = ρ = Arrival rate Timeserver = λ/μ = 40 x 0.02 = 0.8 = 80% • Timequeue = Timeserver x ρ /(1-ρ) = 20 ms x 0.8/(1-0.8) = 20 x 4 = 80 ms • Timesystem=Timequeue + Timeserver = 80+20 ms = 100 ms CS136 25 How Much Better With 2X Faster Disk? • Average service time is now 10 ms ⇒ Arrival rate/sec = 40, Timeserver = 0.01 sec • Now server utilization = Arrival rate Timeserver = 40 x 0.01 = 0.4 = 40% • Timequeue = Timeserver x ρ /(1-ρ) = 10 ms x 0.4/(1-0.4) = 10 x 2/3 = 6.7 ms • Timesystem = Timequeue + Timeserver = 6.7+10 ms = 16.7 ms • 6X faster response time with 2X faster disk! CS136 26 Value of Queueing Theory in Practice • Quick lesson: – Don’t try for 100% utilization – But how far to back off? • Theory allows designers to: – Estimate impact of faster hardware on utilization – Find knee of response curve – Thus find impact of HW changes on response time • Works surprisingly well CS136 27 Crosscutting Issues: Buses Point-to-Point Links & Switches Standard (Parallel) ATA Serial ATA (Parallel) SCSI Serial Attach SCSI PCI PCI Express • • width 8b 2b 16b 1b 32/64 2b length 0.5 m 2m 12 m 10 m 0.5 m 0.5 m Clock rate 133 MHz 3 GHz 80 MHz (DDR) -33 / 66 MHz 3 GHz MB/s 133 300 320 375 533 250 Max 2 ? 15 16,256 ? ? No. bits and BW is per direction 2X for both directions (not shown). Since use fewer wires, commonly increase BW via versions with 2X-12X the number of wires and BW – …but timing problems arise CS136 28 Storage Example: Internet Archive • Goal of making a historical record of the Internet – Internet Archive began in 1996 – Wayback Machine interface performs time travel to see what a web page looked like in the past • Contains over a petabyte (1015 bytes) – Growing by 20 terabytes (1012 bytes) of new data per month • Besides storing historical record, same hardware crawls Web to get new snapshots CS136 29 Internet Archive Cluster • 1U storage node PetaBox GB2000 from Capricorn Technologies • Has 4 500-GB Parallel ATA (PATA) drives, 512 MB of DDR266 DRAM, G-bit Ethernet, and 1 GHz C3 processor from VIA (80x86). • Node dissipates 80 watts • 40 GB2000s in a standard VME rack, 80 TB raw storage capacity • 40 nodes connected with 48-port Ethernet switch • Rack dissipates about 3 KW • 1 Petabyte = 12 racks CS136 30 Estimated Cost • Via processor, 512 MB of DDR266 DRAM, ATA disk controller, power supply, fans, and enclosure = $500 • 7200 RPM 500-GB PATA drive = $375 (in 2006) • 48-port 10/100/1000 Ethernet switch and all cables for a rack = $3000 • Total cost $84,500 for a 80-TB rack • 160 Disks are 60% of total CS136 31 Estimated Performance • 7200 RPM drive: – – – – Average seek time = 8.5 ms Transfer bandwidth 50 MB/second PATA link can handle 133 MB/second ATA controller overhead is 0.1 ms per I/O • VIA processor is 1000 MIPS – OS needs 50K CPU instructions for a disk I/O – Network stack uses 100K instructions per data block • Average I/O size: – 16 KB for archive fetches – 50 KB when crawling Web • Disks are limit: – 75 I/Os/s per disk, thus 300/s per node, 12000/s per rack – About 200-600 MB/sec bandwidth per rack • Switch must do 1.6-3.8 Gb/s over 40 Gb/s links CS136 32 Estimated Reliability • CPU/memory/enclosure MTTF is 1,000,000 hours (x 40) • Disk MTTF 125,000 hours (x 160) • PATA controller MTTF 500,000 hours (x 40) • PATA cable MTTF 1,000,000 hours (x 40) • Ethernet switch MTTF 500,000 hours (x 1) • Power supply MTTF 200,000 hours (x 40) • Fan MTTF 200,000 hours (x 40) • MTTF for system works out to 531 hours ( 3 weeks) • 70% of failures in time are disks • 20% of failures in time are fans or power supplies CS136 33 Summary System Queue Proc server IOC Device • Little’s Law: Lengthsystem = rate x Timesystem (Mean no. customers = arrival rate x mean service time) • Appreciation for relationship of latency and utilization: • Timesystem= Timeserver + Timequeue • Timequeue = Timeserver x ρ/(1-ρ) • Clusters for storage as well as computation • RAID: Reliability matters, not performance CS136 34