Design and Evaluation of Architectures for Commercial

Download Report

Transcript Design and Evaluation of Architectures for Commercial

Design and Evaluation of
Architectures for
Commercial Applications
Part I: benchmarks
Luiz André Barroso
Western Research Laboratory
Why architects should learn about
commercial applications?

Because they are very different from typical
benchmarks

Because they are demanding on many interesting
architectural features

Because they are driving the sales of mid-range
and high-end systems
2
UPC, February 1999
Shortcomings of popular benchmarks

SPEC
uniprocessor-oriented
 small cache footprints
 exacerbates impact of CPU core issues


SPLASH
small cache footprints
 extremely optimized sharing


STREAMS
no real sharing/communication
 mainly bandwidth-oriented

3
UPC, February 1999
SPLASH vs. Online Transaction Processing
(OLTP)
A typical SPLASH app. has
> 3x the issue rate,
~26x less cycles spent in memory barriers,
1/4 of the TLB miss ratios,
< 1/2 the fraction of cache-to-cache transfers,
~22x smaller instruction cache miss ratio,
~1/2 L2$ miss ratio
...of an OLTP app.
4
UPC, February 1999
But the real reason we care? $$$!

Server market:
Total: > $50 billion
 Numeric/scientific computing: < $2 billion
 Remaining $48 billion?

– OLTP
– DSS
– Internet/Web

Trend is for numerical/scientific to remain a niche
5
UPC, February 1999
Relevance of server vs. PC market
High profit margins
 Performance is a differentiating factor
 If you sell the server you will probably sell:

the client
 the storage
 the networking infrastructure
 the middleware
 the service
 ...

6
UPC, February 1999
Need for speed in the commercial market

Applications pushing the envelope
Enterprise resource planning (ERP)
 Electronic commerce
 Data mining/warehousing
 ADSL servers


Specialized solutions
Intel splitting Pentium line into 3-tiers
 Oracle’s raw iron initiative
 Network Appliances’ machines

7
UPC, February 1999
Seminar disclaimer

Hardware centric approach:
target is build better machines, not better software
 focus on fundamental behavior, not on software
“features”

Stick to general purpose paradigm
 Emphasis on CPU+memory system issues
 Lots of things missing:

object-relational and object-oriented databases
 public domain/academic database engines
 many others

8
UPC, February 1999
Overview

Day I: Introduction and workloads
Background on commercial applications
 Software structure of a commercial RDBMS
 Standard benchmarks

– TPC-B
– TPC-C
– TPC-D
– TPC-W
Cost and pricing trends
 Scaling down TPC benchmarks

9
UPC, February 1999
Overview(2)

Day 2: Evaluation methods/tools
Introduction
 Software instrumentation (ATOM)
 Hardware measurement & profiling

– IPROBE
– DCPI
– ProfileMe
Tracing & trace-driven simulation
 User-level simulators
 Complete machine simulators (SimOS)

10
UPC, February 1999
Overview (3)

Day III: Architecture studies
Memory system characterization
 Out-of-order processors
 Simultaneous multithreading
 Final remarks

11
UPC, February 1999
Background on commercial applications

Database applications:

Online Transaction Processing (OLTP)
– massive number of short queries
– read/update indexed tables
– canonical example: banking system

Decision Support Systems (DSS)
– smaller number of complex queries
– mostly read-only over large (non-indexed) tables
– canonical example: business analysis
12
UPC, February 1999
Background (2)

Web/Internet applications

Web server
– many requests for small/medium files

Proxy
– many short-lived connection requests
– content caching and coherence

Web search index
– DSS with a Web front-end

E-commerce site
– OLTP with a Web front-end
13
UPC, February 1999
Background (3)

Common characteristics
Large amounts of data manipulation
 Interactive response times required
 Highly multithreaded by design

– suitable for large multiprocessors
Significant I/O requirements
 Extensive/complex interactions with the operating
system
 Require robustness and resiliency to failures

14
UPC, February 1999
Database performance bottlenecks
I/O-bound until recently (Thakkar, ISCA’90)
 Many improvements since then

multithreading of DB engine
 I/O prefetching
 VLM (very large memory) database caching
 more efficient OS interactions
 RAIDs
 non-volatile DRAM (NVDRAM)


Today’s bottlenecks:
Memory system
 Processor architecture

15
UPC, February 1999
Structure of a database workload
clients
Simple logic checks
16
Application server
(optional)
Formulates and issues
DB query
UPC, February 1999
Database server
Executes query
Who is who in the database market?

DB engine:
Oracle is dominant
 other players: Microsoft, Sybase, Informix


Database applications:
SAP is dominant
 other players: Oracle Apps, PeopleSoft, Baan


Hardware:

17
players: Sun, IBM, HP and Compaq
UPC, February 1999
Who is who in the database market? (2)
Historically, mainly mainframe proprietary OS
 Today:

Unix: 40%
 NT: 8%
 Proprietary: 52%


In two years:
Unix 46%
 NT 19%
 Proprietary 35%

18
UPC, February 1999
Overview of a RDBMS: Oracle8
Similar in structure to most commercial engines
 Runs on:

uniprocessors
 SMP multiprocessors
 NUMA multiprocessors*


For clusters or message passing multiprocessors:

19
Oracle Parallel Server (OPS)
UPC, February 1999
The Oracle RDBMS

Physical structure

Control files
– basic info on the database, it’s structure and status

Data files
– tables: actual database data
– indexes: sorted list of pointers to data
– rollback segments: keep data for recovery upon a
failed transaction

Log files
– compressed storage of DB updates
20
UPC, February 1999
Index files
Critical in speeding up access to data by avoiding
expensive scans
 The more selective the index, the faster the access
 Drawbacks:

Very selective indexes may occupy lots of storage
 Updates to indexed data are more expensive

21
UPC, February 1999
Files or raw disk devices
Most DB engines can directly access disks as raw
devices
 Idea is to bypass the file system
 Manageability/flexibility somewhat compromised
 Performance boost not large (~10-15%)
 Most customer installations use file systems

22
UPC, February 1999
Transactions & rollback segments
Single transaction can access/update many items
 Atomicity is required:


transaction either happens or not
Example: bank transfer
Transaction A (accounts X,Y; value M) {
read account balance(X)
subtract M from balance(X)
add M to balance(Y)
commit
}
failure
old value of balance(X) is kept in a rollback
segment
 rollback: old values restored, all locks released

23
UPC, February 1999
Transactions & log files
A transaction is only committed after it’s side effects
are in stable storage
 Writing all modified DB blocks would be too
expensive





Alternative: write only a log of modifications



random disk writes are costly
a whole DB block has to be written back
no coalescing of updates
sequential I/O writes (enables NVDRAM optimizations)
batching of multiple commits
Background process periodically writes dirty data
blocks out
24
UPC, February 1999
Transactions & log files (2)
When a block is written to disk the log file entries
are deleted
 If the system crashes:



in-memory dirty blocks are lost
Recovery procedure:

25
goes through the log files and applies all updates to
the database
UPC, February 1999
Transactions & concurrency control

Many transactions in-flight at any given time


Locking of data items is required
Lock granularity:
Table
Block
Row

Efficient row-level locking is needed for high
transaction throughput
26
UPC, February 1999
Row-level locking
Each new transaction is assigned an unique ID
 A transaction table keeps track of all active transactions
 Lock: write ID in directory entry for row
 Unlock: remove ID from transaction table

Data block
120

233
230
233
Data block directory
Transaction table
233
234
235
Simultaneous release of all locks
27
UPC, February 1999
Transaction read consistency

A transaction that reads a full table should see a
consistent snapshot

For performance, reads shouldn’t lock a table

Problem: intervening writes

Solution: leverage rollback mechanism

28
intervening write saves old value in rollback segment
UPC, February 1999
Oracle: software structure

Server processes


DB writer


writes redo logs to disk at
commit time
Process and system monitors


flush dirty blocks to disk
Log writer


actual execution of transactions
misc. activity monitoring and
recovery
Processes communicate
through SGA and IPC
29
UPC, February 1999
Oracle: software structure(2)
System Global Area (SGA)
SGA:


Block buffer area



shared memory segment mapped
by all processes
cache of database blocks
larger portion of physical memory
Metadata area



30
UPC, February 1999
Redo buffers
Data dictionary
Shared pool
Fixed region
Metadata area

where most communication takes
place
synchronization structures
shared procedures
directory information
Block buffer area
Increasing virtual address

Oracle: software structure(3)

Hiding I/O latency:
many server processes/processor
 large block buffer area


Process dynamics:
server reads/updates database
(allocates entries in the redo buffer pool)
at commit time server signals Log writer and sleeps
Log writer wakes up, coalesces multiple commits and issues
log file write
after log is written, Log writer signals suspended servers
31
UPC, February 1999
Oracle: NUMA issues
Single SGA region complicates NUMA localization
 Single log writer process becomes a bottleneck
 Oracle8 is incorporating NUMA-friendly
optimizations
 Current large NUMA systems use OPS even on a
single address space

32
UPC, February 1999
Oracle Parallel Server (OPS)
Runs on clusters of SMPs/NUMAs
 Layered on top of RDBMS engine
 Shared data through disk
 Performance very dependent on how well data can
be partitioned
 Not supported by most application vendors

33
UPC, February 1999
Running Oracle: other issues
Most memory allocated to block buffer area
 Need to eliminate OS double buffering
 Best performance attained by limiting process
migration
 In large SMPs, dedicating one processor to I/O may
be advantageous

34
UPC, February 1999
TPC Database Benchmarks

Transaction Processing Performance Council (TPC)
Established about 10 years ago
 Mission: define representative benchmark standards
for vendors (hardware/software) to compare their
products
 Focus on both performance and price/performance
 Strict rules about how the benchmark is ran
 Only widely used benchmarks

35
UPC, February 1999
TPC pricing rules

Must include

All hardware
– server, I/O, networking, switches, clients

All software
– OS, any middleware, database engine
5-year maintenance contract
 Can include usual discounts
 Audited components must be products

36
UPC, February 1999
TPC history of benchmarks

TPC-A



TPC-B




Current TPC OLTP benchmark
Much more complex than TPC-A/B
TPC-D


Simpler version of TPC-A
Meant as a stress test of the server only
TPC-C


First OLTP benchmark
Based on Jim Gray’s Debit-Credit benchmark
Current TPC DSS benchmark
TPC-W

37
New Web-based e-commerce benchmark
UPC, February 1999
The TPC-B benchmark

Models a bank with many branches

Branch
Teller
Begin transaction
Update account balance
Write entry in history table
Update teller balance
Update branch balance
Commit
Account

Metrics:

History


tpsB (transactions/second)
$/tpsB
Scale requirement:

38
1 transaction type: account update
1 tpsB needs 100,000 accounts
UPC, February 1999
TPC-B: other requirements

System must be ACID

(A)tomicity
– transactions either commit or leave the system as if
were never issued

(C)onsistency
– transactions take system from a consistent state to
another

(I)solation
– concurrent transactions execute as if in some serial
order

(D)urability
– results of committed transactions are resilient to faults
39
UPC, February 1999
The TPC-C benchmark

Current TPC OLTP benchmark

Moderately complex OLTP

Models a wholesale supplier managing orders

Workload consists of five transaction types

Users and database scale linearly with throughput

Specification was approved July 23, 1992
40
UPC, February 1999
TPC-C: schema
Warehouse
W
Stock
100K
W*100K
10
<cardinality>
W*10
one-to-many
relationship
secondary index
3K
Customer
41
100K (fixed)
W
Legend
Table Name
District
W*30K
Item
Order
1+
W*30K+
1+
10-15
History
Order-Line
W*30K+
W*300K+
New-Order
0-1
UPC, February 1999
W*5K
TPC-C: transactions
New-order: enter a new order from a customer
 Payment: update customer balance to reflect a
payment
 Delivery: deliver orders (done as a batch
transaction)
 Order-status: retrieve status of customer’s most
recent order
 Stock-level: monitor warehouse inventory

42
UPC, February 1999
TPC-C: transaction flow
1
Select txn from menu:
1. New-Order
2. Payment
3. Order-Status
4. Delivery
5. Stock-Level
45%
43%
4%
4%
4%
2
Measure menu Response Time
Input screen
3
Keying time
Measure txn Response Time
Output screen
Think time
Go back to 1
43
UPC, February 1999
TPC-C: other requirements

Transparency


tables can be split horizontally and vertically
provided it is hidden from the application
Skew
1% of new-order txn are to a random remote
warehouse
 15% of payment txn are to a random remote
warehouse


Metrics:
performance: new-order transactions/minute (tpmC)
 cost/performance: $/tpmC

44
UPC, February 1999
TPC-C: scale
Maximum of 12 tpmC per warehouse
 Consequently:


A quad-Xeon system today (~20,000 tpmC) needs
– over 1668 warehouses
– over 1 TB of disk storage!!

45
That’s a VERY expensive benchmark to run!
UPC, February 1999
TPC-C: side effects of the skew rules
Very small fraction of transactions go to remote
warehouses
 Transparency rules allow data partitioning
 Consequence:

Clusters of powerful machines show exceptional
numbers
 Compaq has current TPC-C record of over 100
KtpmC with an 8-node memory channel cluster


Skew rules are expected to change in the future
46
UPC, February 1999
The TPC-D benchmark
Current DSS benchmark from TPC
 Moderately complex decision support workload
 Models a worldwide reseller of parts
 Queries ask real world business questions
 17 ad hoc DSS queries (Q1 to Q17)
 2 update queries

47
UPC, February 1999
TPC-D: schema
Customer
Nation
Region
SF*150K
25
5
Order
Supplier
Part
SF*1500K
SF*10K
SF*200K
LineItem
PartSupp
SF*6000K
SF*800K
48
UPC, February 1999
TPC-D: scale
Unlike TPC-C, scale not tied to performance
 Size determined by a Scale Factor (SF)


SF = {1,10,30,100,300,1000,3000,10000}
SF=1 means a 1GB database size
 Majority of current results are in the 100GB and
300GB range
 Indices and temporary tables can significantly
increase the total disk capacity. (3-5x is typical)

49
UPC, February 1999
TPC-D example query

Forecasting Revenue Query (Q6)

This query quantifies the amount of revenue increase that would have resulted from
eliminating company-wide discounts in a given percentage range in a given year.
Asking this type of “what if” query can be used to look for ways to increase revenues
Considers all line-items shipped in a year
 Query definition:

SELECT SUM(L_EXTENDEDPRICE*L_DISCOUNT) AS REVENUE FROM LINEITEM
WHERE L_SHIPDATE >= DATE ‘[DATE]]’
AND L_SHIPDATE < DATE ‘[DATE]’ + INTERVAL ‘1’ YEAR
AND L_DISCOUNTBETWEEN [DISCOUNT] - 0.01 AND [DISCOUNT] + 0.01
AND L_QUANTITY < [QUANTITY]
50
UPC, February 1999
TPC-D execution rules

Power Test


Queries submitted in a single stream (i.e., no concurrency)
Each Query Set is a permutation of the 17 read-only queries
Cache
Flush
Query
Set 0
UF1
Query
Set 0
UF2
(optional)
Warm-up, not timed
Throughput Test


Multiple concurrent query streams
Single update stream Query Set 1
Query Set 2
.
.
.

Timed Sequence
Query Set N
Updates: UF1 UF2 UF1 UF2 UF1 UF2
51
UPC, February 1999
TPC-D: metrics

Power Metric (QppD)

Geometric Mean
3600  SF
QppD@ Size 
19
i 17
j 2
i 1
j 1
 QI ( i ,0)UI ( j ,0)
where
QI(i,0)  Timing Interval for Query i, stream 0
UI(j,0)  Timing Interval for Update j, stream 0
SF  Scale Factor

Throughput (QthD)

Arithmetic Mean
QthD@ Size 
S  17
 
TS
3600
 SF
Both Metrics represent
where:
of query streams
“Queries per Gigabyte Hour” ST number
elapsed time of test (in seconds)
S
52
UPC, February 1999
TPC-D: metrics(2)

Composite Query-Per-Hour Rating (QphD)

The Power and Throughput metrics are combined to
get the composite queries per hour.
QphD@ Size  QppD@ Size  QthD@ Size

Reported metrics are:
– Power: QppD@Size
– Throughput: QthD@Size
– Price/Performance: $/QphD@Size
53
UPC, February 1999
TPC-D: other issues
Queries are complex and long-running
 Crucial that DB engine parallelizes queries for
acceptable performance
 Quality of query parallelizer is the most important
factor
 Large improvements are still observed from
generation to generation of software

54
UPC, February 1999
The TPC-W benchmark
Just introduced
 Represent a business that markets and sells over
the internet
 Includes security/authentication
 Uses dynamically generated pages (e.g. cgi-bins)
 Metric: Web Interactions Per Second (WIPS)
 Transactions:


55
Browse, shopping-cart, buy, user-registration, and
search
UPC, February 1999
A look at current audited TPC-C systems

Leader in price/performance:

Compaq ProLiant 7000-6/450, MS SQL 7.0, NT
– 4x 450MHz Xeons, 2MB cache, 4GB DRAM, 1.4 TB
disk
– 22,479 tpmC, $18.84/tpmC

Leader in non-cluster performance:

Sun Enterprise 6500, Sybase 11.9, Solaris7
– 24x 336MHz UltraSPARC IIs, 4MB cache, 24 GB
DRAM, 4TB disk
– 53,050 tpmC, $76.00/tpmC
56
UPC, February 1999
Audited TPC-C systems: price breakdown

Server sub-component prices
Compaq Proliant
Sun E6500
$/CPU
$4,816.00
$15,375.00
$/MB DRAM
$3.92
$9.16
$/GB Disk
$145.33
$382.03
Server Price Breakdown
100%
90%
80%
70%
Disk
60%
Memory
50%
CPU
40%
Base
30%
20%
10%
0%
Compaq Proliant
57
Sun E6500
UPC, February 1999
Using TPC benchmarks for architecture studies
Brute force approach: use full audit-sized system
 Who can afford it?
 How can you run it on top of a simulator?
 How can you explore a wide design space?


Solution: scaling down the size
58
UPC, February 1999
Careful Scaling of Workloads
Identify architectural issue under study
 Apply appropriate scaling to simplify monitoring and
enable simulation studies


Most scaling experiments on real machines


simulation-only is not a viable option!
Validation through sanity checks and comparison
with audit-sized runs
59
UPC, February 1999
Scaling OLTP
Forget about TPC compliance
 Determine lower bound on DB size

monitor contention for smaller tables/indexes
 DB size will change with number of processors


I/O bandwidth requirements vary with fraction of DB
resident in memory
completely in-memory run: no special I/O
requirements
 favor more small disks vs. few large ones
 place all redo logs on a separate disk
 reduce OS double-buffering


Limit number of transactions executed
60
UPC, February 1999
Scaling OLTP(2)

Achieve representative cache behavior
relevant data structures >> size of hardware caches
(metadata area size is key)
 maintain same number of processes/CPU as larger
run


Simplify setup by running clients on the server
machine


need to make lighter-weight versions of the clients
Ensure efficient execution

61
excessive migration, idle time, OS or application
spinning distorts metrics
UPC, February 1999
Scaling DSS

Determine lower bound DB size


sufficient work in parallel section
Ensure representative cache behavior
DB >> hardware caches
 maintain same number of processes/CPU as large
run

Reduce execution time through sampling
 Major difficulty is ensuring representative query
plans
 DSS results more volatile due to improvements in
query optimizers

62
UPC, February 1999
Tuning, tuning, tuning
Ensure scaled workload is running efficiently
 Requires a large number of monitoring runs on
actual hardware platform
 Resembles “black art” on Oracle
 Self-tuning features in Microsoft SQL 7.0 are
promising


63
ability for user overrides is desirable, but missing
UPC, February 1999
Does Scaling Work?
64
UPC, February 1999
TPC-C: scaled vs. full size
TPC-C, scaled
bcache
miss
24%
1-issue
8%
TPC-C, full-size
2-issue
8%
bcache
miss
27%
tlb
3%
1-issue
11%
2-issue
8%
tlb
1%
repl trap
2%
repl trap
5%
br/pc
mispr.
3%
br/pc mispr.
2%
mb
3%
bcache hit
30%
scache hit
17%
mb
6%
bcache hit
20%
scache hit
22%
Breakdown profile of CPU cycles
 Platform: 8-proc. AlphaServer 8400

65
UPC, February 1999
Using simpler OLTP benchmarks:
TPC-B, scaled
1-issue
7%
TPC-C, full-size
2-issue
6%
bcache
miss
27%
tlb
2%
1-issue
11%
2-issue
8%
tlb
1%
repl. trap
5%
bcache miss
37%
repl trap
2%
br/pc mispr.
2%
br/pc
mispr.
3%
mb
9%
scache hit
16%
mb
6%
bcache hit
20%
scache hit
22%
bcache hit
16%

Although “obsolete” TPC-B can be used in
architectural studies
66
UPC, February 1999
Benchmarks wrap-up
Commercial applications are complex, but need to
be considered during design evaluation
 TPC benchmarks cover a wide range of
commercial application areas
 Scaled down TPC benchmarks can be used for
architecture studies
 Architect needs deep understanding of the
workload

67
UPC, February 1999