L2-Cache Miss Profiling on the p690 for a Large
Download
Report
Transcript L2-Cache Miss Profiling on the p690 for a Large
Memory Performance
Profiling via Sampled
Performance Monitor
Event Traces
Diana Villa, Patricia J. Teller, and Jaime Acosta
The University of Texas at El Paso
Department of Computer Science
Trevor Morgan
Exxon/Mobil
Bret Olszewski
IBM Corporation-Austin
5th Annual IBM Austin CAS Conference – 20 February 2004
Outline
Motivation
Data
Events Profiled
Information Collected
Analysis
Approach
Performance
Evaluation Framework
Results
Conclusions and Future Work
5th Annual IBM Austin CAS Conference – 20 February 2004
Motivation
Overall research goal
General workload characterization model
Project goal
Develop
a performance evaluation framework to
facilitate analysis of large sampled event traces
Study load access patterns of key applications
Identify and remedy performance impediments
5th Annual IBM Austin CAS Conference – 20 February 2004
Data Collection Environment
IBM eserver p-Series 690 architecture
8- and 32-processor configurations
TPC-C benchmark
Data collected via event trace sampling:
Timestamp
Effective instruction and data addresses
CPU id
Process id
Thread id
5th Annual IBM Austin CAS Conference – 20 February 2004
Platform -1
8-processor p690 configuration
P
X
MCM 0
MCM 1
X
P
P
L2
L2
X
P
L2
X
L2
L3
L3
P
X
L2
P
X
L2
P
X
L2
P
X
L2
5th Annual IBM Austin CAS Conference – 20 February 2004
Platform - 2
32-processor p690 configuration
P
P
MCM 0
MCM 1
P
P
P
L2
L2
P
P
L2
P
L2
L3
L3
P
P
P
L2
P
P
P
L2
P
P
L2
MCM 2
MCM 3
P
P
P
L2
P
L2
P
L2
P
P
L2
P
L2
L3
L3
P
P
L2
P
P
L2
P
P
L2
P
P
L2
5th Annual IBM Austin CAS Conference – 20 February 2004
Events
Resolution of L2-cache data-load misses
L2.5
L2.5 shared
L2.5 modified
L2.75
L2.75 shared
L2.75 modified
L3
L3.5
5th Annual IBM Austin CAS Conference – 20 February 2004
L2.5
P
X
MCM 0
MCM 1
X
P
P
L2
L2
X
P
L2
X
L2
L3
L3
P
X
L2
P
X
L2
P
X
L2
P
X
L2
Penalty: 73 cycles
5th Annual IBM Austin CAS Conference – 20 February 2004
L2.75
P
X
MCM 0
MCM 1
X
P
P
L2
L2
X
P
L2
X
L2
L3
L3
P
X
L2
P
X
L2
P
X
L2
P
X
L2
Penalty: 96 cycles
5th Annual IBM Austin CAS Conference – 20 February 2004
L3
P
X
MCM 0
MCM 1
X
P
P
L2
L2
X
P
L2
X
L2
L3
L3
P
X
L2
P
X
L2
P
X
L2
P
X
L2
Penalty: 112 cycles
5th Annual IBM Austin CAS Conference – 20 February 2004
L3.5
P
X
MCM 0
MCM 1
X
P
P
L2
L2
X
P
L2
X
L2
L3
L3
P
X
L2
P
X
L2
P
X
L2
P
X
L2
Penalty: 143 cycles
5th Annual IBM Austin CAS Conference – 20 February 2004
Analysis
Identify application-specific sources of performance degradation
associated with data references
Address Space
….
Page
kernel
….
Level of
Memory
Hierarchy
text
….
data,bss,heap
….
buffer pool
….
Segment
Page Offset/
Cache line
5th Annual IBM Austin CAS Conference – 20 February 2004
Performance Evaluation Framework
Data Collection Environment
TPC-C
p690
Sampled Event Traces
PID TID Timestamp Instr.Addr. DataAddr.
PID TID Timestamp Instr.Addr. DataAddr.
PID TID Timestamp Instr.Addr. DataAddr.
Database
Load DB Java Tool
Report Generation Java Tool
Reports
Graphs
Total loads
Unique cache line
1600
3100
4600
6100
Distribution of L3 Data Load Hits
7600
Page [0-65536]
KERN_HEAP
Address region
5 BufferPool 56893 29384
6 Data,BSS,Heap 8799 4855
1 Kernel 23485 9840
Hit/Cache line count
Distribution of L3 Data Load Hits Across Pages of
a Buffer Pool Segment
400
350
300
250
200
150
100
50
0
100
U-BlockandKernelStack
Stack
SharedData
Unique cache line
BufferPool
Hit %
Data,BSS,Heap
Text
Kernel
0
0.1
0.2
0.3
Fraction of data loads
5th Annual IBM Austin CAS Conference – 20 February 2004
0.4
0.5
Results
Resolution of L2 Data Load Misses
Memory
Events
L3.5
L3
32-way
L2.75 Modified
8-way
L2.75 Shared
L2.5 Modified
L2.5 Shared
0
0.1
0.2
0.3
0.4
0.5
0.6
Fraction of loads satisfied
5th Annual IBM Austin CAS Conference – 20 February 2004
Results - Memory Regions
Distribution of Memory Data Load Hits
Address region
KERN_HEAP
M_BUF
Ublock&KernelStack
Stack
Unique cache line
BufferPool
Hit %
Data,BSS,Heap
Text
Kernel
0
0.1
0.2
0.3
0.4
0.5
Fraction of data loads
5th Annual IBM Austin CAS Conference – 20 February 2004
Results - L3 Cache
Address region
Distribution of L3 Data Load Hits
KERN_HEAP
M_BUF
UBlockandKernelStack
Unique cache line
Stack
BufferPool
Hit %
Data,BSS,Heap
Text
Kernel
0
0.1
0.2
0.3
0.4
0.5
Fraction of data loads
5th Annual IBM Austin CAS Conference – 20 February 2004
Results - Segment
Distribution of L3 Data Load Hits in Buffer Pool by
Segment
Segment
07000039C
070000009
Unique cache line
070000002
Hit %
070000001
070000000
0
0.1
0.2
0.3
0.4
0.5
Fraction of data loads
5th Annual IBM Austin CAS Conference – 20 February 2004
Results - Pages
Hit/Cache line count
Distribution of L3 Data Load Hits Across Pages of
a Buffer Pool Segment
400
350
300
250
200
150
100
50
0
100
Total loads
Unique cache line
1600
3100
4600
6100
7600
Page [0-65536]
5th Annual IBM Austin CAS Conference – 20 February 2004
Results – Cache Lines
Distribution of L3 Data Load Hits by Cache line
30
Cache line
25
20
15
10
5
0
0
100
200
300
400
500
Time (s)
5th Annual IBM Austin CAS Conference – 20 February 2004
600
Results - Instructions
Lock Operations
Atomic Operations
simple_lock
fetch_and_add
simple_lock_ppc
fetch_and_add_h
simple_unlock
fetch_and_addlp
disable_lock
fetch_and_or
unlock_enable
fetch_and_orlp
simple_unlock_mem
fetch_and_and
unlock_enable_mem fetch_and_andlp
5th Annual IBM Austin CAS Conference – 20 February 2004
Conclusions
Targets for performance improvement of TPC-C are
associated mainly with two regions of the address space:
buffer pool
data, bss, heap
TPC-C lock instructions are not key to performance
degradation
8- and 32-processor data have same reference pattern,
thus, a model of TPC-C memory access may be possible
5th Annual IBM Austin CAS Conference – 20 February 2004
Future Work
Suggest ways to improve performance of applications executed on p690
Enhance performance evaluation framework
Quantify representativeness of sampled event traces
Expand study of application data load behavior
Process characterization
Process migration
Other performance issues
Compulsory vs. capacity/conflict misses
False sharing
Contention for resources
Develop synthetic applications that mimic the behavior of key p690 applications; use
these to study application behavior and experiment with modifications to applications
that may affect performance
5th Annual IBM Austin CAS Conference – 20 February 2004
Questions?
5th Annual IBM Austin CAS Conference – 20 February 2004