L2-Cache Miss Profiling on the p690 for a Large

Download Report

Transcript L2-Cache Miss Profiling on the p690 for a Large

Memory Performance
Profiling via Sampled
Performance Monitor
Event Traces
Diana Villa, Patricia J. Teller, and Jaime Acosta
The University of Texas at El Paso
Department of Computer Science
Trevor Morgan
Exxon/Mobil
Bret Olszewski
IBM Corporation-Austin
5th Annual IBM Austin CAS Conference – 20 February 2004
Outline


Motivation
Data
 Events Profiled
 Information Collected

Analysis
 Approach
 Performance


Evaluation Framework
Results
Conclusions and Future Work
5th Annual IBM Austin CAS Conference – 20 February 2004
Motivation

Overall research goal
General workload characterization model

Project goal
 Develop
a performance evaluation framework to
facilitate analysis of large sampled event traces
 Study load access patterns of key applications
 Identify and remedy performance impediments
5th Annual IBM Austin CAS Conference – 20 February 2004
Data Collection Environment

IBM eserver p-Series 690 architecture
8- and 32-processor configurations

TPC-C benchmark
Data collected via event trace sampling:
Timestamp
 Effective instruction and data addresses
 CPU id
 Process id
 Thread id

5th Annual IBM Austin CAS Conference – 20 February 2004
Platform -1
8-processor p690 configuration
P
X
MCM 0
MCM 1
X
P
P
L2
L2
X
P
L2
X
L2
L3
L3
P
X
L2
P
X
L2
P
X
L2
P
X
L2
5th Annual IBM Austin CAS Conference – 20 February 2004
Platform - 2
32-processor p690 configuration
P
P
MCM 0
MCM 1
P
P
P
L2
L2
P
P
L2
P
L2
L3
L3
P
P
P
L2
P
P
P
L2
P
P
L2
MCM 2
MCM 3
P
P
P
L2
P
L2
P
L2
P
P
L2
P
L2
L3
L3
P
P
L2
P
P
L2
P
P
L2
P
P
L2
5th Annual IBM Austin CAS Conference – 20 February 2004
Events

Resolution of L2-cache data-load misses
 L2.5


L2.5 shared
L2.5 modified
 L2.75


L2.75 shared
L2.75 modified
 L3
 L3.5
5th Annual IBM Austin CAS Conference – 20 February 2004
L2.5
P
X
MCM 0
MCM 1
X
P
P
L2
L2
X
P
L2
X
L2
L3
L3
P
X
L2
P
X
L2
P
X
L2
P
X
L2
Penalty: 73 cycles
5th Annual IBM Austin CAS Conference – 20 February 2004
L2.75
P
X
MCM 0
MCM 1
X
P
P
L2
L2
X
P
L2
X
L2
L3
L3
P
X
L2
P
X
L2
P
X
L2
P
X
L2
Penalty: 96 cycles
5th Annual IBM Austin CAS Conference – 20 February 2004
L3
P
X
MCM 0
MCM 1
X
P
P
L2
L2
X
P
L2
X
L2
L3
L3
P
X
L2
P
X
L2
P
X
L2
P
X
L2
Penalty: 112 cycles
5th Annual IBM Austin CAS Conference – 20 February 2004
L3.5
P
X
MCM 0
MCM 1
X
P
P
L2
L2
X
P
L2
X
L2
L3
L3
P
X
L2
P
X
L2
P
X
L2
P
X
L2
Penalty: 143 cycles
5th Annual IBM Austin CAS Conference – 20 February 2004
Analysis

Identify application-specific sources of performance degradation
associated with data references
Address Space
….
Page
kernel
….
Level of
Memory
Hierarchy
text
….
data,bss,heap
….
buffer pool
….
Segment
Page Offset/
Cache line
5th Annual IBM Austin CAS Conference – 20 February 2004
Performance Evaluation Framework
Data Collection Environment
TPC-C
p690
Sampled Event Traces
PID TID Timestamp Instr.Addr. DataAddr.
PID TID Timestamp Instr.Addr. DataAddr.
PID TID Timestamp Instr.Addr. DataAddr.
Database
Load DB Java Tool
Report Generation Java Tool
Reports
Graphs
Total loads
Unique cache line
1600
3100
4600
6100
Distribution of L3 Data Load Hits
7600
Page [0-65536]
KERN_HEAP
Address region
5 BufferPool 56893 29384
6 Data,BSS,Heap 8799 4855
1 Kernel 23485 9840
Hit/Cache line count
Distribution of L3 Data Load Hits Across Pages of
a Buffer Pool Segment
400
350
300
250
200
150
100
50
0
100
U-BlockandKernelStack
Stack
SharedData
Unique cache line
BufferPool
Hit %
Data,BSS,Heap
Text
Kernel
0
0.1
0.2
0.3
Fraction of data loads
5th Annual IBM Austin CAS Conference – 20 February 2004
0.4
0.5
Results
Resolution of L2 Data Load Misses
Memory
Events
L3.5
L3
32-way
L2.75 Modified
8-way
L2.75 Shared
L2.5 Modified
L2.5 Shared
0
0.1
0.2
0.3
0.4
0.5
0.6
Fraction of loads satisfied
5th Annual IBM Austin CAS Conference – 20 February 2004
Results - Memory Regions
Distribution of Memory Data Load Hits
Address region
KERN_HEAP
M_BUF
Ublock&KernelStack
Stack
Unique cache line
BufferPool
Hit %
Data,BSS,Heap
Text
Kernel
0
0.1
0.2
0.3
0.4
0.5
Fraction of data loads
5th Annual IBM Austin CAS Conference – 20 February 2004
Results - L3 Cache
Address region
Distribution of L3 Data Load Hits
KERN_HEAP
M_BUF
UBlockandKernelStack
Unique cache line
Stack
BufferPool
Hit %
Data,BSS,Heap
Text
Kernel
0
0.1
0.2
0.3
0.4
0.5
Fraction of data loads
5th Annual IBM Austin CAS Conference – 20 February 2004
Results - Segment
Distribution of L3 Data Load Hits in Buffer Pool by
Segment
Segment
07000039C
070000009
Unique cache line
070000002
Hit %
070000001
070000000
0
0.1
0.2
0.3
0.4
0.5
Fraction of data loads
5th Annual IBM Austin CAS Conference – 20 February 2004
Results - Pages
Hit/Cache line count
Distribution of L3 Data Load Hits Across Pages of
a Buffer Pool Segment
400
350
300
250
200
150
100
50
0
100
Total loads
Unique cache line
1600
3100
4600
6100
7600
Page [0-65536]
5th Annual IBM Austin CAS Conference – 20 February 2004
Results – Cache Lines
Distribution of L3 Data Load Hits by Cache line
30
Cache line
25
20
15
10
5
0
0
100
200
300
400
500
Time (s)
5th Annual IBM Austin CAS Conference – 20 February 2004
600
Results - Instructions
Lock Operations
Atomic Operations
simple_lock
fetch_and_add
simple_lock_ppc
fetch_and_add_h
simple_unlock
fetch_and_addlp
disable_lock
fetch_and_or
unlock_enable
fetch_and_orlp
simple_unlock_mem
fetch_and_and
unlock_enable_mem fetch_and_andlp
5th Annual IBM Austin CAS Conference – 20 February 2004
Conclusions

Targets for performance improvement of TPC-C are
associated mainly with two regions of the address space:


buffer pool
data, bss, heap

TPC-C lock instructions are not key to performance
degradation

8- and 32-processor data have same reference pattern,
thus, a model of TPC-C memory access may be possible
5th Annual IBM Austin CAS Conference – 20 February 2004
Future Work

Suggest ways to improve performance of applications executed on p690

Enhance performance evaluation framework

Quantify representativeness of sampled event traces

Expand study of application data load behavior



Process characterization
Process migration
Other performance issues




Compulsory vs. capacity/conflict misses
False sharing
Contention for resources
Develop synthetic applications that mimic the behavior of key p690 applications; use
these to study application behavior and experiment with modifications to applications
that may affect performance
5th Annual IBM Austin CAS Conference – 20 February 2004
Questions?
5th Annual IBM Austin CAS Conference – 20 February 2004