Low Overhead Interrupt Handling with SMT
Download
Report
Transcript Low Overhead Interrupt Handling with SMT
Low Overhead Interrupt
Handling with SMT
CS252 Spring 2005 – Final Presentation
Greg Gibeling
Andrew Schultz
5/10/2005
CS252 Final Presentation
1
Problem
Traditional OS interrupt handling
High latency context switch
Excessive data copies
High bandwidth devices I/O architecture
Ethernet speed increasing faster than processor
Existing Attempted Solutions
5/10/2005
1 Gb/s Ethernet faster than a standard architecture
10 Gb/s Ethernet becoming common
Interrupt coalescing – poor latency
TOEs – not well supported, application specific
Smart NICs – typically for special purpose (MPI)
Polling – Unreliable, machine dependant
CS252 Final Presentation
2
Observations
Trend towards many hardware contexts
Many contexts
Intel now has HT standard in their cores
Multi-core chips starting to ship
Multiprocessor SOCs
Easier to partition
How do we take advantage?
Devoting one to interrupts wont reduce processing
Programming inertia slows changes
Programmers use certain interfaces
TOEs
5/10/2005
Solutions should not require massive changes
Major changes take forever to be adopted
Have been around for a while
Still don’t have wide-deployment
CS252 Final Presentation
3
Solution (1)
SMT Hot Thread for Interrupts
Accelerate Interrupt Handling
Remove context switch overhead
Pin cache contents
Allow OS control
Increase interrupt handler flexibility
Reduce hardware dependence (Smart NIC)
Use programmer intuition about program structure
Zero Copy (future work)
5/10/2005
Very high level cache prefetch hints
Data is placed directly in user space
Does not require a large physical memory block
CS252 Final Presentation
4
Solution (2)
SMT Hot Thread
Reserved
One of many threads
SMT
Thread
L1 Instruction
N-way associative
Limited set pinning
Int
Thread
L1I
Based on LRU
SMT
Thread
CPU
Cache Pinning
L1D
L2
L1I
L1D
L2
Trace pinning
L2 Combined
5/10/2005
SMT
Thread
Range Pinning
CS252 Final Presentation
Memory
5
Experimental Setup
M5 Simulator
Multiple full system simulation
Event driven memory system
Cycle accurate in-order or OoO
Our Parameters
Current Pentium4 like system
Alpha ISA (EV6 211264)
Simple CPU model
5/10/2005
CPU
Multiple contexts to simulate
SMT
Full Memory Model
CS252 Final Presentation
L1I
Mem
L1D
L2
MemBus
South
Bridge
North
Bridge
B
(io)
NIC
6
Benchmarks
Netperf
Basic UDP traffic flood
Simulates high network load, low CPU load
Netperf (w/NAT)
Insert a NAT box between server and client
Simulates high network load, high CPU load
Client
Server
Client
NETPERF
5/10/2005
NAT
Server
NETPERF w/NAT
CS252 Final Presentation
7
Instrumentation
Simple in-order SMT model
Instrumented Linux kernel
Create “bins” to measure statistics in different parts of the
OS (kernel, user, interrupt, tasklets)
Capture SMT thread to service interrupts and run NIC
related tasklets
Created “cache pinning” policy
5/10/2005
Lower potential pressure on memory system than aggressive
out-of-order model, but sufficient for measuring functional
memory access
Replaces basic LRU replacement policy
Pseudo-instructions explicitly pin data structures or turn on
trace pinning (in I or D cache)
CS252 Final Presentation
8
Instrumentation
Metrics
Cache Misses (L2/I/D)
NetPerf bandwidth
5/10/2005
M5 allows binning, in order to separate misses
during the interrupt handler from other code
Measure impact of cache pinning
Application level performance
“What can this accomplish?”
CS252 Final Presentation
9
NETPERF Performance (1)
L2 Miss Rate
Bandwidth
0.02
900
0.018
800
0.016
700
0.014
Miss Rate
Uni
SMT
Ded
0.01
Bandwith (Mb/s)
600
0.012
500
Uni
SMT
Ded
400
0.008
300
0.006
200
0.004
100
0.002
0
0
No Pinning
128
32
8
2
Cache Fraction Pinned
5/10/2005
No Pinning
128
32
8
2
Cache Fraction Pinned
CS252 Final Presentation
10
NETPERF Performance (2)
DCache Miss Rate
0.016
0.04
0.014
0.035
0.012
0.03
0.01
0.025
Uni
SMT
Ded
0.008
Miss Rate
Miss Rate
ICache Miss Rate
0.006
0.015
0.004
0.01
0.002
0.005
0
0
No Pinning
128
32
8
2
Cache Fraction Pinned
5/10/2005
Uni
SMT
Ded
0.02
No Pinning
128
32
8
2
Cache Fraction Pinned
CS252 Final Presentation
11
NETPERF Results
Bandwidth increase of 30%
Dedicated SMT thread successful
Cache performance is even
Cache Pinning
5/10/2005
L2 –
ICache DCache -
No Impact
Performance Hit
Performance Hit
This is an IO bound benchmark
CS252 Final Presentation
12
NATPERF Performance (1)
L2 Miss Rate
Bandwidth
0.012
2000
1800
0.01
1600
1400
Uni
SMT
Ded
0.006
Bandwidth (Mb/s)
Miss Rate
0.008
1200
Uni
SMT
Ded
1000
800
0.004
600
400
0.002
200
0
0
No Pinning
128
32
8
2
Cache Fraction Pinned
5/10/2005
No Pinning
128
32
8
2
Cache Fraction Pinned
CS252 Final Presentation
13
NATPERF Performance (2)
ICache Miss Rate
DCache Miss Rate
0.06
0.14
0.12
0.05
0.1
0.04
Miss Rate
Miss Rate
0.08
Uni
SMT
Ded
0.03
Uni
SMT
Ded
0.06
0.02
0.04
0.01
0.02
0
0
No Pinning
128
32
8
2
Cache Fraction Pinned
5/10/2005
No Pinning
128
32
8
2
Cache Fraction Pinned
CS252 Final Presentation
14
NATPERF Results
Bandwidth decrease
All processing for NAT is handled in kernel
Dedicated thread is doing all the work
Cache performance improvement
Cache Pinning
L2 –
5/10/2005
30% improvement at ½ pinned
Note that this corresponds to bandwidth peak
ICache DCache -
Performance Hit
Performance Hit
Processing dependent benchmark
CS252 Final Presentation
15
Conclusions
SMT Hot Thread
Cache Pinning
5/10/2005
Definite success
Requires a multithreaded application
130% speed increase on NETPERF
Qualified Success
Helpful at the L2 level for NATPERF
Seemingly detrimental at L1
Deserves more testing
CS252 Final Presentation
16
Future Work (1)
Processing Intensive Benchmark
NETPERF
NATPERF
Single threaded
Doesn’t fully test an SMT system
SPECWeb99
5/10/2005
Excessively simple
Doesn’t include processing overhead
Died horribly on the M5 system
This might provide a better benchmark
CS252 Final Presentation
17
Future Work (2)
Detailed profiling of memory usage
Examine performance with OOO CPU model
Determine optimal data structures/traces to pin
Ready, but would require massive sim time
Extend basic architecture
Implement zero-copy protocol
Comparisons
5/10/2005
Preferably consistent with traditional interfaces
Click – Can this match polling performance
IDS/Packet Capture – Can this handle line speed
CS252 Final Presentation
18
Related Work
Hardware
Software (OS)
U-Net – ADU + fewer copies, more application control
Click – Change method from interrupt to polling
Interrupt coalescing – Add latency, reduce thrashing
Commercial Products
5/10/2005
EMP – True zero copy on NIC
SMT exception handling – Done for TLB misses
On-chip NICs – Tighter coupling between NIC and CPU
TOEs and accelerators – Shift processing to special devices
There are commercial products approaching this solution
Network Analyzers
IDS Systems
CS252 Final Presentation
19