Low Overhead Interrupt Handling with SMT

Download Report

Transcript Low Overhead Interrupt Handling with SMT

Low Overhead Interrupt
Handling with SMT
CS252 Spring 2005 – Final Presentation
Greg Gibeling
Andrew Schultz
5/10/2005
CS252 Final Presentation
1
Problem

Traditional OS interrupt handling



High latency context switch
Excessive data copies
High bandwidth devices I/O architecture

Ethernet speed increasing faster than processor



Existing Attempted Solutions




5/10/2005
1 Gb/s Ethernet faster than a standard architecture
10 Gb/s Ethernet becoming common
Interrupt coalescing – poor latency
TOEs – not well supported, application specific
Smart NICs – typically for special purpose (MPI)
Polling – Unreliable, machine dependant
CS252 Final Presentation
2
Observations

Trend towards many hardware contexts




Many contexts




Intel now has HT standard in their cores
Multi-core chips starting to ship
Multiprocessor SOCs
Easier to partition
How do we take advantage?
Devoting one to interrupts wont reduce processing
Programming inertia slows changes

Programmers use certain interfaces



TOEs


5/10/2005
Solutions should not require massive changes
Major changes take forever to be adopted
Have been around for a while
Still don’t have wide-deployment
CS252 Final Presentation
3
Solution (1)

SMT Hot Thread for Interrupts

Accelerate Interrupt Handling



Remove context switch overhead
Pin cache contents
Allow OS control



Increase interrupt handler flexibility
Reduce hardware dependence (Smart NIC)
Use programmer intuition about program structure


Zero Copy (future work)


5/10/2005
Very high level cache prefetch hints
Data is placed directly in user space
Does not require a large physical memory block
CS252 Final Presentation
4
Solution (2)

SMT Hot Thread



Reserved
One of many threads
SMT
Thread


L1 Instruction


N-way associative
Limited set pinning
Int
Thread
L1I
Based on LRU

SMT
Thread
CPU
Cache Pinning

L1D
L2
L1I
L1D
L2
Trace pinning
L2 Combined

5/10/2005
SMT
Thread
Range Pinning
CS252 Final Presentation
Memory
5
Experimental Setup

M5 Simulator




Multiple full system simulation
Event driven memory system
Cycle accurate in-order or OoO
Our Parameters



Current Pentium4 like system
Alpha ISA (EV6 211264)
Simple CPU model


5/10/2005
CPU
Multiple contexts to simulate
SMT
Full Memory Model
CS252 Final Presentation
L1I
Mem
L1D
L2
MemBus
South
Bridge
North
Bridge
B
(io)
NIC
6
Benchmarks

Netperf



Basic UDP traffic flood
Simulates high network load, low CPU load
Netperf (w/NAT)


Insert a NAT box between server and client
Simulates high network load, high CPU load
Client
Server
Client
NETPERF
5/10/2005
NAT
Server
NETPERF w/NAT
CS252 Final Presentation
7
Instrumentation

Simple in-order SMT model


Instrumented Linux kernel



Create “bins” to measure statistics in different parts of the
OS (kernel, user, interrupt, tasklets)
Capture SMT thread to service interrupts and run NIC
related tasklets
Created “cache pinning” policy


5/10/2005
Lower potential pressure on memory system than aggressive
out-of-order model, but sufficient for measuring functional
memory access
Replaces basic LRU replacement policy
Pseudo-instructions explicitly pin data structures or turn on
trace pinning (in I or D cache)
CS252 Final Presentation
8
Instrumentation

Metrics

Cache Misses (L2/I/D)



NetPerf bandwidth


5/10/2005
M5 allows binning, in order to separate misses
during the interrupt handler from other code
Measure impact of cache pinning
Application level performance
“What can this accomplish?”
CS252 Final Presentation
9
NETPERF Performance (1)
L2 Miss Rate
Bandwidth
0.02
900
0.018
800
0.016
700
0.014
Miss Rate
Uni
SMT
Ded
0.01
Bandwith (Mb/s)
600
0.012
500
Uni
SMT
Ded
400
0.008
300
0.006
200
0.004
100
0.002
0
0
No Pinning
128
32
8
2
Cache Fraction Pinned
5/10/2005
No Pinning
128
32
8
2
Cache Fraction Pinned
CS252 Final Presentation
10
NETPERF Performance (2)
DCache Miss Rate
0.016
0.04
0.014
0.035
0.012
0.03
0.01
0.025
Uni
SMT
Ded
0.008
Miss Rate
Miss Rate
ICache Miss Rate
0.006
0.015
0.004
0.01
0.002
0.005
0
0
No Pinning
128
32
8
2
Cache Fraction Pinned
5/10/2005
Uni
SMT
Ded
0.02
No Pinning
128
32
8
2
Cache Fraction Pinned
CS252 Final Presentation
11
NETPERF Results

Bandwidth increase of 30%


Dedicated SMT thread successful
Cache performance is even

Cache Pinning




5/10/2005
L2 –
ICache DCache -
No Impact
Performance Hit
Performance Hit
This is an IO bound benchmark
CS252 Final Presentation
12
NATPERF Performance (1)
L2 Miss Rate
Bandwidth
0.012
2000
1800
0.01
1600
1400
Uni
SMT
Ded
0.006
Bandwidth (Mb/s)
Miss Rate
0.008
1200
Uni
SMT
Ded
1000
800
0.004
600
400
0.002
200
0
0
No Pinning
128
32
8
2
Cache Fraction Pinned
5/10/2005
No Pinning
128
32
8
2
Cache Fraction Pinned
CS252 Final Presentation
13
NATPERF Performance (2)
ICache Miss Rate
DCache Miss Rate
0.06
0.14
0.12
0.05
0.1
0.04
Miss Rate
Miss Rate
0.08
Uni
SMT
Ded
0.03
Uni
SMT
Ded
0.06
0.02
0.04
0.01
0.02
0
0
No Pinning
128
32
8
2
Cache Fraction Pinned
5/10/2005
No Pinning
128
32
8
2
Cache Fraction Pinned
CS252 Final Presentation
14
NATPERF Results

Bandwidth decrease



All processing for NAT is handled in kernel
Dedicated thread is doing all the work
Cache performance improvement

Cache Pinning

L2 –




5/10/2005
30% improvement at ½ pinned
Note that this corresponds to bandwidth peak
ICache DCache -
Performance Hit
Performance Hit
Processing dependent benchmark
CS252 Final Presentation
15
Conclusions

SMT Hot Thread




Cache Pinning




5/10/2005
Definite success
Requires a multithreaded application
130% speed increase on NETPERF
Qualified Success
Helpful at the L2 level for NATPERF
Seemingly detrimental at L1
Deserves more testing
CS252 Final Presentation
16
Future Work (1)

Processing Intensive Benchmark

NETPERF



NATPERF



Single threaded
Doesn’t fully test an SMT system
SPECWeb99


5/10/2005
Excessively simple
Doesn’t include processing overhead
Died horribly on the M5 system
This might provide a better benchmark
CS252 Final Presentation
17
Future Work (2)

Detailed profiling of memory usage


Examine performance with OOO CPU model


Determine optimal data structures/traces to pin
Ready, but would require massive sim time
Extend basic architecture

Implement zero-copy protocol


Comparisons


5/10/2005
Preferably consistent with traditional interfaces
Click – Can this match polling performance
IDS/Packet Capture – Can this handle line speed
CS252 Final Presentation
18
Related Work

Hardware





Software (OS)




U-Net – ADU + fewer copies, more application control
Click – Change method from interrupt to polling
Interrupt coalescing – Add latency, reduce thrashing
Commercial Products



5/10/2005
EMP – True zero copy on NIC
SMT exception handling – Done for TLB misses
On-chip NICs – Tighter coupling between NIC and CPU
TOEs and accelerators – Shift processing to special devices
There are commercial products approaching this solution
Network Analyzers
IDS Systems
CS252 Final Presentation
19