On-chip Monitoring Infrastructures and Strategies for Many-core Systems Russell Tessier, Jia Zhao, Justin Lu, Sailaja Madduri, and Wayne Burleson Research supported by the Semiconductor Research.

Download Report

Transcript On-chip Monitoring Infrastructures and Strategies for Many-core Systems Russell Tessier, Jia Zhao, Justin Lu, Sailaja Madduri, and Wayne Burleson Research supported by the Semiconductor Research.

On-chip Monitoring
Infrastructures and Strategies for
Many-core Systems
Russell Tessier, Jia Zhao, Justin Lu,
Sailaja Madduri, and Wayne Burleson
Research supported by the Semiconductor Research Corporation
Outline
•
•
•
•
•
•
Motivation
Contributions
On-chip monitoring infrastructures
Extensions to 3D architectures
Monitoring for voltage droop prevention
Conclusion and future work
2
On-chip Sensors and Monitoring
• Thermal sensors, processor activity
monitors, delay monitors, reliability
monitors, et al.
• Challenges involved
1. Sensor design (VLSI design
research)
2. Sensor data collection
3. Sensor data processing, e.g.
identify imprecise sensor
measurements
4. Remediation strategies
Activity
Monitor
Thermal
Sensor
Core
Delay
Monitor
Network
Interface
Cache
To Memories
and Peripherals
3
Multi-core and Many-core Systems
• From single core
systems to multicores and many-cores
• Need to monitor
system temperature,
supply voltage
fluctuation, reliability,
among others
• Remediation
strategies include
voltage change,
frequency change,
error protection, et al.
Image courtesy: silentpcreview.com
AMD FX-8150 8-Core Bulldozer Processor
4
System-Level Requirements
• Monitor data collected in a scalable, coherent fashion
– Interconnect flexibility: Many different monitor interfaces,
bandwidth requirements
– Low overhead: Interconnect should provide low overhead
interfaces (buses, direct connects) while integrating NoC
– Low latency: Priority data needs immediate attention (thermal,
errors)
• Collate data with centralized processing
– Focus on collaborative use of monitor data
• Validate monitoring system with applications
– Layout and system-level simulation
5
MNoC Overview
Control
X-Bar
Port
R
R
R
R
Interface
R
M
MEP
R
R
R
M
M
MEP
R
R
R
R
R
R
T
• A dedicated infrastructure for onchip monitoring – Monitor
Network-on-Chip (MNoC)
• An on-chip network to connect all sensors to a
monitor executive processor (MEP) through MNoC
routers
• Low latency for sensor data transmission and low cost
MEP – Monitor Executive
Processor
R
– Router
M
– Monitor
D
– Data
T
– Timer module
D
M
D
M
6
Priority Based Interfacing
•
•
•
Multiple transfer channels available for critical data
Interface synchronized with router via buffer
Time stamps used to prioritize/coordinate responses
NETWORK ROUTER
HIGH
PRIORITY
CHANNEL
MONITORS
BUS-ROUTER
INTERFACE
7
MNoC Router and Monitor Interface
• Bus and multiplexer are
both supported
• Data type:
– Periodically sampled
– Occasionally reported
• Critical monitor data is
transferred in the priority
channel
Network router interface (Master)
Idle
Sample
finish
Sample
counter
Send
out
Buffer
not full
Sample
Buffer
not full
Wait for
buffer
Buffer
full
Wait for
data
Packetization
Sample
signal
Buffer Data
ready
full
Timer
FIFO
Monitor Monitor
Dest.
Time
data
address address stamp Monitor
packet
Priority
Priority channel
Regular channel
MUX
• A thermal monitor [1]
interface example:
Thermal Monitor
Track/
Hold
T/H sensor-1
VCO
MUX
– FSM controlled
– Time stamp attached to
identify out-of-data data
8 bit
data
T/H sensor-2
Freq-divider
T/H sensor-N
Digital Counter
Select
Controller block
[1] B. Datta and W. Burleson, “Low-Power, Process-Variation Tolerant On-Chip Thermal Monitoring using Track and Hold Based
Thermal Sensors,” in Proc. ACM Great Lakes Symposium on VLSI, pp. 145-148, 2009.
8
To
router
MNoC area with differing router parameters
Total MNoC area for different buffer sizes and data widths
at 65nm
Data
width
(bits)
Input buffer
size
Gate
count per
router
8
4
15017
8
8
15234
10
4
15505
10
8
15763
0.5
12
4
17902
0.25
12
8
18196
14
4
18871
14
8
19222
2.25
2
Area (mm2)
1.75
Buffer size
2
4
8
16
1.5
1.25
1
0.75
0
6
8
10
12
14
Data width in bits
16
18
20
• Desirable to minimize interconnect data width to just meet latency
requirements
• Most of router area consumed by data path
– Each delay monitor generates 12 bit data + 6 bit header
9
Data Width Effect on Latency
Regular channel latencies for different data
widths for buffer size = 4
Priority Channel latencies for different data
widths for buffer size = 4
400
4000
350
1500
Network latency
(Clk cycles)
Network latency
(Clk cycles)
4500
1000
Bit width
3500
3000
2500
2000
12
14
16
18
•
•
8
500
10
250
200
150
100
Bit width
8
300
10
0
12
0
100
200
300
Cycles between injection
400
14
16
50
0
0
100
200
300
Cycles between injection
18
Regular channel latency (e.g. 100 cycles) tolerable for low priority data
Priority channel provides fast path for critical data (20 cycles)
10
NoC and MNoC Architectures
• A shared memory multicore based on Tile64 [1] and TRIPS OCN [2]
– 4×4 mesh as in Tile64
– 256 bit data width
– 2 flits buffer size
• Monitors in the multicore system
– Thermal monitor (1/800 injection rate)
– Delay monitor (around 1/200 injection rate)
• MNoC configuration from the suggested design flow
–
–
–
–
4×4 MNoC
24-bit flit width
2 virtual channels
2 flits buffer size
[1] S. Bell, et al, “TILE64 Processor: A 64-Core SoC with Mesh Interconnect”, in the Proceedings of International Solid-State Circuits
Conference, pp. 88-598, 2008.
[2] P. Gratz, C. Kim, R. McDonald, S. Keckler and D. Burger, “Implementation and Evaluation of On-Chip Network Architectures”, in
the Proceedings of International Conference on Computer Design, pp. 477-484, Oct. 2007.
11
MNoC vs. Regular Network-on-chip
• Network-on-chip (NoC) is used in multi-cores
• Why bother to use MNoC for monitor data?
• Three cases
– NoC with MNoC – monitor data is transferred via
MNoC. Inter-processor traffic (“application data”) in
the multi-core NoC.
– MT-NoC (mixed traffic NoC) – All monitor and
application traffic in four virtual channel (VC) NoC.
– Iso-NoC (isolated channel NoC) – monitor data in one
VC. Application traffic in the remaining three VCs.
• Virtual channel: another lane on a road.
Unfortunately only one lane in the intersections
12
Three Cases for Comparison
1.
2.
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
Router
M
Router
U
X
Pr
oc
3.
16
M
U
rs
rs
ito
ito
X
X
U
on
on
M
M
M
16
NoC with MNoC – monitor
data is transferred via MNoC.
Inter-processor traffic
(“application data”) in the
multicore NoC.
MT-NoC (mixed traffic NoC)
– All monitor and application
traffic in four virtual channel
(VC) NoC.
Iso-NoC (isolated channel
NoC) – monitor data in one VC.
Application traffic in the
remaining three VCs.
M
or
y
oc
Infrastructure for MT-NoC and
Iso-NoC shown
em
Pr
•
13
Application Data Latency
1.
2.
3.
NoC with MNoC: lowest latency
Iso-NoC: highest latency
MT-NoC: lower latency than case 3
•
A standalone MNoC ensures low latency for application data
14
Monitor Data Latency
• Iso-NoC achieves low monitor data latency but has high
application data latency
• A standalone MNoC ensures low latency for monitor data
• Modified Popnet network simulator
15
New On-chip Monitoring Requirement in
Many-cores
• Many-core systems demand higher sensor data collecting and
processing capability
• New remediation strategies in both local and global scales
– Signature sharing for voltage droop compensation in the global scale
– Distributed and centralized dynamic thermal management (DTM)
• Three-dimensional (3D) systems add more complexity
– Stacking memory layers on top of a core layer
• No previous on-chip monitoring infrastructures can address all these
requirements
– Simple infrastructures based on buses are not suitable
– The MNoC infrastructure has no support for communications between
MEPs
– Scaled to many-core systems with hundreds to a thousand cores?
– Support for 3D systems?
16
3D Many-core Systems
Fig.1: A three layer 3D system example. Thermal TSVs are for heat dissipation
• 3D technology stacks dies on top of each other
• Through silicon via (TSV) used for vertical
communications or heat dissipation
• High bandwidth and high performance
Image courtesy: Fig. 1 –S. Pasricha, “Exploring Serial Vertical Interconnects for 3D ICs,” in Proc. ACM/IEEE Design
17
Automation Conference, pp. 581-586, Jul. 2009.
A Hierarchical Sensor Data Interconnect
Infrastructure
• An example for a 36 core
system
• One sensor NoC router per
core
• Sensors are connected to
sensor routers (similar to
MNoC)
• Sensor routers send data to
sensor data processors
(SDPs)
– Through the SDP routers
– One SDP per 9 cores in this
example, may not be the
optimal configuration
Sensor
router
SDP NoC
router
R
R
R
R
R
R
R
R
R
R
R
R
SDP
SDP
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
SDP
R
SDP
R
R
R
18
A Hierarchical Sensor Data Interconnect
Infrastructure (cont)
• SDP routers are
connected by another
layer of network, SDP
NoC
• More traffic pattern are
supported in the SDP
NoC
• Both the sensor NoC
and the SDP NoC have
low cost
– Small data width (e.g. 24
bits)
– Shallow buffers (4-8 flits)
Sensor
router
SDP NoC
router
R
R
R
R
R
R
R
R
R
R
R
R
SDP
SDP
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
SDP
R
SDP
R
R
R
19
Hierarchical Sensor Data Processing Infrastructure
for a 256-core 3D System
Voltage droop
sensor
Voltage droop
signature
generation module
Sign.
Performance
counter
Sensor router
R
Layer 2 memory
blocks
Sensor NoC
Thermal sensor
TSV
Sensor router
SDP router
R
R
R
R
SDP
SDP
SDP
R
SDP NoC
R
R
R
Layer 1 cores
SDP
R
SDP
• Thermal sensors in the memory layer are connected to sensor
routers using through silicon via (TSVs)
• One SDP per 64 cores
20
SDP Router Design
• SDP routers received
packets from sensor
routers
• SDP routers also
support broadcast
– Send broadcast
packets to SDP
– Generate new
broadcast packets
when necessary
• Two virtual channels
supported
LC
Sensor
NoC
input
channels
Link Controller
LC
LC
Packetization
LC
Sensor NoC
packet buffer
Broadcast
Controller
SDP
NoC
input
channels
De-packetization
LC
LC
SDP
De-packetization
Regular
Virtual
Channel
LC
LC
LC
LC
Switch
LC
LC
LC
LC
Routing and
Arbitration
SDP
NoC
output
channels
Priority
Virtual
Channel
21
Packet Transmission in the SDP NoC
• Traffic in the sensor NoC is similar
to MNoC
• SDP NoC supports more
complicated traffic patterns
– Hotspot, a global scale DTM [4]
– Broadcast, a voltage droop
signature sharing method [5]
• Hotspot traffic is supported by
most routing algorithms
• A SDP router design that supports
a simple broadcast strategy with a
broadcast controller
(0,0)
(1,0)
(2,0)
(3,0)
(0,1)
(1,1)
(2,1)
(3,1)
(0,2)
(1,2)
(2,2)
(3,2)
(0,3)
(1,3)
(2,3)
(3,3)
– Send packet vertically first
– Then send horizontally
[4] R. Jayaseelan, et. al., “A Hybrid Local-global Approach for Multi-core Thermal Management,” in Proc. International Conf. on
Computer-Aided Design, pp. 314-320, Nov. 2009.
[5] J. Zhao, et al., “Thermal-aware Voltage Droop Compensation for Multi-core Architectures,” in Proc. Great Lakes Symposium
22
on VLSI, pp. 335-340, May 2010.
Experimental Approach
• Our infrastructure is simulated using a heavily modified Popnet
simulator
• Simulated for 256, 512 and 1024 core systems
– Packet transmission delay
• Synthesized using 45 nm technology
– Hardware cost
• On-chip sensors
–
–
–
–
Thermal sensors
Performance counters
Voltage droop sensors
Signature-based voltage droop predictors
• A system level experiment is performed using the modified Graphite
many-core simulator
– Run-time temperature is modeled
23
Modified Graphite Simulator
•
•
•
•
•
•
•
Simulation speed is the key due
to the large number of cores
Graphite from the Carbon
Research Group at MIT
Graphite maps each thread in the
application to a tile of the target
architecture
These threads are distributed
among multiple host processes
which are running on multiple
host machines
Mean slowdown vs. native
execution around 1000x
SPLASH2 and PARSEC
benchmark supported
Power model under development
Graphite simulator overview
Image courtesy: J. Miller, etc., “Graphite: A Distributed Parallel Simulator for Multicores”, in Proc. of IEEE International
24
Symposium on High-Performance Computer Architecture (HPCA), Jan 2010
Graphite Initial Experiments
• Graphite compiled and runs on a Core2 Quad Core and 4GB
memory machine
• SPLASH2 benchmarks tested with up to 128 cores (1024 next)
• Modification for thermal testing and power evaluation underway
• Integration with modified sensor data network simulator
25
Experimentations using Graphite
•
3D architecture simulation
–
–
–
•
Benchmark
Integrated with the on-chip
sensor interconnect simulator
–
–
–
•
Memory blocks stacked up
3D memory latency numbers
Adopted in Graphite
Configuration
Thermal, activity, etc.
information extraction or
estimation using Graphite
Data collection and processing
using a modified Popnet
Remediation simulation
Sensor data
packets
Modified
Graphite
Dynamic frequency and voltage
scaling experimentations in
Graphite
–
Change simulation parameter at
run-time
Thermal, critical path,
processor activity,
voltage droop,
vulnerability etc.
information
Remediation
decisions
Results
Modified
Popnet
Performance,
power, etc.
statistics
Monitor Data
Processing
26
Comparison against a Flat Sensor NoC
Infrastructure
Core and
SDP num.
•
•
•
•
•
Latency type
Flat sensor
NoC (cycles)
Our method
(cycles)
Latency reduction w.r.t. flat
sensor NoC (%)
256 core
(4 SDP)
Inter-SDP
45.43
7.85
82.72
Total
62.82
25.24
59.82
512 core
(8 SDP)
Inter-SDP
67.57
11.37
83.17
Total
84.96
28.76
66.15
1024 core
(16 SDP)
Inter-SDP
90.36
14.38
84.08
107.75
31.77
70.52
Total
Simulated using a modified Popnet, one SDP per 64 core is chosen,
sensor data from the memory layer included
Compare the latency of our infrastructure versus a flat sensor NoC
infrastructure
– Only sensor routers, no SDP NoC
– Packets between SDPs (inter-SDP) are transmitted using sensor routers
Our infrastructure scientifically reduce the inter-SDP (>82%) and total
latency (>59%)
Hardware cost increase with respect to the flat sensor NoC less than 6%
Our infrastructure provides higher throughput versus the flat sensor NoC
27
Core to SDP Ratio Experiment
Core
num.
256
512
1024
Core/
SDP
ratio
SDP
num.
Sensor
NoC
latency
SDP
latency
Total
latency
SDP NoC to sensor
NoC HW cost ratio
(%)
32
8
12.25
11.01
23.26
7.94
64
4
17.31
7.72
25.03
4.19
128
2
24.14
5.25
29.39
1.72
32
16
12.25
13.37
25.62
8.78
64
8
17.31
11.42
28.73
4.99
128
4
24.14
7.88
32.02
2.65
32
32
12.25
20.87
33.12
9.20
64
16
17.31
15.48
32.79
5.46
128
8
24.14
11.84
35.98
3.15
• One SDP per 64 cores is chosen
– Low latency
– Moderate hardware cost, less than 6% versus only the sensor NoC
28
Throughput Comparison
• Compare the throughput of the inter-SDP packet transmission
– Same throughput for packet transmission in the sensor NoC
• Our infrastructure provides higher throughput versus the flat sensor
NoC
29
Signature based Voltage Droop
Compensation
• Event history table content
IF ID EX MEM WB
is compared with
signatures at run-time to
Pipeline Flush
DTLB miss
Control Flow
predict emergencies
instruction
Processor
– Large table -> more
accurate prediction
L1 Cache
Page Table
Shared L2
Cache
DL1 miss
Event History Table
Shared
Memory
L2 miss
Signature Table
• Signature table
– Larger table -> higher
performance
– Larger table -> higher cost
TLB
Match
Compare
Frequency
Throttling
Capture
To MNoC
Router
Voltage Droop
Monitors
• Extensively studied in
Reddi, et al [1]
[1] V. Reddi, M. Gupta, G. Holloway, M. Smith, G. Wei and D. Brooks. “Voltage emergency prediction: A signature-based approach to
reducing voltage emergencies,” in Proc. International Symposium on High-Performance Computer Architecture, pp. 18-27, Feb. 2009.
30
A Thermal-aware Voltage Droop Compensation
Method with Signature Sharing
• High voltage droops cause voltage
emergencies
• Voltage droop prediction based on
signatures
Shared Memory
L2 Cache
Shared Bus
– Signatures: footsteps of a serial of
instructions
– Prediction of incoming high voltage
droops
• Signature based method in single core
systems
• Initial signature detection involves
performance penalty
Processor
Processor
Thermal
Monitors
Thermal
Monitors
Voltage
Droop
Monitors
Voltage
Droop
Monitors
Signature
R
R
MNoC
R
• Idea 1: Signature sharing across cores
• Fewer penalties for initial signature
detection
Signature
MEP
R
Router
MEP
Monitor Executive Processor
31
Voltage Droop Signature Sharing in
Multicore Systems
• 8 and 16 core systems
simulated by SESC, 8 core
results shown here
• Comparison between the
signature sharing method
and the local signature
method
• Four benchmarks show
significant reduction in
signature number
• Performance benefit mainly
come from lesser rollback
penalty
Test
bench
Case
Sign.
Num.
Waterspatial
Local only
Fmm
Local only
832
Global sharing
202
LU
Ocean
Global sharing
18, 474
5,167
Local only
40,838
Global sharing
15,655
Local only
271,179
Global sharing
224,151
Test
bench
Case
Waterspatial
Local only
14.59
Global sharing
14.23
Fmm
Local only
10.13
Global sharing
10.13
Local only
17.20
Global sharing
16.43
Local only
25.90
Global sharing
24.46
LU
Ocean
Exec.
Time (ms)
Sign. Num.
reduction (%)
72
76
62
17
Exec. time
reduction (%)
2.44
0
4.48
5.57
32
A Thermal-aware Voltage Droop Compensation
Method with Signature Sharing (cont)
• Reduce system frequencies to combat high voltage droops
– Previous research considers only one reduced frequency
• Our experiment show that voltage droop decreases as temperature
increases with the same processor activity
• Idea 2: Choose different reduced frequencies according to
temperature
33
Thermal-aware Voltage Compensation
Method for Multi-core Systems
• 5 frequency cases, normal
frequency is 2GHz
• Case 1 uses one reduced
frequency
• Cases 2, 3 and 4 best show
the performance benefits of
the proposed method
• Performance benefits 5%
on average (8 core and 16
core systems) [9]
Temp.
range
Frequency table case
1
2
3
4
5
20-40°C
1GHz
1GHz
1GHz
1GHz
1.3GHz
40-60°C
1GHz
1GHz
1GHz
1.3GHz
1.3GHz
60-80°C
1GHz
1GHz
1.3GHz
1.3GHz
1.3GHz
80-100°C
1GHz
1.3GHz
1.3GHz
1.3GHz
1.3GHz
[9] J. Zhao, B. Datta, W. Burleson and R. Tessier, “Thermal-aware Voltage Droop Compensation for Multi-core Architectures,” in34
Proc. of ACM Great Lakes Symposium on VLSI (GLSVLSI’10), pp. 335-340, May 2010.
A System Level Experiment Result
Benchmark
Performance (billion cycles)
Case 1
LU (contig)
Case 2
Case 3
Benefit (%)
Case 2
Case 3
24.23
23.70
22.21
2.20
8.32
Ocean (contig)
2.76
2.54
2.51
7.79
9.09
Radix
9.78
9.29
9.08
5.03
7.18
FFT
115.14
115.06
114.73
0.07
0.36
Cholesky
189.66
185.05
182.28
2.43
3.89
Radiosity
121.42
114.71
111.28
5.53
8.35
• A 128-core 3D system with 2 layers simulated using a modified
Graphite
• Use dynamic frequency scaling (DFS) for thermal management
and voltage droop compensation
– Case 1: DFS for thermal management only
– Case 2: DFS for voltage droop, using the flat sensor NoC
– Case 3: DFS for voltage droop, using our hierarchical infrastructure
35
Conclusion
• On-chip monitoring of temperature, performance, supply
voltage and other environmental conditions
• New infrastructures for on-chip sensor data processing
for multi-core and many-core systems
– The MNoC infrastructure
– A hierarchical infrastructure for many-core systems. Significant
latency reduction (>50%) versus a flat sensor NoC
• New remediation strategies using dedicated on-chip
monitoring infrastructures
– A thermal-aware voltage droop compensation method with
signature sharing. Performance benefit 5% on average
• Other monitoring efforts
– Sensor calibration
36
Publications
1.
2.
3.
4.
5.
6.
7.
J. Zhao, J. Lu, W. Burleson and R. Tessier, Run-time Probabilistic Detection of
Miscalibrated Thermal Sensors in Many-core Systems, in the Proceedings of the
IEEE/ACM Design Automation and Test in Europe Conference, Grenoble, France, March
2013
J. Lu, R. Tessier and W. Burleson, Collaborative Calibration of On-Chip Thermal Sensors
Using Performance Counters, in the Proceedings of the IEEE/ACM International
Conference on Computer-Aided Design, San Jose, CA, November 2012
J. Zhao, R. Tessier and W. Burleson, “Distributed Sensor Processing for Many-cores,” in
Proc. of ACM Great Lakes Symposium on VLSI (GLSVLSI’12), to appear, 6 pages, May
2012.
J. Zhao, S. Madduri, R. Vadlamani, W. Burleson and R. Tessier, “A Dedicated Monitoring
Infrastructure for Multicore Processors,” in IEEE Transactions on Very Large Scale
Integration Systems (TVLSI), vol. 19. no. 6, pp. 1011-1022, 2011.
J. Zhao, B. Datta, W. Burleson and R. Tessier, “Thermal-aware Voltage Droop
Compensation for Multi-core Architectures,” in Proc. Great Lakes Symposium on VLSI
(GLSVLSI'10), pp. 335-340, May 2010.
R. Vadlamani, J. Zhao, W. Burleson and R. Tessier, “Multicore Soft Error Rate Stabilization
Using Adaptive Dual Modular Redundancy”, in Proc. Design, Automation and Test Europe
(DATE'10), pp. 27-32, Mar. 2009.
S. Madduri, R. Vadlamani, W. Burleson and R. Tessier, A Monitor Interconnect and
Support Subsystem for Multicore Processors, in the Proceedings of the IEEE/ACM Design
Automation and Test in Europe Conference, Nice, France, April 2009.
37