On-chip Monitoring Infrastructures and Strategies for Many-core Systems Russell Tessier, Jia Zhao, Justin Lu, Sailaja Madduri, and Wayne Burleson Research supported by the Semiconductor Research.
Download ReportTranscript On-chip Monitoring Infrastructures and Strategies for Many-core Systems Russell Tessier, Jia Zhao, Justin Lu, Sailaja Madduri, and Wayne Burleson Research supported by the Semiconductor Research.
On-chip Monitoring Infrastructures and Strategies for Many-core Systems Russell Tessier, Jia Zhao, Justin Lu, Sailaja Madduri, and Wayne Burleson Research supported by the Semiconductor Research Corporation Outline • • • • • • Motivation Contributions On-chip monitoring infrastructures Extensions to 3D architectures Monitoring for voltage droop prevention Conclusion and future work 2 On-chip Sensors and Monitoring • Thermal sensors, processor activity monitors, delay monitors, reliability monitors, et al. • Challenges involved 1. Sensor design (VLSI design research) 2. Sensor data collection 3. Sensor data processing, e.g. identify imprecise sensor measurements 4. Remediation strategies Activity Monitor Thermal Sensor Core Delay Monitor Network Interface Cache To Memories and Peripherals 3 Multi-core and Many-core Systems • From single core systems to multicores and many-cores • Need to monitor system temperature, supply voltage fluctuation, reliability, among others • Remediation strategies include voltage change, frequency change, error protection, et al. Image courtesy: silentpcreview.com AMD FX-8150 8-Core Bulldozer Processor 4 System-Level Requirements • Monitor data collected in a scalable, coherent fashion – Interconnect flexibility: Many different monitor interfaces, bandwidth requirements – Low overhead: Interconnect should provide low overhead interfaces (buses, direct connects) while integrating NoC – Low latency: Priority data needs immediate attention (thermal, errors) • Collate data with centralized processing – Focus on collaborative use of monitor data • Validate monitoring system with applications – Layout and system-level simulation 5 MNoC Overview Control X-Bar Port R R R R Interface R M MEP R R R M M MEP R R R R R R T • A dedicated infrastructure for onchip monitoring – Monitor Network-on-Chip (MNoC) • An on-chip network to connect all sensors to a monitor executive processor (MEP) through MNoC routers • Low latency for sensor data transmission and low cost MEP – Monitor Executive Processor R – Router M – Monitor D – Data T – Timer module D M D M 6 Priority Based Interfacing • • • Multiple transfer channels available for critical data Interface synchronized with router via buffer Time stamps used to prioritize/coordinate responses NETWORK ROUTER HIGH PRIORITY CHANNEL MONITORS BUS-ROUTER INTERFACE 7 MNoC Router and Monitor Interface • Bus and multiplexer are both supported • Data type: – Periodically sampled – Occasionally reported • Critical monitor data is transferred in the priority channel Network router interface (Master) Idle Sample finish Sample counter Send out Buffer not full Sample Buffer not full Wait for buffer Buffer full Wait for data Packetization Sample signal Buffer Data ready full Timer FIFO Monitor Monitor Dest. Time data address address stamp Monitor packet Priority Priority channel Regular channel MUX • A thermal monitor [1] interface example: Thermal Monitor Track/ Hold T/H sensor-1 VCO MUX – FSM controlled – Time stamp attached to identify out-of-data data 8 bit data T/H sensor-2 Freq-divider T/H sensor-N Digital Counter Select Controller block [1] B. Datta and W. Burleson, “Low-Power, Process-Variation Tolerant On-Chip Thermal Monitoring using Track and Hold Based Thermal Sensors,” in Proc. ACM Great Lakes Symposium on VLSI, pp. 145-148, 2009. 8 To router MNoC area with differing router parameters Total MNoC area for different buffer sizes and data widths at 65nm Data width (bits) Input buffer size Gate count per router 8 4 15017 8 8 15234 10 4 15505 10 8 15763 0.5 12 4 17902 0.25 12 8 18196 14 4 18871 14 8 19222 2.25 2 Area (mm2) 1.75 Buffer size 2 4 8 16 1.5 1.25 1 0.75 0 6 8 10 12 14 Data width in bits 16 18 20 • Desirable to minimize interconnect data width to just meet latency requirements • Most of router area consumed by data path – Each delay monitor generates 12 bit data + 6 bit header 9 Data Width Effect on Latency Regular channel latencies for different data widths for buffer size = 4 Priority Channel latencies for different data widths for buffer size = 4 400 4000 350 1500 Network latency (Clk cycles) Network latency (Clk cycles) 4500 1000 Bit width 3500 3000 2500 2000 12 14 16 18 • • 8 500 10 250 200 150 100 Bit width 8 300 10 0 12 0 100 200 300 Cycles between injection 400 14 16 50 0 0 100 200 300 Cycles between injection 18 Regular channel latency (e.g. 100 cycles) tolerable for low priority data Priority channel provides fast path for critical data (20 cycles) 10 NoC and MNoC Architectures • A shared memory multicore based on Tile64 [1] and TRIPS OCN [2] – 4×4 mesh as in Tile64 – 256 bit data width – 2 flits buffer size • Monitors in the multicore system – Thermal monitor (1/800 injection rate) – Delay monitor (around 1/200 injection rate) • MNoC configuration from the suggested design flow – – – – 4×4 MNoC 24-bit flit width 2 virtual channels 2 flits buffer size [1] S. Bell, et al, “TILE64 Processor: A 64-Core SoC with Mesh Interconnect”, in the Proceedings of International Solid-State Circuits Conference, pp. 88-598, 2008. [2] P. Gratz, C. Kim, R. McDonald, S. Keckler and D. Burger, “Implementation and Evaluation of On-Chip Network Architectures”, in the Proceedings of International Conference on Computer Design, pp. 477-484, Oct. 2007. 11 MNoC vs. Regular Network-on-chip • Network-on-chip (NoC) is used in multi-cores • Why bother to use MNoC for monitor data? • Three cases – NoC with MNoC – monitor data is transferred via MNoC. Inter-processor traffic (“application data”) in the multi-core NoC. – MT-NoC (mixed traffic NoC) – All monitor and application traffic in four virtual channel (VC) NoC. – Iso-NoC (isolated channel NoC) – monitor data in one VC. Application traffic in the remaining three VCs. • Virtual channel: another lane on a road. Unfortunately only one lane in the intersections 12 Three Cases for Comparison 1. 2. Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router M Router U X Pr oc 3. 16 M U rs rs ito ito X X U on on M M M 16 NoC with MNoC – monitor data is transferred via MNoC. Inter-processor traffic (“application data”) in the multicore NoC. MT-NoC (mixed traffic NoC) – All monitor and application traffic in four virtual channel (VC) NoC. Iso-NoC (isolated channel NoC) – monitor data in one VC. Application traffic in the remaining three VCs. M or y oc Infrastructure for MT-NoC and Iso-NoC shown em Pr • 13 Application Data Latency 1. 2. 3. NoC with MNoC: lowest latency Iso-NoC: highest latency MT-NoC: lower latency than case 3 • A standalone MNoC ensures low latency for application data 14 Monitor Data Latency • Iso-NoC achieves low monitor data latency but has high application data latency • A standalone MNoC ensures low latency for monitor data • Modified Popnet network simulator 15 New On-chip Monitoring Requirement in Many-cores • Many-core systems demand higher sensor data collecting and processing capability • New remediation strategies in both local and global scales – Signature sharing for voltage droop compensation in the global scale – Distributed and centralized dynamic thermal management (DTM) • Three-dimensional (3D) systems add more complexity – Stacking memory layers on top of a core layer • No previous on-chip monitoring infrastructures can address all these requirements – Simple infrastructures based on buses are not suitable – The MNoC infrastructure has no support for communications between MEPs – Scaled to many-core systems with hundreds to a thousand cores? – Support for 3D systems? 16 3D Many-core Systems Fig.1: A three layer 3D system example. Thermal TSVs are for heat dissipation • 3D technology stacks dies on top of each other • Through silicon via (TSV) used for vertical communications or heat dissipation • High bandwidth and high performance Image courtesy: Fig. 1 –S. Pasricha, “Exploring Serial Vertical Interconnects for 3D ICs,” in Proc. ACM/IEEE Design 17 Automation Conference, pp. 581-586, Jul. 2009. A Hierarchical Sensor Data Interconnect Infrastructure • An example for a 36 core system • One sensor NoC router per core • Sensors are connected to sensor routers (similar to MNoC) • Sensor routers send data to sensor data processors (SDPs) – Through the SDP routers – One SDP per 9 cores in this example, may not be the optimal configuration Sensor router SDP NoC router R R R R R R R R R R R R SDP SDP R R R R R R R R R R R R R R R R R R R R SDP R SDP R R R 18 A Hierarchical Sensor Data Interconnect Infrastructure (cont) • SDP routers are connected by another layer of network, SDP NoC • More traffic pattern are supported in the SDP NoC • Both the sensor NoC and the SDP NoC have low cost – Small data width (e.g. 24 bits) – Shallow buffers (4-8 flits) Sensor router SDP NoC router R R R R R R R R R R R R SDP SDP R R R R R R R R R R R R R R R R R R R R SDP R SDP R R R 19 Hierarchical Sensor Data Processing Infrastructure for a 256-core 3D System Voltage droop sensor Voltage droop signature generation module Sign. Performance counter Sensor router R Layer 2 memory blocks Sensor NoC Thermal sensor TSV Sensor router SDP router R R R R SDP SDP SDP R SDP NoC R R R Layer 1 cores SDP R SDP • Thermal sensors in the memory layer are connected to sensor routers using through silicon via (TSVs) • One SDP per 64 cores 20 SDP Router Design • SDP routers received packets from sensor routers • SDP routers also support broadcast – Send broadcast packets to SDP – Generate new broadcast packets when necessary • Two virtual channels supported LC Sensor NoC input channels Link Controller LC LC Packetization LC Sensor NoC packet buffer Broadcast Controller SDP NoC input channels De-packetization LC LC SDP De-packetization Regular Virtual Channel LC LC LC LC Switch LC LC LC LC Routing and Arbitration SDP NoC output channels Priority Virtual Channel 21 Packet Transmission in the SDP NoC • Traffic in the sensor NoC is similar to MNoC • SDP NoC supports more complicated traffic patterns – Hotspot, a global scale DTM [4] – Broadcast, a voltage droop signature sharing method [5] • Hotspot traffic is supported by most routing algorithms • A SDP router design that supports a simple broadcast strategy with a broadcast controller (0,0) (1,0) (2,0) (3,0) (0,1) (1,1) (2,1) (3,1) (0,2) (1,2) (2,2) (3,2) (0,3) (1,3) (2,3) (3,3) – Send packet vertically first – Then send horizontally [4] R. Jayaseelan, et. al., “A Hybrid Local-global Approach for Multi-core Thermal Management,” in Proc. International Conf. on Computer-Aided Design, pp. 314-320, Nov. 2009. [5] J. Zhao, et al., “Thermal-aware Voltage Droop Compensation for Multi-core Architectures,” in Proc. Great Lakes Symposium 22 on VLSI, pp. 335-340, May 2010. Experimental Approach • Our infrastructure is simulated using a heavily modified Popnet simulator • Simulated for 256, 512 and 1024 core systems – Packet transmission delay • Synthesized using 45 nm technology – Hardware cost • On-chip sensors – – – – Thermal sensors Performance counters Voltage droop sensors Signature-based voltage droop predictors • A system level experiment is performed using the modified Graphite many-core simulator – Run-time temperature is modeled 23 Modified Graphite Simulator • • • • • • • Simulation speed is the key due to the large number of cores Graphite from the Carbon Research Group at MIT Graphite maps each thread in the application to a tile of the target architecture These threads are distributed among multiple host processes which are running on multiple host machines Mean slowdown vs. native execution around 1000x SPLASH2 and PARSEC benchmark supported Power model under development Graphite simulator overview Image courtesy: J. Miller, etc., “Graphite: A Distributed Parallel Simulator for Multicores”, in Proc. of IEEE International 24 Symposium on High-Performance Computer Architecture (HPCA), Jan 2010 Graphite Initial Experiments • Graphite compiled and runs on a Core2 Quad Core and 4GB memory machine • SPLASH2 benchmarks tested with up to 128 cores (1024 next) • Modification for thermal testing and power evaluation underway • Integration with modified sensor data network simulator 25 Experimentations using Graphite • 3D architecture simulation – – – • Benchmark Integrated with the on-chip sensor interconnect simulator – – – • Memory blocks stacked up 3D memory latency numbers Adopted in Graphite Configuration Thermal, activity, etc. information extraction or estimation using Graphite Data collection and processing using a modified Popnet Remediation simulation Sensor data packets Modified Graphite Dynamic frequency and voltage scaling experimentations in Graphite – Change simulation parameter at run-time Thermal, critical path, processor activity, voltage droop, vulnerability etc. information Remediation decisions Results Modified Popnet Performance, power, etc. statistics Monitor Data Processing 26 Comparison against a Flat Sensor NoC Infrastructure Core and SDP num. • • • • • Latency type Flat sensor NoC (cycles) Our method (cycles) Latency reduction w.r.t. flat sensor NoC (%) 256 core (4 SDP) Inter-SDP 45.43 7.85 82.72 Total 62.82 25.24 59.82 512 core (8 SDP) Inter-SDP 67.57 11.37 83.17 Total 84.96 28.76 66.15 1024 core (16 SDP) Inter-SDP 90.36 14.38 84.08 107.75 31.77 70.52 Total Simulated using a modified Popnet, one SDP per 64 core is chosen, sensor data from the memory layer included Compare the latency of our infrastructure versus a flat sensor NoC infrastructure – Only sensor routers, no SDP NoC – Packets between SDPs (inter-SDP) are transmitted using sensor routers Our infrastructure scientifically reduce the inter-SDP (>82%) and total latency (>59%) Hardware cost increase with respect to the flat sensor NoC less than 6% Our infrastructure provides higher throughput versus the flat sensor NoC 27 Core to SDP Ratio Experiment Core num. 256 512 1024 Core/ SDP ratio SDP num. Sensor NoC latency SDP latency Total latency SDP NoC to sensor NoC HW cost ratio (%) 32 8 12.25 11.01 23.26 7.94 64 4 17.31 7.72 25.03 4.19 128 2 24.14 5.25 29.39 1.72 32 16 12.25 13.37 25.62 8.78 64 8 17.31 11.42 28.73 4.99 128 4 24.14 7.88 32.02 2.65 32 32 12.25 20.87 33.12 9.20 64 16 17.31 15.48 32.79 5.46 128 8 24.14 11.84 35.98 3.15 • One SDP per 64 cores is chosen – Low latency – Moderate hardware cost, less than 6% versus only the sensor NoC 28 Throughput Comparison • Compare the throughput of the inter-SDP packet transmission – Same throughput for packet transmission in the sensor NoC • Our infrastructure provides higher throughput versus the flat sensor NoC 29 Signature based Voltage Droop Compensation • Event history table content IF ID EX MEM WB is compared with signatures at run-time to Pipeline Flush DTLB miss Control Flow predict emergencies instruction Processor – Large table -> more accurate prediction L1 Cache Page Table Shared L2 Cache DL1 miss Event History Table Shared Memory L2 miss Signature Table • Signature table – Larger table -> higher performance – Larger table -> higher cost TLB Match Compare Frequency Throttling Capture To MNoC Router Voltage Droop Monitors • Extensively studied in Reddi, et al [1] [1] V. Reddi, M. Gupta, G. Holloway, M. Smith, G. Wei and D. Brooks. “Voltage emergency prediction: A signature-based approach to reducing voltage emergencies,” in Proc. International Symposium on High-Performance Computer Architecture, pp. 18-27, Feb. 2009. 30 A Thermal-aware Voltage Droop Compensation Method with Signature Sharing • High voltage droops cause voltage emergencies • Voltage droop prediction based on signatures Shared Memory L2 Cache Shared Bus – Signatures: footsteps of a serial of instructions – Prediction of incoming high voltage droops • Signature based method in single core systems • Initial signature detection involves performance penalty Processor Processor Thermal Monitors Thermal Monitors Voltage Droop Monitors Voltage Droop Monitors Signature R R MNoC R • Idea 1: Signature sharing across cores • Fewer penalties for initial signature detection Signature MEP R Router MEP Monitor Executive Processor 31 Voltage Droop Signature Sharing in Multicore Systems • 8 and 16 core systems simulated by SESC, 8 core results shown here • Comparison between the signature sharing method and the local signature method • Four benchmarks show significant reduction in signature number • Performance benefit mainly come from lesser rollback penalty Test bench Case Sign. Num. Waterspatial Local only Fmm Local only 832 Global sharing 202 LU Ocean Global sharing 18, 474 5,167 Local only 40,838 Global sharing 15,655 Local only 271,179 Global sharing 224,151 Test bench Case Waterspatial Local only 14.59 Global sharing 14.23 Fmm Local only 10.13 Global sharing 10.13 Local only 17.20 Global sharing 16.43 Local only 25.90 Global sharing 24.46 LU Ocean Exec. Time (ms) Sign. Num. reduction (%) 72 76 62 17 Exec. time reduction (%) 2.44 0 4.48 5.57 32 A Thermal-aware Voltage Droop Compensation Method with Signature Sharing (cont) • Reduce system frequencies to combat high voltage droops – Previous research considers only one reduced frequency • Our experiment show that voltage droop decreases as temperature increases with the same processor activity • Idea 2: Choose different reduced frequencies according to temperature 33 Thermal-aware Voltage Compensation Method for Multi-core Systems • 5 frequency cases, normal frequency is 2GHz • Case 1 uses one reduced frequency • Cases 2, 3 and 4 best show the performance benefits of the proposed method • Performance benefits 5% on average (8 core and 16 core systems) [9] Temp. range Frequency table case 1 2 3 4 5 20-40°C 1GHz 1GHz 1GHz 1GHz 1.3GHz 40-60°C 1GHz 1GHz 1GHz 1.3GHz 1.3GHz 60-80°C 1GHz 1GHz 1.3GHz 1.3GHz 1.3GHz 80-100°C 1GHz 1.3GHz 1.3GHz 1.3GHz 1.3GHz [9] J. Zhao, B. Datta, W. Burleson and R. Tessier, “Thermal-aware Voltage Droop Compensation for Multi-core Architectures,” in34 Proc. of ACM Great Lakes Symposium on VLSI (GLSVLSI’10), pp. 335-340, May 2010. A System Level Experiment Result Benchmark Performance (billion cycles) Case 1 LU (contig) Case 2 Case 3 Benefit (%) Case 2 Case 3 24.23 23.70 22.21 2.20 8.32 Ocean (contig) 2.76 2.54 2.51 7.79 9.09 Radix 9.78 9.29 9.08 5.03 7.18 FFT 115.14 115.06 114.73 0.07 0.36 Cholesky 189.66 185.05 182.28 2.43 3.89 Radiosity 121.42 114.71 111.28 5.53 8.35 • A 128-core 3D system with 2 layers simulated using a modified Graphite • Use dynamic frequency scaling (DFS) for thermal management and voltage droop compensation – Case 1: DFS for thermal management only – Case 2: DFS for voltage droop, using the flat sensor NoC – Case 3: DFS for voltage droop, using our hierarchical infrastructure 35 Conclusion • On-chip monitoring of temperature, performance, supply voltage and other environmental conditions • New infrastructures for on-chip sensor data processing for multi-core and many-core systems – The MNoC infrastructure – A hierarchical infrastructure for many-core systems. Significant latency reduction (>50%) versus a flat sensor NoC • New remediation strategies using dedicated on-chip monitoring infrastructures – A thermal-aware voltage droop compensation method with signature sharing. Performance benefit 5% on average • Other monitoring efforts – Sensor calibration 36 Publications 1. 2. 3. 4. 5. 6. 7. J. Zhao, J. Lu, W. Burleson and R. Tessier, Run-time Probabilistic Detection of Miscalibrated Thermal Sensors in Many-core Systems, in the Proceedings of the IEEE/ACM Design Automation and Test in Europe Conference, Grenoble, France, March 2013 J. Lu, R. Tessier and W. Burleson, Collaborative Calibration of On-Chip Thermal Sensors Using Performance Counters, in the Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, November 2012 J. Zhao, R. Tessier and W. Burleson, “Distributed Sensor Processing for Many-cores,” in Proc. of ACM Great Lakes Symposium on VLSI (GLSVLSI’12), to appear, 6 pages, May 2012. J. Zhao, S. Madduri, R. Vadlamani, W. Burleson and R. Tessier, “A Dedicated Monitoring Infrastructure for Multicore Processors,” in IEEE Transactions on Very Large Scale Integration Systems (TVLSI), vol. 19. no. 6, pp. 1011-1022, 2011. J. Zhao, B. Datta, W. Burleson and R. Tessier, “Thermal-aware Voltage Droop Compensation for Multi-core Architectures,” in Proc. Great Lakes Symposium on VLSI (GLSVLSI'10), pp. 335-340, May 2010. R. Vadlamani, J. Zhao, W. Burleson and R. Tessier, “Multicore Soft Error Rate Stabilization Using Adaptive Dual Modular Redundancy”, in Proc. Design, Automation and Test Europe (DATE'10), pp. 27-32, Mar. 2009. S. Madduri, R. Vadlamani, W. Burleson and R. Tessier, A Monitor Interconnect and Support Subsystem for Multicore Processors, in the Proceedings of the IEEE/ACM Design Automation and Test in Europe Conference, Nice, France, April 2009. 37