Transcript Slide 1
Design in the Nano-meter Regime: From Devices to System Architecture Kaushik Roy Purdue University Challenges ahead … in Si nanometer regime Exponential Increase in Leakage 1970 1980 5 µm 2000 2010 2020 100 nm 1 µm 10 nm Silicon Nano- electronics ION 106 IOFF ION 103 IOFF Subthreshold Gate Leakage Leakage Gate Source Drain n+ n+ Junction leakage Bulk Leakage Power (% of Total) Silicon Micro- electronics Non-Silicon Technology ION ~ 102~6 IOFF 50% Must stop at 50% 40% 30% 20% 10% 0% A. Grove, IEDM 2002 1.5 0.7 0.35 0.18 0.09 0.05 Technology () Technology Trend 2003 2009 2020 Nano devices Fully-depleted body VG Bulk-CMOS Gate Gate VD VD Drain Source VG VS VS DGMOS Buried Oxide (BOX) Substrate Source Floating Body Drain Vback Buried Oxide (BOX) FD/SOI Substrate FinFET Carbon nanotube III-V devices nano-wires Spintronics Trigate PD/SOI Single gate device Multi-gate devices Design methods to exploit the advantages of technology innovations Variation in Process Parameters Device 1 Device 2 Normalized Frequency 1.4 1.3 30% 1.2 130nm 1.1 Source: Intel 1.0 5X 0.9 1 Channel length 2 3 4 Normalized Leakage (Isb (Isb)) 5 # dopant atoms Delay and Leakage Spread Inter and Intra-die Variations 10000 Source: Intel 1000 100 10 1000 500 250 130 65 32 Technology Node (nm) Random dopant fluctuation Device parameters are no longer deterministic Reliability Failure probability Temporal degradation of performance -- NBTI Tech. generation Time Defects Life time degradation Device-Aware Circuit/Architecture is Essential Right type of device with right circuit and architecture Low-Power and High-Performance VLSI Research Wireless Communications - Low Power - Coding / Modulation High Speed Arithmetic - Sharing Multiplier for Vector Scaling Carbon Nano-tubes -circuits -architecture Nano Circuits & Arch. NBTI - Differential / Redundant Coeff. - Distributed Multiplication - Filter / Image Compression Power Delivery - Analysis - Design for Rel. Self-Healing/ Self-Calibration Process-Tolerant Design Low Power VLSI Signal Processing Reliability, Noise & Power Del. - Logic (Sizing, Body Bias) - Memory Failure Analysis & Yield Performance/Power Aware Computing Process Subthreshold, Gate, Jn. BTBT, GIDL,.. -Transistor Stacking - Multiple Vt - Dynamic Vt Low Complexity Leakage Control Variation Wavelet based Idd Analysis Device/Circuit Design - Idd Testing -Mixed Signal Active Leakage Reduction Low Leakage Memory -- Dynamic Vt - DRG Cache Digital Sub-threshold Logic - Ultra Low Power - Self Adjusting Vt Device Modeling & Circuits - Bulk, SOI Caches - Reconfigurable Cache -Gated Gnd, Clocking -- Dynamic Vdd Optimal SOI Electro-thermal Design Devices - DG-SOI - 3D-SOI -Device/Circuit/Arch co-design Professor Kaushik Roy ECE, Purdue University Memories: Leakage Reduction & Process Compensation Device-aware Circuit/Microarch: Cache Bulk Ultra-high Vt Nominal Vt Ground-plane SOI FinFET Circuit Design Issues Leakage – Sub-threshold, Gate, Junction, BTBT Stability – Read noise margin, Writability, Soft errors Delay – Decoder, Wordline, Bitline, MUX, Sense-amp, Driver Transition between active and standby modes Variations – Process, Vdd, Temperature Microarch Design Issues Array aspect ratio – # cells WL/BL Sub-array structure and selection strategy Active-Standby transition frequency, delay, energy How do you co-design? Bulk Nominal Vt Source-biased (Supply Gated) Cache Bulk Ultra-high Vt Nominal Vt Ground-plane SOI FinFET SB-SRAM Circuit Design Issues + - VGND VGND Holding Circuit SLEEP Hot Cache Line SRAM Array Column I/O Periodic Sleep Generation VSB • Data retention delay VDECAY VREF (VGND should be + SLEEP strapped) V GEN CLK • Noise issue • Process variation Self-decay sleep control circuit tracking sleep control BIAS SLEEP VSB VGND holding circuit SB-SRAM Microarch Design Issues Use locality of reference in cache to reduce transition energy Optimum memory sub-array size selection T Sleep time Tsleep selection SLEEP Co-design approach leads to higher payoffs and more opportunities Basic Idea: Supply Gating M1 ‘0’ ‘0’ VM>0 ‘0’ M2 Vgs=0,Vbs=0,Vds=Vdd Negative Vgs, Negative Vbs- More Body effect, Reduced Vds-Less DIBL 2-T stack has lower subthreshold leakage For M1: Vgs =-VM< 0,Vbs =-VM<0, Vds = Vdd-VM<Vdd For M2: Vgs =0,Vbs =0, Vds = VM < Vdd Source-Biasing: Retaining Data During Inactive Mode + - VSB VGND SLEEP VGND Holding Circuit SLEEP VSB … Sleep transistor cuts off VGND from ground during sleep mode VGND is strapped using different circuit schemes 16K-Byte SRAM Organization A<0:1 > A<2:4 > A<5:7 > Predecoder Selfdecay circuit WL<3:0> VSB MP1 VGND ... ... ... ... bitlines 512 cells 4 cells SLEEP BLOCK_SEL Decoder/Driver X4 SL VGND SL Distributed sleep TR cells Col. I/O Φ PRE Active leakage reduction SRAM Distributed sleep transistors SRAM block turned on ahead of time Self-decay circuit for low dynamic power overhead 2x16K-Byte SRAM Testchip Kim, Roy, ISSCC’05 Technology 180nm 6-metal CMOS Chip Size 3.3X2.9 mm2 Supply Voltage 1.8V Threshold Voltage NMOS: 0.53V PMOS: -0.53V Read Access Cycle 984MHz @ 1.8V, RT Active Current 0.14mW/MHz @ 1.8V Standby Current 7.27μA (16KB array) Measured Leakage Reduction Leakage (A) 8.E-06 Junction leakage Bitline leakage Cell leakage 6.E-06 1.8V, 45 C 4.E-06 94.2% 2.E-06 0.E+00 Conventional This work 94.2% total leakage reduction at VGND=0.9V Raising VGND also reduces gate tunneling leakage Bulk Ultra-High Vt Forward-biased Cache Bulk Ultra-high Vt Nominal Vt Ground-plane SOI FinFET Strong halo, Low ISUB FBB to ↑ ION WL FB-SRAM Circuit Design Issues VDD BL BLB • Zero body bias in standby to reduce leakage • FBB in active-mode to improve speed • Early sub-array selection to hide body-bias transition latency PWELL GND FB-SRAM Microarch Design Issues Use MSB of memory address for early selection of memory sub-array Use locality of reference in cache to reduce transition energy Co-design approach gives large leakage savings 32x32 Forward Body-Biased Sub-array M1 0.4V power supply M2 SUBSL .. M3 WL31 MA MP … ... MN 32 WL 0 VPWELL .. 32 ... ... ... Comparison Conventional SBSRAM FBSRAM VSL V DD VDD 0V Active VT=270mV 0.2V Standby VT=270mV VPWELL 0V Active 0.5V Standby VT=350mV • SBSRAM (DRG) has been proven with Si measurements • Dynamic VDD, RBB SRAM have fundamental design issues • MEDICI: gate/BTBT leakage is also modeled Power consumption (W) 32KB Cache Total Leakage Reduction 0.25 0.20 0.15 0.10 230mW Dynamic power overhead Leakage power (selected subarray) Leakage power (unselected subarrays) 64% total leakage reduction 83mW 84mW 0.05 0.00 Conventional SBSRAM FBSRAM • SBSRAM and FBSRAM are designed to give isoleakage savings • 64% total leakage reduction including overhead Process Variations & ProcessTolerance Robust Design: Process Variations in Onchip SRAMs WL PL ‘1’ AXL ‘0’ BR Low-Vt Yield ≈ 33% 300 Fault statistics 250 σVt ≈ 30mv, using BPTM 45nm technology 200 Simulation example of an 64KB Cache 150 100 50 1049 996 944 890 839 786 734 682 629 577 524 472 419 367 315 262 210 157 105 52 0 0 Chip Count AXR NR High-Vt 350 – Read, Write, Access, Hold PR NL BL Parametric failures Number of faulty cells (NFaultyFaulty-Cells) Parametric failures can degrade SRAM yield Inter-die Variation & Memory Failure LVT Failure Probability Reg. A High RF/HF Nom. Vt HVT Reg. B Low Failures Reg. C High AF/WF Cell. Fail. Prob. Mem. Fail. Prob. BPTM 70nm Devices Inter-die Vt shift [V] Memory failure probabilities are high when inter-die shift in process is high Self-Repairing SRAM Array LVT Region A Nom. Vt Region B HVT Region C Region A LVT Corner Region C HVT Corner Read & Hold failures dominate Access & Write failures dominate Reduce RF & HF Reduce AF & WF Reduce the dominant failures at different inter-die corners to increase width of low failure region Self-Repair using Leakage Monitoring DD Bypass Switch V out V V REF1 REF2 Comparator SRAM Array Body bias Body-Bias Generator Entire array leakage is monitored to detect interdie corner and proper body-bias is applied VOUT SRAM ARRAY LVT Calibrate Signal On-chip Leakage Monitor Nom. Vt V BPTM 70nm VREF1 VREF2 HVT Nom. Vt LVT Current Monitor Circuit Test-Chip of Self-Repairing SRAM VCO Isolated cell VCO 16 KB block 64 KB LVT Array Sensor + Ref. gen. BB gen Technology : IBM 0.13 m 128KB SRAM, leakage sensor, reference & body-bias gen Dual-Vt Triple-well tech. Number of Trans: ~ 7 million Die size: 16mm2 Yield Enhancement using Self-Repair No-body-bias 256KB Self-Repairing SRAM Memory failure prob. BPTM 70nm Self- Repair RBB ZBB FBB 256KB SRAM with No Body-Bias BPTM 70nm Inter-die Vt shift [mV] Self-Repairing SRAM using body-bias can significantly improve design yield Self-repair: Architecture Level Fault-Tolerant Cache Architecture Tag Index 16 off 11 5 Column Decoder 2 9 Row Address Controller 16b 16b 16b 16b 256b 256b 256b Faulty Faulty Column Mux Data Col Mux Tag Config Storage Sense Amp Sense Amp Hit/Miss Tag = BIST Data Configurator Fault Memory Locations Configuration 256b 512 Rows Col Address Row Decoder Index Data Blocks Tag Blocks Test Mode Operating Condition BIST detects the faulty blocks Config Storage stores the fault information Idea is to resize the cache to avoid faulty blocks during regular operation Effective Yield of 64K Cache 100 100 Optimum r = 3 60 40 81 85 86 69 Proposed Arch. Yield without any Redundancy 93 91 75 77 77 77 60 46 40 Conv. Yield 34 33 20 20 93 86 80 86 % Yield % Yield 80 94 33 32 31 Proposed Architecture with r = 3 ECC Redundency Proposed Architecture 0 0 0 1 2 3 4 Redundent Rows in Config Storage (r) 0 8 16 24 32 Redundent Rows in Cache (R) ECC + Redundancy yield ~ 77% Proposed architecture + Redundancy yield ~ 94% Fault Tolerant Capability Chip Count (Nchip) 350 Fault statistics 300 Chips saved by the proposed + redundancy (R=8, r=3) 250 Chips saved by ECC + redundancy ( R=16) 200 More number of saved chips as compare to ECC 150 100 ECC fails to save any chips 50 0 0 105 210 315 419 524 629 734 839 944 1049 NFaulty-Cells Proposed architecture can handle more number of faulty cells than ECC, as high as 890 faulty cells Saves more number of chips than ECC for a given NFaulty-Cells CPU Performance Loss % CPU Performance Loss 2.5 For a 64K cache averaged over SPEC 2000 benchmarks 2.0 1.5 1.0 0.5 0.0 0 105 210 315 419 524 629 734 839 NFaulty-Cells Increase in miss rate due to downsizing of cache Average CPU performance loss over all SPEC 2000 benchmarks for a cache with 890 faulty cells is ~ 2% Logic: Active Leakage Reduction - Dual-Vt - Transistor Stacking Leakage Reduction: Supply Gating for Logic VDD VDD-Gating Control input Logic Block GND-Gating Control Pros Cons 5-20X Leakage Reduction Delay/Area Overhead Scalable Floated Output Can be applied to idle sections only Output Design ease GND How to use supply gating dynamically in active mode? Dynamic Supply Gating (DSG): An Example 100 Power Saving % 80 70nm technology 50nm technology 60 40 20 Predecoder 3-to-8 row decoder 0 Postdecoder 8 12 16 Row Address Bits How to do it for random logic? Dynamic Supply Gating for General Circuits Shannon’s expansion: f ( x1,..., xi ,..., xn ) xi f ( x1,..., xi 1,..., xn ) xi' f ( x1,..., xi 0,..., xn ) xi CF1 xi' CF2 CF1 f ( x1,..., xi 1,..., xn ); CF2 f ( x1,..., xi 0,..., xn ) CF1 Xi is referred as Control Variable f1 CF11 f CF2 xi' inputs f2 xixj f1 MUX MUX xi xi CF12 xixj' xj Control variable selection is important Leakage Power (uw) Simulation Results 160 Active Leakage Saving 140 120 Original 100 DSG 80 60 40 20 0 x2 sct pcle pcler8 cht mux alu2 decod cm150a count MCNC Benchmarks, 70nm Process, Vdd=1V, Temp=100°C Logic: Process Variation & Tolerance - Transistor Sizing for Yield (Statistical Design) - Transistor Sizing for Efficient Speed-Binning - Shadow Latches (Razor) - Pipeline Balancing/Imbalancing - Vdd Scaling & Critical Path Isolation (ICCAD’06) Design Considerations for Low Power and Robust Circuit Number of paths predictable and restricted to a logic section having low activation probability CLK Tc Design A S1 VDD=1V Design B S3 S2 S2+S S3 1 VDD<1V Design B path delay Few predictable critical paths Low activation probability of critical paths Slack between critical and non-critical paths under variations f4 Original Circuit f3 PO f2 f1 decoder Inputs Inputs OR Network Proposed Approach: PO Critical Path Isolation By Control Variable Selection X1 X3 X5 X4 X2 X6 f1 f2 X9 X7 X8 | ai bi | Mi max(ai , bi ) Mi max(ai bi ) X1 X3 X5 X6 X9 X2 X3 X6 X1 X9 X7 ai: # literal count of xi bi: # literal count of xi’ f1(CF1) f2(CF1) f1 f1(CF2) f2 f2(CF2) x4 X4 f1(CF1) X3 X2 X6 X7 f2(CF1) X2 X4 X3 X5 X6 X9 f1 f2 f1(CF2) f2(CF2) x1 Further Isolation by Hierarchical Partitioning and Sizing (Xi, Xj)= (1,1) Xi = 1 CF32 CF10 (Xi, Xj)= (1,0) CF53 CF42 (Xi Xj Xk)= (1,0,0) Original Circuit CF63 Xi = 0 MUX Network (Xi Xj Xk)= (1,0,1) CF20 Inputs Inputs LEVEL1 (50%) LEVEL2 (25%) LEVEL3 (12.5%) Stopping conditions: area, delay constraints Advantages of Shannon decomposition Critical paths can be isolated Activation of errors can be predicted ahead of time Activation probability of critical paths can be reduced PO Simulation Results for Pipeline-based Design CLK freeze D2 cht ● mux 80ps 70ps ● Inputs D3 ● 100 1 85ps cm150a D1 D1, D2, D3 are decoding logic outputs % imp in power @input switching prob = 0.2 % imp in power @input switching prob = 0.5 % Imp. in power 80 0.8 60 VDD[V] 0.6 40 0.4 0.2 20 0 0 cht sct pcle mux decod cm150a x2 alu2 count cht sct pcle mux decod cm150a x2 alu2 count Avg performance penalty=5.9% for switching activity=0.5 Avg power saving = 60%, avg area penalty = 18% Ultra Low Power Subthreshold Leakage for computation?? -- Soeleman, Kim, Roy ISLPED 2000/2001, TVLSI 2001, TVLSI 2003 -- Raychowdhury, Kim, Roy, ISLPED 04, TED 2004/2005, TVLSI 2005… Subthreshold Operation Region of operation 1.E-3 Vth IDS α exp(VGS-VTH) 1.E-5 1.E-6 and not (VGS-VTH) 1.E-7 Vdd<Vth 1.E-8 1.E-9 0 0.2 0.4 0.6 0.8 Region of operation VGS (Volts) CGATE < COX CGATE (fF/µm) IDS (A/µm) 1.E-4 1 0.9 0.8 0.7 0.6 0.5 0 0.2 Vth 0.4 VGS (Volts) 0.6 0.8 Design Goal Power Power Ceiling Super-threshold Sub-threshold Device optimization Wireless app. Medical app. Circuit/Architecture optimization Throughput Dev/Cir/Arch optimization is necessary Is scaling necessary ? Device for sub-threshold operation?? Scaling & Subthreshold Operation Average Power (Х 10-7 J) • Reduced L => Reduced capacitance 4 Iso-performance (3.4ns) 3 2 1 0 500 mV 420 mV 250 180 280 mV 130 200 mV 90 Technology Node (nm) Scaling is essential even for subthreshold operation Are standard transistors good for subthreshold operation too? Doping Profile: Std. vs. Proposed Standard Device Proposed Device Proposed device vs. Std. Device @ iso-performance (3.4ns) Average Power (Х 10-7 J) 4 3 2 1 500mV 420mV 280mV 200mV 48% 180mV 0 250 180 130 90 Technology Node (nm) Raychowdhury, Paul, Roy; IEEE TED, Feb’05, ISLPED’04 Circuit Considerations CMOS-NAND A B Pseudo-NMOS (NAND) B B A PUP PUP PDN A Pseudo-NMOS over CMOS - Less power - Faster operation PDN Pseudo-NMOS logic VTC of an Inverter (350nm Tech) Std. operation (Vdd = 3.3V) Sub-threshold (0.5V) P/N=4 Vout P/N=4 P/N=0.25 P/N=0.1 Vin=Vout Vin=Vout Vin Vin Pseudo NMOS logic is good for sub-threshold operation Improvement Through Circuit Innovation Pseudo-NMOS over CMOS (sub-threshold) - Faster operation - Reasonable power Pseudo-NMOS logic is suitable for Sub-threshold operation Architecture Optimization CLK IN Logic Logic Logic Logic OUT Latch Latch Logic Control Parallelism Logic Control IN Latch Pipelining OUT Architecture Optimization 5-Tap FIR filter @ iso-performance Parallelism 90nm Predictive Tech. Pipelining Optimum no. of pipeline stages and parallel blocks need to be chosen Dev/Cir/Arc Co-design: Summary 90nm Predictive Tech. 5-Tap FIR Filter 0.8V 0.7V 0.6V Standard CMOS 0.5V 0.4V 0.4V CMOS to Pseudo-NMOS 0.3V 0.3V 0.2V 0.2V 0.15V 0.13V Optimal parallelization and pipelining Device optimization Under review, TVLSI Process-Tolerant Sub-threshold Chips 16-bit adder Process-Tolerant Adder SRAM(for read/ write/hold test) 1KB SRAM Filter core Adaptive -ratio circuit (tR/tF/tD) IBM 130nm 8RF Process-Tolerant Pipeline MITLL 3D FDSOI Summary • Design paradigm shift is essential for both to meeting the growing demands for power dissipation, yield, and reliability • Device/Circuit/Architecture Co-design can address some of the design problems associated with scaling Performance/Power Aware Computing & Communications • • • • • • • • • • NSF DARPA Northrup-Grumman MARCO GSRC SRC Intel ATT/Lucent HP IBM Convergys Faculty: Kaushik Roy • • • • • • • • • • • • • • • • • • • • • • • • Bipul Paul, Post-Doc. Res. Keejong Kim, Post-Doc. Res. Swarup Bhunia Tamer Cakici James Gallaghar Mark Budnik Yiran Chen Arijit Raychowdhury Aditya Bansal Amit Agarwal Ashish Goel Hunsoo Choo Nilanjan Banerjee Swaroop Ghosh Jung Hwan Choi Konhyuk Kang Arjun Guha Hamid Mahmoodi-Meimand Saibal Mukhopadhyay Animesh Datta Jongsun Park Hari Ananthan Yongtao Wang Myeong Hwang