Transcript Course HJ86
ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk Corporaal Peter Knijnenburg Future We foresee that many characteristics of current high performance architectures will find their way into the embedded domain. ASCI winterschool H.C.-P.K. 2 What are we talking about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel ASCI winterschool H.C.-P.K. 3 Processor Components Overview • • • • • • • Motivation and Goals Trends in Computer Architecture RISC processors ILP Processors Transport Triggered Architectures Configurable components Summary and Conclusions ASCI winterschool H.C.-P.K. 4 Motivation for ILP (and other types of parallelism) • Increasing VLSI densities; decreasing feature size • Increasing performance requirements • New application areas, like – Multi-media (image, audio, video, 3-D) – intelligent search and filtering engines – neural, fuzzy, genetic computing • More functionality • Use of existing Code (Compatibility) • Low Power: P = fCV2 ASCI winterschool H.C.-P.K. 5 Low power through parallelism • Sequential Processor – – – – Switching capacitance C Frequency f Voltage V P = fCV2 • Parallel Processor (two times the number of units) – – – – Switching capacitance 2C Frequency f/2 Voltage V’ < V P = f/2 2C V’2 = fCV’2 ASCI winterschool H.C.-P.K. 6 ILP Goals • Making the most powerful single chip processor • Exploiting parallelism between independent instructions (or operations) in programs • Exploit hardware concurrency – multiple FUs, buses, reg files, bypass paths, etc. • Code compatibility – binary: superscalar and super-pipelined – HLL: VLIW • Incorporate enhanced functionality (ASIP) ASCI winterschool H.C.-P.K. 7 Overview • • • • • • • Motivation and Goals Trends in Computer Architecture RISC processors ILP Processors Transport Triggered Architectures Configurable components Summary and Conclusions ASCI winterschool H.C.-P.K. 8 Trends in Computer Architecture • • • • • • Bridging the semantic gap Performance increase VLSI developments Architecture developments: design space The role of compiler Right match ASCI winterschool H.C.-P.K. 9 Very simple processor r0 r1 r2 Function Function Unit(s) Unit(s) MDR Data Memory Register file MAR Processor datapath Decode logic Instruction register ASCI winterschool H.C.-P.K. 10 Bridging the Semantic Gap Programming domains • Application domain • Architecture domain • Data path domain Example: Larchiteccture Lapplication A := B + C LD r1, M(&B) LD r2,M(&C) ADD r1,r1,r2 ST r1, M(&A) SW compilation or interpretation HW interpretation ASCI winterschool H.C.-P.K. Ldatapath &B MAR MDR r1 &C MAR MDR r2 r1 ALUinput-1 r2 ALUinput-2 ALUoutput := ALUinput-1 ALUoutput r1 r1 MDR &A MAR 11 Bridging the Semantic Gap: Different Methods Application Architecture Direct Hardware interpretation Application Compilation and/or software interpretation Direct Execution Architectures ASCI winterschool H.C.-P.K. Compilation and/or software interpretation Architecture Micro-Code interpretation Operations & Data Transports Application Operations & Data Transports CISC Architectures Application Direct Compilation and/or software interpretation Architecture Micro-Code interpretation Architecture Operations & Data Transports Operations & Data Transports RISC Architectures Microcoded Architectures 12 Bridging the Semantic Gap: What happens to the semantic level ? Compiler and/or interpretation Semantic Level CISC RISC ? Interpretation 1950 1960 1970 1980 1990 2000 2010 Year Application Domain Architecture Domain Datapath ASCI winterschool H.C.-P.K. Domain 13 SPECint and SPECfp ratings Performance Increase SPECfp92 data SPECint92 data 1000 SPECfp92 growth SPECint92 growth 100 10 1.0 0.1 78 80 82 84 86 88 90 92 94 96 98 00 02 Year Microprocessor SPEC Ratings • 50% SPECint improvement / year • 60% SPECfp improvement / year ASCI winterschool H.C.-P.K. 14 VLSI Developments ~ 2(year-1956) * 2/3 10e7 10 10e5 1.0 Feature Size Density 10e3 0.1 70 80 90 Minimum feature size in (um) Density in transistors/chip # Transistors (DRAM) 00 Year Cycle time: tcycle ~ tgate * #gate_levels + wiring_delay + pad_delay What happens to these contributions ? ASCI winterschool H.C.-P.K. 15 Architecture Developments How to improve performance? • (Super)-pipelining • Powerful instructions – MD-technique • multiple data operands per operation – MO-technique • multiple operations per instruction • Multiple instruction issue ASCI winterschool H.C.-P.K. 16 Architecture Developments Pipelined Execution of Instructions IF: Instruction Fetch INSTRUCTION CYCLE 1 1 2 2 IF 3 4 3 DC IF 4 RF DC IF 5 EX RF DC IF 6 WB EX RF DC 7 DC: Instruction Decode 8 RF: Register Fetch WB EX RF EX: Execute instruction WB EX WB WB: Write Result Register Simple 5-stage pipeline Purpose: • Reduce #gate_levels in critical path • Reduce CPI close to one • More efficient Hardware Problems • Hazards: pipeline stalls • Structural hazards: add more hardware • Control hazards, branch penalties: use branch prediction • Data hazards: by passing required Superpipelining: Split one or more of the critical pipeline stages ASCI winterschool H.C.-P.K. 17 Architecture Developments Powerful Instructions (1) MD-technique • Multiple data operands per operation a=B*c+d Two Styles • Vector • SIMD Vector Execution Method FU1 FU2 FU3 SIMD Execution Method FU-K node1 node2 node-K ASCI winterschool H.C.-P.K. time Instruction 2 Instruction K Instruction 3 Instruction 2 Instruction 1 Instr K+1 time Instruction 1 Instruction 3 Instruction n 18 Architecture Developments Powerful Instructions (1) Vector Computing • FU mix may match the application domain • Use of interleaved memory • FUs need to be tightly connected SIMD computing • • • • Nodes used for independent operations Mesh or hypercube connectivity Exploit data locality of e.g. image processing applications SIMD on restricted scale: Multi-media instructions – MMX, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia, ...... – Example: i=1..4|ai-bi| ASCI winterschool H.C.-P.K. 19 Architecture Developments Powerful Instructions (2) MO-technique: multiple operations per instruction • CISC (Complex Instruction Set Computer) • VLIW (Very Long Instruction Word) field FU 1 instruction sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 FU 2 FU 3 FU 4 ld r3, 0(r5) FU 5 bnez r5, 13 VLIW instruction example ASCI winterschool H.C.-P.K. 20 Architecture Developments: Powerful Instructions (2) VLIW Characteristics • Only RISC like operation support Short cycle times • Flexible: Can implement any FU mixture • Extensible • Tight inter FU connectivity required • Large instructions • Not binary compatible ASCI winterschool H.C.-P.K. 21 Architecture Developments Multiple instruction issue (per cycle) Who guarantees semantic correctness? • User specifies multiple instruction streams – MIMD (Multiple Instruction Multiple Data) • Run-time detection of ready instructions – Superscalar • Compile into dataflow representation – Dataflow processors ASCI winterschool H.C.-P.K. 22 Multiple instruction issue Three Approaches Example code a := b + 15; Translation to DDG (Data Dependence Graph) c := 3.14 * d; e := c / f; &d 3.14 &f &b ld 15 + &a ld &e ld &c * / st st st ASCI winterschool H.C.-P.K. 23 Generated Code Instr. Sequential Code Dataflow Code I1 I2 I3 I4 I5 I6 I7 I8 I9 ld(M(&b) addi 15 st M(&a) ld M(&d) muli 3.14 st M(&c) ld M(&f) div st M(&e) ld addi st ld muli st ld div st r1,M(&b) r1,r1,15 r1,M(&a) r1,M(&d) r1,r1,3.14 r1,M(&c) r2,M(&f) r1,r1,r2 r1,M(&e) -> I2 -> I3 -> I5 -> I6, I8 -> I8 -> I9 Notes: • An MIMD may execute two streams: (1) I1-I3 (2) I4-I9 – No dependencies between streams; in practice communication and synchronization required between streams • A superscalar issues multiple instructions from sequential stream – Obey dependencies (True and name dependencies) – Reverse engineering of DDG needed at run-time • Dataflow code is direct representation of DDG ASCI winterschool H.C.-P.K. 24 Instruction Pipeline Overview CISC IF DC RF EX RISC IF DC/RF EX WB IF1 IF2 IF3 DC1 DC2 DC3 ISSUE ISSUE ISSUE RF1 RF2 RF3 EX1 EX2 EX3 ROB ROB ROB WB1 WB2 WB3 IFk DCk ISSUE RFk EXk ROB WBk Superpipelined VLIW ASCI winterschool H.C.-P.K. IF IF1 IF2 --- IFs RF1 RF2 EX1 EX2 WB1 WB2 RFk EXk WBk DC DC RF EX1 DATAFLOW Superscalar WB EX2 --- EX5 WB RF1 RF2 EX1 EX2 WB1 WB2 RFk EXk WBk 25 Four dimensional representation of the architecture design space <I, O, D, S> SIMD 100 Data/operation ‘D’ 10 Vector CISC Superscalar 0.1 MIMD 10 RISC Dataflow 100 Instructions/cycle ‘I’ Superpipelined 10 VLIW 10 Operations/instruction ‘O’ ASCI winterschool H.C.-P.K. Superpipelining Degree ‘S’ 26 Architecture design space Typical values of K (# of functional units or processor nodes), and <I, O, D, S> for different architectures Architecture K I O D S Mpar CISC RISC VLIW Superscalar Superpipelined Vector SIMD MIMD 1 1 10 3 1 7 128 32 0.2 1 1 3 1 0.1 1 32 1.2 1 10 1 1 1 1 1 1.1 1 1 1 1 64 128 1 1 1.2 1.2 1.2 3 5 1.2 1.2 0.26 1.2 12 3.6 3 32 154 38 Dataflow 10 10 1 1 1.2 12 S(architecture) = f(Op) * lt (Op) Op I_set Mpar = I*O*D*S ASCI winterschool H.C.-P.K. 27 The Role of the Compiler 9 steps required to translate an HLL program • • • • • • • • • Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports ASCI winterschool H.C.-P.K. 28 Division of responsibilities between hardware and compiler Application Frontend Superscalar Determine Dependencies Determine Dependencies Dataflow Binding of Operands Binding of Operands Multi-threaded Scheduling Scheduling Indep. Arch Binding of Operations Binding of Operations VLIW Binding of Transports TTA Binding of Transports Execute Responsibility of compiler ASCI winterschool H.C.-P.K. Responsibility of Hardware 29 The Right Match 108 MIMD 107 Transistor per CPU Chip VLIW Superscalar Dataflow 106 RISC+ MMU + FP 64-bit CISC 32-bit core 105 RISC 32-bit core 104 8-bit Microprocessor 103 72 80 90 00 Year ASCI winterschool H.C.-P.K. 30 Overview • • • • • • • Motivation and Goals Trends in Computer Architecture RISC processors ILP Processors Transport Triggered Architectures Configurable components Summary and Conclusions ASCI winterschool H.C.-P.K. 31 INSTRUCTION RISC basics Bypass buses CYCLE 1 1 2 2 IF 3 DC IF 4 EX DC IF 3 4 Register File 5 WB EX DC IF 6 7 8 Forwarding WB EX DC WB EX WB Immediate mux Op-1 Op-1 Memory Unit operand regs. Note: Ifetch path not shown ALU mux RISC datapath Function unit BP-1 ASCI winterschool H.C.-P.K. 32 Why RISC? Make the common case fast • Reduced number of instructions • Limited addressing modes – load-store architecture • Large uniform register set • Limited number of instruction sizes (preferably one) – know directly where the following instruction starts • Limited number of instruction formats Enables pipelining ASCI winterschool H.C.-P.K. 33 Overview • • • • • • • Motivation and Goals Trends in Computer Architecture RISC processors ILP Processors Transport Triggered Architectures Configurable components Summary and Conclusions ASCI winterschool H.C.-P.K. 34 ILP Processors • Overview • General ILP organization • VLIW concept – examples like: TriMedia, Mpact, TMS320C6x, IA-64 • Superscalar concept – examples like: HP-PA8000, Alpha 21264, MIPS R10k/R12k, Pentium I-IV, AMD5-7, UltraSparc – (Ref: IEEE Micro April 1996 (HotChips issue) • Comparing Superscalar and VLIW ASCI winterschool H.C.-P.K. 35 General ILP processor organization Central Processing Unit Instruction Decode Unit FU-2 Data Memory Instruction Fetch Unit Register File Instruction Memory FU-1 FU-K ASCI winterschool H.C.-P.K. 36 ILP processor characteristics • Issue multiple operations/instructions per cycle • Multiple concurrent Function Units • Pipelined execution • Shared register file • Four Superscalar variants – In-order/Out-of-order execution – In-order/Out-of-order completion ASCI winterschool H.C.-P.K. 37 VLIW concept A VLIW architecture with 7 FUs Int FU Int FU Instruction Memory Int FU LD/ST LD/ST FP FU FP FU Floating Point Register File Int Register File Data Memory ASCI winterschool H.C.-P.K. 38 VLIW example: Trimedia Trimedia Overview SDRAM Memory Interface 19 Mpix/s Stereo digital audio * 5-issue * 128 registers * 27 Fus * 32-bit * 8-Way set associative caches * dual ported data cache * gaurded operations ASCI winterschool H.C.-P.K. Timers PCI interface Video In Video Out Audio In Audio Out I2C Interface Serial Interface VLIW Processor 32 bit, 33 MHZ 40 Mpix/s 208 chanel digital audio 32K I$ 16K D$ VLD coprocessor Huffman decoder MPEG1,2 39 VLIW example: TMS320C62 TMS320C62 VelociTI Processor • 8 operations (of 32-bit) per instruction (256 bit) • Two clusters – 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs) – 2 x 16 registers – One port available to read from register file of other cluster • • • • • Flexible addressing modes (like circular addressing) Flexible instruction packing All operations conditional 5 ns, 200 MHz, 0.25 um, 5-layer CMOS 128 KB on-chip RAM ASCI winterschool H.C.-P.K. 40 VelociTI C64 datapath Cluster ASCI winterschool H.C.-P.K. 41 VLIW example: IA-64 Intel HP 64 bit VLIW like architecture • 128 bit instruction bundle containing 3 instructions • 128 Integer + 128 Floating Point registers : 7-bit reg id. • Guarded instructions – 64 entry boolean register file heavily rely on if-conversion to remove branches • Specify instruction independence – some extra bits per bundle • Fully interlocked – i.e. no delay slots: operations are latency compatible within family of architectures • Split loads – non trapping load + exception check ASCI winterschool H.C.-P.K. 42 Intel Itanium 2 • • • • EPIC 0.18um 6ML 8 issue slots 1 GHz (8000 MIPS) • 130 W (max) • 61 MOPS/W • 128b bundle (3x41b + 5b) ASCI winterschool H.C.-P.K. 43 Superscalar: Concept Instruction Instruction Memory Instruction Cache Decoder Reservation Stations Branch Unit ALU-1 ALU-2 Logic & Shift Load Unit Store Unit Address Data Reorder Buffer ASCI winterschool H.C.-P.K. Register File Data Cache Data Data Memory 44 Intel Pentium 4 • • • • • • • • • Superscalar 0.12um 6ML 1.0 V 3 issue >3 GHz 58 W 20 stage pipeline ALUs clocked at 2X Trace cache ASCI winterschool H.C.-P.K. 45 Pentium 4 • Trace cache • Hyper threading • Add with ½ cycle throughput (1 ½ cycle latency) add least signif. 16 bits add most signif. 16 bits calculate flags forwarding carry cycle cycle cycle ASCI winterschool H.C.-P.K. 46 P4 vs P II, PIII pipeline Basic P6 Pipeline 1 2 3 Fetch Fetch 4 5 6 7 8 Intro at 733MHz 9 .18µ Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch 10 Exec Basic Pentium® 4 Processor Pipeline 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc ASCI winterschool H.C.-P.K. 7 8 Rename 9 10 Que Sch 11 12 13 14 15 Sch Sch Disp Disp RF Intro at 16 17 18 19 20 1.4GHz RF Ex Flgs Br Ck Drive .18µ 47 Example with Higher IPC and Faster Clock! Code Sequence: P6 @1GHz Pentium® 4 Processor @1.4GHz Ld Add Add Ld Add Add 10 clocks 10ns IPC = 0.6 ASCI winterschool H.C.-P.K. 6 clocks 4.3ns IPC = 1.0 48 Superscalar Issues • How to fetch multiple instructions in time (across basic block boundaries) ? Trace Cache • Handling control hazards: Branch prediction • Non-blocking memory system: Hit over miss • Handling dependencies: Renaming • How to support precise interrupts?: ROB • How to recover from mis-predicted branch path? ROB ASCI winterschool H.C.-P.K. 49 Renaming Example: # Original Code 1 2 3 4 mul st add shl r1,r2,r3 r1,3(r2) r1,r5,#4 r2,r1,r3 dependence latency renamed version RaW WaW, WaR RaW, WaR 4 1 1 1 mul st add shl p1,p2,p3 p1,3(p2) p4,p5,#4 p6,p4,#3 All four instructions may issue simultaneously – (If resources are available) Renaming is implemented using – Reorder buffer: Pentium II/III, HP PA-8000, PowerPC 604, SPARC64 – Direct register remapping: MIPS 10k/12k, DEC 21264 ASCI winterschool H.C.-P.K. 50 Renaming Mapping (after I4) Logic register Physical register r1 r2 r3 r4 r5 p4 p6 p3 p5 Note: Old mapping r1-p1not needed anymore; however p1 still active When may we reuse physical register p1? – Old mapping has changed (r1-p4) – p1 has been committed ASCI winterschool H.C.-P.K. 51 Branch Prediction • Branch Prediction techniques, why? – Speculatively execute beyond branches – Reduce branch penalties • Classification – Static techniques; prediction based on: • Profiling information • Static analysis of code: use of heuristics – Dynamic techniques • 1-level: Branch prediction buffer with n-bit prediction counters • 2-level: Branch correlation using branch history • Hybrid methods (e.g. Alpha 21264) – Combinations of static and dynamic ASCI winterschool H.C.-P.K. 52 Static Techniques: Heuristic Based (Ball and Larus’93) • Loop Branch Heuristic – Back-edge will be taken 88% of the time • Pointer Heuristic – A comparison of two pointers will fail 60% of the time • Call Heuristic – A successor block containing a call and which does not postdominate the block containing the branchwill not be taken 78% of the time • Opcode Heuristic – A test of an integer for ‘ 0’, or ’ 0’ or ‘ some constant’ will fail outcome 84% of the time • Loop Exit Heuristic – A branch in a loop in which no successor block is a loop head will not exit the loop 80% of the time ASCI winterschool H.C.-P.K. 53 Static Heuristic Based (Ball and Larus’93) • Return Heuristic – A successor block containing a return will not be taken 72% of the time • Store Heuristic – A successor block containing a store instruction and which does not post-dominate will not be taken 55% of the time • Loop Header Heuristic – A successor block which is a loop header or a loop pre-header (I.e. passes control unconditionally to a loop head which it dominates) and which does not post-dominate will be taken 75% of the time • Guard Heuristic – A successor block in which a register is used before being defined and which does not post-dominate will be taken 62% of the time if that register is an operand of the branch ASCI winterschool H.C.-P.K. 54 Static Heuristic Based Prediction When multiple predictors apply we use ‘Dempster-Shafer’ evidence combination Pnew = Pold * Pheuristic Pold * Pheuristic + (1- Pold)*(1-Pheuristic) For example if both ‘Loop Exit’ and ‘Store’ heuristic are applied Pnew = ASCI winterschool H.C.-P.K. 0.8*0.45 0.8*0.45 + (1 - 0.8)*(1 - 0.45) = 0.766 55 Dynamic Techniques: Branch Prediction Buffer: 1 bit prediction 1-bit Branch address 2 K entries (Lower K bits) prediction bit Problems • Aliasing: lower K bits of different branch instructions could be the same – Soln: Use tags (the buffer becomes a tag); however very expensive • Loops are predicted wrong twice – Soln: Use n-bit saturation counter prediction * taken if counter 2 (n-1) * not-taken if counter < 2 (n-1) – A 2 bit saturating counter predicts a loop wrong only once ASCI winterschool H.C.-P.K. 56 Using n-bit Saturating Counters Branch address n-bit saturating Up/Down Counter a Prediction 2-bit saturating counter scheme N T 10/T 11/T T T N N 01/N 00/N N T ASCI winterschool H.C.-P.K. 57 Branch Correlation Using Branch History Two schemes (a, k, m, n) • PA: Per address history, a > 0 • GA: Global history, a = 0 Pattern History Table 2m-1 n-bit saturating Up/Down Counter m 1 Branch Address 0 2k-1 0 1 a Prediction k Branch History Table Table size (usually n = 2): #bits = k * 2a + 2k * 2m *n Variant: Gshare (Scott McFarling’93): GA which takes logic XOR of PC address bits and branch history bits ASCI winterschool H.C.-P.K. 58 Predicting the Target Address 1. Branch Target Buffer (BTB) 2. Branch Folding (Store instruction in BTB) 3. Return Stack ASCI winterschool H.C.-P.K. 59 Accuracy (taking the best combination of parameters): Branch Prediction Accuracy (%) GA(0,11,5,2) 98 PA(10, 6, 4, 2) 97 96 95 Bimodal 94 GAs 93 PAs 92 91 89 64 ASCI winterschool H.C.-P.K. 128 256 1K 2K 4K 8K 16K 32K 64K Predictor Size (bytes) 60 Comparing Superscalar and VLIW Characteristic Superscalar VLIW Architecture Type Multiple issue Multiple operations Complexity High Low Binary Code Compat.. Yes No Source Code Compat. Yes Yes, if good compiler Scheduling Dynamic Static Scheduling Window 10 instructions 100 - 1000 instructions Speculation Dynamic Static Branch Prediction Dynamic Static Mem ref disambiguation Dynamic Static Scalability Medium High Functional Flexibility High Very High Application General Purpose Special Purpose ASCI winterschool H.C.-P.K. 61 Overview • • • • • • • Motivation and Goals Trends in Computer Architecture RISC processors ILP Processors Transport Triggered Architectures Configurable components Summary and Conclusions ASCI winterschool H.C.-P.K. 62 Reducing Datapath Complexity: TTA TTA: Transport Triggered Architecture Overview Philosophy MIRROR THE PROGRAMMING PARADIGM • Program transports, operations are side effects of transports • Compiler is in control of hardware transport capacity ASCI winterschool H.C.-P.K. 63 Transport Triggered Architecture General Structure of TTA Data-transport Buses / Move Buses FU1 ASCI winterschool H.C.-P.K. FU1 Sockets FU1 Integer Reg File FP Reg FIle Boolean Reg File 64 Program TTAs How to do data operations ? 1. Transport of operands to FU • Operand move (s) Trigger Operand • Trigger move 2. Transport of results from FU • Result move (s) Internal stage Example Add r3,r1,r2 becomes r1 Oint r2 Tadd …………. Rint r3 // operand move to integer unit // trigger move to integer unit // addition operation in progress // result move from integer unit Result FU Pipeline How to do Control flow ? 1. Jumps: 2. Branch: 3. Call: ASCI winterschool H.C.-P.K. #jump-address pc #displacement pcd pc r; #call-address pcd 65 Program TTAs Scheduling advantages of Transport Triggered Architectures 1. Software bypassing Rint r1 r1 Tadd Rint r1; Rint Tadd 2. Dead writeback removal Rint r1; Rint Tadd Rint Tadd 3. Common operand elimination #4 Oint; r1 Tadd #4 Oint; r2 Tadd #4 Oint; r1 Tadd r2 Tadd 4. Decouple operand, trigger and result moves completely r1 Oint; r2 Tadd Rint r3 ASCI winterschool H.C.-P.K. r1 Oint --r2 Tadd --Rint r3 66 TTA Advantages Summary of advantages of TTAs • Better usage of transport capacity – Instead of 3 transports per dyadic operation, about 2 are needed – # register ports reduced with at least 50% – Inter FU connectivity reduces with 50-70% • No full connectivity required • Both the transport capacity and # register ports become independent design parameters; this removes one of the major bottlenecks of VLIWs • Flexible: FUs can incorporate arbitrary functionality • Scalable: #FUs, #reg.files, etc. can be changed • TTAs are easy to design and can have short cycle times ASCI winterschool H.C.-P.K. 67 TTA automatic DSE User intercation Optimizer x x x feedback x x Architecture parameters Parametric compiler Pareto curve (solution space) x feedback x x x x Hardware generator x x x x x x x x x x cost Move framework Parallel object code ASCI winterschool H.C.-P.K. chip 68 Overview • • • • • • • Motivation and Goals Trends in Computer Architecture RISC processors ILP Processors Transport Triggered Architectures Configurable components Summary and Conclusions ASCI winterschool H.C.-P.K. 69 Tensilica Xtensa • • • • • • • Configurable RISC 0.13um 0.9V 1 issue slot / 5 stage pipeline 490 MHz typical 39.2 mW (no mem.) 12500 MOPS / W • Tool support • Optional vector unit • Special Function Units ASCI winterschool H.C.-P.K. 70 Fine-Grained reconfigurable: Xilinx XC4000 FPGA CLB Slew Rate Control CLB Switch Matrix D CLB Q Passive Pull-Up, Pull-Down Vcc Output Buffer Pad Input Buffer CLB Q Programmable Interconnect D Delay I/O Blocks (IOBs) C1 C2 C3 C4 H1 DIN S/R EC S/R Control G4 G3 G2 G1 F4 F3 F2 F1 DIN G Func. Gen. F' G' H Func. Gen. F Func. Gen. D EC RD 1 Y G' H' S/R Control DIN F' G' D SD Q H' 1 H' K SD Q H' F' EC RD X Configurable Logic Blocks (CLBs) ASCI winterschool H.C.-P.K. 71 Coarse-Grained reconfigurable: Chameleon CS2000 Highlights: •32-bit datapath (ALU/Shift) •16x24 Multiplier •distributed local memory •fixed timing ASCI winterschool H.C.-P.K. 72 Hybrid FPGAs: Virtex II-Pro GHz IO:16 Upserial to 16 transceivers serial transceivers Up to PowerPCs Memory blocks PowerPC ReConfig. logic Reconfigurable logic blocks Courtesy of Xilinx (Virtex II Pro) ASCI winterschool H.C.-P.K. 73 Reconfiguration time HW or SW reconfigurable? reset FPGA Spatial mapping loopbuffer context Temporal mapping Subword parallelism 1 cycle fine ASCI winterschool H.C.-P.K. Data path granularity VLIW coarse 74 Granularity Makes Differences ASCI winterschool H.C.-P.K. Fine-Grained Architecture Coarse-Grained Architecture Clock Speed Low High Configuration Time Long Short Unit Amount Large Small Flexibility High Low Power High Low Area Large Small 75 Overview • • • • • • • • Motivation and Goals Trends in Computer Architecture RISC processors ILP Processors Transport Triggered Architectures Configurable components Multi-threading Summary and Conclusions ASCI winterschool H.C.-P.K. 76 Simultaneous Multithreading Characteristics • An SMT has separate front-ends for the different threads but shares the back-end between all threads. • Each thread has its own – Re-order buffer – Branch History Register • Registers, caches, branch prediction tables, instruction queues, FUs etc. are shared. ASCI winterschool H.C.-P.K. 77 Multi-threading in Uniprocessor Architectures Superscalar Concurrent Multithreading Simultaneous Multithreading Clock cycles Empty Slot Thread 1 Thread 2 Thread 3 Thread 4 Issue slots ASCI winterschool H.C.-P.K. 78 Instruction Fetch Policies • The Instruction Fetch policy decides from which threads to fetch each cycle. • Performance and throughput is highly sensitive to the Instr.Fetch policy. • “Standard” icount fetches from thread with least instructions in front-end. • Performance of a thread depends on policy as well as workload and becomes highly unpredictable. ASCI winterschool H.C.-P.K. 79 Resource Allocation in SMT • Better to perform dynamic resource allocation to drive instruction fetch. • DCRA outperforms icount in many cases. • Possible to use resource allocation to guarantee certain percentage of singlethread performance. • Improves predictability and hence suitability of SMT for real-time embedded systems. ASCI winterschool H.C.-P.K. 80 Future Processors Components • New TriMedia has deep pipeline, L1 and L2 cache, and branch prediction. • META is a (simple) simultaneous multithreaded architecture. • Calistro is a embedded multi-processor platform for mobile applications. • Imagine (Stanford): combines operation (VLIW) and data level parallelism (SIMD). • TRISP (Texas Austin / IBM) and SCALE (MIT) processors combine task, operation and data level parallelism. ASCI winterschool H.C.-P.K. 81 Summary and Conclusions ILP architectures have great potential • Superscalars – Binary compatible upgrade path • VLIWs – Very flexible ASIPs • TTAs – – – – Avoid control and datapath bottlenecks Completely compiler controlled Very good cost-performance ratio Low power • Multi-threading – Surpass exploitable ILP in applications – How to choose threads ? ASCI winterschool H.C.-P.K. 82