The Garp Architecture and C Compiler Liao Jirong

Download Report

Transcript The Garp Architecture and C Compiler Liao Jirong

The Garp Architecture and C Compiler

T.J. Callahan J.R. Hauser & J. Wawrzynek, U.C. Berkeley

Brought to you by

Liao Jirong

[email protected]

http://www.comp.nus.edu.sg/~liaojiro

Outline

     Background The Garp Architecture The Compiler for the Garp Simulation result Summary

Background

    Emergence of reconfigurable hardware , FPGA,etc.

Impressive speedups for various tasks DNA sequence matching, encryption,etc.

Obstacles to be overcome configuration time, size, floating-point operations, compatibility of various implementations in the market … ..

Past works -- PRISC, NAPA, PRISM, etc limited to specific application domains non full automatic compliation

The Big Picture

Application Non-Computation Kernel Compilation Processor (CPU) communication Computation Kernel Compilation/synthesis Coprocessor (FPGA, ASIC, etc)

Working Flow of the Execution of the Kernel

Load a configuration Copy any initial register data to coprocessor Start execution on coprocessor Copy result back to the processor 1, 2 & 4 are overhead.

Motivation

 Integrate reconfigurable hardware more closely with the processor  Long reconfiguration times  Low- bandwidth paths for data transfer  Hardware design expertise

Assumption

  A few cycles of overhead for register data transferring is acceptable Coprocessor need its own direct path to the processor ’ s memory system  impossible for the processor to do this Coprocessor need to be rapidly reconfigurable.

The Garp Architecture

     Single-issue MIPS processor core with reconfigurable hardware (coprocessor) Coprocessor is on the same die with processor Coprocessor and Processor share the same memory The reconfigurable hardware architecture and interfaces are designed Does not exist as real silicon (simulation only)

The Blueprint

The Garp Arch. (Cont)

   For general purpose applications Fit into an ordinary processing environment The main thread of control through a program is managed by the processor 1. configuration can be loaded only when coprocessor is idle 2. coprocessor can work independently 3. coprocessor execution can be halted or resumed 4. can not load configuration or access the coprocessor while it is active

The reconfigurable hardware

      Two-dimensional array of Blocks No. of row is implementation-specific upward-compatible fashion Interconnected by programmable wiring A fixed global clock - sequencer Configuration cache Memory buses Memory queues

Blocks

   Configurable Logic Block (CLB) 2-bit width 16 CLBs in a row is a 32-bit data path each up to 4 2-bit inputs (a<<10)|(b&c) can be implemented in one row Control blocks one for each row in the leftmost column serve as liaison Boolean Values for if-conversion used in hyperblock

Wires

     Vertical wire communicate blocks in the same column Horizontal communicate blocks in the same or adjacent rows Built-in carry chain support for addition, subtraction and comparison.

Make multiplication and division by constant fairly efficient by multi-bit shift across a row The wire network is passive value cannot jump from one to another without passing through a logic block

Memory tricks

    Configuration cache hold recently displaced configurations reloading from cache requires 5 cycles only. can hole 4 full-sized configurations Wide path betwn coprocessor and memory data transfer and configuration load Memory bus 4 32-bit data bus and 1 32-bit address bus coprocessor is master of memory buses when active initiate one access every cycle Memory Queues

Compare Garp with other arch.

 VLIW Garp resemble VLIW  Advantage over VLIW but doesn for Garp ’ t have VLIW ’ s per-cycle limits on instruction issue, functional units, or register file bandwidth. pipeline in Garp is more straightforward than software pipelining on VLIW: no function units competition problem maintain high performance for sequential code in processor  Disadvantage over VLIW kernel size limit can not exploit ILP outside of loops

Garp V.S. Vector

   Garp resemble a memory-to-memory vector processor when synthesizing a vectorizable loop.

Feedback loops can be constructed arbitrarily while vector units can handle only very speciallized recurrences Garp can easily handle data-dependent loop exits, which is a problem for vector arch.

Garp V.S. Superscalar

Because of the modest number of instruction issue slots, Superscalar processor can not compete with the Garp coprocessor in cases with a large amount of ILP.

Any Question About Garp?

For further details: Garp: A MIPS Processor with a Reconfigurable Coprocessor J.R. Hauser, J. Wawrzynek, IEEE FCCM 1997,

Automatic Compilation

 Standard ANSI C as input  SUIF C compiler for the front-end phase parsing and standard optimizations  Full automatic compliation

Compilation Flow

kernel Optimization & Synthesis Application Kernel selection Non-kernel Optimization Executable file Bit-stream coprocessor processor

Kernel selection

   Loops The whole loop? -- NO loop size – too large contain some infrequent executed code -- longer load time -- longer interconnects operations cannot be implemented ILP limitation in basic block

Hyperblock

    Join all the basic blocks of a loop body by using prediction – boolean value Increase ILP Precedence edges array subscript analysis inter-procedural pointer analysis Contain the loop back edges avoid switching control from time to time

Hyperblock (Cont)

  Reject loops that speedup doesn ’ t make up the overhead by profiling and execution time estimate Exceptional exit cases execution continue on processor occur only a small fraction

Optimization Techs.

   Speculative loads crucial for pipelining Pipelining loop-carried dependencies simultaneous memory access Memory queues 3 memory queues buffering and reading ahead, writing behind non-cache-allocating

Configuration Synthesis

   Module mapping mapping groups of nodes in the DFG to compound modules in the configuration, minimize the size and its critical path Placement connect modules close to one another Generating the bit-stream file

Simulation Results

    32-row array Adapted Ultrasparc processor Cycle-accurate simulator Model cache misses and interlocks.

Wavelet image compression

Gzip compression

 Gzip have irregular memory accesses   reduce parallelism and prevent pipelining Each loop execute only a few cycles overhead cost more significant The overhead negates the benefit

Compilation time & Code expansion

 Compilation time typically much less than double that of compiling for software only  Code size typically increase from 10 to 50 percent wavelet benchmark – 16 percent

Garp V.S. Ultrasparc

    Ultrasparc a four-way superscalar, 167Mhz Garp implemented using the same VLSI process 133Mhz Wavelet Garp is 68% faster than Ultrasparc Gzip Ultrasparc is 14% faster than Garp

Garp V.S. Ultrasparc (Cont)

 Hand-coded functions Garp has great potential

Future

   More experiments over a broader range of benchmark Development of new optimizations Find out strengths and weaknesses of the Garp architecture

Summary

 The Garp Architecture processor + coprocess configuration cache memory queues high-bandwidth, low-latency data access  Synthesis Compiler for Garp

The End Thank you!

Any feedback will be appreciated [email protected]