Efficient Performance Scaling of Future CGRAs for Mobile Applications Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke December 11, 2012 University of.

Download Report

Transcript Efficient Performance Scaling of Future CGRAs for Mobile Applications Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke December 11, 2012 University of.

Efficient Performance Scaling of Future CGRAs for Mobile Applications

Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke

December 11, 2012 University of Michigan, Ann Arbor 1

University of Michigan Electrical Engineering and Computer Science

Convergence of Functionalities

Flexible Accelerator!

4G Wireless Audio Video 3D Navigation Anatomy of an iPhone4

Convergence of functionalities demands a flexible solution due to the design cost and programmability

2

University of Michigan Electrical Engineering and Computer Science

CGRA : Attractive Alternative to ASICs

    Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power consumption High flexibility with dynamic reconfiguration 3

University of Michigan Electrical Engineering and Computer Science

Bridging the Gap Between Market Demand and Computation Power

2000 1600 CPU Audio Video 1200 800 400 0 2009 2010 2011 2012

Year

2013 2014 2015

How to scale performance with retaining energy efficiency?

[Canali, Internet Computing Magazine, IEEE, 2009]

4

University of Michigan Electrical Engineering and Computer Science

Agenda: Scaling the Energy Efficiency of CGRAs

• Investigate the key factors and their feasibility in the view of performance and power efficiency – Hardware scalability vs. hardware flexibility • Interconnection topology • Complex PE vs. simple PE • Vector memory operation support • Homogeneity vs. Heterogeneity 5

University of Michigan Electrical Engineering and Computer Science

Experimental

Setup

• Target applications – Media benchmark: AAC decoder, H.264 decoder, and 3D rendering – Game physics benchmarks: line of sight, convolution, and conjugate • Target architecture: various types of CGRAs – 16 ~ 64 heterogeneous/homogeneous resources • IMPACT frontend compiler + Edge-centric modulo scheduler • Power measurement – IBM 65nm technology @ 200MHz/1V

University of Michigan Electrical Engineering and Computer Science

6

Q1: Interconnection Topology

• Overview – Routing overhead limits the performance when increasing the size of the CGRA – Common solution: clustering – What is the optimal interconnection topology?

• Methodology – Compare the performance of three different clustering schemes.

• Baseline • Fixed partition: CGRAs are physically split into multiple partitions • Flexible partition: number of partitions can be dynamically changed from 1 to 8 – Total number of PEs: 4 to 128 7

University of Michigan Electrical Engineering and Computer Science

Q1: Interconnection Topology

Application No-DLP loops Baseline DLP loops Fixed partition 8 Flexible mapping

University of Michigan Electrical Engineering and Computer Science

Performance Comparison (Base, Fixed, Flex)

12 10 8 6 4 2 0 Media 10 8 6 4 2 0 4 8 16 32 Architecture Game 64 128 4 8 16 32 Architecture 64

• Fixed partitioning doesn’t always show better performance.

• Flexible architectures show the best performance and retain scalability 9

128 University of Michigan Electrical Engineering and Computer Science

Q2: Complex PEs vs. Simple PEs

• Overview – CGRAs with complex PEs are introduced • Two level interconnect • Number of RFs can decrease • Multiple instructions can be chained – Challenge: resource utilization – Goal: determine the availability of complex PEs in the view of energy consumption • Methodology – Compare the energy consumption on different PE styles • Number of FUs inside a PE: 1 ~ 6 • Uniform vs. Optimized 10

University of Michigan Electrical Engineering and Computer Science

PE Designs

Simple integer ALU Simple integer + Complex ALU Register file 11

University of Michigan Electrical Engineering and Computer Science

Energy Consumption

2 1,5 1 0,5 4 3,5 3 2,5 Media uniform Game uniform Media optimized Game optimized

1.5x energy

1 2 3

# of FUs per PE

4 5 6 • Energy consumption does not increase dramatically as number of PEs • In 1.5x energy budget, complex PEs with 2~3 FUs can also be proper solutions 12

University of Michigan Electrical Engineering and Computer Science

Q3: SIMD Memory Support

• Overview – SIMD memory support provides less power and less number of instructions – Challenge: degree of DLP.

– Goal: determine the availability of SIMD memory access in the view of energy consumption • Methodology – Compare the energy consumption on different SIMD widths: 1 ~ 16 13

University of Michigan Electrical Engineering and Computer Science

Relative Energy Consumption

14 12 10 8 6 4 2 0 Relative power per access Relative # of access Relative total energy 1 2 4

Vector width

8 16 • Total energy consumption at wider vector width can be a similar level to a scalar memory unit – High degree of spatial locality can compensate for power overheads 14

University of Michigan Electrical Engineering and Computer Science

• Flexible partitioning should be supported for further improving the performance.

• Complex PE can be more energy efficient even in low resource utilizations.

• The wide SIMD memory support can be realistic due to the mobile application characteristics.

15

University of Michigan Electrical Engineering and Computer Science

Questions?

For more information http://cccp.eecs.umich.edu

16

University of Michigan Electrical Engineering and Computer Science

Q1: Homogeneity vs. Heterogeneity

• Overview – Heterogeneous CGRAs are common – No experiments on the effect of heterogeneity over homogeneity • Methodology – Start from 16-PE homogeneous CGRA (integer ALU, complex ALU, memory unit) – Decrease the number of PEs supporting complex ALU and memory unit – Performance goal: 80% of performance @ homogeneous CGRA How about performance?

17

University of Michigan Electrical Engineering and Computer Science

1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0

Performance Degradation

Media Game 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0

• The amounts of performance degradation are not substantial – The performance is normally constrained not by the complex instructions • Performance degradation depends much more on memory operations • For 80% of the baseline performance, we can decrease the number of both complex and memory units by up to 75%.

18

University of Michigan Electrical Engineering and Computer Science

• Heterogeneous FU organization is highly effective.

• Flexible partitioning should be supported for further improving the performance.

• Complex PE can be more energy efficient even in low resource utilizations.

• The wide SIMD memory support can be realistic due to the mobile application characteristics.

19

University of Michigan Electrical Engineering and Computer Science

CGRA : Attractive Alternative to ASICs

 Suitable for running multimedia applications for future embedded systems  High throughput, low power consumption, high flexibility Morphosys SiliconHive ADRES    viterbi at 80Mbps h.264 at 30fps Morphosys : 8x8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4x4 array with tightly coupled VLIW 20 50-60 MOps /mW

University of Michigan Electrical Engineering and Computer Science