Transcript PPTX

A Survey of the Current State of the Art in SIMD:

Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel?

Wojtek Rajski, Nels Oscar, David Burri, Alex Diede

Introduction

• We have seen how to improve performance • • through exploitation of: Instruction-level parallelism Thread-level parallelism • One other exploitation we have not discussed is Data-level parallelism.

Introduction

• Flynn's Taxonomy • • An organization of computer architectures based on their instruction and data streams Divides all architectures into 4 categories: 1. SISD 2. SIMD 3. MISD 4. MIMD

Introduction

• Implementations of SIMD • • • Prevalent in GPUs SIMD extensions in CPU Embedded systems and Mobile Platforms

Introduction

• Software for SIMD • • Many libraries utilize and encapsulate SIMD Adopted in these areas o o o o Graphics Signal Processing Video Encoding/Decoding Some scientific applications

Introduction

• SIMD Implementations fall into three high level categories: 1. Vector Processors 2. Multimedia Extensions 3. Graphics Processors

Introduction

• • Going forward: Streaming SIMD Extensions(MMX/SSE/AVX) o Similar technology in GPUs • Compiler techniques for DLP • Problems in the world of SIMD

Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for x86 computers.

This figure assumes that two cores per chip for MIMD will be added every two years and the number of operations for SIMD will double every four years.

Copyright © 2011, Elsevier Inc.

• • •

SIMD in Hardware

• Register Size/Hardware changes Intel Core i7 example The ‘Roofline’ model Limitations of streaming extensions in a CPU

SIMD in Hardware

Streaming SIMD requires some basic components o o o Wide Registers  Rather than 32bits, have 64, 128, or 256 bit wide registers. Additional control lines Additional ALU's to handle the simultaneous operation on up to operand sizes of 16-bytes

Hardware

Figure 4.4 Using multiple functional units to improve the performance of a single vector add instruction, C = A + B.

The vector processor (a) on the left has a single add pipeline and can complete one addition per cycle. The vector processor (b) on the right has four add pipelines and can complete four additions per cycle. The elements within a single vector add instruction are interleaved across the four pipelines. The set of elements that move through the pipelines together is termed an

element group

. s

Intel i7

• The Intel i7 Core o o Superscalar processor Contains several SIMD extensions  16x256-bit wide  registers, and physical registers on pipeline.

Support for 2 and 3 operand instructions

The Roofline Model of Performance

• The Roofline model of performance aggregates • floating-point performance, • • operational intensity memory

The Roofline Model of Performance

Opteron X2

The Roofline Model of Performance

Opteron X2

The Roofline Model of Performance

Opteron X2

• • •

Limitations

Memory Latency Memory Bandwidth The actual amount of vectorizable code

• •

SIMD at the software level

SIMD is not a new field.

But more focus has been brought to it by the GPGPU movement.

SIMD at the software level

• • • • • • CUDA Developed by Nvidia Compute Unified Device Architecture Closed to GPUs with chips from Nvidia Graphics cards G8x and newer Provides both high and low level API

SIMD at the software level

• • • • • • OpenCL Developed by Apple Open to any vendor that decide to support it Designed to execute across GPUs and CPUs Graphics cards G8x and newer Provides both high and low level API

SIMD at the software level

• • • • • • Direct Compute Developed by Microsoft Open to any vendor that supports DirectX11 Windows only Graphics cards GTX400 and HD5000 Intel’s Ivy Bridge will also be supported

• • •

Compiler Optimization

Not everyone programs in SIMD based languages.

But C, Java were never designed with SIMD in mind.

Compiler technology had to improve to catch code with vectorizable instructions.

Compiler Optimization

• • • • Before optimization can begin Data dependencies have to be understood But only within the vector window size matter Vector window size - The size of data executed in parallel with the SIMD instruction

Compiler Optimization

• Before optimization can begin • Example: for( int i = 0; i < 16; i++){ C[i] = c[i+1]; C[i] = c[i+16]; } } for( int i = 0; i < 16; 4++){ C[i] = c[i+1]; C[i+1] = c[i+2]; (Wrong) C[i+2] = c[i+3]; (Wrong) C[i+3] = c[i+4]; (Wrong) C[i] = c[i+16]; C[i+1] = c[i+17]; C[i+2] = c[i+18]; C[i+3] = c[i+20];

Compiler Optimization

• Framework for vectorization o Prelude o Loop o Postlude o Cleanup

Compiler Optimization

• • • Framework for vectorization Prelude • Loop independent variables are prepared for use.

• Run time checks that vectorization is possible • Loop Vectorizable instructions are performed in order with original code.

• • Loop could be split into multiple loops.

Vectorizable sections could be split by more complex code in original loop.

Compiler Optimization

• Framework for vectorization o Postlude  All loop independent variables are returned.

o Cleanup  Non vectorizable iterations of the loop are run.

 These include the remainder of vectorizable instructions that do not fit evenly into the vector size.

Compiler Optimization

• • • • Compiler techniques Loop Level Automatic Vectorization Basic Block Level Automatic Vectorization In the presence of control flow

Compiler Optimization

• • • Loop Level Automatic Vectorization 1. Find innermost loop that can be vectorized.

2. Transform loop and create vector instructions.

Original Code for (i = 0; i < 1024; i+=1) C[i] = A[i]*B[i]; } Vectorized Code for( i=0; i<1024; i+=4){ vA = vec_ld( A[i] ); vB = vec_ld( B[i] ); vC = vec_mul( vA, vB); vec_st( vC, C[i] );

Compiler Optimization

Basic Block Level Automatic Vectorization 1. The inner most loop is unrolled by the size of the vector window.

2. Isomorphic scalar instructions are packed into vector instruction.

Original Code for (i = 0; i < 1024; i+=1) C[i] = A[i]*B[i]; Vectorized Code for (i = 0; i < 1024; i+=4) C[i] = A[i]*B[i]; C[i+1] = A[i+1]*B[i+1]; C[i+2] = A[i+2]*B[i+2]; C[i+3] = A[i+3]*B[i+3];

Compiler Optimization

} • 1.

2.

3.

4.

In the presence of control flow Apply predication Apply method from above Remove vector predication Remove scalar predication After Predication for (i = 0; i < 1024; i+=1){

P = A[i] > 0;

Original Code for (i = 0; i < 1024; i+=1){ if (A[i] > 0) else C[i] = B[i]; D[i] = D[i-1]; }

NP = !P;

C[i] = B[i];

(P)

D[i] = D[i-1];

(NP)

Compiler Optimization

• In the presence of control flow After Vectorization After Removing Predicates } for (i = 0; i < 1024; i+=4){ vP=A[i:i+3] > (0,0,0,0); vNP=vec_not(vP); C[i:i+3]=B[i:i+3]; (vP) (NP1,NP2,NP3,NP4) = vP; D[i+3]=D[i+2]; (NP4) D[i+2]=D[i+1]; (NP3) D[i+1]=D[i]; (NP2) D[i]=D[i-1]; (NP1) } for (i = 0; i < 1024; i+=4){ vP=A[i:i+3] > (0,0,0,0); vNP=vec_not(vP);

C[i:i+3]=vec_sel(C[i:i+3], B[i:i+3], vP);

(NP1,NP2,NP3,NP4) = vP;

if (NP4) D[i+3]=D[i+2]; if (NP3) D[i+2]=D[i+1]; if (NP2) D[i+1]=D[i]; if (NP1) D[i]=D[i-1];

• •

CPU vs GPU

Founding of the GPU as we know it today was Nvidia in 1999 Popularity increased in recent years VisionTek GeForec 256 [Wikipedia] Nvidia GeForce GTX590 [Nvidia]

CPU vs GPU

• Theoretical GFLOP/s & Bandwidth [Nvidia, NVIDIA CUDA C Programming Guide]

CPU vs GPU

• Intel Core i7 Nehalem Die Shot [NVIDIA’s Fermi: The First Complete GPU Computing Architecture]

CPU vs GPU

Game, Little Big Planet [http://trendygamers.com]

CPU vs GPU

• OpenGL Graphics Pipeline [Wojtek Palubicki; http://pages.cpsc.ucalgary.ca/~wppalubi/]

CPU vs GPU

• CPU SIMD vs. GPU SIMD • • Intel’s sandy-bridge architecture: 256-bit AVX --> on 8 registers parallel • CUDA multiprocessor up to 512 raw mathematical operations in parallel

CPU vs GPU

• Nvidia’s Fermi Source: http://www.legitreviews.com/article/1193/2/

CPU vs GPU

• Nvidia’s Fermi [Nvidia; NVIDIA’s Next Generation CUDA Compute Architecture: Fermi]

Standardization Problems and Industry Challenges

[Widescreen Wallpapers; http://widescreen.dpiq.org/30__AMD_vs_Intel_Challenge.htm]

Standardization Problems and Industry Challenges

• 1998 o o AMD - 3Dnow Intel - SSE instruction set a few years later without supporting the 3Dnow o Intel won this battle since SSE was better

Standardization Problems and Industry Challenges

• 2001 o o o Intel - Itanium processor (64-bit, parallel computing instruction set) AMD - Its own 64-bit instruction set (backward compatible) AMD won this time because of its backward compatibility.

• 2007 o o AMD - SSE5 Intel - AVX

Standardization Problems and Industry Challenges

• Example: fused-multiply-add (FMA) o d = a + b * c • AMD o o Supports since 2011 FMA4 FMA4 - 4 operand form • Intel o o Will support FMA3 in 2013 with Haswell FMA3 - 3 operand form

Standardization Problems and Industry Challenges

• • • This causes More work for the programmer Impossible maintenance of the code

Standardization required!

• • •

Conclusion

SIMD Processors exploit data-level parallelism increasing performance.

The hardware requirements are easily met as transistor size decreases.

HPC languages have been created to give programmers access to high and low level SIMD operations.

• • • •

Conclusion

Compiler technology has improved to recognize some potential SIMD operations in serial code.

The utility of SIMD instructions in modern microprocessors is diminishing except in special purpose applications due to standardization problems and industry in-fighting.

The increasing adoption of GPGPU computing has the potential to supplant SIMD type instructions in the CPU.

On-chip GPU's appear to be on the horizon, so wider really is better.