John Cavazos Institute for Computing Systems Architecture

Download Report

Transcript John Cavazos Institute for Computing Systems Architecture

Lecture 3
Laws, Equality, and
Inside a Cell
John Cavazos
Dept of Computer & Information Sciences
University of Delaware
www.cis.udel.edu/~cavazos/cisc879
CISC 879 : Software Support for Multicore Architectures
Lecture 2: Overview
•
Know the Laws
•
All are NOT Created Equal
•
Inside a Cell
CISC 879 : Software Support for Multicore Architectures
Two Important Laws
•
•
Amdahl’s Law
•
Gene Amdahl observation in 1967
•
Speedup is limited by serial portions
•
Assumes fixed workloads and fixed problem size
Gustafson’s Law
•
John Gustafson observation in 1988
•
Rescues parallel processing from Amdahl’s Law
•
Proposes fixed time and increasing work
•
Sequential portions have diminishing effect
CISC 879 : Software Support for Multicore Architectures
Amdahl’s Law
Parallelize parts 2 and 4
with 2 processors
100
Sequential
100
100
100
Sequential
100
100
100
100
100
Sequential
100
Sequential
50
Sequential
50
Sequential
Speedup: 25%
CISC 879 : Software Support for Multicore Architectures
Amdahl’s Law (cont’d)
Parallelize parts 2 and 4
with 4 processors
100
Sequential
100
100
100
Sequential
100
100
100
100
100
Sequential
100
Sequential
50
25
Sequential
50
25
Sequential
Speedup: 40%
CISC 879 : Software Support for Multicore Architectures
Amdahl’s Law (cont’d)
Parallelize parts 2 and 4
with infinite processors
100
Sequential
100
100
100
Sequential
100
100
100
100
100
Sequential
100
Sequential
50
25
0
Sequential
50
25
Sequential
Speedup: only 70%
Multicore doesn’t look very appealing!
CISC 879 : Software Support for Multicore Architectures
0
Gustafson’s Law (cont’d)
Boxes contain units
of work now!
100
Sequential
100
100
100
100
Sequential
100
100
100
100
500 units of time, but
700 units of work!
Sequential
100
Sequential
200
Sequential
200
Sequential
Speedup: 40%
CISC 879 : Software Support for Multicore Architectures
Gustafson’s Law (cont’d)
Boxes contain units
of work now!
100
Sequential
100
100
100
100
Sequential
100
100
100
100
500 units of time, but
1100 units of work!
Sequential
100
Sequential
200
400
Sequential
200
400
Sequential
Speedup: 220%
CISC 879 : Software Support for Multicore Architectures
Gustafson Law (cont’d)
•
•
Gustafson found important observation
•
As processors grow, people scale problem size
•
Serial bottlenecks do not grow with problem size
Increasing processors gives linear speedup
•
•
20 processors roughly twice as fast as 10
This is why supercomputers are successful
•
More processors allows increased dataset size
Reference: http://www.scl.ameslab.gov/Publications/Gus/AmdahlsLaw/Amdahls.html
CISC 879 : Software Support for Multicore Architectures
Lecture 2: Overview
•
Know the Laws
•
All are NOT Created Equal
•
Inside a Cell
CISC 879 : Software Support for Multicore Architectures
All Multicores Not Equal
•
Multicore CPUs and GPUs are very different!
•
CPUs run general purpose programs well
•
GPUs run graphics (or similar prgs) well
•
General Purpose Programs have
•
•
Less parallelism
•
More complex control requirements
GPU programs
•
Highly parallel
•
Arithmetic intense
•
Simple control requirements
CISC 879 : Software Support for Multicore Architectures
Floating-Point Operations
32-bit FP operations per second
GPUs : more computational units and
take better advantage of them.
Slide Source: NVIDIA CUDA Programming Guide 1.1
CISC 879 : Software Support for Multicore Architectures
CPUs versus GPUs
CPUs devote lots of area to control and storage.
GPUs devote most area to computational units.
Slide Source: NVIDIA CUDA Programming Guide 1.1
CISC 879 : Software Support for Multicore Architectures
CPU Programming Model
•
Scalar programming model
•
•
No native data parallelism
Few arithmetic units
•
Very small area
•
Optimized for complex control
•
Optimized for low latency not high bandwidth
Slide Source: John Owens, EEC 227 Graphics Arch course
CISC 879 : Software Support for Multicore Architectures
AMD K7 “Deerhound”
Slide Source: John Owens, EEC 227 Graphics Arch course
CISC 879 : Software Support for Multicore Architectures
GPU Programming Model
•
•
•
Streams
•
Collections of data records
•
Data parallelism amenable
Kernels
•
Inputs/outputs are streams
•
Performs computation on each element of stream
•
No dependencies between stream elements
Stream storage
•
Not cache (input read once/output written once)
•
Producer-consumer locality
Slide Source: John Owens (EEC 227 Graphics Arch) and Pat Hanrahan (Stream Prog. Env., GP^2 Workshop)
CISC 879 : Software Support for Multicore Architectures
Lecture 2: Overview
•
Know the Laws
•
All are NOT Created Equal
•
Inside a Cell
CISC 879 : Software Support for Multicore Architectures
Cell B.E. Design Goals
•
An accelerator extension to Power
•
Exploits parallelism and achieves high frequency
•
Sustain high memory bandwidth through DMA
•
Designed for flexibility
•
Heterogenous architecture
•
•
•
PPU for control, general-purpose
SPU for computation-intensive, little control
Applicable to a wide variety of applications
The Cell Architecture has characteristics of both a CPU and GPU.
CISC 879 : Software Support for Multicore Architectures
Cell Chip Highlights
•
241M Transistors
•
9 cores, 10 threads
•
>200 GFlops (SP)
•
>20 GFlops (DP)
•
> 300 GB/s EIB
•
3.2 GHz shipping
•
Top freq. 4.0 GHz (in lab)
Slide Source: Michael Perrone, MIT 6.189 Fall 2007 course
CISC 879 : Software Support for Multicore Architectures
Cell Details
•
•
Heterogenous multicore
architecture
•
Power Processor Element
(PPE) for control tasks
•
Synergistic Processor
Element (SPE) for dataintensive processing
SPE Features
•
No cache
•
Large unified register file
•
Synergistic Memory Flow
Control (MFC)
•
Interface to high-perf. EIB
Slide Source: Michael Perrone, MIT 6.189 Fall 2007 course
CISC 879 : Software Support for Multicore Architectures
Cell PPE Details
•
Power Processor Element
(PPE)
•
General Purpose 64-bit
PowerPC RISC processor
•
2-way hardware threaded
•
L1 32KB I; 32KB D
•
L2 512 KB
•
For operating systems and
program control
Slide Source: Michael Perrone, MIT 6.189 Fall 2007 course
CISC 879 : Software Support for Multicore Architectures
Cell SPE Details
•
Synergistic Processor
Element (SPE)
•
128-bit SIMD architecture
•
Dual Issue
•
Register File 128x128-bit
•
Load Store (256KB)
•
Simplified Branch Arch.
•
•
•
No hardware BR predictor
Compiler-managed hint
Memory Flow Controller
•
Dedicated DMA engine - Up to
16 outstanding requests
Slide Source: Michael Perrone, MIT 6.189 Fall 2007 course
CISC 879 : Software Support for Multicore Architectures
Compiler Tools
•
•
Gnu based C/C++ compiler (Sony)
•
ppu-gcc/ppu-g++ - generates ppu code
•
spu-gcc/spu-g++ - generates spu code
Gdb debugger
•
Supports both PPU and SPU debugging
•
Different modes of execution
Slide Source: Michael Perrone, MIT 6.189 Fall 2007 course
CISC 879 : Software Support for Multicore Architectures
Compiler Tools
•
•
The XLC/C++ compiler
•
ppuxlc/ppuxlc++ - generates ppu code
•
spuxlc/spuxlc++ - generates spu code
Includes the following optimization levels
•
-O0: almost no optimization
•
-O2: strong, low-level optimization
•
-O3: intense, low-level opts with basic loop opts
•
-O4: all of -O3 and detaild loop analysis and good whole
program analysis
•
-O5: all of -O4 and detailed whole-program analysis
Slide Source: Michael Perrone, MIT 6.189 Fall 2007 course
CISC 879 : Software Support for Multicore Architectures
Performance Tools
•
•
Gnu-based tools
•
Oprofile - System level profiler (only PPU)
•
Gprof - generates call graphs
IBM Tools
•
Static analysis tool (spu_timing)
•
•
annotates assembly file with scheduling and instruction
issue estimates
Dynamic analysis tool (CellBE system simulator)
•
•
Can run your code on an X86 machine
Can collect a variety of statistics
Slide Source: Michael Perrone, MIT 6.189 Fall 2007 course
CISC 879 : Software Support for Multicore Architectures
Compiling with the SDK
•
README_build_env.txt (You should IMPORTANT!)
•
•
•
•
make.footer
•
Specifies all of the build rules needed to properly build binaries
•
Must be included in all SDK Makefiles (referenced relatively if
$CELL_TOP is not defined)
•
Includes make.header
make.header
•
Specifies definitions needed to process the Makefiles
•
Includes make.env
make.env
•
•
Provides details on the build environment features, including files,
structure and variables.
Specifies the default compilers and tools to be used by make
make.footer and make.header should not be modified
Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0
CISC 879 : Software Support for Multicore Architectures
Compiling with the SDK
•
Defaults to gcc
•
Set in make.env with three variables set to gcc or xlc
•
PPU32_COMPILER
•
PPU64_COMPILER
•
PPU_COMPILER
[overrides PPU32_COMPILER and
PPU64_COMPILER]
•
SPU_COMPILER
Can change from the command line
•
PPU_COMPILER=xlc SPU_COMPILER=xlc make
•
make -e PPU64_COMPILER:=gcc -e PPU32_COMPILER:=gcc
-e SPU_COMPILER:=gcc
•
export PPU_COMPILER=xlc SPU_COMPILER=xlc ; make
•
Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0
CISC 879 : Software Support for Multicore Architectures
Compiling with the SDK
•
Use CELL_TOP or maintain relative directory structure
ifdef CELL_TOP
include $(CELL_TOP)/make.footer
else
include ../../../make.footer
endif
Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0
CISC 879 : Software Support for Multicore Architectures
Makefile variables
•
DIRS
•
•
PROGRAM_ppu
•
•
PROGRAMS_ppu
32-bit PPU program (or list of programs) to build.
PROGRAM_ppu64 PROGRAMS_ppu64
•
•
list of subdirectories to build first
64-bit PPU program (or list of programs) to build.
PROGRAM_spu
•
•
PROGRAMS_spu
SPU program (or list of programs) to build.
If written as a standalone binary, can run without being
embedded in a PPU program.
Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0
CISC 879 : Software Support for Multicore Architectures
Makefile variables (cont’d)
•
LIBRARY_embed LIBRARY_embed64
•
•
CC_OPT_LEVEL
•
•
Optimization level for compiler to use
CFLAGS, CFLAGS_gcc, CFLAGS_xlc
•
•
Creates a linked library from an SPU program to be embedded
into a 32-bit or 64-bit PPU program.
Additional flags for compiler to use (general or specific to gcc/xlc)
TARGET_INSTALL_DIR
•
Specifies where built targets are installed
Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0
CISC 879 : Software Support for Multicore Architectures
Sample Project
Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0
CISC 879 : Software Support for Multicore Architectures
Next Time
•
Chapters 1-3
•
•
And all of
•
•
NVIDIA CUDA Programming Guide version 1.1
Chapter 29 from GPU Gems 2
Links on website
CISC 879 : Software Support for Multicore Architectures