A. Sazegari AltiVec Technical Lead Introduction • AltiVec™ is an extension to the PowerPC Instruction Set Architecture • Designed to extend Apple’s leadership position in multimedia processing AltiVec is.

Download Report

Transcript A. Sazegari AltiVec Technical Lead Introduction • AltiVec™ is an extension to the PowerPC Instruction Set Architecture • Designed to extend Apple’s leadership position in multimedia processing AltiVec is.

A. Sazegari
AltiVec Technical
Lead
Introduction
• AltiVec™ is an extension to the
PowerPC Instruction Set
Architecture
• Designed to extend Apple’s
leadership position in multimedia
processing
AltiVec is a trademark of Motorola, Inc.
What You’ll Learn
• About the AltiVec Architecture
• Its performance potential
• AltiVec programming
AltiVec Technology
• Vector/SIMD technology
– Fixed-length vector operands
(packed data)
– Single Instruction Multiple Data
– RISC-style instruction set
– Optimized for digital signal
processing
• Elevates multimedia to first-class
data type
AltiVec Architecture
• New Vector Register File:
– 32 new 128-bit wide registers
• New data-types:
– Packed byte, halfword, and word
integers
– Packed IEEE single-precision floats
• Saturation Arithmetic capability
• 160 new PowerPC instructions
PowerPC Architecture
Instruction Stream
Branch Unit
IU
FPU
GRF
FPRF
32
64
Memory
AltiVec Architecture
Instruction Stream
Branch Unit
IU
FPU
Vector Unit
GRF
FPRF
Vector Register File
32
64
128
Memory
Programming Model
• Separate Vector Register File
— More space for coefficients,
variables, etc.
— More names for scheduling
— Wider for more parallelism
— No interference with FP or integer
Cond
Count
Link
Time
Time
VRSave
32-bits
64-bits
128-bits
GPR0
FPR0
VR0
General
Reg.
File
Floating-Point
Register
File
Vector
Regist
er
File
GPR31
FPR31
XER
FPSCR
•
•
•
•
•
•
•
•
VR31
VSCR
32-registers
Branch
Registers
Vector Data Types
One Vector (128 bits)
16 signed or unsigned integer bytes
8 signed or unsigned integer halfwords
4 signed or unsigned integer words
or
4 IEEE single-precision floating-point numbers
Simple SIMD Example
vaddshs T, A, B
T = vec_adds (A, B);
// vector signed short T, A,
VRA
VRB
+
+
+
+
+
+
+
+
• 8 halfword additions in one instruction
• Saturation arithmetic (clamp to max or min on
overflow)
VRT
vec_sums( ) vec_msum( )
Vector Dot Product
VRA1
VRB1
X X X X X X X X X X X X X X X X
VRC1
∑
∑
∑
∑
VRT1/A2
VRB2
∑
VRT2
Arithmetic Operations
• Add, Subtract, Average
• Multiply, Multiply-add, Multiplysum
• Logicals (and, andc, or, nor, xor)
• Rotates and shifts
• Compares
• Convert float <—> fixed (scaled)
• ÷ and √ via Newton-Raphson
Vector Permute
T = vec_perm (A, B, C);
17 18 D
VRA
0
1
2
3
4
5
6
7
8
9
A
E
B
F 1E
C
D
1
E
0 12 11 10
F
A 14 14 14 14
VRC
VRB
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
VRT
• Arbitrary bytewise data
reorganization
• Small table-lookup
vec_sel( )
vec_cmpeq( )
Compare and Select
C1 00 00 00 1A 1A C1 1A 00 C1 00 00 1A 00 1A C1
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
VRA1
VRB1
=
VRT1/C2
C1
VRA1/A2
9A
VRB2
00 FF FF FF 00 00 00 00 FF 00 FF FF 00 FF 00 00
C1 00 00 00 1A 1A C1 1A 00 C1 00 00 1A 00 1A
9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A
C1 9A 9A 9A 1A 1A C1 1A 9A C1 9A 9A 1A 9A 1A C1
VRT2
Other AltiVec Instructions
• Load and Store (vector or scalar
element)
• Pack, Unpack, and Merge
elements
• Splat (element or literal
replication)
• Bitwise vector shifts
• Double-vector bytewise shifts
Data Stream Prefetch
• Software directed prefetch into
cache
• 4 simultaneous streams
– Independent and asynchronous
– Can be non-contiguous
Block Size = 0-32 Vectors
1
2
Stride = ±32KBytes
0-256 Blocks
3
N
Memory
Typical Implementation
• ALL instructions fully-pipelined
with single-cycle throughput
– Simple ops: 1 cycle latency
– Compound ops: 3–4 cycle latency
• Dual AltiVec instruction issue
– One arithmetic, one “permute”
• No restriction on issue with scalar
instructions
AltiVec vs. MMX
• Both SIMD, but AltiVec:
– Does everything MMX does, plus
– Twice the SIMD parallelism
– 4x the register namespace
– 8x the register storage space
– No mode switch or use overhead
– Permute
– Richer set of DSP instructions
AltiVec Performance
• Peak Performance
• Multimedia “kernels”
• DSP benchmarks
– Performance based on cycleaccurate simulator with real memory
effects included
– Performance stated relative to
optimized PowerPC scalar code
Peak Performance
• Vector operations at 400MHz:
– Integer
• 12.8 billion arithmetic ops/sec
• + 6.4 billion byte crossbar ops/sec
– Floating-point
• 3.2 gigaflops
• + 1.6 billion FP crossbar ops/sec
Multimedia Kernels
• Video and Audio
– 11.4x
(DCT)
– 16.1x*
B|)
– 12.5x
– 9.6x
– 3.6x
– 4.9x
Discrete Cosine Transform
Motion estimation (* by ∑|AQuantization
RGB -> YCbCr (CCIR601)
Inverse FFT (FP)
Windowing (FP)
Multimedia Kernels
• Image Processing
– 6.2x
– 1.1cy/px
– 2.2cy/px
– 1.3cy/px
Bilinear interpolation
Separable convolution
RGB to YUV
Median Filter (3x3)
Multimedia Kernels
• Graphics
– 6.2x
– 17.5x
– 6.6x
– 6.3x
Vector-matrix multiply (FP)
Buffer accumulation
Line clipping
Bezier curves
Communication Kernels
• Modems and Telephony
– 2.5x
– 10.5x
– 7.6x
– 9.3x
– 30.7x
– 12.5x
CRC-32
64-QAM Demodulator
Linear prediction
Real 13-tap FIR
Autocorrelation
GSM Module 4.2.11
Miscellaneous DSP
Kernels
• Miscellaneous
– 2.5 to 20x Parallel table lookup
– 10.0x
Sorting
– 5.8x
Associative search
– 16.0x
Galois field multiply
– 4.0x
Gamma Correction
– 12.0cy/block
Haar
Transform (wavelet)
DSP Benchmarks
• Results from an independent DSP
benchmarking firm indicate
AltiVec on integer DSP algorithms
(FIR, FFT, etc.) is:
– Twice as fast as the world’s fastest
DSP (TMS320C6201) per clock, and
four times faster including frequency
– 2 to 5 times faster than Pentium™ II
per clock (but µP would still be 35%
AltiVec Tools
• Programming Model and ABI
• Compilers and assemblers
– Motorola’s MCC CodeWarrior plug-in
– Apple’s MrC and PPCASM in MPW and
MW
– Metrowerks C/C++
• Emulator/Trace generator
• MacsBug
• Cycle-accurate simulator
Programming in C
• 11 new fundamental packed data
types
• AltiVec operators
– Parse like function calls
– Specific operators —> assembly
instructions
– Generic operators type sensitive
– sizeof(), a=b, &a, *p, etc.
• Compiler does register allocation,
C Program Example
//
zero = ( vector unsigned long ) ( 0 );
zero = vec_xor ( zero, zero );
shiftFactor = vec_splat_u8 ( 11 );
z = vec_sro ( x, shiftFactor );
z = vec_srl ( z, shiftFactor );
do
{
carry
= vec_addc ( z, y );
z = vec_add ( z, y );
y = vec_sld ( carry, zero, 4 );
}
while ( !vec_all_eq ( y, zero ) );
Vector Shifts
This ‘shiftFactor’ vector is populated in 2 sections for “vector shift right by octet” vsro
and “vector shift right” vsr
bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127 ||
used by || <------ vsro -------> || <---- vsr ----> ||
vsro is based on the permute cross bar and shifts bytes,
Instruction vsr is a 0 to 7 bit shift.
Used sequentially,the combination of these instructions will shift a vector register right
(or left) from 0 to 127 bits as specified in bits 121:127 of ‘shiftFactor’.
bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127||
shiftFactor = ... || 0 | 0 | 0 | 1 || 0 | 1 | 1 ||
AltiVec at Apple
•
•
•
•
•
Mac OS (blockmove, etc.)
QuickDraw
QTML (codecs, rasterizers…)
Media source code library
[email protected]
AltiVec Summary
• Major architectural extension will
make future PowerPCs great
media processors
• Early programming tools available
now
• Development systems 2H98
(Now)
• AltiVec based systems in 1H99