Introduction - Carleton University

Download Report

Transcript Introduction - Carleton University

Vector Processors
Prof. Sivarama Dandamudi
School of Computer Science
Carleton University
Pipelining
 Vector
machines exploit pipelining in all its
activities
 Computations
 Movement
 Pipelining
of data from/to memory
provides overlapped execution
 Increases
throughput
 Hides latency …
Carleton University
© S. Dandamudi
2
Pipelining
(cont’d)
Pipeline overlaps
execution:
6 versus 18 cycles
Carleton University
© S. Dandamudi
3
Pipelining
 One
(cont’d)
measure of performance:
Non-pipelined execution time
Speedup =
Pipelined execution time
 Ideal
case:
 n-stage
 Two
pipeline should give a speedup of n
factors affect this:
 Pipeline
fill
 Pipeline drain
Carleton University
© S. Dandamudi
4
Pipelining
(cont’d)
N
computations, each takes n * T time
 Non-pipelined time = N * n * T time
 Pipelined time = n * T + (N – 1) T time
= (n + N –1) T time
Speedup =
Carleton University
n*N
n+N-1
1
=
© S. Dandamudi
1/N + 1/n – 1/(n * N )
5
Pipelining
(cont’d)
9.0
n=9
8.0
Speedup
7.0
6.0
n=6
5.0
4.0
3.0
n=3
2.0
1.0
0
10
20
30
40
50
60
70
80
90
100 110
120
130 140
150
Number of elements, N
Carleton University
© S. Dandamudi
6
Pipelining
(cont’d)
Pipeline depth, n
Carleton University
© S. Dandamudi
7
Vector Machines
 Provide
high-level operations
 Work
on vectors (linear arrays of numbers)
 A typical vector operation
 Add
two 64-element floating-point vectors
 Equivalent to an entire loop
 CRAY format
V3
V2 VOP V1
Carleton University

© S. Dandamudi
V3  V2 VOP V1
8
Vector Machines
 Consists
 Scalar
(cont’d)
of
unit
 Works
on scalars
 Address arithmetic
 Vector
unit
 Responsible
for vector operations
 Several vector functional units

Integer add, FP add, FP multiply …
Carleton University
© S. Dandamudi
9
Vector Machines
 Two
(cont’d)
types of architecture
 Memory-to-memory

Vectors are memory resident
 First machines are of this type
 Example: CDC Star 100, CYBER 205
 Vector-register

architecture
architecture
Vectors are stored in registers
 Modern vector machines belong to this type
 Examples: Cary 1/2/X-MP/YMP, NEC SX/2, Fujitsu VP200,
Hitachi S820
Carleton University
© S. Dandamudi
10
Components
 Primary
components of vector-register machine
 Vector
registers
 Each
register can hold a small vector
 Example: Cray-1 has 8 vector registers
Each vector register can hold 64 doublewords (64-bit values)
 Two read ports and one write port
 Allows overlap among the vector operations

Carleton University
© S. Dandamudi
11
Cray-1Architecture
Carleton University
© S. Dandamudi
12
Components
 Vector
functional units
 Each
unit is fully pipelined
Can start a new operation on every clock cycle
 Cray-1 has six functional units
 FP Add, FP multiply, FP reciprocal, Integer add, Logical, Shift

 Scalar
registers
 Store
scalars
 Compute addresses to pass on to the load/store unit
Carleton University
© S. Dandamudi
13
Components
 Vector
load/store unit
 Moves

Load and store operations are pipelined
 Some

vectors between memory and vector registers
processors have more than one load/store unit
NEC SX/2 has 8 load/store units
 Memory
 Designed
to allow pipelined access
 Typically use interleaved memories

Will discuss later
Carleton University
© S. Dandamudi
14
Some Example Vector Machines
Machine
Year
# VR
VR size
# LSUs
CRAY-2
1985
8
64
1
Cray Y-MP
1988
8
64
2 loads/1 store
8-256
32-1024
2
32
256
4
Fujitsu VP100 1982
Hitachi S810
1983
NEC SX/2
1984
Convex C-1
1985
Carleton University
8+8192 256+var.
8
© S. Dandamudi
128
8
1
15
Some Example Vector Machines
 Vector
 Cray
8
(cont’d)
functional units
X-MP/Y-MP
units
FP add, FP multiply, FP reciprocal
 Integer add,
 2 logical
 Shift
 Population count/parity

Carleton University
© S. Dandamudi
16
Some Example Vector Machines
 Vector
functional units (cont’d)
 NEX
 16
(cont’d)
SX/2
units
4 FP add,
 4 FP multiply/divide
 4 Integer add/logical,
 4 Shift

Carleton University
© S. Dandamudi
17
Advantages of Vector Machines
 Flynn’s
bottleneck can be reduced
 Vector
instructions significantly improve code density
 A single vector instruction specifies a great deal of
work
 Reduce
the number of instructions needed to execute a
program
 Eliminate control overhead of a loop
A vector instruction represents the entire loop
 Loop overhead can be substantial

Carleton University
© S. Dandamudi
18
Advantages of Vector Machines
 Impact
(cont’d)
of main memory latency can be reduced
 Vector
instructions that access memory have a known
pattern
 Pipelined
access can be used
 Can exploit interleaved memory
 High latency associated with memory can be amortized over
the entire vector

Latency is not associated with each data item
 When accessing a floating-point number
Carleton University
© S. Dandamudi
19
Advantages of Vector Machines
 Control
(cont’d)
hazards can be reduced
 Vector
machines organize data operands into regular
sequences

Suitable for pipelined access in hardware
 Vector
 Data
operation  loop
hazards can be eliminated
 Due
to structured nature of data
 Allows
Carleton University
planned prefetching of data
© S. Dandamudi
20
Example Problem
 A Typical
Vector Problem
Y=a*X+Y
X
and Y are vectors
 This problem is known as
 SAXPY
(single precision A*X Plus Y)
 DAXPY (double precision A*X Plus Y)
 SAXPY/DAXPY
represents a small piece of code that
takes most of the time in the benchmark
Carleton University
© S. Dandamudi
21
Example Problem
 Non-vector
LD
ADDI
loop:
LD
MULT
(cont’d)
code fragment
F0,a
R4,Rx,#512 ;last address to load
F2,0(Rx)
F2,F0,F2
Carleton University
;F2 := M[0+Rx]
;
i.e., load X[i]
;a*X[i]
© S. Dandamudi
22
Example Problem
LD
ADD
SD
ADDI
ADDI
SUB
JNZ
(cont’d)
F4,0(Ry)
;load Y[i]
F4,F2,F4
;a*X[i] + y[i]
F4,0(Ry)
;store into Y[i]
Rx,Rx,#8
;increment index to X
Ry,Ry,#8
;increment index to Y
R20,R4,Rx ;R20 := R4-Rx
R20,loop
;jump if not done
9 instructions in the loop
Carleton University
© S. Dandamudi
23
Example Problem
 Vector
(cont’d)
code fragment
LD
LV
MULTSV
LV
ADDV
SV
F0,a
V1,Rx
V2,F0,V1
V3,Ry
V4,V2,V3
Ry,V4
;load scalar a
;load vector X
;V2 := F0 * V1
;load vector Y
;V4 := V2 + V3
; store the result
Only 6 vector instructions!
Carleton University
© S. Dandamudi
24
Example Problem
 Two
(cont’d)
main observations
 Execution
 Vector

efficiency
code
Executes 6 instructions
 Non-vector
code
Nearly 600 instructions (9 * 64)
 Lots of control overhead
 4 out of 9 instructions!
 Absent in the vector code

Carleton University
© S. Dandamudi
25
Example Problem
 Two
(cont’d)
main observations
 Frequency
of pipeline interlock
 Non-vector
code:
Every ADD must wait for MULT
 Every SD must wait for ADD
 Loop unrolling can eliminate this interlock

 Vector
code
Each instruction is independent
 Pipeline stalls once per vector operation
 Not once per vector element

Carleton University
© S. Dandamudi
26
Vector Length
 Vector
 64
register has a natural vector length
elements in CRAY systems
 What
if the vector has a different length?
 Three
cases
 Vector

length < Vector register length
Use a vector length register to indicate the vector length
 Vector
length = Vector register length
 Vector length > Vector register length
Carleton University
© S. Dandamudi
27
Vector Length
 Vector
(cont’d)
length > Vector register length
 Use
strip mining
 Vector is partitioned into strips that are less than or
equal to the vector register length
Odd strip
Carleton University
© S. Dandamudi
28
Vector Stride
 Vector
stride
 Distance
separating the elements that are to be merged
into a single vector
 In
elements, not bytes
 Typically
multidimensional matrices may have
non-unit stride access patterns
 Example:
Carleton University
matrix multiply
© S. Dandamudi
29
Vector Stride
 Matrix
(cont’d)
multiplication
for (i = 1, 100)
for (j = 1, 100)
Non-unit stride
A[i,j] = 0
for (k = 1, 100)
A[i,j] = A[i,j] + B[i,k] * C[k,j]
Unit stride
Carleton University
© S. Dandamudi
30
Vector Stride
(cont’d)
 Access
pattern of B and C depends on how the
matrix is stored
 Row-major
 Matrix
is stored row-by-row
 Used by most languages except FORTRAN
 Column-major
 Matrix
is stored column-by-column
 Used by FORTRAN
Carleton University
© S. Dandamudi
31
11
21
31
41
12
22
32
42
13
23
33
43
14
24
34
44
Carleton University
Vector Stride
© S. Dandamudi
(cont’d)
32
Cray X-MP Instructions
 Integer
addition
 Vi
Vj+Vk
 Vi Sj+Vk
 Sj
is a scalar
 Floating-point
 Vi
Vj+FVk
 Vi Sj+FVk
 Sj
Vi = Vj + Vk
Vi = Sj + Vk
addition
Vi = Vj + Vk
Vi = Sj + Vk
is a scalar
Carleton University
© S. Dandamudi
33
Cray X-MP Instructions
 Load
 Vi
(cont’d)
instructions
,A0,Ak
Vi = M(A0)+Ak
 Vector
load with stride Ak
 Loads VL elements from memory address A0
 Vi
,A0,1
Vi = M(A0)+1
 Vector
load with stride 1
 Special case
Carleton University
© S. Dandamudi
34
Cray X-MP Instructions
 Store
(cont’d)
instructions
 ,A0,Ak
Vi
 Vector
store with stride Ak
 Stores VL elements starting at memory address A0
 ,A0,1
Vi
 Vector
store with stride 1
 Special case
Carleton University
© S. Dandamudi
35
Cray X-MP Instructions
 Logical AND
instructions
 Vi
Vj&Vk
 Vi Sj&Vk
 Sj
 Shift
(cont’d)
Vi = Vj & Vk
Vi = Sj & Vk
is a scalar
instructions
 Vi
Vj>Ak
 Vi Vj<Ak
 Left/right
Carleton University
Vi = Vj >> Ak
Vi = Vj << Ak
shift each element of Vj and store the result in Vi
© S. Dandamudi
36
Sample Vector Functional Units
Vector functional unit
Integer ADD (64-bit)
# Stages Available Vector
to chain results
3
8
VL+8
64-bit shift
3
8
VL+8
128-bit shift
4
9
VL+9
Floating ADD
6
11
VL+11
Floating MULTIPLY
7
12
VL+12
Carleton University
© S. Dandamudi
37
X-MP Pipeline Operation
 Three
phases
 Setup
phase
 Sets
functional units to perform the appropriate operation
 Establishes routes to source and destination vector registers
 Requires 3 clock cycles for all functional units
 Execution
phase
 Shutdown phase
Carleton University
© S. Dandamudi
38
X-MP Pipeline Operation
 Three
(Cont’d)
phases (cont’d)
 Execution
 Source

and destination vector registers are reserved
Cannot be used by another instruction
 Source

phase
vector register is reserved for VL+3 clock cycles
VL = vector length
 One
pair of operands/clock cycle enter the first stage
Carleton University
© S. Dandamudi
39
X-MP Pipeline Operation
 Three
(Cont’d)
phases (cont’d)
 Shutdown
phase
 Shutdown
time = 3 clock cycles
 Shutdown time

Time difference between
 when the last result emerges and
 when the destination vector register becomes available for other
instructions
Carleton University
© S. Dandamudi
40
X-MP Pipeline Operation
 Three
(Cont’d)
phases (cont’d)
 Shutdown
phase
 Destination register becomes available after
3 + n + (VL-1) + 3 = n + VL + 5 clock cycles
 Setup time = shutdown time = 3 clock cycles
 First result comes after n clock cycles
 Remaining (VL-1) results come out at one/clock cycle
Carleton University
© S. Dandamudi
41
A Simple Vector Add Operation
A1
VL
V1
Carleton University
5
A1
V2+FV3
© S. Dandamudi
42
Overlapped Vector Operations
A1
VL
V1
V4
Carleton University
5
A1
V2+FV3
V5*FV6
© S. Dandamudi
43
Chaining Example
A1
VL
V1
V4
Carleton University
5
A1
V2+FV3
V5*FV1
© S. Dandamudi
44
Vector Processing Performance
Carleton University
© S. Dandamudi
45
Interleaved Memories
 Traditional
 Provide
 Use
memory designs
sequential, non-overlapped access
high-order interleaving
 Interleaved
memories
 Facilitate
overlapped, pipelined access
 Used by vector and high performance systems
 Use
low-order interleaving
Carleton University
© S. Dandamudi
46
Interleaved Memories
Carleton University
© S. Dandamudi
(cont’d)
47
Interleaved Memories
 Two
(cont’d)
types of designs
 Synchronized
access organization
 Upper
m bits are given to all memory banks simultaneously
 Requires output latches
 Does not efficiently support non-sequential access
 Independent
access organization
 Supports
pipelined access for arbitrary access pattern
 Require address registers
Carleton University
© S. Dandamudi
48
Interleaved Memories
(cont’d)
Synchronized access organization
Carleton University
© S. Dandamudi
49
Interleaved Memories
(cont’d)
Pipelined transfer of data
in interleaved memories
Carleton University
© S. Dandamudi
50
Interleaved Memories
(cont’d)
Independent access organization
Carleton University
© S. Dandamudi
51
Interleaved Memories
 Number
(cont’d)
of banks B
BM
M = memory access time in cycles
 Sequential
access if stride = B
 B = 8, M = 6 clock cycles, stride = 1
 Time
to read 16 words = 6 + 16 = 22 clock cycles
 If stride is 8, it takes 16 * 6 = 96 clock cycles
Last slide
Carleton University
© S. Dandamudi
52