Introduction - Carleton University
Download
Report
Transcript Introduction - Carleton University
Vector Processors
Prof. Sivarama Dandamudi
School of Computer Science
Carleton University
Pipelining
Vector
machines exploit pipelining in all its
activities
Computations
Movement
Pipelining
of data from/to memory
provides overlapped execution
Increases
throughput
Hides latency …
Carleton University
© S. Dandamudi
2
Pipelining
(cont’d)
Pipeline overlaps
execution:
6 versus 18 cycles
Carleton University
© S. Dandamudi
3
Pipelining
One
(cont’d)
measure of performance:
Non-pipelined execution time
Speedup =
Pipelined execution time
Ideal
case:
n-stage
Two
pipeline should give a speedup of n
factors affect this:
Pipeline
fill
Pipeline drain
Carleton University
© S. Dandamudi
4
Pipelining
(cont’d)
N
computations, each takes n * T time
Non-pipelined time = N * n * T time
Pipelined time = n * T + (N – 1) T time
= (n + N –1) T time
Speedup =
Carleton University
n*N
n+N-1
1
=
© S. Dandamudi
1/N + 1/n – 1/(n * N )
5
Pipelining
(cont’d)
9.0
n=9
8.0
Speedup
7.0
6.0
n=6
5.0
4.0
3.0
n=3
2.0
1.0
0
10
20
30
40
50
60
70
80
90
100 110
120
130 140
150
Number of elements, N
Carleton University
© S. Dandamudi
6
Pipelining
(cont’d)
Pipeline depth, n
Carleton University
© S. Dandamudi
7
Vector Machines
Provide
high-level operations
Work
on vectors (linear arrays of numbers)
A typical vector operation
Add
two 64-element floating-point vectors
Equivalent to an entire loop
CRAY format
V3
V2 VOP V1
Carleton University
© S. Dandamudi
V3 V2 VOP V1
8
Vector Machines
Consists
Scalar
(cont’d)
of
unit
Works
on scalars
Address arithmetic
Vector
unit
Responsible
for vector operations
Several vector functional units
Integer add, FP add, FP multiply …
Carleton University
© S. Dandamudi
9
Vector Machines
Two
(cont’d)
types of architecture
Memory-to-memory
Vectors are memory resident
First machines are of this type
Example: CDC Star 100, CYBER 205
Vector-register
architecture
architecture
Vectors are stored in registers
Modern vector machines belong to this type
Examples: Cary 1/2/X-MP/YMP, NEC SX/2, Fujitsu VP200,
Hitachi S820
Carleton University
© S. Dandamudi
10
Components
Primary
components of vector-register machine
Vector
registers
Each
register can hold a small vector
Example: Cray-1 has 8 vector registers
Each vector register can hold 64 doublewords (64-bit values)
Two read ports and one write port
Allows overlap among the vector operations
Carleton University
© S. Dandamudi
11
Cray-1Architecture
Carleton University
© S. Dandamudi
12
Components
Vector
functional units
Each
unit is fully pipelined
Can start a new operation on every clock cycle
Cray-1 has six functional units
FP Add, FP multiply, FP reciprocal, Integer add, Logical, Shift
Scalar
registers
Store
scalars
Compute addresses to pass on to the load/store unit
Carleton University
© S. Dandamudi
13
Components
Vector
load/store unit
Moves
Load and store operations are pipelined
Some
vectors between memory and vector registers
processors have more than one load/store unit
NEC SX/2 has 8 load/store units
Memory
Designed
to allow pipelined access
Typically use interleaved memories
Will discuss later
Carleton University
© S. Dandamudi
14
Some Example Vector Machines
Machine
Year
# VR
VR size
# LSUs
CRAY-2
1985
8
64
1
Cray Y-MP
1988
8
64
2 loads/1 store
8-256
32-1024
2
32
256
4
Fujitsu VP100 1982
Hitachi S810
1983
NEC SX/2
1984
Convex C-1
1985
Carleton University
8+8192 256+var.
8
© S. Dandamudi
128
8
1
15
Some Example Vector Machines
Vector
Cray
8
(cont’d)
functional units
X-MP/Y-MP
units
FP add, FP multiply, FP reciprocal
Integer add,
2 logical
Shift
Population count/parity
Carleton University
© S. Dandamudi
16
Some Example Vector Machines
Vector
functional units (cont’d)
NEX
16
(cont’d)
SX/2
units
4 FP add,
4 FP multiply/divide
4 Integer add/logical,
4 Shift
Carleton University
© S. Dandamudi
17
Advantages of Vector Machines
Flynn’s
bottleneck can be reduced
Vector
instructions significantly improve code density
A single vector instruction specifies a great deal of
work
Reduce
the number of instructions needed to execute a
program
Eliminate control overhead of a loop
A vector instruction represents the entire loop
Loop overhead can be substantial
Carleton University
© S. Dandamudi
18
Advantages of Vector Machines
Impact
(cont’d)
of main memory latency can be reduced
Vector
instructions that access memory have a known
pattern
Pipelined
access can be used
Can exploit interleaved memory
High latency associated with memory can be amortized over
the entire vector
Latency is not associated with each data item
When accessing a floating-point number
Carleton University
© S. Dandamudi
19
Advantages of Vector Machines
Control
(cont’d)
hazards can be reduced
Vector
machines organize data operands into regular
sequences
Suitable for pipelined access in hardware
Vector
Data
operation loop
hazards can be eliminated
Due
to structured nature of data
Allows
Carleton University
planned prefetching of data
© S. Dandamudi
20
Example Problem
A Typical
Vector Problem
Y=a*X+Y
X
and Y are vectors
This problem is known as
SAXPY
(single precision A*X Plus Y)
DAXPY (double precision A*X Plus Y)
SAXPY/DAXPY
represents a small piece of code that
takes most of the time in the benchmark
Carleton University
© S. Dandamudi
21
Example Problem
Non-vector
LD
ADDI
loop:
LD
MULT
(cont’d)
code fragment
F0,a
R4,Rx,#512 ;last address to load
F2,0(Rx)
F2,F0,F2
Carleton University
;F2 := M[0+Rx]
;
i.e., load X[i]
;a*X[i]
© S. Dandamudi
22
Example Problem
LD
ADD
SD
ADDI
ADDI
SUB
JNZ
(cont’d)
F4,0(Ry)
;load Y[i]
F4,F2,F4
;a*X[i] + y[i]
F4,0(Ry)
;store into Y[i]
Rx,Rx,#8
;increment index to X
Ry,Ry,#8
;increment index to Y
R20,R4,Rx ;R20 := R4-Rx
R20,loop
;jump if not done
9 instructions in the loop
Carleton University
© S. Dandamudi
23
Example Problem
Vector
(cont’d)
code fragment
LD
LV
MULTSV
LV
ADDV
SV
F0,a
V1,Rx
V2,F0,V1
V3,Ry
V4,V2,V3
Ry,V4
;load scalar a
;load vector X
;V2 := F0 * V1
;load vector Y
;V4 := V2 + V3
; store the result
Only 6 vector instructions!
Carleton University
© S. Dandamudi
24
Example Problem
Two
(cont’d)
main observations
Execution
Vector
efficiency
code
Executes 6 instructions
Non-vector
code
Nearly 600 instructions (9 * 64)
Lots of control overhead
4 out of 9 instructions!
Absent in the vector code
Carleton University
© S. Dandamudi
25
Example Problem
Two
(cont’d)
main observations
Frequency
of pipeline interlock
Non-vector
code:
Every ADD must wait for MULT
Every SD must wait for ADD
Loop unrolling can eliminate this interlock
Vector
code
Each instruction is independent
Pipeline stalls once per vector operation
Not once per vector element
Carleton University
© S. Dandamudi
26
Vector Length
Vector
64
register has a natural vector length
elements in CRAY systems
What
if the vector has a different length?
Three
cases
Vector
length < Vector register length
Use a vector length register to indicate the vector length
Vector
length = Vector register length
Vector length > Vector register length
Carleton University
© S. Dandamudi
27
Vector Length
Vector
(cont’d)
length > Vector register length
Use
strip mining
Vector is partitioned into strips that are less than or
equal to the vector register length
Odd strip
Carleton University
© S. Dandamudi
28
Vector Stride
Vector
stride
Distance
separating the elements that are to be merged
into a single vector
In
elements, not bytes
Typically
multidimensional matrices may have
non-unit stride access patterns
Example:
Carleton University
matrix multiply
© S. Dandamudi
29
Vector Stride
Matrix
(cont’d)
multiplication
for (i = 1, 100)
for (j = 1, 100)
Non-unit stride
A[i,j] = 0
for (k = 1, 100)
A[i,j] = A[i,j] + B[i,k] * C[k,j]
Unit stride
Carleton University
© S. Dandamudi
30
Vector Stride
(cont’d)
Access
pattern of B and C depends on how the
matrix is stored
Row-major
Matrix
is stored row-by-row
Used by most languages except FORTRAN
Column-major
Matrix
is stored column-by-column
Used by FORTRAN
Carleton University
© S. Dandamudi
31
11
21
31
41
12
22
32
42
13
23
33
43
14
24
34
44
Carleton University
Vector Stride
© S. Dandamudi
(cont’d)
32
Cray X-MP Instructions
Integer
addition
Vi
Vj+Vk
Vi Sj+Vk
Sj
is a scalar
Floating-point
Vi
Vj+FVk
Vi Sj+FVk
Sj
Vi = Vj + Vk
Vi = Sj + Vk
addition
Vi = Vj + Vk
Vi = Sj + Vk
is a scalar
Carleton University
© S. Dandamudi
33
Cray X-MP Instructions
Load
Vi
(cont’d)
instructions
,A0,Ak
Vi = M(A0)+Ak
Vector
load with stride Ak
Loads VL elements from memory address A0
Vi
,A0,1
Vi = M(A0)+1
Vector
load with stride 1
Special case
Carleton University
© S. Dandamudi
34
Cray X-MP Instructions
Store
(cont’d)
instructions
,A0,Ak
Vi
Vector
store with stride Ak
Stores VL elements starting at memory address A0
,A0,1
Vi
Vector
store with stride 1
Special case
Carleton University
© S. Dandamudi
35
Cray X-MP Instructions
Logical AND
instructions
Vi
Vj&Vk
Vi Sj&Vk
Sj
Shift
(cont’d)
Vi = Vj & Vk
Vi = Sj & Vk
is a scalar
instructions
Vi
Vj>Ak
Vi Vj<Ak
Left/right
Carleton University
Vi = Vj >> Ak
Vi = Vj << Ak
shift each element of Vj and store the result in Vi
© S. Dandamudi
36
Sample Vector Functional Units
Vector functional unit
Integer ADD (64-bit)
# Stages Available Vector
to chain results
3
8
VL+8
64-bit shift
3
8
VL+8
128-bit shift
4
9
VL+9
Floating ADD
6
11
VL+11
Floating MULTIPLY
7
12
VL+12
Carleton University
© S. Dandamudi
37
X-MP Pipeline Operation
Three
phases
Setup
phase
Sets
functional units to perform the appropriate operation
Establishes routes to source and destination vector registers
Requires 3 clock cycles for all functional units
Execution
phase
Shutdown phase
Carleton University
© S. Dandamudi
38
X-MP Pipeline Operation
Three
(Cont’d)
phases (cont’d)
Execution
Source
and destination vector registers are reserved
Cannot be used by another instruction
Source
phase
vector register is reserved for VL+3 clock cycles
VL = vector length
One
pair of operands/clock cycle enter the first stage
Carleton University
© S. Dandamudi
39
X-MP Pipeline Operation
Three
(Cont’d)
phases (cont’d)
Shutdown
phase
Shutdown
time = 3 clock cycles
Shutdown time
Time difference between
when the last result emerges and
when the destination vector register becomes available for other
instructions
Carleton University
© S. Dandamudi
40
X-MP Pipeline Operation
Three
(Cont’d)
phases (cont’d)
Shutdown
phase
Destination register becomes available after
3 + n + (VL-1) + 3 = n + VL + 5 clock cycles
Setup time = shutdown time = 3 clock cycles
First result comes after n clock cycles
Remaining (VL-1) results come out at one/clock cycle
Carleton University
© S. Dandamudi
41
A Simple Vector Add Operation
A1
VL
V1
Carleton University
5
A1
V2+FV3
© S. Dandamudi
42
Overlapped Vector Operations
A1
VL
V1
V4
Carleton University
5
A1
V2+FV3
V5*FV6
© S. Dandamudi
43
Chaining Example
A1
VL
V1
V4
Carleton University
5
A1
V2+FV3
V5*FV1
© S. Dandamudi
44
Vector Processing Performance
Carleton University
© S. Dandamudi
45
Interleaved Memories
Traditional
Provide
Use
memory designs
sequential, non-overlapped access
high-order interleaving
Interleaved
memories
Facilitate
overlapped, pipelined access
Used by vector and high performance systems
Use
low-order interleaving
Carleton University
© S. Dandamudi
46
Interleaved Memories
Carleton University
© S. Dandamudi
(cont’d)
47
Interleaved Memories
Two
(cont’d)
types of designs
Synchronized
access organization
Upper
m bits are given to all memory banks simultaneously
Requires output latches
Does not efficiently support non-sequential access
Independent
access organization
Supports
pipelined access for arbitrary access pattern
Require address registers
Carleton University
© S. Dandamudi
48
Interleaved Memories
(cont’d)
Synchronized access organization
Carleton University
© S. Dandamudi
49
Interleaved Memories
(cont’d)
Pipelined transfer of data
in interleaved memories
Carleton University
© S. Dandamudi
50
Interleaved Memories
(cont’d)
Independent access organization
Carleton University
© S. Dandamudi
51
Interleaved Memories
Number
(cont’d)
of banks B
BM
M = memory access time in cycles
Sequential
access if stride = B
B = 8, M = 6 clock cycles, stride = 1
Time
to read 16 words = 6 + 16 = 22 clock cycles
If stride is 8, it takes 16 * 6 = 96 clock cycles
Last slide
Carleton University
© S. Dandamudi
52