Advanced Computer Architecture -

Download Report

Transcript Advanced Computer Architecture -

SIMD Computers
ECE/CS 757 Spring 2007
J. E. Smith
Copyright (C) 2007 by James E. Smith (unless noted otherwise)
All rights reserved. Except for use in ECE/CS 757, no part of these notes may be
reproduced, stored in a retrieval system, or transmitted,in any form or by any means,
electronic, mechanical, photocopying,recording, or otherwise, without prior written
permission from the author.
Outline


Automatic Parallelization
Vector Architectures
•

Data Parallel Programming
•


Cray-1 case study
CM-2 case study
CUDA Overview (separate slides)
Readings
•
•
W. Daniel Hillis and Guy L. Steele, Data Parallel Algorithms,
Communications of the ACM, December 1986, pp. 1170-1183.
S. Ryoo, et al., Optimization Principles and Application
Performance Evaluation of a Multithreaded GPU Using CUDA,
Proceedings of PPoPP, Feb. 2008.
04/07
ECE/CS 757; copyright J. E. Smith, 2007
2
Automatic Parallelization


Start with sequential programming model
Let the compiler attempt to find parallelism
•
•

Commonly used for SIMD computing – vectorization
•

It can be done…
We will look at one of the success stories
Useful for MIMD systems, also -- concurrentization
Often done with FORTRAN
•
But, some success can be achieved with C
(Compiler address disambiguation is more difficult with C)
04/07
ECE/CS 757; copyright J. E. Smith, 2007
3
Automatic Parallelization

Consider operations on arrays of data
do I=1,N
A(I,J) = B(I,J) + C(I,J)
end do
•

Operations along one dimension involve vectors
Loop level parallelism
•
•
Do all – all loop iterations are independent
Completely parallel
Do across – some dependence across loop iterations
Partly parallel
A(I,J) = A(I-1,J) + C(I,J) * B(I,J)
04/07
ECE/CS 757; copyright J. E. Smith, 2007
4
Data Dependence

Independence  Parallelism
OR, dependence inhibits parallelism
S1:
S2:
S3:

A=B+C
D=A+2
A=E+F
True Dependence (RAW):
S1  S2

Antidependence (WAR):
S2 - S3

Output Dependence (WAW):
S1 o S3
04/07
ECE/CS 757; copyright J. E. Smith, 2007
5
Data Dependence Applied to Loops

Similar relationships for loops
But consider iterations
do I=1,2
S1:
A(I)=B(I)+C(I)
S2:
D(I)=A(I)
end do
•

S1 = S2
•
Dependence involving A, but on same loop iteration
04/07
ECE/CS 757; copyright J. E. Smith, 2007
6
Data Dependence Applied to Loops

S1 < S2
do
I=1,2
S1: A(I)=B(I)+C(I)
S2: D(I)=A(I-1)
end do
•
•

Dependence involving A, but read occurs on next loop iteration
Loop carried dependence
S2 -< S1
Antidependence involving A, write occurs on next loop iteration
do I=1,2
S1: A(I)=B(I)+C(I)
S2: D(I)=A(I+1)
end do
•
04/07
ECE/CS 757; copyright J. E. Smith, 2007
7
Loop Carried Dependence

Definition
do
I = 1, N
S1:
S2:
X(f(i)) = F(...)
A = X(g(i)) ...
end do
S1  S2 : is loop-carried
if there exist i1, i2 where
1 i1 < i2  N and f(i1) = g(i2 )


If f and g can be arbitrary functions, the problem is essentially
unsolvable.
However, if (for example)
f(i) = c*I + j and g(i) = d*I + k
there are methods for detecting dependence.
04/07
ECE/CS 757; copyright J. E. Smith, 2007
8
Loop Carried Dependences

GCD test
do
I = 1, N
S1:
S2:
X(c*I + j ) = F(...)
A = X(d*I + k) ...
end do
f(x) = g(y) if c*I + j = d*I + k
This has a solution iff gcd(c, d ) | k- j

Example
A(2*I) =
= A(2*I +1)
GCD(2,2) does not divide 1 - 0

The GCD test is of limited use because it is very conservative
often gcd(c,d) = 1
X(4i+1) = F(X(5i+2))

Other, more complex tests have been developed
e.g. Banerjee's Inequality
04/07
ECE/CS 757; copyright J. E. Smith, 2007
9
Vector Code Generation


In a vector architecture, a vector instruction
performs identical operations on vectors of
data
Generally, the vector operations are
independent
•

A common exception is reductions
In general, to vectorize:
•
•
There should be no cycles in the dependence
graph
Dependence flows should be downward
 some rearranging of code may be needed.
04/07
ECE/CS 757; copyright J. E. Smith, 2007
10
Vector Code Generation: Example
do I = 1, N
S1:
A(I) = B(I)
S2:
C(I) = A(I) + B(I)
S3:
E(I) = C(I+1)
end do
 Construct
S1:
dependence graph

S2:
S3:
Vectorizes (after re-ordering S2: and S3: due to antidependence)
S1:
S3:
S2:
04/07
A(I:N) = B(I:N)
E(I:N) = C(2:N+1)
C(I:N) = A(I:N) + B(I:N)
ECE/CS 757; copyright J. E. Smith, 2007
11
Multiple Processors (Concurrentization)


Often used on outer loops
Example
do
I = 1, N
do J = 2,
S1:
A(I,J)
S2:
C(I,J)
S3:
E(I,J)
end do
end do


N
= B(I,J) + C(I,J)
= D(I,J)/2
= A(I,J-1)**2 + E(I,J-1)
Data Dependences & Directions
S1 =, < S3
S1 =, = S2
S3 =, < S3
Observations
•
All dependence directions for I loop are =
 Iterations of the I loop can be scheduled in parallel
04/07
ECE/CS 757; copyright J. E. Smith, 2007
12
Scheduling

Data Parallel Programming Model
•

Compiler can pre-schedule:
•
•
•
•

SPMD (single program, multiple data)
Processor 1 executes 1st N/P iterations,
Processor 2 executes next N/P iterations
Processor P executes last N/P iterations
Pre-scheduling is effective if execution time is nearly identical
for each iteration
Self-scheduling is often used:
•
•
04/07
If each iteration is large
Time varies from iteration to iteration
- iterations are placed in a "work queue”
- a processor that is idle, or becomes idle takes the next
block of work from the queue (critical section)
ECE/CS 757; copyright J. E. Smith, 2007
13
Code Generation with Dependences
do I =
S1: A(I) =
S2: C(I) =
S3: E(I) =
end do
2, N
B(I) + C(I)
D(I) * 2
C(I) + A(I-1)

Data Dependences & Directions
S1 -= S2
S1 < S3
S2 = S3

Parallel Code on N-1 Processors
S1:
A(I) = B(I) + C(I)
signal(I)
C(I) = D(I) * 2
if (I > 2) wait(I-1)
E(I) = C(I) + A(I-1)
S2:
S3:

Observation
•
Weak data-dependence tests may add unnecessary synchronization.
Good dependence testing crucial for high performance
04/07
ECE/CS 757; copyright J. E. Smith, 2007
14
Reducing Synchronization
do
S1:
S2:
S3:

I = 1, N
A(I) = B(I) + C(I)
D(I) = A(I) * 2
SUM = SUM + A(I)
end do
Parallel Code: Version 1
do
I = p, N, P
S1:
S2:
A(I) = B(I) + C(I)
D(I) = A(I) * 2
if (I > 1) wait(I-1)
S3:
SUM = SUM + A(I)
signal(I)
end do
04/07
ECE/CS 757; copyright J. E. Smith, 2007
15
Reducing Synchronization, contd.

Parallel Code: Version 2
SUMX(p) = 0
do I = p, N, P
S1:
A(I) = B(I) + C(I)
S2:
D(I) = A(I) * 2
S3:
SUMX(p) = SUMX(p) + A(I)
end do
barrier synchronize
add partial sums
04/07
ECE/CS 757; copyright J. E. Smith, 2007
16
Vectorization vs Concurrentization

When a system is a vector MP, when should
vector/concurrent code be generated?
do
J = 1,N
do I = 1,N
S1:
A(I,J+1) = B(I,J) + C(I,J)
S2:
D(I,J) = A(I,J) * 2
end do
end do

Parallel & Vector Code: Version 1
doacross J = 1,N
S1:
A(1:N,J+1) = B(1:N,J)+C(1:N,J)
signal(J)
if (J > 1) wait (J-1)
S2:
D(1:N,J) = A(1:N,J) * 2
end do
04/07
ECE/CS 757; copyright J. E. Smith, 2007
17
Vectorization vs Concurrentization

Parallel & Vector Code: Version 2
Vectorize on J, but non-unit stride memory access
(assuming Fortran Column Major storage order)
doall I = 1,N
S1: A(I,2:N+1) = B(I,1:N) + C(I,1:N)
S2: D(I,1:N) = A(I,1:N) * 2
end do
04/07
ECE/CS 757; copyright J. E. Smith, 2007
18
Summary


Vectorizing compilers have been a success
Dependence analysis is critical to any autoparallelizing scheme
•
•

Software (static) disambiguation
C pointers are especially difficult
Can also be used for improving performance of
sequential programs
•
•
•
Loop interchange
Fusion
Etc. (see add’l slides at end of lecture)
04/07
ECE/CS 757; copyright J. E. Smith, 2007
19
Cray-1 Architecture


Circa 1976
80 MHz clock
•

Scalar instruction set
•
•
•

When high performance mainframes were 20 MHz
16/32 bit instruction sizes
Otherwise conventional RISC
8 S register (64-bits)
8 A registers (24-bits)
In-order pipeline
•
•
Issue in order
Can complete out of order (no precise traps)
04/07
ECE/CS 757; copyright J. E. Smith, 2007
20
Cray-1 Vector ISA

8 vector registers
•
•
•


64 elements
64 bits per element (word
length)
Vector length (VL) register
RISC format
•
Vi  Vj OP Vk
•
Vi  mem(Aj, disp)
Conditionals via vector
mask (VM) register
•
VM  Vi pred Vj
•
Vi  V2 conditional on VM
04/07
ECE/CS 757; copyright J. E. Smith, 2007
21
Vector Example
Do 10 i=1,looplength
a(i) = b(i) * x + c(i)
10 continue

looplength

address(a)

address(b)

address(c)

0

64

x

A1
.
done, A1<=0
.initial values:
.for the arrays
.
.
.index value
.max hardware VL
.scalar x in register S1
.set VL – performs mod function
V3

V1

V2

V4

A2,A5 
A7

A1

A5

VL

BrC
.load c indexed by A5 – addr mode not in Cray-1
.load b indexed by A5
.vector times scalar
.add in c
.store to a indexed by A5
.read actual VL
.remaining iteration count
.increment index value
. set VL for next iteration
.branch if more work
A1
A2
A3
A4
A5
A6
S1
VL
BrC
more:
A4,A5
A3,A5
V1 * S1
V2 + V3
V4
VL
A1 – A7
A5 + A7
A6
more, A1>0
.branch if nothing to do
done:
04/07
ECE/CS 757; copyright J. E. Smith, 2007
22
Compare with Scalar
Do 10 i=1,looplength
a(i) = b(i) * x + c(i)
10 continue
2 loads
1 store
2 FP
1 branch
1 index increment (at least)
1 loop count increment
total -- 8 instructions per iteration
4-wide superscalar => up to 1 FP op per cycle
vector, with chaining => up to 2 FP ops per cycle (assuming mem b/w)
Also, in a CMOS microprocessor would save a lot of energy
.
04/07
ECE/CS 757; copyright J. E. Smith, 2007
23
Vector Conditional Loop
do 80 i = 1,looplen
if (a(i).eq.b(i)) then
c(i) = a(i) + e(i)
endif
continue
80
V1
V2
VM
V3
V4
A4
04/07






A1
A2
V1 == V2
A3; VM
V1 + V3; VM
V4; VM
.load a(i)
.load b(i)
.compare a and b; result to VM
.load e(i) under mask
.add under mask
.store to c(i) under mask
ECE/CS 757; copyright J. E. Smith, 2007
24
Vector Conditional Loop
Gather/Scatter Method (used in later Cray machines)
do 80 i = 1,looplen
if (a(i).eq.b(i)) then
c(i) = a(i) + e(i)
endif
80 continue
V1 
A1
V2 
A2
VM 
V1 == V2
V5 
IOTA(VM)
VL 
pop(VM)
V6 
A1, V5
V3 
A3, V5
V4 
V6 + V3
A4,V11 
V4
04/07
.load a(i)
.load b(i)
.compare a and b; result to VM
.form index set
.find new VL (population count)
.gather a(i) values
.gather e(i) values
.add a and e
.scatter sum into c(i)
ECE/CS 757; copyright J. E. Smith, 2007
25
Thinking Machines CM1/CM2




Fine-grain parallelism
Looks like intelligent
RAM to host (frontend)
Front-end dispatches
"macro" instructions
to sequencer
Macro instructions
decoded by sequencer
and broadcast to bitserial parallel
processors
04/07
P M
P M
P M
P M
P M
Sequencer
ECE/CS 757; copyright J. E. Smith, 2007
P M
P M
host
P M
P M
instructions
P M
P M
P M
26
CM Basics, contd.



All instructions are executed by all
processors
Subject to context flag
Context flags
•
•
•

Processor is selected if context flag = 1
saving and restoring of context is unconditional
AND, OR, NOT operations can be done on context
flag
Operations
•
Can do logical, integer, floating point as a series of
bit serial operations
04/07
ECE/CS 757; copyright J. E. Smith, 2007
27
CM Basics, contd.

Front-end can broadcast data
•

SEND instruction does communication
•

(e.g. immediate values)
within each processor, pointers can be computed,
stored and re-used
Virtual processor abstraction
•
time multiplexing of processors
04/07
ECE/CS 757; copyright J. E. Smith, 2007
28
Data Parallel Programming Model


“Parallel operations across large sets of
data”
SIMD is an example, but can also be driven
by multiple (identical) threads
•
•
Thinking Machines CM-2 used SIMD
Thinking Machines CM-5 used multiple threads
04/07
ECE/CS 757; copyright J. E. Smith, 2007
29
Connection Machine Architecture

Nexus: 4x4, 32-bits wide
•


16K processors per
sequencer
Memory
•
•


Cross-bar interconnect for
host communications
4K mem per processor
(CM-1)
64K mem per processor
(CM-2)
CM-1 Processor
16 processors on
processor chip
04/07
ECE/CS 757; copyright J. E. Smith, 2007
30
Instruction Processing



HLLs: C* and FORTRAN 8X
Paris virtual machine instruction set
Virtual processors
•
•
•

Nexus contains sequencer
•
•

Allows some hardware independence
Time-share real processors
V virtual processors per real processor
=> 1/V as much memory per virtual processor
AMD 2900 bit-sliced micro-sequencer
16K of 96-bit horizontal microcode
Inst. processing:
•
32 bit virtual machine insts (host)
-> 96-bit microcode (nexus sequencer)
-> nanocode (to processors)
04/07
ECE/CS 757; copyright J. E. Smith, 2007
31
CM-2









re-designed sequencer; 4x microcode
memory
New processor chip
FP accelerator (1 per 32 processors)
16x memory capacity (4K-> 64K)
SEC/DED on RAM
I/O subsystem
Data vault
Graphics system
Improved router
04/07
ECE/CS 757; copyright J. E. Smith, 2007
32
Performance

Computation
•
•
•

Communication
•
•

4000 MIPS 32-bit integer
20 GFLOPS 32-bit FP
4K x 4K matrix mult: 5 GFLOPS
2-d grid: 3 microseconds per bit
96 microseconds per 32 bits
20 billion bits /sec
general router: 600 microseconds/32 bits
3 billion bits /sec
Compare with CRAY Y-MP (8 procs.)
•
•
2.4 GFLOPS
But could come much closer to peak than CM-2
246 Billion bits/ sec to/from shared memory
04/07
ECE/CS 757; copyright J. E. Smith, 2007
33
Outline


Automatic Parallelization
Vector Architectures
•

Data Parallel Programming
•


Cray-1 case study
CM-1/2 case study
CUDA Overview (separate slides)
Readings
•
•
W. Daniel Hillis and Guy L. Steele, Data Parallel Algorithms,
Communications of the ACM, December 1986, pp. 1170-1183.
S. Ryoo, et al., Optimization Principles and Application
Performance Evaluation of a Multithreaded GPU Using CUDA,
Proceedings of PPoPP, Feb. 2008.
04/07
ECE/CS 757; copyright J. E. Smith, 2007
34
Additional Slides on Code
Generation
Improving Parallelism

Loop Interchange
do
J = 1,N
do I = 2,N
S1:
A(I,J) = A(I-1,J) + B(I)
end do
end do
do
I = 2,N
do J = 1,N
S1:
A(I,J) = A(I-1,J) + B(I)
end do
end do
04/07
ECE/CS 757; copyright J. E. Smith, 2007
36
Loop Interchange
04/07
ECE/CS 757; copyright J. E. Smith, 2007
37
Loop Interchange
04/07
ECE/CS 757; copyright J. E. Smith, 2007
38
Improving Parallelism

Induction Variable Recognition
•
Successive values are an arithmetic progression
INC = N
do I = 1,N
I2 = 2*I-1
X(INC) = Y(I) + Z(I2)
INC = INC - 1
end do
do
I = 1,N
X(N-I+1) = Y(I) + Z(2*I - 1)
end do
X(N:1:-1) = Y(1:N) + Z(1:2*N-1:2)
04/07
ECE/CS 757; copyright J. E. Smith, 2007
39
Improving Parallelism

Wraparound Variable Recognition
J = N
do I = 1,N
B(I) = (A(J) + A(I))/2
J = I (Is J an induction variable?)
end do
Peel first iteration:
if (N >= 1) then
B(1) = (A(N) + A(1))/2
do I = 2,N
B(I) = (A(I-1) + A(I))/2
end do
end if
if (N >= 1) then
B(1) = (A(N) + A(1))/2
B(2:N) = (A(1:N-1) + A(2:N))/2
end if
04/07
ECE/CS 757; copyright J. E. Smith, 2007
40
Improving Parallelism

Symbolic Dependence Testing
do
I = 1,N
S1:
A(LOW+I-1) = B(I)
S2:
B(I+N) = A(LOW+I)
end do

Global Forward Substitution
NP1 = N+1
NP2 = N+2
...
do I = 1,N
S1:
B(I) = A(NP1) + C(I)
S2:
A(I) = A(I) - 1
do J = 2,N
S3:
D(J,NP1)=D(J-1,NP2)*C(J)+1
end do
end do

Observations
•
•
Useful for symbolic dependence testing
Constant propagation is a special case
04/07
ECE/CS 757; copyright J. E. Smith, 2007
41
Improving Parallelism

Semantic Analysis
do
I = LOW,HIGH
S1:
A(I) = B(I) + A(I+M)
end do
( M is unknown)
if (M >= 0) then
do I = LOW,HIGH
S1:
A(I) = B(I) + A(I+M)
end do
else
do I = LOW,HIGH
S1:
A(I) = B(I) + A(I+M)
end do
end if
Top if is vectorizable
04/07
ECE/CS 757; copyright J. E. Smith, 2007
42
Improving Parallelism

Interprocedural Dependence Analysis
•

If a procedure call is present in a loop, can we
generate parallel code?
Procedure in-lining
•
•
may reduce overhead, and lead to exact
dependence analysis
may increase compilation time
04/07
ECE/CS 757; copyright J. E. Smith, 2007
43
Improving Parallelism


Removal of Output and Antidependences
Eliminate them by renaming
•
But may require extra vector storage
do
I = 1,N
S1:
A(I) = B(I) + C(I)
S2:
D(I) = (A(I) + A(I+1))/2
end do
do
I = 1,N
S3:
ATEMP(I) = A(I+1)
S1:
A(I) = B(I) + C(I)
S2:
D(I) = (A(I) + ATEMP(I))/2
end do
04/07
ECE/CS 757; copyright J. E. Smith, 2007
44
Improving Parallelism

Scalar Expansion
do
I = 1,N
S1:
X = A(I) + B(I)
S2:
C(I) = X ** 2
end do
allocate (XTEMP(1:N))
do I = 1,N
S1:
XTEMP(I) = A(I) + B(I)
S2:
C(I) = XTEMP(I) ** 2
end do
X = XTEMP(N)
free (XTEMP)
04/07
ECE/CS 757; copyright J. E. Smith, 2007
45
Improving Parallelism

Fission by Name
•

Break a loop into several adjacent loops to improve performance of
memory
Loop Fusion
do
I = 2,N
S1:
A(I) = B(I) + C(I)
end do
do I = 2,N
S2:
D(I) = A(I-1)
end do
do
I = 2,N
S1:
A(I) = B(I) + C(I)
S2:
D(I) = A(I-1)
end do

Observations
•
•
fusion reduces start-up costs for loops
when does fusion yield benefits over fission?
04/07
ECE/CS 757; copyright J. E. Smith, 2007
46
Improving Parallelism

Strip Mining
do
I = 1,N
A(I) = B(I) + 1
D(I) = B(I) - 1
end do
do
J = 1,N,32 (vector reg. len. = 32)
do I = J,MIN(J+31,N)
A(I) = B(I) + 1
D(I) = B(I) - 1
end do
end do
04/07
ECE/CS 757; copyright J. E. Smith, 2007
47
Loop Collapsing
real A(5,5), B(5,5)
do I = 1,5
do J = 1,5
A(I,J) = B(I,J) + 2
end do
end do
real A(25),B(25)
do IJ = 1,25
A(IJ) = B(IJ) + 2
end do
04/07
ECE/CS 757; copyright J. E. Smith, 2007
48
Conditional Statements
do
I = 1,N
IF (A(I) .LE. 0) GOTO 100
A(I+1) = B(I) + 3
end do
do
I = 1,N (Is this vectorizable?)
BR1 = A(I) .LE. 0
IF (.NOT. BR1) A(I+1) = B(I) + 3
end do
do
I = 1,N (Is this vectorizable?)
BR1 = A(I) .LE. 0
IF (.NOT. BR1) A(I) = B(I) + 3
end do
04/07
ECE/CS 757; copyright J. E. Smith, 2007
49
Conditional Statements, contd.
do
I = 1,N
BR1(I) = A(I) .LE. 0
IF (.NOT. BR1(I)) A(I) = B(I) + 3
end do
BR1(1:N)=A(1:N).LE.0 ( Scalar Expansion)
WHERE(.NOT.BR1(1:N)) A(1:N)=B(1:N)+3
04/07
ECE/CS 757; copyright J. E. Smith, 2007
50
Data Parallel Examples
Example: Reduction



Each processor (k) unconditionally tests context flag
add if if flag set; else 0
perform unconditional summation of integers
•
Special case: count processors
04/07
ECE/CS 757; copyright J. E. Smith, 2007
52
Example: Compute Partial Sums

Similar to reduction,
except use sum-prefix
computation
•
Special case: enumerate give each active
processor a sequence
number
04/07
ECE/CS 757; copyright J. E. Smith, 2007
53
Example: Radix Sort
04/07
ECE/CS 757; copyright J. E. Smith, 2007
54
Example: Find End of Linked List
04/07
ECE/CS 757; copyright J. E. Smith, 2007
55
Bonus Slides: BSP
Burroughs BSP



Developed during the "supercomputer wars" of the
late '70s, early '80s.
Taken to prototype stage, but never shipped
Draws distinct division between vector and scalar
processing
• control and parallel processors have totally different
memories (for both insts. and data)
04/07
ECE/CS 757; copyright J. E. Smith, 2007
57
Burroughs BSP
04/07
ECE/CS 757; copyright J. E. Smith, 2007
58
Burroughs BSP


Control (scalar) processor
• processes all instructions from control
memory
• 80 ns clock => up to 1.5 MFLOPS
Parallel (vector) processor
• 16 processors
• 160 ns clock
• 2 cp latency for major FP operations
• pipelined at a high level
04/07
ECE/CS 757; copyright J. E. Smith, 2007
59
BSP FAPAS Pipeline
Store

stages: Fetch, Align,
Process, Align, Store
(FAPAS)
Align
Fe tch
Me mory
Output
Alignme nt
Ne t
Input
Alignme nt
Ne t
Align
Proce ssor
Proce ss
04/07
ECE/CS 757; copyright J. E. Smith, 2007
60
Example
 Example:
D=A*B+C
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Fetch
A1
B1 C1 A2
B2 C2 A3
B3 C3 A4
B4
Align
A1
B1 C1 A2
B2 C2 A3
B3 C3 A4
Process
* * + + * * + + * * +
Align
D1
D2
Store
D1
D2
 Note
that memory bandwidth and fp
bandwidth are both fully committed for triad.
04/07
ECE/CS 757; copyright J. E. Smith, 2007
61
FAPAS pipeline contd.

Throughput: 1 FP op/320ns * 16 ops in ||
=> 50 MFLOPS peak
=> big difference between performance in scalar and
vector modes
also, many scalar arith operations were done in the
vector unit because of partitioning of memory
04/07
ECE/CS 757; copyright J. E. Smith, 2007
62
Architecture


Memory to Memory
Unlimited vector lengths
•

Up to 5 input operands per instruction
(pentads)
•
•
•

(stripmining is automatic in hardware)
takes into account all evaluation trees
refer to table 1 for list of forms
Any fixed stride
Supports two levels of loop nesting
04/07
ECE/CS 757; copyright J. E. Smith, 2007
63
Example:
VFORM
TRIAD, op1, op2
OBV
(operand bit vector)
RBV
(result bit vector)
VOPERAND A
VOPERAND B
VOPERAND C
VRESULT Z
=> RBV,Z = A op1 B op2 C, OBV
04/07
ECE/CS 757; copyright J. E. Smith, 2007
64
FORTRAN Example

Livermore Fortran Kernel 1, Hydro Excerpt
•
inner loop: x(k)=u(k)*(r*z(k+10)+t*z(k+11))
VLEN = "one level of nesting, length 100"
VFORM PENTAD2, *, +, *, *, no bit vectors
no OBV
no RBV
VOPERAND r
(broadcast)
VOPERAND z+10 (stride 1)
VOPERAND t
(broadcast)
VOPERAND z+11 (stride 1)
VOPERAND u
(stride 1)
VRESULT x
(stride 1)
04/07
ECE/CS 757; copyright J. E. Smith, 2007
65
Architecture, contd.

Strided loads/stores
•

Sparse matrices
•
•
•
•

bit vectors embedded in vector insts.
Recurrences/reductions
•

compress
expand
random fetch
random store
IF statements
•

Implemented with prime memory system to avoid conflicts
special instructions
Chaining
•
•
•
04/07
built into instructions via multi-operands
saves loads and stores in a mem-to-mem arch.
10 temp registers per arithmetic unit
ECE/CS 757; copyright J. E. Smith, 2007
66
Template Processing
control memory
120 bits
16 entries
fields filled by control processor
Vector Data Buffer
from 16 GPR's (if needed)
shipped to vector unit
Vector Input and
assembles sequence of instructions
Validation Unit
into a global description of operation
and checks memory hazards
read
16 entries
T emplate
Descriptor
Memory
T emplate Control
Unit
write
control
F
A
P
A
S
04/07
ECE/CS 757; copyright J. E. Smith, 2007
67