Chap. 9 Pipeline and Vector Processing

Transcript Chap. 9 Pipeline and Vector Processing

9-1
Chap. 9 Pipeline and Vector Processing
9-1 Parallel Processing

 Simultaneous data processing tasks for the purpose of increasing the
=
computational speed
 Perform concurrent data processing to achieve faster execution time
Parallel Processing Example
 Multiple Functional Unit : Fig. 9-1

Separate the execution unit into eight functional units operating in parallel
 Computer Architectural Classification
 Data-Instruction Stream : Flynn
 Serial versus Parallel Processing : Feng
 Parallelism and Pipelining : Händler
 Flynn’s Classification
 1) SISD (Single Instruction - Single Data stream)
Adder-subtractor
Integer multiply
Logic unit
Shift unit
To Memory
» for practical purpose: only one processor is useful
» Example systems : Amdahl 470V/6, IBM 360/91
Incrementer
Processor
registers
Floatint-point
add-subtract
Floatint-point
multiply
IS
Floatint-point
divide
CU
Computer System Architecture
IS
PU
DS
MM
Chap. 9 Pipeline and Vector Processing
Dept. of Info. Of Computer
9-2

2) SIMD
(Single Instruction - Multiple Data stream)
Shared memmory
PU 1
» vector or array operations 에 적합한 형태

one vector operation includes many
operations on a data stream
PU 2
CU
» Example systems : CRAY -1, ILLIAC-IV
DS 1
DS 2
MM1
MM2
IS
PU n
DS n
MMn
IS

3) MISD
(Multiple Instruction - Single Data stream)
» Data Stream에 Bottle neck으로 인해
실제로 사용되지 않음
DS
IS 1
CU 1
IS 1
PU 1
Shared memory
IS 2
IS n
CU 2
CU n
IS 2
IS n
PU 2
MMn
MM2
MM1
PU n
DS
Computer System Architecture
Chap. 9 Pipeline and Vector Processing
Dept. of Info. Of Computer
9-3

4) MIMD
(Multiple Instruction - Multiple Data stream)
Shared memory
» 대부분의 Multiprocessor
System에서 사용됨
IS 1
IS 2
CU 1
CU 2
IS 1
IS 2
PU 1
PU 2
v
IS n
CU n
DS
MM1
MM2
v
IS n
PU n
MMn
 Main topics in this Chapter
 Pipeline processing : Sec. 9-2
» Arithmetic pipeline : Sec. 9-3
» Instruction pipeline : Sec. 9-4


Vector processing :adder/multiplier pipeline 이용, Sec. 9-6
Array processing :별도의 array processor 이용, Sec. 9-7
Large vector, Matrices,
그리고 Array Data 계산
» Attached array processor : Fig. 9-14
» SIMD array processor : Fig. 9-15
Computer System Architecture
Chap. 9 Pipeline and Vector Processing
Dept. of Info. Of Computer
9-4

9-2 Pipelining
 Pipelining의 원리
 Decomposing a sequential process into suboperations
 Each subprocess is executed in a special dedicated segment concurrently
 Pipelining의 예제 : Fig. 9-2
 Multiply and add operation : Ai * Bi  Ci ( for i = 1, 2, …, 7 )
 3 개의 Suboperation Segment로 분리
» 1) R1  Ai, R 2  Bi
: Input Ai and Bi
» 2) R3  R1 * R 2, R 4  Ci : Multiply and input Ci
» 3) R5  R3  R 4
: Add Ci

Content of registers in pipeline example : Tab. 9-1
 General considerations
 4 segment pipeline : Fig. 9-3
» S : Combinational circuit for Suboperation
» R : Register(intermediate results between the segments)

Space-time diagram : Fig. 9-4
» Show segment utilization as a function of time

Task : T1, T2, T3,…, T6
Segment
versus
clock-cycle
» Total operation performed going through all the segment
Computer System Architecture
Chap. 9 Pipeline and Vector Processing
Dept. of Info. Of Computer
9-5
 Speedup S : Nonpipeline / Pipeline
 S = n • tn / ( k + n - 1 ) • tp = 6 • 6 tn / ( 4 + 6 - 1 ) • tp = 36 tn / 9 tn = 4


n : task number ( 6 )
Pipeline에서의 처리 시간 = 9 clock cycles
tn : time to complete each task in nonpipeline ( 6 cycle times = 6 tp)
tp : clock cycle time ( 1 clock cycle )
k : segment number ( 4 )
If n  이면, S = tn / tp
한 개의 task를 처리하는 시간이 같을 때
즉, nonpipeline ( tn ) = pipeline ( k • tp )
이라고 가정하면,
S = tn / tp = k • tp / tp = k
따라서 이론적으로 k 배 (segment 개수)
만큼 처리 속도가 향상된다.
Clock cycles
1
Segment
k+n-1n
»
»
»
»
1
2
3
4
5
6
7
8
9
T1 T2 T3 T4 T5 T6
2
T1 T2 T3 T4 T5 T6
3
4
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
 Pipeline에는 Arithmetic Pipeline(Sec. 9-3)과 Instruction Pipeline(Sec. 9-4)이 있다

Sec. 9-3 Arithmetic Pipeline
 Floating-point Adder Pipeline Example : Fig. 9-6
 Add / Subtract two normalized floating-point binary number
» X = A x 2a = 0.9504 x 103
» Y = B x 2b = 0.8200 x 102
Computer System Architecture
Chap. 9 Pipeline and Vector Processing
Dept. of Info. Of Computer
9-6

4 segments suboperations
» 1) Compare exponents by subtraction :


3-2=1
X = 0.9504 x 103
Y = 0.8200 x 102
Segment 1 :
» 2) Align mantissas


X = 0.9504 x 103
Y = 0.08200 x 103
» 3) Add mantissas

Exponents
a
b
Mantissas
A
B
R
R
Compare
exponents
by subtraction
R
Segment 2 :
Choose exponent
R
Z = 0.1324 x 104
Add or subtract
mantissas
Segment 3 :
Segment 4 :
Computer System Architecture
Align mantissas
Z = 1.0324 x 103
» 4) Normalize result

Difference
R
R
Adjust
exponent
Normalize
result
R
R
Chap. 9 Pipeline and Vector Processing
Dept. of Info. Of Computer
9-7

9-4 Instruction Pipeline
 Instruction Cycle
1) Fetch the instruction from memory
2) Decode the instruction
3) Calculate the effective address
4) Fetch the operands from memory
5) Execute the instruction
6) Store the result in the proper place

1) FI : Instruction Fetch
2) DA : Decode Instruction & calculate EA
3) FO : Operand Fetch
4) EX : Execution
Timing of Instruction Pipeline : Fig. 9-8
» Instruction 3 에서 Branch 명령 실행
Fetch instruction
from memory
Segment 2 :
Decode instruction
and calculate
effective address
Branch ?
Segment 3 :
Fetch operand
from memory
Segment 4 :
Execute instruction
Interrupt
handling
Empty pipe
Step :
Instruction :
1
2
(Branch)
1
2
3
4
FI
DA
FO
EX
FI
DA
FO
EX
FI
DA
FO
3
4
5
6
7
No Branch
Computer System Architecture
Interrupt ?
Update PC
 Example : Four-segment Instruction Pipeline
 Four-segment CPU pipeline : Fig. 9-7
»
»
»
»
Segment 1 :
Chap. 9 Pipeline and Vector Processing
FI
5
6
7
8
9
10
11
12
FI
DA
FO
EX
FI
DA
FO
EX
FI
DA
FO
EX
FI
DA
FO
13
EX
EX
Branch
Dept. of Info. Of Computer
9-8
 Pipeline Conflicts : 3 major difficulties
 1) Resource conflicts
» memory access by two segments at the same time

2) Data dependency
» when an instruction depend on the result of a previous instruction, but this result is not
yet available

3) Branch difficulties
» branch and other instruction (interrupt, ret, ..) that change the value of PC
 Data Dependency 해결 방법
 Hardware 적인 방법
» Hardware Interlock

previous instruction의 결과가 나올 때 까지 Hardware 적인 Delay를 강제 삽입
» Operand Forwarding


previous instruction의 결과를 곧바로 ALU 로 전달 (정상적인 경우, register를 경유함)
Software 적인 방법
» Delayed Load

previous instruction의 결과가 나올 때 까지 No-operation instruction 을 삽입
 Handling of Branch Instructions
 Prefetch target instruction
» Conditional branch에서 branch target instruction (조건 맞음) 과 다음 instruction (조건 안
맞음) 을 모두 fetch
Computer System Architecture
Chap. 9 Pipeline and Vector Processing
Dept. of Info. Of Computer
9-9

Branch Target Buffer : BTB
» 1) Associative memory를 이용하여 branch target address 이후에 몇 개에 instruction 을
미리 BTB에 저장한다.
» 2) 만약 branch instruction이면 우선 BTB를 검사하여 BTB에 있으면 곧바로
가져온다(Cache 개념 도입)

Loop Buffer
» 1) small very high speed register file (RAM) 을
이용하여 프로그램에서 loop를 detect한다.
» 2) 만약 loop가 발견되면 loop 프로그램 전체를
Loop Buffer에 load 하여 실행하면 외부
메모리를 access 하지 않는다.

Clock cycles :
1
2
3
1. Load
I
A
E
I
A
E
I
A
E
I
A
E
I
A
E
I
A
E
I
A
E
I
A
2. Increment
3. Add
4
4. Subtract
5
5. Branch to X
6
6. No-operation
7
7. No-operation
Branch Prediction
» Branch를 predict하는 additional hardware logic 사용
 Delayed Branch 해결 방법
 Fig. 9-8 에서와 같이 branch instruction이
pipeline operation을 지연시키는 경우
 예제 : Fig. 9-10, p. 318, Sec. 9-5
» 1) No-operation instruction 삽입
» 2) Instruction Rearranging : Compiler 지원
8
8. Instruction in X
9
10
E
(a) Using no-operation instructions
Clock cycles :
1
2
3
1. Load
I
A
E
I
A
E
I
A
E
I
A
E
I
A
E
I
A
2. Increment
3. Branch to X
4. Add
5. Subtract
4
5
6
6. Instruction in X
7
8
E
(b) Rearranging the instructions
Computer System Architecture
Chap. 9 Pipeline and Vector Processing
Dept. of Info. Of Computer
9-10

9-5 RISC Pipeline
 RISC CPU 의 특징
 Instruction Pipeline 을 이용함
 Single-cycle instruction execution
 Compiler support
 Example : Three-segment Instruction Pipeline
 3 Suboperations Instruction Cycle
» 1) I : Instruction fetch
» 2) A : Instruction decoded and ALU operation
» 3) E : Transfer the output of ALU to a register,
memory, or PC

Delayed Load : Fig. 9-9(a)
» 3 번째 Instruction(ADD R1 + R3)에서 Conflict 발생

4 번째 clock cycle에서 2 번째 Instruction (LOAD R2)
실행과 동시에 3 번째 instruction에서 R2 를 연산
» Delayed Load 해결 방법 : Fig. 9-9(b)


Conflict 발생
Clock cycles :
1
2
3
1. Load R1
I
A
E
I
A
E
I
A
E
I
A
2. Load R2
3. Add R1+R2
4. Store R3
4
5
6
E
(a) Pipeline timing with data conflict
Clock cycles :
1
2
3
1. Load R1
I
A
E
I
A
E
I
A
E
I
A
E
I
A
2. Load R2
3. No-operation
4. Add R1+R2
5. Store R3
No-operation 삽입
4
5
6
7
E
(b) Pipeline timing with delayed load
Delayed Branch : Sec. 9-4에서 이미 설명
Computer System Architecture
Chap. 9 Pipeline and Vector Processing
Dept. of Info. Of Computer
9-11

9-6 Vector Processing
 Science and Engineering Applications
 Long-range weather forecasting, Petroleum explorations, Seismic data analysis,
Medical diagnosis, Aerodynamics and space flight simulations, Artificial
intelligence and expert systems, Mapping the human genome, Image processing
 Vector Operations
 Arithmetic operations on large arrays of numbers
 Conventional scalar processor
» Machine language
» Fortran language
Initialize I = 0
20 Read A(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = I + 1
If I  100 go to 20
Continue

DO 20 I = 1, 100
20 C(I) = A(I) + B(I)
Vector processor
» Single vector instruction
C(1:100) = A(1:100) + B(1:100)
Computer System Architecture
Chap. 9 Pipeline and Vector Processing
Dept. of Info. Of Computer
9-12
 Vector Instruction Format : Fig. 9-11
Operation
code
ADD
Base address
source 1
Base address
source 2
Base address
destination
B
C
A
Vector
length
100
 Matrix Multiplication
 3 x 3 matrices multiplication : n2 = 9 inner product
 a11
a
 21
a31
a12
a22
a32
a13   b11 b12
a23   b21 b22
 
a33  b31 b32
b13   c11 c12
b23   c21 c22
 
b33  c31 c32
c13 
c23 

c33 
» c11  a11 b11  a12 b21  a13 b31 : 이와 같은 inner product가 9 개

Cumulative multiply-add operation : n3 = 27 multiply-add
c  c  a b
» c11  c11  a11 b11  a12 b21  a13 b31 : 이와 같은 multiply-add가 3 개
    
따라서 9 X 3 multiply-add = 27
C11의 초기값 = 0
Computer System Architecture
Chap. 9 Pipeline and Vector Processing
Dept. of Info. Of Computer
9-13
 Pipeline for calculating an inner product : Fig. 9-12
 Floating point multiplier pipeline : 4 segment
 Floating point adder pipeline : 4 segment
 예제 ) C  A1B1  A2 B2  A3 B3    Ak Bk
» after 1st clock input
» after 4th clock input
Source
A
Source
A
A1B1
Source
B
A4B4 A3B3 A2B2 A1B1
Multiplier
pipeline
Adder
pipeline
Source
B
» after 8th clock input
Adder
pipeline
» after 9th, 10th, 11th ,...
Source
A
Source
A
A8B8 A7B7 A6B6 A5B5
Source
B
Multiplier
pipeline
Multiplier
pipeline
A4B4 A3B3 A2B2 A1B1
Adder
pipeline
A8B8 A7B7 A6B6 A5B5
Source
B
A4B4 A3B3 A2B2 A1B1
Multiplier
pipeline
Adder
pipeline
» Four section summation
C  A1B1  A5 B5  A9 B9  A13B13  
A2 B2  A6 B6
 A2 B2  A6 B6  A10 B10  A14 B14  
 A3 B3  A7 B7  A11B11  A15B15  
A1B1  A5 B5
,,,
 A4 B4  A8 B8  A12 B12  A16 B16  
Computer System Architecture
Chap. 9 Pipeline and Vector Processing
Dept. of Info. Of Computer
9-14
 Memory Interleaving : Fig. 9-13
 Simultaneous access to memory from two or
more source using one memory bus system
 AR 의 하위 2 bit를 사용하여 4 개중 1 개의
memory module 선택
 예제 ) Even / Odd Address Memory Access
Address bus
AR
AR
AR
AR
Memory
array
Memory
array
Memory
array
Memory
array
DR
DR
DR
DR
 Supercomputer
 Supercomputer = Vector Instruction + Pipelined floating-point arithmetic
 Performance Evaluation Index
Data bus
» MIPS : Million Instruction Per Second
» FLOPS : Floating-point Operation Per Second


megaflops : 106, gigaflops : 109
Cray supercomputer : Cray Research
» Clay-1 : 80 megaflops, 4 million 64 bit words memory
» Clay-2 : 12 times more powerful than the clay-1

VP supercomputer : Fujitsu
» VP-200 : 300 megaflops, 32 million memory, 83 vector instruction, 195 scalar instruction
» VP-2600 : 5 gigaflops
Computer System Architecture
Chap. 9 Pipeline and Vector Processing
Dept. of Info. Of Computer
9-15

9-7 Array Processors
 Performs computations on large arrays of data
Vector processing : Adder/Multiplier pipeline 이용
Array processing :별도의 array processor 이용
 Array Processing
 Attached array processor : Fig. 9-14
» Auxiliary processor attached to a general purpose computer

SIMD array processor : Fig. 9-15
» Computer with multiple processing units operating in parallel

General-purpose
computer
Main memory
Vector 계산 C = A + B 에서 ci = ai + bi 를
각각의 PEi에서 동시에 실행
Input-Output
interface
High-speed memory tomemory bus
Attached array
Processor
Master control
unit
M1
PE 2
M2
PE 3
M3
PE n
Mn
Local memory
Main memory
Computer System Architecture
PE 1
Chap. 9 Pipeline and Vector Processing
Dept. of Info. Of Computer

Chap. 9 Pipeline and Vector Processing

Transcript Chap. 9 Pipeline and Vector Processing

Directory