醫學影像處理實驗室(Medical Image Processing Lab.) Chuan

Download Report

Transcript 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan

Chapter Six
Enhancing performance with Pipelining
授課教師: 張傳育 博士 (Chuan-Yu Chang Ph.D.)
E-mail: [email protected]
Tel: (05)5342601 ext. 4337
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
1
An Overview of Pipelining
• 管路 (pipeline)是一種製作技巧,它可以重疊指令的執行
• 舉例來說
– 4個階段的管路 (假設所有的管路階段都花費同樣的時間)
• 對於非管路而言,花費16個單位時間
• 對於管路而言,花費7個單位時間
• 使用管路並不是縮短單一指令的執行時間,而是增加指
令的生產量
• 如果每個階段所需的時間一樣,且有足夠多的工作要做
,則pipeline加速的效率約等於管線的階段數。
• 舉例來說
– 大量衣服需要洗滌,管路化非常有效率
– 只有一件衣服需要洗滌,用不用管路所需的時間都是一樣的
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
2
An Overview of Pipelining
洗衣機
烘乾機
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
3
An Overview of Pipelining
• 一般來說,MIPS的指令分成下面五個管路階段:
– 從指令記憶體中擷取指令
– 當對指令進行解碼時,讀取暫存器的值
– 執行運算 (R-type),或計算一個位址 (存取記憶體)
– 存取在資料記憶體中的運算元
– 將結果寫回暫存器中
• MIPS指令以管路方式執行須五個管路階段
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
4
An Overview of Pipelining
• 範例. 單一時脈週期指令及管路效率的比較 (page 438)
(1) 單一時脈,非管路化的執行結果:
– 單一指令執行時間 = 8 ns
– 三個指令的執行時間 = 3*8 = 24 ns
(2) 管路執行結果:
– 所有的管路階段花費相同的時間 (單一時脈週期)
– 時脈週期必須夠長以配合最慢的指令
– 階段長度 = 2 ns
– 執行時間 = 完成第一個指令的時間+(n-1)階段長度
= 10+(3-1)*2 = 14ns
Instruction Class
Instruction
fecth
Register
read
Load word (lw)
2 ns
Store word (sw)
R-format (add、sub、and、
or、slt)
Branch (beq)
Data
access
Register
write
Total
time
1 ns
ALU
operatio
n
2 ns
2 ns
1 ns
8 ns
2 ns
1 ns
2 ns
2 ns
2 ns
1 ns
2 ns
2 ns
1 ns
2 ns
7 ns
1 ns
6 ns
5 ns
只討論此八個指令
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
5
An Overview of Pipelining
Program
execution
Time
order
(in instructions)
lw $1, 100($0)
2
Instruction
Reg
fetch
lw $2, 200($0)
4
6
8
ALU
Data
access
10
12
14
ALU
Data
access
16
18
Reg
Instruction
Reg
fetch
8 ns
lw $3, 300($0)
Reg
Instruction
fetch
8 ns
...
8 ns
Program
execution
Time
order
(in instructions)
2
lw $1, 100($0)
Instruction
fetch
lw $2, 200($0)
2 ns
lw $3, 300($0)
4
Reg
Instruction
fetch
2 ns
6
ALU
Reg
Instruction
fetch
2 ns
8
Data
access
ALU
Reg
2 ns
10
14
12
Reg
Data
access
Reg
ALU
Data
access
2 ns
2 ns
Reg
2 ns
Fig 6.3 Single-cycle, non-pipelined execution in top vs. pipelined execution in bottom
• 若有分支指令呢?
在管路終將會有〝洞〞產生.
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
6
An Overview of Pipelining (cont.)
• Under ideal conditions
– The speedup from pipelining equals the number of pipe
stages.
• In fact
– Pipelining involves some overhead
– The time per instruction in the pipelined machine will exceed
the minimum possible, and the speed up will be less than the
number of pipeline stages.
• Pipelining improves performance by increasing
instruction throughput, as opposed to decreasing the
execution time of an individual instruction.
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
7
An Overview of Pipelining
• Designing Instruction Sets for Pipelining
– all instructions are the same length
– just a few instruction formats, with the source register fields
being located in the same place in each instruction.
– memory operands appear only in loads and stores
– Operands must be aligned in memory
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
8
An Overview of Pipelining
• 管路危障 (Pipeline Hazards)
在管路化中,不能順利的在下一個時脈週期執行下一個指令,這種
情形稱之為〝危障〞。.
• 三種不同形態的危障
– 結構危障(structural hazards):
–suppose we had only one memory
– 控制危障(control hazards):
–need to worry about branch instructions
– 資料危障(data hazards):
–an instruction depends on a previous instruction
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
9
An Overview of Pipelining
• 1.結構危障(structural hazards)
硬體資源不夠多,而導致在同一時間內要執行的多個指令卻無法執
行。
• 範例
假設我們只有單一記憶體而不是擁有兩個獨立的記憶體,如果在圖
6.3管路中有第四個指令,在某一時脈週期,第一個指令正在存取記
憶體的同時,第四個指令也在同一記憶體中擷取指令,也就是兩個
記憶體對同一個記憶體同時進行存取動作,此一狀況稱之為〝結構
危障〞。
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
10
An Overview of Pipelining
2. 控制危障(control hazards)
發生在其他指令正在執行時,需要依據另一指令的結果來做出一些
決定的時候就會發生〝控制危障〞。
• 範例: 圖6.4 的lw指令︰
Program
execution
Time
order
(in instructions)
add $4, $5, $6
beq $1, $2, 40
2
Instruction
fetch
2ns
4
Reg
Instruction
fetch
lw $3, 300($0)
4 ns
6
ALU
Reg
8
Data
access
ALU
Instruction
fetch
10
14
12
16
Reg
Data
access
Reg
Reg
ALU
Data
access
Reg
2ns
• 解決方式 1: 暫停(stall)
– 假設我們有足夠的硬體,所以可以在第二個管路階段中測試
暫存器、計算分支位址、並更新PC值。
– 指令 lw 被額外暫停了 2-ns 的時脈週期 稱之為〝管路暫停〞
(pipeline stall) ,也稱之為”氣泡”(bubble)。
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
11
An Overview of Pipelining
• 解決方式2: 預測(predict)
– 預測分支條件永遠不會成立
– 當你的預測正確時,管路可以全速的運作 (圖 6.5 (a))
– 只有當分支發生時我們才需要管路暫停 (圖6.5 (b))
Program
execution
Time
order
(in instructions)
add $4, $5, $6
2
6
Instruction
Reg
fetch
2 ns
lw $3, 300($0)
2
4
Instruction
Reg
fetch
beq $1, $2, 40
2 ns
ALU
Instruction
Reg
fetch
bubble
or $7, $8, $9
Data
access
ALU
6
4 ns
10
14
Reg
Data
access
ALU
8
Data
access
12
Reg
Instruction
Reg
fetch
2 ns
Program
execution
Time
order
(in instructions)
8
Data
access
ALU
Instruction
Reg
fetch
beq $1, $2, 40
add $4, $5 ,$6
4
10
Reg
14
12
Reg
ALU
Data
access
Reg
bubble
bubble
bubble
Instruction
Reg
fetch
ALU
bubble
Data
access
Reg
Fig 6.5 Predicting that branches are not taken as a solution to control hazards
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
12
An Overview of Pipelining
• 解決方式3: 延遲決定(delayed decision)
– 有一些指令不管分支發不發生都要執行(safe instruction),而且這些指令
不影響管線運作的正確性,因此我們可將這些指令放到原本需暫停的時
脈週期中。
– MIPS會把safe instruction放到分支指令之後的位置。
Program
execution
order
Time
(in instructions)
beq $1, $2, 40
2
Instruction
fetch
add $4, $5, $6
(Delayed branch slot) 2 ns
lw $3, 300($0)
4
Reg
Instruction
fetch
2 ns
6
ALU
Reg
Instruction
fetch
8
Data
access
ALU
Reg
10
12
14
Reg
Data
access
ALU
Reg
Data
access
Reg
2 ns
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
13
An Overview of Pipelining
3. 資料危障(data hazards)
管路中某一指令執行時,需要用到還在管路中前面的指令所產生的
結果。
• 範例
add $s0, $t0, $t1
sub $t2, $s0, $t3
加法指令會到第五個管路階段才將資料寫回暫存器,意思就是說
我們將會產生三個氣泡。
• 解決Data dazard的方法是不等整個指令完成,即將ALU運算結果送給
下一指令,這種從內部資源提早拿取資料的方法稱為前饋(forwarding)
或旁路(bypassing) (圖 6.8)
左邊的陰影表示write
右邊的陰影表示read
Program
execution
order
Time
(in instructions)
add $s0, $t0, $t1
sub $t2, $s0, $t3
2
IF
4
6
8
ID
EX
MEM
IF
ID
EX
10
WB
MEM
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
WB
14
An Overview of Pipelining
• 範例
lw $s0, 20($t1)
sub $t2, $s0, $t3
即使有前饋的技術我們仍需要暫停管路
2
Time
4
因為memory reference指
令要到第四個階段才完成
記憶體data的存取
6
8
10
12
MEM
WB
bubble
bubble
bubble
14
Program
execution
order
(in instructions)
lw $s0, 20($t1)
sub $t2, $s0, $t3
IF
ID
EX
bubble
bubble
IF
ID
EX
MEM
WB
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
15
A Pipeline Datapath
• MIPS指令執行時的5個階段
– IF: 指令擷取
– ID: 指令解碼與暫存器讀取
– EX: 執行或有效記憶體位址計算
– MEM: 資料記憶體存取
– WB: 寫回
• 圖 6.9 展示單一時脈, 5個階段的管路化資料路徑(同Fig. 5.17)
IF: Instruction fetch
ID: Instruction decode/
register file read
EX: Execute/
address calculation
MEM: Memory access
WB: Write back
0
M
u
x
1
Add
4
Add
Add
result
Shift
left 2
PC
Read
register 1
Address
Instruction
Instruction
memory
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
16
A Pipeline Datapath
左邊的陰影表示write
右邊的陰影表示read
• 這5個指令將在任一時脈週期內執行
• 範例 觀看圖 6.10
Time (in clock cycles)
Program
execution
order
(in instructions)
lw $1, 100($0)
CC 1
IM
lw $2, 200($0)
lw $3, 300($0)
CC 2
Reg
IM
CC 3
ALU
Reg
IM
CC 4
DM
ALU
Reg
CC 5
Reg
DM
ALU
CC 6
CC 7
框框內的英文縮寫表示
資料處理的對象:
IM: instruction memory
DM: data memory
Reg
DM
Reg
此指令流程有兩個例外:
1. 在寫回階段,會將結果回存在資料路徑中間的暫存器檔案,造成資料危障。
2. PC的值可能是PC+1或是從MEM階段所得到的分支位址,造成控制危障。
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
17
Pipelined Datapath
• 管路暫存器(Pipeline Register)
– 我們將從指令記憶體讀出的資料存放在管路暫存器中,以便保有
指令的相關資訊,讓剩餘的4個階段使用。
– 圖 6.11 展示加上著色的管路暫存器之管路化資料路徑
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
18
Pipelined Datapath
• 當記憶體或暫存器做讀取,我們將其右半塗上不同的顏色,而寫入時我
們將其左半塗上顏色。
• 指令 lw 的5個階段如下所示:(參考圖 6.12-6.14)
– 指令擷取:
 我們以程式計數器 (PC) 中儲存的位址到記憶體中讀取指令並將
其放到IF/ID管路暫存器 (這是由於電腦一開始並不曉得哪種形態
的指令會被擷取)
– 指令解碼與暫存器讀取:
 暫存器的號碼, 暫存器的內容, 16位元的立即欄位, ID/EX 暫存器
置入遞增後的程式計數器 (PC)的值
– 執行或有效記憶體計算:
 載入指令讀取從ID/EX管路暫存器讀取符號擴充後的位址與暫存器1
的內容。使用ALU將這兩個值相加後放到EX/MEM管路暫存器中。
– 記憶體存取
載入指令使用EX/MEM管路暫存器內的位址到資料記憶體讀取資
料
– 寫回
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
19
lw
Instruction fetch
0
M
u
x
1
Fig. 6.12
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
load word
4
Add
Add
result
Instruction fetch
PC
Instruction
Shift
left 2
Address
Instruction
memory
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Zero
ALU ALU
result
0
M
u
x
1
Write
data
Address
Read
data
1
M
u
x
0
Data
memory
Write
data
16
32
Sign
extend
lw
0
M
u
x
1
Instruction decode
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Add
result
Instruction decode &
Register file read
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
32
Read
data
1
M
u
x
0
Pipelined Datapath
Fig. 6.13 load word Execute or address calculation
從ID/EX管路暫存器讀取符號擴充後的位址暫存器的內容
使用ALU將這兩個值相加後,放到EX/MEM管路暫存器中。
lw
0
M
u
x
1
Execution
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Add
result
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
21
Pipelined Datapath
lw
Fig. 6.14 load word
0
M
u
x
1
Memory
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
Shift
left 2
Address
PC
Read
register 1
Instruction
Memory access
從EX/MEM管路暫存器的
位址到資料記憶體讀取資
料,並將結果寫入
MEM/WB管路暫存器中。
Instruction
memory
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
0
M
u
x
1
Write
data
Zero
ALU ALU
result
Address
Data
memory
Read
data
1
M
u
x
0
Write
data
16
Write back
從EX/MEM管路暫存器中
讀取資料,並將其寫入暫
存器檔案。
Sign
extend
32
lw
Write back
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Add
result
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
22
Pipelined Datapath
Fig. 6.15 store word指令的第三階段 (計算位址)
執行有效記憶體位址計算,將有效位址放到EX/MEM
sw
0
M
u
x
1
Execution
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Read
data
Write
data
16
Sign
extend
1
M
u
x
0
32
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
23
Pipelined Datapath
sw
Fig. 6.16
0
M
u
x
1
store word指令的
Memory
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
Shift
left 2
Memory write
Read
register 1
Instruction
Address
PC
Instruction
memory
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
0
M
u
x
1
Write
data
Zero
ALU ALU
result
Read
data
Address
1
M
u
x
0
Data
memory
Write
data
16
Sign
extend
32
sw
0
M
u
x
1
Write back
IF/ID
ID/EX
EX/MEM
因為sw指令在
第四階段已將
資料寫入記憶
體,因此在此
階段無動作。
MEM/WB
Add
Write back
4
Add
Add
result
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
24
Pipelined Datapath
Fig. 6.17 修正後的pipelined datapath使得可處理lw指令
需將IF/ID register中的目的暫存器保留至寫
回階段,供記憶體將data寫入register file
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Add
result
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
25
Pipelined Datapath
Fig. 6.18 lw指令所會用到的所有datapath
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Add
result
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Read
data
Data
memory
Write
data
16
Sign
extend
1
M
u
x
0
32
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
26
Graphically Representing Pipelines
多時脈週期pipeline表示法(1)
Time (in clock cycles)
Program
execution
order
(in instructions)
lw $10, 20($1)
sub $11, $2, $3
CC 1
CC 2
CC 3
IM
Reg
ALU
IM
Reg
CC 4
CC 5
DM
Reg
ALU
DM
CC 6
Reg
• Can help with answering questions like:
– how many cycles does it take to execute this code?
– what is the ALU doing during cycle 4?
– use this representation to help understand datapaths
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
27
Graphically Representing Pipelines
多時脈週期pipeline表示法(2)
Program
execution
order
(in instructions)
lw $10, $20($1)
sub $11, $2, $3
Time ( in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
Instruction
fetch
Instruction
decode
Execution
Data
access
Write back
Instruction
fetch
Instruction
decode
Execution
Data
access
CC 6
Write back
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
28
單時脈週期pipeline表示法
lw $10, 20($1)
Instruction fetch
0
M
u
x
1
第一個指令的第一階段
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Instruction
Shift
left 2
Address
Instruction
memory
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Zero
ALU ALU
result
0
M
u
x
1
Write
data
Address
Read
data
1
M
u
x
0
Data
memory
Write
data
16
Sign
extend
32
Clock 1
sub $11, $2, $3
lw $10, 20($1)
Instruction fetch
Instruction decode
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add
result
Shift
left 2
PC
Address
Instruction
memory
Instruction
第一個指令的第二階段
第二個指令的第一階段
4
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
Clock 2
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
29
單時脈週期pipeline表示法
sub $11, $2, $3
lw $10, 20($1)
Instruction decode
Execution
0
M
u
x
1
第一個指令的第三階段
第二個指令的第二階段
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Add
result
PC
Instruction
Shift
left 2
Address
Instruction
memory
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Zero
ALU ALU
result
0
M
u
x
1
Write
data
Address
Read
data
1
M
u
x
0
Data
memory
Write
data
16
Sign
extend
32
Clock 3
0
M
u
x
1
IF/ID
sub $11, $2, $3
lw $10, 20($1)
Execution
Memory
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Add
result
PC
第一個指令的第四階段
第二個指令的第三階段
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
Clock 4
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
30
0
M
u
x
1
單時脈週期pipeline表示法
IF/ID
ID/EX
sub $11, $2, $3
lw $10, 20($1)
Memory
Write back
EX/MEM
MEM/WB
Add
4
Add
Add
result
Shift
left 2
PC
Instruction
第一個指令的第五階段
第二個指令的第四階段
Address
Instruction
memory
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Zero
ALU ALU
result
0
M
u
x
1
Write
data
Address
Read
data
1
M
u
x
0
Data
memory
Write
data
16
Sign
extend
32
Clock 5
sub $11, $2, $3
0
M
u
x
1
Write back
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Add
result
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
第二個指令的第五階段
0
M
u
x
1
Zero
ALU ALU
result
Address
Read
data
Data
memory
Write
data
16
Sign
extend
1
M
u
x
0
32
Clock 6
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
31
Pipeline Control
PCSrc
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
result
Add
4
Branch
Shift
left 2
PC
Address
Instruction
memory
Instruction
RegWrite
Read
register 1
MemWrite
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
ALUSrc
Zero
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Data
memory
Write
Read
data
1
M
u
x
0
data
Instruction
16
[15– 0]
Sign
extend
32
6
ALU
control
MemRead
Instruction
[20– 16]
Instruction
[15– 11]
0
M
u
x
1
ALUOp
RegDst
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
32
Pipeline control
• We have 5 stages. What needs to be controlled in
each stage?
–
–
–
–
–
Instruction Fetch and PC Increment
Instruction Decode / Register Fetch
Execution
Memory Stage
Write Back
• How would control be handled in an automobile
plant?
– a fancy control center telling everyone what to do?
– should we use a finite state machine?
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
33
Pipeline Control
• Pass control signals along just like the data
Instruction
R-format
lw
sw
beq
Execution/Address Calculation Memory access stage
stage control lines
control lines
Reg
ALU
ALU
ALU
Mem
Mem
Dst
Op1
Op0
Src Branch Read Write
1
1
0
0
0
0
0
0
0
0
1
0
1
0
X
0
0
1
0
0
1
X
0
1
0
1
0
0
Write-back
stage control
lines
Reg Mem to
write
Reg
1
0
1
1
0
X
0
X
WB
Instruction
IF/ID
Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
34
Datapath with Control
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
Add
Add result
Instruction
memory
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Branch
Shift
left 2
MemWrite
PC
Instruction
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
35
Dependencies
• Problem with starting next instruction before first is
finished
– dependencies that “go backward in time” are data hazards
Time (in clock cycles)
CC 1
Value of
register $2: 10
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10
10/– 20
– 20
– 20
– 20
– 20
DM
Reg
Program
execution
order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM
Reg
IM
DM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
Reg
Reg
DM
Reg
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
36
Software Solution
• Have compiler guarantee no hazards
• Where do we insert the “nops” ?
sub
and
or
add
sw
$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
• Problem: this really slows us down!
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
37
Forwarding
• Use temporary results, don’t wait for them to be
written
– register file forwarding to handle read/write to same
register
– ALU forwarding
Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
– 20
X
10/– 20
X
– 20
– 20
X
X
– 20
X
X
– 20
X
X
– 20
X
X
DM
Reg
Program
execution order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
what if this $2 was $13?
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
38
Forwarding
ID/EX
WB
Control
PC
Instruction
memory
Instruction
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
IF/ID.RegisterRs
Rs
IF/ID.RegisterRt
Rt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
M
u
x
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
MEM/WB.RegisterRd
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
39
Can't always forward
• Load word can still cause a hazard:
– an instruction tries to read a register following a load
instruction that writes to the same register.
Time (in clock cycles)
Program
CC 1
execution
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
IM
CC 2
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
CC 6
CC 7
CC 8
CC 9
Reg
DM
Reg
–
add $9, $4, $2
slt $1, $6, $7
IM
Reg
IM
DM
Reg
Reg
DM
Reg
Thus, we need a hazard detection unit to “stall” the load
instruction
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
40
Stalling
• We can stall the pipeline by keeping an instruction
in the same stage
Program
Time (in clock cycles)
execution
CC 1
CC 2
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
IM
CC 3
Reg
IM
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
CC 6
CC 7
DM
Reg
Reg
DM
CC 8
CC 9
CC 10
Reg
bubble
add $9, $4, $2
slt $1, $6, $7
IM
DM
Reg
IM
Reg
Reg
DM
Reg
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
41
Hazard Detection Unit
• Stall by letting an instruction that won’t write anything go
forward
ID/EX.MemRead
Hazard
detection
unit
ID/EX
IF/IDWrite
WB
Control
0
M
u
x
PC
Instruction
memory
Instruction
PCWrite
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
M
u
x
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
ID/EX.RegisterRt
Rs
Rt
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
MEM/WB.RegisterRd
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
42
Branch Hazards
• When we decide to branch, other instructions are in the
pipeline!
Time (in clock cycles)
Program
execution
CC 1
CC 2
order
(in instructions)
40 beq $1, $3, 7
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
IM
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
43
Flushing Instructions
IF.Flush
Hazard
detection
unit
ID/EX
M
u
x
WB
Control
0
M
u
x
IF/ID
4
M
WB
EX
M
MEM/WB
WB
Shift
left 2
Registers
PC
EX/MEM
=
M
u
x
Instruction
memory
ALU
M
u
x
Data
memory
M
u
x
Sign
extend
M
u
x
Forwarding
unit
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
44
Improving Performance
• Try and avoid stalls! E.g., reorder these
instructions:
lw
lw
sw
sw
$t0,
$t2,
$t2,
$t0,
0($t1)
4($t1)
0($t1)
4($t1)
• Add a “branch delay slot”
– the next instruction after a branch is always executed
– rely on compiler to “fill” the slot with something useful
• Superscalar: start more than one instruction in the
same cycle
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
45
Dynamic Scheduling
• The hardware performs the “scheduling”
– hardware tries to find instructions to execute
– out of order execution is possible
– speculative execution and dynamic branch prediction
• All modern processors are very complicated
– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
– PowerPC and Pentium: branch history table
– Compiler technology important
• This class has given you the background you need to
learn more
• Video: An Overview of Intel’s Pentium Processor
(available from University Video Communications)
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
46
Superscalar and Dynamic Pipelining
• n-路 超純量
– 複製電腦的內部單元 使其在每一個管路階段都能夠處理n個指令
– 理想的 CPI 是 1/n
• 超純量的MIPS指令
– 假設每個時脈週期會啟動兩個指令
– 一個指令可以是整數的ALU運算,另一個可以是載入或儲存的
指令
指令形態
ALU 或分支指令
載入或儲存指令
ALU 或分支指令
載入或儲存指令
ALU或分支指令
載入或儲存指令
IF ID
IF ID
IF
IF
管路階段
EX MEM WB
EX MEM WB
ID EX
MEM
ID EX
MEM
IF ID
EX
IF ID
EX
WB
WB
MEM WB
MEM WB
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
47
Superscalar and Dynamic Pipelining
• 範例
Loop:
lw $t0, 0($s1)
addi $s1, $s1, -4
addu $t0, $t0, $s2
bne $s1, zero, Loop
sw $t0, 4($s1)
ALU或分支指令
Loop:
資料傳送指令
lw $t0, 0($s1)
時脈週期
1
addi $s1, $s1, -4
2
addu $t0, $t0, $s2
3
bne $s1, zero, Loop
sw $t0, 4($s1)
4
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
48
Superscalar and Dynamic Pipelining
• 額外的硬體需求 (圖 6.58)
– 另外32個指令記憶體的位元
– 暫存器檔案額外的存取埠
– 一個另外的ALU 負責資料傳送的位址計算
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
49
Superscalar and Dynamic Pipelining
• 動態管路排程
– 當解決等待暫停時,動態管路排程跳過暫停去尋找之後的指令來執行
– 動態管路排程能夠正常的與額外的硬體資源相結合,所以之後的指令
可以並行的處理
– 代價是更為複雜許多的管路控制, 及更複雜的指令執行模式
• 範例
lw $t0, 20($s2)
addu $t1, $t0, $t2
sub $s4, $s4, $t3
alti $t5, $s4, 20
即使 sub and slti 指令已經準備要執行, 首先它們必須等待lw 和 addu
指令完成。
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
50
Superscalar and Dynamic Pipelining
• 管路可分為三個主要單元
– 指令擷取與啟動單元:
 擷取指令並將其解碼。
 將每個指令送對執行階段對應的單元。
 有順序的交付。
– 執行單元:
 每個功能單元都有緩衝器, 稱為保留站,可儲存運算元與
運算子。
 當緩衝器包含它所有的運算子 且功能單元可以開始執行時
, 便可計算出結果。
 不照順序的執行。
– 交付單元:
 決定何時可以安全地將結果送到暫存器檔案或記憶體。
有順序的交付。
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
51
Superscalar and Dynamic Pipelining
• 範例
lw $t0, 20($s2)
addu $t1, $t0, $t2
sub $s4, $s4, $t3
alti $t5, $s4, 20
sub $s4, $s5, $t6
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
52
Superscalar and Dynamic Pipelining
• 範例: DEC Alpha 21264
– 每個時脈週期擷取四個指令, 但是最多可以6個指令。
– 使用不照順序執行,與有順序完成。
– 這個管路花了9個階段來做簡單的整數與浮點運算。
– 在1997年的時脈速度為600MHz。
• 動態管路比傳統的靜態管路還要複雜
– 結合分支預測:
交付單元必須能捨棄在執行單元的結果,而這種結果是在
錯誤分支之後交付給指令執行的。
– 結合超純量執行。
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
53
Branches
• If the branch is taken, we have a penalty of one cycle
• For our simple design, this is reasonable
• With deeper pipelines, penalty increases and static
branch prediction drastically hurts performance
• Solution: dynamic branch prediction
Taken
Not taken
Predict taken
Predict taken
Taken
Not taken
Taken
Not taken
Predict not taken
Predict not taken
Taken
Not taken
A 2-bit prediction scheme
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
54
Branch Prediction
• Sophisticated Techniques:
– A “branch target buffer” to help us look up the destination
– Correlating predictors that base prediction on global
behavior
and recently executed branches (e.g., prediction for a specific
branch instruction based on what happened in previous branches)
– Tournament predictors that use different types of prediction
strategies and keep track of which one is performing best.
– A “branch delay slot” which the compiler tries to fill with a
useful instruction (make the one cycle delay part of the ISA)
• Branch prediction is especially important because it
enables other more advanced pipelining techniques
to be effective!
醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D.
55