ARM Cortex A8 Pipeline

Download Report

Transcript ARM Cortex A8 Pipeline

ARM Cortex A8 Pipeline
EE126 Wei Wang
• Cortex A8 is a processor core designed by ARM Holdings.
• Application: Apple A4, Samsung Exynos 3110.
What’s the pipeline architecture in Cortex A8?
Deeper pipeline and superscalar pipeline.
Deeper Pipeline
Why does it break one cycle into several cycles?
IF
F0
F1
ID
F2
D0
D1
D2
EXE
D3
D4
E0
IF
F0
F1
F2
E1
E2
E3
E4
E5
ID
D0
D1
D2
D3
D4
E0
EXE
E1
E2
E3
E4
E5
For pipeline, the speed is limited by the length of the longest stage, and the longest
stage is set to be the standard one cycle time. For the deeper pipeline, the time of
the new sub-stage is small. The smaller time resolution therefore leads to less time
to complete one instruction.
Superscalar Pipeline
• It is a form of instruction level parallelism, which is faster than normal pipeline.
IF
ID
EX
WB
Simple 4 Stage Pipeline
Superscalar Pipeline
0
1
2
3
4
5
6
7
Two instructions executed at the same time
8
9
Cortex A8 Pipeline Main Architecture:
14-Stage Integer Pipeline
F0 F1 F2
D0 D1 D2 D3 D4
10-Stage NEON Pipeline
E0
E1
E2
E3
E4
E5
M0 M1 M2 M3
N1
Instruction Execute and Load/Store
Integer register
In writeback
N2
N3
N4
N5
N6
NEON register writeback
NEON
Integer ALU Pipeline
ALU Pipeline0
ALU Pipeline1
NEON
Instruction
Decode
Load/Store Pipeline0/1
Load/Store
Data
Quence
NEON Register File
Instruction
Decode
Architecture Register File
Instruction
Fetch
MUL Pipeline0
Integer MUL Pipeline
Integer shift Pipeline
Non-IEEE FP ADD Pipeline
Non-IEEE FP MUL Pipeline
IEEE FP Engine
Load/Store Permute Pipeline
NEON Store Data
• Execution stages: 6 stage pipeline.
E0
E1
E2
E3
E4
E5
Two symmetric ALU pipeline, a multiple
pipeline and an address generator for
load and store.
Instruction Execute and Load/Store
Integer register
In writeback
1. For the ALU pipeline:
Shift
Architecture Register File
MUL
1
ALU+
Flags
MUL
2
Sat
BP
Update
E0 access register file;
WB
ALU Pipeline
E1 shift if needed;
E2 ALU function;
MUL
3
ACC
WB
Multiple Pipeline
E3 complete saturation if needed;
E4 change in control flow;
Shift
AGU
ALU+
Flags
Sat
BP
Update
Load/Store Pipeline
WB
ALU Pipeline
E5 write back to register file.
2. For the Mul pipeline:
WB
Load/Store Pipeline
E1-E3 implement multiply;
E4 perform addition.;
E5 write back.
• It can extensively support of key forwarding path. Result data is from the outputs of shift, ALU and
MUL immediately as it is produced. The intermediate execution stage results can be forwarded.
Unlike the simple pipeline, only the final execution stage result can be forwarded.
• Deep pipeline and superscalar pipeline have good performance. Why
not increases the sub-stages and the parallel instructions?
• What’s the limitations?
• Data Dependency
Data Independency
MUL t3,t2,t1
ADD t6, t5,t4
Data Dependency
MUL t3,t2,t1
ADD t6, t3,t4
BUBBLE
Add BUBBLE
0
1
2
3
4
Solution: Stall the adder until the multiplier has finished.
5
• Output dependency:
MUL t3,t2,t1;
ADD t3,t4,t5;
• An output dependency occurs if two paralleled instructions are
writing into the same location. An error occurs if the second
instruction implement before the first one.
• Antidependency:
MUL t3,t2,t1;
ADD t2,t4,t5;
• An antidependency exists if an instruction uses a location as an
operand while a following one is writing into that location; if the first
one is still using the location when the second one writes into it, an
error occurs.
• Solution for the output independency and antidependency: Use other
register.
MUL t3,t2,t1;
ADD t3,t4,t5;
MUL t3,t2,t1;
ADD t6,t4,t5;
MUL t3,t2,t1;
ADD t2,t4,t5;
MUL t3,t2,t1;
ADD t6,t4,t5;
Alternative ways to handle dependency:
Compiler will generate instructions with less dependency.
• Summary:
Cortex architecture is a high speed architecture by using deeper
pipeline and superscalar pipeline.
Thank you