Bulldozer: An Approach to multithreaded Compute Performance

Download Report

Transcript Bulldozer: An Approach to multithreaded Compute Performance

by
Michael Butler, Leslie Barnes,
Debjit Das Sarma, Bob Gelinas
This paper appears in: Micro, I EEE
March/April 2011 (vol. 31 no. 2)
pp. 6-15
마이크로 프로세서 구조
speaker: 박세준
Contents
1. Motivation
2. Introduction
3. Block diagram
4. Key features
5. Function block highlights
6. Bulldozer-based SoC
Motivation
AMD has been focusing on the core count and highly parallel sever
workloads
Two basic observations
1. Future SoCs support multiple execution threads
• The smallest possible building module
2. Core would operate in constrained power environment.
• Power reduction techniques:
Filtering , speculation reduction, data movement minimization
Performance per watt!!
Introduction
Bulldozer is New direction in
microarchitecture
•
•
•
•
Bulldozer is the first x86 design to share
substantial hardware between multiple
core
Bulldozer is a hierarchical design with
sharing at nearly every level
Bulldozer is a high frequency optimized
CPU
Instead of peak performance, average
performance increased.
Introduction
Major contribution
•
Scaling the core structures
•
Aggressive frequency goal
•
low gates per clock
Block diagram
It combines two independent core as a module
•
implementation of a shared level 2 cache
•
Improved area and power efficiency
The module can fetch and
decode up to four x86
instruction per clock.
Each core can services two
loads per cycle.
Shared Frontend
• Decoupled predict and
fetch pipelines
Block diagram
• ALU performance 33% decrease FPU performance 33% increase
• ALU performance 33% increase FPU performance 33% increase
Key features
1. Multithreading microarchitecture
•
•
•
Appropriate use of replication and shared hardware
Main advantage to sharing instruction cache and branch
Enforcing frontend (increasing ROB, BTB)
2. Decoupled branch-prediction from instruction fetch pipelines
•
•
Enablement of instruction prefetch using the prediction queue
instruction control unit increased 128 (reorder buffer)
3. Register renaming and operand delivery
scheduler and operand-handling is the biggest power consumer in the integer
execution unit
• PRF-based renaming microarchitecture for power efficiency
• Eliminates data replication
•
4. FMAC and media extension
•
•
FMAC(floating-point multiply-accumulate) deliver significant peak execution bandwidth
It made one per each module like coprocessor
Function block highlights
Branch prediction
multilevel BTB
Instruction cache
64 Kbyte, two-way set-associative,
cache shared between both threads
Function block highlights
Decode
branch fusion (intel: macro fusion ), four x86 instruction per cycle
Bulldozer execution pipeline
Function block highlights
Integer scheduler and execution
renaming by PRF(Physical Register Files)
Floating point
FPU is a coprocessor between two integer core
L2 cache
the two cores share the unified L2 cache
Bulldozer-based SoC
Summary
1. In single threading, sacrifice peak performance, throughput increase
2. In single threading, FPU is more important
3. ALU performance need in server
Bulldozer can deliver a significant performance improvement in the same
power.
The end