SUN ULTRASPARC-III ARCHITECTURE CMPE 511 PRESENTATION Prepared by:Balkır Kayaaltı

Download Report

Transcript SUN ULTRASPARC-III ARCHITECTURE CMPE 511 PRESENTATION Prepared by:Balkır Kayaaltı

SUN ULTRASPARC-III
ARCHITECTURE
CMPE 511 PRESENTATION
Prepared by:Balkır Kayaaltı
Introduction



SPARC stands for a Scalable Processor ARChitecture.
It is an open processor architecture.(i.e. Member companies to
the SPARC community can freely produce the processor)
SUN ULTRA SPARCv9 is a robust RISC architecture with
-64 bit integer address and data
-Superscalar implementations
-Extremely fast trap handling and context switching.
The presentation will look in detail to the SUN Microsystem’s Ultra
SPARC III v9 architecture.
Major Architectural units






The processor’s micro-architecture design
has six major functional units that perform
relatively independently:
Instruction issue unit (IIU)
Floating point unit (FPU)
Integer execution unit (IEU)
Data cache unit (DCU)
External memory unit (EMU)
System interface unit (SIU)
The units communicate requests and results among themselves
through well-defined interface protocols, as the next figure
Communication paths between architectural units
Instruction issue unit






This unit feeds the execution pipelines with the instructions.
It independently predicts the control flow through a program
and fetches the predicted path from the memory system.
Fetched instructions are staged in a queue before forwarding to
the two execution units: ‘integer and floating point’
This unit includes:
32-Kbyte, four-way associative ‘Instruction cache’
‘The instruction address translation buffer’
A 16 K-entry ‘branch predictor’
Ultra SPARC-III pipeline and physical
data
Pipeline feature
Instruction issue
Parameter
4 integer
2 float point
2 graphics
Level-one(L1) caches Data:
64-Kbyte, 4-way
Instructions:
32-Kbyte, 4-way
Prefetch:
2-Kbyte,4-way
Write :
2-Kbyte,4-way
Level-two(L2) cache Unified (data and instructions)
4- and 8-Mbyte,1-way
On-chip tags;off chip data
Pipeline
Pipeline blocks
Stage
Function
A
Generate instruction fetch addresses, generate
pre-decoded instruction bits on
Fetch first cycle of instructions from cache; access
first cycle of branch prediction
Fetch second cycle of instructions from cache; access
second cycle of branch prediction; translate virtual-tophysical address
Calculate branch target addresses; decode first cycle of
instructions
Decode second cycle of instructions;enqueue
instructions into the queue
Steer instructions to execution units
Read integer register file operands; check operand
dependencies
Execute integers for arithmetic, logical, and shift
instructions; read, and check dependency of, first
cycle of data cache access floating-point register file
P
F
B
I
J
R
E
Pipeline blocks[2]
Stage Function
C
Access second cycle of data cache, and forward load data
for word and doubleword loads; execute first cycle of
floating-point instructions
M
Load data alignment for half-word and byte loads; execute
second cycle of floating-point instructions
W
Write speculative integer register file; execute third cycle of
floating-point instructions
X
Extend integer pipeline for precise floating-point traps;
execute fourth cycle of floating-point instructions
T
Report traps
D
Write architectural register file
Pipeline




The instruction issue unit
:Stages A-J
The execution unit
:Stages R-D
data cache: E, C, M, and W stages of the pipe in parallel with
integer execution unit stages
Floating point unit: Side pipeline parallel E through D stages of
the integer pipeline
Pipeline
Instruction issue unit cont.


To increase the performance high level of instruction parallelism
is desired.
Ultra SPARC is a static speculation machine.
-
-
Dynamic speculation machines require very high fetch
bandwidths to fill an instruction window and find
instruction-level parallelism.
In a static speculation machine the compiler can make the
speculated path sequential, resulting in fewer
requirements on the instruction fetch unit.
Instruction issue unit:
Stage A:
Address lines enter to the instruction cache.
All fetch address generation and selection occurs.
Stage P,F:
Instruction cache access.
Branch prediction
Instruction address translation access
By the time the instructions are available from the cache in the B
stage, we also have the physical address from the translator and a
prediction for any branch that was fetched.
The processor uses all this information in the B stage to
determine whether to follow a sequential or taken-branch path
Branch prediction

The processor also determines whether the instruction cache access
was a hit or miss. If the processor predicts a taken branch in the B
stage, the processor sends back the target address for the branch to
the A stage to redirect the fetch stream.

Waiting until the B stage to redirect the fetch stream lets us use a
large, accurate branch predictor.

Branch predictor uses a ‘G-share algorithm’ with 16K 2-bit saturating
up/down counters

Predictor is pipelined since it is big.
Instruction buffer (queue)

There are 2 instruction queue’s designed (instruction queue
and miss queue)

The 20-entry instruction queue decouples the fetch unit from
the execution units, allowing each to proceed at its own rate

If a branch is taken at the two cycles that should pass for filling
the queue with right instructions , immediately instructions in
the miss queue can be used.
Integer execute unit
Execution pipelines can support concurrent launch up to six
instructions; which can consist of:
-two integer operations,A0/A1 pipelines
-two FP operations, FP pipelines
-one memory operation (load/store), MS pipeline
-one special purpose memory operation ( prefetch cache
load only)
-one control transfer instruction (CTI), BR pipeline
However only four Instructions per cycle (IPC) can be executed in
a sustain manner.

Working and Architectural Register
File (WARF)





Physically it is a one block but logically it can be seen as two
separate register files. (working register file and architectural)
SPARC architectures use register files and windowing
techniques.
Any time 8 global registers can be reached g0 – g7
Global register g0 is always ‘0’.
At any time, an instruction can access the 8 global and a 24register window into the registers. A register window comprises
the 8 ‘in’ and 8 ‘local registers’ of a particular register set,
ttogether with the 8 ‘in’ registers of an adjacent register set,
which are addressable from the current window as out
registers.
Register windows
WARF



WRF consist of 32 – 64-bit registers (each of with 3 write,7 read
ports and 32*64=2048 minus 64 =1984 bit write port to
transport data from Architectural register file
ARF has 160 entries (Total 8 register windows)
8x8=64 for local registers in the window
8x8=64 registers for 16 IN/OUT shared registers.
28 register for 4 set of 8 global registers.
The WRF manages as single window & updated as results
computed






The processor accesses the WRF in the pipeline’s R stage and
supplies integer operands to the execution units.
Most integer operations complete in one cycle , so result can be
written immediately at C stage.
If an exceptional event occurs, results written must be undone;
so original copies of integer registers are copied using
broadside copy of all integer files from appropriate ARF window.
The place where to architecture register file is written at the end
of the pipeline since all exceptions should be resolved.
ARF fills 16 WRF entries after a window change
On an exception 31 nonzero registers of WRF should be
updated.
On chip memory system
Chache diagram used in the architecture
On chip memory system
Level-one(L1) caches Data:
64-Kbyte, 4-way
Instructions:
32-Kbyte, 4-way
Prefetch:
2-Kbyte,4-way
Write :
2-Kbyte,4-way
Level-two(L2) cache Unified (data and instructions)
4- and 8-Mbyte,1-way
On-chip tags; off chip
data
average latency = L1 hit time + L1 miss rate * L1miss time +
L2 miss rate * L2 miss time
Prefetch cache

Performance is highly increased by using a ‘Prefetch Cache’ in
parallel with the ‘L1 data cache’.

By issuing up to eight in-flight prefetches to main memory, the
prefetch cache enables program to utilize 100% of the available
main memory bandwidth without incurring a slow-down due to
the main memory latency.
Prefetch cache




The prefetch cache :2-Kbyte SRAM organized as 32 entries of
64 bytes and using four-way associativity with an LRU
replacement policy.
A multi-port SRAM design let us achieve a very high
throughput.
Data can be streamed through the prefetch cache in a manner
similar to stream buffers.
On every cycle, each of two independent read ports supply 8
bytes of data to the pipeline while a third write port fills the
cache with 16 bytes.
Prefetch cache




Some early processors like Ultra Sparc II uses prefetch
instructions.
Autonomous stride prefetch engine that tracks the program
counters of load instructions and detects when a load
instruction is striding through memory .
When the prefetch engine detects a striding load, the prefetch
engine issues a hardware prefetch independent of any software
prefetch.
This allows the prefetch cache to be effective even on codes
that do not include prefetch instructions.
Write cache





Write-caching is an excellent way to reduce the
bandwidth due to store traffic.
A write cache is used in SPARC-III to reduce the store
traffic bandwidth to the off-chip L2 data cache
Size is 2Kbyte -4 way associative
Advantage of using it is : being the sole source of on-chip
dirty data, the write cache easily handles both
multiprocessor and on-chip cache consistency.
Error recovery also becomes easier with the write cache,
since the write cache keeps all other on-chip caches
clean and simply invalidates them when an error is
detected.
Write chaching

A byte validate policy is used on the write cache. Rather than
reading the data from the L2 cache for the bytes within the line
that are not being overwritten, we just keep an individual valid
bit for each byte. Not performing the read-on-allocate saves
considerable L2 cache bandwidth by postponing a read-modifywrite until the write cache evicts a line. Frequently, by eviction
time the entire line has been written so the write cache can
eliminate the read.

Write cache is included in the L2 data cache and write-cache
data can supersede read data from the L2 data cache . We
handle this by a byte-merging multiplexer on the incoming L2
cache data bus that can choose either writecache data or L2
cache data for each byte.
Floating point unit




This unit contains data paths and control logic to execute floating point
and partitioned fixed-point data type instructions.
Three data paths concurrently execute floating point or graphics
instructions, one each per cycle from the following classes:
-Divide/multiply (single or double precision or partitioned)
-Add/subtract/compare (single or double precision or partitioned)
-An independent division datapath which lets non-pipelined divide
proceed concurrently with the full pipelined multiply and adder paths.
In order to meet the cycle time of the floating point operations latency
cycles must be added.
With using advanced circuit techniques for floating point add multiply
units a latency cycle will be enough.
External memory interface




External memory consist of a large L2 cache built off chip
and a main memory built off chip using synchronous
DRAM’s.
Size of L2 caches: 4 or 8 Mbyte
Latency: 12 clock cycles to support 32 byte line to L1
Tags for the L2 is placed on-chip to early detect L2 miss
(L2 cache controller accesses on-chip tags parallel with
the start of the off-chip SRAM access and provide a way
select signal to a late select address pin on the off-chip
SRAMs)



L2 caches are Wave-pipelined and operate at 600MHz.,
Main memory DRAM controller is on chip, reducing memory
latency and scales the memory bandwidth with the number of
processor.
The memory controller supports up to 4 Gbytes of SDRAM
memory organized as four independent banks.
Trap stage in the pipeline



In this architecture classical stall signal( which freezes the state
of the pipeline is eliminated for performance purposes)
Instead a trap stage is put at the end of the pipeline to restore a
state when an unexpected event occurs.
It’s handled like a trap:the instructions that are in the pipeline
will be refetched from Stage A.
Conclusion



One of the advanced RISC microprocessor is the Sun
Microsystems UltraSPARC.It finds many application in
desktops, network systems , scientific calculation machines.
The internal architecture of the UltraSPARC-III. is represented .
Various parts of the processor is examined like: instruction
issue, execution, on chip and external memory.
References

1) ‘Ultra Sparc III:Designing Third -Generation 64-Bit
performance’ ,IEEE Micro ,June 1999

2)’Design Decisions Influencing Ultra SPARC’s Instruction
Fetch Architecture’, 29th annual IEEE/ACM International
Symposium on Microarchitecture ,p178-190,1996 Paris

3)Ultra SPARC III v9 Manual,Sun Microsystems.
THANK YOU