Current and Future Trends in Processor Architecture

Download Report

Transcript Current and Future Trends in Processor Architecture

1

Current and Future Trends in Processor Architecture

Theo Ungerer Borut Robic Jurij Silc

2

Tutorial Background Material

 

Jurij Silc, Borut Robic, Theo Ungerer: Processor Architecture From Dataflow to

Superscalar and Beyond (Springer-Verlag, Berlin, Heidelberg, New York 1999).

Book homepage: http://goethe.ira.uka.de/people/ungerer/proc-arch/

Slide collection of tutorial slides: http://goethe.ira.uka.de/people/ungerer/

Slide collection of book contents (in 15 lectures): http://goethe.ira.uka.de/people/ungerer/prozarch/prslides99-00.html

3

Outline of the Tutorial

 

Part I: State-of-the-art multiple-issue processors

– – –

Superscalar – Overview Superscalar in more detail: Instruction Fetch and Branch Prediction, Decode, Rename, Issue, Dispatch, Execution Units, Completion, Retirement VLIW/EPIC Part II: Solutions for future high-performance processors

– – –

Technology prognosis Speed-up of a single-threaded application

Advanced superscalar, Trace Cache, Superspeculative, Multiscalar processors Speed-up of multi-threaded applications

Chip multiprocessors (CMPs) and Simultaneous multithreading

4

Part I: State-of-the-art multiple-issue processors

  

Superscalar

Overview Superscalar in more detail

– – – – – – – –

Instruction Fetch and Branch Prediction Decode Rename Issue Dispatch Execution Units Completion Retirement VLIW/EPIC

5

Multiple-issue Processors

Today's microprocessors utilize instruction-level parallelism by a multi-stage instruction pipeline and by the superscalar or the VLIW/EPIC technique.

Most of today's general-purpose microprocessors are four- or six-issue superscalars.

VLIW (very long instruction word) is the choice for most signal processors.

VLIW is enhanced to EPIC (explicitly parallel instruction computing) by HP/Intel for its IA-64 ISA.

6

Instruction Pipelining

7

Superscalar Pipeline

IF ID and Rename Issue EX EX EX EX Retire and Write Back

 

Instructions in the instruction window are free from control dependencies branch prediction, and free from name dependences due to due to register renaming. So, only (true) data dependences and structural conflicts remain to be solved.

8

Superscalar vs. VLIW

Superscalar and VLIW : More than a single instruction can be issued to the execution units per cycle.

Superscalar machines are able to dynamically issue multiple instructions each clock cycle from a conventional linear instruction stream.

VLIW processors use a long instruction word that contains a usually fixed number of instructions that are fetched, decoded, issued, and executed synchronously.

Superscalar: dynamic issue, VLIW: static issue

9

Sections of a Superscalar Pipeline

The ability to issue and execute instructions out-of-order partitions a superscalar pipeline in three distinct sections:

in-order section with the instruction fetch, decode and rename stages - the issue is also part of the in-order section in case of an in-order issue,

out-of-order section starting with the issue in case of an out-of-order issue processor, the execution stage, and usually the completion stage, and again an

in-order section that comprises the retirement and write-back stages.

10

Components of a Superscalar Processor

Load/ Store Unit MMU Branch Unit BHT Instruction Issue Unit I-cache MMU BTAC Instruction Fetch Unit Instruction Decode and Register Rename Unit Instruction Buffer Reorder Buffer Floating Point Unit(s) Floating Point Registers Integer Unit(s) General Purpose Registers D-cache Retire Unit Rename Registers 32 (64) Data Bus Bus Inter face Unit 32 (64) Address Bus Control Bus

11

Branch-Target Buffer or Branch-Target Address Cache

 

The Branch Target Buffer (BTB) information. or Branch-Target Address Cache (BTAC) stores branch and jump addresses, their target addresses, and optionally prediction The BTB is accessed during the IF stage.

Branch address Target address Prediction bits

...

...

...

12

Branch Prediction

  

Branch prediction foretells the outcome of conditional branch instructions.

Excellent branch handling techniques are essential for today's and for future microprocessors.

Requirements of high performance branch handling:

– – – – –

an early determination of the branch outcome (the so-called branch resolution ), buffering of the branch target address in a BTAC, an excellent branch predictor execution mechanism, (i.e. branch prediction technique) and speculative often another branch is predicted while a previous branch is still unresolved, so the processor must be able to pursue two or more speculation levels , and an efficient rerolling mechanism branch misprediction penalty ).

when a branch is mispredicted (minimizing the

Misprediction Penalty

13

  

The performance of branch prediction depends on the prediction accuracy cost of misprediction . and the Misprediction penalty depends on many organizational features:

– – – –

the pipeline length (favoring shorter pipelines over longer pipelines), the overall organization of the pipeline, the fact if misspeculated instructions can be removed from internal buffers, or have to be executed and can only be removed in the retire stage, the number of speculative instructions in the instruction window or the reorder buffer. Typically only a limited number of instructions can be removed each cycle.

Mispredicted is expensive:

– –

4 to 9 cycles in the Alpha 21264, 11 or more cycles in the Pentium II.

14

Static Branch Prediction

   

Static Branch Prediction predicts always the same direction for the same branch during the whole program execution.

It comprises hardware-fixed prediction and compiler-directed prediction.

Simple hardware-fixed direction mechanisms can be:

– – –

Predict always not taken Predict always taken Backward branch predict taken, forward branch predict not taken Sometimes a bit in the branch opcode allows the compiler to decide the prediction direction.

15

Dynamic Branch Prediction

 

Dynamic Branch Prediction : the hardware influences the prediction while execution proceeds. Prediction is decided on the computation history of the program.

During the start-up phase of the program execution, where a static branch prediction might be effective, the history information is gathered and dynamic branch prediction gets effective.

In general, dynamic branch prediction gives better results than static branch prediction, but at the cost of increased hardware complexity.

16

One-bit Predictor

T

Predict Taken

NT T

Predict Not Taken

NT

17

One-bit vs. Two-bit Predictors

   

A one-bit predictor correctly predicts a branch at the end of a loop iteration, as long as the loop does not exit. In nested loops, a one-bit prediction scheme will cause two mispredictions for the inner loop:

– –

One at the end of the loop, when the iteration exits the loop instead of looping again, and one when executing the first loop iteration, when it predicts exit instead of looping. Such a double misprediction in nested loops is avoided by a two-bit predictor scheme.

Two-bit Prediction : A prediction must miss twice before it is changed when a two bit prediction scheme is applied.

18

Two-bit Predictors (Saturation Counter Scheme)

T (11) Predict Strongly Taken NT T (10) Predict Weakly Taken NT T (01) Predict Weakly Not Taken NT T (00) Predict Strongly Not Taken NT

19

Two-bit Predictors (Hysteresis Scheme)

T (11) Predict Strongly Taken NT T (10) Predict Weakly Taken T NT (01) Predict Weakly Not Taken NT T (00) Predict Strongly Not Taken NT

20

Two-bit Predictors

   

Two-bit predictors can be implemented in the Branch Target Buffer (BTB) assigning two state bits to each entry in the BTB. Another solution is to use a BTB for target addresses and a separate Branch History Table (BHT) as prediction buffer.

A mispredict in the BHT occurs due to two reasons:

– –

either a wrong guess for that branch, or the branch history of a wrong branch is used because the table is indexed.

In an indexed table lookup part of the instruction address is used as index to identify a table entry.

Two-bit Predictors and Correlation based Prediction

21

 

Two-bit predictors work well for programs which contain many frequently executed loop-control branches (floating-point intensive programs). Shortcomings arise from dependent (correlated) branches, which are frequent in integer-dominated programs.

Example:

if (d==0) /* branch b1*/ d=1; if (d==1) /*branch b2 */ ...

22

Predictor Behavior in Example

   

A one-bit predictor initialized to “ predict taken” for branches b1 and b2

every branch is mispredicted.

A two-bit predictor of of saturation counter scheme starting from the state “predict weakly taken”

every branch is mispredicted.

The two-bit predictor of hysteresis scheme mispredicts every second branch execution of b1 and b2.

A (1,1) correlating predictor takes advantage of the correlation of the two branches; it mispredicts only in the first iteration when d = 2.

Correlation-based Predictor

23

     

The two-bit predictor scheme uses only the recent behavior of a single branch to predict the future of that branch. Correlations between different branch instructions are not taken into account.

Correlation-based predictors or correlating predictors behavior of other branches to make a prediction. additionally use the While two-bit predictors use self-history neighbor history additionally . only, the correlating predictor uses Notation: (m,n)-correlation-based predictor the last m branches to choose from 2 m or (m,n)-predictor uses the behavior of branch predictors, each of which is a n-bit predictor for a single branch. Branch history register (BHR): can be recorded in a m-bit shift register where each bit records whether the branch was taken or not taken.

The global history of the most recent m branches

24

Correlation-based Prediction (2,2)-predictor

Pattern History Tables PHTs (2-bit predictors) Branch address 10 ...

...

...

1 1 ...

...

...

...

...

Branch History Register BHR (2-bit shift register) select 1 0

25

Two-level Adaptive Predictors

     

Developed by Yeh and Patt at the same time (1992) as the correlation-based prediction scheme.

The basic two-level predictor uses a single global branch history register (BHR) of k bits to index in a pattern history table (PHT) of 2-bit counters.

Global history schemes correspond to correlation-based predictor schemes.

Example for the notation: GAg :

– – –

a single global BHR (denoted G ) and a single global PHT (denoted g ), A stands for adaptive. All PHT implementations of Yeh and Patt use 2-bit predictors. GAg-predictor with a 4-bit BHR length is denoted as GAg(4).

Implementation of a GAg(4)-predictor

Branch History Register (BHR) shift direction 1 1 0 0 Index ...

1100 1 1 ...

Branch Pattern History Table (PHT) predict: taken

26

In the GAg predictor schemes the PHT lookup depends entirely on the bit pattern in the BHR and is completely independent of the branch address.

27

Variations of Two-level Adaptive Predictors

Mispredictions can be restrained by additionally using:

    

the full branch address to distinguish multiple PHTs (called per-address PHTs), a subset of branches (e.g. n bits of the branch address) to distinguish multiple PHTs (called per-set PHTs), the full branch address to distinguish multiple BHRs (called per-address BHRs), a subset of branches to distinguish multiple BHRs (called per-set BHRs), or a combination scheme.

28

Two-level Adaptive Predictors

single global PHT per-set PHTs per-address PHTs single global BHR per-address BHT per-set BHT GAg GAs GAp PAg PAs PAp SAg SAs SAp

29

gselect and gshare Predictors

  

gselect predictor : concatenates some lower order bit of the branch address and the global history gshare predictor : uses the bitwise exclusive OR of part of the branch address and the global history as hash function.

McFarling: gshare slightly better than gselect

Branch Address

00000000 00000000 11111111 11111111

BHR

00000001 00000000 00000000 10000000

gselect4/4

00000001 00000000 11110000 11110000

gshare8/8

00000001 00000000 11111111 01111111

30

Hybrid Predictors

The second strategy of McFarling is to combine multiple separate branch predictors, each tuned to a different class of branches.

Two or more predictors and a predictor selection mechanism are necessary in a combining or hybrid predictor .

– – –

McFarling: combination of two-bit predictor and gshare two-level adaptive, Young and Smith: a compiler-based static branch prediction with a two-level adaptive type, and many more combinations!

Hybrid predictors often better than single-type predictors.

31

Simulations of Grunwald 1998

committed conditional Application instructions branches branches (in millions) (in millions) taken (%)

compress 80.4

14.4

54.6

gcc 250.9

50.4

49.0

perl go m88ksim xlisp 228.2

548.1

416.5

183.3

43.8

80.3

89.8

41.8

52.6

54.5

71.7

39.5

vortex jpeg 180.9

252.0

29.1

20.0

50.1

70.0

mean 267.6

46.2

54.3

SAg

10.1

12.8

9.2

25.6

4.7

10.3

2.0

10.3

8.6

misprediction rate (%) gshare

10.1

23.9

combining

9.9

12.2

25.9

34.4

8.6

10.2

8.3

12.5

11.4

24.1

4.7

6.8

1.7

10.4

14.5

8.1

SAg, gshare and MCFarling‘s combining predictor for some SPECmarks

32

Results

Simulation of Keeton et al. 1998 using an OLTP (online transaction workload) on a PentiumPro multiprocessor reported a misprediction rate of 14% with an branch instruction frequency of about 21%.

Two different conclusions may be drawn from these simulation results:

– –

Branch predictors should be further improved and/or branch prediction is only effective if the branch is predictable.

If a branch outcome is dependent on irregular data inputs, the branch often shows an irregular behavior.

Question: Confidence of a branch prediction?

Confidence Estimation

33

   

Confidence estimation particular prediction. is a technique for assessing the quality of a Applied to branch prediction, a confidence estimator attempts to assess the prediction made by a branch predictor.

A low confidence branch direction in an irregular way making its outcome hard to predict or even unpredictable.

is a branch which frequently changes its branch Four classes possible:

– – – –

correctly predicted with high confidence C(HC), correctly predicted with low confidence C(LC), incorrectly predicted with high confidence I(HC), and incorrectly predicted with low confidence I(LC).

Predicated Instructions

34

    

Method to “remove” branches Predicated or conditional instructions and one or more predicate registers predicate register as additional input operand.

use a The Boolean result of a condition testing is recorded in a (one-bit) predicate register .

Predicated instructions are fetched, decoded and placed in the instruction window like non predicated instructions. It is dependent on the processor architecture, how far a predicated instruction proceeds speculatively in the pipeline before its predication is resolved:

– –

A predicated instruction executes only if its predicate is true, otherwise the instruction is discarded.

Alternatively the predicated instruction may be executed, but commits only if the predicate is true, otherwise the result is discarded.

35

Predication Example

/*branch b1 */

if

(

x

= = 0) {

a

=

b

+

c

;

d

=

e

-

f

; }

g

=

h

*

i

; /* instruction independent of branch b1 */ (

Pred

= (

x

= = 0) ) if Pred then a =

b

+

c

; if Pred then e =

e

-

f

;

g

=

h

*

i

; /* branch b1:

Pred

is set to true in

x

equals 0 */ /* The operations are only performed */ /* if

Pred

is set to true */

36

Predication

 

Able to eliminate a branch and therefore the associated branch prediction

increasing the distance between mispredictions.

The the run length of a code block is increased

better compiler scheduling.

Predication affects the instruction set, adds a port to the register file, and complicates instruction execution. Predicated instructions that are discarded still consume processor resources; especially the fetch bandwidth.

 

Predication is most effective when control dependences can be completely eliminated, such as in an if-then with a small then body. The use of predicated instructions is limited when the control flow involves more than a simple alternative sequence.

37

Eager (Multipath) Execution

   

Execution proceeds down both paths of a branch, and no prediction is made. When a branch resolves, all operations on the non-taken path are discarded.

With limited resources, the eager execution strategy must be employed carefully.

Mechanism is required that decides when to employ prediction and when eager execution: e.g. a confidence estimator

Rarely implemented (IBM mainframes) but some research projects:

Dansoft processor, Polypath architecture, selective dual path execution, simultaneous speculation scheduling, disjoint eager execution

38

Branch handling techniques and implementations

Technique No branch prediction Static prediction always not taken always taken backward taken, forward not taken semistatic with profiling Dynamic prediction: 1-bit 2-bit two-level adaptive Hybrid prediction Predication Eager execution (limited) Implementation examples Intel 8086 Intel i486 Sun SuperSPARC HP PA-7x00 early PowerPCs DEC Alpha 21064, AMD K5 PowerPC 604, MIPS R10000, Cyrix 6x86 & M2, NexGen 586 Intel PentiumPro, Pentium II, AMD K6 DEC Alpha 21264 Intel/HP Itanium, ARM processors, TI TMS320C6201 IBM mainframes: IBM 360/91, IBM 3090

39

High-Bandwidth Branch Prediction

  

Future microprocessor will require more than one prediction per cycle starting speculation over multiple branches in a single cycle When multiple branches are predicted per cycle, then instructions must be fetched from multiple target addresses per cycle, complicating I-cache access. Solution: Trace cache in combination with next trace prediction.

Back to the Superscalar Pipeline

IF ID and Rename Issue EX EX EX EX Retire and Write Back 40

In-order delivery of instructions to the out-of-order execution kernel!

41

Decode Stage

    

Delivery task: Keep instruction window full

the deeper instruction look-ahead allows to find more instructions to issue to the execution units. Fetch and decode instructions at a higher bandwidth than execute them. The processor fetches and decodes today about 1.4 to twice as many instructions than it commits (because of mispredicted branch paths).

Typically the decode bandwidth is the same as the instruction fetch bandwidth. Multiple instruction fetch and decode is supported by a fixed instruction length.

42

Decoding variable-length instructions

Variable instruction length: often the case for legacy CISC instruction sets as the Intel IA32 ISA.

a multistage decode is necessary.

– –

The first stage determines the instruction limits within the instruction stream.

The second stage decodes the instructions generating one or several micro-ops from each instruction.

Complex CISC instructions are split into micro-ops which resemble ordinary RISC instructions.

43

Two principal techniques to implement renaming

  

Separate sets of architectural registers provided.

and rename (physical) registers are

– – –

The physical registers contain values (of completed but not yet retired instructions), the architectural registers store the committed values. After commitment of an instruction, copying architectural register is required.

its result from the rename register to the Only a single set of registers is provided and architectural registers are dynamically mapped to physical registers.

– –

The physical registers contain committed values and temporary results. After commitment of an instruction, the physical register is made permanent and no copying is necessary.

Alternative to the dynamic renaming is static renaming in combination with a large register file as defined for the Intel Itanium.

Issue and Dispatch

44

    

The notion of the instruction window decode (rename) and execute stages. comprises all the waiting stations between The instruction window isolates the decode/rename from the execution stages of the pipeline.

Instruction issue is the process of initiating instruction execution in the processor's functional units.

– –

issue to a FU or a reservation station dispatch, if a second issue stage exists to denote when an instruction is started to execute in the functional unit.

The instruction-issue policy is the protocol used to issue instructions. The processor's lookahead capability beyond the current point of execution in hope of finding independent instructions to execute.

is the ability to examine instructions

Issue

45

  

The issue logic examines the waiting instructions in the instruction window and simultaneously assigns ( issues ) a number of instructions to the FUs up to a maximum issue bandwidth . The program order of the issued instructions is stored in the reorder buffer . Instruction issue from the instruction window can be:

– – –

in-order (only in program order) or out-of-order it can be subject to simultaneous data dependences and resource constraints, or it can be divided in two (or more) stages

 

checking structural conflict in the first and data dependences in the next stage (or vice versa). In the case of structural conflicts first, the instructions are issued to reservation stations (buffers) in front of the FUs where the issued instructions await missing operands.

46

Reservation Station(s)

Two definitions in literature:

– –

A reservation station is a buffer for a single instruction with its operands (original Tomasulo paper, Flynn's book, Hennessy/Patterson book).

A reservation station is a buffer (in front of one or more FUs) with one or more entries and each entry can buffer an instruction with its operands(PowerPC literature).

 

Depending on the specific processor, reservation stations can be central to a number of FUs or each FU has one or more own reservation stations.

Instructions await their operands in the reservation stations, as in the Tomasulo algorithm.

47

Dispatch

    

An instruction is then said to be dispatched from a reservation station to the FU when all operands are available, and execution starts.

If all its operands are available during issue and the FU is not busy, an instruction is immediately dispatched, starting execution in the next cycle after the issue. So, the dispatch is usually not a pipeline stage. An issued instruction may stay in the reservation station for zero to several cycles.

Dispatch and execution is performed out of program order.

Other authors interchange the meaning of issue and dispatch or use different semantic.

48

The following issue schemes are commonly used

Single-level, central issue : single-level issue out of a central window as in Pentium II processor Issue and Dispatch Decode and Rename Functional Units

49

Single-level, two-window issue

Single-level, two-window issue : single-level issue with a instruction window decoupling using two separate windows

most common: separate floating point and integer windows as in HP 8000 processor Issue and Dispatch Functional Units Decode and Rename Functional Units

Two-level issue with multiple windows

50

Two-level issue with multiple windows with a centralized window in the first stage and separate windows in the second stage (PowerPC 604 and 620 processors).

Issue Dispatch Functional Unit Decode and Rename Functional Unit Functional Unit Functional Unit Reservation Stations

Execution Stages

51

    

Various types of FUs classified as:

– –

single-cycle (latency of one) or multiple-cycle (latency more than one) units.

Single-cycle units Usually they are also able to accept a new instruction each cycle (throughput of one). produce a result one cycle after an instruction started execution. Multi-cycle units perform more complex operations that cannot be implemented within a single cycle. Multi-cycle units

– –

can be pipelined to accept a new operation each cycle or each other cycle or they are non-pipelined.

Another class of units exists that perform the operations with variable cycle times.

Types of FUs

52

    

single-cycle (single latency) units:

(simple) integer and (integer-based) multimedia units, multicycle units that are pipelined (throughput of one):

complex integer, floating-point, and (floating-point -based) multimedia unit (also called multimedia vector units), multicycle units that are pipelined but do not accept a new operation each cycle (throughput of 1/2 or less):

often the 64-bit floating-point operations in a floating-point unit, multicycle units that are often not pipelined:

division unit, square root units, complex multimedia units variable cycle time units:

load/store unit (depending on cache misses) and special implementations of e.g. floating-point units.

Multimedia Units

Utilization of subword parallelism (data parallel instructions, SIMD)

R1: R2: x1 x2 x3 x4 y1 y2 y3 y4

* * * *

53

   R3: x1*y1x2*y2x3*y3x4*y4

Saturation arithmetic Additional arithmetic instructions, e.g. pavgusb (average instruction), masking and selection instructions, reordering and conversion MM streams and/or 3D graphics supported

54

Finalizing Pipelined Execution - Completion, Commitment

An instruction is completed when the FU finished the execution of the instruction and the result is made available for forwarding and buffering.

Instruction completion is out of program order.

Committing an operation means that the results of the operation have been made permanent and the operation retired from the scheduler.

55

Finalizing Pipelined Execution - Retirement and Write-Back

Retiring means removal from the scheduler with or without the commitment of operation results, whichever is appropriate.

Retiring an operation does not imply the results of the operation are either permanent or non permanent.

A result is made permanent:

– –

either by making the mapping of architectural to physical register permanent (if no separate physical registers exist) or by copying the result value from the rename register to the architectural register ( in case of separate physical and architectural registers) in an own write-back stage after the commitment!

56

Reorder Buffers

     

The reorder buffer keeps the original program order of the instructions after instruction issue and allows result serialization during the retire stage.

State bits store if an instruction is on a speculative path, and when the branch is resolved, if the instruction is on a correct path or must be discarded. When an instruction completes, the state is marked in its entry. Exceptions are marked in the reorder buffer entry of the triggering instruction. The reorder buffer is implemented as a circular FIFO buffer. Reorder buffer entries are allocate in the (first) issue stage and deallocated serially when the instruction retires.

57

Precise Interrupt (Precise Exception)

   

An interrupt or exception is called precise if the saved processor state corresponds with the sequential model of program execution where one instruction execution ends before the next begins.

Precise exception means that all instructions before the faulting instruction are committed and those after it can be restarted from scratch. If an interrupt occurred, all instructions that are in program order before the interrupt signaling instruction are committed, and all later instructions are removed. Depending on the architecture and the type of exception, the faulting instruction should be committed or removed without any lasting effect.

58

VLIW and EPIC

VLIW (very long instruction word) and EPIC (explicit parallel instruction computing):

Compiler packs a fixed number of instructions into a single VLIW/EPIC instruction.

The instructions within a VLIW instruction are issued and executed in parallel, EPIC is more flexible.

Examples: VLIW: High-end signal processors (TMS320C6201) EPIC: Intel Merced/Itanium

59

Intel's IA-64 EPIC Format

   

IA-64 instruction word Instruction 2 41 bits Instruction 1 41 bits Instruction 0 41 bits Template 5 bits 128 bits IA-64 instructions are packed by compiler into bundles.

A bundle is a 128-bit long instruction word (LIW) containing three IA-64 instructions along with a so-called template that contains instruction grouping information.

IA-64 does not insert no-op instructions to fill slots in the bundles. The template explicitly indicates parallelism, that is,

– – –

whether the instructions in the bundle can be executed in parallel or if one or more must be executed serially and whether the bundle can be executed in parallel with the neighbor bundles.

60

Part II: Microarchitectural solutions for future microprocessors

  

Technology prognosis Speed-up of a single-threaded application

– – – –

Advanced superscalar Trace Cache Superspeculative Multiscalar processors Speed-up of multi-threaded applications

– –

Chip multiprocessors (CMPs) Simultaneous multithreading

61

Technological Forecasts

Moore's Law: number of transistors per chip double every two years

SIA (semiconductor industries association) prognosis 1998:

Year of 1 st shipment Local Clock (GHz) Across Chip (GHz) Chip Size (mm²) Dense Lines (nm) Number of chip I/O Transistors per chip 1997 0,75 0,75 300 250 1515 11M 1999 1,25 1,2 340 180 1867 21M 2002 2,1 1,6 430 130 2005 3,5 2 520 100 2008 6 2,5 620 70 2553 3492 4776 76M 200M 520M 2011 10 3 3,674 750 901 50 2014 16,9 35 6532 8935 1,4B 3,62B

62

Design Challenges

  

Increasing clock speed, the amount of work that can be performed per cycle, and the number of instructions needed to perform a task.

  

Today's general trend toward more complex designs is opposed by the wiring delay within the processor chip as main technological problem. higher clock rates with subquarter-micron designs

on-chip interconnecting wires cause a significant portion of the delay time in circuits. Functional partitioning becomes more important!

63

Architectural Challenges and Implications

    

Preserve object code compatibility (may be avoided by a virtual machine that targets run-time ISAs) Find ways of expressing and exposing more parallelism to the processor. It is doubtful if enough ILP is available. Harness thread-level paralelism (TLP) additionally.

Memory bottleneck Power consumption for mobile computers and appliances.

Soft errors by cosmic rays of gamma radiation may be faced with fault-tolerant design through the chip.

64

Future Processor Architecture Principles

 

Speed-up of a single-threaded application

– – – –

Advanced superscalar Trace Cache Superspeculative Multiscalar processors Speed-up of multi-threaded applications

– –

Chip multiprocessors (CMPs) Simultaneous multithreading

65

Processor Techniques to Speed-up Single-threaded Application

   

Advanced superscalar processors instructions per cycle.

scale current designs up to issue 16 or 32 Trace cache facilitates instruction fetch and branch prediction Superspeculative processors enhance wide-issue superscalar performance by speculating aggressively at every point.

Multiscalar processors to a number of parallel processing units under control of a single hardware sequencer.

divide a program in a collection of tasks that are distributed

66

Advanced Superscalar Processors for Billion Transistor Chips

      

Aggressive speculation, such as a very aggressive dynamic branch predictor, a large trace cache, very-wide-issue superscalar processing (an issue width of 16 or 32 instructions per cycle), a large number of reservation stations to accommodate 2,000 instructions, 24 to 48 highly optimized, pipelined functional units, sufficient on-chip data cache, and sufficient resolution and forwarding logic.

67

The Trace Cache

     

Trace cache is a special I-cache that captures dynamic instruction sequences in contrast to the I-cache that contains static instruction sequences. Like the I-cache, the trace cache is accessed using the starting address of the next block of instructions. Unlike the I-cache, it stores logically contiguous instructions in physically contiguous storage. A trace cache line stores a segment of the dynamic instruction trace across multiple, potentially taken branches.

Each line stores a snapshot, or trace , of the dynamic instruction stream. The trace construction is of the critical path.

68

I-cache and Trace Cache

I-cache Trace cache

Superspeculative Processors

69

 

Idea: Instructions generate many highly predictable result values in real programs

Speculate on source operand values and begin execution without waiting for result from the previous instruction .

Speculate about true data dependences!!

reasons for the existence of value locality

Register spill code.

– – –

Input sets often contain data with little variation.

A compiler often generates run-time constants due to error-checking, switch statement evaluation, and virtual function calls.

The compiler also often loads program constants from memory rather than using immediate operands.

Strong- vs. Weak-dependence Model

70

 

Strong-dependence model for program execution: a total instruction ordering of a sequential program.

– –

Two instructions are identified as either dependent or independent, and when in doubt, dependences are pessimistically assumed to exist. Dependences are never allowed to be violated and are enforced during instruction processing. Weak-dependence model :

– –

specifying that dependences can be temporarily violated during instruction execution as long as recovery can be performed prior to affecting the permanent machine state. Advantage: the machine can speculate aggressively and temporarily violate the dependences. The machine can exceed the performance limit imposed by the strong-dependence model.

71

Implementation of a Weak-dependence Model

The front-end engine assumes the weak-dependence model and is highly speculative, predicting instructions to aggressively speculate past them.

The back-end engine still uses the strong-dependence model speculations, recover from misspeculation, and provide history and guidance information to the speculative engine.

to validate the

72

Superflow processor

The Superflow processor speculates on

– – –

instruction flow : two-phase branch predictor combined with trace cache register data flow : dependence prediction: predict the register value dependence between instructions

 

source operand value prediction constant value prediction

 

value stride prediction: speculate on constant, incremental increases in operand values dependence prediction predicts inter-instruction dependences memory data flow : prediction of load values, of load addresses and alias prediction

73

Superflow Processor Proposal

74

Multiscalar Processors

    

A program is represented as a control flow graph nodes, and arcs represent flow of control. (CFG), where basic blocks are A multiscalar processor walks through the CFG speculatively, taking task-sized steps, without pausing to inspect any of the instructions within a task.

The tasks are distributed to a number of parallel PEs within a processor. Each PE fetches and executes instructions belonging to its assigned task. The primary constraint: it must preserve the sequential program semantics.

75

Multiscalar mode of execution

A B C D E PE 0 Task A PE 1 Task B PE 2 Task D PE 3 Task E

76

Multiscalar processor

77

Multiscalar, Trace and Speculative Multithreaded Processors

  

Multiscalar : A program is statically partitioned into tasks which are marked by annotations of the CFG.

Trace Processor : Tasks are generated from traces of the trace cache.

Speculative multithreading : Tasks are otherwise dynamically constructed.

 

Common target: Increase of single-thread program performance by dynamically utilizing thread-level speculation additionally to instruction-level parallelism.

A „thread“ means a „HW thread“

78

Additional utilization of more coarse grained parallelism

Chip multiprocessors (CMPs) or multiprocessor chips

– –

integrate two or more complete processors on a single chip, every functional unit of a processor is duplicated.

Simultaneous multithreaded processors (SMPs)

– – –

store multiple contexts in different register sets on the chip, the functional units are multiplexed between the threads, instructions of different contexts are simultaneously executed.

79

Shared memory candidates for CMPs

Pro cessor Pro cessor Pro cessor Primary Cache Pro cessor Secondary Cache Global Memory Shared primary cache

80

Shared memory candidates for CMPs

Pro cessor Primary Cache Pro cessor Primary Cache Pro cessor Primary Cache Pro cessor Primary Cache Secondary Cache Secondary Cache Secondary Cache Secondary Cache Global Memory

Shared caches and memory

Pro cessor Pro cessor Pro cessor Pro cessor Primary Cache Primary Cache Primary Cache Primary Cache Secondary Cache Global Memory

Shared secondary cache

81

Hydra: A Single-Chip Multiprocessor

CPU 0 Primary I-cache Primary D-cache CPU 0 Memory Controller Centralized Bus Arbitration Mechanisms CPU 1 Primary I-cache Primary D-cache CPU 1 Memory Controller CPU 2 Primary I-cache Primary D-cache CPU2 Memory Controller CPU 3 Primary I-cache Primary D-cache CPU 3 Memory Controller On-chip Secondary Cache Off-chip L3 Interface Cache SRAM Array Rambus Memory Interface DMA I/O Bus Interface DRAM Main Memory I/O Device

82

Shared memory candidates for CMPs

Pro cessor Pro cessor Pro cessor Primary Cache Pro cessor Global Memory Shared global memory, no caches

83

Motivation for Processor-in-Memory

     

Technological trends have produced a large and growing gap between processor speed and DRAM access latency. Today, it takes dozens of cycles for data to travel between the CPU and main memory.

CPU-centric design philosophy with deep pipelines. has led to very complex superscalar processors Much of this complexity is devoted to hiding memory access latency. Memory wall : the phenomenon that access times are increasingly limiting system performance.

Memory-centric design is envisioned for the future!

84

PIM or Intelligent RAM (IRAM)

  

PIM (processor-in-memory) or IRAM (intelligent RAM) approaches couple processor execution with large, high-bandwidth, on-chip DRAM banks.

PIM or IRAM merge processor and memory into a single chip.

Advantages:

– – – –

The processor-DRAM gap in access speed increases in future. PIM provides higher bandwidth and lower latency for (on-chip-)memory accesses.

DRAM can accommodate 30 to 50 times more data than the same chip area devoted to caches.

On-chip memory may be treated as main memory - in contrast to a cache which is just a redundant memory copy.

PIM decreases energy consumption in the memory system due to the reduction of off-chip accesses.

85

PIM Challenges

  

Scaling a system beyond a single PIM. The DRAM technology today does not allow on-chip coupling of high performance processors with DRAM memory since the clock rate of DRAM memory is too low.

Logic and DRAM manufacturing processes are fundamentally different.

The PIM approach can be combined with most processor organizations.

– – – –

The processor(s) itself may be a simple or moderately superscalar standard processor, it may also include a vector unit as in the vector IRAM type, or be designed around a smart memory system.

In future: potentially memory-centric architectures.

86

Conclusions on CMP

Usually, a CMP will feature:

– –

separate L1 I-cache and D-cache per on-chip CPU and an optional unified L2 cache.

If the CPUs always execute threads of the same process, the L2 cache organization will be simplified, because different processes do not have to be distinguished.

Recently announced commercial processors with CMP hardware:

– –

IBM Power4 processor with 2 processor on a single die Sun MAJC5200 two processor on a die (each processor a 4-threaded block interleaving VLIW)

87

Motivation for Multithreaded Processors

 

Aim : Latency tolerance What is the problem? Load access latencies measured on an Alpha Server 4100 SMP with four 300 MHz Alpha 21164 processors are:

– – – –

7 cycles for a primary cache miss which hits in the on-chip L2 cache of the 21164 processor, 21 cycles for a L2 cache miss which hits in the L3 (board-level) cache, 80 cycles for a miss that is served by the memory, and 125 cycles for a dirty miss, i.e., a miss that has to be served from another processor's cache memory.

88

Multithreading

Multithreading

– –

The ability to pursue two or more threads of control in parallel within a processor pipeline.

Advantage : The latencies that arise in the computation of a single instruction stream are filled by computations of another thread.

Multithreaded processors are able to bridge latencies by switching to another thread of control - in contrast to chip multiprocessors.

89

Multithreaded Processors

Multithreading:

– –

Provide several program counters registers (and usually several register sets) on chip Fast context switching by switching to another thread of control

Thread 1: Thread 2: Thread 3: Thread 4:

...

Register set 1 Register set 2 Register set 3 Register set 4

...

FP PC PSR 1 PC PSR 2 PC PSR 3 PC PSR 4

...

90

Approaches of Multithreaded Processors

  

Cycle-by-cycle interleaving

An instruction of another thread is fetched and fed into the execution pipeline at each processor cycle.

Block-interleaving

The instructions of a thread are executed successively until an event occurs that may cause latency. This event induces a context switch.

Simultaneous multithreading

– –

Instructions are simultaneously issued from multiple threads to the FUs of a superscalar processor.

combines a wide issue superscalar instruction issue with multithreading .

Comparision of Multithreading with Non-Multithreading Approaches

(a) single-threaded scalar (b) cycle-by-cycle interleaving multithreaded scalar (c) block interleaving multithreaded scalar 91

(a) (b) (c)

Simultaneous Multithreading (SMT) and Chip Multiprocessors (CMP)

(a) SMT (b) CMP 92

Issue slots (a) (b)

93

Simultaneous Multithreading

State of research

– – –

SMT is simulated and evaluated with Spec92, Spec95, and with database transaction and decision support workloads Mostly unrelated programs are loaded in the thread slots!

Typical result: 8-threaded SMT reaches a two- to threefold IPC increase over single threaded superscalar.

State of industrial development

DEC/Compaq announced Alpha EV8 ( 21464 ) as 4-threaded 8-wide superscalar SMT processor

94

Combining SMT and Multimedia

Start with a wide-issue superscalar general-purpose processor

Enhance by simultaneous multithreading

Enhance by multimedia unit(s)

Enhance by on-chip RAM memory for constants and local variables

95

The SMT Multimedia Processr Model

To Memory Memory interface DCache ICache IF IF ID ID RI Global L/S Local L/S Thread Control Branch Simple Integer Compl Integer Local Memory I/O RT Rename WB BTAC Register

96

IPC of Maximum Processor Models

8 Threads 6,32 6,33 5,56 5,64 5,67 5,34 4 1 1 3,84 3,89 3,91 3,53 3,52 1,98 1,99 1,99 1,96 1 1 1,57 3,27 1,86 1,86 1,86 0,96 1 1 2 4 6 Issue 8 7 6 5 4 3 2 1 0 IPC

97

CMP or SMT?

   

The performance race between SMT and CMP is not yet decided. CMP is easier to implement, but only SMT has the ability to hide latencies. A functional partitioning is not easily reached within a SMT processor due to the centralized instruction issue.

– –

A separation of the thread queues is a possible solution, although it does not remove the central instruction issue.

A combination of simultaneous multithreading with the CMP may be superior.

Research: combine SMT or CMP organization with the ability to create threads with compiler support or fully dynamically out of a single thread

– –

thread-level speculation close to multiscalar

98

This is the End!

Several alternative processor design principles were introduced:

– –

fine grain techniques (increasing performance of a single thread of control) coarse grain techniques to speed up a multiprogramming mix

Nothing is so hard to predict like the future.