CS 152 Computer Architecture and Engineering Lecture 19 -- Dynamic Scheduling II 2014-4-3 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS.

Download Report

Transcript CS 152 Computer Architecture and Engineering Lecture 19 -- Dynamic Scheduling II 2014-4-3 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS.

CS 152 Computer Architecture and Engineering

Lecture 19 -- Dynamic Scheduling II 2014-4-3 John Lazzaro (not a prof “John” is always OK)

TA: Eric Love

www-inst.eecs.berkeley.edu/~cs152/

Play:

UC Regents Spring 2014 © UCB CS 152 L18: Dynamic Scheduling I

Case studies of dynamic execution

DEC Alpha 21264: High performance from a relatively simple implementation of a modern instruction set. Short Break Simultaneous Multi-threading: Adapting multi-threading to dynamic scheduling.

IBM Power: Evolving dynamic designs over many generations.

UC Regents Fall 2006 © UCB CS 152 L21: Networks and Routers

DEC Alpha 21164: 4-issue in-order design.

21264: 4-issue out-of-order design.

21264 was 50% to 200% faster in real-world applications.

500 MHz 0.5µ parts for in-order 21164 and out-of-order 21264.

21264 has a 1.7x

advantage on integer code, and a 2.7x

advantage of floating-point code. 21264 has 55% more transistors than the 21164. The die is 44% larger.

21264 consumes 46% more power than the 21164.

Similarly-sized on chip caches (116K vs 128K) In-order 21164 has larger off-chip cache.

The Real Difference: Speculation

If the ability to recover from mis-speculation is built into an implementation ... it offers the option to add speculative features to all parts of the design.

CS 152 L19: Dynamic Scheduling II UC Regents Spring 2014 © UCB

21264 die Separate OoO control for integer and floating point.

RISC decode happens in OoO blocks Unlabeled areas devoted to memory system control

FP Pipe OoO Fetch and predict I-Cache I-Cache Int Pipe OoO Int Pipe Data Cache Data Cache

21264 pipeline diagram Rename and Issue stages are primary locations of dynamic scheduling logic. Load/store disambiguation support resides in Memory stage.

Slot: absorbs delay of long path on last slide.

Fetch stage close-up: Each cache line stores predictions of the next line , and the cache way to be fetched. If predictions are correct, fetcher maintains the required 4 instructions/cycle pace. Speculative

Rename stage close-up: (1) Allocates new physical registers for destinations, (2) Looks up physical register numbers for sources, (3) Handle rename dependences within issuing instructions in the 4 one clock cycle!

For mis speculation recovery Time stamped.

Output: 12 physical registers numbers: 1 destination and 2 sources for the 4 instructions to be issued.

Input: 4 instructions specifying architected registers.

Recall: malloc() -- free() in hardware

The record-keeping shown in this diagram occurs in the rename stage.

CS 152 L18: Dynamic Scheduling I UC Regents Spring 2014 © UCB

Issue stage close-up: (1) Newly issued instructions placed in top of queue.

(2) Instructions check scoreboard : are 2 sources ready?

(3) Arbiter selects 4 oldest “ready” instructions.

(4) Update removes these 4 from queue.

Input: 4 just-issued instructions, renamed to use physical registers.

Output: The 4 oldest instructions whose 2 source registers are ready for use.

Scoreboard: Tracks writes to physical registers.

Execution close-up: (1) Two copies of register files, to reduce port pressure.

(2) Forwarding buses are low-latency paths through CPU. Relies on speculations

Latencies, from issue to retirement.

Short latencies keep buffers to a reasonable size.

Retirement managed here.

8 retirements per cycle can be sustained over short time periods.

Peak rate is 11 retirements in a single cycle.

Execution unit close-up: (1) Two arbiters: one for top pipes, one for bottom pipes.

(2) Instructions statically assigned to (3) Arbiter dynamically selects top left or or bottom .

right .

Top Top Bottom Thus, 2 dual-issue dynamic machines, not a 4-issue machine. Why?

Simplifies arbiter. Performance penalty?

A few %.

Memory stages close-up: 1st stop: TLB, to convert virtual memory addresses.

3rd stop: Flush STQ to the data cache ... on a miss, place in Miss Address File. Loads and stores from execution unit appear as “Cluster 0/1 memory unit” in the diagram below. 2nd stop: Load Queue (LDQ) and Store Queue (SDQ) each hold 32 instructions, until retirement ...

1 GHz Input: (MAF == MHSR) “Double pumped” So we can roll back!

LDQ/STQ close-up: Hazards we are trying to prevent: To do so, LDQ and SDQ lists of up to 32 loads and stores, in issued order. When a new load or store arrives, addresses are compared to detect/fix hazards:

LDQ/STQ speculation It also marks the load instruction in a predictor, so that future invocations are not speculatively executed.

First execution Subsequent execution

Designing a microprocessor is a team sport. Below are the author and acknowledgement lists for the papers whose figures I use.

micro-architects architect circuits There is no “i” in T-E-A-M ...

Break

CS 152 L19: Dynamic Scheduling II

Play:

UC Regents Spring 2014 © UCB

Multi-Threading (Dynamic Scheduling)

UC Regents Spring 2014 © UCB CS 152 L18: Dynamic Scheduling I

Power 4 (predates Power 5 shown earlier)

Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle.

UC Regents Spring 2014 © UCB CS 152 L18: Dynamic Scheduling I

For most apps, most execution units lie idle

Observation: Most hardware in an out-of-order CPU concerns physical registers. Could several instruction threads share this hardware? For an 8-way superscalar.

From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On chip Parallelism, ISCA 1995.

UC Regents Spring 2014 © UCB CS 152 L18: Dynamic Scheduling I

Simultaneous Multi-threading ...

One thread, 8 units Two threads, 8 units

Cycle M M FX FX FP FP BR CC 1 Cycle M M FX FX FP FP BR CC 1 2 3 4 5 2 3 4 5 8 9 6 7 8 9 6 7

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

CS 152 L18: Dynamic Scheduling I UC Regents Spring 2014 © UCB

Power 4 Power 5 2 fetch (PC), 2 initial decodes

CS 152 L18: Dynamic Scheduling I

2 commits (architected register sets)

UC Regents Spring 2014 © UCB

Power 5 data flow ...

Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck.

UC Regents Spring 2014 © UCB CS 152 L18: Dynamic Scheduling I

Power 5 thread performance ...

Relative priority of each thread controllable in hardware.

For balanced operation, both threads run slower than if they “owned” the machine.

CS 152 L18: Dynamic Scheduling I UC Regents Spring 2014 © UCB

CS 152 L18: Dynamic Scheduling I

Multi-Core

UC Regents Spring 2014 © UCB

Recall: Superscalar utilization by a thread

For an 8-way superscalar.

Observation: In many cases, the on-chip cache and DRAM I/O bandwidth is also underutilized by one CPU. So, let 2 cores share them.

UC Regents Spring 2014 © UCB CS 152 L18: Dynamic Scheduling I

Most of Power 5 die is shared hardware

Core #1 Core #2

CS 152 L18: Dynamic Scheduling I

Shared Components L2 Cache L3 Cache Control DRAM Controller

UC Regents Spring 2014 © UCB

Core-to-core interactions stay on chip

(1) Threads on two cores that use shared libraries conserve L2 memory.

(2) Threads on two cores share memory via L2 cache operations.

Much faster than 2 CPUs on 2 chips.

UC Regents Spring 2014 © UCB CS 152 L18: Dynamic Scheduling I

Sun Niagara

CS 152 L18: Dynamic Scheduling I UC Regents Spring 2014 © UCB

The case for Sun’s Niagara ...

For an 8-way superscalar.

Observation: Some apps struggle to reach a CPI == 1. For throughput on these apps, a large number of single-issue cores is better than a few superscalars.

UC Regents Spring 2014 © UCB CS 152 L18: Dynamic Scheduling I

Niagara (original): 32 threads on one chip

8 cores: Single-issue, 1.2 GHz 6-stage pipeline 4-way multi-threaded Fast crypto support

Die size: 340 mm² in 90 nm.

Power: 50-60 W

Shared resources: 3MB on-chip cache 4 DDR2 interfaces 32G DRAM, 20 Gb/s 1 shared FP unit GB Ethernet ports

Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO) CS 152 L18: Dynamic Scheduling I UC Regents Spring 2014 © UCB

The board that booted Niagara first-silicon

Source: J Schwartz weblog (then Sun COO, now CEO) CS 152 L18: Dynamic Scheduling I UC Regents Spring 2014 © UCB

Used in Sun Fire T2000: “Coolthreads”

Claim: server uses 1/3 the power of competing servers.

Web server benchmarks used to position the T2000 in the market.

UC Regents Spring 2014 © UCB CS 152 L18: Dynamic Scheduling I

IBM RISC chips, since Power 4 (2001) ...

2014

CS 152 L19: Dynamic Scheduling II UC Regents Spring 2014 © UCB

CS 152 L19: Dynamic Scheduling II UC Regents Spring 2014 © UCB

Recap: Dynamic Scheduling

Three big ideas: register renaming, data-driven detection of RAW resolution, bus-based architecture.

Very complex, but enables many things: out-of-order execution, multiple issue , loop unrolling, etc.

Has saved architectures that have a small number of registers: IBM 360 floating-point ISA, Intel x86 ISA .

UC Regents Spring 2014 © UCB CS 152 L18: Dynamic Scheduling I

On Tuesday Epilogue ...

Have a good weekend!