## ECE-3056-B Quiz-2 Topic Areas John Copeland March 28, 2014

### 05 Single-Cycle Datapath

Asynchronous Logic –combinatorial logic Synchronous Logic – latches, registers Data Path – actual determined by control signals and mux'es Instruction Decoding – determines values of control signals 3 Types of Instruction Formats – how used to simplify decoding Opcode( bits 31-26) "funct" (bits (5-0) determines ALU operation PC Register – 3 possible inputs: pc+4, branch, jump Given a diagram like slide 05-28, or VHDL like 05-32; know how to figure which control lines are set (true) for: add, lw, sw, j, jr, beq See 05-33, 34, 35, or 37 Energy use depends on average number of gates that switch per cycle Which uses more: Register Set or ALU (during add).

Given VHDL code, or combinatorial logic, for simple ALU, be able to modify to add another operation (e.g., shift, subtract, bit-wise logic, …) 2

## Datapath With Jumps Added

Slide 05-37 3

### 06 Multi-Cycle Datapath

Divide Datapath into five phases – Fetch Instr., Decode Instr., Execute (calculate), Memory R/W, Write Back - (IF, ID, EX, MEM, WB) Given a diagram like slide 06-8, 11, 17, 18, 20, 22, 23,24; know how to figure which control lines (ALUOp, ALUSrcA, MemWrite, …) are set (true) for: add, lw, sw, j, jr, beq, … Interpret a state diagram like 06-10 or 06-30 – how many clock cycles for br, add, lw.

For given MIPS assembler code: how many cycles to execute.

State diagram for decoding instruction - ways to implement: ROM Implementation – very inefficient PLA " – Inputs: 5 Opcode bits and 4 state bits (“funct” bits decoded apart) Outputs: 1 for each control line (control signal) Vertical Lines: Minimized by choosing similar bits for closely related functions (06-32).

CISC (complex instruction set computer) – uses microcode, more ticks/instr Unexpected change in flow of control: Exception (error) vs. Interrupt (I/O) Two registers just for Exception Handling: cause and status 4

add \$14, \$5, \$6 0 M u x 1

## Pipelining Example

lw \$13, 24(\$1) add \$12, \$3, \$4 sub \$11, \$2, \$3 lw \$10, 20(\$1)

Note what is happening in the register file

IF/ID ID/EX EX/MEM MEM/WB Add 4 Shift left 2 Add Add result PC Address Instruction memory

Pipeline stage execution time

Read register 1 Read register 2 Registers Write register Read data 1 Read data 2 Write data Instruction [20 – 16] Instruction [15 – 11] 16 Sign extend 32 0 M u x 1 Zero ALU ALU result 0 M u x 1 RegDst Address Data memory Read data Write data 1 M u x 0 Slide 08a-21 5

## Pipelined Control

Instruction

R-format lw sw beq

Execution/Address Calculation stage control lines Reg Dst

1 0 X X

ALU Op1

1 0 0 0

ALU Op0

0 0 0 1

ALU Src

0 1 1 0

Memory access stage control lines Branch

0 0 0 1

Mem Read

0 1 0 0

Mem Write

0 0 1 0

Write-back stage control lines Reg write

1 1 0 0

Mem to Reg

0 1 X X • Control signals derived from instruction – As in single-cycle implementation • Pass control signals along like data Slide 08a-23

### 07 Performance

Determined by: Algorithm & Language, Compiler, (hardware) Architecture Different Metrics, and different numbers for different "bench mark" programs SPEC2005 – performance doing a standard set of bench mark programs "Throughput" – work done per unit time "Response Time" (latency) – time to do a task (e.g., display a Web page) Trading throughput versus latency (for multitasking or parallel processing) Instructions per second = clock rate / average cycles per instruction Programs with heavy I/O or memory R/W's may depend less on CPU speed. Amdahl's Law – know definition, and do calculations like 07-18.

Corollary – "make the common case fast" Increased clock rate  higher temperature and lower energy efficiency, 7

### 08a Pipelined Datapath

Use clocked registers to divide Datapath's five stages – Fetch Instr., Decode Instr., Execute, Memory R/W, Write Back - (IF, ID, EX, MEM, WB) Use more clocked registers to delay control signals as instruction moves through stages (08a-23).

Hazards: Structural, Control, Data Structural – need to separate Instruction and Data memory (or caches) Control – guess "to branch" or not – stall when wrong – how?

Can branch be determined 2 or 3 instructions ahead - if so ?

How many real br instructions are there? only bre (branch if equal) Jump/Br must have next instruction started before (kill if wrong) Data – add logic devices to "forward" data needed by next instruction.

Forwarding unit – controls when to forward – from EXE or MEM stage Problem with" Load Word" and Use –requires a" nop" insertion unless order of following instructions can be swapped ("code scheduling").

MIPS can predict branch always taken (wrong when loop ends) More complicated CPU's have "dynamic" branch prediction.

MIPS – Exception causes Jump to "Exception Handler” procedure in OS. 8

### 08b Pipelined – Thread Level Parallelism

Flynn's Taxonomy - Single/Multiple Instructions Single/Multiple Data Multithreading – MIMD – parallel threads or LWPs (light-weight processes) Process – running program with state (register values, include PC, pipeline r's). Scheduled by OS –swapped out, restarted later.

Threads have own state, stack memory (local data), but share static data Hardware assisted switching between threads – separate hardware registers instructions may be interleaved in pipeline ("Fine-Grain Multithreading") "Course-Grained Multithreading" – swap on long stall (like cache miss) Synchronization between threads – requires hardware "atomic commands" "Test and Set" without switching threads between the Test and the Set.

When altering common data, thread must tell OS "start/end Critical Path" Simultaneous Multithreading on multiple cores (data paths).

Amdahl's Law – increased multithreading reaches "point of diminishing return” Big time Parallelism – requires special many-core computers, high-level languages, shared and local memory. "Grid" has cores on 2-d grid of buses.

Communication (assigning tasks, collecting results) – by passing messages.

9

### 09a Memory

Different levels – large, slow, and low cost  small, fast, and high cost.

Hard disk (Terabyte) DRAM (Gigabyte) \$ 0.25 per Gigabyte \$ 20.00 per Gigabyte millisecond several nanoseconds Static Ram (Megabytes) \$ 2000 per Gigabyte Registers fraction of nanosecond On CPU – limited area and power Use multiple levels to get low average cost, high-speed for data being used.

Multicore computers need a fast cache memory for each core Data movement takes more energy than computation What to put in cache? Temporal (recent) and Spatial (nearby) How to locate data.

Direct Cache: address parts: [ tag ][ cache address or index][ byte location ] Size of cache: = 2^(no. of index bits) x 2^((no. of byte location bits) = number of blocks x block size Each block also has valid bit, dirty bit, tag.

To read or write data, look at indexed block, match tag, select byte location.

If not found (tag did not match), put contents in write queue if "dirty" before getting new data from memory (all byte-addresses that start with tag + index) "Write Through" immediately update memory, "Write Back" just write cache.

10

Index 000 001 010 011 100 101 110 111 Word addr 16 3 16 V N N Y N N N Y N Previous State of Cache Tag 11 10 Data Mem[11010] Mem[10110] 09a-20 Then This Happens Binary addr 10 000 00 011 10 000 Hit/miss ?

?

?

What is new State of Cache?

Cache block 000 011 000 09a-21 Answer on 09a-21 11