Transactors and RAMP

Download Report

Transcript Transactors and RAMP

RAMP Design Infrastructure
Krste Asanovic
[email protected]
MIT Computer Science and Artificial Intelligence Laboratory
http://cag.csail.mit.edu/scale
Embedded RAMP Workshop, BWRC
August 23, 2006
RAMP Approach
 Detailed target-cycle accurate emulation of proposed machine,
NOT run applications as fast as possible on underlying platform
 But must run applications fast enough (~100MHz) to allow software
development
 Initially, should boot and run standard software (OS+applications
unchanged)
 Challenges:
–
–
–
–
–
Accurate target-cycle emulation
Efficient use of FPGA resources
Providing reproducibility, debugging, monitoring
Managing design complexity with multiple contributing authors
Providing flexibility for rapid architectural exploration
 Approach:
– Generate a distributed cycle-accurate hardware event simulator from
transactor model
RAMP Design Framework Overview
[ With Greg Gibeling, Andrew Schultz, UCB]
Target System: the machine
being emulated
CPU
CPU
CPU
Interconnect
Network
DRAM
RDL Compiled to
FPGA Emulation
CPU
• Describe structure as transactor
netlist in RAMP Description
Language (RDL)
• Describe behavior of each leaf
unit in favorite language (Verilog,
VHDL, Bluespec, C/C++, Java)
RDL Compiled to
Software Simulation
Host Platforms: systems
that run the emulation or
simulation
2VP70
FPGA
2VP70
FPGA
• Can have part of target
mapped to FPGA
emulation and part mapped
to software simulation
2VP70
FPGA
2VP70
FPGA
2VP70
FPGA
BEE2 Host Platform
Workstation Host Platform
Units and Channels in RAMP
Port
Sending Unit
Channel
Receiving Unit
Port
 Units
 Large
pieces of functionality, >10,000 Gates (e.g. CPU + L1$)
 Leaf units implemented in a “host” language (e.g., Verilog, C++)
 Channels
 Unidirectional
 Point-to-point
 FIFO
semantics
 Unknown latency and buffering (fixed when system
instantiated)
 Implementation generated automatically by RDL compiler
RAMP Channels Generated Automatically
During System Instantiation
Channel parameters for timing-accurate simulations given in RAMP
description file




Bitwidth (in bits per target clock cycle)
Latency (in target clock cycles)
Buffering (in either fragments or messages)
Fragments (one target clock cycle’s worth of data)


Smaller than messages
Convey the simulation time through idles
32b
32b
32b
Buffering
Latency
Channel
Bitwidth

Mapping Target Units to Host Platform
Start
Timing
,
Unpacking
&
Buffer
Link A
Link B
Done
Unit
Port A
Timing
,
Unpacking
&
Buffer
& Control
Port B
Control&Status
State
Buffer
,
Packing
&
Timing
Link C
Buffer
,
Packing
&
Timing
Link D
Port C
Port D
Inside Edge
Outside Edge
 Inside

edge, free from host implementation dependencies
Needs language-specific version of interface (e.g., Verilog, Bluespec, C++)
 Outside

edge, implementation dependant
Deals with physical links
 RDL

Wrapper
compiler generates the wrapper and all of the links
Allows plugins to extend to new host languages or new link types
Targets Mapped Across Hardware
and Software Host Platforms
Host
(Hardware/FPGA)
Library
(Output)
Outside Edge
Outside Edge
Link A
(Channel A)
Wrapper 1
(Unit 1)
Link B
(Channel B)

Wrapper 2
(Unit 2)
Link C
(Channel C)
Link F
(Channel F
)
Link G
(Channel G)
Link H
(Channels
D & G)
RS232
Host ?
(Misc. Platform)
Library
(Input)
Link D
(Channel D)
Host
(Workstation)
Link K
(Channel E)
Link J
(Channel F
)


Link E
(Channel E)
Link I
(Channels
E & F)
TCP/IP
 Cross-platform
Wrapper 3
(Unit 3)
Outside Edge
Link L
(Channel H)
Library
(Debug)
Units implemented in many
languages
Library units for I/O
Links implement channels
 Links

Can be mapped to anything
that transmits data (e.g.,FPGA
wires, high-speed serial links,
Ethernet)
Virtualization to Improve FPGA
Resource Usage

RAMP allows units to run at varying target-host clock ratios
to optimize area and overall performance without changing
cycle-accurate accounting

Example 1: Multiported register file
 Example,
Sun Niagara has 3 read ports and 2 write ports to 6KB of
register storage
 If RTL mapped directly, requires 48K flip-flops

Slow cycle time, large area
 If
mapping into block RAMs (one read+one write per cycle), takes 3
host cycles and 3x2KB block RAMs


Faster cycle time (~3X) and far less resources
Example 2: Large L2/L3 caches
 Current
FPGAs only have ~1MB of on-chip SRAM
 Use on-chip SRAM to build cache of active piece of L2/L3 cache,
stall target cycle if access misses and fetch data from off-chip
DRAM
Debugging and Monitoring Support
Channel model + target time model supports:

Monitoring


Single-stepping by cycle or by transaction


Target time can be paused or slowed down
Simulation steering


All communication over channels can be examined and controlled
Inject messages into channels
Mixed-mode emulation/simulation
Can move some units into software simulation
 Cross-platform communication hidden by RDL compiler (RDLC)

Related Approaches
FPGA-Based Approaches:


Quickturn, Axis, IKOS, Thara:
 FPGA- or special-processor based gate-level hardware emulators
 Slow clock rate (~1MHz vs. RAMP ~100MHz)
 Limited memory capacity (few GB vs. RAMP 256GB)
RPM at USC in early 1990’s:
 Up to only 8 processors, only memory controller in configurable logic
Other approaches:
Software Simulators
 Clusters (standard microprocessors)
 PlanetLab (distributed environment)
 Wisconsin Wind Tunnel
(used CM-5 to simulate shared memory)
All suffer from some combination of:
 Slowness, inaccuracy, target inflexibility, scalability, unbalanced
computation-communication ratio, ..

RAMP White Structure
CPU +
L1$ +
Coherenc
e
CPU +
L1$ +
Coherenc
e
L2$ +
Coherence
To Other
Nodes
Router


L2$ optional

Target router topology
independent of host link
topology
ISA
Independent

Coherence Engine
Non-target
accesses
Mem. Sched.
DRAM Cntl.
DRAM
Multiple different ISAs will
eventually be supported

RAMP White uses scalable
directory-based coherence
protocol
Host DRAM used to support
host emulation (e.g., L2 cache
image) and tracing, as well as
target memory
RAMP for MP-SoC Emulation
Off-chip memory held in DRAM,
with accurate target timing models
Large on-chip memories
virtualized/cached in offchip DRAM
Standard TI OMAP 2420 design
CPU& DSP Mapping Optimized
with Virtualized RTL
Slower-rate I/O
modeled in software
on host workstation
Selected blocks’ RTL mapped
directly onto FPGA
Backup
Computing Devices Then
EDSAC, University of Cambridge, UK, 1949
Computing Devices Now
Sensor Nets
Qu ickT ime™ an d a
TIF F (U ncom pres sed ) dec omp resso r
are nee ded to se e thi s pic ture.
Cameras
Games
Qui ckT ime™ and a
T IFF (Uncompres sed) dec ompres sor
are needed to s ee this pic ture.
Media
Players
Quic kTime™ and a
Set-top
TIFF (Unc ompres sed) dec ompres sor
are needed t o s ee t his pict ure.
boxes
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Laptops
Smart
phones
Servers
Routers
Robots
Automobiles
Supercomputers
Requirements Converging and Growing
 Traditional “general-purpose” computing
– Focus on programming effort to implement large and
extensible feature set
 Traditional embedded computing
– Focus on resource constraints (cost, execution time,
power, memory size, …) to implement a fixed function
 Current and future computing platforms
– Large and growing feature set and resource
constraints (e.g., web browsers on cellphones, power
consumption of server farms)
 But also, new concerns:
– Reliability (hardware and software errors)
– Security
– Manageability (labor costs)
Uniprocessor Performance (SPECint)
Performance (vs. VAX-11/780)
10000
From Hennessy and Patterson, Computer
Architecture: A Quantitative Approach,
4th edition, 2006
??%/year
3X gap from
historical
growth
1000
52%/year
100
10
25%/year
=> All major manufacturers
moving to multicore
architectures
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
 General-purpose uniprocessors have stopped historic
performance scaling
– Power consumption
– Wire delays
– DRAM access latency
– Diminishing returns of more instruction-level parallelism
Custom Chip Design Cost Growing
12000
No of ASIC design starts
10000
8000
=> Fewer chips, increasingly
programmable to support wider
range of applications
6000
4000
2000
0
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
Year
Source: Dr. Raul Camposano,
CTO Synopsys
 Development cost rising rapidly because of growing design effort
– Logic complexity and new physical design challenges (wire delay, switching
and leakage power, coupling, inductance, variability, …)
 New ASIC development with automated design tools ~$10-30M
(<400MHz@90nm)
– Assume 10% R&D cost, 10% market share => $1-3B market
 Development cost much higher for hand-crafted layout, e.g., IBM
Cell microprocessor >$400M (4GHz in 90nm)
Convergence of Platforms
Only way to meet system feature set, cost, power, and
performance requirements is by programming a processor array
– Multiple parallel general-purpose processors (GPPs)
– Multiple application-specific processors (ASPs)
IBM Cell
Intel Network Processor
1 GPP (2 threads)
1 GPP Core
8 ASPs
16 ASPs (128 threads)
18
18
18
Stripe
MEv2 MEv2 MEv2 MEv2
1
2
3
4
RDRAMRDRAMRDRAM
1
2
3
PCI
64b (64b)
66 MHz
QDR
SRAM
1
E/D Q
18 18
Intel®
XScale™
Core
32K IC
32K DC
QDR
SRAM
2
E/D Q
18 18
MEv2 MEv2 MEv2 MEv2
8
7
6
5
G
A
S
K
E
T
QDR
SRAM
3
E/D Q
18 18
MEv2 MEv2 MEv2 MEv2
9
10
11
12
QDR
SRAM
4
E/D Q
18 18
MEv2 MEv2 MEv2 MEv2
16
15
14
13
IXP2
800
S
Rbuf P
16b
64 @ 128B I
4
or
C
Tbuf S
16b
64 @ 128B I
X
Hash
48/64/128
Scratch
CSRs
16KB
-Fast_wr
-UART
-Timers
-GPIO
-BootROM/SlowPort
Sun Niagara
8 GPP cores (32 threads)
1000s of
processor
cores per
die
Picochip DSP
1 GPP core
248 ASPs
Cisco CSR-1
188 Tensilica GPPs
Intel 4004 (1971):
4-bit processor,
2312 transistors,
~100 KIPS,
10 micron PMOS,
11 mm2 chip
“The Processor is the
new Transistor” [Rowen]
New Abstraction Stack Needed
 Challenge: Desperate need to improve the state of the art
of parallel computing for complex applications
 Opportunity: Everything is open to change
– Programming languages
– Operating systems
– Instruction set architecture (ISA)
– Microarchitectures
 How do we work across traditional abstraction
boundaries?
Stratification of Research Communities
Application
Algorithm
Programming Language
Software Community:
Hardware cannot be changed!
Operating System
Instruction Set Architecture (ISA)
Microarchitecture
Gates/Register-Transfer Level (RTL)
Hardware community:
Software cannot be changed!
Circuits
Devices
 Problem is not just one of mindset
 Software developers not interested unless hardware available
– software simulations too slow, ~10-100 kHz for detailed models of one CPU
– software simulations not credible
 But takes 5 years to complete prototype hardware system!
– Then in a few months of software development, all mistakes become clear…
RAMP: Build Research MPP from FPGAs
 As  25 CPUs will fit in Field Programmable Gate Array
(FPGA), 1000-CPU system from  40 FPGAs?
– 16 32-bit simple “soft core” RISC at 150MHz in 2004 (Virtex-II)
– FPGA generations every 1.5 yrs;  2X CPUs,  1.2X clock rate
 HW research community does logic design (“gate
shareware”) to create out-of-the-box, MPP
– E.g., 1000 processor, standard ISA binary-compatible, 64-bit,
cache-coherent supercomputer @  200 MHz/CPU in 2007
 Multi-University Collaboration
– RAMPants: Arvind (MIT), Krste Asanovic (MIT), Derek Chiou
(Texas), James Hoe (CMU), Christos Kozyrakis (Stanford),
Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson
(UCB), Jan Rabaey (UCB), and John Wawrzynek (UCB)
RAMP Goals

Provide credible prototypes with sufficient performance to
support co-development of software and hardware ideas

Turn-around new hardware ideas in minutes or hours

Support reproducible comparison of ideas across different
groups

Architects distribute usable hardware designs by FTP,
improve visibility to industry
RAMP-1 Hardware
 BEE2: Berkeley Emulation Engine 2
 By John Wawrzynek and Bob Brodersen with students Chen
Chang and Pierre Droz
 Completed Dec. 2004 (14x17 inch 22-layer PCB)
1.5W / computer,
5 cu. in. /computer,
$100 / computer
Board:
5 Virtex II FPGAs,
18 banks DDR2-400
memory,
20 10GigE conn.
Box:
8 compute modules in
8U rack mount chassis
1000 CPUs :
1.5 KW,
 $100,000
Transactors
 A transactor (transactional actor) is an abstract unit of
computation, which is easy to understand and verify, but
which can also be automatically translated into high
quality hardware or software implementations
Original Transactor Motivation
Application
Algorithm
Programming Language
Operating System
Instruction Set Architecture (ISA)
Transactors/Microarchitecture (UTL)
Scale Vector-Thread Processor
128 Threads/Core
~1M Gates, 17mm2, 400MHz, 0.18um
[IEEE Micro, Top Picks, 2004]
Gates/Register-Transfer Level (RTL)
Circuits
Devices




Design chip at microarchitecture level rather than at RTL level
Abstract away pipeline depth and communication latencies
Separate global communication from local computation
Avoid over-specification of behavior, particularly local pipelining &
scheduling
 Encode “best-practice” in concurrency management
Transactor Anatomy
Transactor unit comprises:
 Architectural state (registers + RAMs)
 Input queues and output queues connected to other units
 Transactions (guarded atomic actions on state and queues)
 Scheduler (selects next ready transaction to run)
Input
queues
Transactions
Scheduler
Transactor
Architectural
State
Advantages
 Handles non-deterministic inputs
 Allows concurrent operations on mutable state within unit
 Natural representation for formal verification
Output
queues
Transactor Networks
Transactor
Global inter-unit communication via
FIFO buffered point-point channels
Short-range local
communication within unit

Decompose system into network of transactor units

Decoupling global communication and local computation
 Only
communication between units via buffered point-point channels
 All computation only on local state and channel end-points
Message Queues or “Channels”

Queues decouple units’ execution and require units to use latencyinsensitive protocols [Carloni et al., ICCAV’99]

Queues are point-to-point channels only



No fanout, a unit must replicate messages on multiple queues
No buses in a transactor design (though implementation may use them)
Transactions can only pop head of input queues and push at most one
element onto each output queue


Avoids exposing size of buffers in queues
Also avoids synchronization inherent in waiting for multiple elements
Transactions
 Transaction is a guarded atomic action on local state and input
and output queues
 Guard is a predicate that specifies when transaction can
execute
– Predicate is over architectural state and heads of input queues
– Implicit conditions on input queues (data available) and output queues
(space available) that transaction accesses
 Transaction can only pop up to one record from an input queue
and push up to one record on each output queue
transaction
route(input int[32] in,
output int[32] out0,
output int[32] out1)
{
when (routable(in)) {
if (route_func(in) == 0)
out0 = in;
else
out1 = in;
};};
transaction
route_kill(input int[32] in)
{
when (!routable(in)) {
bad_packets++;
};};
in0
in1
Route
Stage
out0
out1
Scheduler
 Scheduling function decides on transaction priority based on local state and
state of input queues
– Simplest scheduler picks among ready transactions in a fixed priority order
 Transactions may have additional predicates which indicate when they can
fire
– E.g., implicit condition on all necessary output queues being ready
unit
route_stage(input int[32] in0,
// First input channel.
input int[32] in1,
// Second input channel.
output int[32] out0, // First output channel.
output int[32] out1) // Second output channel.
{
int[32] bad_packets;
int[1] last; // Fair scheduler state.
schedule {
reset { bad_packets = 0; last = 0; };
route_kill(in0);
route_kill(in1);
in0
schedule round_robin(last) {
Route
(0): route(in0, out0, out1);
Stage
(1): route(in1, out0, out1); in1
}; }; }
out0
out1
Raise Abstraction Level for Communication
RTL Model: Cycles and Wires
Combinational
Logic
Transactors: Messages and Queues
Combinational
Logic
CLK

Designer allocate signals to wires and orchestrates
cycle-by-cycle communication across chip

All global communication uses latency-insensitive
messages on buffered point-point channels

Global and local wires specified identically

Global wires separated from local intra-unit wires
Problems in RTL Implementation
Long signal paths may need
more pipelining to hit
frequency goal, require
manual RTL changes
Repeaters used to
reduce latency burn
leakage power
cycle
Transactor Communications
Latency-insensitive model
allows automatic insertion of
pipeline registers to meet
frequency goals.
A
1
cycle
B
1
Dedicated wires for each signal
cause wiring congestion & waste
repeater power because many
wires are mostly idle
Neighbor wire coupling
may reduce speed &
inject errors, require
manual rework
Error detection and correction circuitry cannot be added
automatically, requires manual RTL redesign
Multiplexed channels
reduce congestion,
save repeater power.
Can use on-chip
network.
Can also trade
increased end-to-end
latency for reduced
repeater power.
A
2
B
2
Use optimized signaling on
known long wires: e.g., dualdata rate for high throughput,
low-swing for low power, shields
to cut noise
Can automatically insert error correction/retry to
cover communication soft errors
Raise Abstraction Level for Computation
RTL Model: Manual
Concurrency Management

Designer has to divide application operations
into pieces that fit within a clock cycle, then
develop control logic to manage concurrent
execution of many overlapping operations.
Single application operation manually divided
across multiple pipeline stages, then
interleaved with other operations
If (condA1)
{ Astage1 }
else if (condB1)
{ Bstage1}
If (condA2)
{ Astage2 }
else if (condB2)
{ Bstage2}
CLK
Dependencies between
concurrently executing
operations managed
manually
Architectural
State
Input and output communication
rates and flow control protocol
manually built into code
Transactor Model: Synthesis from
Guarded Atomic Actions

Designer describes each atomic transaction in
isolation, together with priority for scheduling
transactions.

Tools synthesize pipelined transactor
implementation including all control logic to
manage dependencies between operations and
flow control of communications.
Schedule gives desired
priority for multiple enabled
transactions
Each application operation
described as independent
transaction
Transactor
Schedule
A>B
Transaction B
If (condB)
---Transaction A
If (condA)
{…}
Architectural
State
No pipeline registers or other internal
bookkeeping state is exposed in specification
Communication flow control automatically
generated from transactions’ use of input and
output queues
Design Template for Transactor
Scheduler
Arch.
State 1
Arch.
State 2
 Scheduler only fires transaction when it can complete without stalls
– Avoids driving heavily loaded stall signals
 Architectural state (and outputs) only written in one stage of pipeline,
use bypass/interlocks to read in earlier stages
– Simplifies hazard detection/prevention
 Have different transaction types access expensive units (RAM read
ports, shifters, multiply units) in same pipeline stage to reduce area
Transactor VLSI Design Flow
Specification
Manual
Translation
Designer specifies
desired transactor
microarchitecture and
channel bandwidths
Microarch.
Parameters
Designer converts
specification into
Transactor network
Transacto
r
Synthesis
CLK
Automated transactor
synthesis produces
optimized gate netlist
plus channel ports
Global
Routing
Designer specifies
relative placement of
units on die
Placement
Directives
Place and
Local Route
Units placed on die,
no global routing
Channels routed with
post-placement
repeater insertion
System Design Flow
Transactor
Code
Generate
Hardware RTL
Generate
Software Code
C Program
C program
Assembly
Verilog/VHDL
RTL Code
DSP compiler
Assembler
General-purpose
compiler
OS
Logic synthesis
FPGA tools
SRAM
Logic synthesis
Physical Design
DSP
CPU
FPGA
ASIC
DRAM
Verilog/VHDL
RTL Code
Related Models

CSP/Occam
 Rendevous
communications expose system latencies in design
 No mutable shared state within a unit

Khan Process Networks (and simpler SDF models)
 Do
not support non-deterministic inputs
 Sequential execution within unit

Latency-Insensitive Design [Carloni et al.]
 Channels
are similar to transactor channels
 Units described as stallable RTL

TRS/Bluespec [Arvind & Hoe]
 Uses
guarded atomic actions at RTL level (single cycle transactions)
 Microarchitectural state is explicit
 No unit-level discipline enforced
RAMP Implementation Plans
Name
Goal
Target
CPUs
Details
Red
(Stanford)
Get
Started
1H06
8 PowerPC
32b hard cores
Transactional
memory SMP
Blue (Cal)
Scale
2H06
1000 32b soft
(Microblaze)
Cluster, MPI
1H07?
128? soft 64b,
Multiple commercial
ISAs
CC-NUMA, shared
address,
deterministic,
debug/monitor
4X CPUs of ‘04
FPGA
New ’06 FPGA, new
board
White (All) Full
Features
2.0
3rd party
sells it
2H07?
Summary

All computing systems will use many concurrent processors
(1,000s of processors/chip)
 Unlike
previously, this is not just a prediction, already happening

We desperately need a new stack of system abstractions to
manage complexity of concurrent system design

RAMP project building an emulator “watering hole” to bring
everyone together to help make rapid progress
 architects,
OS, programming language, compilers, algorithms, application
developers, …

Transactors provide a unifying model for describing complex
concurrent hardware and software systems
 Complex
digital applications
 The RAMP target hardware itself