Outline

Transcript Outline

RAMP Tutorial
Introduction/Overview
Krste Asanovic
UC Berkeley
RAMP Tutorial, ASPLOS, Seattle, WA
March 2, 2008
Quic kT ime™ and a
T IFF (Uncompres sed) decompres sor
are needed to s ee this picture.
1
Technology Trends: CPU

Microprocessor: Power Wall + Memory Wall + ILP Wall = Brick Wall
 End
of uniprocessors and faster clock rates
 Every program(mer) is a parallel program(mer),
Sequential algorithms are slow algorithms

Since parallel more power efficient (W ≈ CV2F)
New “Moore’s Law” is 2X processors or “cores” per socket every 2
years, same clock frequency
 Conservative:
2007 4 cores, 2009 8 cores,
2011 16 cores for embedded, desktop, & server
 Sea change for HW and SW industries since changing
programmer model, responsibilities
 HW/SW
industries bet farm that parallel successful
2
Problems with “Manycore” Sea Change
Algorithms, Programming Languages, Compilers, Operating
Systems, Architectures, Libraries, …
not ready for 1000 CPUs / chip
 Only companies can build HW, and it takes years
Software people don’t start working hard until hardware arrives
1.
2.
3.
•
4.
5.
3 months after HW arrives, SW people list everything
that must be fixed, then we all wait 4 years for next
iteration of HW/SW
How get 1000 CPU systems in hands of researchers to innovate in
timely fashion on in algorithms, compilers, languages, OS,
architectures, … ?
Can avoid waiting years between HW/SW iterations?
3
Vision: Build Research MPP from
FPGAs


As  16 CPUs will fit in Field Programmable Gate Array (FPGA), 1000-CPU
system from  64 FPGAs?
•
8 32-bit simple “soft core” RISC at 100MHz in 2004 (Virtex-II)
•
FPGA generations every 1.5 yrs;  2X CPUs,  1.2X clock rate
HW research community does logic design (“gate shareware”) to create
out-of-the-box, MPP
E.g., 1000 processor, standard ISA binary-compatible, 64-bit,
cache-coherent supercomputer @  150 MHz/CPU in 2007
 6 universities, 10 faculty
3rd party sells RAMP 2.0 (BEE3) hardware at low cost
“Research Accelerator for Multiple Processors”



4
Why RAMP Good for Research MPP?
SMP
Cluster
Custom
Scalability (1k)
C
A
A
A
A
Cost (1k CPUs)
F ($20M)
C ($1M)
F ($3M)
A+ ($0M)
A ($0.1M)
Cost to own
A
D
A
A
A
Power/Space
D (120 kw,
D (120 kw, 6
A (100 kw,
A+ (.1 kw,
A (1.5 kw,
Community
D
A
F
A
A
Observability
D
C
D
A+
A+
Reproducibility
B
D
B
A+
A+
Reconfigurability
D
C
D
A+
A+
A+
A+
A-
F
B
A (2 GHz)
A (3 GHz)
B (.4 GHz)
F (0 GHz)
C (.1 GHz)
C
B-
C+
B
A-
(kilowatts, racks)
Credibility
Perform. (clock)
GPA
6 racks)
racks)
3 racks)
Simulate
0.1 racks)
RAMP
0.3 racks)
5
Partnerships

Co-PIs:

RAMP hardware development activity centered at Berkeley Wireless
Research Center.

Three year NSF grant for staff (awarded 3/06).

GSRC (Jan Rabaey) has paid partial staff and some students.

Major continuing commitment from Xilinx

Collaboration with MSR (Chuck Thacker) on BEE3 FPGA platform.

Sun, IBM contributing processor designs, IBM faculty awards.
Krste Asanovíc (UCB), Derek Chiou (UT Austin), Joel Emer (MIT/Intel),
James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin
(Washington), David Patterson (Berkeley), and John Wawrzynek (Berkeley)
High-speed high-confidence emulation is widely recognized as a necessary
component of multiprocessor research and development. FPGA emulation is
the only practical approach.
6
BEE3 Design
Chuck Thacker
Chen Chang, UC Berkeley
• New RAMP systems to
be based on Berkeley
Emulation Engine version
3 (BEE3).
• BEECube, Inc.
BEE3,1st
BEE3,1stprototype
prototype11/07
11/07
– (UC Berkeley spinout
startup company)
– To provide manufacturing,
distribution, and support to
commercial and academic
users.
– General availability 2Q08
For small scale design, or to get started, use Xilinx ML505
7
7
RAMP: An infrastructure to build simulators using FPGAs
Run Target Model on Host Platform
CPU
Target
Model
CPU
CPU
CPU
Interconnect
Network
DRAM
Hard Work
Host
Platform
a d na ™ e mi Tk ciuQ
ro ss er pm oc ed )d e ss erp m oc nU ( FFI T
. er utc ip si ht ee s ot de de en er a
9
Quick Time™ an d a
TIFF ( Un compr ess ed) de compr ess or
ar e n eed ed to s ee this pic ture .

Reduce, Reuse, Recycle
Reduce effort to build target models
just build components (units), infrastructure
handles connections (The RDL Compiler)
 Users

Reuse units by having good abstractions
 Across
different target models
 Across different host platforms


XUP, Calinx, BEE2, BEE3, ML505 also Altera platforms
Recycle existing IP for use as simulation models
 Commercial
processor RTL is (almost) its own model
10
RAMP Target Model
Unit
A
Unit
B
FIFO Channel
Pipeline
Channel
Unit
C
Units
 Relatively large chunks of functionality


e.g., processor + L1 cache
User-written in some HDL or software
Channels
 Point-point, undirectional, two kinds:



FIFO channel: Flow-controlled interface
Pipeline channel: Simple shift register, bits drop off end
Generated by RAMP infrastructure
11
D
D
Datawidth
Target Pipeline Channel Parameters
Forward Latency
12
RAMP Description Language (RDL)
Target:
Unit
A
[ Greg Gibeling, UCB ]
Unit
B
Unit
C
RDLC
Host:


Generated
Unit
Wrappers
Uni
tB
Uni
tA
FPGA1
Generated links
carry channels
FPGA2
Unit
C
User describes target model topology, channel parameters,
and (manual) mapping to host platform FPGAs using RDL
RDL Compiler (RDLC) generates configurations
13
Virtual Target Clock
14
Virtualized RTL Improves FPGA
Resource Usage


RAMP allows units to run at varying target-host clock
ratios to optimize area and overall performance
Example 1: Multiported register file


Example, Sun Niagara has 3 read ports and 2 write ports to
6KB of register storage
If RTL mapped directly, requires 48K flip-flops


If mapping into block RAMs (one read+one write per cycle),
takes 3 host cycles and 3x2KB block RAMs


Slow cycle time, large area
Faster cycle time (~3X) and far less resources
Example 2: Large L2/L3 caches


Current FPGAs only have ~1MB of on-chip SRAM
Use on-chip SRAM to build cache of active piece of L2/L3
cache, stall target cycle if access misses and fetch data from
off-chip DRAM
15
Start/Done Timing Interface
Wrapper
Start
In1
In2
Unit
Out
Done




Wrapper generated by RDL asserts “Start” on the physical FPGA
cycle when the inputs to the unit are ready for the next target cycle
Unit asserts “Done” when it finishes the target cycle and its outputs
are ready
Unit can take variable amount of time
Unvirtualized RTL unit can connect “Done” to “Start” (but must not
clock until “Start”)
16
Distributed Timing Models
17
Distributed Timing Example
Unit
A
Target:
Host:
RDYs
D
Latency L
Pipeline target channel
implemented as distributed
FIFO with at least L buffers
Start
RDY
Unit
A
DEQs
Unit
B
D
Done
Start
D
ENQ
DEQ
Unit
B
Done
18
Other Automatically Generated
Networks

Control network has workstation as master and every
unit as slave device



Units can connect to DRAM resources outside of timed
target channels


Memory-mapped interface with block transfers
Used for initialization, stats gathering, debugging, and
monitoring
Used to support emulation and virtualization state
Units can communicate with each other outside of timed
target channels

Support arbitrary communication. E.g., for distributed stats
gathering
19
Wide Variety of RAMP Simulators
20
Simulator Design Choices

Structural Analog versus Highly Virtualized

Functional-only versus Functional+Timing

Timing via (virtual) RTL design versus separate
functional and timing models

Hybrid software/hardware simulators
21
Host Multithreading
(Zhangxi Tan (UCB), Chung, (CMU))
Target Model
CPU
1
CPU
2
CPU
3
CPU
4
Multithreaded Host Emulation
Engine (on FPGA)
PC
PC1
PC
PC1 1
1
Single hardware
pipeline with
multiple copies of +1
CPU state
2


I$
IR
GPR1
GPR1
GPR1
GPR1
X
Y
D$
2
Multithreading emulation engine reduces FPGA resource use and improves emulator
throughput
Hides emulation latencies (e.g., communicating across FPGAs)
22
Split Functional/Timing Models
(HASIM Emer (MIT/Intel), FAST Chiou, (UT Austin))
Functional
Model

Functional model executes CPU ISA correctly, no timing information


Timing
Model
Only need to develop functional model once for each ISA
Timing model captures pipeline timing details, does not need to execute
code
Much easier to change timing model for architectural experimentation
 Without RTL design, cannot be 100% certain that timing is accurate


Many possible splits between timing and functional model
23
Multithreaded Func. & Timing Models
(RAMP Gold: Tan, Gibeling, Asanovic, UCB)
MT-Channels
Timing
State
Arch
State
Functional
Model
Pipeline
Timing Model
Pipeline
MT-Unit


MT-Unit multiplexes multiple target units on a single host engine
MT-Channel multiplexes multiple target channels over a single host link
24
Schedule










9:00- 9:45 Welcome/Overview
9:45-10:15 RAMP Blue Overview & Demo
10:15-10:45 Break
10:45-12:30 RAMP White Live Demo
BEE3 Rollout (MSR/BEEcube/Q&A)
12:30-13:30 Lunch
13:30-15:00 ATLAS Transactional Memory (RAMP Red)
15:00-15:15 Break
15:15-16:45 CMU Simics/RAMP Cache Study
16:45
Wrapup
25
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
RAMP Blue Release 2/25/2008
- design available from RAMP website
- ramp.eecs.berkeley.edu
26
RAMP White
Hari Angepat, Derek Chiou (UT Austin)


Scalable Coherent Shared Memory Multiprocessor
Support standard shared memory programming models
Leon 3
Leon 3
Mst Slv Int Dbg
Mst Slv Int Dbg
Leon3 shim
Leon3 shim
Intersectio
n Unit
Route
r
NIU
Route
r
AHB shim
IntCntrl
27
DSU
Eth
Intersectio
n Unit
AHB shim
AHB bus
AHB bus
MP
NIU
DDR2
DDR2
RAMP-White
27
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
28
CMU Simics/RAMP Simulator
16-CPU Shared-memory
UltraSPARC III Server
(SunFire 3800)
BEE2 Platform
CPU
Memory
CPU
..
CPU
MMU
Graphics
PCI
Xilinx XCV2P70
DMA
NIC
Terminal
SCSI
Simics (PC)
PowerPC
DDR2
Mem
Interleaved
Pipeline
CPU
CPU
context
context
16xCPU
Simulated
I/O devices
29
29
RAMP Home Page/Repository


ramp.eecs.berkeley.edu
Remotely accessible subversion repository
30
Thank You!

Questions?
31

Outline

Transcript Outline

Directory