TBEE ideas & ideas

Download Report

Transcript TBEE ideas & ideas

The Design and Application of
Berkeley Emulation Engines
John Wawrzynek
Bob Brodersen
Chen Chang
University of California, Berkeley
Berkeley Wireless Research Center
7/20/05
FDIS 2005
1
Berkeley Emulation Engine (BEE), 2002

FPGA-based system for
real-time hardware
emulation:



Emulation speeds up to 60
MHz
Emulation capacity of 10
Million ASIC gateequivalents (although not a
logic gate emulator),
corresponding to 600 Gops
(16-bit adds)
2400 external parallel I/Os
providing 192 Gbps raw
bandwidth.
7/20/05
20
Xilinx VirtexE 2000 chips,
16 1MB ZBT SRAM chips.
FDIS 2005
2
Realtime Processing Allows In-System Emulation
Receiver
Transmitter
BEE
Frame O.K.
Transmitter
Output
Spectrum
Data Match
Data Out
7/20/05
Receiver
Output
on SCSI
Connector
FDIS 2005
3
Matlab/Simulink Programming Tools:
Discrete-Time-Block-Diagrams with FSMs
Block Diagrams:
Control
Data Path
S1
DI DO
A
R/W
S2
Matlab/Simulink:
Functional simulation,
Hardware Emulation
StateFlow,
CoreGen
Matlab
Module
Compiler
HDL




User Macros
Black Boxes
Tool flow developed by Mathworks, Xilinx, and UCB.
User specifies design as block diagrams (for datapaths) and finite
state machines for control.
Tools automatically map to both FPGAs and ASIC implementation.
User assisted partitioning with automatic system level routing.
7/20/05
FDIS 2005
4
BEE Status



Four BEE processing units built
Three in near continuous “production” use
Other supported universities


Successful tapeout of:



3.2M transistor pico-radio chip
1.8M transistor LDPC decoder chip
System emulated:




CMU, USC, Tampere, UMass, Stanford
QPSK radio transceiver
BCJR decoder
MPEG IDCT
On-going projects




UWB mix-signal SOC
MPEG/PRISM transcoder
Pico radio multi-node system
Infineon SIMD processor for SDR
7/20/05
FDIS 2005
5
Lessons from BEE
1.
2.
3.
Real-time performance vastly eases the
debugging/verification/tuning process.
Simulink based tool-flow very effective FPGA
programming model in DSP domain.
System emulation tasks are significant computations
in their own right – high-performance emulation
hardware makes for high-performance general
computing.
Is this the right way to build high-end (super)
computers?
BEE could be scaled up with latest FPGAs and by
using multiple boards  BEE2 (and beyond).
7/20/05
FDIS 2005
6
BEE2 Hardware
1.
2.
3.
4.
Modular design scalable from a few to hundreds of
FPGAs.
High memory capacity and bandwidth to support general
computing applications.
High bandwidth / low-latency inter-module communication
to support massive parallelism.
All off-the-shelf components no custom chips.
Thanks to Xilinx for engineering assistance, FPGAs, and
interaction on application development.
7/20/05
FDIS 2005
7
Basic Computing Element
Single Xilinx Virtex 2 Pro 70 FPGA


1704 package with 996 user I/O
pins

2 PowerPC405 cores
326 dedicated multipliers (18-bit)
5.8 Mbit on-chip SRAM



4 physical DDR2-400 banks

Per FPGA: up to 12.8 Gbyte/s memory
bandwidth and maximum 8 GByte
capacity.
Virtex 4 (90nm) out now, 2x capacity, 2x
frequency.
Virtex 5 (65nm) next spring.
7/20/05
72
18
72
18
38
38
72
18
FDIS 2005
38
FPGA
72
18
38
DDR2-400
DRAM


20X 3.125-Gbit/s duplex serial
communication links (MGTs)
DDR2-400
DRAM

130nm technology
~70K logic cells
DDR2-400
DRAM

DDR2-400
DRAM

8
Compute Module Diagram
4GB DDR2 DRAM
12.8GB/s (400DDR)
Memory
Controller
FPGA
Fabric
FPGA
Fabric
MGT
5 FPGAs
2VP70FF1704
6
it @
4b
IB4X/CX4
40Gbps
MGT
MGT
DRAM
DRAM
IB4X/CX4
20Gbps
100BT
Ethernet
DRAM
Memory
Controller
DRAM
DRAM
DRAM
DRAM
DRAM
IB4X/CX4
40Gbps
DR
0D
0
3
10GigE
or
Infiniband
FPGA
Fabric
Memory
Controller
DRAM
DRAM
DRAM
DRAM
IB4X/CX4
40Gbps
IB4X/CX4
40Gbps
FPGA
Fabric
138 bits 300MHz DDR 41.4Gb/s
Memory
Controller
Memory
Controller
DRAM
DRAM
FDIS 2005
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
7/20/05
MGT
MGT
FPGA
Fabric
9
Compute Module
Completed 12/04.
Module also
includes I/O for
administration
and
maintenance:

10/100
Ethernet

HDMI / DVI

USB
14X17 inch 22 layer PCB
7/20/05
FDIS 2005
10
Inter-Module Connections
Global
Communication
Tree
4X
Compute
module
Compute
Module
As
Tree node
N-modules
4X
Compute
module
4X
Stream Packets
Admin, UI, NFS
7/20/05
NAS
4X
10G Ethernet Switch
100 Base-T Ethernet Switch
FDIS 2005
11
Alternative topology: 3D mesh or torus


The 4 compute FPGA can
be used to extend to 3D
mesh/torus
6 directional links:


4 off-board MGT links
2 on-board LVCMOS links
7/20/05
FDIS 2005
12
19” Rack Cabin Capacity





40 compute modules in 5 chassis
(8U) per rack
~40TeraOPS, ~1.5TeraFLOPS
150 Watt AC/DC power supply to
each blade
~6 Kwatt power consumption
Hardware cost: ~ $500K
7/20/05
FDIS 2005
13
Why are these systems interesting?
1.
Best solution in several domains:
a)
b)
c)
2.
Emulation for custom chip design
Extreme real-time signal processing tasks
Scientific and Supercomputing
Good model on how to build future chips and
systems:
a)
b)
Massively parallel
Fine-grained reconfigurability enables:
•
•
7/20/05
Robust performance/power efficiency on a widerange of problems.
Manufacturing defect tolerance.
FDIS 2005
14
Moore’s Law in FPGA world
10000000
Xilinx FPGA
Intel Xeon Processor
1000000
100X higher performance,
100X more efficient
than microprocessors
10000
1000
100
10.00
10
1
6/15/1994 10/28/1995 3/11/1997
7/24/1998
12/6/1999
4/19/2001
9/1/2002
1/14/2004
5/28/2005 10/10/2006
Release Date
FPGA performance
doubles every 12 months
MOPS/MHz/Million Transistors
MOPS
100000
1.00
6/15/1994
10/28/1995
3/11/1997
7/24/1998
12/6/1999
4/19/2001
9/1/2002
1/14/2004
5/28/2005
10/10/2006
0.10
Xilinx FPGA
Intel Xeon Processor
0.01
Release Date
7/20/05
FDIS 2005
15
Extreme Digital-Signal-Processing
BEE2 is a promising computing platform for for Allen
Telescope Array (ATA) (350 antennas) and proposed
Square Kilometer Array (SKA) (1K antennas)
SETI spectrometer
Image-formation for Radio Astronomy Research
 Massive arithmetic operations per second requirement.
 “Stream-based” computation model
 Real-time requirement
 High-bandwidth data I/O
 Low numerical precision requirements
 Mostly fix-point operations
 Rarely needs floating point
 Data-flow processing dominated
 few control branch points
7/20/05
FDIS 2005
16
SETI Spectrometer


Target: 0.7Hz channels over 800MHz  1 billion Channel
real-time spectrometer
Result:
One BEE2 module meets target and yields 333GOPS (16-bit
mults, 32-bit adds), at 150Watts (similar to desk-top computer)
>100x peak throughput of current Pentium-4 system on integer
performance, & >100x better throughput per energy.


8 Gbps
16 Gbps
7/20/05
BPF
4 ch
128 tap
PFB
8K ch
64K tap
CT
8K,32K
FFT
32K
Power
Spectrum
Threshold
PFB
8K ch
64K tap
CT
8K,32K
FFT
32K
Power
Spectrum
Threshold
Report
PFB
8K ch
64K tap
CT
8K,32K
FFT
32K
Power
Spectrum
Threshold
PFB
8K ch
64K tap
CT
8K,32K
FFT
32K
Power
Spectrum
Threshold
FDIS 2005
17
FPGA versus DSP Chips
1000




XC2VP70-7
C6415-7E
C6415T-1G
Performance
100
GMAC/s
Spectrometer & polyphase filter bank (PFB): 18
mult, Correlator: 4bit mult, 32bit acc.
Cost based on street price.
Assume peak numbers for DSPs, mapped for
FPGAs (automatic Simulink tools).
TI DSPs:

C6415-7E, 130nm (720MHz)

C6415T-1G, 90nm (IGHz)
FPGAs: 130nm, freq. 200-250MHz.

10
35.00
1
30.00
XC2VP70-7
C6415-7E
C6415T-1G
Spectrometer
120.0
Energy Efficiency
100.0
15.00
MMAC/s/$
GMAC/s/watt
25.00
20.00
PFB
Correlator
140.0
10.00
5.00
XC2VP70-7
C6415-7E
C6415T-1G
Cost-Performance
80.0
60.0
40.0
0.00
Spectrometer
PFB
Correlator
20.0
Metrics include chips only (not system). FPGAs 0.0
provide extra benefit at the PC board level.
7/20/05
FDIS 2005
Spectrometer
PFB
Correlator
18
Active Application Areas





High-performance DSP
 SETI Spectroscopy, ATA / SKA Image Formation
Scientific computation and simulation
 E & M simulation for antenna design
Communication systems development Platform
 Algorithms for SDR and Cognitive radio
 Large wireless Ad-Hoc sensor networks
 In-the-loop emulation of SOCs and Reconfigurable
Architectures
Bioinformatics
 BLAST (Basic Local Alignment Search Tool) biosequence
alignment
System design acceleration
 Full Chip Transistor-Level Circuit Simulation (Xilinx)
 RAMP (Research Accelerator for MultiProcessing)
7/20/05
FDIS 2005
19
Opportunity for a New Research
Platform: RAMP
(Research Accelerator for Multiple Processors)
Krste Asanovic (MIT), Christos Kozyrakis (Stanford), Dave
Patterson (UCB),
Jan Rabaey (UCB), John Wawrzynek (UCB)
July 2005
7/20/05
FDIS 2005
20
Change in Computer Landscape




Old Conventional Wisdom: Uniprocessor
performance 2X / 1.5 yrs (“Moore’s Law”)
New Conventional Wisdom:
2X CPUs per socket / ~ 2 years
Problem: Compilers, operating systems,
architectures not ready for 1000s of CPU per
chip, but that’s where we’re headed
How do research on 1000 CPU systems in
compilers, OS, architecture?
7/20/05
FDIS 2005
21
FPGA Boards as New Research Platform

Given ~ 25 soft CPUs can fit in FPGA, what if
made a 1000-CPU system from ~ 40 FPGAs?


Research community does logic design
(“gate shareware”) to create out-of-the-box
Massively Parallel Processor that runs standard
binaries of OS and applications


64-bit simple RISC at 100HMz
Processors, Caches, Coherency, Switches, Ethernet
Interfaces, …
Recreate synergy of old VAX + BSD Unix?
7/20/05
FDIS 2005
22
Why RAMP Attractive?
Priorities for Research Parallel Computers
1a. Cost of purchase
1b. Cost of ownership (staff to administer it)
1c. Scalability (1000 much better than 100 CPUs)
4. Observability (measure, trace everything)
5. Reproducibility (to debug, run experiments)
6. Community synergy (share code, …)
7. Flexibility (change for different experiments)
8. Performance
7/20/05
FDIS 2005
23
Why RAMP Attractive?
Grading SMP vs. Cluster vs. RAMP
SMP
Cluster
RAMP
D
B
A+
(1 CPU, 1 GB DRAM)*
($40k, $4k)
($2k, $0.4k)
($0.1k, $0.2k)
Cost of ownership
A
D
B
Scalability
C
A
A
Observability
D
C
A+
Reproducibility
B
D
A+
Community
D
A
A
Flexibility
D
C
A+
A (2 GHz)
A (3 GHz)
D (0.2 GHz)
Cost of purchase
Performance (clock)
* Costs from TPC-C Benchmark IBM eServer P5 595, IBM eServer x346/Apple Xserver, BWRC BEE2
7/20/05
FDIS 2005
24
Internet in a Box?
Could RAMP radically change research in distributed
computing? (Armando Fox, Ion Stoica, Scott
Shenker)
 Existing distributed environments (like PlanetLab)
very hard to use for development:



The computers are live on the Internet and subject to all
kinds of problems (security, ...) and there is no
reproducibility.
You cannot reserve the whole thing for yourself and change
OS or routing or ....
Very expensive to support - the reason the biggest
ones are order 200 to 300 nodes, and there are lots
of restrictions on using them.
7/20/05
FDIS 2005
25
Internet in a Box?




RAMP promises a private "internet in a box" for $50k
to $100k.
A collection of 1000 computers running independent
OS that could do real checkpoints and have
reproducible behavior.
We can set parameters for network delays,
bandwidth, number of disks, disk latency and
bandwidth, ...
Could have every board running synchronously to the
same clock cycle,

so that we could do a checkpoint at clock cycle
4,000,000,000, and then reload later from that point and
cause the network interrupt to occur exactly at clock cycle
4,000,000,100 for CPU 104 every single time.
7/20/05
FDIS 2005
26