Transcript Slide 1

The Cray XD1
Computer and its
Reconfigurable
Architecture
Dave Strenski
[email protected]
July 11, 2005
Outline
XD1 overview
 Architecture
 Interconnect
 Active Manager
XD1 FPGAs
 Architecture
 Example execution
 Core development stragity
FORTRAN to VHDL considerations
 Memory allocation
 Unrolling
 One verses many cores
XD1 FPGA running examples





MTA kernel and Ising Model
FFT kernel from DSPlogic
Smith-Waterman kernel from Cray
LANL Traffic simulation code
Other works in progress
Slide ‹#›
Cray Today
Nasdaq: CRAY



Formed on April 1, 2000 as Cray Inc.
Headquartered in Seattle, WA
Roughly 900 employees across 30 countries
Four Major Development Sites:




Chippewa Falls, WI
Mendota Heights, MN
Seattle, WA
Vancouver, Canada
Significant Progress in the market




X1 Sales and Sandia National Laboratory Red Storm contract
Oak Ridge National Laboratory Leadership Class system
DARPA HPCS Phase II funding of $50M through 2006 for Cascade
Acquired OctigaBay – 70+ Cray XD1s sold to date
Slide ‹#›
Cray XD1
Overview
Slide ‹#›
Cray XD1 System Architecture
Compute
12 AMD Opteron 32/64 bit, x86
processors
High Performance Linux
RapidArray Interconnect
12 communications processors
1 Tb/s switch fabric
Active Management
Dedicated processor
Application Acceleration
6 co-processors
Processors directly
connected via
integrated switch fabric
Slide ‹#›
Cray XD1 Chassis
Six Two-way
Opteron Blades
Fans
Six SATA
Hard Drives
Six FPGA
Modules
Chassis Front
0.5 Tb/s
Switch
12 x 2 GB/s
Ports to
Fabric
Three I/O Slots
(e.g. JTAG)
Four 133 MHz
PCI-X Slots
Connector for 2nd
0.5 Tb/s Switch
and 12 More
2 GB/s Ports
to Fabric
Chassis Rear
Slide ‹#›
Compute Blade
4 DIMM Sockets
for DDR 400
Registered ECC
Memory
AMD Opteron
2XX
Processor
RapidArray
Communications
Processor
Connector to
Main Board
AMD Opteron
2XX Processor
4 DIMM Sockets
for DDR 400
Registered ECC
Memory
Slide ‹#›
Cray Innovations
Balanced
Interconnect
Active
Management
Application
Acceleration
Cray XD1
Performance and Usability
Slide ‹#›
Architecture
Intel XeonTM
Processor
Intel XeonTM
Processor
6.4 GB/sec
DDR Memory
Controller
AMD Opteron
HyperTransport
6.4 GB/sec
Northbridge
Southbridge or
PCI-X Bridge
3.2 GB/sec
HT
HT
Rapid
Array
Rapid
Array
I/O
SPEED
LIMIT
PCI-X Slot
PCI-X Slot
PCI-X Slot
1 GB/sec
Slide ‹#›
Removing the Bottleneck
GigaBytes
Memory
GFLOPS
Processor
GigaBytes per Second
I/O
1 GB/s
PCI-X
Xeon
Server
Interconnect
0.25 GB/s
GigE
5.3 GB/s
DDR 333
Cray
XD1
8 GB/s
RA
6.4GB/s
DDR 400
Cray
XT3
SS
6.4 GB/s
DDR 400
31 GB/s
34.1 GB/s
102 GB/s
Cray
X1
Slide ‹#›
Communications Optimizations
Cray Communications Libraries






RapidArray Communications Processor
MPI 1.2 library
TCP/IP
PVM
Shmem
Global Arrays
System-wide process & time synchronization






HT/RA tunnelling with bonding
Routing with route redundancy
Reliable transport
Short message latency optimization
DMA operations
System-wide clock synchronization
RapidArray
Communications
Processor
AMD
Opteron 2XX
Processor
2 GB/s
3.2 GB/s
2 GB/s
Direct Connected Processor Architecture
Slide ‹#›
Synchronized Linux Scheduler
Not Synchronized
Proc 1
System
Overhead
Wasted CPU
Cycles
Proc 2
Wasted CPU
Cycles
Proc 3
Wasted CPU
Cycles
Wasted CPU
Cycles
System
Overhead
Wasted CPU
Cycles
System
Overhead
Wasted CPU
Cycles
Barrier 1 complete
Barrier 2 complete
Barrier 3 complete
Synchronized
Key
Proc 1
System
Overhead
Proc 2
System
Overhead
Proc 3
System
Overhead
Compute cycles
System cycles
Wasted cycles
Barrier 1 complete
Barrier 2 complete
Barrier 3 complete
Slide ‹#›
Reducing OS Jitter
Linux Synchronization Speedup
% Speedup
50%
40%
30%
20%
10%
0%
1
2
4
8
16
Processors
32
64
Cray XD1 Linux Synchronization increases application scaling
Improves efficiency by 42%
Lowers application license fees for equivalent processor
count
Slide ‹#›
Direct Connect Topology
1 Cray XD1 Chassis
12 AMD Opteron Processors
58 GFLOPS
8 GB/s between SMPs
1.8 msec interconnect
Integrated switching
3 Cray XD1 Chassis
36 AMD Opteron Processors
173 GFLOPS
8 GB/s between SMPs
2.0 msec interconnect
Integrated switching
25 Cray XD1 Chassis, two racks
300 AMD Opteron Processors
1.4 TFLOPS
2 - 8 GB/s between SMPs
2.0 msec interconnect
Integrated switching
Slide ‹#›
Fat Tree Topology
Spine switch
Spine switch
Spine switch
12 Cray XD1 chassis
144 AMD Opteron
Processors
691 GFLOPS
4/8 GB/s between SMPs
2.0 msec interconnect
Fat tree switching,
integrated first & third
order
6/12 RapidArray spine
switches (24-ports)
Slide ‹#›
MPI Latency
MPI Latency versus Message Size
35.00
Latency (microsec)
30.00
25.00
20.00
15.00
10.00
5.00
0.00
0
4
8
32
64
128
256
512
1024
2048
4096
Message Length (bytes)
Cray XD1 (RapidArray)
Quadrics (Elan 4)
4x Infiniband
Myrinet (D card)
RapidArray Short Message Latency is 4 times lower than Infiniband
The Cray XD1 has sent 2 KB before others have sent their first byte
Slide ‹#›
MPI Throughput
Bandwidth versus Message Size
1400
1000
800
600
400
200
B
M
1
32
76
8
25
60
00
81
92
16
38
4
40
96
10
24
51
2
25
6
12
8
64
8
4
0
1
Bandwidth (MB/s)
1200
Data Length (Bytes)
Cray XD1 (1/2 RapidArray Fabric)
Quadrics Elan 4
4x Infiniband
Myrinet (D card)
The Cray XD1 Delivers 2X the Bandwidth of Infiniband
(1 KB Message Size)
Slide ‹#›
Active Manager System
Usability

Single System Command and Control
Resiliency
CLI and
Web Access

Dedicated management processors,
real-time OS and communications
fabric.

Proactive background diagnostics
with self-healing.
Active Management
Software
Automated management for
exceptional reliability, availability, serviceability
Slide ‹#›
Active Manager GUI: SysAdmin
GUI provides quick
access to status info
and system functions
Slide ‹#›
Automated Management
Users & Administrators
Compute Partition 1
Front End
Partition
Partition management
Linux configuration
Hardware monitoring
Software upgrades
File system management
Data backups
Compute Partition 2
•
•
•
•
•
File
Services
Partition
Compute Partition 1
Network configuration
Accounting & user management
Security
Performance analysis
Resource & queue management
Single System Command and Control
Slide ‹#›
Self-Monitoring
Parity
Heartbeat
Temperature
Fan speed
Diagnostics
Air Velocity
Voltage
Current
Hard Drive
Thermals
Processors
Memory
Fans
Power supply
Active
Manager
Interconnect
Power Supply
Dedicated Management Processor, OS, Fabric
Slide ‹#›
Thermal Management
Slide ‹#›
File Systems: Local Disks
One S-ATA HD per SMP; Local Linux directory per HD
EXT2/3
EXT2/3
EXT2/3
RapidArray
EXT2/3
EXT2/3
EXT2/3
Cray XD1
Slide ‹#›
File Systems: SAN
SMP acting as a File Server for the SAN
File
Server
EXT2/3
FC HBA
FC
SAN
NFS
Compute
Cray XD1
Slide ‹#›
Programming Environment
Operating System
System Management
Cray HPC Enhanced Linux
Distribution (derived from
SuSe 8.2)
Active Manager for system
administration & workload
management
Application
Acceleration Kit
IP Cores, Reference Designs,
Command-line tools, API, JTAG
interface card
Scientific Libraries
Shared Memory Access
3rd Party Tools
AMD Core Math Library (ACML)
Shmem, Global Arrays, OpenMP
Fortran 77/90/95, HPF, C/C++, Java,
Etnus TotalView
Communications
Libraries
MPI 1.2
Cray XD1 is standards-based for ease of programming –
Linux, x86, MPI
Slide ‹#›
Cray XD1’s FPGA
Architecture
Slide ‹#›
The Rebirth of Co-processing
1976
8086 Processor
8087 Coprocessor
AMD Opteron
Xilinx Virtex II Pro FPGA
2004
Slide ‹#›
Application Acceleration
Application
Accelerator
Application Acceleration
RAP
Reconfigurable Computing
Tightly coupled to Opteron
FPGA acts like a programmable coprocessor
Performs vector operations
Well-suited for:
RAP
 Searching, sorting, signal processing,
audio/video/image manipulation, encryption,
error correction, coding/decoding, packet
processing, random number generation.
SuperLinear speedup for
key algorithms
Slide ‹#›
Two Switches
One Switch
Two Switches
One Switch
4 configurations
Slide ‹#›
Application Acceleration FPGA
...
do for each array element
.
.
.
end
…
DataSet
Application Acceleration FPGA
Compute
Processor
…
…
Fine-grained parallelism applied
for 100x potential speedup
Slide ‹#›
Compute Blade
Expansion
Module
RapidArray
Processor
DDR 400
DRAM
Opteron
Processor
Application
Acceleration
FPGA
Slide ‹#›
Interconnections
HT
Neighbor Module
Expansion Module
HT
RT
HT
Neighbor Module
RapidArray
RapidArray
Slide ‹#›
Module Detail
Neighbor
Compute Module
HyperTransport
3.2
GB/s
RAP
2
GB/s
2
GB/s
3.2
GB/s
2
GB/s
RapidArray
QDR II
SRAM
QDR II
SRAM
QDR II
SRAM
QDR II
SRAM
Acceleration
FPGA
2
GB/s
3.2
GB/s
Neighbor
Compute Module
Slide ‹#›
Virtex II Pro FPGA
Multi-Gigabit
Transceivers
(Rocket I/O)
Virtex-II Series
Fabric
MGT
MGT
XC2VP30 – XC2VP50
• 422 MHz max. clock rate
• 30,000 – 53,000 LEs
• 3 – 5 Million ‘system gates’
• 136 – 232 Block RAM
• 136 – 232 18x18 Multipliers
300 MHz
PowerPC
• 8 – 16 Rocket I/O
MGT
MGT
Block
RAM
Slide ‹#›
Virtex II Family Logic Blocks
RAM16
Virtex-II Family Logic Blocks
SRL16
LUT
G
LUT
F
CY
CY
Register
Register
1 LE
= LUT + Register
1 Slice
= 2 LEs
1 CLB
= 4 Slices
Slice
XC2VP30-6 Examples
Size
Function
f (MHz)
LE’s
BRAM
Mult.
Number
Possible
64 bit Adder
194
66
0
0
450
64 bit Accumulator
198
64
0
0
450
18 x 18 Multiplier
259
88
0
1
136
SP FP Multiplier
188
252
0
4
34
1024 FFT (16 bit complex)
140
5526
22
12
5
Slide ‹#›
Module Variants
A variety of Application Acceleration variants can be manufactured by
populating different pin compatible FPGAs and QDR II RAMs.
Speed
Logic Elements
PowerPC
18x18
Multipliers
XC2VP30
-6
30,816
2
136
XC2VP40
-6
43,632
2
192
XC2VP50
-7
53,136
2
232
FPGA
RAMs
Speed
Dimensions
Quantity
Module Memory
K7R163682
200 MHz
512K x 36
4
8 MByte
K7R323682
200 MHz
1M x 36
4
16 MByte
K7R643682 (future)
200 MHz
2M x 36
4
32 MByte
Slide ‹#›
Processor to FPGA
FPGA
Processor
RAP
Req
Resp
HyperTransport
Req
Resp
RapidArray
Transport
• Since the Acceleration FPGA is connected to the local processing node through its
HyperTransport I/O bus, the FPGA can be accessed directly using reads and
writes.
• Additionally, a node can also transfer large blocks of data to and from the
Acceleration FPGA using a simple DMA engine in the FPGA’s RapidArray
Transport Core.
Slide ‹#›
FPGA to Processor
FPGA
Processor
RAP
Req
Resp
Req
Resp
• The Acceleration FPGA can also directly access the memory of a processor. Read
and write requests can be performed in bursts of up to 64 bytes.
• The Acceleration FPGA can access processor memory without interrupting the
processor.
• Memory coherency is maintained by the processor.
Slide ‹#›
FPGA to Neighbor
2-3 GB/s
SMP 4
SMP 2
SMP 1
SMP 3
SMP 6
SMP 5
• Each Acceleration FPGA is connected to its neighbors in a ring using the Virtex II
Pro MGT (Rocket I/O) transceivers.
• The XC2VP40 FPGAs provide a 2 GB/s link to each neighbor FPGA.
• The XC2VP50 FPGAs provide a 3 GB/s link to each neighbor FPGA.
Slide ‹#›
Cray XD1 FPGA
Programming
Slide ‹#›
Hard, but it could be worse!
Slide ‹#›
Application Acceleration Interfaces
RapidArray
Transport
Core
User
Logic
QDR RAM
Interface Core
ADDR(20:0)
D(35:0)
Q(35:0)
TX
RAP
RX
RapidArray
Transport
•
•
•
•
ADDR(20:0)
D(35:0)
Q(35:0)
ADDR(20:0)
D(35:0)
Q(35:0)
QDR II
SRAM
ADDR(20:0)
D(35:0)
Q(35:0)
XC2VP30-50 running at up to 200 MHz.
4 QDR II RAM with over 400 HSTL-I I/O at 200 MHz DDR (400 MTransfers/s).
16 bit simplified HyperTransport I/F at 400 MHz DDR (800 MTransfers/s.)
QDR and HT I/F take up <20 % of XC2VP30. The rest is available for user applications.
Slide ‹#›
FPGA Linux API
Admininstration Commands
 fpga_open
 fpga_close
 fpga_load
– allocate and open fpga
– close allocated fpga
– load binary into fpga
Operation Commands
 fpga_start
 fpga_reset
– start fpga (release from reset)
– soft-reset the FPGA
Mapping Commands
 fpga_set_ftrmem
 fpga_memmap
– map application virtual address to allow access by fpga
– map fpga ram into application virtual space
Control Commands
 fpga_wrt_appif_val
 fpga_rd_appifval
– write data into application interface (register space)
– read data from application interface (register space)
Status Commands
 fpga_status
– get status of fpga
DMA Commands
 fpga_put
 fpga_get
– send data to FPGA
– receive data from fpga
Interrupt/Blocking Commands
 fpga_intwait
– blocks process waits for fpga interrupt
Slide ‹#›
Additional High Level Tools
Adelante
Celoxica
Forte Design Systems
Mentor Graphics
Prosilog
Synopsis
int mask(a, m)
{
return(a & m);
}
SystemC,
ANSI C/C++
DSPlogic
RCIO Lib
The
MathWorks
High
Level
Flow
MATLAB/
Simulink
C Synthesis
Xilinx
process(a, m)is
begin
z <= a andm;
end process;
System
Generator for
DSP
VHDL,
Verilog
VHDL/Verilog Synthesis
Mentor Graphics
Synopsis
Synplicity
Xilinx
a
m
Xilinx
z Gate Level
EDIF File
Standard
Flow
Place and
Route
01001011010101
01010110101001
01000101011010
10100101010101
Binary File
for FPGA
Slide ‹#›
Standard Development Flow
Cores
Merge
0100010101
1010101011
0100101011
0101011010
Load/Run
Binary File
RAP I/F,
QDR RAM I/F
DSPLogic
RCIO Core
HDL
Download to XD1
Synthesize
Metadata
Acceleration
FPGA
Implement
From Command line
or Application
ModelSim
Xilinx ISE
Verify
Simulate
VHDL,
Verilog,
C
Xilinx
ChipScope
ModelSim
Slide ‹#›
On Target Debugging
Acceleration FPGA
• Integrated Logic Analyzer (ILA) blocks are used to capture
and store internal logic events based on user defined
triggers.
User
Function 1
• Trapped events can then be read out and displayed on a
PC by the ChipScope Software.
User
Function 2
ILA
ILA
JTAG
Parallel or
USB
Xilinx
ChipScope Plus
Software
Xilinx
Parallel Cable III/IV
or MultiLINX
JTAG
OctigaBay
JTAG I/O Card
Slide ‹#›
FORTRAN to VHDL ideas
program test
integer xyz
integer a, b, c, n(1000), temp(1000)
do i = 1, 1000
n(i) = xyz (a, b, c, temp)
end do
end
The variable temp is allocated once outside the
loop calling the function. This is efficient
FORTRAN code because you only allocate the
space one.
With an FPGA design you would want to allocated
the temporary space on the FPGA.
Slide ‹#›
FORTRAN to VHDL ideas
program test
integer xyz
integer a, b, n(1000)
real
delta
delta = 0.01
do i = 1, 1000
n(i) = xyz (a, b, delta)
end do
end
program test
integer xyz
integer a, b, n(1000)
integer delta
delta = 100 ! 1/delta
do i = 1, 1000
n(i) = xyz (a, b, delta)
end do
end
function xyz (a, b, delta)
if (a .gt. b*delta) then
xyz = a
else
xyz = b
endif
return
end
function xyz (a, b, delta)
if (a*delta .gt. b) then
xyz = a
else
xyz = b
endif
return
end
Convert real variables to integers where possible.
Slide ‹#›
FORTRAN to VHDL ideas
function xyz (i,j,mode)
integer i,j,mode
do i = 1, 1000
do j = 1, 1000
if (mode .eq. 2) then
if (a(i,j,k) .gt. b(i,j,k)) then
xyz = a
else
xyz = b
end if
else
Move code that doesn’t change outside
xyz = 0
the function. Maybe make multiple cores,
end if
one for each mode.
end do
end do
return
end
Slide ‹#›
Mixing FPGAs and MPI
It gets a bit tricky mixing FPGAs with an MPI code. The XD1 has 2 or
4 Opterons per node and only one FPGA. Only one Opteron is able
to grab the FPGA at a time.
Job1
CPU
Job1
CPU
Job1
CPU
Job2
CPU
Job1
CPU
Job1
CPU
Job2
CPU
Job1
CPU
?
Job1
FPGA
Job1
CPU
Job1
CPU
Job2
FPGA
Job1
CPU
Job1
CPU
FPGA
Job1
CPU
Job2
FPGA
FPGA
Job1
CPU
Job2
CPU
Job1
FPGA
Job1
CPU
Job2
CPU
Job2
FPGA
Job1
CPU
Job2
FPGA
Job2
CPU
Job1
FPGA
Job1
CPU
Job1
CPU
Job1
FPGA
Not
Available
Job1
CPU
Job1
FPGA
Job2
CPU
Job1
CPU
Job1
FPGA
Slide ‹#›
Cray XD1 FPGA
Examples
Slide ‹#›
Random Number Example
Processor
RAP
Mersenne Twister
RNG
pseudo-random numbers
• FPGA implements “Mersenne Twister” RNG algorithm often
used for Monte Carlo analysis. The algorithm generates integers
with a uniform distribution and won’t repeat for 219937-1 values.
• FPGA automatically transfers generated numbers into two
buffers located in the processor’s local memory.
• Processor application alternately reads the pseudo-random
numbers from two buffers. As processor marks the buffers as
‘empty’, the FPGA refills them with new numbers.
Slide ‹#›
MTA Example
Application
Accelerator
Load/Start a.out in Opteron’s memory
Call FPGA_OPEN
Call FPGA_LOAD
Buffer B
Call FPGA_SET_FTRMEM (allocate memory)
Buffer A
Call FPGA_START
FPGA checks buffer flags
RAP
FPGA generate random numbers
FPGA toggles buffer flag
Opteron consumes random numbers
RAP
Opteron/FPGA run asynchronously
Call FPGA_CLOSE
Opteron exits
Slide ‹#›
Random Number Results
Source
Original C Code
VHDL Code
Platform
2.2 GHz Opteron
FPGA (XC2VP30-6) @ 200 MHz
Speed
(32 bit integers/second)
~101 Million
~319 Million
N/A
~25% of chip
(includes RapidArray Core)
Size
• FPGA provides 3X performance of fastest available Opteron
• Algorithm takes up a small portion of the smallest FPGA.
• Performance is limited by speed at which numbers can be written
into processor memory, not by FPGA logic. The logic could easily
produce 1.6 billion integers/second by increasing parallelism.
Slide ‹#›
Ising Model with Monte Carlo
Code was developed by Martin Siegert at Simon Fraser University
Uses the MTA random number generation design
Runs 2.5 times faster with the FPGA
Should run faster when the newest MTA design that returns floating
point random number instead of integers.
Tar file available for the Cray XD1
Slide ‹#›
FFT design from DSPlogic
Code was developed by Mike Babst and Rod Swift at DSPlogic
Uses 16-bit fixed point data as input and 32-bit fixed point as output,
which yields an accuracy similar to single precision results posted
at FFTW web site (www.fftw.org)
A one dimensional complex FFT of length 65536 on the FPGA is
about 5 times faster then on the 2.2 GHz Opteron using FFTW.
Packing the data more can double the performance to 10x.
Performance depends on the size of the data.
Slide ‹#›
Smith-Waterman
Code was developed internally by Cray
CUPS = Cell updates Per Second
Rate = FPGA frequency * clocks/cell * num S-M Processing Elements
Current:
80 MHz * 1 * 32 = 2.6 Billion CUPS, 60% of the chip
Optimization: 100 MHz * 1 * 50 = 5 Billion CUPS
Virtex 4 FPGA: 100 MHz * 1 * 150 = 15 Billion CUPS
Opteron using SSEARCH34
= 100 Million CUPS
Current version running 25 times faster then 2.2 GHz Opteron.
Nucleotide (4-bit) version is running in house. Amino acid (8-bit) is
just finished, incorporating it into SSEARCH to make it easier to use.
Smith-Waterman on the FPGA is about 10 times faster then BLAST
on the Opteron.
Slide ‹#›
Los Alamos Traffic Simulation
Code was developed by Justin Tripp, Henning Mortveit, Anders
Hansson, and Maya Gokhale at Los Alamos National Labs
Uses FPGA for straight road sections and Opteron for everything
else.
Runs 34.4 times faster with the FPGA relative to a 2.2 GHz Opteron
System integration issues must be optimized to exploit this speedup
in the overall simulation.
Slide ‹#›
Other XD1 FPGA Projects
Financial company using the random number generation core for a
Monte Carlo simulation.
Seismic companies using FPGAs for FFT and convolutions.
Pharmaceutical companies using FPGAs for searching and sorting.
NCSA is working on a civil engineering “dirt” code.
University of Illinois is working on porting part of NAMD to an FPGA.
Slide ‹#›
Other Useful FPGA designs
JPEG2000 developed by Barco Silex, currently runs on Virtex
FPGAs. Working with them on a real time, high resilution
compression project.
64-bit floating point Matrix Multiplication by Ling Zhuo and Viktor
Prasanna at the University of Southern California. Gets 8.3 Gflops on
a XC2VP125 as compared to 5.5 Gflops 3.2 GHz Xeon.
Finite-Difference Time-Domain (FDTD) by Ryan Schneider, Laurence
Turner, and Michal Okoniewski at University of Calgary.
Slide ‹#›
Questions
Slide ‹#›