Transcript Outline
RAMP Tutorial Introduction/Overview Krste Asanovic UC Berkeley RAMP Tutorial, ASPLOS, Seattle, WA March 2, 2008 Quic kT ime™ and a T IFF (Uncompres sed) decompres sor are needed to s ee this picture. 1 Technology Trends: CPU Microprocessor: Power Wall + Memory Wall + ILP Wall = Brick Wall End of uniprocessors and faster clock rates Every program(mer) is a parallel program(mer), Sequential algorithms are slow algorithms Since parallel more power efficient (W ≈ CV2F) New “Moore’s Law” is 2X processors or “cores” per socket every 2 years, same clock frequency Conservative: 2007 4 cores, 2009 8 cores, 2011 16 cores for embedded, desktop, & server Sea change for HW and SW industries since changing programmer model, responsibilities HW/SW industries bet farm that parallel successful 2 Problems with “Manycore” Sea Change Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, … not ready for 1000 CPUs / chip Only companies can build HW, and it takes years Software people don’t start working hard until hardware arrives 1. 2. 3. • 4. 5. 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW How get 1000 CPU systems in hands of researchers to innovate in timely fashion on in algorithms, compilers, languages, OS, architectures, … ? Can avoid waiting years between HW/SW iterations? 3 Vision: Build Research MPP from FPGAs As 16 CPUs will fit in Field Programmable Gate Array (FPGA), 1000-CPU system from 64 FPGAs? • 8 32-bit simple “soft core” RISC at 100MHz in 2004 (Virtex-II) • FPGA generations every 1.5 yrs; 2X CPUs, 1.2X clock rate HW research community does logic design (“gate shareware”) to create out-of-the-box, MPP E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cache-coherent supercomputer @ 150 MHz/CPU in 2007 6 universities, 10 faculty 3rd party sells RAMP 2.0 (BEE3) hardware at low cost “Research Accelerator for Multiple Processors” 4 Why RAMP Good for Research MPP? SMP Cluster Custom Scalability (1k) C A A A A Cost (1k CPUs) F ($20M) C ($1M) F ($3M) A+ ($0M) A ($0.1M) Cost to own A D A A A Power/Space D (120 kw, D (120 kw, 6 A (100 kw, A+ (.1 kw, A (1.5 kw, Community D A F A A Observability D C D A+ A+ Reproducibility B D B A+ A+ Reconfigurability D C D A+ A+ A+ A+ A- F B A (2 GHz) A (3 GHz) B (.4 GHz) F (0 GHz) C (.1 GHz) C B- C+ B A- (kilowatts, racks) Credibility Perform. (clock) GPA 6 racks) racks) 3 racks) Simulate 0.1 racks) RAMP 0.3 racks) 5 Partnerships Co-PIs: RAMP hardware development activity centered at Berkeley Wireless Research Center. Three year NSF grant for staff (awarded 3/06). GSRC (Jan Rabaey) has paid partial staff and some students. Major continuing commitment from Xilinx Collaboration with MSR (Chuck Thacker) on BEE3 FPGA platform. Sun, IBM contributing processor designs, IBM faculty awards. Krste Asanovíc (UCB), Derek Chiou (UT Austin), Joel Emer (MIT/Intel), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley), and John Wawrzynek (Berkeley) High-speed high-confidence emulation is widely recognized as a necessary component of multiprocessor research and development. FPGA emulation is the only practical approach. 6 BEE3 Design Chuck Thacker Chen Chang, UC Berkeley • New RAMP systems to be based on Berkeley Emulation Engine version 3 (BEE3). • BEECube, Inc. BEE3,1st BEE3,1stprototype prototype11/07 11/07 – (UC Berkeley spinout startup company) – To provide manufacturing, distribution, and support to commercial and academic users. – General availability 2Q08 For small scale design, or to get started, use Xilinx ML505 7 7 RAMP: An infrastructure to build simulators using FPGAs Run Target Model on Host Platform CPU Target Model CPU CPU CPU Interconnect Network DRAM Hard Work Host Platform a d na ™ e mi Tk ciuQ ro ss er pm oc ed )d e ss erp m oc nU ( FFI T . er utc ip si ht ee s ot de de en er a 9 Quick Time™ an d a TIFF ( Un compr ess ed) de compr ess or ar e n eed ed to s ee this pic ture . Reduce, Reuse, Recycle Reduce effort to build target models just build components (units), infrastructure handles connections (The RDL Compiler) Users Reuse units by having good abstractions Across different target models Across different host platforms XUP, Calinx, BEE2, BEE3, ML505 also Altera platforms Recycle existing IP for use as simulation models Commercial processor RTL is (almost) its own model 10 RAMP Target Model Unit A Unit B FIFO Channel Pipeline Channel Unit C Units Relatively large chunks of functionality e.g., processor + L1 cache User-written in some HDL or software Channels Point-point, undirectional, two kinds: FIFO channel: Flow-controlled interface Pipeline channel: Simple shift register, bits drop off end Generated by RAMP infrastructure 11 D D Datawidth Target Pipeline Channel Parameters Forward Latency 12 RAMP Description Language (RDL) Target: Unit A [ Greg Gibeling, UCB ] Unit B Unit C RDLC Host: Generated Unit Wrappers Uni tB Uni tA FPGA1 Generated links carry channels FPGA2 Unit C User describes target model topology, channel parameters, and (manual) mapping to host platform FPGAs using RDL RDL Compiler (RDLC) generates configurations 13 Virtual Target Clock 14 Virtualized RTL Improves FPGA Resource Usage RAMP allows units to run at varying target-host clock ratios to optimize area and overall performance Example 1: Multiported register file Example, Sun Niagara has 3 read ports and 2 write ports to 6KB of register storage If RTL mapped directly, requires 48K flip-flops If mapping into block RAMs (one read+one write per cycle), takes 3 host cycles and 3x2KB block RAMs Slow cycle time, large area Faster cycle time (~3X) and far less resources Example 2: Large L2/L3 caches Current FPGAs only have ~1MB of on-chip SRAM Use on-chip SRAM to build cache of active piece of L2/L3 cache, stall target cycle if access misses and fetch data from off-chip DRAM 15 Start/Done Timing Interface Wrapper Start In1 In2 Unit Out Done Wrapper generated by RDL asserts “Start” on the physical FPGA cycle when the inputs to the unit are ready for the next target cycle Unit asserts “Done” when it finishes the target cycle and its outputs are ready Unit can take variable amount of time Unvirtualized RTL unit can connect “Done” to “Start” (but must not clock until “Start”) 16 Distributed Timing Models 17 Distributed Timing Example Unit A Target: Host: RDYs D Latency L Pipeline target channel implemented as distributed FIFO with at least L buffers Start RDY Unit A DEQs Unit B D Done Start D ENQ DEQ Unit B Done 18 Other Automatically Generated Networks Control network has workstation as master and every unit as slave device Units can connect to DRAM resources outside of timed target channels Memory-mapped interface with block transfers Used for initialization, stats gathering, debugging, and monitoring Used to support emulation and virtualization state Units can communicate with each other outside of timed target channels Support arbitrary communication. E.g., for distributed stats gathering 19 Wide Variety of RAMP Simulators 20 Simulator Design Choices Structural Analog versus Highly Virtualized Functional-only versus Functional+Timing Timing via (virtual) RTL design versus separate functional and timing models Hybrid software/hardware simulators 21 Host Multithreading (Zhangxi Tan (UCB), Chung, (CMU)) Target Model CPU 1 CPU 2 CPU 3 CPU 4 Multithreaded Host Emulation Engine (on FPGA) PC PC1 PC PC1 1 1 Single hardware pipeline with multiple copies of +1 CPU state 2 I$ IR GPR1 GPR1 GPR1 GPR1 X Y D$ 2 Multithreading emulation engine reduces FPGA resource use and improves emulator throughput Hides emulation latencies (e.g., communicating across FPGAs) 22 Split Functional/Timing Models (HASIM Emer (MIT/Intel), FAST Chiou, (UT Austin)) Functional Model Functional model executes CPU ISA correctly, no timing information Timing Model Only need to develop functional model once for each ISA Timing model captures pipeline timing details, does not need to execute code Much easier to change timing model for architectural experimentation Without RTL design, cannot be 100% certain that timing is accurate Many possible splits between timing and functional model 23 Multithreaded Func. & Timing Models (RAMP Gold: Tan, Gibeling, Asanovic, UCB) MT-Channels Timing State Arch State Functional Model Pipeline Timing Model Pipeline MT-Unit MT-Unit multiplexes multiple target units on a single host engine MT-Channel multiplexes multiple target channels over a single host link 24 Schedule 9:00- 9:45 Welcome/Overview 9:45-10:15 RAMP Blue Overview & Demo 10:15-10:45 Break 10:45-12:30 RAMP White Live Demo BEE3 Rollout (MSR/BEEcube/Q&A) 12:30-13:30 Lunch 13:30-15:00 ATLAS Transactional Memory (RAMP Red) 15:00-15:15 Break 15:15-16:45 CMU Simics/RAMP Cache Study 16:45 Wrapup 25 QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. RAMP Blue Release 2/25/2008 - design available from RAMP website - ramp.eecs.berkeley.edu 26 RAMP White Hari Angepat, Derek Chiou (UT Austin) Scalable Coherent Shared Memory Multiprocessor Support standard shared memory programming models Leon 3 Leon 3 Mst Slv Int Dbg Mst Slv Int Dbg Leon3 shim Leon3 shim Intersectio n Unit Route r NIU Route r AHB shim IntCntrl 27 DSU Eth Intersectio n Unit AHB shim AHB bus AHB bus MP NIU DDR2 DDR2 RAMP-White 27 QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. 28 CMU Simics/RAMP Simulator 16-CPU Shared-memory UltraSPARC III Server (SunFire 3800) BEE2 Platform CPU Memory CPU .. CPU MMU Graphics PCI Xilinx XCV2P70 DMA NIC Terminal SCSI Simics (PC) PowerPC DDR2 Mem Interleaved Pipeline CPU CPU context context 16xCPU Simulated I/O devices 29 29 RAMP Home Page/Repository ramp.eecs.berkeley.edu Remotely accessible subversion repository 30 Thank You! Questions? 31