Hardware-based CIL-machine Nizhniy Novgorod State University, Russia Laboratory of Physical Fundamentals and Technologies of Wireless Communications reporter: Maxim Shuralev [email protected] Head of the project: Dr.
Download ReportTranscript Hardware-based CIL-machine Nizhniy Novgorod State University, Russia Laboratory of Physical Fundamentals and Technologies of Wireless Communications reporter: Maxim Shuralev [email protected] Head of the project: Dr.
Hardware-based CIL-machine Nizhniy Novgorod State University, Russia Laboratory of Physical Fundamentals and Technologies of Wireless Communications reporter: Maxim Shuralev [email protected] Head of the project: Dr. Alexey Umnov [email protected] Hardware CIL processor project team Hardware: Maxim Shuralev, Maxim Sokolov, Dmitry Mordvinov (NNSU, Wireless Lab) Software, workloads and tools: Andrey Eltsov (NNSU, Wireless Lab), Roman Mitin, Sergey Lyalin, Sergey Galkin, Ilia Golubev (NNSU, IT Lab) Support: Dmitry Golovachev, Svetlana Surova, Elena Pankratova (NNSU, Wireles Lab) Consultants: Aliaksei Chapyzhenka (Intel), Dmitry Ragozin (Intel), Sergey Chernyshov (Nizhniy Novgorod State Technology University) Head of Wireless Lab: Alexey Umnov Slide 2 Agenda Introduction Architecture of the CIL processor Description of the DSP core Description of the CIL core Speed up features of the CIL core a metainformation cache a hardware stack a hardware type control engine Garbage collector implementation Example of DSP workload for the processor Development board for processor implementation HW Implementation results Software support & libraries Slide 3 Conclusion and comparison Introduction port of the .NET engine to energy-efficient low-power mobile platform advantages and disadvantages of stack-based CIL engine: • maximum execution speed of CIL instructions can not be more than one instruction per clock • the stack engine is the most simplest way to execute some machine code, as instruction decoding and processor structure is very simple • limited ability for parallel instruction execution • low complexity and low power consumption Slide 4 Introduction application target and target market .NET is intended for different Web-oriented services, distributed business databases, online transactions, CRM system support and etc. CIL processor is not supposed to compete with desktop processors and PDAs by performance – but it is great for mobile market and digital home! The target is end-user specialized and oriented for: MOBILE DEVICES, Web-terminals, Web-browsers, interactive TV, HOUSE CONTROL SYSTEMS Slide 5 Introduction requirements for the CIL processor • Execute the .NET (CIL) code directly .NET is native code • Consume low power from power supply Mobile low power devices • Effectively handle DSP tasks New generation of interactive multimedia mobile devices Slide 6 Architecture High-level structure of the CIL processor implementation Programmers model CIL application DSP library: codecs, protocols, software defined radio, modems, multimedia processing libraries DSP kernel Standard CIL class libraries, custom CIL class libraries, other performance libraries CIL instruction decoder CIL metainformation support Hardware CIL processor Slide 7 Architecture High-level hardware structure of the CIL processor Hardware structure CIL meta-information caches CIL instruction decoder native DSP set instruction decoder Main data bus (X) Secondary data bus (Y) Main arithmetic unit X data bus address generation unit Y data bus address generation unit X-space address bus Y-space address bus Slide 8 Instruction and data cache unit, system control unit Architecture Why DSP-based ? Is it a waste of time during development or a necessary thing for digital home? As CIL processor is an excellent solution for digital home Pro: • We have firmware layer for executing very complex CIL instructions • increased in 5-10 times performance in multimedia applications Slide 9 Contra: • increased development time • We need to implement only “standard” CIL set, not DSP Architecture Why DSP-based ? Hardware implementation Pro: Effective & low-power computational kernel Good mapping “CIL instruction -> DSP instruction” Low power consumption in multimedia tasks Similar technology to existing and efficient ARM/Java Jazelle Contra: • Only serial instruction execution (as we have CIL stack based instruction set and do not want to use superscalar techniques) Slide 10 Architecture Why DSP-based ? • “2-in-1”: 2 native instruction sets on-board • Complex CIL instructions (e.g. type hierarchy checks and safety checks) are simply implemented in firmware as DSP instructions • 5x-10x speed improvement for DSP workloads • Low overhead in terms of extra transistors on-chip Slide 11 Description of DSP core units Data memory bus Program bus memory Data memory vector input register Shifter Shifter Immediate value or standard increment/decrement value Program memory vector input register Shifter Index register file (2-4-8 registers) MUX MUX Shifter End of circular buffer pointer registers (2-4 registers) 16*16 multiplier Temporary product register 1 Temporary product register 2 Shifter Start of circular buffer pointer registers (2-4 registers) + Cross-bar switch unit 16*16 multiplier AGU-1 DEMUX MUX Comparator MUX Shifter Pointer register file MUX MUX To address bus MUX Shifter ALU adder Immediate value or standard increment/decrement value Special functional unit Index register file (2-4-8 registers) AGU-2 Cross-bar switch unit MUX Accumulator register file A0 (top of stack under CIL mode) A1 (under-top-of-stack in CIL) Stack control … + An Saturation unit Saturation unit Saturation unit Cross-bar switch unit Slide 12 Pointer register file Saturation unit ALU To address bus Description of CIL core Under the execution CIL mode, the programmer has the exact implementation of the ECMA-335 standard CIL engine stage 1 Instruction fetch unit CIL instruction buffer stage 1 stage 2 DSP instructtion decoder CIL instructions pipeline Metainformation cache memory 2 and other stages Slide 13 stage 3 Execution unit Register file Type registers CIL decoder Type generation and check unit Last stage Program memory (PM) Typed stack Data memory (DM) Speed up features of CIL core Metainformation cache Indexkey index RAM fetch, using index LSBs as comparator address in RAM Selected block Information access Yes/No • Constant table • String table • Method table Slide 14 information Table in main RAM From other cache lines MUX • Class field table • Type table • Smart array table Speed up features of CIL core Hardware typed stack Stack memory Memory cell tag Register file with type tags Metainformation cache table instruction Operand type exception checking unit Stack pointer Type setting unit Immediate type tag Slide 15 New type Garbage collector Automatic memory management • Division of objects into “big” and “small” • The generational garbage collector with two generations for “small” objects • Separate area of memory for “big” objects large heap generation 1 generation 0 Special coprocessor, based on reduced DSP kernel may be used for processing garbage collector tasks Slide 16 Example of DSP workload Our CIL processor is an excellent target for multimedia applications Slide 17 Development board 495 USD only Slide 18 Virtex-4 FPGA chip 64 MBytes DDR SDRAM 100 Mhz clock oscillator Expansion bus up to 32 I/O lines Stereo AC97 audio codec RS-232 serial port LCD display for debugging messages VGA output (50 Mhz 24-bit video DAC) PS/2 mouse and PS/2 keyboard connectors System ACE™ configuration controller access to external flash cards 10/100/1000 Mbit Ethernet transceiver for networking USB interface chip Xilinx XC95144XL CPLD for FPGA configur. Xilinx XCF32P Platform Flash configuration JTAG configuration port for design loading or remote debugging from PC Development board Testing process for processor cores The C++ model is a full-scale analog of the Verilog HDL model The C++ model is considered as a reference model Slide 19 Implementation results Device Spartan-3 Virtex-4 Slices Slice FlipFlops 4-input LUTs Maximum frequency, MHz Slices Slice FlipFlops 4-input LUTs Maximum frequency, MHz AGU-1 331 220 548 N/A 300 200 560 228 AGU-2 385 320 543 N/A 300 200 560 228 ALU 4368 587 7917 N/A 4216 593 8056 55.4 Decoder 1227 60 2139 N/A 1319 40 2303 971 DSP 628 9508 46.9 4981 628 9191 77.8 5365 The ALU consumes most of the FPGA resources The DSP core uses only a small part of Virtex-4 LX25, and the CIL processor implementation takes only up to 5500 cells (~35 %) of our Virtex-4 FPGA (without optimizations) Slide 20 Implementation results main ALU unit structure Bit Manipulation Unit (a part of the ALU unit) whole DSP kernel Slide 21 Y-memory .NET RAM unit DSP core X-bus addressing units (including XAU registers) X-memory prefetch unit X-bus FPGA internal memory DSP core Y-bus addressing units (including YAU registers) Y-bus FPGA internal memory DSP core DAU units (accumulator registers, ALU, adder unit and BMU) CIL pre-decoder unit Meta-information cache memory Implementation results Stack block transfer controller and address generator Fast internal stack memory (X-Bus) 1-stage DSP decoder CIL prefetch Xmemory unit External 16bit video memory Fast internal stack tag memory (X-Bus) Firmware: exception & interrupt handlers Pipeline starter Interrupt controller Stack control decoder Exception decoder Exception mapper DSP DAU signal mapper DSP AGU signal mapper .NET complex instruction pipeline Type operation decoder Prefetch operation decoder Type check decoder Pipeline table memory ROM .NET instruction decoder Pipeline signal mapper Metainformation cache access controller Type setting unit Type checking unit Meta-information exception generator Slide 22 Metainformation cache access controller Metainformation cache blocks Pipeline automaton Meta-information cache – meta-information memory transfer controller Moderate detaillevel structure of implemented CIL processor Software support • Exception microcode – complex CIL instruction implementation in DSP code • Class library may ported from PC • Supporting system libraries – I/O, memory management • Multimedia libraries – for DSP core • User applications • Just in time compiler for CIL code, if necessary • Compiler – we are using a retargeted GCC version • Assembler / disassembler – retargetable utilities, used with compiler, they a specially tuned for CIL core • Linker • Hardware and software codesign suite (compiler, assembler, disassembler, Verilog instruction decoder generator Slide 23 Conclusion & comparison Comparison with ARM-based software .NET engine for embedded systems (www.dotnetcpu.com) Hardware-based CIL-machine ARM-based .NET execution engine 80-100 Mhz FPGA implementation 27 Mhz 1-2 CIL operations per cycle (40-50 Millions of CIL operations per second) hardware execution for basic CIL operations hardware assisted stack implementation 450,000 CIL operation per second interpreted CIL operations execution 50x faster than interpreted execution 50x slower than hardware execution of basic operations hardware type control software type control garbage collector may be implemented as a software garbage collector hardware coprocessor or “intellectual” memory Meta-information cache hardware software meta-information processing DSP core with two memory spaces ARM core 2 Multiply-Accumulate instructions and 2 ALU 1 ALU operation in cycle operations in cycle = up to 4 instruction per cycle DSP core power consumption is 3-4x less than ARM core power consumption in 3-4x more than ARM core DSP core Slide 24 Conclusion & comparison 1.CIL processor is not only a software concept – it may be successfully implemented in hardware 2.Our dual architecture – the CIL processor, based on a DSP core, enables multimedia applications with low-power consumption, so the CIL processor may be successfully used for digital home and digital entertainment 3.CIL typed engines are implemented in hardware, that greatly reduces overhead of type checking in run-time 4. Hardware CIL implementation greatly outperforms nonoptimized software implementations (by performance and power consumption) Slide 25 Project participants Slide 26 Express gratitude Microsoft Corporation for grant, which allows us to joint people for different faculties of Nizhny Novgorod State University into one team and develop our hardware solution Laboratory of Physical Foundations and Technologies of Wireless Communications, Nizhny Novgorod State University, which is supported by Intel Corporation, for help during our research activities Special thanks for Aliaskey Chapyzhenka, Intel Corp. for spending his time advising us in hardware architectures Slide 27 Slide 28