Hardware-based CIL-machine Nizhniy Novgorod State University, Russia Laboratory of Physical Fundamentals and Technologies of Wireless Communications reporter: Maxim Shuralev [email protected] Head of the project: Dr.

Download Report

Transcript Hardware-based CIL-machine Nizhniy Novgorod State University, Russia Laboratory of Physical Fundamentals and Technologies of Wireless Communications reporter: Maxim Shuralev [email protected] Head of the project: Dr.

Hardware-based
CIL-machine
Nizhniy Novgorod State University, Russia
Laboratory of Physical Fundamentals and
Technologies of Wireless Communications
reporter: Maxim Shuralev
[email protected]
Head of the project: Dr. Alexey Umnov
[email protected]
Hardware CIL processor project team
Hardware:
Maxim Shuralev, Maxim Sokolov, Dmitry Mordvinov
(NNSU, Wireless Lab)
Software, workloads and tools:
Andrey Eltsov (NNSU, Wireless Lab),
Roman Mitin, Sergey Lyalin, Sergey Galkin,
Ilia Golubev (NNSU, IT Lab)
Support:
Dmitry Golovachev, Svetlana Surova, Elena Pankratova
(NNSU, Wireles Lab)
Consultants:
Aliaksei Chapyzhenka (Intel), Dmitry Ragozin (Intel),
Sergey Chernyshov (Nizhniy Novgorod State Technology University)
Head of Wireless Lab: Alexey Umnov
Slide 2
Agenda
Introduction
Architecture of the CIL processor
Description of the DSP core
Description of the CIL core
Speed up features of the CIL core
a metainformation cache
a hardware stack
a hardware type control engine
Garbage collector implementation
Example of DSP workload for the processor
Development board for processor implementation
HW Implementation results
Software support & libraries
Slide 3
Conclusion and comparison
Introduction
port of the .NET engine
to
energy-efficient low-power mobile platform
advantages and disadvantages of stack-based CIL engine:
• maximum execution speed of CIL instructions can not be
more than one instruction per clock
• the stack engine is the most simplest way to execute some
machine code, as instruction decoding and processor
structure is very simple
• limited ability for parallel instruction execution
• low complexity and low power consumption
Slide 4
Introduction
application target and target market
.NET is intended for different Web-oriented services,
distributed business databases, online transactions, CRM
system support and etc.
CIL processor is not supposed to compete with desktop processors
and PDAs by performance – but it is great for mobile market and
digital home!
The target is end-user specialized and oriented for:
MOBILE DEVICES,
Web-terminals, Web-browsers, interactive TV,
HOUSE CONTROL SYSTEMS
Slide 5
Introduction
requirements for the CIL processor
• Execute the .NET (CIL) code directly
.NET is native code
• Consume low power from power supply
Mobile low power devices
• Effectively handle DSP tasks
New generation of
interactive multimedia mobile devices
Slide 6
Architecture
High-level structure of the CIL processor implementation
Programmers model
CIL application
DSP library: codecs,
protocols, software
defined radio, modems,
multimedia processing
libraries
DSP kernel
Standard CIL class libraries,
custom CIL class libraries, other
performance libraries
CIL instruction decoder
CIL metainformation support
Hardware CIL processor
Slide 7
Architecture
High-level hardware structure of the CIL processor
Hardware structure
CIL meta-information
caches
CIL instruction decoder
native DSP set instruction
decoder
Main data bus (X)
Secondary data bus (Y)
Main arithmetic unit
X data bus address
generation unit
Y data bus address
generation unit
X-space address bus
Y-space address bus
Slide 8
Instruction and data
cache unit, system
control unit
Architecture
Why DSP-based ?
Is it a waste of time during development or a necessary thing for
digital home?
As CIL processor is an excellent solution for digital home
Pro:
• We have firmware layer for executing
very complex CIL instructions
• increased in 5-10 times performance
in multimedia applications
Slide 9
Contra:
• increased development time
• We need to implement only
“standard” CIL set, not DSP
Architecture
Why DSP-based ?
Hardware implementation
Pro:
Effective & low-power computational kernel
Good mapping “CIL instruction -> DSP instruction”
Low power consumption in multimedia tasks
Similar technology to existing and efficient ARM/Java Jazelle
Contra:
• Only serial instruction execution
(as we have CIL stack based
instruction set and do not want to use
superscalar techniques)
Slide 10
Architecture
Why DSP-based ?
• “2-in-1”: 2 native instruction sets on-board
• Complex CIL instructions (e.g. type hierarchy checks and
safety checks) are simply implemented in firmware as DSP
instructions
• 5x-10x speed improvement for DSP workloads
• Low overhead in terms of extra transistors on-chip
Slide 11
Description of DSP core units
Data memory bus
Program bus memory
Data memory vector
input register
Shifter
Shifter
Immediate value or
standard
increment/decrement value
Program
memory
vector input register
Shifter
Index register file
(2-4-8 registers)
MUX
MUX
Shifter
End of circular buffer
pointer registers
(2-4 registers)
16*16
multiplier
Temporary
product register 1
Temporary
product register 2
Shifter
Start of circular
buffer pointer
registers
(2-4 registers)
+
Cross-bar switch unit
16*16
multiplier
AGU-1
DEMUX
MUX
Comparator
MUX
Shifter
Pointer register
file
MUX
MUX
To address
bus
MUX
Shifter
ALU
adder
Immediate value or
standard
increment/decrement value
Special
functional unit
Index register file
(2-4-8 registers)
AGU-2
Cross-bar switch unit
MUX
Accumulator
register file
A0 (top of stack under CIL mode)
A1 (under-top-of-stack in CIL)
Stack control
…
+
An
Saturation unit
Saturation unit
Saturation unit
Cross-bar switch unit
Slide 12
Pointer register
file
Saturation unit
ALU
To address bus
Description of CIL core
Under the execution CIL mode, the programmer has
the exact implementation of the ECMA-335 standard CIL engine
stage 1
Instruction
fetch unit
CIL
instruction
buffer
stage 1
stage 2
DSP instructtion
decoder
CIL
instructions
pipeline
Metainformation
cache memory
2 and other stages
Slide 13
stage 3
Execution unit
Register file
Type
registers
CIL decoder
Type generation
and check unit
Last stage
Program memory
(PM)
Typed stack
Data memory
(DM)
Speed up features of CIL core
Metainformation cache
Indexkey
index
RAM fetch, using
index LSBs as comparator
address in RAM
Selected block
Information access
Yes/No
• Constant table
• String table
• Method table
Slide 14
information
Table in main RAM
From other cache lines
MUX
• Class field table
• Type table
• Smart array table
Speed up features of CIL core
Hardware typed stack
Stack memory
Memory
cell tag
Register file with
type tags
Metainformation
cache table
instruction
Operand type exception
checking unit
Stack
pointer
Type setting
unit
Immediate type tag
Slide 15
New
type
Garbage collector
Automatic memory management
• Division of objects into “big” and “small”
• The generational garbage collector with two generations for
“small” objects
• Separate area of memory for “big” objects
large heap
generation 1
generation 0
Special coprocessor, based on reduced DSP
kernel may be used for processing garbage
collector tasks
Slide 16
Example of DSP workload
Our CIL processor is an
excellent target for
multimedia applications
Slide 17
Development board
495 USD only
Slide 18
Virtex-4 FPGA chip
64 MBytes DDR SDRAM
100 Mhz clock oscillator
Expansion bus up to 32 I/O lines
Stereo AC97 audio codec
RS-232 serial port
LCD display for debugging messages
VGA output (50 Mhz 24-bit video DAC)
PS/2 mouse and PS/2 keyboard connectors
System ACE™ configuration controller
access to external flash cards
10/100/1000 Mbit Ethernet transceiver for
networking
USB interface chip
Xilinx XC95144XL CPLD for FPGA
configur.
Xilinx XCF32P Platform Flash
configuration
JTAG configuration port for design loading
or remote debugging from PC
Development board
Testing process for processor cores
The C++ model is a full-scale analog
of the Verilog HDL model
The C++ model is considered as a reference model
Slide 19
Implementation results
Device
Spartan-3
Virtex-4
Slices
Slice
FlipFlops
4-input
LUTs
Maximum
frequency,
MHz
Slices
Slice
FlipFlops
4-input
LUTs
Maximum
frequency,
MHz
AGU-1
331
220
548
N/A
300
200
560
228
AGU-2
385
320
543
N/A
300
200
560
228
ALU
4368
587
7917
N/A
4216
593
8056
55.4
Decoder 1227
60
2139
N/A
1319
40
2303
971
DSP
628
9508
46.9
4981
628
9191
77.8
5365
The ALU consumes most of the FPGA resources
The DSP core uses only a small part of Virtex-4 LX25,
and the CIL processor implementation takes only up to 5500
cells (~35 %) of our Virtex-4 FPGA (without optimizations)
Slide 20
Implementation results
main ALU unit structure
Bit Manipulation Unit
(a part of the ALU unit)
whole DSP kernel
Slide 21
Y-memory .NET
RAM unit
DSP core X-bus
addressing units
(including XAU
registers)
X-memory
prefetch unit
X-bus FPGA
internal
memory
DSP core Y-bus
addressing units
(including YAU
registers)
Y-bus FPGA
internal
memory
DSP core DAU units
(accumulator registers,
ALU, adder unit and
BMU)
CIL pre-decoder
unit
Meta-information
cache memory
Implementation
results
Stack block transfer controller and
address generator
Fast internal
stack memory
(X-Bus)
1-stage DSP decoder
CIL prefetch Xmemory unit
External 16bit video
memory
Fast internal
stack tag
memory
(X-Bus)
Firmware: exception &
interrupt handlers
Pipeline starter
Interrupt controller
Stack control decoder
Exception decoder
Exception mapper
DSP DAU signal mapper
DSP AGU signal mapper
.NET complex instruction
pipeline
Type operation decoder
Prefetch operation decoder
Type check decoder
Pipeline table memory
ROM
.NET instruction decoder
Pipeline
signal
mapper
Metainformation cache
access controller
Type setting unit
Type checking unit
Meta-information
exception generator
Slide 22
Metainformation cache
access controller
Metainformation
cache blocks
Pipeline
automaton
Meta-information cache
– meta-information
memory transfer
controller
Moderate detaillevel structure of
implemented CIL
processor
Software support
• Exception microcode – complex CIL instruction implementation in DSP
code
• Class library may ported from PC
• Supporting system libraries – I/O, memory management
• Multimedia libraries – for DSP core
• User applications
• Just in time compiler for CIL code, if necessary
• Compiler – we are using a retargeted GCC version
• Assembler / disassembler – retargetable utilities, used with compiler, they
a specially tuned for CIL core
• Linker
• Hardware and software codesign suite (compiler, assembler, disassembler,
Verilog instruction decoder generator
Slide 23
Conclusion & comparison
Comparison with ARM-based software .NET engine for
embedded systems (www.dotnetcpu.com)
Hardware-based CIL-machine
ARM-based .NET execution engine
80-100 Mhz FPGA implementation
27 Mhz
1-2 CIL operations per cycle
(40-50 Millions of CIL operations per second)
hardware execution for basic CIL operations
hardware assisted stack implementation
450,000 CIL operation per second
interpreted CIL operations execution
50x faster than interpreted execution
50x slower than hardware execution of basic
operations
hardware type control
software type control
garbage collector may be implemented as a software garbage collector
hardware coprocessor or “intellectual” memory
Meta-information cache hardware
software meta-information processing
DSP core with two memory spaces
ARM core
2 Multiply-Accumulate instructions and 2 ALU 1 ALU operation in cycle
operations in cycle = up to 4 instruction per cycle
DSP core power consumption is 3-4x less than ARM core power consumption in 3-4x more than
ARM core
DSP core
Slide 24
Conclusion & comparison
1.CIL processor is not only a software concept – it may be successfully
implemented in hardware
2.Our dual architecture – the CIL processor, based on a DSP core,
enables multimedia applications with low-power consumption, so the
CIL processor may be successfully used for digital home and digital
entertainment
3.CIL typed engines are implemented in hardware, that greatly
reduces overhead of type checking in run-time
4. Hardware CIL implementation greatly outperforms nonoptimized software implementations
(by performance and power consumption)
Slide 25
Project participants
Slide 26
Express gratitude
Microsoft Corporation for grant, which allows us to joint
people for different faculties of Nizhny Novgorod State
University into one team and develop our hardware solution
Laboratory of Physical Foundations and Technologies of
Wireless Communications, Nizhny Novgorod State University,
which is supported by Intel Corporation, for help during our
research activities
Special thanks for Aliaskey Chapyzhenka, Intel Corp. for
spending his time advising us in hardware architectures
Slide 27
Slide 28