DRM BSP Implementation Flow

Download Report

Transcript DRM BSP Implementation Flow

Andrea Michelotti - Atmel
Toolchain Overview & Demo
hArtes
2010 Final Review
Toolchain Overview & Demo
1/20
Andrea Michelotti, Atmel
WP2 Leader
• Key Achievements
• Developing an Application for a Heterogeneous Platform
– Initialization
– Sharing resources
– Calling a DSP function
– Targeting Execution
– Expressing parallelism
• Conclusions
Andrea Michelotti - Atmel
Toolchain Overview & Demo
Agenda
2/20
The hArtes toolchain…
• dramatically minimizes learning curve
• hides heterogeneous complexity:
no expert knowledge of the platform,
tools or new programming languages
use ‘C’ or tools that produce ‘C’
(Scilab/Nutech)
• automatically speeds-up application (in
respect to GPP-only execution) by
exploiting Processor Element capabilities
Andrea Michelotti - Atmel
Toolchain Overview & Demo
Key Achievements (1)
3/20
Easily retarget new platforms:
basic OMAP porting took ~1 month
other popular platforms like GPUs can be
targeted as well
must conform to Molen machine
Quickly evaluate how an application
behaves on different architectures
important for time to market
Andrea Michelotti - Atmel
Toolchain Overview & Demo
Key Achievements (2)
4/20
Scaleo’s hArtes Emulation Platform:
Andrea Michelotti - Atmel
Toolchain Overview & Demo
Targeted Platforms/Architectures
ARM9 GPP + mAgicV DSP + Altera FPGA
5/20
Heterogeneous Platforms:
Atmel DEB (Diopsis Evaluation Board):
ARM9 GPP + mAgicV DSP
UniFE’s hArtes HW Platform:
ARM9 GPP + mAgicV DSP + Xilinx FPGA
TI Experimenter OMAPL138:
ARM9 GPP + C67 TI DSP
Before hArtes, toolchains for heterogeneous hardware (Diopsis and
OMAP) consist of separate subchains: one for GPP and one for
DSP.
MAIN PROBLEMS:
• High learning curve: target tools, architecture, APIs….
• Without knowledge of the underlying software/hardware it’s difficult
to use and to benefit from the existence of PEs;
• Code maintainability: two distinct projects must be kept aligned;
• Code portability: usually GPP code contains specific APIs to load,
execute and access PEs resources; An identical C-code cannot be
produce correct result across all the platforms;
• Debugging: not an unified image, not an unified debugger.
Andrea Michelotti - Atmel
Toolchain Overview & Demo
Developing an Application for a
Heterogeneous Platform in a nutshell
6/20
Suppose we want to port our legacy code on a
powerful but heterogeneous architecture like OMAP or
DIOPSIS…
C code running on my host PC
void main(){
…
unsigned *shared_array=malloc(SIZE);
…
my_fft(int param,shared_array…);
another_kernel(…);
…
}
Seems easy, but, after a while, becomes a nightmare! …
Andrea Michelotti - Atmel
Toolchain Overview & Demo
Writing an application for a
heterogeneous platform in a nutshell
7/20
Initialization
GPP code
int main(int argc,char**argv){
…
if (DSP_SUCCEEDED (status)) {
status = PROC_setup (NULL) ;
}
if (DSP_SUCCEEDED (status)) {
status = PROC_attach (processorId, NULL) ;
if (DSP_FAILED (status)) {
RDWR_1Print ("PROC_attach failed. Status: [0x%x]\n", status) ;
}
}
else {
RDWR_1Print ("PROC_setup failed. Status: [0x%x]\n", status) ;
}
if (DSP_SUCCEEDED (status)) {
args [0] = strNumIterations ;
{
status = PROC_load (processorId, dspExecutable_myfft, NUM_ARGS,
args) ;
}
if (DSP_FAILED (status)) {
RDWR_1Print ("PROC_load failed. Status: [0x%x]\n", status) ;
}
}
if (DSP_SUCCEEDED (status)) {
status = PROC_start (processorId) ;
if (DSP_FAILED (status)) {
RDWR_1Print ("PROC_start failed. Status: [0x%x]\n", status) ;
}
}
Use of specific API
to access DSP.
The DSP image is
loaded as a file.
The DSP has its
own main
DSP code
Andrea Michelotti - Atmel
Toolchain Overview & Demo
OMAP toolchain (through dsplink)
8/20
void myfft();
int main(int argc,char**argv){
myfft();
}
Initialization
GPP code
int main (int argc, char *argv[]){
....
ret = mAgicV_load_PM("myfft.bin",_m_fd_extm);
if(ret!=0) return ret;
ret = mAgicV_load_DM(“myfft_datamem.bin");
if(ret!=0) return ret;
ret = mAgicV_load_XM(“myfft_extmem.bin",(0x365890));
if(ret!=0) return ret;
ret = mAgicV_init_PMU();
if(ret!=0) return ret;
magicV_start();
..
magicV_wait();
....
}
Very low API to access
DSP.
The DSP image is binary
loaded as a file. The
DSP has its own main
DSP code
void myfft();
int main(int argc,char**argv){
…
myfft();
…
}
Andrea Michelotti - Atmel
Toolchain Overview & Demo
Diopsis toolchain case
9/20
Initialization
GPP/Application code
int main (int argc, char *argv[]){
....
}
The hArtes Toolchain
and hArtes Runtime take
care of hiding all the
initialization details.
JUST one code.
Andrea Michelotti - Atmel
Toolchain Overview & Demo
hArtes toolchain case
10/20
GPP code
OMAP toolchain case
int main(int argc,char**argv){
SMAPOOL_Attrs
poolAttrs
poolAttrs.bufSizes
= (Uint32 *) &size ;
poolAttrs.numBuffers
= (Uint32 *) &numBufs ;
poolAttrs.numBufPools
= NUM_BUF_SIZES ;
poolAttrs.exactMatchReq = TRUE ;
volatile unsigned* my_shared_array;
status = POOL_open (POOL_makePoolId(processorId, SAMPLE_POOL_ID),
&poolAttrs) ;
if (DSP_FAILED (status)) {
MPCSXFER_1Print ("POOL_open () failed. Status = [0x%x]\n",
status) ;
}
}
if (DSP_SUCCEEDED (status)) {
status = POOL_alloc (POOL_makePoolId(processorId, SAMPLE_POOL_ID),
&my_shared_array,
SIZE,
DSPLINK_BUF_ALIGN)) ;
/* Get the translated DSP address to be sent to the DSP. */
if (DSP_SUCCEEDED (status)) {
status = POOL_translateAddr (
POOL_makePoolId(processorId,
SAMPLE_POOL_ID),
&dspCtrlBuf,
AddrType_Dsp,
(Void *) &my_shared_array_from_dsp,
AddrType_Usr) ;
Use of specific API
to create a shared
area, translate the
address for DSP
and then pass the
translated address
to DSP via
messages.
Memory Layout
must be configured
by recompiling
drivers!!
Andrea Michelotti - Atmel
Toolchain Overview & Demo
Sharing resources
11/20
DSP code
int main(int argc,char**argv){
DSPlink_init();
….
}
GPP code
Diopsis toolchain case
#define MY_SHARED_ARRAY_ADDR 2
int main(int argc,char**argv){
....
unsigned local_my_shared_array[];
mAgicV_read_buff(local_my_shared_array,MY_SHARED_ARRAY_ADDR,si
zeof(my_shared_array));
....
// modify local copy, write back
mAgicV_write_buff(local_my_shared_array,MY_SHARED_ARRAY_ADDR,s
izeof(my_shared_array));
…
DSP code
Very raw access,
sharing directly
addresses that are
manually mapped,
using local copy and
write back.
Use of specific APIs
and
Specific compiler
directives.
NO LINKER used,
many problems of
source alignment,
debug
#define MY_SHARED_ARRAY_ADDR 2
volatile long chess_storage(DATA:MY_SHARED_ARRAY_ADDR) my_shared_array;
volatile long
chess_storage(DATA:MY_SHARED_ARRAY_ADDR+SIZEOF(my_shared_array)
my_other_variable;
int main(int argc,char**argv){
// access the variable
}
Andrea Michelotti - Atmel
Toolchain Overview & Demo
Sharing resources
12/20
Sharing resources
GPP code
Intuitive and portable.
int main(int argc,char**argv){
....
Unsigned* my_shared_array;
my_shared_array = malloc(MYSIZE);
#pragma map call_hw dsp0
dsp_func(my_shared_array);
DSE turns automatically
malloc into hmalloc
(hArtes API), that
allocates and traces
memory in a shared
physical space of the
target platform.
Andrea Michelotti - Atmel
Toolchain Overview & Demo
hArtes toolchain case
13/20
Very natural access
Calling a DSP routine
GPP code
GPP code
int main(int argc,char**argv){
… // initialization, see main
int main(int argc,char**argv){
… // initialization, see main
if (DSP_SUCCEEDED (status)) {
status = PROC_start
(processorId) ;
if (DSP_FAILED (status)) {
RDWR_1Print ("PROC_start
failed. Status: [0x%x]\n", status) ;
}
}
…
mAgicV_start()
…
DSP code
void my_fft(int pp, float*…);
int main(int argc,char**argv){
… // initialization, see main
my_fft();
…
}
There is not concept of DSP call
from GPP, the GPP can start a DSP
process that executes the desired
function.
The call can be inefficient or maybe
cannot be executed correctly on the
DSP. The programmer must know the
underlying architecture!
For example the DSP in the DIOPSIS
architecture the type int is 16 bit wide.
Typically is 32 bit.
Andrea Michelotti - Atmel
Toolchain Overview & Demo
OMAP/Diopsis toolchain cases
14/20
Calling a DSP routine
GPP code
void my_fft(…){
…
}
int main(int argc,char**argv){
..
#pragma call_hw dsp 0
my_fft();
..
}
Intuitive and portable.
DSE checks if the my_fft function
can be executed on the target
DSP (checks parameters, used
stack memory). It also estimates
the cost of the call to decide if it’s
convenient to move the execution
on the DSP.
To call the DSP function, the DSE
adds a pragma to the function call
and generate a C-source that can
be compiled by the DSP toolchain.
Andrea Michelotti - Atmel
Toolchain Overview & Demo
hArtes toolchain case
15/20
Expressing Parallelism
NOT KNOWN/NOT IMPLEMENTED
Andrea Michelotti - Atmel
Toolchain Overview & Demo
OMAP/Diopsis toolchain
16/20
Expressing Parallelism
Void main(){
…
#pragma omp parallel
sections
{
#pragma omp section
{
#pragma call_hw dsp
0
my_fft();
}
#pragma omp section
{
another_kernel(…);
}
}
}
Intuitive and portable.
hArtes supports some openMP construct to
express parallelism.
DSE in some case automatically detects
kernels that can go in parallel and adds
openMP annotations to the c-source.
The parallelism can be also explicited via
POSIX threads
#pragma call_hw dsp 0
void my_fft(…);
int main(int argc,char**argv){
… // initialization, see
hthread_create(my_fft()…);
Another_kernel();
hthread_join();
}
Andrea Michelotti - Atmel
Toolchain Overview & Demo
hArtes toolchain case
17/20
Target Execution
Two separate binaries, not common
symbols, not unified debugger, I/O
messages (printf) often relies on jtag
connection.
RUN and HOPE!
Andrea Michelotti - Atmel
Toolchain Overview & Demo
(under Linux)
OMAP/Diopsis toolchain
18/20
bash$./my_fft_arm.elf <my_fft_dsp.bin>
Target Execution
Single ELF binary, common symbols,
unified debugger, I/O messages (printf)
on the target.
RUN!
Andrea Michelotti - Atmel
Toolchain Overview & Demo
(under Linux)
hArtes toolchain
19/20
$bash ./my_fft.elf
• Although the original “Brain to Bit” (B2B) objective was very
ambitious, the hArtes toolchain fulfilled its original promise: to
support software development of heterogeneous hardware without
expert knowledge of the target platform, and therefore allowing
developers to achieve high-performance applications through
complete automated solutions and by abstracting low-level
hardware details.
• Areas of improvement regards mainly data flow analysis,
automatic parallelization, AET integration in Eclipse, debugging
capabilities.
Andrea Michelotti - Atmel
Toolchain Overview & Demo
Conclusions
20/20