Transcript pptx
SLAM Accelerated
Using Hardware to improve SLAM algorithm
performance
Project Overview
Team Members
Roy Lycke
Ji Li
Ryan Hamor
Take existing SLAM algorithm and implement on computer
Analyze Performance of algorithm to determine kernels to be
accelerated in HW
Implement SLAM algorithm on PowerPC with previously identified
kernels in HW
RH
What is SLAM?
SLAM stands for Simultaneous Localization and Mapping
Predict pose using previous and current data
Types of pose sensors
Wheel Encoders
GPS
Detect landmarks and correlated to robot using predicted pose.
Types of Observation Sensors
Sonar
Infrared
Laser Scanners
Video
RH
Current State of SLAM
Algorithms
SLAM algorithms fall into two main categories
Extend Kalman Filter
Large Covariance Matrix to Process
Particle Filter
Each Particle contains pose estimate and map
RH
Particle Filter Algorithm
RH
What we have Decided to
do
Started with existing SLAM implementation
ratbot-slam developed by Kris Beevers
ratbot-slam
Uses particle filter algorithm and multiple observation scans using just
wheel encoders and 5 IR sensors
We modified ratbot-slam to use log files taken from
radish.sourceforge.net
RH
Ratbot-slam Modifications
Create new observation function using laser scans vs. original IR
sensors.
Modify motion model to use dead-reckoned odometry
RH
Demo of Modified ratbotslam
RH
Profile of Modified Code
RL
Areas that can be
Accelerated
Decided to accelerate predict step included:
motion_model_deadreck
gaussian_pose
Estimated Maximum speed up 39% or 1.64x
Why not squared_distance_point_segment?
Least understood of algorithms we could accelerate
If we had more time we would have developed this
RL
Function Acceleration
Design Decisions
Fixed or Floating Point?
Fixed point
Implementation done in fixed point
Resources required to do floating point were significantly heavier
Heavily Pipeline or Create Predict Stage for each particle?
Heavily Pipelined
Data is serially loaded through load and save function to coprocessor
It would take too many resources to implement predict stages in
parallel for each particle
RL
Top Level Design
RL
Motion Model C-Code
RH
MotionModel Data Flow
RH
MotionModel Data Flow
RH
MotionModel HDL Stats
RH
Gaussian Pose
void gaussian_pose(const pose_t *mean, const cov3_t *cov,
pose_t *sample)
{
sample->x = gaussian(mean->x, fp_sqrt(cov->xx));
sample->y = gaussian(mean->y, fp_sqrt(cov->yy));
sample->t = gaussian(mean->t, fp_sqrt(cov->tt));
}
JL
Gaussian Pose
fixed_t gaussian(fixed_t mean, fixed_t stddev)
{
static int cached = 0;
static fixed_t extra;
static fixed_t a, b, c, t;
if(cached) {
cached = 0;
return fp_mul(extra, stddev) + mean;
}
// pick random point in unit circle
do {
a = fp_mul(fp_2, fp_rand_0_1()) - fp_1;
b = fp_mul(fp_2, fp_rand_0_1()) - fp_1;
c = fp_mul(a,a) + fp_mul(b,b);
} while(c > fp_1 || c == 0);
t = pgm_read_fixed(&unit_gaussian_table[c >> unit_gaussian_shift]);
extra = fp_mul(t, a);
cached = 1;
return fp_mul(fp_mul(t, b), stddev) + mean;
}
JL
Parallelism & Acceleration
Techniques
Parallelism
gaussian_pose function is consists of three gaussian functions.
gaussian functions can be separated into two parts
Acceleration TechniquesPipelineMulti-thread
JL
Top Level Diagram of
gaussian_Pose
JL
Random Number Generator
Xorshift random number generators are developed. They
generate the next number in their sequence by repeatedly
taking the exclusive or (XOR) of a number with a bit shifted
version of itself.
JL
Random_Number_Manager
JL
Gaussian Entity
JL
Demo of FPGA System
RL
Timing Analysis of Original
System
Timing analysis was performed via run-time clock counts and
print statements to the minicom
Sections of code timed include: Predict Step, Multiscan
Feature Extraction and Data Association Step, & Filter Health
Evaluation and Re-sample Step
The Predict Step was implemented on the FPGA for acceleration
Initial timing analysis :
Operation
Predict Step - Original
Multiscan Step - Original
Filter Step - Original
Average Runtime Present in
(in microseconds) percentage of
runs 100%
107,502
2,487,969
2.17%
3,394
2.17%
RL
Timing Analysis of
Accelerated System
Timing analysis for accelerated implementation was performed
in same manner as original implementation
Results shown along with original timing analysis
From the
data
collected,
the Predict
Step was
accelerat
ed by 88%
Operation
Predict Step - Original
Multiscan Step - Original
Average Runtime
(microseconds)
107,502
Present in
percentage of runs
100%
2,487,969
2.17%
Filter Step - Original
3,394
2.17%
Predict Step - Accelerated
12,784
100%
1,982,950
1.94%
13,291
1.94%
Multiscan Step - Accelerated
Filter Step - Accelerated
RL
Result Analysis
With the Predict Step accelerated by 88.108%, the overall system
is accelerated by:
34% = 39% x 88%
Result is a reliable and sizable acceleration to the system
execution time
Analysis of other components
Multiscan Step accelerated by 20.29%
Filter Step slowed by 74.46%
Differences may be due to different values generated by FPGA
implementation vs. Original implementation
Both implementations use random values
More accurate values may lead to longer calculation in other
components
RL
Difficulties with Project
Implementation
Networking issues
Data transfer - differences between PowerPC and Linux
Limitations of FPGA
Unpredictable execution halting
Lack of resource libraries
Timing performed with specialized Xilinx library
Code needed to be modified to run
PC vs. FPGA Environment
Output file format is different
Issue figuring out how to add multiple files to custom IP
RL
Conclusions
Based on the run-time analysis of our implementation of the
accelerated SLAM algorithm there was an appreciable speed
up achieved.
Our Implementation achieved a speed up of approximately 34%
or 1.51x out of an ideal 39% or 1.64x
This result shows that if more of the SLAM algorithm was
implemented on an FPGA there could be a greater
acceleration.
Top issue in SLAM implementations is getting algorithm’s
implemented on embedded real time systems
RH
Future Directions
Add more regions of the Algorithm to the FPGA acceleration
Current implementation only accelerates 39% of system
Run SLAM system on different FPGA
FPGAs with more robust processors may overcome some of the
limitations our implementation faced
Run different SLAM algorithm
Current implementation is a particle filter algorithm, a Kalman filter
algorithm would be next
Load data onto board rather than using PC interaction
Load data via memory card
Perform single data load and perform memory management on the
FPGA
RL
References
1.
Durrant-Whyte, Bailey, “Simultaneous Localization and Mapping:
Part 1”, IEEE Robotics and Automation Magazine, June 2006, pg
99 – 1082.
2.
Durrant-Whyte, Bailey, “Simultaneous Localization and Mapping:
Part 2”, IEEE Robotics and Automation Magazine, September
2006, pg 108 - 1173.
3.
Bonato, Peron, Wolf, Holanda, Marques, Cardoso, “An FPGA
Implementation for a Kalman Filter with Application to Mobile
Robotics”, Industrial Embedded Systems, 2007, pg 148 – 1554.
4.
Bonato, Marques, Constantinides, “A Floating-point Extended
Kalman Filter Implementation for Autonomous Mobile Robots”,
Field Programmable Logic and Applications, 2007, pg 576-5795.
5.
Beevers K.R., Huang, W.H., “SLAM with Sparse Sensing”, Robotics
and Automation 2006, pg 2285-2290
RL
Questions?
RL