Accelerating Multimedia Applications using the Intel SSE and AVX ISA
Download
Report
Transcript Accelerating Multimedia Applications using the Intel SSE and AVX ISA
ACCELERATING MULTIMEDIA
APPLICATIONS USING THE INTEL
SSE AND AVX ISA
MIN LI
05/08/2013
INTEL SSE AND AVX ISA
Intel ISA
SSE1, SSE2, SSE3, SSE4 (SSE4.1, SSE4.2)
SSE4.2 Specialized for String and Text applications (suitable for applications like template
matching, Genome Sequence Comparison)
AVX (mainly for floating point operations)
AVX1: 256bits
AVX2: 256bits (with some instructions extension)
XMM register and YMM register
XMM: 128bits
YMM: 256bits
INTEL OPENCV LIBRARY
Opencv Library
Various of multimedia applications
Object detection, face recognition,
image processing…
Good candidates for using Intel SSE or AVX ISA for speedup
Intensive computations
I made a video on Youtube to show some tricks in using Opencv library
https://www.youtube.com/watch?v=ISap9zEGE2I
https://www.youtube.com/watch?v=pqSgT0quMBc
GUIDELINES FOR ENABLING THE ISA
Intel SSE and AVX
cat /proc/cpuinfo Make sure SSE and AVX are enabled. Otherwise enable them.
As you can see
All SSE ISA are activated
However only AVX1 is activated, which means I can only use 128bits XMM registers
Note: AVX2 is released in the mid of 2012
INTEL OPENCV LIBRARY
Opencv Library
Various of multimedia applications
Object detection, face recognition,
image processing…
ACCELERATION CASE I
Original:
After modification:
for( int i = 0; i < length; i += 4 ){
double t0 = d1[i] - d2[i];
double t1 = d1[i+1] - d2[i+1];
double t2 = d1[i+2] - d2[i+2];
double t3 = d1[i+3] - d2[i+3];
total_cost += t0*t0 + t1*t1
+ t2*t2 + t3*t3;
}
int chunk = length / 4;
for(i = 0; i < chunk; i++){
__m128 m0, m1;
m0 = _mm_load_ps(&d1[4 * i]);
m1 = _mm_load_ps(&d2[4 * i]);
m1 = _mm_sub_ps(m0, m1);
m1 = _mm_mul_ps(m1, m1);
m1 = _mm_hadd_ps(m1, m1);
m2 = _mm_shuffle_ps(m1, m1, _MM_SHUFFLE(2,3,0,1));
m1 = _mm_add_ps(m1, m2);
total_cost += ((float*)&m1)[0];
if( total_cost > best )
break;
}
ACCELERATION CASE II
Original:
After modification :
float minval = FLT_MAX, maxval = -FLT_MAX;
for( i = 0; i < N; i++, ++it )
{
float v = *(const float*)it.ptr;
if( v < minval )
{
minval = v;
minidx = it.node()->idx;
}
if( v > maxval )
{
maxval = v;
maxidx = it.node()->idx;
}
}
__mm128 m0, m1, m2, m3, m4, minArray, maxArray;
int chunk = N / 4;
for(i = 1; i < chunk; i++){
m0 = __mm_load_ps( (const float*)it.ptr );
it += 4;
m1 = _mm_min_ps(m0, minArray);
m2 = _mm_max_ps(m0, maxArray);
m3 = _mm_cmp_ps(m0, minArray, _CMP_LT_OS);
m4 = _mm_cmp_ps(m0, maxArray, _CMP_GT_OS);
int* mask1 = (int*) &m3;
int* mask2 = (int*) &m4;
for(int j = 0; j < 4; j++){
if(mask1[j] == -1)
minPos[j] = 4 * i + j;
if(mask2[j] == -1)
maxPos[j] = 4 * i + j;
}
minArray = m3; maxArray = m4;
}
if( _minval )
*_minval = minval;
if( _maxval )
*_maxval = maxval;
LOAD OF STRUCTURES
point* points;
Structues like this :
points[0].x
typedef point_{
int x;
int y;
} point;
points[0].y
points[1].x
points[1].y
_mm_load_ only takes consecutive mem space!
.
.
.
What is it like insider the XMM register?
X0
Y0
X1
Y1
X2
Y2
X3
Y3
How to achieve the following using SSE && AVX ISA?
X0
X1
X2
X3
Y0
Y1
Y2
Y3
Not easy!!!
PERMUTE AND BLEND
(1) __m256i temp = _mm256_load_si256((__m256i*) &points[4 * i]);
X0
Y0
X1
Y1
X2
Y2
X3
Y3
X0
X1
Y0
Y1
Y2
Y2
X2
X3
(5) __m256 temp4 = _mm256_permute2f128_ps(temp3, temp3, 0x01);
Y2
Y3
X2
X3
X0
X1
Y0
Y1
(6) temp3 = _mm256_blend_ps(temp3, temp4, 0b00110011);
X0
X1
X2
X3
Y2
Y3
Y0
Y1
(8) temp3 = _mm256_permutevar_ps(temp2, mask2);
X0
X1
X2
X3
Y0
Y1
Y2
Y3
(9) __m128 m1 = _mm256_extractf128_ps(temp3, 1);
X0
X1
X2
X3
(10)__m128 m2 = _mm256_extractf128_ps(temp3, 0);
Y0
Y1
Y2
Y3
(2) __m256 temp2 = _mm256_cvtepi32_ps(temp);
(3) v4si mask1 = {9,8,8,9};
(4)
__m256 temp3 = _mm256_permutevar_ps(temp2, mask1);
(7) v4si mask2 = {0xd,4,4,0xd};
SIMULATION RESULTS
Too many overhead for loading
structures
Not only finding min/max, but also
the position
Runtime Comparison
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
CSD
MML
CNVP
Original
AVX
CLPB
CONCLUSION AND FUTURE WORK
Opencv suitable for SSE or AVX acceleration
Single task has more chance to get speedup
Loading and arranging a structure is really a cumbersome task
Hints for smart automated compilation (such as loading structure)
Suggestions for the expansion of the ISA (new instruction introduced)