Presented by: Tal Klein Omer Manor Digital Interactive Photomontage • The project focuses on digital photomontage: computer-assisted framework for combining parts of a set of photographs.

Download Report

Transcript Presented by: Tal Klein Omer Manor Digital Interactive Photomontage • The project focuses on digital photomontage: computer-assisted framework for combining parts of a set of photographs.

Presented by:
Tal Klein
Omer Manor
Digital Interactive Photomontage
• The project focuses on digital photomontage:
computer-assisted framework for combining parts of a set
of photographs into a single composite picture
• We focused on one feature: extended depth of field (DOF)
• DOF is mostly important in Macro photography where the
depth of field is very shallow
Digital Interactive Photomontage
• DOF allows a photographer to take several pictures of the
same frame, focusing on different areas in each picture
and then combine them using this feature
• Along the benefits of using extended DOF in photography,
it is a "Heavy Resource Consumer" due the complex
calculations & image manipulations needed here, therefore
our goal was to speedup this process
Digital Interactive Photomontage
System Configuration
• Intel® Core 2 Duo E6600 @ 2.4Ghz
• 2Gbyte RAM
• Microsoft Windows XP x64
• Due to the nature of our platform (2 cores) we assumed
that by optimization , we can achieve a major boost in
performance
The Optimization Process
• Analyzing the application
• Code Optimization
• SIMD
• Multithreading
Analyzing The Application
• Analyzing the application in 3 different ways:
1. VTune performance analyzer in order to search for our
program's bottlenecks.
2. We added counters of our own to functions we
suspected to be called many times.
3. Call graph (using Intel’s VTune).
Analyzing The Application
displace()
- we
changed
its-displace
content
to
macro
instead
function
call.loops into
BVZ_data_penalty
transformPixel()
GetDataCost
- we
- optimized
declares
– Calls
unnecessary
theoptimized
code
which
and
variable.
we
used
change
SIMD
instructions.
macro.
BVZ_interaction_penalty
we
the
code
byofinto
merging
two
one and used SIMD instructions.
Analyzing The Application
BVZ_Expand() - function which calls the small functions and consumes
the biggest time on the CPU, We used multithreading on it.
Code Optimization
• Replacement of FP variables with Integer variables when
no FP operation is needed
• Merging of 2 concurrent "for loops" into one
• Two assignments to the same pointer without using the 1st
assignment
• Code replacement instead of function
• Unnecessary variable declaration
Code Optimization
Replacement
of FP variables
with Integer
variables
Merging of 2 concurrent
"for loops"
into one
Original Code:
Optimized Code:
float PortraitCut::BVZ_interaction_penalty
{
int c, k;
float a,M=0;
if (l==nl) return 0;
unsigned char *Il, *Inl;
if (_cuttype == C_NORMAL || _cuttype == C_GRAD)
{ a=0;
Il = _imptr(l,p);
Inl = _imptr(nl,p);
for (c=0; c<3; ++c) {
k = Il[c] - Inl[c];
a += k*k;
}
M = sqrt(a);
a=0;
Il = _imptr(l,np);
Inl = _imptr(nl,np);
for (c=0; c<3; ++c) {
k = Il[c] - Inl[c];
a += k*k;
}
}
M += sqrt(a);
float PortraitCut::BVZ_interaction_penalty
{
int c;
int ap = 0,anp = 0;
float M=0;
int kp = 0,knp = 0;
unsigned char *Il_np, *Inl_np;
if (l==nl) return 0;
unsigned char *Il, *Inl;
if (_cuttype == C_NORMAL || _cuttype == C_GRAD)
{ Il = _imptr(l,p);
Inl = _imptr(nl,p);
Il_np = _imptr(l,np);
Inl_np = _imptr(nl,np);
for (c=0; c<3; ++c)
{
kp = Il[c] - Inl[c];
knp = Il_np[c] - Inl_np[c];
ap += kp*kp;
anp += knp*knp;
}
}
M = sqrt (float(ap)) + sqrt(float(anp));
Code Optimization
Two assignments to the same pointer without using the 1st assignment
Original Code:
for (y=p.y-2, i=0; y<=p.y+2; ++y) {
for (x=p.x-2; x<=p.x+2; ++x, ++i) {
I = _id->_imptr(d,Coord(x,y));
I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x);
lum = .3086f * (float)I[0] + .6094f * (float)I[1] + .082f * (float)I[2];
mean += lum*_gaussianK5[i];
} // x
} // y
Optimized Code:
for (y=p.y-2, i=0; y<=p.y+2; ++y) {
for (x=p.x-2; x<=p.x+2; ++x, ++i) {
I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x);
lum = .3086f * (float)I[0] + .6094f * (float)I[1] + .082f * (float)I[2];
mean += lum*_gaussianK5[i];
} // x
} // y
Code Optimization
Code replacement instead of function
Original Code:
float PortraitCut::BVZ_data_penalty(Coord p, ushort d) {
assert(0);
Coord dp = p;
_displace(dp,d);
Optimized Code:
#define _displacedef(p,l) _idata->images(l)->displace(p)
float PortraitCut::BVZ_data_penalty(Coord p, ushort d) {
Coord dp = p;
_displacedef(dp,d);
Code Optimization
Unnecessary variable declaration
Original Code:
const unsigned char* ImageAbs::transformedPixel(Coord p) const {
if (transformed())
displace(p);
if (p>=Coord(0,0) && p<_size) {
unsigned char *res = _data + 3*(p.y * _size.x + p.x);
return res; }
else return __black;
}
Optimized Code:
const unsigned char* ImageAbs::transformedPixel(Coord p) const {
if (transformed())
displace(p);
if (p>=Coord(0,0) && p<_size) {
return (_data + 3*(p.y * _size.x + p.x)); }
else return __black;
}
Code Optimization
Optimized Code vs. Original Code Time Based Comparison
18% improvement!
Original Code
Optimized Code
SIMD - Single Instruction Multiple Data
• main issue when using the SIMD instruction is that a 128bit
register is available to us so we can use it wisely.
• We used this 128bit register in some places in our code
that we thought that it will boost our application performance
SIMD
Original Code:
Optimized Code:
float PortraitCut::BVZ_interaction_penalty
{
int c, k;
float a,M=0;
if (l==nl) return 0;
unsigned char *Il, *Inl;
if (_cuttype == C_NORMAL || _cuttype == C_GRAD)
{ a=0;
Il = _imptr(l,p);
Inl = _imptr(nl,p);
for (c=0; c<3; ++c) {
k = Il[c] - Inl[c];
a += k*k;
}
M = sqrt(a);
a=0;
Il = _imptr(l,np);
Inl = _imptr(nl,np);
for (c=0; c<3; ++c) {
k = Il[c] - Inl[c];
a += k*k;
}
M += sqrt(a);
M /=6.f;
float PortraitCut::BVZ_interaction_penalty
{
int c;
__m128 SimdM;
int ap = 0,anp = 0;
float M=0;
int kp = 0,knp = 0;
unsigned char *Il_np, *Inl_np;
if (l==nl) return 0;
unsigned char *Il, *Inl;
if (_cuttype == C_NORMAL || _cuttype == C_GRAD)
{ Il = _imptr(l,p);
Inl = _imptr(nl,p);
Il_np = _imptr(l,np);
Inl_np = _imptr(nl,np);
for (c=0; c<3; ++c)
{
kp = Il[c] - Inl[c];
knp = Il_np[c] - Inl_np[c];
ap += kp*kp;
anp += knp*knp;
}
}
SimdM = _mm_sqrt_ps
(_mm_set_ps(0,0,float(ap),float(anp)));
M = (SimdM.m128_f32[0] + SimdM.m128_f32[1])/6.f;
SIMD
• In the following example we used SIMD in order to
compute a dot-product on 2 vectors
• In order to make our process efficient, we must align the
data in the memory and so we used the
__declspec(align(16)) instruction
SIMD
Original Code:
Optimized Code:
float ContrastCut::getDataCost (Coord p, ushort d) {
float mean=0, lum, contrast=0;
const unsigned char* I;
int y,x, i;
for (y=p.y-2, i=0; y<=p.y+2; ++y) {
for (x=p.x-2; x<=p.x+2; ++x, ++i) {
I = _id->_imptr(d,Coord(x,y));
I = _id->images((int)d)->data() +
3*(y * _id->images((int)d)->_size.x + x);
lum = .3086f * (float)I[0] +
.6094f * (float)I[1] + .082f * (float)I[2];
mean += lum*_gaussianK5[i];
} // x } // y
mean /= .9997f;
float ContrastCut::getDataCost (Coord p, ushort d) {
float mean=0, lum, contrast=0;
const unsigned char* I;
int y,x, i;
__declspec(align(16)) float lumarr[25];
__m128 SimdMult; __m128 SimdMean;
__m128 *pLumArr = (__m128*)lumarr;
__m128 *pGaussArr = (__m128*)_gaussianK5;
SimdMean = _mm_set1_ps (0.f);
for (y=p.y-2, i=0; y<=p.y+2; ++y) {
for (x=p.x-2; x<=p.x+2; ++x, ++i) {
I = _id->images((int)d)->data() +
3*(y * _id->images((int)d)->_size.x + x);
lumarr[i] = .3086f * (float)I[0] +
.6094f * (float)I[1] + .082f * (float)I[2];
} // x } // y
for (i = 0; i < 24 ; i+=4) {
SimdMult = _mm_mul_ps (*pLumArr, *pGaussArr);
SimdMean = _mm_add_ps (SimdMult, SimdMean);
pLumArr++; pGaussArr++; }
mean = SimdMean. m128_f32[0]+
SimdMean.m128_f32[1]+
SimdMean. m128_f32[2]+
SimdMean. m128_f32[3];
mean =(mean+lumarr[24]*_gaussianK5[24])/.9997f;
SIMD Optimization
SIMD vs. Original Code Time Based Comparison
1.5% improvement??
Optimized Code
Original Code
SIMD
• Instead of storing the data (the variables ap & anp) in the
registers, it stores it in the memory, an action that causes
store forwarding when using the sqrtps instruction
• The use of SIMD accelerates the function's speed by
approximately 1 sec, however the delay caused by the
store forwarding is larger the speedup the SIMD acquired,
and so, we got a slow down
SIMD
SimdM = _mm_sqrt_ps
(_mm_set_ps(0,0,float(ap), float(anp)));
0041133B
0041133E
00411340
00411344
00411346
0041134A
0041134E
00411352
00411354
00411356
00411359
0041135B
0041135E
00411364
0041136A
0041136C
00411370
00411376
xorps
sub
mov
add
movzx
mov
movzx
sub
mov
imul
mov
imul
movss
movss
add
cvtsi2ss
movss
add
xmm0,xmm0
eax,edi
edi,dword ptr [esp+18h]
ecx,ebx
ebx,byte ptr [edi+2]
edi,dword ptr [esp+20h]
edi,byte ptr [edi+2]
edi,ebx
ebx,eax
ebx,eax
eax,edi
eax,edi
dword ptr [esp+2Ch],xmm0
dword ptr [esp+28h],xmm0
ecx,ebx
xmm0,ecx
dword ptr [esp+24h],xmm0
edx,eax
M = SimdM. m128_f32[0] + SimdM. m128_f32[1]) /6.f;
if (_cuttype == C_GRAD) {
00411378
0041137C
00411380
00411386
0041138B
00411390
00411396
0041139C
cmp
cvtsi2ss
movss
sqrtps
movaps
movss
addss
mulss
004113A4 movss
004113AA jne
dword ptr [esi+50h],1
xmm0,edx
dword ptr [esp+20h],xmm0
xmm0,xmmword ptr [esp+20h]
xmmword ptr [esp+20h],xmm0
xmm0,dword ptr [esp+24h]
xmm0,dword ptr [esp+20h]
xmm0,dword ptr
__real@3e2aaaab (5397A0h)]
dword ptr [esp+0Ch],xmm0
004114AE
Store Forwarding Blocked
Multithreading
• Our major attempt to improve the original application was
to divide the massive calculation into two independent
threads that will run simultaneously on each core
• The main procedure used in this application is the function
"compute”
Multithreading
Original
Compute
Iter < ITER_MAX &&
step_counter < _n
NO
Finish Compute
Original Compute function flow
YES
++iter
ITER_MAX - Defined so the external loop won't
loop forever.
N - Number of pictures in the stack.
Step < _n &&
step_counter < _n
Step - Image index descriptor.
YES
BVZ_Expand - Calculates max flow on the
image's labels and returns the Energy of the
current step. According to the calculation, it also
updates the final image labels (the outcome)
Step++
E_old = E
E = BVZ_Expand
Inner Loop - Executed one time on each image.
External Loop - Runs as long as there is
improvement in the max flow calculation.
As long as the old energy (from the previous step)
is bigger than the new one; we continue the
iterations to the next image.
If no improvement in the flow was made,
we achieved the maximum improvement and the
function ends.
NO
E_Old = E
NO
YES
Step_counter++
Step_counter = 0
Compute
Multithreading
Optimized Compute function flow
Multi 1
Iter < ITER_MAX &&
step_counter < _n
NO
Finish Compute
YES
++iter
Step < _n &&
step_counter < _n
Our goal was to parallelize the energy
Computation in each step so we can advance
the steps by 2 each iteration.
YES
Step++
E_old = E
Step++
THREAD 2
E = BVZ_Expand
We calculate the odd steps (images) in thread 1
and the even steps in thread 2.
Thread synchronization appears in two places:
E_Old =
E_thread1
THREAD 1
E = BVZ_Expand
NO
Step_counter = 0
YES
Step_counter++
BVZ_Expand - The calculation part of the maxflow
is parallelized (both threads) and at this point
thread 2 waits for thread 1 to finish his energy
calculation & label updates. now thread 2
has the right E_old.
Compute - If thread 1 changed the label,
thread 2 must recalculate the last step on
the updated label.
NO
Thread 1
changed label
YES
Ignore Calc of
Thread 2 (Step--)
NO
E_old = E_thread1
(synchronize threads)
E_Old =
E_thread2
NO
YES
Step_counter++
Step_counter = 0
Multithreading Optimization
Multithreading vs. Original Code Time Based Comparison
25% improvement!
Optimized Code
Original Code
Multithreading Optimization
Multithreading vs. Original Code Time Based Comparison
• Theoretically when we are using 2 threads that are working
simultaneously we expect that we would get 50% speedup
• Due to the fact that the results of each thread depends on
the previous iteration, synchronization points are required
in the code
• Those synchronization points halts the threads runs and
therefore causes delays
Compute
Multithreading 2
Second Attempt
Multi 2
Iter < ITER_MAX &&
step_counter < _n
NO
Finish Compute
YES
++iter
Step < _n &&
step_counter < _n
We tried to enhance the speed by taking a different
approach to the synchronization.
YES
Step++
E_old = E
In this attempt, each thread changes the labels
(temporary labels) on its own memory segment
and we merge the results after the completion
of both threads.
Each label that thread 2 changes is marked using
an auxiliary array.
Step++
THREAD 2
E = BVZ_Expand
THREAD 1
E = BVZ_Expand
Update
temp_lablel2
Update
temp_lablel1
Merge labels
NO
E_Old =
E_thread1
In the merging process, the labels are updated
using the temp labels of thread 1 unless the
specific label was changed also by thread 2 after
that, the specific label is updated using thread 2's
temp labels.
We can see that the results however do not
Have seamlessly differences
YES
Step_counter++
E_Old =
E_thread2
YES
Step_counter++
Multithreading Optimization
Multithreading 2 vs. Original Code Time Based Comparison
99.8% improvement!
Optimized Code
Original Code
Multithreading
Thread Profiler view of Multithreaded Code
Our 2 threads
Our program utilization is high and except for out threads' sync points, both
cores are working.
Our code’s 2
threads
Multithreading
Thread Profiler view of Multithreaded Code
The serial part at the beginning of the program is required to load the image
stack and to compute the first energy of the image, after that, our threads begin
their computation.
Multithreading
Thread Profiler view of Multithreaded Code
The thread's sync points are taking little time - most of the code runtime is done
simultaneously.
Intel® Compiler
• Intel compiler did not run on our SIMD configuration (class error).
• We used Intel Compiler on 3 configurations we made and compared
its runtime with the same configurations that we ran using Visual
Studio's compiler
Tim e Com parison - Intel vs. Microsoft
Time (Visual Studio Compiler)
60
Time (Intel Compiler)
50
Runtime [Sec]
40
30
56.95 58.17
53.91 55.38
34.83 34.7
20
10
0
code optimization
thread 1
thread 2
Intel® Tuning Assistant
• Using Intel's tuning Assist, we found no significant areas
where our code caused a slowdown
• All events collected by the tuning assistant indicates that
our optimization is satisfactory
• The store forwarding issue using SIMD was not detected as
a "hotspot" because the time consumed by the faulty code
was 0.7% than the overall time spent on the entire function
(less than 1%)
Optimization Summary
Time Comparison
70
60
Time (Seconds)
50
40
69.34
68.36
58.17
30
65.33
55.38
20
34.7
33.94
10
0
original
code
SIMD
optimization instructions
thread 1
Run Type
thread 2
thread
1+SIMD
thread
2+SIMD
Optimization Summary
Speed Up Comparison (Original is 100%)
160
140
Speedup (%)
120
100
80
60
149.9567349
116.109028
100
120.1326796
101.4133256
151.0527834
105.7830978
40
20
0
original
code
SIMD
optimization instructions
thread 1
Run Type
thread 2
thread
1+SIMD
thread
2+SIMD
Thank you,
Tal Klein & Omer Manor