Binocular Stereo Left Image Right Image Binocular Stereo © 2006 by Davi Geiger Computer Vision November 2006 L1.1

Download Report

Transcript Binocular Stereo Left Image Right Image Binocular Stereo © 2006 by Davi Geiger Computer Vision November 2006 L1.1

Binocular Stereo
Left Image
Right Image
Binocular Stereo
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.1
Binocular Stereo
There are various different methods of extracting relative depth
from images, some of the “passive ones” are based on
(i) relative size of known objects,
(ii) occlusion cues, such as presence of T-Junctions,
(iii) motion information,
(iv) focusing and defocusing,
(v) relative brightness
Moreover, there are active methods such as
(i) Radar , which requires beams of sound waves or
(ii) Laser, uses beam of light
Stereo vision is unique because it is both passive and accurate.
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.2
Human Stereo: Random Dot Stereogram
Julesz’s Random Dot Stereogram. The left
image, a black and white image, is
generated by a program that assigns black or
white values at each pixel according to a
random number generator.
The right image is constructed from by copying the left image, but an imaginary
square inside the left image is displaced a few pixels to the left and the empty
space filled with black and white values chosen at random. When the stereo pair
is shown, the observers can identify/match the imaginary square on both images
and consequently “see” a square in front of the background. It shows that stereo
matching can occur without recognition.
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.3
Human Stereo: Illusory Contours
Stereo matching occurs in the
presence of illusory.
Here not only illusory figures on left and right
images don’t match, but also stereo matching yields
illusory figures not seen on either left or right
images alone.
Not even the identification/matching of illusory contour is known
a priori of the stereo process. These pairs gives evidence that the
human visual system does not process illusory contours/surfaces
before processing binocular vision.
Accordingly, binocular vision will be thereafter described as a
process that does not require any recognition or contour detection
a priori.
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.4
Human Stereo: Half Occlusions
Left
Right
Left
Right
An important aspect of the stereo geometry
are half-occlusions. There are regions of a
left image that will have no match in the
right image, and vice-versa. Unmatched
regions, or half-occlusion, contain important
information about the reconstruction of the
scene. Even though these regions can be
small they affect the overall matching
scheme, because the rest of the matching
must reconstruct a scene that accounts for
the half-occlusion.
Leonardo DaVinci had noted that the larger is the discontinuity between
two surfaces the larger is the half-occlusion. Nakayama and Shimojo in
1991 have first shown stereo pair images where by adding one dot to one
image, like above, therefore inducing occlusions, affected the overall
matching of the stereo pair.
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.5
Projective Camera
y
po=(xo,yo,f)

f 
p o  P0
Z
f
O
z
Let P  ( X , Y , Z ) be a point in the 3D world
represented by a “world” coordinate
system. Let O be the center of projection
of a camera where a camera reference
frame is placed. The camera coordinate
system has the z component perpendicular
Po=(Xo,Yo,Zoto
) the camera frame (where the image is
produced) and the distance between the
center O and the camera frame is the focal
length, f . In this coordinate system the
point
P  ( X , Y , Z ) is described by the vector

x
PO  ( X O , YO , Z O )  and the projection of this
point to the image (the intersection of the
line PO with
is given by
 the camera frame)

the point po  ( xo , yo , f ) , where

f 
p o  P0
Z
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.6
y
Projective Camera Coordinate System
ox
x0  (qx0  ox ) sx ; y0  (qy 0  oy ) s y ,

qo  (qx0 , qyo ,1)
where the intrinsic parameters of the camera,
(s x , s y ); (ox , o y ); f , represent the size of the
O
x
pixels (say in millimeters) along x and y
directions, the coordinate in pixels of the
image (also called the principal point) and the
focal length of the camera.
We have neglected to account for the radial distortion of the lenses, which would
give an additional intrinsic parameter. Equation above can be described by the
linear transformation
ox 
 1
0



1 
f 
 sx


po  Q q0
oy
pixel coordinates
Q 1
 sx

 0

0
© 2006 by Davi Geiger
0
 sy
0
 s x ox 

syoy 

f

qo  Q p0
Computer Vision

Q 0


0


1

sy
0
oy 

f 

1 
f 
November 2006
L1.7
Two Projective Cameras
y
l

Pl
P=(X,Y,Z)

Pr

T
Ol
Ol Or
f
pl=(xo,yo,f)
yr
xr
Or
pr=(xo,yo,f)
x
l
zl
f
zr
A 3D point P projected on both cameras. The transformation of coordinate
system, from left to right is described by a rotation matrix R and a
translation vector T. More precisely, a point P described
as Pl in the left


frame will be described in the right frame as Pr  R1 (Pl  T )
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.8
Two Projective Cameras
Epipolar Lines
y
el
Ol Or
l
P=(X,Y,Z)

 
1
Pr  R (Pl  T )
Pl

T
Ol
pl=(xo,yo,f)
yr
xr
Or
er
pr=(xo,yo,f)
x
l
zl
zr
epipolar lines
Each 3D point P defines a plane POl Or . This plane intersects the two camera frames
creating two corresponding
epipolar lines. The line Ol Or will intersect the camera


planes at e l and er , known as the epipoles. The line Ol Or is common to every plane
POlOl and thus the two epipoles belong to all pairs of epipolar lines (the epipoles are
the “center/intersection” of all epipolar lines.)
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.9
Estimating Epipolar Lines and Epipoles
 
The
 two
 vectors, T , Pl , span a 2 dimensional space and their cross product
,(T  Pl ) , is perpendicular to this 2 dimensional space. Therefore
'

Pl   Pl

 


 
Pr  R1 (Pl  T )

  
   
'    
( Pl   T )  (T  Pl )  0   ( Pl  T )  (T  Pl )  0   ( P l  T )  (T  Pl )  0 
'   
'  

' 



 ( R Pr )  (T  Pl )  0   P r R S (T ) Pl   P r E ( R, T ) Pl  0  p ' r  E ( R, T ) pl  0

where
 0

S (T )   T z
 T y

 Tz
0
Tx
Ty 

 Tx 
0 
and
E ( R, T )  R S (T )
T
f2
 Zl Z r

is the essential m atrix
' 

 '  T
' 

1 
p r E(R, T ) pl  0  q r Qr E(R, T ) Ql ql  0  q r F (R, T , il , ir ) ql  0
F is known as the fundamental matrix and needs to be estimated
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.10
Computing F (fundamental matrix)
“Eight point algorithm”:
(i)
Given two images, we need to identify eight points or more on both
images, i.e., we provide n  8 points with their correspondence. The
points have to be non-degenerate.
' 

q
F
(
R
,
T
,
i
,
i
)
q
r
(ii) Then we have n linear and homogeneous equations
l r
l 0
with 9 unknowns, the components of F. We need to estimate F only up to
some scale factors, so there are only 8 unknowns to be computed from the
n  8 linear and homogeneous equations.
(i) If n=8 there is a unique solution (with non-degenerate points), and if n >
8 the solution is overdetermined and we can use the SVD decomposition
to find the best fit solution.
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.11
Stereo Correspondence: Ambiguities
Each potential match is
represented by a square.
The black ones represent
the most likely scene to
“explain” the images, but
other combinations could
have given rise to the same
images (e.g., red)
What makes the set of black squares preferred/unique is that they have
similar disparity values, the ordering constraint is satisfied and there is a
unique match for each point. Any other set that could have given rise to the
two images would have disparity values varying more, and either the
ordering constraint violated or the uniqueness violated. The disparity values
are inversely proportional to the depth values
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.12
Stereo Correspondence: Matching Space
B
D
C
E
F
depth discontinuity
boundary
A
Right
In the matching space, a point (or a node) represents a match of a pixel
in the left image with a pixel in the right image
A
C
FD
CB
C
A
D
FE
A
no match
D
Surface orientation
E
discontinuity
F
F
Boundary
D
C
B
A
Left
no match
Note 1: Depth discontinuities and very tilted surfaces can/will yield the same
images ( with half occluded pixels)
Note 2: Due to pixel discretization, points A and C in the right frame are neighbors.
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.13
Cyclopean Eye
The cyclopean eye “sees” the world in 3D where x represents
the coordinate system of this eye and w is the disparity axis
x
r l
2
and w 
r l
2
x 
1 1 1  r 
  

  
w
1

1
2
 

 l 
Right Epipolar Line

xw
xw
and l 
2
2
r
1 1 1  x 
  

  
l
1

1
2
 

  w
x
For manipulating with integer coordinate values,
one can also use the following representation
 x  1 1  r 
 r  1 1 1  x 
   
       
  
 w  1  1  l 
 l  2 1  1  w 
Restricted to integer values. Thus, for l,r=0,…,N-1
we have x=0,…2N-2 and w=-N+1, .., 0, …, N-1
r+1
r=5
w
 r
r-1
Note: Not every pair (x,w) have a correspondence to
(l,r), when only integer coordinates are considered.
For x+w even we have integer values for pixels r
and l and for x+w odd we have supixel locations.
w=2
l-1 l=3 l+1
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.14
Surface Constraints I
Smoothness : In nature most surfaces are smooth in depth compared to their distance to
the observer, but depth discontinuities also occur. Usually smoothness implies an

ordering constraint, where points to the right of ql must match points to the right of qr
Smoothness
x
Right Epipolar Line
NO: Ordering Violation
r+1
YES
r=5
w
YES
r-1
NO: Ordering Violation
w=2
l-1
© 2006 by Davi Geiger
l=3
l+1
Computer Vision
Left Epipolar Line
November 2006
L1.15
Surface Constraints II
Uniqueness: There should be only one disparity value associated to each cyclopean
coordinate x. Note: multiple matches for left eye points or right eye points are allowed.
Uniqueness
x
Right Epipolar Line
YES, but note that it is a multiple
match for the left eye
r+1
YES, but note that it is a multiple
match for the right eye
r=5
w
r-1
NO: Uniqueness
w=2
l-1
© 2006 by Davi Geiger
l=3
l+1
Computer Vision
Left Epipolar Line
November 2006
L1.16
Bayesian Formulation
The probability of a surface w(x,e) to account for the left and right image can
be described by the Bayes formula as
P({I L (ql , e), I R (qr , e)}|{w( x, e)}) P({w( x, e)})
P({w( x, e)}|{I (ql , e), I (qr , e)}) 
P({I L (ql , e), I R (qr , e)})
L
R
where e index the epipolar lines. Let us develop formulas for both probability
terms on the numerator. The denominator can be computed as the
normalization constant to make the probability sum to 1.
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.17
The Image Formation
N 1 2 N 2
   C ( e , x , w )
1
P({I L (ql , e), I R (qr , e)} |{w( x, e)})  e e0 x0
Z
C(e,x,w) Є [0,1], for x+w even, represents how similar the images are between
pixels (e,l) in the left image and (e,r) in the right image, given that they match.
The epipolar lines are indexed by e.
left
left
right
  Iˆ L (l , e,0,5)  Iˆ R (r, e,0,5) 2  Iˆ L (l , e,  ,5)  Iˆ R (r, e,  ,5) 2
 ,
,
C (e, x, w)  min 




128
128





right

1 , x  w even


We use “left” and “right” windows to account for occlusions. We also “spread” the
difference in intensities that are below 128 values as much as possible, assuming
differences above 128 to be all “unacceptable”, i.e., C(e,x,w)=1.
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.18
The Image Formation II
N 1 2 N 2
   C ( e , x , w )
1
P({I L (ql , e), I R (qr , e)} |{w( x, e)})  e e0 x0
Z
C(e,x,w) Є [0,1], for x+w odd, represents how similar intensity edges are at (e,l > e,l+1) in the left image and at (e,r ->e,r+1) in the right image.
 l DI L (l ,0, e)  DI R (r ,0, e)  1 
,
C (e, x, w)  min1,
 3 DI L (l ,0, e)  DI R (r ,0, e)  l 


 x  w
 x  w
x  w odd , where l  
,
r

 2 
 2 
R
Note that when DI L (l,0, e)  0 and DI (r,0, e)  0 than C(e,x,w) ~ 0.3.
An intensity edge should not be encouraged to match a non-intensity edge, i.e,
when DI R (r,0, e)  DI L (l ,0, e)  0 or DI L (l,0, e)  DI R (r,0, e)  0 l/3 as long
as DI R (r,0, e)  l or DI L (l,0, e)  l . Note that at occlusions this may occur
and thus, l/3 becomes the occlusion cost. If we set l ~ 2 we obtain an
oclusion cost of about C(e,x,w)~0.7.
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.19
The Prior Model of Surfaces I
N 1 2 N 2
 F ( e , x , j , w( x ,e ), w( x 1,e ))
1  
Pw( x, e)  e e0 x0
Z
r+1
Tilted
where w( x, e)  w( x 1, e)  1 , e, x
r=5
w
r-1
For x  w even
F (e, x, w( x, e), w( x  1, e)):
w=2
l-1 l=3 l+1
if w( x, e)  w( x  1, e)
F (e, x, w( x, e), w( x  1, e))  0
 w( x, e)  w( x  1, e)  1 


if 
or

 w( x, e)  w( x  1, e)  1 


x
Right Epipolar Line
x
r+1
r=5
x=8
r-1
w
F (e, x, w( x, e), w( x  1, e))  TILT _ Cost
w=2
l-1
© 2006 by Davi Geiger
Computer Vision
l=3
l+1
November 2006
L1.20
The Prior Model of Surfaces I, cont...
Occluded
For x  w odd
F (e, x, w( x, e), w( x  1, e)) :
w
if w( x, e)  w( x  1, e)
pixel to subpixel
F (e, x, w( x, e), w( x  1, e))  0
if w( x, e)  w( x  1, e)  1
w=2
l-1 l=3 l+1
subpixel to subpixel (occlusion)
F (e, x, w( x, e), w( x  1, e)) 
Right Epipolar Line
Occlusion_ Cost
DI L (l ,0,5)  1
subpixel to subpixel (occlusion)
Occlusion_ Cost
w
F (e, x, w( x, e), w( x  1, e)) 
DI R (r ,0,5)  1
© 2006 by Davi Geiger
x
r+1
if w( x, e)  w( x  1, e)  1
 x  w
 x  w
where l  
,
r

 2 
 2 
No pixel
Matches
r+1
r=5
r-1
r=5
r-1
x=8
w=2
l-1
Computer Vision
l=3
l+1
November 2006
L1.21
The Prior Model of Surfaces II
N 1 2 N 2
 D ( e , x , w( x ,e ), w( x ,e 1))
1  
Pw( x, e)  e e0 x0
Z
D(e, x, w( x, e), w( x, e  1))) 
we ( x, e)  we1 ( x, e  1)
3
3


Iˆ L (l ( x, we ), e, ,5)  Iˆ R (r ( x, we ), e, ,5)  Iˆ L (l ( x, we1 ), e  1, ,5)  Iˆ R (r ( x, we1 ), e  1, ,5)  1
2
2
2
2
Epipolar interaction: the larger the intensity edges the less the cost (the
higher the probability) to have disparity changes across epipolar lines
x
r+1
r=5
w
r-1
w=2
l-1 l=3 l+1
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.22
Limit Disparity
The search is within a range of
| w || r  l |  D
disparity : 2D+1 , i.e.,
Right Epipolar Line
The rational is:
(i)
(ii) Larger disparity matches
imply larger errors in 3D
estimation.
r+1
r=5
w
r-1
| w|  D
We may start the computations at
x=D to avoid limiting the range of w
values. In this case, we also limit the
computations to up to x=2N-2-D
© 2006 by Davi Geiger
x
D=3
Less computations
(iii) Humans only fuse stereo
images within a limit. It is
called the Panum’s limit.
Smoothness (+Ordering)
w=2
Computer Vision
l-1 l=3 l+1
w=-3
November 2006
L1.23
Dynamic Programming
Stereo-Matching DP( ImageLeft, ImageRight, D, e ) (to be solved line by line)
Initialize
Create the Graph F(V(x,w), E(x,w,x-1,w’) (length is 2N-1 and width is 2D+1)
/* precomputation of the match and transition costs are stored in arrays C and F */
loop for v=(x,w) in V (i.e., loop for x and loop for w)
Set-Array C(x,2w+1,e) (see previous slides for the formula)
loop for u=(x’,w’) such that (x’=x-1, |w’-w| ≤1)
Set-Array F(x, 2w+1, 2w’+1, e) (see previous slides for the formula)
end loop
end loop
Main loop
loop for x=D, 1, ..., 2N-2-D
loop for w=-D,…,0,…,D
Cost;
loop for w’=w-1, w, w+1
Temp= Fx-1*(2w’+1)+ F(x, 2w+1, 2w’+1,e);
if Temp < Cost);
Cost= Temp;
backx(2w+1) =2w’+1;
end loop
Fx*2w+1 Cost+ C(x,2w+1,e);
end loop
end loop
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.24
Stereo Correspondence: Belief Propagation (BP)
We have now the posterior distribution for disparity values (surface {w(x,e)})
N 1 2 N 2
 C ( w( x ,e ), x ,e )  F ( e , x , w( x ,e ), w( x 1,e ))  l D ( e , x , w( x ,e ), w( x ,e 1)) 
1 
P ({ w( x, e)} | I , I )  e e0 x0
Z
We want to obtain/compute the marginal

L
R
P( w( x, e), x, e) 
D

D

...
w ( x ' 0 , e ' 0 )   D

...
w ( x ',e ' 0 )   D
D

D
w ( x ' 2 N  2 , e ' 0 )   D
D

...
w ( x ' 0 ,e 1)   D
D

...
w ( x ',e '1)   D
w ( x ' 2 N  2 ,e '1)   D
...
D

w ( x '  0 , e ')   D
D
...

D
...
w ( x ' x & e ' e )   D

w ( x '  2 N  2 , e ')   D
...
D

w ( x ' 0 , e ' N )   D
D
...

w ( x ',e ' N )   D
D
...
 P({w( x' , e' )} | I
L
,I R)
w ( x ' 2 N  2 , e  N )   D
These are exponential computations on the size (length) of the grid N
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.25
Kai Ju’s approximation to BP
We use Kai Ju’s Ph.D. thesis work to approximate the (x,e) graph/lattice by horizontal
and vertical graphs, which are singly connected. Thus, exact computation of the
marginal in these graphs can be obtained in linear time. We combine the probabilities
obtained for the horizontal and vertical graphs, for each lattice
site, by “picking” the
D
“best” one (the ones with lower entropy, where S ( x, e)    P( w( x, e)) log P(w( x, e)) .)
w  D
“Horizontal” Belief Tree
2w  1
2w  1
“Vertical” Belief Tree
P v (w( x, e))
Ph (w( x, e))
x
x
e
e
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.26
Result
© 2006 by Davi Geiger
Computer Vision
November 2006
L1.27
Some Issues in Stereo:
Junctions and its properties (false matches that reveal information
from vertical disparities (see Malik 94, ECCV)
Regi
on A
Regi
on B
Region
A Right
Region
B Left
Region
A Left
© 2006 by Davi Geiger
Computer Vision
Region
B Right
November 2006
L1.28