Binocular Stereo Left Image Right Image Binocular Stereo © 2010 by Davi Geiger Computer Vision March 2010 L1.1

Download Report

Transcript Binocular Stereo Left Image Right Image Binocular Stereo © 2010 by Davi Geiger Computer Vision March 2010 L1.1

Binocular Stereo
Left Image
Right Image
Binocular Stereo
© 2010 by Davi Geiger
Computer Vision
March 2010
L1.1
Binocular Stereo
There are various different methods of extracting relative depth
from images, some of the “passive ones” are based on
(i) relative size of known objects,
(ii) texture variations,
(iii) occlusion cues, such as presence of T-Junctions,
(iv) motion information,
(v) focusing and defocusing,
(vi) relative brightness
Moreover, there are active methods such as
(i) Radar , which requires beams of sound waves or
(ii) Laser, uses beam of light
Stereo vision is unique because it is both passive and accurate.
© 2010 by Davi Geiger
Computer Vision
March 2010
L1.2
Human Stereo: Random Dot Stereogram
Julesz’s Random Dot Stereogram. The left
image, a black and white image, is
generated by a program that assigns black or
white values at each pixel according to a
random number generator.
The right image is constructed from by copying the left image, but an imaginary
square inside the left image is displaced a few pixels to the left and the empty
space filled with black and white values chosen at random. When the stereo pair
is shown, the observers can identify/match the imaginary square on both images
and consequently “see” a square in front of the background. It shows that stereo
matching can occur without recognition.
© 2010 by Davi Geiger
Computer Vision
March 2008
L1.3
Human Stereo: Illusory Contours
Stereo matching occurs in the
presence of illusory.
Here not only illusory figures on left and right
images don’t match, but also stereo matching yields
illusory figures not seen on either left or right
images alone.
Not even the identification/matching of illusory contour is known
a priori of the stereo process. These pairs gives evidence that the
human visual system does not process illusory contours/surfaces
before processing binocular vision.
Accordingly, binocular vision will be thereafter described as a
process that does not require any recognition or contour detection
a priori.
© 2010 by Davi Geiger
Computer Vision
March 2008
L1.4
Human Stereo: Half Occlusions
Left
Right
Left
Right
An important aspect of the stereo geometry
are half-occlusions. There are regions of a
left image that will have no match in the
right image, and vice-versa. Unmatched
regions, or half-occlusion, contain important
information about the reconstruction of the
scene. Even though these regions can be
small they affect the overall matching
scheme, because the rest of the matching
must reconstruct a scene that accounts for
the half-occlusion.
Leonardo DaVinci had noted that the larger is the discontinuity between
two surfaces the larger is the half-occlusion. Nakayama and Shimojo in
1991 have first shown stereo pair images where by adding one dot to one
image, like above, therefore inducing occlusions, affected the overall
matching of the stereo pair.
© 2010 by Davi Geiger
Computer Vision
March 2008
L1.5
y
po=(xo,yo,f)
Projective Camera

f 
p o  P0
Z
f
O
z
1
qx 0
qyo
© 2010 by Davi Geiger
f
Let P  ( X , Y , Z ) be a point in the 3D world
represented by a “world” coordinate
Po=(Xo,Yo,Zosystem.
)
Let O be the center of projection
of a camera where a camera reference
frame is placed. The camera coordinate
system has the z component perpendicular
to the camera frame (where the image is
x
produced) and the distance between the
center O and the camera frame is the focal
length, f . In this coordinate system the
point
P  ( X , Y , Z ) is described by the vector

PO  ( X O , YO , Z O )  and the projection of this
point to the image (the intersection of the
line PO with
is given by
 the camera frame)

the point po  ( xo , yo , f ) , where

f 
p o  P0
Z
Computer Vision
March 2008
L1.6
Projective Camera and Image Coordinate System
y
ox
oy
x0  (qx0  ox ) sx ; y0  (qy 0  oy ) s y ,

qo  (qx0 , qyo ,1)
pixel coordinates
O
x
where the intrinsic parameters of the camera,
(s x , s y ); (ox , o y ); f , represent the size of the pixels
(say in millimeters) along x and y directions, the
coordinate qx,qy in pixels of the image (also called the
principal point) and the focal length of the camera.
We have neglected to account for the radial distortion of the lenses, which would
give additional intrinsic parameters. Equation above can be described by the linear
transformation
ox 
 1
0



1 
f 
 sx


po  Q q0
Q 1
 sx

 0

0
© 2010 by Davi Geiger
0
 sy
0
 s x ox 

syoy 

f

qo  Q p0
Computer Vision

Q 0


0


1

sy
0
oy 

f 

1 
f 
March 2008
L1.7
Two Projective Cameras
y
l

Pl
P=(X,Y,Z)

Pr

T
Ol
Ol Or
f
pl=(xo,yo,f)
yr
xr
Or
pr=(xo,yo,f)
x
l
zl
f
zr
A 3D point P, view in the cyclopean coordinate system, projected on both
cameras. The same point P described by a coordinate system in the left eye
is Pl and described by a coordinate system in the right eye is Pr. The
translation vector T brings the origin of one the left coordinate system to
the origin of the right coordinate system.
© 2010 by Davi Geiger
Computer Vision
March 2008
L1.8
Two Projective Cameras: Transformations
y
l

Pl
P=(X,Y,Z)

Pr

T
Ol
Ol Or
f
pl=(xo,yo,f)
yr
xr
Or
pr=(xo,yo,f)
x
l
zl
f
zr
The transformation of coordinate system, from left to right is described by
a rotation matrix R and a translation vector T. More precisely, a point P
described as Pl in the left frame will be described in the right frame as
R Pr  Pl  T
© 2010 by Davi Geiger
Computer Vision
March 2008
L1.9
Two Projective Cameras
Epipolar Lines
y
el
Ol Or
l
P=(X,Y,Z)

 
1
Pr  R (Pl  T )
Pl

T
Ol
pl=(xo,yo,f)
yr
xr
Or
er
pr=(xo,yo,f)
x
l
zl
zr
epipolar lines
Each 3D point P defines a plane POl Or . This plane intersects the two camera frames
creating two corresponding
epipolar lines. The line Ol Or will intersect the camera


planes at e l and er , known as the epipoles. The line Ol Or is common to every plane
POlOl and thus the two epipoles belong to all pairs of epipolar lines (the epipoles are
the “center/intersection” of all epipolar lines.)
© 2010 by Davi Geiger
Computer Vision
March 2008
L1.10
Estimating Epipolar Lines and Epipoles
 
The
 two
 vectors, T , Pl , span a 2 dimensional space. Their cross product
, (T  Pl ), is perpendicular to this 2 dimensional space. Therefore


'


 
1
Pl   Pl
P

R
(
P
r
l T )

  
   
'    
( Pl   T )  (T  Pl )  0   ( Pl  T )  (T  Pl )  0  ( P l  T )  (T  Pl )  0 
'   
'  

' 

 

( R Pr )  (T  Pl )  0  P r R S (T ) Pl  P r E ( R, T ) Pl  0  p ' r E ( R, T ) pl  0
 

where
 0

S (T )   T z
 T y

 Tz
0
Tx
Ty 

 Tx 
0 
and
E ( R, T )  R T S (T )
f2
Zl Z r
is the essential m atrix
' 

 '  T
' 

1 
p r E(R, T ) pl  0  q r Qr E(R, T ) Ql ql  0  q r F (R, T , il , ir ) ql  0
F is known as the fundamental matrix and needs to be estimated
© 2010 by Davi Geiger
Computer Vision
March 2008
L1.11
Computing F (fundamental matrix)
“Eight point algorithm”:
(i)
Given two images, we need to identify eight points or more on both
images, i.e., we provide n  8 points with their correspondence. The
points have to be non-degenerate.
' 

q
F
(
R
,
T
,
i
,
i
)
q
r
(ii) Then we have n linear and homogeneous equations
l r
l 0
with 9 unknowns, the components of F. We need to estimate F only up to
some scale factors, so there are only 8 unknowns to be computed from the
n  8 linear and homogeneous equations.
(i) If n=8 there is a unique solution (with non-degenerate points), and if n >
8 the solution is overdetermined and we can use the SVD decomposition
to find the best fit solution.
© 2010 by Davi Geiger
Computer Vision
March 2008
L1.12
Stereo Correspondence: Ambiguities
Each potential match is
represented by a square.
The black ones represent
the most likely scene to
“explain” the images, but
other combinations could
have given rise to the same
images (e.g., red)
What makes the set of black squares preferred/unique is that they have
similar disparity values, the ordering constraint is satisfied and there is a
unique match for each point. Any other set that could have given rise to the
two images would have disparity values varying more, and either the
ordering constraint violated or the uniqueness violated. The disparity values
are inversely proportional to the depth values
© 2010 by Davi Geiger
Computer Vision
March 2008
L1.13
Stereo Correspondence: Matching Space
D
C
E
FD
Left C B
F
C
A
D
FE
A
boundary
B
F
E
D
no match
A
Right
In the matching space, a point (or a node) represents a match of a pixel
in the left image with a pixel in the right image
C
A
Surface orientation
discontinuity
depth discontinuity
A B C
Boundary
D F
Left
no match
Note 1: Depth discontinuities and very tilted surfaces can/will yield the same
images ( with half occluded pixels)
Note 2: Due to pixel discretization, points A and C in the right frame are neighbors.
© 2010 by Davi Geiger
Computer Vision
March 2010
L1.14
Cyclopean Eye
The cyclopean eye “sees” the world in 3D where x represents
the coordinate system of this eye and w is the disparity axis
x
r l
2
and w 
r l
2
 w
1 1  1   r 

 

 
2 1 1  l 
x 
 r
r
1  1 1  w 

 

 
2  1 1   x 
l 

For manipulating with integer coordinate values, one can
also use the following representation
Right Epipolar Line
x
r+1
 w  1  1   r 
 
 
x
1
1
  
 l 

 r  1  1 1   w
  
 
l

1
1
2
 

 x 
Restricted to integer values. Thus, for l,r=0,…,N-1 we have
x=0,…2N-2 and w=-N+1, .., 0, …, N-1
r=5
r-1
w
xw
xw
and l 
2
2
w=2
l-1 l=3 l+1
© 2010 by Davi Geiger
Note: Not every pair (x,w) have a correspondence to (l,r),
when only integer coordinates are considered.
For “x+w even” we have integer values for pixels r and l and
for “x+w odd” we have supixel locations. Thus, the cyclopean
coordinate system for integer values of (x,w) includes a
subpixel image resolution
Computer Vision
March 2008
L1.15
The Uniqueness-Opaque Constraint
There should be only one disparity value, one depth value, associated to each
cyclopean coordinate x (see figure). The assumption is that objects are opaque and
so a 3D point P, seen by the cyclopean coordinate x and disparity value w will cause
all other disparity values not to be allowed. Closer points than P , along the same x
coordinate, must be transparent air and further away points will not be seen since P
is already been seen (opaqueness). However, multiple matches for left eye points or
right eye points are allowed. This is indeed required to model tilt surfaces and
occlusion surfaces as we will later discuss. This constraint is a physical motivated
one and is easily understood in the cyclopean coordinate system.
Given that the l=3 and
r=5 are matched (blue
square), then the red
squares represent
violations of the
uniqeness-opaqueness
constraint while the
yellow squares represent
unique matches, in the
cyclopean coordinate
system but multiple
matches in the left or
right eyes coordinate
system.
Uniqueness-Opaque
YES, multiple match
for the left eye
YES, multiple match for
the right eye
r+1
r=5
r-1
w
NO: Uniqueness
Uniqueness
w=2
l-1
© 2010 by Davi Geiger
x
Right Epipolar Line
l=3
l+1
Computer Vision
Left Epipolar Line
March 2008
L1.16
Surface Constraints I
Smoothness : In nature most surfaces are smooth in depth compared to their distance to
the observer, but depth discontinuities also occur.
Smoothness
x
Right Epipolar Line
r+1
YES
r=5
w
YES
r-1
Given that the l=3
and r=5 are matched
(blue square), then
the red squares
represent violations
of the ordering
constraint while the
yellow squares
represent smooth
matches.
w=2
l-1
© 2010 by Davi Geiger
l=3
l+1
Computer Vision
Left Epipolar Line
March 2010
L1.17
Surface Constraints:
Discontinuities and Occlusions
x
Right Epipolar Line
r+1
r=5
r-1
w
w
Ordering Violation
x
w=2
l
Left Epipolar Line
l-1
l=3
l+1
r
Discontinuities: Note that in these cases, some pixels will not be matched to any pixel, e.g.,
“l+1”, and other pixels will have multiple matches, e.g., “r-1”. In fact, the number of pixels
unmatched in the left image is the same as the number of multiple matches in the right image.
© 2010 by Davi Geiger
Computer Vision
March 2010
L1.18
Neighborhood (with no subpixel accuracy)
w
w
x
x
l
l
r
r
r=6
x
w’=w+2
?
r=6
X
r=5
r=4
r=4
r=3
r=3
w=4
x
?
w’=w+2
r=5
x
w’=w-2
x’=x-2-|w’-w|
r
r
w=4
x
X
w’=w-2
x’=x-2-|w’-w|
l
l=1 l=2 l=3
l
l=1 l=2 l=3
Neighborhood structure for a node (e,xw) consisting of flat, tilt, or occluded surfaces. Note that
when an occlusion/discontinuity occurs, the contrast matches on the front surface. Jumps “at
the right eye” are from back to front, while jumps “at the left eye” are from front to the back.
© 2010 by Davi Geiger
Limit Disparity
The search is within a range of
| w || r  l |  D
disparity : 2D+1 , i.e.,
Right Epipolar Line
The rational is:
(i)
(ii) Larger disparity matches
imply larger errors in 3D
estimation.
r+1
r=5
w
r-1
| w|  D
We may start the computations at
x=D to avoid limiting the range of w
values. In this case, we also limit the
computations to up to x=2N-2-D
© 2010 by Davi Geiger
x
D=3
Less computations
(iii) Humans only fuse stereo
images within a limit. It is
called the Panum’s limit.
Smoothness (+Ordering)
w=2
Computer Vision
l-1 l=3 l+1
w=-3
March 2008
L1.20
Bayesian Formulation
The probability of a surface w(x,e) to account for the left and right image can
be described by the Bayes formula as
P({I L (ql , e), I R (qr , e)}|{w( x, e)}) P({w( x, e)})
P({w( x, e)}|{I (ql , e), I (qr , e)}) 
P({I L (ql , e), I R (qr , e)})
L
R
where e index the epipolar lines. Let us develop formulas for both probability
terms on the numerator. The denominator can be computed as the
normalization constant to make the probability sum to 1.
© 2010 by Davi Geiger
Computer Vision
March 2008
L1.21
The Image Formation I (special case)
1 
L
R
P({I (ql , e), I (qr , e)}|{w( x, e)})  e
Z
W ( e , x , w)
T
P(e,x,w) Є [0,1], for x+w even, represents how similar the images are between
pixels (e,l) in the left image and (e,r) in the right image, given that they match.
left
left
right

 
2
W (e, x, w)  min Iˆ L (l , e, 0,3)  Iˆ R (r , e, 0,3) , Iˆ L (l , e,  ,3)  Iˆ R ( r , e,  ,3)
right
 ,
2
x  w even
We use “left” and “right” windows to account for occlusions.
© 2010 by Davi Geiger
Computer Vision
March 2008
L1.22
The Image Formation I (intensity)
Note that when x+w is odd, the coordinates l and r are half integers and so an
interpolation/average value for the intensity values need to be computed. For half integer
values of l and r we use the formulas

1
Iˆ L (e, l ,  , s)  Iˆ L (e, l   1,  , s)  Iˆ L (e, l  ,  , s)
2


1
and Iˆ R (e, r ,  , s)  Iˆ R (e, r   1,  , s)  Iˆ R (e, r  ,  , s)
2

where 
 x  is the floor value of x.
We expand the previous formula to include matching of windows that have other
orientations than just q =0 and we expand the formula to any integer value of x+w.

 
2
W (e, x, w)  W (e, l , r )  min Iˆ L (l , e,  ,3)  Iˆ R (r , e,  ,3) , Iˆ L (l , e,    ,3)  Iˆ R (r , e,    ,3)
1
P({I (ql , e), I (qr , e)}|{w( x, e)})  e
Z
L
© 2010 by Davi Geiger
R
Computer Vision

2

W  ( e, x , w) 
W 0 ( e , x , w)


2



T
T




March 2008
L1.23
The Image Formation II (Occlusions)
Occlusions are regions where no feature match occur. In order to consider them we introduce
an “occlusion field” O(e,x) which is 1 if an occlusion occurs at column (e,x) and zero
otherwise. Thus, the likelihood of left and right images given the feature match must take
into account if an occlusion occurs. We modify the data probability to include occlusions.
1
P( I , I | O( M ), e, x, w) 
Z
L
R
D
e



 (1O ( e , x )) M ( e , x , w) ( e , x , w, s ) 
O ( e, x ) 
2 D 1


w D
D
M ( e , x , w ) ( e , x , w, s )   
1  w
 e  D
Z
where O is an occlusion binary variable
O(e, x)  1 
D
 M (e, x, w)  0,1
w D
where M (e, x, w)  0,1
The cost  is introduced as a prior to encourage matches, otherwise it is better to occlude
everything.
© 2010 by Davi Geiger
Computer Vision
March 2008
L1.24
The Image Formation III (Occluded Surfaces)
2N
D
Pe ( M | D(e, x, w, s ))   
D
 P(M , e, x, w, w ' | D(e, x, w, s), D(e, x ', w ', s))
x 1 w D w ' D
1 2N D 
 e
Z x 1 w D
where
D
M ( e , x , w)

w ' D
x '  x  1 | w  w ' |
M ( e , x ', w ')  ( w  w ') D ( e , x , w, s )  1 ( w  w ')  D ( e , x ', w ', s ) 
and
x
1 w  w '
0 otherwise
 (w  w ')  
and
X
r=6
D
X
X
M (e, x, w)
X
r=5

M (e, x  1 | w  w ' |, w ')  0,1
w ' D
X
r=4
r=3
O(e,x)=1
w=2
Example of a jump/occlusion, w=2 and
w’=-1. Three x coordinates will have
O(e,x)=1
l=1 l=2 l=3
© 2010 by Davi Geiger
Computer Vision
March 2010
L1.25
The Image Formation IV (Tilted Surfaces)
2N
D
Pe ( M |  (e, x, w, s))   
D
 P(M , e, x, w, w ' |  (e, x, w, s),  (e, x ', w ', s))
x 1 w D w ' D
1 2N D 
 e
Z x 1 w D
where
D

M ( e , x , w)
w ' D
x '  x  1 | w  w ' |
M ( e , x ', w ')  ( w  w ') ( e , x , w, s )  1 ( w  w ')  ( e, x ', w ', s ) 
and
x
1 w  w '
0 otherwise
 (w  w ')  
and
X
r=6
D
X
X
M (e, x, w)
X
r=5

M (e, x  1 | w  w ' |, w ')  0,1
w ' D
X
r=4
r=3
O(e,x)=1
w=2
Example of a jump/occlusion, w=2 and
w’=-1. Three x coordinates will have
O(e,x)=1
l=1 l=2 l=3
© 2010 by Davi Geiger
Computer Vision
March 2010
L1.26
Posterior Model
P( M , w | I L , I R ) 
1 N 2 N D  M ( e, x ,w) ( e, x ,w,s )    w1  M ( e, x , w) M ( e, x 1, w ') 
e
e




Z e1 x 1 w D
 w ' w1
 D 
e
 w ' D
M ( e , x , w ) M ( e , x ', w ')  ( w  w ') D ( e , x , w, s )  1 ( w  w ')  D ( e , x ', w ', s ) 
Neighborhood structure for a node (e,xw)
consisting of flat, tilt, or occluded surfaces
r=6
w w '






x
If there is a match at (e,xw) then either it is a flat/tilt
surface or it is an occluded surface
X
r=5
X
X
r=4
w1

 D
M (e, x, w) 1   M (e, x  1, w ')   M (e, x ', w ')
 w ' w1
 w ' D|
w1


 M (e, x, w) 1   M (e, x  1, w ') 
 w ' w1

x '  x 1 w  w '
r=3
w=2
D
 M (e, x, w)  1  O(e, x)  0,1
l=1 l=2 l=3
© 2010 by Davi Geiger
w D
Computer Vision
March 2008
L1.27
Posterior Model (with no subpixel accuracy)
P( M , w | I L , I R ) 
1 N 2 N D  M ( e , x , w )  ( e , x , w , s )   
e


Z e 1 x 1 w D
 D 
e
 w ' D
M ( e , x , w ) M ( e , x ', w ') ( w  w ') D ( e , x , w, s )  1 ( w  w ')  D ( e , x ', w ', s ) 
w=D
Neighborhood structure for a node (e,xw)
consisting of flat, tilt, or occluded surfaces
x
x
r=6
?
r=6
w’=w+2
?
X
r=5
r=5
w=-D
r=4
r=3
w=4
X
w’=w-2
x’=x-2-|w’-w|
l=1 l=2 l=3
l=1 l=2 l=3
© 2010 by Davi Geiger
x
r=4
r=3
=2



Computer Vision
March 2008
L1.28
Need for prior model: Flat or (Double) Tilted ?
Flat
Tilted
x
r=5
r=4
a.Flat Plane
X
X
r=3
b.Tilted
X
X
r=2
The images and the probabilities for (a) and (b) are the same.
w=2
l=1 l=2 l=3
x
r=5
w
a.Flat Plane
b.Doubled Tilted
The data and the probabilities for (a) and (b) are the
same. Since we are not considering curvature
preferences, the preference for flat surfaces must be built
as a prior model of surfaces.
© 2010 by Davi Geiger
Computer Vision
X
r=4
r=3
X
w=2
l=1 l=2 l=3
March 2008
L1.29
Dynamic Programming x '  x  2  w  w'
(pixel-pixel matching only)
States (2D+1)
Fx,w+D]= C[x, w+D] + minx’=-D-1,..,D+1{F[x’,w(x’)+D]+F[x,x’,w,w’]}
w=D
Fx[x’,w(x’)+D]
F[x, w+D, i+1]
?
?
w
?
C[ x, w  D]
w=-D
1
© 2010 by Davi Geiger
2
3
x’ …
x-1
x
Computer Vision
...
2N-2
March 2008
L1.30