2. Motivation and scope
Compiling a complete survey of existing stereo methods,
even restricted to dense two-frame methods, would be a
formidable task, as a large number of new methods are pub-
lished every year. It is also arguable whether such a survey
would be of much value to other stereo researchers, besides
being an obvious catch-all reference. Simply enumerating
different approaches is unlikely to yield new insights.
Clearly, a comparative evaluation is necessary to assess
the performance of both established and new algorithms and
to gauge the progress of the field. The publication of a simi-
lar study by Barron et al. [8] has had a dramatic effect on the
development of optical flow algorithms. Not only is the per-
formance of commonly used algorithms better understood
by researchers, but novel publications have to improve in
some way on the performance of previously published tech-
niques [86]. A more recent study by Mitiche and Bouthemy
[78] reviews a large number of methods for image flow com-
putation and isolates central problems, but does not provide
any experimental results.
In stereo correspondence, two previous comparative pa-
pers have focused on the performance of sparse feature
matchers [54, 19]. Two recent papers [111, 80] have devel-
oped new criteria for evaluating the performance of dense
stereomatchers forimage-based renderingand tele-presence
applications. Our work is a continuation of the investiga-
tions begun by Szeliski and Zabih [116], which compared
the performance of several popular algorithms, but did not
provide a detailed taxonomy or as complete a coverage of
algorithms. A preliminary version of this paper appeared
in the CVPR 2001 Workshop on Stereo and Multi-Baseline
Vision [99].
An evaluation of competing algorithms has limited value
if each method is treated as a “black box” and only final
results are compared. More insights can be gained by exam-
ining the individual components of various algorithms. For
example, suppose a method based on global energy mini-
mization outperforms other methods. Is the reason a better
energy function, or a better minimization technique? Could
thetechniquebe improvedbysubstitutingdifferentmatching
costs?
In this paper we attempt to answer such questions by
providing a taxonomy of stereo algorithms. The taxonomy
is designed to identify the individual components and de-
sign decisions that go into a published algorithm. We hope
that the taxonomy will also serve to structure the field and
to guide researchers in the development of new and better
algorithms.
2.1. Computational theory
Any vision algorithm, explicitly or implicitly, makes as-
sumptions about the physical world and the image formation
process. In other words, it has an underlying computational
theory [74, 72]. For example, how does the algorithm mea-
sure the evidence that points in the two images match, i.e.,
that they are projections of the same scene point? One com-
mon assumption is that of Lambertian surfaces, i.e., surfaces
whose appearance does not vary with viewpoint. Some al-
gorithms also model specific kinds of camera noise, or dif-
ferences in gain or bias.
Equally important are assumptions about the world
or scene geometry and the visual appearance of objects.
Starting from the fact that the physical world consists
of piecewise-smooth surfaces, algorithms have built-in
smoothness assumptions (often implicit) without which the
correspondence problem would be underconstrained and ill-
posed. Our taxonomy of stereoalgorithms, presented in Sec-
tion3, examinesbothmatching assumptionsand smoothness
assumptions in order to categorize existing stereo methods.
Finally, mostalgorithms make assumptions about camera
calibration and epipolar geometry. This is arguably the best-
understood part of stereo vision; we therefore assume in
this paper that we are given a pair of rectified images as
input. Recent references on stereo camera calibration and
rectification include [130, 70, 131, 52, 39].
2.2. Representation
Acritical issue inunderstanding an algorithm is therepresen-
tation used internally and output externally by the algorithm.
Most stereo correspondence methods compute a univalued
disparity function d(x, y) with respect to a reference image,
which could be one of the input images, or a “cyclopian”
view in between some of the images.
Other approaches, in particular multi-view stereo meth-
ods, use multi-valued [113], voxel-based [101, 67, 34, 33,
24], or layer-based [125, 5] representations. Still other ap-
proaches use full 3D models such as deformable models
[120, 121], triangulated meshes [43], or level-set methods
[38].
Since our goal is to compare a large number of methods
within one common framework, we have chosen to focus on
techniques that produce a univalued disparity map d(x, y)
as their output. Central to such methods is the concept of a
disparity space (x, y, d). The term disparity was first intro-
duced in the human vision literature to describe the differ-
ence in location of corresponding features seen by the left
and right eyes [72]. (Horizontal disparity is the most com-
monly studied phenomenon, but verticaldisparity is possible
if the eyes are verged.)
In computer vision, disparity is often treated as synony-
mous with inverse depth [20, 85]. More recently, several re-
searchers have defined disparity as a three-dimensional pro-
jective transformation (collineation or homography) of 3-D
space (X, Y , Z). The enumeration of all possible matches
in such a generalized disparity space can be easily achieved
with a plane sweep algorithm [30, 113], which for every
disparity d projects all images onto a common plane using
2