立体匹配算法分类论文taxonomy_taxonomy算法资源-CSDN文库

5星 · 超过95%的资源需积分: 10 119 浏览量 2013-06-08 12:25:04 上传评论收藏 1.56MB PDF 举报

在计算机视觉领域，立体匹配算法是研究得最为活跃的领域之一。立体匹配算法主要涉及计算机视觉中的一个重要问题，即如何根据两个摄像机拍摄的两幅图像（即立体图像对）来恢复场景的三维结构。双目立体视觉系统通过模拟人类的双眼视觉原理，利用两个相机从略微不同的角度拍摄同一场景，然后通过分析两个视角下的图像差异来计算场景的深度信息。本文提到的“ATaxonomyandEvaluationofDenseTwo-FrameStereoCorrespondenceAlgorithms”是Daniel Scharstein和Richard Szeliski两位研究人员发表在IEEE的经典论文，该论文详细介绍了密集双目立体匹配算法的分类和评估。论文中提出了一套对双目立体匹配算法进行分类的框架，并设计了一种灵活的C++实现，以便对不同算法的单个组成部分进行评估，并可以容易地扩展以包括新算法。在论文的分类体系中，作者强调了算法设计中的不同组件和设计决策，比如算法如何处理图像之间的像素匹配、如何处理视差不连续性、如何优化视差图以消除噪声和不一致性等。此外，为了便于评估，研究者设计了一套共有的软件平台和数据集，以便于不同算法的性能比较。立体匹配算法可以分为两大类：局部方法和全局方法。局部方法通常基于图像中局部邻域像素的相似性度量来进行匹配，这种方法计算效率高，但是容易受到图像噪声和遮挡的影响。全局方法则尝试找到一个全局一致的视差图，这些方法通常基于能量最小化框架，能够得到较为平滑且符合实际物理约束的视差图，但是计算代价较高。为了提高立体匹配的性能，研究者们提出了多种改进方法。其中包括半全局匹配（Semi-Global Matching, SGM）算法，它通过在多个方向上应用动态规划来计算像素的视差值，结合了局部方法的效率和全局方法的准确性。另外还有基于深度学习的方法，利用神经网络来学习立体图像对之间的映射关系，这些方法在近年来的性能表现尤为突出。论文中还提到了数据集的重要性。为了评估立体匹配算法的性能，需要有一个或多个标准数据集，这些数据集包含真实的深度信息，以便算法生成的视差图可以与之进行比较。作者制作了多个新的多帧立体数据集，并将它们和评估代码一并发布到了网上，供研究人员使用。立体匹配算法分类的论文提供了对立体匹配算法全面的概述和分类，为学术界提供了一个评价和比较不同算法性能的基准。这不仅有助于推动立体匹配算法的发展，而且对于计算机视觉领域内其他类似问题的研究也具有指导意义。随着深度学习技术的发展，立体匹配算法也在不断进步，未来在准确度和效率上都有望得到进一步提升。

资源推荐

资源详情

资源评论

A Taxonomy and Evaluation of Dense Two-Frame

Stereo Correspondence Algorithms

Daniel Scharstein Richard Szeliski

Dept. of Math and Computer Science Microsoft Research

Middlebury College Microsoft Corporation

Middlebury, VT 05753 Redmond, WA 98052

schar@middlebury.edu szeliski@microsoft.com

Abstract

Stereo matching is one of the most active research areas in

computer vision. While a large number of algorithms for

stereo correspondence have been developed, relatively lit-

tle work has been done on characterizing their performance.

In this paper, we present a taxonomy of dense, two-frame

stereo methods. Our taxonomy is designed to assess the dif-

ferent components and design decisions made in individual

stereo algorithms. Using this taxonomy, we compare exist-

ing stereo methods and present experiments evaluating the

performance of many different variants. In order to estab-

lish a common software platform and a collection of data

sets for easy evaluation, we have designed a stand-alone,

ﬂexible C++ implementation that enables the evaluation of

individual components and that can easily be extended to in-

clude new algorithms. We have also produced several new

multi-frame stereo data sets with ground truth and are mak-

ing both the code and data sets available on theWeb. Finally,

we include a comparative evaluation of a large set of today’s

best-performing stereo algorithms.

1. Introduction

Stereo correspondence has traditionally been, and continues

to be, one of themostheavily investigatedtopicsin computer

vision. However, it is sometimes hard to gauge progress in

the ﬁeld, as most researchers only report qualitative results

on the performance of their algorithms. Furthermore, a sur-

vey of stereo methods is long overdue, with the last exhaus-

tive surveys dating back about a decade [7, 37, 25]. This

paper provides an update on the state of the art in the ﬁeld,

with particular emphasis on stereo methods that (1) operate

on two frames under known camera geometry, and (2) pro-

duce a dense disparity map, i.e., a disparity estimate at each

pixel.

Our goals are two-fold:

1. To provide a taxonomy of existing stereo algorithms

that allows the dissection and comparison of individual

algorithm components design decisions;

2. To provide a test bed for the quantitative evaluation

of stereo algorithms. Towards this end, we are plac-

ing sample implementations of correspondence algo-

rithms along with test data and results on the Web at

www.middlebury.edu/stereo.

We emphasize calibrated two-frame methods in order to fo-

cus our analysis on the essential components of stereo cor-

respondence. However, it would be relatively straightfor-

wardtogeneralizeourapproachtoincludemanymulti-frame

methods, in particular multiple-baseline stereo [85] and its

plane-sweep generalizations [30, 113].

The requirement of dense output is motivated by modern

applications of stereo such as view synthesis and image-

based rendering, which require disparity estimates in all im-

age regions, even those that are occluded or without texture.

Thus, sparse and feature-based stereo methods are outside

the scope of this paper, unless they are followedby a surface-

ﬁtting step, e.g., using triangulation, splines, or seed-and-

grow methods.

We beginthispaperwith a reviewof the goals and scopeof

this study, which include the need for a coherent taxonomy

and a well thought-out evaluation methodology. We also

review disparity space representations, which play a central

role in this paper. In Section 3, we present our taxonomy

of dense two-frame correspondence algorithms. Section 4

discusses our current test bed implementation in terms of

the major algorithm components, their interactions, and the

parameters controlling their behavior. Section 5 describes

our evaluation methodology, including the methods we used

for acquiring calibrated data sets with known ground truth.

In Section 6 we present experiments evaluating the different

algorithm components, while Section 7 provides an overall

comparison of 20 current stereo algorithms. We conclude in

Section 8 with a discussion of planned future work.

2. Motivation and scope

Compiling a complete survey of existing stereo methods,

even restricted to dense two-frame methods, would be a

formidable task, as a large number of new methods are pub-

lished every year. It is also arguable whether such a survey

would be of much value to other stereo researchers, besides

being an obvious catch-all reference. Simply enumerating

different approaches is unlikely to yield new insights.

Clearly, a comparative evaluation is necessary to assess

the performance of both established and new algorithms and

to gauge the progress of the ﬁeld. The publication of a simi-

lar study by Barron et al. [8] has had a dramatic effect on the

development of optical ﬂow algorithms. Not only is the per-

formance of commonly used algorithms better understood

by researchers, but novel publications have to improve in

some way on the performance of previously published tech-

niques [86]. A more recent study by Mitiche and Bouthemy

[78] reviews a large number of methods for image ﬂow com-

putation and isolates central problems, but does not provide

any experimental results.

In stereo correspondence, two previous comparative pa-

pers have focused on the performance of sparse feature

matchers [54, 19]. Two recent papers [111, 80] have devel-

oped new criteria for evaluating the performance of dense

stereomatchers forimage-based renderingand tele-presence

applications. Our work is a continuation of the investiga-

tions begun by Szeliski and Zabih [116], which compared

the performance of several popular algorithms, but did not

provide a detailed taxonomy or as complete a coverage of

algorithms. A preliminary version of this paper appeared

in the CVPR 2001 Workshop on Stereo and Multi-Baseline

Vision [99].

An evaluation of competing algorithms has limited value

if each method is treated as a “black box” and only ﬁnal

results are compared. More insights can be gained by exam-

ining the individual components of various algorithms. For

example, suppose a method based on global energy mini-

mization outperforms other methods. Is the reason a better

energy function, or a better minimization technique? Could

thetechniquebe improvedbysubstitutingdifferentmatching

costs?

In this paper we attempt to answer such questions by

providing a taxonomy of stereo algorithms. The taxonomy

is designed to identify the individual components and de-

sign decisions that go into a published algorithm. We hope

that the taxonomy will also serve to structure the ﬁeld and

to guide researchers in the development of new and better

algorithms.

2.1. Computational theory

Any vision algorithm, explicitly or implicitly, makes as-

sumptions about the physical world and the image formation

process. In other words, it has an underlying computational

theory [74, 72]. For example, how does the algorithm mea-

sure the evidence that points in the two images match, i.e.,

that they are projections of the same scene point? One com-

mon assumption is that of Lambertian surfaces, i.e., surfaces

whose appearance does not vary with viewpoint. Some al-

gorithms also model speciﬁc kinds of camera noise, or dif-

ferences in gain or bias.

Equally important are assumptions about the world

or scene geometry and the visual appearance of objects.

Starting from the fact that the physical world consists

of piecewise-smooth surfaces, algorithms have built-in

smoothness assumptions (often implicit) without which the

correspondence problem would be underconstrained and ill-

posed. Our taxonomy of stereoalgorithms, presented in Sec-

tion3, examinesbothmatching assumptionsand smoothness

assumptions in order to categorize existing stereo methods.

Finally, mostalgorithms make assumptions about camera

calibration and epipolar geometry. This is arguably the best-

understood part of stereo vision; we therefore assume in

this paper that we are given a pair of rectiﬁed images as

input. Recent references on stereo camera calibration and

rectiﬁcation include [130, 70, 131, 52, 39].

2.2. Representation

Acritical issue inunderstanding an algorithm is therepresen-

tation used internally and output externally by the algorithm.

Most stereo correspondence methods compute a univalued

disparity function d(x, y) with respect to a reference image,

which could be one of the input images, or a “cyclopian”

view in between some of the images.

Other approaches, in particular multi-view stereo meth-

ods, use multi-valued [113], voxel-based [101, 67, 34, 33,

24], or layer-based [125, 5] representations. Still other ap-

proaches use full 3D models such as deformable models

[120, 121], triangulated meshes [43], or level-set methods

[38].

Since our goal is to compare a large number of methods

within one common framework, we have chosen to focus on

techniques that produce a univalued disparity map d(x, y)

as their output. Central to such methods is the concept of a

disparity space (x, y, d). The term disparity was ﬁrst intro-

duced in the human vision literature to describe the differ-

ence in location of corresponding features seen by the left

and right eyes [72]. (Horizontal disparity is the most com-

monly studied phenomenon, but verticaldisparity is possible

if the eyes are verged.)

In computer vision, disparity is often treated as synony-

mous with inverse depth [20, 85]. More recently, several re-

searchers have deﬁned disparity as a three-dimensional pro-

jective transformation (collineation or homography) of 3-D

space (X, Y , Z). The enumeration of all possible matches

in such a generalized disparity space can be easily achieved

with a plane sweep algorithm [30, 113], which for every

disparity d projects all images onto a common plane using

a perspective projection (homography). (Note that this is

different from the meaning of plane sweep in computational

geometry.)

In general, we favor the more generalized interpretation

of disparity, since it allows the adaptation of the search space

to the geometry of the input cameras [113, 94]; we plan to

use it in future extensions of this work to multiple images.

(Note that plane sweeps can also be generalized to other

sweep surfaces such as cylinders [106].)

In this study, however, since all our images are taken on a

linear path with the optical axis perpendicular to the camera

displacement, the classical inverse-depth interpretation will

sufﬁce [85]. The (x, y) coordinates of the disparity space

are taken to be coincident with the pixel coordinates of a

reference image chosen from our input data set. The corre-

spondence between a pixel (x, y) in reference image r and

a pixel (x



) in matching image m is then given by



= x + sd(x, y),y



= y, (1)

where s = ±1 is a sign chosen so that disparities are always

positive. Note that since our images are numbered from

leftmost to rightmost, the pixels move from right to left.

Once the disparity space has been speciﬁed, we can intro-

duce the concept of a disparity space image or DSI [127, 18].

In general, aDSI is anyimage or function deﬁned over a con-

tinuous or discretized version of disparity space (x, y, d).

In practice, the DSI usually represents the conﬁdence or

log likelihood (i.e., cost) of a particular match implied by

d(x, y).

The goal of a stereo correspondence algorithm is then to

produce a univalued function in disparity space d(x, y) that

best describes the shape of the surfaces in the scene. This

can be viewed as ﬁnding a surface embedded in the dispar-

ity space image that has some optimality property, such as

lowest cost and best (piecewise) smoothness [127]. Figure 1

showsexamples ofslices through a typical DSI.More ﬁgures

of this kind can be found in [18].

3. A taxonomy of stereo algorithms

In order to support an informed comparison of stereo match-

ing algorithms, we develop in this section a taxonomy and

categorization scheme for such algorithms. We present a set

of algorithmic “building blocks” from which a large set of

existingalgorithmscaneasilybeconstructed. Our taxonomy

is based on the observation that stereo algorithms generally

perform (subsets of) the following four steps [97, 96]:

1. matching cost computation;

2. cost (support) aggregation;

3. disparity computation / optimization; and

4. disparity reﬁnement.

The actual sequence of steps taken depends on the speciﬁc

algorithm.

For example, local (window-based) algorithms, where

the disparity computation at a given point depends only on

intensityvalueswithinaﬁnitewindow, usuallymakeimplicit

smoothness assumptions by aggregating support. Some of

these algorithms can cleanly be broken down into steps 1, 2,

3. For example, the traditional sum-of-squared-differences

(SSD) algorithm can be described as:

1. the matching cost is the squared difference of intensity

values at a given disparity;

2. aggregation is done by summing matching cost over

square windows with constant disparity;

3. disparities are computed by selecting the minimal (win-

ning) aggregated value at each pixel.

Some local algorithms, however, combine steps 1 and 2 and

use a matching cost that is based on a support region, e.g.

normalized cross-correlation [51, 19] and the rank transform

[129]. (This can also be viewed as a preprocessing step; see

Section 3.1.)

On the other hand, global algorithms make explicit

smoothness assumptions and then solve an optimization

problem. Such algorithms typically do not perform an ag-

gregation step, but rather seek a disparity assignment (step

3) that minimizes a global cost function that combines data

(step 1) and smoothness terms. The main distinction be-

tween these algorithms is the minimization procedure used,

e.g., simulated annealing [75, 6], probabilistic (mean-ﬁeld)

diffusion [97], or graph cuts [23].

In between these two broad classes are certain iterative

algorithms that do not explicitly state a global function that

is to be minimized, but whose behavior mimics closely that

of iterative optimization algorithms [73, 97, 132]. Hierar-

chical (coarse-to-ﬁne) algorithms resemble such iterative al-

gorithms, but typically operate on an image pyramid, where

results from coarser levels are used to constrain a more local

search at ﬁner levels [126, 90, 11].

3.1. Matching cost computation

The most common pixel-based matching costs include

squared intensity differences (SD) [51, 1, 77, 107] and abso-

lute intensity differences (AD) [58]. In the video processing

community, these matching criteria are referred to as the

mean-squared error (MSE) and mean absolute difference

(MAD) measures; the term displaced frame difference is

also often used [118].

More recently, robust measures, including truncated

quadratics and contaminated Gaussians have been proposed

[15, 16, 97]. These measures are useful because they limit

the inﬂuence of mismatches during aggregation.

(a) (b) (c) (d) (e)

(f)

Figure 1: Slices through a typical disparity space image (DSI): (a) original color image; (b) ground-truth disparities; (c–e) three (x, y)

slices for d =10, 16, 21; (e) an (x, d) slice for y = 151 (the dashed line in Figure (b)). Different dark (matching) regions are visible in

Figures (c–e), e.g., the bookshelves, table and cans, and head statue, while three different disparity levels can be seen as horizontal lines

in the (x, d) slice (Figure (f)). Note the dark bands in the various DSIs, which indicate regions that match at this disparity. (Smaller dark

regions are often the result of textureless regions.)

Other traditional matching costs include normalized

cross-correlation [51, 93, 19], which behaves similar to sum-

of-squared-differences (SSD), and binary matching costs

(i.e., match / no match) [73], based on binary features such

as edges [4, 50, 27] or the sign of the Laplacian [82]. Bi-

nary matching costs are not commonly used in dense stereo

methods, however.

Some costs are insensitive to differences in camera gain

or bias, for example gradient-based measures [100, 95] and

non-parametricmeasuressuch as rankandcensus transforms

[129]. Of course, it is also possible to correct for differ-

ent camera characteristics by performing a preprocessing

step for bias-gain or histogram equalization [48, 32]. Other

matching criteria include phase and ﬁlter-bank responses

[74, 63, 56, 57]. Finally, Birchﬁeld and Tomasi have pro-

posed a matching cost that is insensitive to image sampling

[12]. Rather than just comparing pixelvaluesshifted by inte-

gral amounts (which may miss a valid match), they compare

each pixel in the reference image against a linearly interpo-

lated function of the other image.

Thematching cost valuesoverallpixels and all disparities

form the initial disparity space image C

(x, y, d). While our

study is currently restricted to two-frame methods, the ini-

tial DSI can easily incorporate information from more than

two images by simply summing up the cost values for each

matching image m, since the DSI is associated with a ﬁxed

reference image r (Equation (1)). This is the idea behind

multiple-baseline SSSD and SSAD methods [85, 62, 81].

As mentioned in Section 2.2, this idea can be generalized to

arbitrary camera conﬁgurations using a plane sweep algo-

rithm [30, 113].

3.2. Aggregation of cost

Local and window-based methods aggregate the matching

cost by summing or averaging over a support region in

the DSI C(x, y, d). A support region can be either two-

dimensional at a ﬁxed disparity (favoring fronto-parallel

surfaces), or three-dimensional in x-y-d space (supporting

slanted surfaces). Two-dimensional evidence aggregation

has been implemented using square windows or Gaussian

convolution (traditional), multiple windows anchored at dif-

ferent points, i.e., shiftable windows [2, 18], windows with

adaptive sizes [84, 60, 124, 61], and windows based on

connected components of constant disparity [22]. Three-

dimensional support functions that have been proposed in-

clude limited disparity difference [50], limited disparity gra-

dient [88], and Prazdny’s coherence principle [89].

Aggregationwith a ﬁxedsupport region can be performed

using 2D or 3D convolution,

C(x, y, d)=w(x, y, d) ∗ C

(x, y, d), (2)

or, in the case of rectangular windows, using efﬁcient (mov-

ing average) box-ﬁlters. Shiftable windows can also be

implemented efﬁciently using a separable sliding min-ﬁlter

(Section 4.2). A different method of aggregation is itera-

tive diffusion, i.e., an aggregation (or averaging) operation

that is implemented by repeatedly adding to each pixel’s

cost the weighted values of its neighboring pixels’ costs

[114, 103, 97].

3.3. Disparity computation and optimization

Local methods. In local methods, the emphasis is on the

matchingcostcomputationandonthecostaggregationsteps.

Computing the ﬁnal disparities is trivial: simply choose at

each pixel the disparity associated with the minimum cost

value. Thus, these methods perform a local “winner-take-

all” (WTA) optimization at each pixel. A limitation of this

approach (and many other correspondence algorithms) is

that uniqueness of matches is only enforced for one image

(the reference image), while points in the other image might

get matched to multiple points.

Globaloptimization. In contrast, global methods perform

almost all of their work during the disparity computation

phase and often skip the aggregation step. Many global

methods are formulated in an energy-minimization frame-

work [119]. The objective is to ﬁnd a disparity function d

that minimizes a global energy,

E(d)=E

data

(d)+λE

smooth

(d). (3)

The data term, E

data

(d), measures how well the disparity

function d agrees with the input image pair. Using the dis-

parity space formulation,

data

(d)=



(x,y )

C(x, y, d(x, y)), (4)

where C is the (initial or aggregated) matching cost DSI.

The smoothness term E

smooth

(d) encodes the smooth-

ness assumptions made by the algorithm. To make the opti-

mization computationally tractable, the smoothness term is

often restricted to only measuring the differences between

neighboring pixels’ disparities,

smooth

(d)=



(x,y )

ρ(d(x, y) − d(x+1,y)) +

ρ(d(x, y) − d(x, y+1)), (5)

where ρ is some monotonically increasing function of dis-

parity difference. (An alternative to smoothness functionals

is to use a lower-dimensional representation such as splines

[112].)

In regularization-based vision [87], ρ is a quadratic func-

tion, which makes d smooth everywhere and may lead to

poor results at object boundaries. Energy functions that do

not have this problem are called discontinuity-preserving

and are based on robust ρ functions [119, 16, 97]. Geman

and Geman’s seminal paper [47] gave a Bayesian interpreta-

tion of these kinds of energy functions [110] and proposed a

discontinuity-preserving energy function based on Markov

Random Fields (MRFs) and additional line processes. Black

and Rangarajan [16] show how line processes can be often

be subsumed by a robust regularization framework.

The terms in E

smooth

can also be made to depend on the

intensity differences, e.g.,

(d(x, y) − d(x+1,y)) · ρ

(I(x, y) − I(x+1,y)), (6)

where ρ

is some monotonically decreasing function of in-

tensity differences that lowers smoothness costs at high in-

tensity gradients. This idea [44, 42, 18, 23] encourages dis-

parity discontinuities to coincide with intensity/color edges

and appears to account for some of the good performance of

global optimization approaches.

Once the global energy has been deﬁned, a variety of al-

gorithms can be used to ﬁnd a (local) minimum. Traditional

approaches associated with regularization and Markov Ran-

dom Fields include continuation [17], simulated annealing

[47, 75, 6], highest conﬁdence ﬁrst [28], and mean-ﬁeld an-

nealing [45].

More recently, max-ﬂow and graph-cut methods have

been proposed to solve a special class of global optimiza-

tion problems [92, 55, 23, 123, 65]. Such methods are more

efﬁcient than simulated annealing and have produced good

results.

Dynamic programming. A different class of global opti-

mization algorithms are those based on dynamic program-

ming. While the 2D-optimization of Equation (3) can be

shown to be NP-hard for common classes of smoothness

functions [123], dynamic programming can ﬁnd the global

minimumforindependentscanlinesinpolynomialtime. Dy-

namicprogrammingwasﬁrst used forstereo vision insparse,

edge-based methods [3, 83]. More recent approaches have

focused on the dense (intensity-based) scanline optimization

problem [10, 9, 46, 31, 18, 13]. These approaches work by

computing the minimum-cost path through the matrix of all

pairwise matching costs between two corresponding scan-

lines. Partial occlusion is handled explicitly by assigning a

group of pixels in one image to a single pixel in the other

image. Figure 2 shows one such example.

Problems with dynamic programming stereo include

the selection of the right cost for occluded pixels and

the difﬁculty of enforcing inter-scanline consistency, al-

though several methods propose ways of addressing the lat-

ter [83, 9, 31, 18, 13]. Another problem is that the dynamic

programming approach requires enforcing the monotonicity

or ordering constraint [128]. This constraint requires that

the relative ordering of pixels on a scanline remain the same

between the two views, which may not be the case in scenes

containing narrow foreground objects.

Cooperative algorithms. Finally, cooperative algo-

rithms, inspired by computational models of human stereo

vision, were among the earliest methods proposed for dis-

parity computation [36, 73, 76, 114]. Such algorithms it-

eratively perform local computations, but use nonlinear op-

erations that result in an overall behavior similar to global

optimization algorithms. In fact, for some of these algo-

rithms, it is possible to explicitly state a global function that

is being minimized [97]. Recently, a promising variant of

Marr and Poggio’s original cooperative algorithm has been

developed [132].

3.4. Reﬁnement of disparities

Most stereo correspondence algorithms compute a set of

disparity estimates in some discretized space, e.g., for inte-

剩余34页未读，继续阅读

评论收藏

内容反馈

ofenzhong

2014-03-12

不错，虽然不是我想要的，不过还是让我开拓了眼界

_guagua2222_

粉丝: 0
资源: 1

立体匹配算法分类论文taxonomy

最新资源