
Learning 3-D Scene Structure from a Single Still Image
Ashutosh Saxena, Min Sun and Andrew Y. Ng
Computer Science Department, Stanford University, Stanford, CA 94305
{asaxena,aliensun,ang}@cs.stanford.edu
Abstract
We consider the problem of estimating detailed 3-d struc-
ture from a single still image of an unstructured environment.
Our goal is to create 3-d models which are both quantita-
tively accurate as well as visually pleasing.
For each small homogeneous patch in the image, we use a
Markov Random Field (MRF) to infer a set of “plane param-
eters” that capture both the 3-d location and 3-d orienta-
tion of the patch. The MRF, trained via supervised learning,
models both image depth cues as well as the relationships
between different parts of the image. Inference in our model
is tractable, and requires only solving a convex optimiza-
tion problem. Other than assuming that the environment is
made up of a number of small planes, our model makes no
explicit assumptions about the structure of the scene; this
enables the algorithm to capture much more detailed 3-d
structure than does prior art (such as Saxena et al., 2005,
Delage et al., 2005, and Hoiem et el., 2005), and also give
a much richer experience in the 3-d flythroughs created us-
ing image-based rendering, even for scenes with significant
non-vertical structure.
Using this approach, we have created qualitatively cor-
rect 3-d models for 64.9% of 588 images downloaded from
the internet, as compared to Hoiem et al.’s performance of
33.1%. Further, our models are quantitatively more accu-
rate than either Saxena et al. or Hoiem et al.
1. Introduction
When viewing an image such as that in Fig. 1a, a human
has no difficulty understanding its 3-d structure (Fig. 1b).
However, inferring the 3-d structure remains extremely chal-
lenging for current computer vision systems—there is an in-
trinsic ambiguity between local image features and the 3-d
location of t he point, due to perspective projection.
Most work on 3-d reconstruction has focused on using
methods such as stereovision [16] or structure from mo-
tion [6], which require two (or more) images. Some methods
can estimate 3-d models from a single image, but they make
strong assumptions about the scene and work in specific set-
tings only. For example, shape from shading [18], relies on
purely photometric cues and is difficult to apply to surfaces
that do not have fairly uniform color and texture. Crimin-
isi, Reid and Zisserman [1] used known vanishing points to
Figure 1. (a) A single image. (b) A screenshot of the 3-d model
generated by our algorithm.
determine an affine structure of the image.
In recent work, Saxena, Chung and Ng (SCN) [13, 14]
presented an algorithm for predicting depth from monocular
image features. However, their depthmaps, although use-
ful for tasks such as a robot driving [ 12] or improving per-
formance of stereovision [15], were not accurate enough to
produce visually-pleasing 3-d fly-throughs. Delage, Lee and
Ng (DLN) [4, 3] and Hoiem, Efros and Hebert (HEH) [9, 7]
assumed that the environment is made of a flat ground with
vertical walls. DLN considered indoor images, while HEH
considered outdoor scenes. They classified the image into
ground and vertical (also sky in case of HEH) to produce a
simple “pop-up” type fly-through from an image. HEH fo-
cussed on creating “visually-pleasing” fly-throughs, but do
not produce quantitatively accurate results. More recently,
Hoiem et al. (2006) [8] also used geometric context to im-
prove object recognition performance.
In this paper, we focus on inferring the detailed 3-d struc-
ture that is both quantitatively accurate as well as visually
pleasing. Other than “local planarity,” we make no explicit
assumptions about the structure of the scene; this enables our
approach to generalize well, even to scenes with significant
non-vertical structure. We infer both the 3-d location and the
orientation of the small planar regions in the image using a
Markov Random Field (MRF). We will learn the relation be-
tween the image features and the l ocation/orientation of the
planes, and also the relationships between various parts of
the image using supervised learning. For comparison, we
also present a second MRF, which models only the location
of points in the image. Although quantitatively accurate, this
method is unable to give visually pleasing 3-d models. MAP
inference in our models is efficiently performed by solving
a linear program.
Using this approach, we have inferred qualitatively cor-
1