GRAHAM TAYLOR
UNSUPERVISED LEARNING
SCHOOL OF ENGINEERING
UNIVERSITY OF GUELPH
Deep Learning for Computer Vision Tutorial @ CVPR 2014
Columbus, OH
23 June 2014 /
CVPR DL for Vision Tutorial ・ Unsupervised Learning/ G Taylor
2
•
Most impressive results in deep learning have been obtained with
purely supervised learning methods (see previous talk)
•
In vision, typically classification (e.g. object recognition)
•
Though progress has been slower, it is likely that unsupervised
learning will be important to future advances in DL
Motivation
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts
at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and
the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–
4096–4096–1000.
neurons in a kernel map). The second convolutional layer takes as input the (response-normalized
and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 ⇥ 5 ⇥ 48.
The third, fourth, and fifth convolutional layers are connected to one another without any intervening
pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥
256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth
convolutional layer has 384 kernels of size 3 ⇥ 3 ⇥ 192 , and the fifth convolutional layer has 256
kernels of size 3 ⇥ 3 ⇥ 192. The fully-connected layers have 4096 neurons each.
4 Reducing Overfitting
Our neural network architecture has 60 million parameters. Although the 1000 classes of ILSVRC
make each training example impose 10 bits of constraint on the mapping from image to label, this
turns out to be insufficient to learn so many parameters without considerable overfitting. Below, we
describe the two primary ways in which we combat overfitting.
4.1 Data Augmentation
The easiest and most common method to reduce overfitting on image data is to artificially enlarge
the dataset using label-preserving transformations (e.g., [25, 4, 5]). We employ two distinct forms
of data augmentation, both of which allow transformed images to be produced from the original
images with very little computation, so the transformed images do not need to be stored on disk.
In our implementation, the transformed images are generated in Python code on the CPU while the
GPU is training on the previous batch of images. So these data augmentation schemes are, in effect,
computationally free.
The first form of data augmentation consists of generating image translations and horizontal reflec-
tions. We do this by extracting random 224 ⇥ 224 patches (and their horizontal reflections) from the
256⇥256 images and training our network on these extracted patches
4
. This increases the size of our
training set by a factor of 2048, though the resulting training examples are, of course, highly inter-
dependent. Without this scheme, our network suffers from substantial overfitting, which would have
forced us to use much smaller networks. At test time, the network makes a prediction by extracting
five 224 ⇥ 224 patches (the four corner patches and the center patch) as well as their horizontal
reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax
layer on the ten patches.
The second form of data augmentation consists of altering the intensities of the RGB channels in
training images. Specifically, we perform PCA on the set of RGB pixel values throughout the
ImageNet training set. To each training image, we add multiples of the found principal components,
4
This is the reason why the input images in Figure 2 are 224 ⇥ 224 ⇥ 3-dimensional.
5
Image: Krizhevsky (2012) - AlexNet, the “hammer” of DL
23 June 2014 /
CVPR DL for Vision Tutorial ・ Unsupervised Learning/ G Taylor
3
•
Unsupervised learning was the catalyst
for the present DL revolution that started
around 2006
•
Now we can train deep supervised neural
nets without “pre-training”, thanks to
-
Algorithms (nonlinearities,
regularization)
-
More data
-
Better computers (e.g. GPUs)
•
Should we still care about unsupervised
learning?
An Interesting Historical Fact
x
W
1
h
1
h
2
h
1
W
2
W
2
W
1
W
3
x
h
1
h
2
h
3
Greedy layer-wise !
pre-training!
(circa 2006)
23 June 2014 /
CVPR DL for Vision Tutorial ・ Unsupervised Learning/ G Taylor
4
Why Unsupervised Learning?
Reason 1:
We can exploit unlabelled data; much more readily available
and oen free.
23 June 2014 /
CVPR DL for Vision Tutorial ・ Unsupervised Learning/ G Taylor
5
Why Unsupervised Learning?
Reason 2:
We can capture enough information about the observed
variables so as to ask new questions about them; questions
that were not anticipated at training time.
Visualizing and Understanding Convolutional Neural Networks
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
Figure 3. Evolution of model features through training. Each layer’s featu res are di sp l ayed in a d i↵ erent block. Withi n
each block, we show a randomly chosen subset o f features at epochs [1,2,5,10,20,30,40,64 ] . The vis u a li za t i on shows the
strongest activation (across all training examples) for a given feature map, projected down to pixel space using ou r
deconv n et approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form.
(R2,C4)). Layer 4 shows significant variation, but
is more class-specific: dog faces ( R1,C 1) ; bird’s legs
(R4,C2). Layer 5 shows entire objects with significant
pose variation, e.g. keyboards (R1,C11) and dogs (R4).
Fig. 4 shows 5 sample images being translated, rotated
and scaled by varyi n g degrees while looking at the
changes in the feature vectors fr om t h e t op and bot-
tom layers of the model, relative to the untrans for me d
feature. Small transformation s have a dramatic e↵ect
in the first layer of the model, but a lesser impact at
the top feature layer, being q uas i- l inear for translation
& scaling. The network output is s t abl e to translations
and scalings, but not to rotation.
4.3. Occlu sion Sensitiv ity
With image classification approaches, a natural ques-
tion is if the model is truly classifying the object alone,
or if it is using the surrounding context. Fig. 5 at-
tempts to answer this question by systematically oc-
cluding di↵erent portions of the input image with a
grey sq uar e, and monitoring the output of the clas-
sifier. The examples clear l y show the m odel i s local-
izing the objects within the scene, as the probabi l i ty
of the corre ct class drops s i gni fi cantly when the ob-
ject is occluded. Fig. 5 als o shows visuali z at ion s from
the strongest featu re map of the top convolution layer,
in addition to activ i ty in this map as a func t ion of
occluder position. When the occluder covers the im-
age region t hat appears in the visualization, we see a
strong drop in act i vi ty in th e feature map. This shows
that the visualization genuin el y corres ponds to the im-
age st ru ct u re that stimulates that feature map, hence
validatin g the other visualizations in Fig. 3 and Fig. 8.
4.4. Corres pondence Analysis
Deep models di↵ er from many existing recogniti on
approaches in that ther e is no ex pl i ci t mechanism
for establishing correspondence between specific ob-
ject parts i n di↵erent images (e.g. eyes and noses
for faces). However, an intriguin g possibility is that
deep model s might be implicitly c omp ut i n g them. To
explore this , we take 5 randomly drawn dog images
with frontal pose and sy st e mat i cal l y mask out the
Figure 6. Images used for correspondence experiments.
Col 1: O rig i n a l image. Col 2,3,4: Occlusion of th e right
eye, left eye, a n d nose respectively. Other columns show
examples of random occlusions.
same part of the face in each image (e.g. all left
eyes, see F i g. 6). For each image i, we then com-
pute: ✏
l
i
= x
l
i
˜x
l
i
,wherex
l
i
and ˜x
l
i
are the feature
vectors at layer l for the original and occ lu d ed im-
ages respectively. We then measure the consistency of
this di↵ere nc e vector ✏ between all relate d i mage p air s
(i, j):
l
=
P
5
i,j=1,i6=j
H( si gn ( ✏
l
i
), sign(✏
l
j
)), where H
is Hamming distance. A lower value indicates greater
consistency in the change resulting from the masking
operation, hence ti ghter correspondence between the
same object parts in di↵erent images. In Tab le 3 we
compare the scor e for three part s of the face ( l e ft
eye, right eye and nose) to random parts of the ob-
ject, using features from layer l = 5 and l = 7. The
lower score for th es e parts, relat ive to random object
regions, for the layer 5 features show t h e model does
establish some degree of correspondence.
5. Feature Generaliza t io n
The experiments above show the importance of the
convolutional part of our ImageNet m odel in ob tai n -
ing state-of-the-art performance. This is supported by
the vi su ali z ati on s of Fig. 8 which show the complex i n-
variances learned in the convolutional layers. We now
Image: Features from a convolutional net (Zeiler and Fergus, 2013)
- 1
- 2
- 3
- 4
- 5
- 6
前往页