Learning a selectivity–invariance–selectivity feature extraction architecture for images
Michael U. Gutmann
Michael U. Gutmann University of Helsinki
Aapo Hyvärinen University of Helsinki
[email protected]
[email protected]
University of Helsinki - p. 1
Motivation
● Motivation
■
● Research question ● Data ● Architecture ● Learning ● Results
■
● Summary
■
We are very good at detecting specific patterns while being invariant/tolerant to possible variations. It is the pairing of selectivity with invariance which is important. (“tolerant selectivity”) Tolerant selectivities occur at multiple levels Lower- and higher-level tolerant selectivities: a) Same face, luminance and contrast vary
(a) “Low-level”
Michael U. Gutmann
(b) “Higher-level”
b) Same face, facial expression varies (From “Facial Expressions A Visual Reference for Artists” by Mark Simon.)
University of Helsinki - p. 2
Question asked and methodology
● Motivation
■
● Research question ● Data ● Architecture ● Learning ● Results
Basic hypothesis: Higher level tolerant selectivities emerge through a sequence of elementary selectivity and invariance computations. (see for example: Riesenhuber & Poggio, Nature 1999; Kouh & Poggio, NeCo 2008;
● Summary
Rust & Stocker, Curr Op Neurobiol, 2010) ■
■
Michael U. Gutmann
Question asked: In a system with three processing layers, what should be selected and tolerated at each level of the hierarchy? Methodology: ◆ Learn the selectivity and invariance computations from images, using as few assumptions as possible. ◆ Learning ≡ fitting a probability density function
University of Helsinki - p. 3
Data and preprocessing ■ ● Motivation ● Research question ● Data
(Torralba et al, TPAMI 2008)
● Architecture ● Learning ● Results
Tiny images dataset, converted to gray scale: complete scenes downsampled to 32 by 32 images
■
● Summary
■
Preprocessing: ◆ Removing DC component ◆ Normalizing norm after whitening ◆ Reducing the dimension from 32 · 32 = 1024 to 200 Preprocessing can be considered a form of luminance and contrast gain control, followed by low-pass filtering.
Examples from the tiny images dataset before preprocessing.
Michael U. Gutmann
University of Helsinki - p. 4
Feature extraction architecture ■ ● Motivation
■
Let x ∈ R200 be a vectorized image after preprocessing. Feature extraction with three processing layers:
● Research question
(1)
● Data
yi
● Architecture ● Learning
(1) T
= wi
● Results ● Summary
(2)
yk = fth
x # ! " 100 X (2) (1) (2) wki (yi )2 + 1 + bk ln
i = 1 . . . 100 k = 1 . . . 50
i=1
˜ (2) = gain control(y(2) ) y (3) (3) T (2) (3) ˜ + bj yj = fth wj y
j = 1 . . . n(3)
Thresholding function fth (u): smooth version of max(u, 0) Gain control: centering, normalizing the norm after whitening, dimension reduction (similar to the preprocessing) ■
(1)
Parameters of interest: feature vectors wi , pooling weights (3) (2) wki ≥ 0, higher-order feature vectors wj (2)
Other parameters: the thresholds bk Michael U. Gutmann
(3)
and bk
University of Helsinki - p. 5
Learning ■ ● Motivation ● Research question
■
● Data ● Architecture ● Learning ● Results ● Summary
First, learn the parameters of layers one and two. Keeping them fixed, learn the parameters of layer three. For layer one and two, fit the pdf # " 50 X (2) (1) (2) (2) p(x; wi , wki , bk ) ∝ exp yk . | {z } k=1 Parameters
■
For layer three, fit the pdf
■
n X
(3) (3) (3) yj . p(x; wj , bj ) ∝ exp | {z } j=1 Parameters
■
(3)
Basic idea: the overall activity of the feature outputs determines how probable the input is. We do not know the partition functions: Likelihood is intractable. Use noise-contrastive estimation for the fitting. (Gutmann and Hyvärinen, JMLR2012)
Michael U. Gutmann
University of Helsinki - p. 6
Noise-contrastive estimation (Gutmann and Hyvärinen, JMLR2012) ● Motivation
■
● Research question ● Data ● Architecture
(1)
(2)
(2)
(3)
(3)
Here: pθ (x) = p(x; wi , wki , bk ) or pθ (x) = p(x; wj , bj )
● Learning ● Results ● Summary
Purpose: learn parameters θ of a pdf pθ when you do not know the partition function.
■
■
Intuition: Learn the differences between the data and auxiliary “noise” whose properties you know. Deduce from the differences the properties of the observed data. More concrete: 1. Choose a random variable z with known pdf pz where sampling is easy. Here: Uniform distribution in the sphere where the data is defined
■
Michael U. Gutmann
2. Obtain an auxiliary sample of z (“noise”). 3. Perform logistic regression on the data and the auxiliary “noise”; use the ratio pθ /pz in the regression function. The procedure provides a consistent estimator of θ.
University of Helsinki - p. 7
Results, layers one and two (1)
(2)
The wi are Gabor-like, the wki are sparse (94.5%: < 10−6 ; 5.1%: > 10) Mostly complex-cell like pooling (2)
Each row corresponds to a different yk (1) wi
(2) yk
= fth ln
hP 100
(2) (1) T x)2 i=1 wki (wi
i
+1 +
(2)
wki
Subset of the features and their icons
Michael U. Gutmann
All the learned features for layer one and two
University of Helsinki - p. 8
(2) bk
Results, layer three Features with enhanced selectivity to orientation and space. Horizontal Vertical Diagonal
Horizontal Vertical Diagonal (3)
(3)
w6
w1
˜ (2) = gain control(y(2) ) y (3) (3) T (2) (3) ˜ yj = fth wj y + bj
(3)
(3)
w2
w7
(3) w3
(3) w8
(3) w4
(3) w9
(3)
k-th element of wj (2)
Activity of yk
is detected. Corre-
sponding icon is colored in red. (3)
k-th element of wj (2)
(3) w10
(3) w5 (3)
Complete set of wj
Michael U. Gutmann
is positive:
Inactivity of yk
is negative:
is detected. Cor-
responding icon is colored in blue.
for n(3) = 10. See paper for n(3) = 100.
University of Helsinki - p. 9
Results, layer three Descriptors of overall image properties? Images giving maximal activation
Features
Images giving minimal activation
Feature outputs were computed for 10000 randomly chosen tiny images.
Michael U. Gutmann
University of Helsinki - p. 10
Summary
● Motivation ● Research question ● Data ● Architecture ● Learning ● Results ● Summary
Michael U. Gutmann
Selectivity and invariance/tolerance are important for any feature extraction system. ■ Question asked: In a system with three processing layers, what should be selected and tolerated at each level of the hierarchy? ■ Looked for an answer by fitting probabilistic models to images: → First layer: Selectivity to Gabor-like image structure → Second layer: Tolerance to exact orientation or localization of the stimulus (“complex-cells”) → Third layer: Enhanced selectivity to orientation and/or location of the stimulus ■
University of Helsinki - p. 11