Accepted Manuscript Spatio-Temporal Layout of Human Actions for Improved Bag-of-Words Action Detection G.J. Burghouts, K. Schutte PII: DOI: Reference:
S0167-8655(13)00037-8 http://dx.doi.org/10.1016/j.patrec.2013.01.024 PATREC 5622
To appear in:
Pattern Recognition Letters
Please cite this article as: Burghouts, G.J., Schutte, K., Spatio-Temporal Layout of Human Actions for Improved Bag-of-Words Action Detection, Pattern Recognition Letters (2013), doi: http://dx.doi.org/10.1016/j.patrec. 2013.01.024
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
1
Spatio-Temporal Layout of Human Actions for
2
Improved Bag-of-Words Action Detection
3
G.J. Burghouts*, K. Schutte
4
TNO, Intelligent Imaging, Oude Waalsdorperweg 63, The Hague, The Netherlands
5
Abstract
6
We investigate how human action recognition can be improved by considering spatio-temporal layout of
7
actions. From literature, we adopt a pipeline consisting of STIP features, a random forest to quantize the
8
features into histograms, and an SVM classifier. Our goal is to detect 48 human actions, ranging from simple
9
actions such as walk to complex actions such as exchange. Our contribution to improve the performance of this
10
pipeline by exploiting a novel spatio-temporal layout of the 48 actions. Here each STIP feature does not in the
11
video contributes to the histogram bins by a unity value, but rather by a weight given by its spatio-temporal
12
probability. We propose 6 configurations of spatio-temporal layout, where the varied parameters are the
13
coordinate system and the modeling of the action and its context. Our model of layout does not change any
14
other parameter of the pipeline, it requires no re-learning of the random forest, yields a limited increase of the
15
size of its resulting representation by only a factor two, and at a minimal additional computational cost of only
16
a handful of operations per feature. Extensive experiments show that the layout is demonstrated to be
17
distinctive of actions that involve trajectories, (dis)appearance, kinematics, and interactions. The visualization
18
of each action’s layout illustrates that our approach is indeed able to model spatio-temporal patterns of each
19
action. Each layout is experimentally shown to be optimal for a specific set of actions. Generally, the context
20
has more effect than the choice of coordinate system. The most impressive improvements are achieved for
21
complex actions involving items. For 43 out of 48 human actions, the performance is better or equal when
22
spatio-temporal layout is included. In addition, we show our method outperforms state-of-the-art for the
23
IXMAS and UT-Interaction datasets.
24 25 26
Keywords: human action recognition, spatio-temporal layout, STIP features, gaussian kernel, mixture model, random forest, support vector machines.
* Corresponding author. Tel.: +31 888 663 997. E-mail address:
[email protected].
27
1. Introduction
28
We consider the challenge set in the DARPA Mind’s Eye program of automated detection of 48 human
29
actions from 4,774 videos. This dataset is novel (released early 2012), and it involves many complex actions.
30
The actions vary from a single person (e.g., walk) to two or more persons (e.g., follow). Some of these actions
31
are defined by the involvement of some object (e.g., give), or an interaction with the environment (e.g., leave).
32
The most complex actions involve two person and an object (e.g., exchange). See Figure 1 for a few
33
illustrations of the visint.org dataset, including persons, cars, interactions with other persons or cars,
34
involvement of items (of which some, as shown, are not detectable), clutter in the background (such as moving
35
small cars in the background), and the varying scenes and recording conditions. The dataset contains a test set
36
of 1,294 realistic videos with highly varying recording conditions and on average 195 variations of each of the
37
48 actions. A complicating factor of this dataset is that the actions are highly unbalanced: e.g., within the train
38
set of 3,480 videos there are 1,947 positive learning samples for “move” to only 50 samples for “bury”, see
39
Figure 2. Also, on average 7 actions are annotated per clip. We argue that the complexity of simultaneously
40
detecting 48 human actions in this dataset, makes the faced problem interesting.
41
42 43
Fig. 1. Human actions in the visint.org dataset include persons, cars, interactions with other persons or cars, involvement of items like the
44
exchanged item (in the middle and right, where the item itself is not detectable), clutter in the background (like the small cars in the back of
45
the right image), and the varying scenes and recording conditions. Images are extracted from videos from the visint.org dataset.
46
47
Fig. 2. The 48 human behaviors in this paper and their (logarithmic) prevalence in the train set.
48 49
This paper improves a popular pipeline from literature (from features to action detection) by exploiting a
50
novel spatio-temporal layout of human actions. We adopt recent algorithms to construct a basic pipeline. We
51
create action detectors from a pipeline [1] of local spatio-temporal STIP features [2], a random forest to
52
quantize the features into action histograms [3], and a SVM classifier with a χ2 kernel [4] serving as a detector
53
for each action [5]. This rationale has been used in e.g. [1,3,5,6,14]. The novel spatio-temporal layout model
54
that we propose has the goal to address two objectives:
55
1.
Improving the selectivity of the action histograms by considering the action’s spatial layout.
56
2.
Optimizing the options for modeling spatio-temporal layout of each action.
57 58
The main driver for these improvements is the observation that most human actions follow a particular
59
spatio-temporal pattern. Such a pattern can be sequential, like the case where somebody falls, as shown in
60
Figure 3 (also from the visint.org dataset). Clearly, from left to right, from the beginning to end of the action,
61
we can see the horizontal and then downward movement. We expect this to hold for more cases from the 48
62
human actions from the visint.org dataset. For instance, in Figure 1, in the case of an exchange (image in the
63
middle and on the right), the moving arms and hands will be in between the two persons. In other words, the
64
action is spatially confined: in the middle. In the case of an approach of two vehicles (Figure 1, left image), two
65
trajectories get closer in time: at the end they will be close to each other. In the next sections, we aim to exploit
66
such spatio-temporal layout of actions.
67
68 69
Fig. 3. Example of action: “fall”. From left to right, from the beginning (blue) to end of the action (red), we can see the horizontal and then
70
downward movement. In this visualization, STIP features as considered in this paper are shown. The frame in which they are detected is
71
indicated by the text inside the detections.
72
73 74
We achieve this under the following constraints: a.
No significant increase of the size of the histograms (at maximum a factor of two). This is beneficial for
75
learning of the back-end SVM: curse of dimensionality and efficiency. The latter advantage holds also
76
for the later application of the learned classifier.
77
b.
No adaptation of a pre-learned feature quantizer (in this paper a Random Forest). This is beneficial as
78
learning a new quantizer is time-consuming; iterative adaptive learning is foreseen as prohibitive time
79
consuming.
80
c.
No significant additional computational costs to compute the spatio-temporal layout.
81 82
Exploiting the spatio-temporal layout improves the detection of the 48 human actions. We will demonstrate
83
a relative improvement of 19% with respect to the baseline pipeline for human action recognition, and up to >
84
100% improvement for 5 actions.
85 86
In Section 2, we discuss previous work. Section 3 considers the modeling of spatio-temporal layout, and
87
visualizes the layout of each of the 48 human actions. Section 4 describes the experimental setup. In addition,
88
we evaluate the human action recognition performance with and without spatio-temporal layout, and we
89
compare the various configurations of modeling the layout. Section 5 provides comparisons with state-of-the-
90
art results for the IXMAS and UT-Interaction datasets. Finally, Section 6 concludes with a discussion of our
91
results.
92
2. Previous Work
93
2.1. Pipeline: Action Detectors from a Bag-of-Words Model
94 95
We adopt our action detectors from recent literature. In Section 3 we extend it to also include also spatiotemporal layout of actions. We summarize the baseline here where more details can be found in [1].
96 97
Features. STIP features [2] have been proven to be discriminative for human action recognition [5].
98
Therefore we will consider them as our standard feature in this paper. The advantages of these local spatio-
99
temporal features are that they don’t require any segmentation of the scene, they are able to capture detailed
100
motion patterns of both the whole body and of the limbs, and they encode motion patterns together with local
101
shape. The STIP features outperformed bounding-box based features [6]. They are regionally computed at
102
spatio-temporal interest points, i.e., a 3D Harris detector that is an extension of the well-known 2D corner
103
detector. The features comprise of histograms of gradients (HOG) and optical flow (HOF). Together these two
104
feature types capture qualities about local shape and motion. The STIP features are computed with Laptev's
105
implementation from [6], version 1.1, with default parameters and input images reduced in size to 640x480
106
pixels. A STIP based feature vector are the 162 STIP HOG-HOF features.
107 108
Codebook / Quantizer. We choose the random forest as a codebook / quantizer (from here we refer to
109
quantizer). The random forest has proven to be more distinctive than k-means and they are also more efficient
110
[3]. The additional advantage is that it serves as a feature selector, which k-means does not provide. It selects
111
the combinations of particular features and their thresholds that give best separation between the target and
112
non-target class during training. We consider this property important, as we do not a priori know which motion
113
patterns and local shapes, encoded by the STIP features, are relevant. For each action, we create a random
114
forest with 10 trees and 32 leafs, based on 200K feature vectors, 100K from randomly selected positive videos,
115
and 100K from randomly selected negative videos. For the random forest we use Breiman and Cutler's
116
implementation [7], with the M-parameter equal to the total number of features (M=162). The random forest
117
quantizes the features into histograms of length 10 x 32 = 320. We will call each bin a “word”, in accordance
118
with bag-of-features, or bag-of-words terminology.
119 120
Action Detectors. We adopt the same bag-of-words pipeline as in our ICPR 2012 paper [1] (Section 2.3
121
explains the differences between the current and our ICPR paper). As a detector, we select an SVM. For
122
various tasks, ranging from image classification [8] to action recognition [5], it was found to be the best
123
classifier. Compared to our earlier work on action recognition [9], where we used Tag-Propagation [10] as a
124
classifier, the SVM showed better performance. For each action, we train an SVM classifier with a χ2 kernel
125
[4] that serves as a detector for that action. For the SVM we use the libSVM implementation [11], where the χ2
126
kernel is normalized by the mean distance across the full training set [4], with the SVM's slack parameter
127
default to 1. The weight of the positive class is set to (P+N)/P and the weight of the negative class to (P+N)/N,
128
where P is the size of the positive class and N of the negative class [12].
129
2.2. Encoding Spatial Layout
130
Bag-of-features approaches, including the approach as described in Section 2.1, are discriminative yet they
131
ignore the potential discriminative power that is in the spatial layout of the local features. The encoding of
132
layout has been explored for image retrieval, e.g., [8], and also for action recognition, e.g., [5]. We discuss
133
these and other approaches below. For images, the layout is spatially defined in two dimensions. In this paper,
134
we consider action recognition, in three dimensions. However, the ideas that were laid down in papers on
135
image retrieval are similar to the novel idea as investigated in this paper. Therefore, we also include them in
136
this discussion on previous work. Below we only consider the methods that consider layout, because we aim to
137
model patterns and order. Fitting a bounding box to an action does not model those aspects and therefore such
138
localization methods e.g., [13,14] are out of scope of this paper.
139 140
Spatial Pyramids. To encode the spatial arrangements of local features, like SIFT features, various methods
141
have been proposed recently. They share the rationale that the locations of the local features, which are by
142
themselves orderless, can be put into fixed spatial cells to obtain some order. With spatial pyramids [8] an
143
approximate global geometric correspondence is achieved (similar to [15]), by dividing the image into levels of
144
increasingly fine sub-regions. For each level, and all regions, a bag-of-features histogram is created. All
145
histograms are combined in a matching kernel that is fed into an SVM classifier. In these papers, the pyramids
146
were based on bag-of-features histograms where the quantizer or codebook was obtained by a k-means
147
clustering. As an alternative, also random forests (used in our pipeline) have been considered as a quantizer in
148
combination with the spatial pyramid approach [16]. The spatial pyramid has been considered for action
149
recognition in [5].
150 151
Non-rigid Layouts. Whereas the spatial pyramids are rigid cells within the image or video, recently also a
152
model of non-rigid layout has been proposed in [17]. Spatial Fisher vectors deal with non-rigid layouts by
153
retrieving the words in the bag-of-features model by jointly learning them from both the features values and
154
their locations. The learning is performed using EM and the underlying models of appearance (the feature
155
values) and locations are Gaussian. For each of the K words, a C-component Mixture-of-Gaussians is learned.
156
This results in a bag-of-features histogram of length K * (1 + 2D) + K * C * (1 + 2d), where K is the number of
157
words, C the number of components of the mixture, D the dimensionality of the feature vector, and d the
158
number of dimensions in the feature’s location. The authors also propose a more compact representation, where
159
the appearance words are pre-learned by k-means. In that case, the histogram length becomes K + K * C * (1 +
160
2d).
161
2.3. Novelty of This Paper
162
Compact Histograms that encode Spatio-Temporal Layout. We propose compact histograms, where the
163
inclusion of spatio-temporal layout leads to a histogram of the same size, or double size, in case we also
164
consider the layout of the background features (see Section 3.2). This means that we have histograms of length
165
320 (same) or 640 (plus background). The Mixture of Gaussians approach as described above would be
166
significantly larger (with one component only, C=1, and in video, d=3): 320*(1+2*162)+320*1*(1+2*3) =
167
106,240. With the replacement of the appearance-part of the mixture, by a k-means based codebook, the
168
histogram length becomes: 320+320*1*(1+2*3) = 2,560. We propose small histograms which is beneficial for
169
efficient learning and classification, and because it avoids the curse of dimensionality during learning.
170 171
Re-use of Pre-learned Quantizers / Codebooks. Rather than joint learning of appearance and locations to
172
obtain the quantizer/codebook, we consider pre-learned quantizers that are extended to encode spatio-temporal
173
layout in a second, independent stage (see Section 3). This property of our approach is beneficial as learning a
174
new quantizer is time-consuming.
175 176
Extensive Evaluation of Spatio-Temporal Layout Configurations. The contribution in this paper is that we
177
model the spatio-temporal layout for 48 actions. In this model, two fundamental choices are identified (see
178
Section 3.1 and 3.2). These configurations are extensively evaluated for 48 human actions in 1.294 realistic
179
videos which have been recorded under highly varying recording conditions. We explore which configuration
180
works best for each action, and we visualize the spatio-temporal layout for each action. We show that with
181
modeling spatio-temporal layout, without a significant increase in the histogram size, and without re-learning
182
of feature quantizers, there is a significant improvement over the pipeline without spatio-temporal layout.
183 184
Difference of this paper with our earlier ICPR 2012 paper [1]. The baseline bag-of-word action detector
185
setup is identical to our ICPR [1] paper, i.e., the features, random forests, kernel and SVM are the same, and we
186
have re-used the pre-trained random forests from that paper. The novelty of our ICPR paper was the second-
187
stage setup for classification, which is not considered in the current paper. In the current paper, the novelty is
188
the extension of the bag-of-word action detectors to also encode spatio-temporal layout, which is not part of the
189
ICPR paper.
190
3. Spatio-Temporal Layout of Human Actions
191
Two factors dominate the modeling of spatio-temporal layout: the coordinate system and the model of the
192
layout itself. Both factors are discussed here.
193
3.1. Coordinate Systems
194
The detected STIP features vary with the location and extent of the observed action. To add robustness to
195
such variations within videos of the same action, we consider two schemes of coordinate transformations to
196
achieve increasing levels of invariance. The first scheme is to normalize the locations per video to zero mean
197
and unit variance. With this scheme, we loose information about the absolute position in the image coordinates,
198
and about the absolute frame number, in which the STIP features were detected. We achieve that the middle of
199
an action in one video is aligned to the middle of another video. The second scheme is that we apply the first
200
scheme and, in addition, achieve invariance to horizontal flip. This results in a representation that is
201
independent of whether the action is taking place from left to right, or from right to left. We consider only
202
horizontal invariance, as vertical patterns in the direction of up-down (e.g., fall) are not considered to be similar
203
to the direction down-up (e.g., throw). The same holds for temporal ordering: walking and then falling does not
204
have a similar meaning when turned around. The horizontal invariance is achieved by aligning the action’s
205
horizontal direction (leftward or rightward) to the temporal axis. For the current video, if the correlation
206
between the detections on the horizontal axis and the temporal axis is negative, R(X,T) < 0 with X the
207
horizontal coordinates and T the temporal coordinates, we mirror the horizontal coordinates of the detections
208
around the horizontal mean. In summary: the original coordinates will be referred to in the experiment sections
209
as “Original”, whereas the normalization to zero mean and unit variance is indicated by “Normalized” and
210
when horizontal flip invariance is added we use the term “Normalized + Flip Invariance”. These schemes are
211
investigated in the experiments in Section 4.
212
3.2. Action-Specific Spatio-Temporal 3D Gaussians
213
We consider the modeling of the spatio-temporal layout of an action by the locations of the STIP features.
214
Note that these locations may have been transformed as described in Section 3.1, if such a scheme has been
215
applied.
216 217
3D Gaussian-Mixture per “Word”. Recall (from Section 2.1) that we consider a random forest to quantize
218
the features into words, i.e., the histogram bins. Our random forest quantizes the features into 320 words. For
219
each word, we consider all the features from the training set (see Section 4) that projects onto it, and we collect
220
their locations. The locations are 3-dimensional: (x,y,t). These word-specific locations are modeled by a 3D
221
Mixture-of-Gaussians [18], which we refer to as Ga,w(x,y,t), where G denotes the Gaussian mixture, a is the
222
action for which we model the layout, w is the index of the word, and (x,y,t) is the location of the feature. For
223
each word, a probability density function (pdf) is learned by the EM algorithm [18], C
224
Ga ,w ( x, y, t ) k N ( x, y, t | k ,a ,w , k ,a ,w ) , k 1
225
where Ga,w denotes the mixture that has been learned for action a and its wth word, k indicates one of the C
226
components, αk is the learned mixing weight of the kth component, and N is a Gaussian function parameterized
227
by a learned mean μk,a,w and covariance matrix Σk,a,w.
228 229
In this way, we obtain 320 word-specific pdfs Ga,w(x,y,t) for one action a. This is our model of the spatio-
230
temporal layout of a particular action. A new feature from a test video, that gets projected onto the particular
231
word, is assigned a posterior probability that the feature is in accordance with the word’s extent.
232 233
Construction of Histograms. Originally in the random forest approach e.g., [3], each feature that gets
234
projected onto a particular word i, will add +1 to entry i in the histogram. After all features have contributed to
235
the histogram, it is normalized to 1. In our approach, each feature will add the posterior probability obtained
236
from the word-specific Gaussian Gw(x,y,t) to histogram entry i. Again we normalize the histogram to 1.
237 238
Action and Context. The word-specific pdf Ga,w(x,y,t) can be learned from feature locations that were
239
obtained from training videos that contain the particular actions of interest. More precisely, we learn
240
Ga,w(x,y,t|a). In addition, we also consider the pdf of locations that were obtained from other videos, where the
241
action was not present: Ga,w(x,y,t| a ). In the experiments, we consider Ga,w(x,y,t|a) only, or alternatively, both
242
Ga,w(x,y,t|a) and Ga,w(x,y,t| a ). In the first configuration, we refer to the layout as “Action”, where in the
243
second configuration we refer to “Action + Context”. In the case of “Action + Context”, the histogram size is
244
doubled, as we add the posterior probabilities of both pdf models to the histogram, and because they are
245
different by construction, we use separate bins (two bins for each word rather than one). Here, normalization is
246
done for the action and context parts separately, to weight their contributions equally.
247
3.3. Visualization of Each Action’s Spatio-Temporal Layout
248
We are interested in the layout of each of the 48 human actions to understand the models obtained by our
249
technique. Figure 4 visualizes the configuration “Normalized + Flip Invariance / Action”, as this configuration
250
enables us to depict these layouts in a shared coordinate frame (see Section 3.1), such that the layouts can be
251
compared between the actions. A few observations in Figure 4 are key to the ideas as laid down in this paper.
252
The layouts should be interpreted as follows: in blue the detections at the beginning of the action, where red
253
indicates detections at the end of the action. A general observation is that all actions show a dominant pattern in
254
the horizontal direction. People clearly tend to walk and move horizontally through the image field. Another
255
observation is that for many actions no clear and direct interpretation follows from their visualized layout. Yet,
256
clearly, the layouts are distinct. Below we highlight a few examples which have an obvious a semantic
257
interpretation.
258
Trajectory Actions: examples are Stop, Jump. For the action Stop, the layout shows that this action can be
259
started anywhere, which follows from the scatter of the blue parts of the model. The Stop action typically ends
260
when somebody has arrived at the person or scene object that he/her moved to. Usually, that is somewhere in
261
the middle, as can be learned from the red parts in the middle. Jump is the action with the most distinct vertical
262
pattern. The beginning of the action is upward (blue to green) where the downward motion is visible later
263
(orange to red).
264
(Dis)Appear Actions, examples are Enter, Exit. Typically someone enters from the side of the image field,
265
which follows from the pattern: at the beginning the location is at the side of the image (blue), where Enter
266
tends to end when somebody has arrived at the somebody or some item in the middle (red). For the dataset
267
under investigation, Exit is not the inverse of Enter: usually the exit happens by stepping into a vehicle, which
268
is in the middle (note the red in the middle).
269 270
Kinematic Actions, examples are Fall, Put down. Fall clearly shows a pattern from top to bottom. A similar pattern is observed for Put down.
271
Interaction Actions, examples are Take, Collide. Take is typically visible by somebody who picks up
272
something, and walks away. The part of Take where the person walks away is clearly shown in its layout, from
273
left to right, from beginning (blue) to end (red) of the action. Collide involves two persons that touch each other
274
after some movement of one or both, where the collision happens approximately in the middle (orange to red).
275
4. Experiments on the visint.org dataset of 48 human actions
276
4.1. Experimental setup
277
The dataset includes 48 human actions in 1,294 short test videos of 10-30 seconds, given a train set of 3,480
278
similar videos. This dataset is novel and contributed by the DARPA Mind’s Eye program on www.visint.org.
279
The annotation is as follows: for each of the 48 human actions, a human has assessed whether the action is
280
present in each video or not (“Is action X present?”). Typically, multiple actions are reported for every video,
281
with on average seven reported actions. We consider the train set for the training of classifiers and the
282
optimization of combining schemes (as described later in the experiments), where we use the test set for
283
performance evaluation only.
284
The only part of the pipeline from Section 2.1 that we vary in our experiments in Section 5 is the weighting
285
to each feature’s contribution to its histogram bin. This weighting scheme depends on the spatio-temporal
286
layout, for which we consider variations that are described in Section 3. For the 3D Gaussian, we use a single
287
component for simplicity, C=1.
288
The performance will be measured by the MCC measure,
289 290
MCC
TP TN FP FN (TP FP )( FP FN )(TN FP )(TN FN )
291 292
with T=true, F=false, P=positive and N=negative. The
MCC measure has the advantage of it’s
293
independence of the sizes of the positive and negative classes. Recall from the Introduction and specifically
294
Figure 2 that for some actions we have many positive training samples, where for others we only have few.
295
From the 3,480 samples in the training set, there are 1,947 samples for “move” (56%), to only 50 samples for
296
“bury” (1.4%). Clearly some classes are highly unbalanced, and because the MCC is insensitive to this, it
297
enables us to compare performance across all actions independent of their prevalence in the dataset.
298
299 300 301 302 303 304 305 306 307 308 309 310 311
approach
arrive
attach
bounce
bury
carry
catch
chase
close
collide
dig
drop
enter
exchange
exit
fall
flee
fly
follow
get
give
go
hand
haul
have
hit
hold
jump
kick
leave
lift
move
open
pass
pickup
push
putdown
raise
receive
replace
run
snatch
stop
take
throw
touch
turn
walk
312 313 314 315 316 317 318 319 320 321 322
Fig. 4. Visualization of the spatio-temporal layouts of all 48 human actions. The figure depicts the configuration “Normalized + Flip
323
Invariance / Action”. All boxes have the same coordinate frame, where the displayed extent is [-σ, +σ] in both spatial dimensions, and blue
324
to red indicate [-σ, +σ] in the temporal dimension, where green represents 0. Each box visualizes the 320 spatio-temporal 3D Gaussians
325
(only the mean is displayed) that model the layout for that action.
326 327
328
4.2. Spatio-Temporal Layout Configurations: Comparison per Action
329
Figure 5 depicts results of using no spatial layout and the 6 configurations that model the spatio-temporal
330
layout, as proposed in Section 3: three coordinate systems (Original, Normalized, Normalized + Flip
331
Invariance) combined with two extents of the layout (Action, Action + Context). Our main findings are
332
summarized below.
333
For 43 out of 48 actions improvements are achieved by adding spatio-temporal layout. For the layout
334
“Normalized + Flip Invariance / Action” the improvements are marginal, as also can be seen in Table 1.
335
“Normalized / Action + Context” performs badly. For the other 4 layouts, significant improvements are
336
achieved with respect to the representation without spatio-temporal layout. The layouts add discriminative
337
power. For 5 actions the performance is better without spatio-temporal layout.
338
Truly different layouts have most distinct performances. For those actions where “Original / Action”
339
performs best, a change of both coordinate system and context, i.e., “Normalized + Flip Invariance / Action +
340
Context”, performs worst. The opposite comparison also holds.
341
Context has more effect than Coordinate System. Whereas “Original / Action” produces results that are
342
very comparable to “Normalized / Action”, it produces distinct results compared to “Original / Action +
343
Context”. This also shows for “Normalized + Flip Invariance / Action + Context”: these results are comparable
344
to “Original / Action + Context” and very distinct from “Normalized + Flip Invariance / Action”.
345
For some actions, the merit of layout is significant. For kick, the performance without layout is ~0.1 and
346
the improvement is ~0.1. Kick is well-localized: somebody stands somewhere and the legs move. The same
347
holds for fall, which improves from ~0.0 to ~0.1. Many actions with involvement of items improve, examples
348
are: carry improves from ~0.15 to ~0.2, push from ~0.05 to ~0.1, lift from ~0.25 to ~0.3, and, haul and raise
349
from ~0.15 to 0.2. Actions that involve two persons who move relative to each other also improve: flee is
350
improved from ~0.3 to ~0.4, chase improves from a negative score to a small positive score, follow improves
351
from ~0.05 to ~0.1, and receive from ~0.1 to ~0.15.
352 353
354 Spatio-Temporal Layouts vs No Layout walk approach open enter get jump bury hand carry exchange throw have close take fly putdown chase attach catch move run pickup lift touch raise haul kick follow snatch hit go turn exit drop arrive pass stop collide leave flee dig hold receive give bounce push replace fall
355
No Spatio-Temporal Layout Original / Action Original / Action + Context Normalized / Action Normalized / Action + Context Normalized + Flip Invariance / Action Normalized + Flip Invariance / Action + Context
0
0.1
0.2
0.3
0.4
0.5
0.6
356
Fig. 5. The performance of recognizing of 48 human actions (measured by MCC on the horizontal axis), with various spatio-temporal
357
layouts vs. no spatio-temporal layout. For each layout, we indicate for which actions it works best (see the lines that connect the optimal
358
layout for a set of actions).
359
4.3. Spatio-Temporal Layout Configurations: Across the 48 Actions
360 361
Table 1 summarizes the results shown in Figure 5. The MCC averaged for all actions shows that no single
362
layout is best overall: for each layout the performance is approximately 0.17, where without layout the
363
performance is 0.16. When the optimal layout per action is selected, the performance becomes 0.20.
364
Normalization of Action and its Context requires Horizontal Alignment. “Normalized / Action + Context”
365
performs badly. The other coordinate systems plus “Action + Context” are performing much better. It appears
366
that the action can be modelled relative to its context only in the original coordinate system, or if both are also
367
aligned with respect to their dominant trajectory. That makes sense: in the original coordinate system, the STIP
368
features will be far apart, which enables to discriminate between them. In the normalized scheme, all features
369
will be projected into the same area: dominant leftward and rightward actions are now overlapping and their
370
distinction becomes impossible. Such actions can be distinguished again when aligning the dominant horizontal
371
motions in the “Normalized + Flip Invariance” layout.
372
There is no single best layout. This is confirmed by the number of actions for which each layout performs
373
best. Without any layout, 5 actions are recognized best. For all other actions, one of the spatio-temporal layouts
374
performs better. Two layouts have improvements for only 4 actions, where “Original / Action + Context”
375
performs best for most actions: 10 actions.
376
All layouts improve significantly for a subset of actions. We consider the relative improvement of each
377
layout for the subset of actions for which each performs best. The relative improvements for the layouts with
378
only the “Action”, without “Context”, are all below 20%. “Original / Action + Context” achieves an average
379
improvement of 32%. Its best improvement is 0.100 gain for the fall action MCC score. For “Normalized +
380
Flip Invariance / Action + Context”, the best relative improvement is achieved on average: 37%, with a
381
maximum gain of 0.126 for the replace action MCC score.
382 383
384
Table 1. The performance of human actions recognition for all configurations of spatio-temporal layout as described in Section 3
385
(Coordinate Systems and 3D Gaussians) and without layout. The table indicates for each configuration the averaged MCC across all 48
386
actions, the number of actions for which it performed best, the average improvement for the actions where it performed best (both absolute
387
and relative), and maximum improvement.
Spatio-Temporal Layout: Configurations Coordinate System
3D Gaussian
MCC avg.
actions best
MCC impr. avg.
MCC impr. max.
Original
Action
0.170
9
0.030 (18%)
0.078
Original
Action + Context
0.167
10
0.052 (32%)
0.100
Normalized
Action
0.168
4
0.031 (19%)
0.037
Normalized
Action + Context
0.005
0
-
-
Normalized + Flip Inv.
Action
0.169
4
0.027 (16%)
0.052
Normalized + Flip Inv.
Action + Context
0.167
9
0.060 (37%)
0.126
None
0.164
5
-
-
Best configuration of layout per action
0.195
48
0.031 (19%)
0.126
388 389
4.4. Merit for each Action
390 391
For almost all actions improvements are achieved. Most notably, for 7 actions the absolute improvement, due
392
to spatio-temporal layout, is > 0.07. These actions are: replace, fall, kick, follow, bury, flee, hit. Indeed, as can
393
be seen in Figure 4, all these actions have a clear spatio-temporal motion pattern. For 13 actions, the
394
improvement is > 0.05, whereas for 21 actions the improvement is > 0.03.
395
For eleven actions the method fails to improve the action MCC. For five actions, the performance is best
396
without layout (see upper part in Figure 5 and right part in Figure 6), and for these actions the results degrade
397
with layout: walk, approach, open, enter, get. It seems that these are actions that can happen anywhere in the
398
scene and have a large spatio-temporal extent, e.g., walk. Open is ill-localized, because the opening of a bag
399
can be done on the ground as well as on a table, for instance. For six actions, there is similar performance with
400
and without layout (see Figure 6 where the merit is zero or close to zero): run, bounce, exchange, putdown,
401
throw, open. Note that three other verbs the MCC = 0 with and without layout: attach, catch, move.
402
Improvements are achieved for complex actions involving items. An example is: replace (MCC improved
403
significantly with ~0.13), and another example is push, (improved with ~0.07). Such actions are hard to detect
404
purely based on STIP features, because it involves an item that is being manipulated. Yet, the pushing to the
405
item is always from one side. We conclude that such information aids the detection of such more complex
406
actions. For many actions that involve items, e.g., snatch, carry, haul, pickup, lift and drop, the improvement in
407
MCC is more than 0.03. Improvement by Spatio-Temporal Layout vs No Layout 0.12 0.1
merit MCC
0.08 0.06
7 actions: MCC impr. > 0.07 13 actions: MCC impr. > 0.05
0.04
21 actions: MCC impr. > 0.03
0.02
replace fall kick follow bury flee hit push receive raise dig arrive chase snatch carry haul pickup lift drop turn exit stop fly leave pass give touch go have jump close hand hold collide take run attach bounce catch exchange move putdown throw open walk approach get enter
0
408 409
Fig. 6. Absolute improvements for the recognition of 48 human actions, by considering spatio-temporal layout (best configuration per
410
action), relative to the representation without spatio-temporal layout.
411
5. Comparison to the State-of-the-Art
412
In this section, we compare our spatio-temporal layout to state-of-the-art methods on commonly used
413
datasets. To the best of our knowledge, beside our earlier work [1,9] no results have been published on the
414
visint.org dataset yet; it has been released only very recently. Therefore, we will compare to state-of-the-art on
415
the single-person multi-viewpoint IXMAS dataset of subtle actions [19] and the UT-Interaction dataset [20]
416
that contains two-person interactions. We select these datasets as they together represent a broad scope of
417
actions.
418
419
5.1. IXMAS
420
The IXMAS dataset [19] consists of 12 complete action classes with each action executed three times by 12
421
subjects and recorded by five cameras with the frame size of 390 × 291 pixels. These actions are: check watch,
422
cross arms, scratch head, sit down, get up, turn around, walk, wave, punch, kick, point and pick up. The body
423
position and orientation are freely decided by different subjects. The standard setup on this dataset is leave -
424
one subject - out cross validation setting. We compare against the state-of-the-art result of [21], who achieved a
425
78.0% recognition accuracy across the five cameras using the with the Multiple-Kernel Learning with
426
Augmented Features (AFMKL) method. We have re-used the random forests from the experiments in Section
427
4, because we want to demonstrate that our approach is generic (further tuning of the random forests
428
specifically to the IXMAS actions may further improve the results). For each action in IXMAS, we select the
429
best forest and spatio-temporal layout configuration, based on cross-validation. For each action we obtained a
430
detector and we assign the test sample to the single action that maximizes the a-posteriori probability. The
431
performance of our “Best Configuration” method is 79.1%, thereby outperforming [21] by 1.1% on average,
432
see Table 2.
433
Table 2. Performance on the IXMAS dataset [19] of the state-of-the-art (AFMKL [21]) method on this dataset and of our method.
AFMKL [21] Our method 434
Camera Viewpoint 1 2 3 81.9 80.1 77.1 80.8 81.4 77.8
4 77.6 81.7
5 73.4 73.7
Average (%) 78.0 79.1
435
5.2. UT-Interaction
436
We used the segmented version of the UT-Interaction dataset [20] containing videos of six types of human
437
activities: hand-shaking, hugging, kicking, pointing, punching, and pushing. The UT-Interaction dataset is a
438
public video dataset containing high-level human activities of multiple actors. We consider the #1 set of this
439
dataset, following the setup in [22]. The #1 set contains a total of 60 videos of six types of human-human
440
interactions. Each set is composed of 10 sequences, and each sequence contains one execution per activity. The
441
videos involve camera jitter and/or background movements (e.g., trees). Several pedestrians are present in the
442
videos as well, making the recognition harder. Following [22], we consider the leave-one-sequence-out cross
443
validation, performing a 10-fold cross validation. That is, for each round, the videos in one sequence were
444
selected for testing, and videos in the other sequences were used for the training. Identical to the experiment on
445
IXMAS in Section 5.1, we have re-used the random forests from the experiments in Section 4, where we
446
established for each action in the UT-Interaction set the best forest and spatio-temporal layout configuration,
447
based on cross-validation. For each action we obtained a detector and we assign the test sample to the single
448
action that maximizes the a-posteriori probability. We compare against three well-performing methods,
449
[20,22,23]. In [22] a Hough-voting scheme is proposed. In [23] a dynamical bag-of-words model is proposed,
450
which together with the cuboid + SVM setup in [20] are very similar to our method, being also bag-of-words
451
methods, except that [20] and [23] do not model spatio-temporal layout. The performance of our “Best
452
Configuration”method is 93.3%, thereby outperforming the best result on UT-Interaction [22] by a significant
453
gain of 5.3%, see Table 3. Also notable is the increase of 8.3% over the best score of [20] and [23], which
454
solely can be attributed to the spatio-temporal layout as proposed in this paper.
455
Table 3. Performance on the UT-Interaction dataset [20] of the state-of-the-art methods on this dataset and of our method.
Method
Accuracy
Waltisberg et al. [22]
88.0%
Ryoo [23]
85.0%
Ryoo et al. [20]
83.3%
Our method
93.3%
456
6. Discussion
457
For human action recognition, we have considered the visint.org dataset of 48 actions in 3,480 train and
458
1,294 test videos, ranging from simple actions such as walk to complex actions such as exchange. The state-of-
459
the-art bag-of-features approach discards any spatio-temporal location information. Our results show that for
460
action recognition we utilize that information by modeling it in a pdf. We have modeled the spatio-temporal
461
layout of all 48 actions, by considering the locations of each action’s STIP features. We have used a pipeline of
462
STIP features, a random forest to quantize the features into histograms, and an SVM classifier as a detector for
463
each of the 48 actions. We have proposed 6 configurations of modeling the layout, where the varied parameters
464
are the coordinate system and the modeling of the action and its context. We have visualized the spatio-
465
temporal layouts and demonstrated that the patterns for each action are distinct and that some can be
466
semantically interpreted.
467 468
In terms of action recognition performance, we have shown that there is no single best layout for all actions.
469
Rather, we have considered the optimal layout for each action. For 43 actions, the performance is better or
470
equal when spatio-temporal layout is included, while the other 5 actions do not degrade significantly. The
471
improvement is achieved without changing any other parameter of the processing pipeline, no re-learning of the
472
quantizer/codebook (a random forest), a limited increase of the size of the representation by a factor of only
473
two, and at a limited additional computational cost of only a handful of operations per feature’s location (i.e.,
474
evaluating a 3D Gaussian function). For 7 actions, the improvement is large (MCC score improved by > 0.07),
475
for 21 actions we improve reasonably (> 0.03). We have found experimentally that modeling the layout of the
476
action’s context (by a model of all non-action STIP features) is more important than the configuration of the
477
action’s coordinate system. We have learned that the most impressive improvements are achieved for complex
478
actions involving items. In total, relative to the processing pipeline without spatio-temporal selectivity, our
479
extended pipeline improves by 19% on average for all actions.
480
Finally, we have compared our method to the state-of-the-art. On the IXMAS dataset, we outperform the
481
state-of-the-art by 1.1% (was 78.0%, ours 79.1%). On UT-Interaction dataset, we outperform the state-of-the-
482
art by 5.3% (was 88.0%, ours 93.3%).
483
Acknowledgements
484
This work is supported by DARPA (Mind’s Eye program). The content of the information does not
485
necessarily reflect the position or the policy of the US Government, and no official endorsement should be
486
inferred.
487
References
488
[1] G.J. Burghouts, K. Schutte, Correlations Between 48 Human Actions Improve Their Detection, ICPR,
489
(2012)
490
[2] I. Laptev, On Space-Time Interest Points, International Journal of Computer Vision, 64 (2/3) (2005)
491
[3] F. Moosmann, B. Triggs, F. Jurie, Randomized Clustering Forests for Building Fast and Discriminative
492 493 494 495 496 497 498
Visual Vocabularies, NIPS (2006) [4] J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, Local features and kernels for classification of texture and object categories: A comprehensive study, International Journal of Computer Vision, 73 (2) (2007) [5] I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning Realistic Human Actions from Movies, CVPR (2008) [6] G.J. Burghouts, K. Schutte, R. den Hollander, H. Bouma, Selection of Negative Samples and Two-Stage Combination of Multiple Features for Action Detection in Thousands of Videos, MVAP, submitted (2012)
499
[7] L. Breiman, Random forests, Machine Learning, 45 (1) (2001)
500
[8] S. Lazebnik, C. Schmid, J. Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing
501
Natural Scene Categories, CVPR (2006)
502
[9] H. Bouma, P. Hanckmann, J-W. Marck, L. de Penning, R. den Hollander, J-M. ten Hove, S.P. van den
503
Broek, K. Schutte, G.J. Burghouts, Automatic human action recognition in a scene from visual inputs, Proc.
504
SPIE Vol. 8388, (2012)
505 506 507 508 509 510
[10] M. Guillaumin, T. Mensink, J. Verbeek, “TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation”, ICCV (2011) [11] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm (2001) [12] K.E.A. van de Sande, T. Gevers, C.G.M. Snoek, Evaluating Color Descriptors for Object and Scene Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 32 (9) (2010)
511 512
[13] C.H. Lampert, M.B. Blaschko, T. Hofmann, Beyond sliding windows: Object localization by efficient subwindow search, CVPR (2008)
513
[14] J. Liu, J. Luo, M. Shah, Recognizing Realistic Actions from Videos “in the Wild”, CVPR (2009)
514
[15] J. K. Grauman, T. Darrell, Pyramid match kernels: Discriminative classification with sets of image
515
features, ICCV (2005)
516
[16] A. Bosch, A. Zisserman, X. Muoz, Image Classification using Random Forests and Ferns, ICCV (2007)
517
[17] J. Krapac, J. Verbeek, F. Jurie, Modeling Spatial Layout with Fisher Vectors for Image Categorization,
518
ICCV (2011)
519
[18] C.M. Bishop, Pattern Recognition and Machine Learning, Springer (2006)
520
[19] D. Weinland, E. Boyer, R. Ronfard, Action Recognition from Arbitrary Views using 3D Exemplars, ICCV
521 522 523 524 525 526 527 528 529
(2007) [20] M. S. Ryoo, C.-C. Chen, J. K. Aggarwal, A. Roy-Chowdhury, An Overview of Contest on Semantic Description of Human Activities, ICPR (2010) [21] X. Wu, D. Xu, L. Duan, J. Luo, Action Recognition using Context and Appearance Distribution Features, CVPR (2011) [22] D. Waltisberg, A. Yao, J. Gall, L. V. Gool, Variations of a Hough-voting action recognition system, ICPR (2010) [23] M. S. Ryoo, Human Activity Prediction: Early Recognition of Ongoing Activities from Streaming Videos, ICCV (2011)
Pipeline to Recognize 48 Human Actions in 1,294 Realistic Videos Adopted from state-of-the-art (see references inside paper):
Video
STIP features
Histogram (Bag-ofFeatures)
Codebook (Random Forest)
Our Method:
Action Detector (SVM)
Spatio-Temporal Layout for each of he 48 Actions
+19.2% avg. improvement > 100% impr. for 5 actions > 30% impr. for 14 actions > 10% impr. for 26 actions
approach
arrive
attach
bounce
bury
carry
catch
chase
close
collide
dig
drop
enter
exchange
exit
fall
flee
fly
follow
get
give
go
hand
haul
have
hit
hold
jump
kick
leave
lift
move
open
pass
pickup
push
putdown
raise
receive
replace
run
snatch
stop
take
throw
touch
turn
walk
Highlights: * Using the spatio-temporal layout of human actions improves discrimination. * The method is a weighting scheme on top of the popular bag-of-words model. * No need to re-learn the action codebook. * Tested on 1,294 test videos, under varying recording conditions, with 195 variations per action: 19.2% improvement. * Outperforms state-of-the-art on IXMAS and UT-Interaction datasets.