Spatio-Temporal Layout of Human Actions for Improved ...

Viewer
Transcript

Accepted Manuscript Spatio-Temporal Layout of Human Actions for Improved Bag-of-Words Action Detection G.J. Burghouts, K. Schutte PII: DOI: Reference:

S0167-8655(13)00037-8 http://dx.doi.org/10.1016/j.patrec.2013.01.024 PATREC 5622

To appear in:

Pattern Recognition Letters

Please cite this article as: Burghouts, G.J., Schutte, K., Spatio-Temporal Layout of Human Actions for Improved Bag-of-Words Action Detection, Pattern Recognition Letters (2013), doi: http://dx.doi.org/10.1016/j.patrec. 2013.01.024

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

1

Spatio-Temporal Layout of Human Actions for

2

Improved Bag-of-Words Action Detection

3

G.J. Burghouts*, K. Schutte

4

TNO, Intelligent Imaging, Oude Waalsdorperweg 63, The Hague, The Netherlands

5

Abstract

6

We investigate how human action recognition can be improved by considering spatio-temporal layout of

7

actions. From literature, we adopt a pipeline consisting of STIP features, a random forest to quantize the

8

features into histograms, and an SVM classifier. Our goal is to detect 48 human actions, ranging from simple

9

actions such as walk to complex actions such as exchange. Our contribution to improve the performance of this

10

pipeline by exploiting a novel spatio-temporal layout of the 48 actions. Here each STIP feature does not in the

11

video contributes to the histogram bins by a unity value, but rather by a weight given by its spatio-temporal

12

probability. We propose 6 configurations of spatio-temporal layout, where the varied parameters are the

13

coordinate system and the modeling of the action and its context. Our model of layout does not change any

14

other parameter of the pipeline, it requires no re-learning of the random forest, yields a limited increase of the

15

size of its resulting representation by only a factor two, and at a minimal additional computational cost of only

16

a handful of operations per feature. Extensive experiments show that the layout is demonstrated to be

17

distinctive of actions that involve trajectories, (dis)appearance, kinematics, and interactions. The visualization

18

of each action’s layout illustrates that our approach is indeed able to model spatio-temporal patterns of each

19

action. Each layout is experimentally shown to be optimal for a specific set of actions. Generally, the context

20

has more effect than the choice of coordinate system. The most impressive improvements are achieved for

21

complex actions involving items. For 43 out of 48 human actions, the performance is better or equal when

22

spatio-temporal layout is included. In addition, we show our method outperforms state-of-the-art for the

23

IXMAS and UT-Interaction datasets.

24 25 26

Keywords: human action recognition, spatio-temporal layout, STIP features, gaussian kernel, mixture model, random forest, support vector machines.

* Corresponding author. Tel.: +31 888 663 997. E-mail address: [email protected].

27

1. Introduction

28

We consider the challenge set in the DARPA Mind’s Eye program of automated detection of 48 human

29

actions from 4,774 videos. This dataset is novel (released early 2012), and it involves many complex actions.

30

The actions vary from a single person (e.g., walk) to two or more persons (e.g., follow). Some of these actions

31

are defined by the involvement of some object (e.g., give), or an interaction with the environment (e.g., leave).

32

The most complex actions involve two person and an object (e.g., exchange). See Figure 1 for a few

33

illustrations of the visint.org dataset, including persons, cars, interactions with other persons or cars,

34

involvement of items (of which some, as shown, are not detectable), clutter in the background (such as moving

35

small cars in the background), and the varying scenes and recording conditions. The dataset contains a test set

36

of 1,294 realistic videos with highly varying recording conditions and on average 195 variations of each of the

37

48 actions. A complicating factor of this dataset is that the actions are highly unbalanced: e.g., within the train

38

set of 3,480 videos there are 1,947 positive learning samples for “move” to only 50 samples for “bury”, see

39

Figure 2. Also, on average 7 actions are annotated per clip. We argue that the complexity of simultaneously

40

detecting 48 human actions in this dataset, makes the faced problem interesting.

41

42 43

Fig. 1. Human actions in the visint.org dataset include persons, cars, interactions with other persons or cars, involvement of items like the

44

exchanged item (in the middle and right, where the item itself is not detectable), clutter in the background (like the small cars in the back of

45

the right image), and the varying scenes and recording conditions. Images are extracted from videos from the visint.org dataset.

46

47

Fig. 2. The 48 human behaviors in this paper and their (logarithmic) prevalence in the train set.

48 49

This paper improves a popular pipeline from literature (from features to action detection) by exploiting a

50

novel spatio-temporal layout of human actions. We adopt recent algorithms to construct a basic pipeline. We

51

create action detectors from a pipeline [1] of local spatio-temporal STIP features [2], a random forest to

52

quantize the features into action histograms [3], and a SVM classifier with a χ2 kernel [4] serving as a detector

53

for each action [5]. This rationale has been used in e.g. [1,3,5,6,14]. The novel spatio-temporal layout model

54

that we propose has the goal to address two objectives:

55

1.

Improving the selectivity of the action histograms by considering the action’s spatial layout.

56

2.

Optimizing the options for modeling spatio-temporal layout of each action.

57 58

The main driver for these improvements is the observation that most human actions follow a particular

59

spatio-temporal pattern. Such a pattern can be sequential, like the case where somebody falls, as shown in

60

Figure 3 (also from the visint.org dataset). Clearly, from left to right, from the beginning to end of the action,

61

we can see the horizontal and then downward movement. We expect this to hold for more cases from the 48

62

human actions from the visint.org dataset. For instance, in Figure 1, in the case of an exchange (image in the

63

middle and on the right), the moving arms and hands will be in between the two persons. In other words, the

64

action is spatially confined: in the middle. In the case of an approach of two vehicles (Figure 1, left image), two

65

trajectories get closer in time: at the end they will be close to each other. In the next sections, we aim to exploit

66

such spatio-temporal layout of actions.

67

68 69

Fig. 3. Example of action: “fall”. From left to right, from the beginning (blue) to end of the action (red), we can see the horizontal and then

70

downward movement. In this visualization, STIP features as considered in this paper are shown. The frame in which they are detected is

71

indicated by the text inside the detections.

72

73 74

We achieve this under the following constraints: a.

No significant increase of the size of the histograms (at maximum a factor of two). This is beneficial for

75

learning of the back-end SVM: curse of dimensionality and efficiency. The latter advantage holds also

76

for the later application of the learned classifier.

77

b.

No adaptation of a pre-learned feature quantizer (in this paper a Random Forest). This is beneficial as

78

learning a new quantizer is time-consuming; iterative adaptive learning is foreseen as prohibitive time

79

consuming.

80

c.

No significant additional computational costs to compute the spatio-temporal layout.

81 82

Exploiting the spatio-temporal layout improves the detection of the 48 human actions. We will demonstrate

83

a relative improvement of 19% with respect to the baseline pipeline for human action recognition, and up to >

84

100% improvement for 5 actions.

85 86

In Section 2, we discuss previous work. Section 3 considers the modeling of spatio-temporal layout, and

87

visualizes the layout of each of the 48 human actions. Section 4 describes the experimental setup. In addition,

88

we evaluate the human action recognition performance with and without spatio-temporal layout, and we

89

compare the various configurations of modeling the layout. Section 5 provides comparisons with state-of-the-

90

art results for the IXMAS and UT-Interaction datasets. Finally, Section 6 concludes with a discussion of our

91

results.

92

2. Previous Work

93

2.1. Pipeline: Action Detectors from a Bag-of-Words Model

94 95

We adopt our action detectors from recent literature. In Section 3 we extend it to also include also spatiotemporal layout of actions. We summarize the baseline here where more details can be found in [1].

96 97

Features. STIP features [2] have been proven to be discriminative for human action recognition [5].

98

Therefore we will consider them as our standard feature in this paper. The advantages of these local spatio-

99

temporal features are that they don’t require any segmentation of the scene, they are able to capture detailed

100

motion patterns of both the whole body and of the limbs, and they encode motion patterns together with local

101

shape. The STIP features outperformed bounding-box based features [6]. They are regionally computed at

102

spatio-temporal interest points, i.e., a 3D Harris detector that is an extension of the well-known 2D corner

103

detector. The features comprise of histograms of gradients (HOG) and optical flow (HOF). Together these two

104

feature types capture qualities about local shape and motion. The STIP features are computed with Laptev's

105

implementation from [6], version 1.1, with default parameters and input images reduced in size to 640x480

106

pixels. A STIP based feature vector are the 162 STIP HOG-HOF features.

107 108

Codebook / Quantizer. We choose the random forest as a codebook / quantizer (from here we refer to

109

quantizer). The random forest has proven to be more distinctive than k-means and they are also more efficient

110

[3]. The additional advantage is that it serves as a feature selector, which k-means does not provide. It selects

111

the combinations of particular features and their thresholds that give best separation between the target and

112

non-target class during training. We consider this property important, as we do not a priori know which motion

113

patterns and local shapes, encoded by the STIP features, are relevant. For each action, we create a random

114

forest with 10 trees and 32 leafs, based on 200K feature vectors, 100K from randomly selected positive videos,

115

and 100K from randomly selected negative videos. For the random forest we use Breiman and Cutler's

116

implementation [7], with the M-parameter equal to the total number of features (M=162). The random forest

117

quantizes the features into histograms of length 10 x 32 = 320. We will call each bin a “word”, in accordance

118

with bag-of-features, or bag-of-words terminology.

119 120

Action Detectors. We adopt the same bag-of-words pipeline as in our ICPR 2012 paper [1] (Section 2.3

121

explains the differences between the current and our ICPR paper). As a detector, we select an SVM. For

122

various tasks, ranging from image classification [8] to action recognition [5], it was found to be the best

123

classifier. Compared to our earlier work on action recognition [9], where we used Tag-Propagation [10] as a

124

classifier, the SVM showed better performance. For each action, we train an SVM classifier with a χ2 kernel

125

[4] that serves as a detector for that action. For the SVM we use the libSVM implementation [11], where the χ2

126

kernel is normalized by the mean distance across the full training set [4], with the SVM's slack parameter

127

default to 1. The weight of the positive class is set to (P+N)/P and the weight of the negative class to (P+N)/N,

128

where P is the size of the positive class and N of the negative class [12].

129

2.2. Encoding Spatial Layout

130

Bag-of-features approaches, including the approach as described in Section 2.1, are discriminative yet they

131

ignore the potential discriminative power that is in the spatial layout of the local features. The encoding of

132

layout has been explored for image retrieval, e.g., [8], and also for action recognition, e.g., [5]. We discuss

133

these and other approaches below. For images, the layout is spatially defined in two dimensions. In this paper,

134

we consider action recognition, in three dimensions. However, the ideas that were laid down in papers on

135

image retrieval are similar to the novel idea as investigated in this paper. Therefore, we also include them in

136

this discussion on previous work. Below we only consider the methods that consider layout, because we aim to

137

model patterns and order. Fitting a bounding box to an action does not model those aspects and therefore such

138

localization methods e.g., [13,14] are out of scope of this paper.

139 140

Spatial Pyramids. To encode the spatial arrangements of local features, like SIFT features, various methods

141

have been proposed recently. They share the rationale that the locations of the local features, which are by

142

themselves orderless, can be put into fixed spatial cells to obtain some order. With spatial pyramids [8] an

143

approximate global geometric correspondence is achieved (similar to [15]), by dividing the image into levels of

144

increasingly fine sub-regions. For each level, and all regions, a bag-of-features histogram is created. All

145

histograms are combined in a matching kernel that is fed into an SVM classifier. In these papers, the pyramids

146

were based on bag-of-features histograms where the quantizer or codebook was obtained by a k-means

147

clustering. As an alternative, also random forests (used in our pipeline) have been considered as a quantizer in

148

combination with the spatial pyramid approach [16]. The spatial pyramid has been considered for action

149

recognition in [5].

150 151

Non-rigid Layouts. Whereas the spatial pyramids are rigid cells within the image or video, recently also a

152

model of non-rigid layout has been proposed in [17]. Spatial Fisher vectors deal with non-rigid layouts by

153

retrieving the words in the bag-of-features model by jointly learning them from both the features values and

154

their locations. The learning is performed using EM and the underlying models of appearance (the feature

155

values) and locations are Gaussian. For each of the K words, a C-component Mixture-of-Gaussians is learned.

156

This results in a bag-of-features histogram of length K * (1 + 2D) + K * C * (1 + 2d), where K is the number of

157

words, C the number of components of the mixture, D the dimensionality of the feature vector, and d the

158

number of dimensions in the feature’s location. The authors also propose a more compact representation, where

159

the appearance words are pre-learned by k-means. In that case, the histogram length becomes K + K * C * (1 +

160

2d).

161

2.3. Novelty of This Paper

162

Compact Histograms that encode Spatio-Temporal Layout. We propose compact histograms, where the

163

inclusion of spatio-temporal layout leads to a histogram of the same size, or double size, in case we also

164

consider the layout of the background features (see Section 3.2). This means that we have histograms of length

165

320 (same) or 640 (plus background). The Mixture of Gaussians approach as described above would be

166

significantly larger (with one component only, C=1, and in video, d=3): 320*(1+2*162)+320*1*(1+2*3) =

167

106,240. With the replacement of the appearance-part of the mixture, by a k-means based codebook, the

168

histogram length becomes: 320+320*1*(1+2*3) = 2,560. We propose small histograms which is beneficial for

169

efficient learning and classification, and because it avoids the curse of dimensionality during learning.

170 171

Re-use of Pre-learned Quantizers / Codebooks. Rather than joint learning of appearance and locations to

172

obtain the quantizer/codebook, we consider pre-learned quantizers that are extended to encode spatio-temporal

173

layout in a second, independent stage (see Section 3). This property of our approach is beneficial as learning a

174

new quantizer is time-consuming.

175 176

Extensive Evaluation of Spatio-Temporal Layout Configurations. The contribution in this paper is that we

177

model the spatio-temporal layout for 48 actions. In this model, two fundamental choices are identified (see

178

Section 3.1 and 3.2). These configurations are extensively evaluated for 48 human actions in 1.294 realistic

179

videos which have been recorded under highly varying recording conditions. We explore which configuration

180

works best for each action, and we visualize the spatio-temporal layout for each action. We show that with

181

modeling spatio-temporal layout, without a significant increase in the histogram size, and without re-learning

182

of feature quantizers, there is a significant improvement over the pipeline without spatio-temporal layout.

183 184

Difference of this paper with our earlier ICPR 2012 paper [1]. The baseline bag-of-word action detector

185

setup is identical to our ICPR [1] paper, i.e., the features, random forests, kernel and SVM are the same, and we

186

have re-used the pre-trained random forests from that paper. The novelty of our ICPR paper was the second-

187

stage setup for classification, which is not considered in the current paper. In the current paper, the novelty is

188

the extension of the bag-of-word action detectors to also encode spatio-temporal layout, which is not part of the

189

ICPR paper.

190

3. Spatio-Temporal Layout of Human Actions

191

Two factors dominate the modeling of spatio-temporal layout: the coordinate system and the model of the

192

layout itself. Both factors are discussed here.

193

3.1. Coordinate Systems

194

The detected STIP features vary with the location and extent of the observed action. To add robustness to

195

such variations within videos of the same action, we consider two schemes of coordinate transformations to

196

achieve increasing levels of invariance. The first scheme is to normalize the locations per video to zero mean

197

and unit variance. With this scheme, we loose information about the absolute position in the image coordinates,

198

and about the absolute frame number, in which the STIP features were detected. We achieve that the middle of

199

an action in one video is aligned to the middle of another video. The second scheme is that we apply the first

200

scheme and, in addition, achieve invariance to horizontal flip. This results in a representation that is

201

independent of whether the action is taking place from left to right, or from right to left. We consider only

202

horizontal invariance, as vertical patterns in the direction of up-down (e.g., fall) are not considered to be similar

203

to the direction down-up (e.g., throw). The same holds for temporal ordering: walking and then falling does not

204

have a similar meaning when turned around. The horizontal invariance is achieved by aligning the action’s

205

horizontal direction (leftward or rightward) to the temporal axis. For the current video, if the correlation

206

between the detections on the horizontal axis and the temporal axis is negative, R(X,T) < 0 with X the

207

horizontal coordinates and T the temporal coordinates, we mirror the horizontal coordinates of the detections

208

around the horizontal mean. In summary: the original coordinates will be referred to in the experiment sections

209

as “Original”, whereas the normalization to zero mean and unit variance is indicated by “Normalized” and

210

when horizontal flip invariance is added we use the term “Normalized + Flip Invariance”. These schemes are

211

investigated in the experiments in Section 4.

212

3.2. Action-Specific Spatio-Temporal 3D Gaussians

213

We consider the modeling of the spatio-temporal layout of an action by the locations of the STIP features.

214

Note that these locations may have been transformed as described in Section 3.1, if such a scheme has been

215

applied.

216 217

3D Gaussian-Mixture per “Word”. Recall (from Section 2.1) that we consider a random forest to quantize

218

the features into words, i.e., the histogram bins. Our random forest quantizes the features into 320 words. For

219

each word, we consider all the features from the training set (see Section 4) that projects onto it, and we collect

220

their locations. The locations are 3-dimensional: (x,y,t). These word-specific locations are modeled by a 3D

221

Mixture-of-Gaussians [18], which we refer to as Ga,w(x,y,t), where G denotes the Gaussian mixture, a is the

222

action for which we model the layout, w is the index of the word, and (x,y,t) is the location of the feature. For

223

each word, a probability density function (pdf) is learned by the EM algorithm [18], C

224

Ga ,w ( x, y, t )   k  N ( x, y, t |  k ,a ,w ,  k ,a ,w ) , k 1

225

where Ga,w denotes the mixture that has been learned for action a and its wth word, k indicates one of the C

226

components, αk is the learned mixing weight of the kth component, and N is a Gaussian function parameterized

227

by a learned mean μk,a,w and covariance matrix Σk,a,w.

228 229

In this way, we obtain 320 word-specific pdfs Ga,w(x,y,t) for one action a. This is our model of the spatio-

230

temporal layout of a particular action. A new feature from a test video, that gets projected onto the particular

231

word, is assigned a posterior probability that the feature is in accordance with the word’s extent.

232 233

Construction of Histograms. Originally in the random forest approach e.g., [3], each feature that gets

234

projected onto a particular word i, will add +1 to entry i in the histogram. After all features have contributed to

235

the histogram, it is normalized to 1. In our approach, each feature will add the posterior probability obtained

236

from the word-specific Gaussian Gw(x,y,t) to histogram entry i. Again we normalize the histogram to 1.

237 238

Action and Context. The word-specific pdf Ga,w(x,y,t) can be learned from feature locations that were

239

obtained from training videos that contain the particular actions of interest. More precisely, we learn

240

Ga,w(x,y,t|a). In addition, we also consider the pdf of locations that were obtained from other videos, where the

241

action was not present: Ga,w(x,y,t| a ). In the experiments, we consider Ga,w(x,y,t|a) only, or alternatively, both

242

Ga,w(x,y,t|a) and Ga,w(x,y,t| a ). In the first configuration, we refer to the layout as “Action”, where in the

243

second configuration we refer to “Action + Context”. In the case of “Action + Context”, the histogram size is

244

doubled, as we add the posterior probabilities of both pdf models to the histogram, and because they are

245

different by construction, we use separate bins (two bins for each word rather than one). Here, normalization is

246

done for the action and context parts separately, to weight their contributions equally.

247

3.3. Visualization of Each Action’s Spatio-Temporal Layout

248

We are interested in the layout of each of the 48 human actions to understand the models obtained by our

249

technique. Figure 4 visualizes the configuration “Normalized + Flip Invariance / Action”, as this configuration

250

enables us to depict these layouts in a shared coordinate frame (see Section 3.1), such that the layouts can be

251

compared between the actions. A few observations in Figure 4 are key to the ideas as laid down in this paper.

252

The layouts should be interpreted as follows: in blue the detections at the beginning of the action, where red

253

indicates detections at the end of the action. A general observation is that all actions show a dominant pattern in

254

the horizontal direction. People clearly tend to walk and move horizontally through the image field. Another

255

observation is that for many actions no clear and direct interpretation follows from their visualized layout. Yet,

256

clearly, the layouts are distinct. Below we highlight a few examples which have an obvious a semantic

257

interpretation.

258

Trajectory Actions: examples are Stop, Jump. For the action Stop, the layout shows that this action can be

259

started anywhere, which follows from the scatter of the blue parts of the model. The Stop action typically ends

260

when somebody has arrived at the person or scene object that he/her moved to. Usually, that is somewhere in

261

the middle, as can be learned from the red parts in the middle. Jump is the action with the most distinct vertical

262

pattern. The beginning of the action is upward (blue to green) where the downward motion is visible later

263

(orange to red).

264

(Dis)Appear Actions, examples are Enter, Exit. Typically someone enters from the side of the image field,

265

which follows from the pattern: at the beginning the location is at the side of the image (blue), where Enter

266

tends to end when somebody has arrived at the somebody or some item in the middle (red). For the dataset

267

under investigation, Exit is not the inverse of Enter: usually the exit happens by stepping into a vehicle, which

268

is in the middle (note the red in the middle).

269 270

Kinematic Actions, examples are Fall, Put down. Fall clearly shows a pattern from top to bottom. A similar pattern is observed for Put down.

271

Interaction Actions, examples are Take, Collide. Take is typically visible by somebody who picks up

272

something, and walks away. The part of Take where the person walks away is clearly shown in its layout, from

273

left to right, from beginning (blue) to end (red) of the action. Collide involves two persons that touch each other

274

after some movement of one or both, where the collision happens approximately in the middle (orange to red).

275

4. Experiments on the visint.org dataset of 48 human actions

276

4.1. Experimental setup

277

The dataset includes 48 human actions in 1,294 short test videos of 10-30 seconds, given a train set of 3,480

278

similar videos. This dataset is novel and contributed by the DARPA Mind’s Eye program on www.visint.org.

279

The annotation is as follows: for each of the 48 human actions, a human has assessed whether the action is

280

present in each video or not (“Is action X present?”). Typically, multiple actions are reported for every video,

281

with on average seven reported actions. We consider the train set for the training of classifiers and the

282

optimization of combining schemes (as described later in the experiments), where we use the test set for

283

performance evaluation only.

284

The only part of the pipeline from Section 2.1 that we vary in our experiments in Section 5 is the weighting

285

to each feature’s contribution to its histogram bin. This weighting scheme depends on the spatio-temporal

286

layout, for which we consider variations that are described in Section 3. For the 3D Gaussian, we use a single

287

component for simplicity, C=1.

288

The performance will be measured by the MCC measure,

289 290

MCC 

TP  TN  FP  FN (TP  FP )( FP  FN )(TN  FP )(TN  FN )

291 292

with T=true, F=false, P=positive and N=negative. The

MCC measure has the advantage of it’s

293

independence of the sizes of the positive and negative classes. Recall from the Introduction and specifically

294

Figure 2 that for some actions we have many positive training samples, where for others we only have few.

295

From the 3,480 samples in the training set, there are 1,947 samples for “move” (56%), to only 50 samples for

296

“bury” (1.4%). Clearly some classes are highly unbalanced, and because the MCC is insensitive to this, it

297

enables us to compare performance across all actions independent of their prevalence in the dataset.

298

299 300 301 302 303 304 305 306 307 308 309 310 311

approach

arrive

attach

bounce

bury

carry

catch

chase

close

collide

dig

drop

enter

exchange

exit

fall

flee

fly

follow

get

give

go

hand

haul

have

hit

hold

jump

kick

leave

lift

move

open

pass

pickup

push

putdown

raise

receive

replace

run

snatch

stop

take

throw

touch

turn

walk

312 313 314 315 316 317 318 319 320 321 322

Fig. 4. Visualization of the spatio-temporal layouts of all 48 human actions. The figure depicts the configuration “Normalized + Flip

323

Invariance / Action”. All boxes have the same coordinate frame, where the displayed extent is [-σ, +σ] in both spatial dimensions, and blue

324

to red indicate [-σ, +σ] in the temporal dimension, where green represents 0. Each box visualizes the 320 spatio-temporal 3D Gaussians

325

(only the mean is displayed) that model the layout for that action.

326 327

328

4.2. Spatio-Temporal Layout Configurations: Comparison per Action

329

Figure 5 depicts results of using no spatial layout and the 6 configurations that model the spatio-temporal

330

layout, as proposed in Section 3: three coordinate systems (Original, Normalized, Normalized + Flip

331

Invariance) combined with two extents of the layout (Action, Action + Context). Our main findings are

332

summarized below.

333

For 43 out of 48 actions improvements are achieved by adding spatio-temporal layout. For the layout

334

“Normalized + Flip Invariance / Action” the improvements are marginal, as also can be seen in Table 1.

335

“Normalized / Action + Context” performs badly. For the other 4 layouts, significant improvements are

336

achieved with respect to the representation without spatio-temporal layout. The layouts add discriminative

337

power. For 5 actions the performance is better without spatio-temporal layout.

338

Truly different layouts have most distinct performances. For those actions where “Original / Action”

339

performs best, a change of both coordinate system and context, i.e., “Normalized + Flip Invariance / Action +

340

Context”, performs worst. The opposite comparison also holds.

341

Context has more effect than Coordinate System. Whereas “Original / Action” produces results that are

342

very comparable to “Normalized / Action”, it produces distinct results compared to “Original / Action +

343

Context”. This also shows for “Normalized + Flip Invariance / Action + Context”: these results are comparable

344

to “Original / Action + Context” and very distinct from “Normalized + Flip Invariance / Action”.

345

For some actions, the merit of layout is significant. For kick, the performance without layout is ~0.1 and

346

the improvement is ~0.1. Kick is well-localized: somebody stands somewhere and the legs move. The same

347

holds for fall, which improves from ~0.0 to ~0.1. Many actions with involvement of items improve, examples

348

are: carry improves from ~0.15 to ~0.2, push from ~0.05 to ~0.1, lift from ~0.25 to ~0.3, and, haul and raise

349

from ~0.15 to 0.2. Actions that involve two persons who move relative to each other also improve: flee is

350

improved from ~0.3 to ~0.4, chase improves from a negative score to a small positive score, follow improves

351

from ~0.05 to ~0.1, and receive from ~0.1 to ~0.15.

352 353

354 Spatio-Temporal Layouts vs No Layout walk approach open enter get jump bury hand carry exchange throw have close take fly putdown chase attach catch move run pickup lift touch raise haul kick follow snatch hit go turn exit drop arrive pass stop collide leave flee dig hold receive give bounce push replace fall

355

No Spatio-Temporal Layout Original / Action Original / Action + Context Normalized / Action Normalized / Action + Context Normalized + Flip Invariance / Action Normalized + Flip Invariance / Action + Context

0

0.1

0.2

0.3

0.4

0.5

0.6

356

Fig. 5. The performance of recognizing of 48 human actions (measured by MCC on the horizontal axis), with various spatio-temporal

357

layouts vs. no spatio-temporal layout. For each layout, we indicate for which actions it works best (see the lines that connect the optimal

358

layout for a set of actions).

359

4.3. Spatio-Temporal Layout Configurations: Across the 48 Actions

360 361

Table 1 summarizes the results shown in Figure 5. The MCC averaged for all actions shows that no single

362

layout is best overall: for each layout the performance is approximately 0.17, where without layout the

363

performance is 0.16. When the optimal layout per action is selected, the performance becomes 0.20.

364

Normalization of Action and its Context requires Horizontal Alignment. “Normalized / Action + Context”

365

performs badly. The other coordinate systems plus “Action + Context” are performing much better. It appears

366

that the action can be modelled relative to its context only in the original coordinate system, or if both are also

367

aligned with respect to their dominant trajectory. That makes sense: in the original coordinate system, the STIP

368

features will be far apart, which enables to discriminate between them. In the normalized scheme, all features

369

will be projected into the same area: dominant leftward and rightward actions are now overlapping and their

370

distinction becomes impossible. Such actions can be distinguished again when aligning the dominant horizontal

371

motions in the “Normalized + Flip Invariance” layout.

372

There is no single best layout. This is confirmed by the number of actions for which each layout performs

373

best. Without any layout, 5 actions are recognized best. For all other actions, one of the spatio-temporal layouts

374

performs better. Two layouts have improvements for only 4 actions, where “Original / Action + Context”

375

performs best for most actions: 10 actions.

376

All layouts improve significantly for a subset of actions. We consider the relative improvement of each

377

layout for the subset of actions for which each performs best. The relative improvements for the layouts with

378

only the “Action”, without “Context”, are all below 20%. “Original / Action + Context” achieves an average

379

improvement of 32%. Its best improvement is 0.100 gain for the fall action MCC score. For “Normalized +

380

Flip Invariance / Action + Context”, the best relative improvement is achieved on average: 37%, with a

381

maximum gain of 0.126 for the replace action MCC score.

382 383

384

Table 1. The performance of human actions recognition for all configurations of spatio-temporal layout as described in Section 3

385

(Coordinate Systems and 3D Gaussians) and without layout. The table indicates for each configuration the averaged MCC across all 48

386

actions, the number of actions for which it performed best, the average improvement for the actions where it performed best (both absolute

387

and relative), and maximum improvement.

Spatio-Temporal Layout: Configurations Coordinate System

3D Gaussian

MCC avg.

actions best

MCC impr. avg.

MCC impr. max.

Original

Action

0.170

9

0.030 (18%)

0.078

Original

Action + Context

0.167

10

0.052 (32%)

0.100

Normalized

Action

0.168

4

0.031 (19%)

0.037

Normalized

Action + Context

0.005

0

-

-

Normalized + Flip Inv.

Action

0.169

4

0.027 (16%)

0.052

Normalized + Flip Inv.

Action + Context

0.167

9

0.060 (37%)

0.126

None

0.164

5

-

-

Best configuration of layout per action

0.195

48

0.031 (19%)

0.126

388 389

4.4. Merit for each Action

390 391

For almost all actions improvements are achieved. Most notably, for 7 actions the absolute improvement, due

392

to spatio-temporal layout, is > 0.07. These actions are: replace, fall, kick, follow, bury, flee, hit. Indeed, as can

393

be seen in Figure 4, all these actions have a clear spatio-temporal motion pattern. For 13 actions, the

394

improvement is > 0.05, whereas for 21 actions the improvement is > 0.03.

395

For eleven actions the method fails to improve the action MCC. For five actions, the performance is best

396

without layout (see upper part in Figure 5 and right part in Figure 6), and for these actions the results degrade

397

with layout: walk, approach, open, enter, get. It seems that these are actions that can happen anywhere in the

398

scene and have a large spatio-temporal extent, e.g., walk. Open is ill-localized, because the opening of a bag

399

can be done on the ground as well as on a table, for instance. For six actions, there is similar performance with

400

and without layout (see Figure 6 where the merit is zero or close to zero): run, bounce, exchange, putdown,

401

throw, open. Note that three other verbs the MCC = 0 with and without layout: attach, catch, move.

402

Improvements are achieved for complex actions involving items. An example is: replace (MCC improved

403

significantly with ~0.13), and another example is push, (improved with ~0.07). Such actions are hard to detect

404

purely based on STIP features, because it involves an item that is being manipulated. Yet, the pushing to the

405

item is always from one side. We conclude that such information aids the detection of such more complex

406

actions. For many actions that involve items, e.g., snatch, carry, haul, pickup, lift and drop, the improvement in

407

MCC is more than 0.03. Improvement by Spatio-Temporal Layout vs No Layout 0.12 0.1

merit MCC

0.08 0.06

7 actions: MCC impr. > 0.07 13 actions: MCC impr. > 0.05

0.04

21 actions: MCC impr. > 0.03

0.02

replace fall kick follow bury flee hit push receive raise dig arrive chase snatch carry haul pickup lift drop turn exit stop fly leave pass give touch go have jump close hand hold collide take run attach bounce catch exchange move putdown throw open walk approach get enter

0

408 409

Fig. 6. Absolute improvements for the recognition of 48 human actions, by considering spatio-temporal layout (best configuration per

410

action), relative to the representation without spatio-temporal layout.

411

5. Comparison to the State-of-the-Art

412

In this section, we compare our spatio-temporal layout to state-of-the-art methods on commonly used

413

datasets. To the best of our knowledge, beside our earlier work [1,9] no results have been published on the

414

visint.org dataset yet; it has been released only very recently. Therefore, we will compare to state-of-the-art on

415

the single-person multi-viewpoint IXMAS dataset of subtle actions [19] and the UT-Interaction dataset [20]

416

that contains two-person interactions. We select these datasets as they together represent a broad scope of

417

actions.

418

419

5.1. IXMAS

420

The IXMAS dataset [19] consists of 12 complete action classes with each action executed three times by 12

421

subjects and recorded by five cameras with the frame size of 390 × 291 pixels. These actions are: check watch,

422

cross arms, scratch head, sit down, get up, turn around, walk, wave, punch, kick, point and pick up. The body

423

position and orientation are freely decided by different subjects. The standard setup on this dataset is leave -

424

one subject - out cross validation setting. We compare against the state-of-the-art result of [21], who achieved a

425

78.0% recognition accuracy across the five cameras using the with the Multiple-Kernel Learning with

426

Augmented Features (AFMKL) method. We have re-used the random forests from the experiments in Section

427

4, because we want to demonstrate that our approach is generic (further tuning of the random forests

428

specifically to the IXMAS actions may further improve the results). For each action in IXMAS, we select the

429

best forest and spatio-temporal layout configuration, based on cross-validation. For each action we obtained a

430

detector and we assign the test sample to the single action that maximizes the a-posteriori probability. The

431

performance of our “Best Configuration” method is 79.1%, thereby outperforming [21] by 1.1% on average,

432

see Table 2.

433

Table 2. Performance on the IXMAS dataset [19] of the state-of-the-art (AFMKL [21]) method on this dataset and of our method.

AFMKL [21] Our method 434

Camera Viewpoint 1 2 3 81.9 80.1 77.1 80.8 81.4 77.8

4 77.6 81.7

5 73.4 73.7

Average (%) 78.0 79.1

435

5.2. UT-Interaction

436

We used the segmented version of the UT-Interaction dataset [20] containing videos of six types of human

437

activities: hand-shaking, hugging, kicking, pointing, punching, and pushing. The UT-Interaction dataset is a

438

public video dataset containing high-level human activities of multiple actors. We consider the #1 set of this

439

dataset, following the setup in [22]. The #1 set contains a total of 60 videos of six types of human-human

440

interactions. Each set is composed of 10 sequences, and each sequence contains one execution per activity. The

441

videos involve camera jitter and/or background movements (e.g., trees). Several pedestrians are present in the

442

videos as well, making the recognition harder. Following [22], we consider the leave-one-sequence-out cross

443

validation, performing a 10-fold cross validation. That is, for each round, the videos in one sequence were

444

selected for testing, and videos in the other sequences were used for the training. Identical to the experiment on

445

IXMAS in Section 5.1, we have re-used the random forests from the experiments in Section 4, where we

446

established for each action in the UT-Interaction set the best forest and spatio-temporal layout configuration,

447

based on cross-validation. For each action we obtained a detector and we assign the test sample to the single

448

action that maximizes the a-posteriori probability. We compare against three well-performing methods,

449

[20,22,23]. In [22] a Hough-voting scheme is proposed. In [23] a dynamical bag-of-words model is proposed,

450

which together with the cuboid + SVM setup in [20] are very similar to our method, being also bag-of-words

451

methods, except that [20] and [23] do not model spatio-temporal layout. The performance of our “Best

452

Configuration”method is 93.3%, thereby outperforming the best result on UT-Interaction [22] by a significant

453

gain of 5.3%, see Table 3. Also notable is the increase of 8.3% over the best score of [20] and [23], which

454

solely can be attributed to the spatio-temporal layout as proposed in this paper.

455

Table 3. Performance on the UT-Interaction dataset [20] of the state-of-the-art methods on this dataset and of our method.

Method

Accuracy

Waltisberg et al. [22]

88.0%

Ryoo [23]

85.0%

Ryoo et al. [20]

83.3%

Our method

93.3%

456

6. Discussion

457

For human action recognition, we have considered the visint.org dataset of 48 actions in 3,480 train and

458

1,294 test videos, ranging from simple actions such as walk to complex actions such as exchange. The state-of-

459

the-art bag-of-features approach discards any spatio-temporal location information. Our results show that for

460

action recognition we utilize that information by modeling it in a pdf. We have modeled the spatio-temporal

461

layout of all 48 actions, by considering the locations of each action’s STIP features. We have used a pipeline of

462

STIP features, a random forest to quantize the features into histograms, and an SVM classifier as a detector for

463

each of the 48 actions. We have proposed 6 configurations of modeling the layout, where the varied parameters

464

are the coordinate system and the modeling of the action and its context. We have visualized the spatio-

465

temporal layouts and demonstrated that the patterns for each action are distinct and that some can be

466

semantically interpreted.

467 468

In terms of action recognition performance, we have shown that there is no single best layout for all actions.

469

Rather, we have considered the optimal layout for each action. For 43 actions, the performance is better or

470

equal when spatio-temporal layout is included, while the other 5 actions do not degrade significantly. The

471

improvement is achieved without changing any other parameter of the processing pipeline, no re-learning of the

472

quantizer/codebook (a random forest), a limited increase of the size of the representation by a factor of only

473

two, and at a limited additional computational cost of only a handful of operations per feature’s location (i.e.,

474

evaluating a 3D Gaussian function). For 7 actions, the improvement is large (MCC score improved by > 0.07),

475

for 21 actions we improve reasonably (> 0.03). We have found experimentally that modeling the layout of the

476

action’s context (by a model of all non-action STIP features) is more important than the configuration of the

477

action’s coordinate system. We have learned that the most impressive improvements are achieved for complex

478

actions involving items. In total, relative to the processing pipeline without spatio-temporal selectivity, our

479

extended pipeline improves by 19% on average for all actions.

480

Finally, we have compared our method to the state-of-the-art. On the IXMAS dataset, we outperform the

481

state-of-the-art by 1.1% (was 78.0%, ours 79.1%). On UT-Interaction dataset, we outperform the state-of-the-

482

art by 5.3% (was 88.0%, ours 93.3%).

483

Acknowledgements

484

This work is supported by DARPA (Mind’s Eye program). The content of the information does not

485

necessarily reflect the position or the policy of the US Government, and no official endorsement should be

486

inferred.

487

References

488

[1] G.J. Burghouts, K. Schutte, Correlations Between 48 Human Actions Improve Their Detection, ICPR,

489

(2012)

490

[2] I. Laptev, On Space-Time Interest Points, International Journal of Computer Vision, 64 (2/3) (2005)

491

[3] F. Moosmann, B. Triggs, F. Jurie, Randomized Clustering Forests for Building Fast and Discriminative

492 493 494 495 496 497 498

Visual Vocabularies, NIPS (2006) [4] J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, Local features and kernels for classification of texture and object categories: A comprehensive study, International Journal of Computer Vision, 73 (2) (2007) [5] I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning Realistic Human Actions from Movies, CVPR (2008) [6] G.J. Burghouts, K. Schutte, R. den Hollander, H. Bouma, Selection of Negative Samples and Two-Stage Combination of Multiple Features for Action Detection in Thousands of Videos, MVAP, submitted (2012)

499

[7] L. Breiman, Random forests, Machine Learning, 45 (1) (2001)

500

[8] S. Lazebnik, C. Schmid, J. Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing

501

Natural Scene Categories, CVPR (2006)

502

[9] H. Bouma, P. Hanckmann, J-W. Marck, L. de Penning, R. den Hollander, J-M. ten Hove, S.P. van den

503

Broek, K. Schutte, G.J. Burghouts, Automatic human action recognition in a scene from visual inputs, Proc.

504

SPIE Vol. 8388, (2012)

505 506 507 508 509 510

[10] M. Guillaumin, T. Mensink, J. Verbeek, “TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation”, ICCV (2011) [11] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm (2001) [12] K.E.A. van de Sande, T. Gevers, C.G.M. Snoek, Evaluating Color Descriptors for Object and Scene Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 32 (9) (2010)

511 512

[13] C.H. Lampert, M.B. Blaschko, T. Hofmann, Beyond sliding windows: Object localization by efficient subwindow search, CVPR (2008)

513

[14] J. Liu, J. Luo, M. Shah, Recognizing Realistic Actions from Videos “in the Wild”, CVPR (2009)

514

[15] J. K. Grauman, T. Darrell, Pyramid match kernels: Discriminative classification with sets of image

515

features, ICCV (2005)

516

[16] A. Bosch, A. Zisserman, X. Muoz, Image Classification using Random Forests and Ferns, ICCV (2007)

517

[17] J. Krapac, J. Verbeek, F. Jurie, Modeling Spatial Layout with Fisher Vectors for Image Categorization,

518

ICCV (2011)

519

[18] C.M. Bishop, Pattern Recognition and Machine Learning, Springer (2006)

520

[19] D. Weinland, E. Boyer, R. Ronfard, Action Recognition from Arbitrary Views using 3D Exemplars, ICCV

521 522 523 524 525 526 527 528 529

(2007) [20] M. S. Ryoo, C.-C. Chen, J. K. Aggarwal, A. Roy-Chowdhury, An Overview of Contest on Semantic Description of Human Activities, ICPR (2010) [21] X. Wu, D. Xu, L. Duan, J. Luo, Action Recognition using Context and Appearance Distribution Features, CVPR (2011) [22] D. Waltisberg, A. Yao, J. Gall, L. V. Gool, Variations of a Hough-voting action recognition system, ICPR (2010) [23] M. S. Ryoo, Human Activity Prediction: Early Recognition of Ongoing Activities from Streaming Videos, ICCV (2011)

Pipeline to Recognize 48 Human Actions in 1,294 Realistic Videos Adopted from state-of-the-art (see references inside paper):

Video

STIP features

Histogram (Bag-ofFeatures)

Codebook (Random Forest)

Our Method:

Action Detector (SVM)

Spatio-Temporal Layout for each of he 48 Actions    

+19.2% avg. improvement > 100% impr. for 5 actions > 30% impr. for 14 actions > 10% impr. for 26 actions

approach

arrive

attach

bounce

bury

carry

catch

chase

close

collide

dig

drop

enter

exchange

exit

fall

flee

fly

follow

get

give

go

hand

haul

have

hit

hold

jump

kick

leave

lift

move

open

pass

pickup

push

putdown

raise

receive

replace

run

snatch

stop

take

throw

touch

turn

walk

Highlights: * Using the spatio-temporal layout of human actions improves discrimination. * The method is a weighting scheme on top of the popular bag-of-words model. * No need to re-learn the action codebook. * Tested on 1,294 test videos, under varying recording conditions, with 195 variations per action: 19.2% improvement. * Outperforms state-of-the-art on IXMAS and UT-Interaction datasets.

Dynamic Modulation of Human Motor Activity When Observing Actions