Adversarial Methods Improve Object Localization

Sima Behpour Brian D. Ziebart University of Illinois at Chicago {sbehpo2,bziebart}@uic.edu

Abstract We propose deep adversarial object localization, which approximates ground truth annotations of training images instead of approximating the loss function by posing object localization as an adversarial game between a loss-minimizing prediction player and a loss-maximizing adversarial player. We constrain the adversary to match specified properties of training data that are uncovered from a convolutional neural network’s feature representation. We demonstrate the efficiency and predictive performance of our approach on the ILSVRC2012 image dataset, showing significant improvements over other prediction methods.

1

Introduction

Performance in computer vision tasks is often assessed using specialized evaluation measures. For example, successful object detection is often defined by predicting a bounding box for an object that overlaps with more than some threshold (e.g., 50%, 70%) of the area of an object’s ground truth bounding box. Though deep learning architectures have drastically improved the proficiency of computer vision systems over recent years by constructing extremely rich and general feature representations for images [Szegedy et al., 2015, Simonyan and Zisserman, 2014a], there is still a significant mismatch between the objective functions they are designed to optimize and the application performance measures for which they will be applied. This degrades task performance in both theory [Liu, 2007] and in practice—as we show in this paper. Adversarial methods have found success when used in generative formulations to train deep architectures and obtain useful feature representations [Goodfellow et al., 2014, Salimans et al., 2016, Chen et al., 2016, Mirza and Osindero, 2014, Denton et al., 2015] due to the additional robustness that considering an adversary introduces. We explore a similar hypothesis in this work: Does better aligning predictor training objectives with evaluation performance measures using adverarial prediction in conjunction with deep learning architectures provide advantages in object localization tasks?

2

Approach

Game formulation: We view object localization using bounding box proposals (Figure 1) as a two-player game between a predictor player Yˆ and an adversarial player Yˇ determining the evaluation distribution [Asif et al., 2015]. Player Yˆ first chooses a predictive distribution of bounding boxes, Pˆ (ˆ y |x), to minimize the expected loss, then player Yˇ stochastically chooses an evaluation distribution, Pˇ (ˇ y |x), that maximizes the expected loss while also (approximately) matching with a set of statistics, Φ(x, y), in expectation. We measure these statistics/features from labeled data and leverage the benefits of feature representations learned using a Convolutional Neural Network (CNN) [Vedaldi and Lenc, 2015] to define them in this paper. Figure 1: Prediction game bounding boxes. Workshop on Adversarial Training, NIPS 2016, Barcelona, Spain.

Definition 1. The Adversarial Object Localization (AOLp ) game is: h i   min max EX ∼ P˜ , loss(Yˆ , Yˇ ) such that: EX ∼ P˜ , φ(X, Yˇ ) = E Pˆ



X,Y ∼P˜

ˇ |X ∼ P ˇ Y

ˆ |X ∼ P ˆ, Y ˇ |X ∼ P ˇ Y

[φ(X, Y )] (1)

i h i h ⇐⇒ min EX,Y ∼P˜ min max EYˆ |X ∼ Pˆ , loss(Yˆ , Yˇ ) + θ · φ(X, Yˇ ) X − θ · φ(X, Y ) . θ





(2)

ˇ |X ∼ P ˇ Y

where Pˆ (ˆ y |x) and Pˇ (ˇ y |x) are distributions over the |Y| predicted bounding boxes and φ(·, ·) are “features” characterizing relationships between the input pixels x and bounding box existence y. Due to strong Lagrangian duality, the dual solution (Equation (2)) is equivalent to the original formulation’s solution [Asif et al., 2015]. We consider two types of losses: the non-overlap, loss1−o (ˆ y , yˇ) = 1 − area(ˆ y ∩ yˇ)/area(ˆ y ∪ yˇ); and the thresholded overlap, losso<α (ˆ y , yˇ) = area(ˆ y ∩ yˇ)/area(ˆ y ∪ yˇ) < α. Table 1: The payoffs of the adversarial object localization game based on losses, `(ˆ y , yˇ), between predicted (ˆ y ) and adversarial (ˇ y ) bounding boxes, and potential terms ψ(ˇ y ) = θ · φ(ˇ y ). ... yˇ = yˆ = yˆ = yˆ = yˆ = .. .

yˇ =

yˇ =

yˇ =

`(

,

) + ψ(

) `(

,

) + ψ(

) `(

,

) + ψ(

)

`(

,

) + ψ( )

`(

,

) + ψ(

) `(

,

) + ψ(

) `(

,

) + ψ(

)

`(

,

) + ψ( )

`(

,

) + ψ(

) `(

,

) + ψ(

) `(

,

) + ψ(

)

`(

,

) + ψ( )

`( ,

) + ψ(

)

`( ,

) + ψ(

)

`( ,

) + ψ(

)

`( ,

) + ψ( )

.. .

.. .

.. .

.. .

... ... ... ... ..

.

Efficient game solutions: We employ a constraint-generation method [McMahan et al., 2003, Wang et al., 2015] to more efficiently solve the AOL game. Its operation is detailed in Algorithm 1. It iteratively obtains a Nash equilibrium for a game defined over a subset of the bounding boxes, finds a player’s best response strategy (bounding box) to that equilibrium distribution, and then adds the best response to the set of strategies defining the game. When additional best responses no longer improve either player’s game value (Figure 2), the subgame equilibrium is guaranteed to be an equilibrium to the larger game [McMahan et al., 2003]. We employ the most probable strategy under this distribution at testing time. Algorithm 1 Adversarial Object Localization equilibrium computation Input: Image img; Parameters θ Output: Nash equilibrium, (Pˆ , Pˇ ) BoxP roposals ← EdgeBox(img) Φ = VggNet.LastLayer(img, BoxP roposals) ψ ←θ·Φ ˇ←S ˆ ← argmax ψ S repeat ˆ S, ˇ ψ(S), ˇ loss(S, ˇ S)) ˆ (Pˆ , Pˇ , Vˇal) ← solveGame(S, ˇnew , maxV ) ← argmaxE ˇ ˇ [loss(S, ˇ S)+ ˆ ˇ (S ψ(S)] P (S) if (Vˇal 6= maxV ) then ˇ←S ˇ∪S ˇnew S end if ˆ S, ˇ ψ(S), ˇ loss(S, ˇ S)) ˆ (Pˆ , Pˇ , Vˆal) ← solveGame(S, ˆnew , minV ) ← argmin E ˆ ˆ [loss(S, ˇ S)] ˆ (S P (S)

Figure 2: Final strategies for predictor (black/blue bounding boxes) and adversary (black/red bounding boxes) with non-zero probability. Black boxes are the highest probability strategies.

if (Vˆal 6= minV ) then ˆ←S ˆ∪S ˆnew S end if until Vˇal = maxV = Vˆal = minV return (Pˆ , Pˇ )

2

3

Experimental Evaluation

Table 2: Train and test samExperimental setup: We evaluate the effectiveness of our method ples for our experiments. #Class #Training #Testing in solving object localization tasks using 10 classes (Table 2) from the Airplane 400 50 Imagenet2015 dataset [Russakovsky et al., 2015]. Since the images vary Bird 1600 200 greatly in size, we first re-size all of the images to 1360 by 800 pixels, Bus 330 50 Car 565 100 then apply EdgeBox [Larry Zitnick, 2014] to generate a relatively small Cat 325 55 set of candidate bounding boxes (up to 250) that cover the objects in Cow 246 40 the image. We represent bounding boxes for images using sets of 1000 Dog 500 100 Horse 520 50 CNN descriptors [Simonyan and Zisserman, 2014b, Vedaldi and Lenc, Monitor/TV 385 50 2015] for each bounding box proposals provided by EdgeBox. Sofa 380 50 To show the relative performance of AOL, we benchmark it against a multiclass support vector machine (SVM) trained to incorporate the overlap into its hinge loss function, and multiclass logistic regression. We use an existing SSVM implementation [Vedaldi, 2011] for the former to learn and produce predictions. It employs constraint generation and uses a technique to accelerate the learning process by adding multiple diverse constraints at each pass through the bounding boxes. We train one variant of the SSVM model (denoted SSVM) using non-overlap as the loss function, loss1−o , and two additional variants (denoted SSVM50 and SSVM70 ) using thresholded losses, losso<50% and losso<70% . For logistic regression, we estimate a distribution over all proposed bounding boxes that maximizes the conditional likelihood of proposed bounding boxes with an overlap of at least 50% with the example’s ground truth bounding box annotation. We produce bounding box predictions from the Bayesian optimal decision from the estimated conditional bounding box distribution for the non-overlap loss (LR) and for the thresholded overlap losses (LR50 , LR70 ). Similarly, we train our AOL method for the non-overlap (AOL) and for the thresholded overlap (AOL50 , AOL70 ) loss functions. Evaluation results: We evaluate the performance of each approach using the overlap between predicted bounding box and ground truth bounding box. Figure 3 shows the amount of predicted bounding box overlap with the object’s ground truth bounding box across the entire set of examples for each method and object class. We note a few general trends. First, the two AOL localizers are either the best or competitive with the best for all of the datasets. Second, we note that the AOL localizer, the AOL50 localizer, or the AOL70 localizer provides the best performance for the amount of overlap and the thresholded overlaps (50% and 70%), with the exception of Sofa 50% and Sofa 70%. (a) Airplane

(b) Bird

(c) Bus

(d) Car

(e) Cat

(f) Cow

(g) Dog

(h) Horse

(i) Monitor/TV

(j) Sofa

Figure 3: The number of test examples (x axis) having at most a specified amount of overlap with the ground truth bounding box (y axis) for three methods: Adversarial object localization (AOL), logistic regression (LR), and structured support vector machines (SSVM) for eight classes from the Imagenet 2012 dataset. Larger values are better. Of more significance is how each method adapts to loss function modification. The adaptation between AOL and AOL50 and AOL70 is visually evident for many of these datasets: the black line 3

(AOL50 ) and black dotted line (AOL70 ) are “inflated” to the left of each plot so that more examples exceed the critical 50% or 70% overlap threshold of the targeted loss function. For instance, 10.8% more (absolute) of the testing examples predicted using AOL50 exceed the critical 50% threshold than those predicted by AOL, when averaged over all classes. The corresponding reduction in test example “mistakes” (less than 50% overlap threshold) between AOL and AOL50 is 65%. The minimal reduction across the 10 datasets (ignoring Car, for which no improvement is possible) is over 40%. Similar reductions exist for AOL70 and From these results, we affirm the central hypothesis of this work: that by better aligning localizer training objectives with performance measures of interest, significant improvements to generalized application performance can be realized. An unexpected results of these experiments is that in optimizing for the 50% and 70% threshold losses, AOL50 and AOL70 also realizes increases in overlap. Additionally, AOL70 provided improved performance for the 50% overlap measure. This was similar with the SSVM and LR models. Further investigation is needed, but one possible explanation is that the thresholded loss functions create a sharper “strategic” game between predictor and adversary in the AOL formulation, more influential support vectors for SSVM, and more focused predictions of the Bayes optimal action under logistic regression.

4

Conclusions

In this paper, we have developed an adversarial formulation for object localization that better leverages the rich feature representations provided by deep learning for optimizing specific application performance measures. We demonstrated the benefits for object localization on ten different object classes, showing significant improvements for our approach when tuned to the non-overlapping evaluation measure In our experiments to date, we have used a pre-trained deep architecture to inform our predictions.

Acknowledgements This research was supported as part of the Future of Life Institute (futureoflife.org) FLI-RFP-AI1 program, grant#2016-158710 and by NSF grant RI-#1526379.

References Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pages 568–576, 2014a. Yufeng Liu. Fisher consistency of multicategory support vector machines. In International Conference on Artificial Intelligence and Statistics, pages 291–298, 2007. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. CoRR, abs/1606.03657, 2016. URL http://arxiv.org/abs/1606.03657. Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014. Emily L. Denton, Soumith Chintala, Arthur Szlam, and Robert Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. CoRR, abs/1506.05751, 2015. URL http://arxiv.org/abs/1506.05751. Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart. Adversarial cost-sensitive classification. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2015. A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab. In Proceeding of the ACM Int. Conf. on Multimedia, 2015. H Brendan McMahan, Geoffrey J Gordon, and Avrim Blum. Planning in the presence of cost functions controlled by an adversary. In Proceedings of the International Conference on Machine Learning, pages 536–543, 2003. Hong Wang, Wei Xing, Kaiser Asif, and Brian Ziebart. Adversarial prediction games for multivariate losses. In Advances in Neural Information Processing Systems, pages 2710–2718, 2015. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. Piotr Dollar Larry Zitnick. Edge boxes: Locating object proposals from edges. In ECCV. European Conference on Computer Vision, September 2014. URL https://www.microsoft.com/en-us/research/publication/ edge-boxes-locating-object-proposals-from-edges/. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014b. A. Vedaldi. A MATLAB wrapper of SVMstruct . http://www.vlfeat.org/~vedaldi/code/svm-struct-matlab.This, 2011.

4

Adversarial Methods Improve Object Localization

a convolutional neural network's feature representation. ... between the objective functions they are designed to optimize and the application .... Monitor/TV. 385.

3MB Sizes 1 Downloads 255 Views

Recommend Documents

Adversarial Training Methods for Semi-Supervised Text ...
As described in Sec. 2, in our work, we apply the adversarial perturbation to word embeddings, rather .... 2http://riejohnson.com/cnn_data.html. 3There are some ...

Beyond Sliding Windows: Object Localization by ...
Beyond Sliding Windows: Object Localization by. Efficient ... n × m images: empirical performance O(nm) instead of O(n2m2). .... dog, horse, motorbike, person, pottedplant, sheep, sofa, train, tv/monitor ... find best image I in database D.

Object Localization by Efficient Subwindow Search - Max Planck ...
localization in a way that does not suffer from these draw- backs. It relies on a .... T = [tlow ,thigh ] etc., see Figure 1 for an illustration. For each rectangle set, we .... the corresponding ground truth box is at least 50%. To each detection a 

Beyond Sliding Windows: Object Localization by Efficient Subwindow ...
The details of this method, which we call Efficient Sub- window Search ... center point, its contour, a bounding box, or by a pixel-wise segmentation. ..... ing the evaluation software provided in the PASCAL VOC .... Calling these R. (l,i,j) max.

Steganographic Generative Adversarial Networks
3National Research University Higher School of Economics (HSE) ..... Stacked convolutional auto-encoders for steganalysis of digital images. In Asia-Pacific ...

Sonar Signal Processing Methods for the Detection and Localization ...
Fourier transform converts each block of data x(t) from the time domain to the frequency domain: X ( f ) . The power spectrum | X ( f ) ... the hydrophone is 1 m above the sea floor (hr=1m). The model ... The generalized cross correlation processing

Adversarial Sequence Prediction
Software experiments provide evidence that this is also true .... evaders in Ef, with the obvious interchange of roles for predictors and evaders. This tells us that in ...

new methods to improve the propulsion performance of ...
When he takes time to visualize the irregular nature of the currents which flow into the propeller disc he must certainly feel great admiration for a propulsion ...

localization
locations to investigate the stability of RSSI in two seemingly common environments. The open office environment chosen was meant to simulate an open space.

Adversarial Evaluation of Dialogue Models
model deployed as part of the Smart Reply system (the "generator"), and, keeping it fixed, we train a second RNN (the ... in the system: an incorrect length distribution and a reliance on familiar, simplistic replies such as .... First, we see that t

Adversarial Decision Making: Choosing Between ...
Mar 24, 2016 - “It is your job to sort the information before trial, organize it, simplify it and present it to the jury in a simple model that explains what happened ...

Simultaneous Approximations for Adversarial ... - Research at Google
When nodes arrive in an adversarial order, the best competitive ratio ... Email:[email protected]. .... model for combining stochastic and online solutions for.

Semantic Segmentation using Adversarial Networks - HAL Grenoble ...
Segmentor. Adversarial network. Image. Class predic- tions. Convnet concat. 0 or 1 prediction. Ground truth or. 16. 64. 128. 256. 512. 64. Figure 1: Overview of the .... PC c=1 yic ln yic denotes the multi-class cross-entropy loss for predictions y,

Fundamental limits on adversarial robustness
State-of-the-art deep networks have recently been shown to be surprisingly unstable .... An illustration of ∆unif,ϵ(x; f) and ∆adv(x; f) is given in Fig. 1. Similarly to ...

Adversarial Images for Variational Autoencoders
... posterior are normal distributions, their KL divergence has analytic form [13]. .... Our solution was instead to forgo a single choice for C, and analyze the.

Generative Adversarial Imitation Learning
Aug 14, 2017 - c(s,a): cost for taking action a at state s. (Acts the same as reward function). Eπ[c(s,a)]: expected cumulative cost w.r.t. policy π. πE: expert policy.

Kernel Methods for Object Recognition - University of Oxford
Jun 20, 2009 - We can define a centering matrix. H = I −. 1 n ..... Define the normalized independence criterion to be the ..... Structured Output Support Vector.

Kernel Methods for Object Recognition - University of Oxford
Jun 20, 2009 - KPCA is maximization of auto-covariance ... Training data consists of images with text captions ..... Perceptron Training with Multiclass Joint.

Generating Text via Adversarial Training -
network (CNN) for adversarial training to generate realistic text. Instead of using .... for the generator by pre-training a standard auto-encoder LSTM model.