Adversarial Methods Improve Object Localization
Sima Behpour Brian D. Ziebart University of Illinois at Chicago {sbehpo2,bziebart}@uic.edu
Abstract We propose deep adversarial object localization, which approximates ground truth annotations of training images instead of approximating the loss function by posing object localization as an adversarial game between a loss-minimizing prediction player and a loss-maximizing adversarial player. We constrain the adversary to match specified properties of training data that are uncovered from a convolutional neural network’s feature representation. We demonstrate the efficiency and predictive performance of our approach on the ILSVRC2012 image dataset, showing significant improvements over other prediction methods.
1
Introduction
Performance in computer vision tasks is often assessed using specialized evaluation measures. For example, successful object detection is often defined by predicting a bounding box for an object that overlaps with more than some threshold (e.g., 50%, 70%) of the area of an object’s ground truth bounding box. Though deep learning architectures have drastically improved the proficiency of computer vision systems over recent years by constructing extremely rich and general feature representations for images [Szegedy et al., 2015, Simonyan and Zisserman, 2014a], there is still a significant mismatch between the objective functions they are designed to optimize and the application performance measures for which they will be applied. This degrades task performance in both theory [Liu, 2007] and in practice—as we show in this paper. Adversarial methods have found success when used in generative formulations to train deep architectures and obtain useful feature representations [Goodfellow et al., 2014, Salimans et al., 2016, Chen et al., 2016, Mirza and Osindero, 2014, Denton et al., 2015] due to the additional robustness that considering an adversary introduces. We explore a similar hypothesis in this work: Does better aligning predictor training objectives with evaluation performance measures using adverarial prediction in conjunction with deep learning architectures provide advantages in object localization tasks?
2
Approach
Game formulation: We view object localization using bounding box proposals (Figure 1) as a two-player game between a predictor player Yˆ and an adversarial player Yˇ determining the evaluation distribution [Asif et al., 2015]. Player Yˆ first chooses a predictive distribution of bounding boxes, Pˆ (ˆ y |x), to minimize the expected loss, then player Yˇ stochastically chooses an evaluation distribution, Pˇ (ˇ y |x), that maximizes the expected loss while also (approximately) matching with a set of statistics, Φ(x, y), in expectation. We measure these statistics/features from labeled data and leverage the benefits of feature representations learned using a Convolutional Neural Network (CNN) [Vedaldi and Lenc, 2015] to define them in this paper. Figure 1: Prediction game bounding boxes. Workshop on Adversarial Training, NIPS 2016, Barcelona, Spain.
Definition 1. The Adversarial Object Localization (AOLp ) game is: h i min max EX ∼ P˜ , loss(Yˆ , Yˇ ) such that: EX ∼ P˜ , φ(X, Yˇ ) = E Pˆ
Pˇ
X,Y ∼P˜
ˇ |X ∼ P ˇ Y
ˆ |X ∼ P ˆ, Y ˇ |X ∼ P ˇ Y
[φ(X, Y )] (1)
i h i h ⇐⇒ min EX,Y ∼P˜ min max EYˆ |X ∼ Pˆ , loss(Yˆ , Yˇ ) + θ · φ(X, Yˇ ) X − θ · φ(X, Y ) . θ
Pˇ
Pˆ
(2)
ˇ |X ∼ P ˇ Y
where Pˆ (ˆ y |x) and Pˇ (ˇ y |x) are distributions over the |Y| predicted bounding boxes and φ(·, ·) are “features” characterizing relationships between the input pixels x and bounding box existence y. Due to strong Lagrangian duality, the dual solution (Equation (2)) is equivalent to the original formulation’s solution [Asif et al., 2015]. We consider two types of losses: the non-overlap, loss1−o (ˆ y , yˇ) = 1 − area(ˆ y ∩ yˇ)/area(ˆ y ∪ yˇ); and the thresholded overlap, losso<α (ˆ y , yˇ) = area(ˆ y ∩ yˇ)/area(ˆ y ∪ yˇ) < α. Table 1: The payoffs of the adversarial object localization game based on losses, `(ˆ y , yˇ), between predicted (ˆ y ) and adversarial (ˇ y ) bounding boxes, and potential terms ψ(ˇ y ) = θ · φ(ˇ y ). ... yˇ = yˆ = yˆ = yˆ = yˆ = .. .
yˇ =
yˇ =
yˇ =
`(
,
) + ψ(
) `(
,
) + ψ(
) `(
,
) + ψ(
)
`(
,
) + ψ( )
`(
,
) + ψ(
) `(
,
) + ψ(
) `(
,
) + ψ(
)
`(
,
) + ψ( )
`(
,
) + ψ(
) `(
,
) + ψ(
) `(
,
) + ψ(
)
`(
,
) + ψ( )
`( ,
) + ψ(
)
`( ,
) + ψ(
)
`( ,
) + ψ(
)
`( ,
) + ψ( )
.. .
.. .
.. .
.. .
... ... ... ... ..
.
Efficient game solutions: We employ a constraint-generation method [McMahan et al., 2003, Wang et al., 2015] to more efficiently solve the AOL game. Its operation is detailed in Algorithm 1. It iteratively obtains a Nash equilibrium for a game defined over a subset of the bounding boxes, finds a player’s best response strategy (bounding box) to that equilibrium distribution, and then adds the best response to the set of strategies defining the game. When additional best responses no longer improve either player’s game value (Figure 2), the subgame equilibrium is guaranteed to be an equilibrium to the larger game [McMahan et al., 2003]. We employ the most probable strategy under this distribution at testing time. Algorithm 1 Adversarial Object Localization equilibrium computation Input: Image img; Parameters θ Output: Nash equilibrium, (Pˆ , Pˇ ) BoxP roposals ← EdgeBox(img) Φ = VggNet.LastLayer(img, BoxP roposals) ψ ←θ·Φ ˇ←S ˆ ← argmax ψ S repeat ˆ S, ˇ ψ(S), ˇ loss(S, ˇ S)) ˆ (Pˆ , Pˇ , Vˇal) ← solveGame(S, ˇnew , maxV ) ← argmaxE ˇ ˇ [loss(S, ˇ S)+ ˆ ˇ (S ψ(S)] P (S) if (Vˇal 6= maxV ) then ˇ←S ˇ∪S ˇnew S end if ˆ S, ˇ ψ(S), ˇ loss(S, ˇ S)) ˆ (Pˆ , Pˇ , Vˆal) ← solveGame(S, ˆnew , minV ) ← argmin E ˆ ˆ [loss(S, ˇ S)] ˆ (S P (S)
Figure 2: Final strategies for predictor (black/blue bounding boxes) and adversary (black/red bounding boxes) with non-zero probability. Black boxes are the highest probability strategies.
if (Vˆal 6= minV ) then ˆ←S ˆ∪S ˆnew S end if until Vˇal = maxV = Vˆal = minV return (Pˆ , Pˇ )
2
3
Experimental Evaluation
Table 2: Train and test samExperimental setup: We evaluate the effectiveness of our method ples for our experiments. #Class #Training #Testing in solving object localization tasks using 10 classes (Table 2) from the Airplane 400 50 Imagenet2015 dataset [Russakovsky et al., 2015]. Since the images vary Bird 1600 200 greatly in size, we first re-size all of the images to 1360 by 800 pixels, Bus 330 50 Car 565 100 then apply EdgeBox [Larry Zitnick, 2014] to generate a relatively small Cat 325 55 set of candidate bounding boxes (up to 250) that cover the objects in Cow 246 40 the image. We represent bounding boxes for images using sets of 1000 Dog 500 100 Horse 520 50 CNN descriptors [Simonyan and Zisserman, 2014b, Vedaldi and Lenc, Monitor/TV 385 50 2015] for each bounding box proposals provided by EdgeBox. Sofa 380 50 To show the relative performance of AOL, we benchmark it against a multiclass support vector machine (SVM) trained to incorporate the overlap into its hinge loss function, and multiclass logistic regression. We use an existing SSVM implementation [Vedaldi, 2011] for the former to learn and produce predictions. It employs constraint generation and uses a technique to accelerate the learning process by adding multiple diverse constraints at each pass through the bounding boxes. We train one variant of the SSVM model (denoted SSVM) using non-overlap as the loss function, loss1−o , and two additional variants (denoted SSVM50 and SSVM70 ) using thresholded losses, losso<50% and losso<70% . For logistic regression, we estimate a distribution over all proposed bounding boxes that maximizes the conditional likelihood of proposed bounding boxes with an overlap of at least 50% with the example’s ground truth bounding box annotation. We produce bounding box predictions from the Bayesian optimal decision from the estimated conditional bounding box distribution for the non-overlap loss (LR) and for the thresholded overlap losses (LR50 , LR70 ). Similarly, we train our AOL method for the non-overlap (AOL) and for the thresholded overlap (AOL50 , AOL70 ) loss functions. Evaluation results: We evaluate the performance of each approach using the overlap between predicted bounding box and ground truth bounding box. Figure 3 shows the amount of predicted bounding box overlap with the object’s ground truth bounding box across the entire set of examples for each method and object class. We note a few general trends. First, the two AOL localizers are either the best or competitive with the best for all of the datasets. Second, we note that the AOL localizer, the AOL50 localizer, or the AOL70 localizer provides the best performance for the amount of overlap and the thresholded overlaps (50% and 70%), with the exception of Sofa 50% and Sofa 70%. (a) Airplane
(b) Bird
(c) Bus
(d) Car
(e) Cat
(f) Cow
(g) Dog
(h) Horse
(i) Monitor/TV
(j) Sofa
Figure 3: The number of test examples (x axis) having at most a specified amount of overlap with the ground truth bounding box (y axis) for three methods: Adversarial object localization (AOL), logistic regression (LR), and structured support vector machines (SSVM) for eight classes from the Imagenet 2012 dataset. Larger values are better. Of more significance is how each method adapts to loss function modification. The adaptation between AOL and AOL50 and AOL70 is visually evident for many of these datasets: the black line 3
(AOL50 ) and black dotted line (AOL70 ) are “inflated” to the left of each plot so that more examples exceed the critical 50% or 70% overlap threshold of the targeted loss function. For instance, 10.8% more (absolute) of the testing examples predicted using AOL50 exceed the critical 50% threshold than those predicted by AOL, when averaged over all classes. The corresponding reduction in test example “mistakes” (less than 50% overlap threshold) between AOL and AOL50 is 65%. The minimal reduction across the 10 datasets (ignoring Car, for which no improvement is possible) is over 40%. Similar reductions exist for AOL70 and From these results, we affirm the central hypothesis of this work: that by better aligning localizer training objectives with performance measures of interest, significant improvements to generalized application performance can be realized. An unexpected results of these experiments is that in optimizing for the 50% and 70% threshold losses, AOL50 and AOL70 also realizes increases in overlap. Additionally, AOL70 provided improved performance for the 50% overlap measure. This was similar with the SSVM and LR models. Further investigation is needed, but one possible explanation is that the thresholded loss functions create a sharper “strategic” game between predictor and adversary in the AOL formulation, more influential support vectors for SSVM, and more focused predictions of the Bayes optimal action under logistic regression.
4
Conclusions
In this paper, we have developed an adversarial formulation for object localization that better leverages the rich feature representations provided by deep learning for optimizing specific application performance measures. We demonstrated the benefits for object localization on ten different object classes, showing significant improvements for our approach when tuned to the non-overlapping evaluation measure In our experiments to date, we have used a pre-trained deep architecture to inform our predictions.
Acknowledgements This research was supported as part of the Future of Life Institute (futureoflife.org) FLI-RFP-AI1 program, grant#2016-158710 and by NSF grant RI-#1526379.
References Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pages 568–576, 2014a. Yufeng Liu. Fisher consistency of multicategory support vector machines. In International Conference on Artificial Intelligence and Statistics, pages 291–298, 2007. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. CoRR, abs/1606.03657, 2016. URL http://arxiv.org/abs/1606.03657. Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014. Emily L. Denton, Soumith Chintala, Arthur Szlam, and Robert Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. CoRR, abs/1506.05751, 2015. URL http://arxiv.org/abs/1506.05751. Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart. Adversarial cost-sensitive classification. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2015. A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab. In Proceeding of the ACM Int. Conf. on Multimedia, 2015. H Brendan McMahan, Geoffrey J Gordon, and Avrim Blum. Planning in the presence of cost functions controlled by an adversary. In Proceedings of the International Conference on Machine Learning, pages 536–543, 2003. Hong Wang, Wei Xing, Kaiser Asif, and Brian Ziebart. Adversarial prediction games for multivariate losses. In Advances in Neural Information Processing Systems, pages 2710–2718, 2015. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. Piotr Dollar Larry Zitnick. Edge boxes: Locating object proposals from edges. In ECCV. European Conference on Computer Vision, September 2014. URL https://www.microsoft.com/en-us/research/publication/ edge-boxes-locating-object-proposals-from-edges/. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014b. A. Vedaldi. A MATLAB wrapper of SVMstruct . http://www.vlfeat.org/~vedaldi/code/svm-struct-matlab.This, 2011.
4