Under review as a conference paper at ICLR 2017

R EGULARIZING N EURAL N ETWORKS BY P ENALIZING C ONFIDENT O UTPUT D ISTRIBUTIONS Gabriel Pereyra, George Tucker, Łukasz Kaiser & Geoffrey Hinton Google Brain {pereyra, gjt, lukaszkaiser, geoffhinton}@google.com

A BSTRACT One way to reduce overfitting when performing classification with a deep neural network is to penalize the the confidence of the network’s output distribution. This forces the network to hedge its bets, preventing it from being overconfident. Recent work has shown that label noise or label smoothing acts to regularize neural networks for image classification. However, both these techniques typically assume a uniform distribution over labels, an assumption which doesn’t hold in many machine learning tasks, particularly sequence modeling. To address this, we propose a form of regularization which acts on the output distribution of a neural network and does not force the user to make abitrary decisions about the target probabilities of incorrect labels. We extensively evaluate this form of regularization and show that it improves state-of-the-art models for image classification, language modeling, machine translation, and speech recognition.

1

I NTRODUCTION

Large neural networks with millions of parameters achieve strong performance on image classification Szegedy et al. (2015a), machine translation Sutskever et al. (2014); ?, language modeling Jozefowicz et al. (2016), and speech recognition tasks Graves et al. (2013). Despite the large datasets used for these tasks, however, neural networks are still prone to overfitting. A number of techniques have been proposed to prevent overfitting, including early stopping, L1/L2 regularization (weight decay), dropout Srivastava et al. (2014), and batch normalization Ioffe & Szegedy (2015). These techniques, along with most other forms of regularization act on the hidden activities or the weights of a neural network. Very little work has been done to explore regularizing the output distribution of a neural network. We commonly associate the knowledge of a model with the learned values of its parameters, but we can also view the knowledge as the distribution it produces over outputs given an input Hinton et al. (2015). Given this functional view of the knowledge, it is clear that the probabilities assigned to class labels that are incorrect (according to the training data) are part of the knowledge the network has. For example, when shown an image of a BMW, a network that assigns a probability of 10−3 to ”Audi” and 10−9 to ”carrot” is clearly better than a network that assigns 10−9 to ”Audi” and 10−3 to carrot, other things being equal. One reason it is better is that the probabilities assigned to incorrect classes are an indication of how the network generalizes. Distillation Hinton et al. (2015); ? exploits this fact by explicitly training a small network to assign the same probabilities to incorrect classes as a big network or ensemble that already generalizes well. In this paper, we use the fact that the relative probabilities of incorrect classes contain a lot of knowledge to motivate a better regularizer. Instead of simply smoothing one-of-N target distributions by taking a weighted average with the uniform distribution, our new regularizer penalizes the confidence of the output distribution for each training case. It does this by adding a term to the objective function that rewards high entropy output distributions. This prevents overconfidence by forcing the model to distribute some probability mass to other classes, but it does not micro-manage the way in which it achieves this. In particular there is no pressure to assign the same probability to all incorrect classes. In this paper, we show that penalizing low entropy output distributions leads to better generalization across many different tasks including image classification, language modeling, machine translation 1

Under review as a conference paper at ICLR 2017

and speech recognition. We improve on very strong baselines for all these tasks without re-tuning the hyper-parameters of the baseline models.

2

R ELATED W ORK

A large number of methods have been proposed for regularizing neural networks and recurrent neural networks. These forms of regularization typically constrain the network weights or inject noise into the weights or activations. Less explored are methods that constrain the output distribution of a model. Below, we discuss existing forms of regularization that implicitly lead to a higher entropy distribution during training.

2.1

L ABEL N OISE

The simplest approach is to corrupt each training example with a small probability by replacing the true label with a label sampled uniformly from all possible labels. This reduces overfitting during training because a network is penalized for placing a very small probability on an incorrect label when the corruption makes that label correct. Empirically, this leads to better generalization. cite disturblabel

2.2

L ABEL S MOOTHING

Label smoothing was recently shown to improve very deep convolutional neural networks for image classification Szegedy et al. (2015b). Label smoothing regularizes the output layer of a neural network by redistributing probability mass among all labels, preventing a model from becoming overconfident. We can also recover this form of regularization by marginalizing out the effect of label noise Bengio et al. (2015). One drawback of label smoothing is that it assumes a uniform prior over the labels, which does not hold for many machine learning tasks. For example, in language modeling, we know that words are not uniformly distributed, so choosing a suitable prior becomes complicated. We could smooth using the uni-gram distribution instead of the uniform, but there are many other possibilities (e.g., higher order n-gram distributions).

2.3

S ELF -D ISTILLATION

One way to create soft targets without assuming a prior distribution over labels is to use a model’s own distribution. Distillation is a technique for transferring knowledge from a larger neural network to a smaller network by teaching the small network to mimic the output distribution of the larger network Hinton et al. (2015). During training, one typically uses a combination of the larger network’s output distribution and the true distribution. If, instead of using the larger model’s output distribution, we simply use the smaller model’s output distribution and the true labels, we perform self-distillation Reed et al. (2014). This encourages the smaller model not to change its predictions too much, leading to a form of trust region optimization via self-created soft targets.

2.4

E NTROPY M AXIMIZATION

The techniques above implicitly increase the entropy of the output distribution of a model. A simpler way to achieve this is adding a regularization term that encourages a model to have an output distribution with high entropy. We are not aware of any work that has attempted this for supervised learning but in reinforcement learning, a form of entropy maximization is used to improve exploration when learning a policy Williams & Peng (1991). In reinforcement learning, actions are sampled based on the output distribution of the policy, and having an output distribution with low entropy leads to poor exploration. Adding a regularization term that increases the entropy of the policy being learned has been shown to improve exploration. 2

Under review as a conference paper at ICLR 2017

Figure 1: Make this plot look better. Rerun MNIST jobs with lower evaluation timer so left plot is more fine-grained. Training accuracy (left) and validation entropy (right) for an unregularized model, a model regularized with dropout, and model regularized with label noise, label smoothing, and a confidence penalty. Note that while label smoothing and label noise implicitly raise the validation entropy of the model, a confidence penalty does so directly and leads to higher validation entropy.

3

D IRECTLY P ENALIZING C ONFIDENCE

In this section, we present a way of penalizing output distributions that have low entropy. Given a training example (x, y), a neural network produces a distribution over classes y given x through the softmax function exp(zj ) q(y = j|x) = PK i=0 exp(zi )

(1)

where z is the output of the final layer of the neural network (sometimes called the unnormalized logits) and K is the number of classes. During training, we minimize the cross entropy between the true distribution p and our network’s output distribution q: H(p, q) = H(p) + DKL (pkq)

(2)

where H(p) is the entropy of the true distribution, which we can drop during optimization since it is a constant. Typically, p places all probability on the ”correct” label, which encourages q to do the same. However, as noted in Hinton et al. (2015), during training, assigning all probability to the correct label is often a symptom of overfitting. In Hinton et al. (2015), the authors argue that ”dark knowledge”, i.e. the ratios between the probabilities assigned to incorrect classes, provides important information about how the network generalizes. To encourage our model to redistribute some of its probability mass to ”incorrect” classes, we add a term to the objective function that penalizes confidence. To achieve an output distribution with high entropy, the simplest regularization term R is: R(θ) = λH(q),

(3)

where λ controls the strength of the regularization. In reinforcement learning, this form of entropy maximization has been found to improve exploration when added to a policy Williams & Peng (1991) by preventing early convergence. 3.1

C OMPARISON TO L ABEL S MOOTHING AND L ABEL N OISE

For classification tasks where all of the output labels occur with about the same frequency, label smoothing and label noise have one free parameter – the proportion of the uniform distribution that 3

Under review as a conference paper at ICLR 2017

Model

Test Error

2-layer, 1024-unit NN 2-layer, 1024-unit NN (dropout) 2-layer, 1024-unit NN (label noise) 2-layer, 1024-unit NN (label smoothing) 2-layer, 1024-unit NN (entropy maximization) 2-layer, 1024-unit NN (confidence penalty)

1.97% 1.40% 1.24% 1.23% 1.22% 1.17%

Table 1: Test error (%) on MNIST. Add improvement in parenthesis for test error column.

is mixed with the hard one-of-N distribution defined by the correct label. A confidence penalty also has one free parameter – the coefficient that multiplies the derivatives of the entropy of the output distribution with respect to the logits. Even in this simple case, confidence penalties have the advantage that, conditional on the input, they do not try to force the incorrect labels to have equal small probabilities. Given an image of the digit 7, for example, it is generally better to have a model that associates higher probabilities with the classes 4 and 9 than with the classes 5 and 8. If the class frequencies are far from equal, a confidence penalty has the additional advantage that it requires less decisions than label smoothing because we have to decide what probbilities to give to rare classes when they are incorrect.

4

E XPERIMENTS

We evaluated our confidence penalty on MNIST and CIFAR-10 for image classification, Penn Treebank for language modeling, WMT’14 English-to-German for machine translation, and TIMIT for speech recognition. All models were implemented using TensorFlow and trained on NVIDIA Tesla K40’s. We describe the datasets and models used below. 4.1

I MAGE C LASSIFICATION

For image classification, we performed experiments on the MNIST and CIFAR-10 datasets, using fully-connected neural networks and very deep convolutional neural networks. 4.1.1

MNIST

MNIST is an image classification dataset consisting of 28x28 black and white images of digits. The dataset is split into 60k training images and 10k testing images. We use the last 10k images of the training set as a held-out validation set for all hyper-parameter tuning, as is common practice. We trained fully-connected neural networks with [2, 3, 4, 5] layers and [1024, 2048, 4096, 8192] units. Currently have results on 2-layers and 1024-units and running a grid search for 3-layers and 1024-units. All models used rectified linear activations. Weights were initialized from a normal distribution with standard deviation of 0.01. Models were optimized with stochastic gradient descent with a learning rate of 0.05 except when dropout was used. For models with dropout, we used a learning rate of 0.001. Need to confirm that label noise also requires a lower learning rate. For label smoothing and label noise, we tried the smoothing or noise parameter with values of [0.05, 0.1, 0.2, 0.3, 0.4, 0.5], and found 0.1 to work best for both methods. For the confidence penalty, WE NEED TO TRY WITH NO THRESHOLD. We found that label smoothing and our confidence penalty allowed models to be trained with higher learning rates compared to label noise and dropout Need to confirm this but initial results suggest this is true. Compared to dropout, the effectiveness of the various forms of output regularization diminish as the model becomes deeper. Confirm this once we have different depth results. Describe what happens as a function of width. 4

Under review as a conference paper at ICLR 2017

Model

Test Error

110-layer residual CNN He et al. (2015) 110-layer residual CNN with stochastic depth Huang et al. (2016b) 21-layer fractal CNN Larsson et al. (2016)

13.63% 11.66% 7.33%

40-layer dense CNN (without dropout) 40-layer dense CNN (label smoothing) 40-layer dense CNN (label noise) 40-layer dense CNN (entropy maximization) 40-layer dense CNN (entropy penalty) 40-layer dense CNN (dropout) 40-layer dense CNN (dropout + entropy penalty)

9.29% ? ? 9.24% 8.04% 7.00% ?

Table 2: Test error (%) on Cifar-10 without data augmentation. Models with citations are existing results. Add improvement in parenthesis for test error column.

4.1.2

CIFAR-10

CIFAR-10 is an image classification dataset consisting of 32x32 colored images of 10 classes. The dataset is split into 50k training images and 10k testing images. We use the last 5k images of the training set as a held-out validation set for all hyper-parameter tuning, as is common practice. For our experiments, we used a densely connected convolutional neural network, which represents the current state-of-the-art on CIFAR-10 Huang et al. (2016a). We use the small configuration from Huang et al. (2016a), which consists of 40-layers, with a growth rate of 12 (each layer receives as input the output of the previous 12 layers. All models were trained using stochastic gradient descent for 300 epochs, with a batch-size of 50 and a learning rate 0.1. The learning rate was reduced by a factor of 10 at 150 and 225 epochs. We present results for training without data-augmentation. We found that entropy regularization did lead to improved performance when training with data augmentation, however neither did other regularization techniques, including dropout. For our final test scores, we trained on the entire training set. For label smoothing and label noise, we tried smoothing or noise parameter values of [0.05, 0.1, 0.2, 0.3, 0.4, 0.5], and found 0.1 to work best for both methods. For the confidence penalty, get results without the threshold we performed a grid search of entropy threshold values of [0.1, 0.25, 0.5, 1.0, 1.5] and entropy weights of [0.1, 0.25, 0.5, 1.0] and found an entropy threshold of 1.5 and an entropy weight of 0.1 to work best. 4.2

L ANGUAGE M ODELING

We show that our confidence penalty significantly outperforms label noise and label smoothing for language modeling. We performed word-level language modeling experiments using the Penn Treebank dataset (PTB) Marcus et al. (1993). We used hyperparameter settings for the large configuration in Zaremba et al. (2014). Briefly, we used a 2-layer, 1500-unit LSTM, with 65% dropout applied on all non-recurrent connections, as described in Zaremba et al. (2014). We trained using stochastic gradient descent for 55 epochs, decaying the learning rate by 1.15 after 14 epochs, and clipped the norm of the gradients when they were larger than 10. For label noise and label smoothing, we performed a grid search over noise and smoothing values of [0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5]. For label noise, we found ? to be the best value, which led to a 3 point improvement in test perplexity. For label smoothing, we found 0.5 to be the best value, although it still hurt performance as compared to not using label smoothing. For the entropy penalty, we performed a grid search over entropy target values of [4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0] and entropy weight values of [2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]. We found an entropy target of 6.5 and an entropy weight of 3.5 to work best, which led to an improvement of 3.7 perplexity points. For reference, we also include results of the existing state of the art models for the word-level language modeling task on PTB. Variational dropout Gal (2015) applies a fixed dropout mask (stochastic for each sample) at each time-step, instead of resampling as in traditional dropout. Note, that we don’t include the variational dropout results that use Monte Carlo (MC) model averaging, which 5

Under review as a conference paper at ICLR 2017

Model

Valid Perplexity

Test Perplexity

2-layer, 1000-unit LSTM Zaremba et al. (2014) 2-layer, 650-unit LSTM + Char CNN Kim et al. (2015) 2-layer, 1500-unit LSTM Zaremba et al. (2014) 2-layer, 1500-unit variational LSTM Gal (2015) 5-layer, 1000-unit RHN Zilly et al. (2016)

86.2 81.2 79.6 72.8

82.7 78.9 78.4 75.0 71.3

2-layer, 1500-unit LSTM (label smoothing) 2-layer, 1500-unit LSTM (label noise) 2-layer, 1500-unit LSTM (entropy maximization) 2-layer, 1500-unit LSTM (entropy penalty)

82.3 79.7 79.5 78.6

78.5 77.7 76.8 74.7

Table 3: Word-level Penn Treebank validation and test perplexity. Models with citations are existing results. A 2-layer LSTM with an entropy penalty outperforms all existing results except for the 5-layer LSTM described in Zilly et al. (2016).

Model 4-layer, 1000-unit LSTM Luong et al. (2015) 8-layer, 1024-unit LSTM 8-layer, 1024-unit LSTM (label smoothing) 8-layer, 1024-unit LSTM (entropy penalty) 8-layer, 1024-unit LSTM (dropout) 8-layer, 1024-unit LSTM (dropout + label smoothing) 8-layer, 1024-unit LSTM (dropout + entropy penalty)

Valid BLEU

Test BLEU

-

20.9

22.33 23.85 23.35 24.11 24.60 24.46

21.24 22.42 22.26 23.41 23.79 23.52

Table 4: WMT English-to-German validation and test BLEU scores.

achieves lower perplexity on the test set but requires 1000 model evaluations, which are then averaged. Recurrent highway networks Zilly et al. (2016) currently represent the state-of-the-art performance on PTB. However, we note that these results were achieved using a 5-layer models.

4.3

M ACHINE T RANSLATION

We show that our entropy penalty outperforms label smoothing for machine translation, although not as significantly as in our language modeling experiments. We performed our translation experiments on the WMT’14 English-to-German task. For all experiments, we used an 8-layer, 512-unit encoderdecoder architecture with attention. Finishing describing the experimental setup of our translation models. We compare label smoothing and our entropy penalty on models with and without dropout. The model with dropout uses a dropout value 30% applied to all non-recurrent connections. @Lukasz: what paper should we cite for the translation model we used. When are you guys planning on publishing it on arXiv? We trained an 8-layer, 1024 unit LSTM encoder decoder with attention. Describe WMT model. For label smoothing, we performed a grid search over smoothing values of [0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5]. We found a smoothing value of 0.4 to work best, leading to an improvement of 1 BLEU point over the model without dropout and 0.4 BLEU points over the model with dropout. For our entropy penalty, we performed a grid search over entropy target values of [2, 4, 6] and entropy weights of [0.5, 2.5, 4.5]. We found an entropy target and entropy weight of ? and ? to work best, respectively, which lead to an improvement of ?. We compare our results to recent state-of-the-art results on the WMT’14 English-to-German task. Compile list of state-of-the-art results on the WMT’14 English-to-German task. 6

Under review as a conference paper at ICLR 2017

Model 3/1-layer, 256-unit, bidirectional LSTM 3/1-layer, 256-unit, bidirectional LSTM (label smoothing) 3/1-layer, 256-unit, bidirectional LSTM (entropy penalty)

Valid PER

Test PER

21.0 19.3 19.3

23.0 21.3 20.7

Table 5: TIMIT validation and test phoneme error rates (PER). Run multiple trials for best configuration. 4.4

S PEECH R ECOGNITION

We show that our entropy penalty outperforms label smoothing for speech recognition, although by a smaller margin, as in our machine translation experiments. We used the model defined in Chan et al. (2015) and evaluated the model on the standard TIMIT phone recognition task Garofolo et al. (1993). For all experiments, we used a 3-layer encoder and a 1-layer decoder, where all layers had 256 units and 15% dropout on the non-recurrent connections confirm this. For label smoothing, we performed a grid search over smoothing values of [0.05, 0.1, .15, 0.2, 0.3, 0.4, 0.5, 0.6, 0.8] and found 0.2 to work best. For our entropy penalty, we performed a grid search over entropy target values of [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 5] and entropy weight values of [0.125, 0.25, 0.5, 1, 2, 4, 8]. We found an entropy target of 3.5 and entropy weight of 1.0 to perform best, leading to an improvement of ≈ 2% absolute test error over the baseline model. We compare our results to recent state-of-the-art results on the TIMIT dataset. Compile list of state-of-the-art results on TIMIT.

5

C ONCLUSION

We introduce a new form of regularization for neural networks. Our confidence penalty penalizes a model for having an output distribution with low entropy. We extensively evaluate this form of regularization for image classification, language modeling, machine translation, and speech recognition. We compare to label noise and label smoothing, two forms of regularization that also act on the output layer of a neural network, although on the labels instead of the models output distribution. We find that, in all experiments, our confidence penalty outperforms these other techniques. ACKNOWLEDGMENTS We would like to thank Sergey Ioffe and Navdeep Jaitly for helpful discussions. We would also like to thank Prajit Ramachandran, Barret Zoph, Mohammad Norouzi, and Yonghui Wu for technical help with the various models used in our experiments.

R EFERENCES Yoshua Bengio, Ian J Goodfellow, and Aaron Courville. Deep learning. An MIT Press book in preparation. Draft chapters available at http://www.iro.umontreal.ca/˜bengioy/ dlbook, 2015. William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell. arXiv preprint arXiv:1508.01211, 2015. Yarin Gal. A theoretically grounded application of dropout in recurrent neural networks. arXiv preprint arXiv:1512.05287, 2015. John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, David S Pallett, Nancy L Dahlgren, and Victor Zue. Timit acoustic-phonetic continuous speech corpus. Linguistic data consortium, Philadelphia, 33, 1993. 7

Under review as a conference paper at ICLR 2017

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. IEEE, 2013. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. Gao Huang, Zhuang Liu, and Kilian Q Weinberger. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016a. Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv preprint arXiv:1603.09382, 2016b. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016. Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. Character-aware neural language models. arXiv preprint arXiv:1508.06615, 2015. Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016. Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025, 2015. Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993. Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015a. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015b. Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991. Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014. Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutn´ık, and J¨urgen Schmidhuber. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.

8

regularizing-neural-networks.pdf

One way to reduce overfitting when performing classification with a deep neu- ral network is to penalize the the confidence of the network's output distribution. This forces the ... It does this by adding a term to the objective function that rewards high entropy output. distributions. .... Page 3 of 8. regularizing-neural-networks.pdf.

206KB Sizes 1 Downloads 125 Views

Recommend Documents

No documents