Neural Optimizers with Hypergradients for Tuning Parameter-Wise Learning Rates Jie Fu*, Ritchie Ng*, Danlu Chen, Ilija Ilievski, Christopher Pal, Tat-Seng Chua

Introduction

Training the Optimizer

- Recently, there has been a rising interest in learning to learn, i.e. learning to update an optimizee?s parameters directly - In this work, we combine hand-designed and learned optimizers in which an LSTM-based optimizer learns to propose parameter-wise learning rates for the hand-designed optimizer and is trained with hypergradients - Our method can be used to tune any continuous dynamic hyperparameters (including momentum), but we focus on learning rates as a case study

- Because an LSTM is limited to a few hundred time-steps as the long-term memory contents are diluted at every time-step, we do not propose learning rates at every iteration - We conjecture that learning rates tend not to change dramatically across iterations and the LSTM optimizer only proposes learning rates every S iterations - Essentially, we use a straight line to approximate the optimal learning rate curve within S steps in the hope that the actual curve is not highly bumpy with the computational graph in Fig. 2 - As a by-product, proposing learning rates every S iterations also makes it efficient for training and testing with averaged gradients over S iterations

Setup - Fig. 1 shows one step of the LSTM optimizer where all LSTMs share the same parameters, but have separate hidden states - The red lines indicate gradients, and the blue lines indicate hypergradients Figure 2: Computational graph for chaining the hypergradients, d?, from the optimizee

Experiments Experiment Setup

Figure 1: one step of an LSTM optimizer with hypergradients and gradients flow - In (Andrychowicz et al., 2016), an LSTM optimizer g with its own set of parameters ?, is used to minimize the loss of optimizee l in the form of (1)

- We have the following objective function that depends on the entire training trajectory (2)

Generating Learning Rates

- We evaluate our method on MNIST, SVHN, and CIFAR-10 - A 2-hidden-layer (20 neurons at each layer) MLP with sigmoid activation functions is trained using Adam - The batch-size is 120 - Our LSTM optimizer controls the learning rates of Adam - We use a ?xed set of random initial parameters for every meta-iteration for all datasets - The learning rates for baseline optimizers are grid-searched from 10-1 to 10?10 - The LSTM optimizer has 3 layers, each having 20 hidden units, which is trained by Adam with a ?xed learning rate of 10?7 - The LSTM is trained for 5 meta-iterations and unrolled for 50 steps Results - We can see from Fig. 3 that our LSTM optimizer (NOH) improves the optimizee performance signi?cantly in terms of convergence rate and ?nal accuracy after training for 5 meta-iterations for MNIST - We freeze the NOH and transfer it to SVHN and CIFAR-10 where it also improves the optimizee performance - We can also observe from Fig. 3 that the learning rates for SVHN, and especially CIFAR-10, decay more slowly than those on MNIST.

Issues - The loss defined in Eq. 2 is equivalent to optimizing the final loss when we set ?T = 1 where the back-propagation through time becomes inefficient - The neural optimizer learns to update the parameters directly which makes the learning task unnecessarily difficult - The LSTM optimizer needs to unroll for too many time-steps, and thus the training of the LSTM optimizer itself is difficult

Proposed Solution - In contrast to Eq 1, we adopt the following update rule as shown in Eq 3 (3)

- where h(·), the input to the LSTM optimizer, is de?ned as the state description vector of the optimizee gradients at iteration t

Figure 3: Learning curves of the optimizee and randomly picked parameter-wise learning rate schedules for Adam on different datasets.

References Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. Neural Information Processing Systems, 2016.

Neural Optimizers with Hypergradients for Tuning ...

Figure 2: Computational graph for chaining the hypergradients, d?, from the optimizee. - The loss defined in Eq. 2 is equivalent to optimizing the final loss when ...

1MB Sizes 0 Downloads 148 Views

Recommend Documents

Neural Optimizers with Hypergradients for Tuning ...
meta-iterations, another optimizee trained by Adam whose learning rates are tuned by the learned but ..... Wood. Online learning rate adaptation with hypergradient descent. ... In Proceedings of the IEEE Conference on Computer Vision and.

Fine-tuning deep convolutional neural networks for ...
Aug 19, 2016 - mines whether the input image is an illustration based on a hyperparameter .... Select images for creating vocabulary, and generate interest points for .... after 50 epochs of training, and the CNN models that had more than two ...

A Self-Tuning System Tuning System Tuning System ...
Hadoop is a MAD system that is becoming popular for big data analytics. An entire ecosystem of tools is being developed around Hadoop. Hadoop itself has two ...

Interactive Learning with Convolutional Neural Networks for Image ...
data and at the same time perform scene labeling of .... ample we have chosen to use a satellite image. The axes .... For a real scenario, where the ground truth.

Tuning for Non-Magnet Racing.pdf
There are slight differences between the way that you would set up a car for a wood .... Tuning for Non-Magnet Racing.pdf. Tuning for Non-Magnet Racing.pdf.

neural architecture search with reinforcement ... -
3.3 INCREASE ARCHITECTURE COMPLEXITY WITH SKIP CONNECTIONS AND OTHER. LAYER TYPES. In Section 3.1, the search space does not have skip connections, or branching layers used in modern architectures such as GoogleNet (Szegedy et al., 2015), and Residua

Computing with Neural Ensembles
Computing with Neural Ensembles. Miguel A. L. Nicolelis, MD, PhD. Anne W. Deane Professor of Neuroscience. Depts. of Neurobiology, Biomedical ...

Explain Images with Multimodal Recurrent Neural Networks
Oct 4, 2014 - In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating .... It needs a fixed length of context (i.e. five words), whereas in our model, ..... The perplexity of MLBL-F and LBL now are 9.90.

Inverting face embeddings with convolutional neural networks
Jul 7, 2016 - of networks e.g. generator and classifier are training in parallel. ... arXiv:1606.04189v2 [cs. ... The disadvantage, is of course, the fact that.

performance tuning with sql server dynamic management views pdf ...
performance tuning with sql server dynamic management views pdf. performance tuning with sql server dynamic management views pdf. Open. Extract.

Femtosecond tuning of Cr:colquiriite lasers with AlGaAs ...
“Multiphoton fluorescence excitation: New spectral windows for biological ... S. A. Payne, L. L. Chase, L. K. Smith, W. L. Kway, and H. W.. Newkirk, “Laser ...

Tuning clustering in random networks with arbitrary degree distributions
Sep 30, 2005 - scale-free degree distributions with characteristic exponents between 2 and 3 as ... this paper, we make headway by introducing a generator of random networks ..... 2,3 and its domain extends beyond val- ues that scale as ...

Controller Tuning - nptel
Review Questions. 1. What does controller tuning mean? 2. Name the three techniques for controller tuning, those are commonly known as Ziegler-. Nichols method. 3. Explain the reaction curve technique for tuning of controller. What are its limitation

Evaluation of Evolutionary and Genetic Optimizers: No ...
Texas Tech University ... Empirical studies are best regarded as attempts to infer the prior information optimizers have ...... do each other's jobs very poorly.

LONG SHORT TERM MEMORY NEURAL NETWORK FOR ...
a variant of recurrent networks, namely Long Short Term ... Index Terms— Long-short term memory, LSTM, gesture typing, keyboard. 1. ..... services. ACM, 2012, pp. 251–260. [20] Bryan Klimt and Yiming Yang, “Introducing the enron corpus,” .

Learning Methods for Dynamic Neural Networks - IEICE
Email: [email protected], [email protected], [email protected]. Abstract In .... A good learning rule must rely on signals that are available ...

INTELLIGENT PROCESS SELECTION FOR NTM - A NEURAL ...
finish Damage Radius Dia of cut Cut Dia ratio. (mm) (CLA) (μm) ... INTELLIGENT PROCESS SELECTION FOR NTM - A NEURAL NETWORK APPROACH.pdf.

Convolutional Neural Network Committees For Handwritten Character ...
Abstract—In 2010, after many years of stagnation, the ... 3D objects, natural images and traffic signs [2]–[4], image denoising .... #Classes. MNIST digits. 60000. 10000. 10. NIST SD 19 digits&letters ..... sull'Intelligenza Artificiale (IDSIA),

Tuning for Query-log based On-line Index Maintenance - People
Oct 28, 2011 - 1. INTRODUCTION. Inverted indexes are an important and widely used data structure for ... query, for instance Twitter, which is not present few years back is now ... This index- ing scheme provides a high degree of query performance si