LHOPT: A Generalizable Approach to Learning Optimizers

2021. 6. 6. 02:24AI Paper Review/AutoML Papers

<arxiv> https://arxiv.org/abs/2106.00958

 

A Generalizable Approach to Learning Optimizers

A core issue with learning to optimize neural networks has been the lack of generalization to real world problems. To address this, we describe a system designed from a generalization-first perspective, learning to update optimizer hyperparameters instead

arxiv.org

 

A core issue with learning to optimize neural networks has been the lack of generalization to real world problems.

 

0. Learning to optimize

 Most previous works about "learning to optimize" are methods that attempt to learn a general update rule from scratch. These approaches are quite promising, even outperform human-designed optimizers in specific tasks. But testing these results in out-of-distribution, they have shown limited generalization to real world problems. Like optimizers that trained with image-based datasets don't work in the text-based datasets. As a result, this method manages to generalize in different tasks without tuning. 

 

1. LHOPT: An generalization-first perspective

As described above, Most previous work attempts to learn optimization rules from scratch. But this has a generalization problem. In contrast, LHOPT(Learned Hyperparameters OPTimizers) leverages the priors of existing optimization algorithms, such as Adam. The resulting optimizers can be interpreted as having data-driven schedules that interpolate with hand-designed optimizers. 

 

2. Training Setup

So, in practice, this method finds optimal optimizer hyperparameters sequentially based on input statistics. Like in many NAS(Neural Architecture Search) Researches, this method uses LSTM cell and Policy Gradient(PPO) to generate outputs. Similar to previous works, this method runs for a whole training step, and updates via PPO with the final reward, which is the validation loss.

 

Simplified diagram of LHOPT

 

2.1 Feature Space

The feature space definition is quite interesting. This method's major features are transformations: log-ratios, cosine similarities, averages of booleans, or the CDF features. This is quite different from existing learned optimizers, which use statistics such as the gradient, square of gradient, loss values.

 

2.1.1 The CDF Features

As I know, the Introduced CDF features are quite a novel concept in this field. Fundamentally, I think the CDF feature's motivation is to generalize to many tasks. The motivations introduced in this paper are:

 

(1) there are use-cases for which we would like to know the relative values of a feature within an inner task.

 

(2) ranking features should be invariant to details of the inner task. One example is to know if validation loss is plateauing without seeing its exact value. To achieve this, we calculate an estimate of that value’s mean and variance and map that features with a Gaussian cumulative distribution function onto the interval [0,1]. 

 

2.2 Action Space

Action is limited to just to ways of updating parameters: scaling for most hyperparameters, and logit shifting (which is applying the logit function, adding a constant, then applying a sigmoid... etc). The important thing is that this model doesn't have access to the exact values of hyperparameters, which improves robustness by forcing the model to react to the underlying inner task.

 

3. Experimental Results

3.1 Basic Tasks

Tasks that model gets more than 2x speedup

3.2 GPT-2 on WikiText-103 

 

GPT-2 on WikiText-103 (1 epoch)

3.3 ResNet on ImageNet

 

ResNet18 on ImageNet

 

3.4 OOD(Out-of-distribution) evaluations

While this method tends to be a "generalization-first" approach, experimenting on OOD tasks is really important. Task diversity is the key. So this paper applies LHOPTs to two unseen tasks, which are Neural Collaborative Filtering and Speech Recognition.

 

3.4.1 Neural Collaborative Filtering (NCF)

This paper trained both the Generalized Matrix Factorization (GMF) and Neural Matrix Factorization(NeuMF) models from scratch on the MovieLens 1M Dataset.

 

Comparison of NeuMF model of (a) NCDG and (b) hit ratio

These demonstrate generalization to out-of-distribution architectures input modality and loss function without tuning.

 

3.4.2 Speech Recognition

 

Loss curves for Deep Speech 2 on LibriSpeech

 

4. Limitations

1. This metric ignores how much better or wore the models are than baselines

2. overfitting LHOPTs to the validation set

3. A single run only provides an approximate upper or lower bound of the speedup and cannot give you an exact measurement of a speedup

4. and more (introduced in specific tasks, which is on full paper)

 

5. My Opinion 

I think this research's generalization-first concept makes LHOPTs more practical and efficient. Previous works train update rules from scratch, which is not really practical in the real world due to high computational requirements, not generalizable, etc. But due to overfitting issues and the metric's limitation, I think the performance and stability can be improved. This approach is relatively new, so really looking forward to improvements.

 

Disclaimer:

This is a short review of the paper. it only shows a few essential equations and plots.

If needed, always read the full paper :)

 

Materials & Previous Works that you need to understand this post:

0. Adam: A Method for Stochastic Optimization (Kingma et al)

1. Neural Architecture Search with Deep Reinforcement Learning (Zoph et al)

2. Neural Optimizer Search with Reinforcement Learning (Bello et al)

3. Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves (Metz et al)