LHOPT: A Generalizable Approach to Learning Optimizers

LHOPT: A Generalizable Approach to Learning Optimizers

2021. 6. 6. 02:24ㆍAI Paper Review/AutoML Papers

<arxiv> https://arxiv.org/abs/2106.00958

A Generalizable Approach to Learning Optimizers

A core issue with learning to optimize neural networks has been the lack of generalization to real world problems. To address this, we describe a system designed from a generalization-first perspective, learning to update optimizer hyperparameters instead

arxiv.org

A core issue with learning to optimize neural networks has been the lack of generalization to real world problems.

0. Learning to optimize

Most previous works about "learning to optimize" are methods that attempt to learn a general update rule from scratch. These approaches are quite promising, even outperform human-designed optimizers in specific tasks. But testing these results in out-of-distribution, they have shown limited generalization to real world problems. Like optimizers that trained with image-based datasets don't work in the text-based datasets. As a result, this method manages to generalize in different tasks without tuning.

1. LHOPT: An generalization-first perspective

As described above, Most previous work attempts to learn optimization rules from scratch. But this has a generalization problem. In contrast, LHOPT(Learned Hyperparameters OPTimizers) leverages the priors of existing optimization algorithms, such as Adam. The resulting optimizers can be interpreted as having data-driven schedules that interpolate with hand-designed optimizers.

2. Training Setup

So, in practice, this method finds optimal optimizer hyperparameters sequentially based on input statistics. Like in many NAS(Neural Architecture Search) Researches, this method uses LSTM cell and Policy Gradient(PPO) to generate outputs. Similar to previous works, this method runs for a whole training step, and updates via PPO with the final reward, which is the validation loss.

2.1 Feature Space

The feature space definition is quite interesting. This method's major features are transformations: log-ratios, cosine similarities, averages of booleans, or the CDF features. This is quite different from existing learned optimizers, which use statistics such as the gradient, square of gradient, loss values.

2.1.1 The CDF Features

As I know, the Introduced CDF features are quite a novel concept in this field. Fundamentally, I think the CDF feature's motivation is to generalize to many tasks. The motivations introduced in this paper are:

(1) there are use-cases for which we would like to know the relative values of a feature within an inner task.

(2) ranking features should be invariant to details of the inner task. One example is to know if validation loss is plateauing without seeing its exact value. To achieve this, we calculate an estimate of that value’s mean and variance and map that features with a Gaussian cumulative distribution function onto the interval [0,1].

2.2 Action Space

Action is limited to just to ways of updating parameters: scaling for most hyperparameters, and logit shifting (which is applying the logit function, adding a constant, then applying a sigmoid... etc). The important thing is that this model doesn't have access to the exact values of hyperparameters, which improves robustness by forcing the model to react to the underlying inner task.

3. Experimental Results

3.1 Basic Tasks

Tasks that model gets more than 2x speedup

3.2 GPT-2 on WikiText-103

3.3 ResNet on ImageNet

3.4 OOD(Out-of-distribution) evaluations

While this method tends to be a "generalization-first" approach, experimenting on OOD tasks is really important. Task diversity is the key. So this paper applies LHOPTs to two unseen tasks, which are Neural Collaborative Filtering and Speech Recognition.

3.4.1 Neural Collaborative Filtering (NCF)

This paper trained both the Generalized Matrix Factorization (GMF) and Neural Matrix Factorization(NeuMF) models from scratch on the MovieLens 1M Dataset.

Comparison of NeuMF model of (a) NCDG and (b) hit ratio

These demonstrate generalization to out-of-distribution architectures input modality and loss function without tuning.

3.4.2 Speech Recognition

Loss curves for Deep Speech 2 on LibriSpeech

4. Limitations

1. This metric ignores how much better or wore the models are than baselines

2. overfitting LHOPTs to the validation set

3. A single run only provides an approximate upper or lower bound of the speedup and cannot give you an exact measurement of a speedup

4. and more (introduced in specific tasks, which is on full paper)

5. My Opinion

I think this research's generalization-first concept makes LHOPTs more practical and efficient. Previous works train update rules from scratch, which is not really practical in the real world due to high computational requirements, not generalizable, etc. But due to overfitting issues and the metric's limitation, I think the performance and stability can be improved. This approach is relatively new, so really looking forward to improvements.

Disclaimer:

This is a short review of the paper. it only shows a few essential equations and plots.

If needed, always read the full paper :)

Materials & Previous Works that you need to understand this post:

0. Adam: A Method for Stochastic Optimization (Kingma et al)

1. Neural Architecture Search with Deep Reinforcement Learning (Zoph et al)

2. Neural Optimizer Search with Reinforcement Learning (Bello et al)

3. Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves (Metz et al)

'AI Paper Review > AutoML Papers' 카테고리의 다른 글

[3줄 AutoML] 모든 사람을 면접할 수 없고, 모든 모델을 학습할 수 없다. (0)	2021.07.12
[3줄 AutoML] 솔직히 까놓고 말해서 기존 NAS 비효율적이지 않냐? (3)	2021.07.09
[3줄 AutoML] 옵티마이저도 한번 찾아볼까? (1)	2021.07.07
[3줄 AutoML] 언제까지 ReLU에 만족할래? (0)	2021.07.07
[3줄 AutoML] NAS with RL: 그 원대한 시작 (0)	2021.07.07

Bellman

Bellman

태그

최근글

댓글

공지사항

아카이브

'AI Paper Review > AutoML Papers' 카테고리의 다른 글

관련글

티스토리툴바