Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Abstract

Adaptive gradient-based optimizers, particularly Adam, have left their mark in training large-scale deep learning models. The strength of such optimizers is that they exhibit fast convergence while being more robust to hyperparameter choice. However, they often generalize worse than non-adaptive methods.

Type
Publication
arXiv 2023