Data and AI cluster

Project: Learning to Optimize at Scale with Attention Mechanisms

Description

Continual learning refers to the ability of a system to continually acquire new knowledge over time while retaining previously learned experience [1]. Conventional neural networks typically update all model parameters (weights) when adapting to new tasks, which often leads to catastrophic forgetting [2]. Instead, model parameters should be updated selectively and carefully, protecting certain parameters to avoid forgetting while allowing others to adapt to accelerate future learning.

Learned optimizers are neural networks trained to directly optimize the weights of target models and they have recently emerged as a promising alternative to manually designed optimization strategies (e.g., SGD and Adam) [3,4]. However, these learned optimizers are not designed for continual learning scenarios, where tasks arrive sequentially and previously seen tasks cannot be revisited.

Vettoruzzo et al. [5] addresses this gap by introducing a transformer-based optimizer that dynamically updates a specific set of parameters of a classifier network per task. This technique minimizes interference with previously acquired knowledge while promoting knowledge transfer across related tasks. While this approach demonstrates the potential of meta-optimizers for continual learning, it is currently limited to small model sizes and specific model architectures, i.e., CNN.

This project aims to extend and transform the meta-optimizer proposed in [5] into a scalable and practical tool that can be used as a drop-in replacement for standard optimizers (e.g., SGD, Adam) in PyTorch. The improved optimizer should generalize across various domains and neural network architectures, including CNNs, MLPs, and transformers, while effectively addressing the challenges of continual learning.

Objectives:

- Refactor the existing codebase to improve computational efficiency and to ensure generalization across different model architectures.

- Evaluate the performance of the learned optimizer against standard optimizers (SGD, Adam) using consistent training setups across multiple datasets and model architectures.

- Develop and release a pip-installable tool, with full documentation, API reference,

example training scripts, and tutorials to support adoption by the PyTorch community.

References:

[1] Son, Jaehyeon, Soochan Lee, and Gunhee Kim. "When meta-learning meets online and continual learning: A survey." IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).

[2] Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural networks." Proceedings of the national academy of sciences 114.13 (2017): 3521-3526.

[3] Jain, Deepali, et al. "Mnemosyne: Learning to train transformers with transformers." Advances in Neural Information Processing Systems 36 (2023): 77331-77358.

[4] Metz, Luke, et al. "VeLO: Training Versatile Learned Optimizers by Scaling Up." arXiv e-prints (2022): arXiv-2211.

[5] Vettoruzzo, Anna, et al. "Learning to Learn without Forgetting using Attention." Conference on Lifelong Learning Agents (CoLLAs). 2024.

Details

Supervisor: Joaquin Vanschoren
Secondary supervisor: Anna Vettoruzzo