PROFIT: A Specialized Optimizer for Deep Fine-Tuning

Motivation

Fine-tuning pre-trained deep learning models has become the norm across various domains such as computer vision, natural language processing, and autonomous driving. However, widely used optimizers like SGD or Adam were originally designed for training models from scratch, not for the nuanced needs of fine-tuning.

This mismatch leads to problems such as:

Overwriting of useful pre-trained knowledge (a.k.a. catastrophic forgetting)
Slower convergence, or suboptimal minima when only a small amount of task-specific data is available
Lack of control over how much the pre-trained weights should change

Hence, there’s a need for an optimizer that is explicitly designed to fine-tune models without destroying their pre-learned capabilities.

Approach

To tackle the above challenges, we introduce PROFIT (Proximally Restricted Optimizer For Iterative Training), a novel optimization algorithm specifically tailored for fine-tuning tasks.

PROFIT constrains updates to remain close to the pre-trained parameters while still adapting to the new task. This is done via proximal updates and temporal gradient orthogonalization, which helps reduce interference between previous and new knowledge.

In other words, rather than allowing the optimizer to freely explore the loss landscape (as SGD or Adam would), PROFIT restricts the optimization path to directions that don’t “erase” what the model already knows.

Please refer to our paper for more details.