Catastrophic forgetting occurs when a neural network “overwrites” what it learned in a previous task while training on a new one. This happens because the weights optimized for the first task are changed to minimize the error on the second task, often disregarding the original objective.
To reduce this when fine-tuning, you can use three main categories of strategies: Regularization, Rehearsal, and Architecture.
1. The “Replay” Strategy (Data Mixing)
This is generally the most effective method. If you can still access some of your old data, do not train on the new data in isolation.
- Mix the Datasets: Create a training set that consists mostly of your new task data, but includes a small percentage (e.g., 5-10%) of the data from the previous task.
- Why it works: By seeing old examples occasionally during the new training, the model is constantly reminded of what it previously learned, preventing it from drifting too far.
2. Freeze the “Backbone”
Instead of retraining the entire neural network, you keep the majority of the weights fixed.
- Freeze Early Layers: The earlier layers of a network usually detect general features (like edges or shapes in images, or grammar in text). Freeze these layers so they cannot be changed. Only train the final few layers that make the specific decision.
- Add “Adapter” Layers: Freeze the entire original model and add small, new layers in between the old ones. Train only these new small layers.
- Why it works: Since the original weights are physically frozen, it is impossible for the model to “forget” the old capabilities. It simply learns a translation on top of them.
3. Change How the Model Updates (Regularization)
If you cannot keep old data (due to privacy or storage), you can change the math behind how the model learns.
- Penalize Weight Changes: Modify the loss function to make it “expensive” for the model to change weights that were important for the previous task. The model is allowed to learn new things, but it must try to find a solution that doesn’t drastically alter its previous structure. This is best when you cannot store old data and want to constrain the model mathematically.
- Elastic Weight Consolidation (EWC):
- How it works: After training on Task A, the method calculates the importance of each weight (using the Fisher Information Matrix). When training on Task B, it penalizes changes to high-importance weights while allowing low-importance weights to change freely.
- Pros: rigorous mathematical foundation; no need to store old data.
- Cons: Computationally expensive to calculate the importance matrix.
- Learning without Forgetting (LwF):
- How it works: It uses Knowledge Distillation. When training on Task B, you pass the new data through both the old (frozen) model and the current model. The loss function tries to match the current model’s predictions to the old model’s predictions (for old task classes) while also learning the true labels for Task B.
- Pros: Simple to implement; no old data required.
- Cons: Depends heavily on the new data being somewhat related to the old data.
- Elastic Weight Consolidation (EWC):
- Knowledge Distillation: Before you update the model, run your new data through the old version of the model. Record the output. Then, train your new model to match those old outputs while also learning the new labels. This forces the new model to mimic the old one’s behavior.
4. Adjust the Learning Rate
Simple training adjustments can sometimes be enough if the tasks are similar.
- Lower Learning Rate: Use a much smaller learning rate for fine-tuning than you did for the initial training. This limits how much the weights can shift.
- Discriminative Learning Rates: Use a tiny learning rate for the early layers (to preserve general knowledge) and a larger learning rate for the final layers (to learn the new task).
Summary Checklist
| Strategy | When to use it |
| Replay / Mixing | If you have access to the old data. (Most reliable). |
| Freezing / Adapters | If you want to keep the model efficient and modular. |
| Weight Penalties | If you cannot store old data but need to update the whole model. |
| Low Learning Rate | If the new task is very similar to the old one. |