The paper “Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer” by Sergey Zagoruyko, Nikos Komodakis Université proposed a novel training methodology called attention transfer to improve the performance of convolutional neural networks by mimicking the focus of more powerful models. By defining attention maps based on either neuron activations or network gradients, the authors provide a mechanism for a “student” network to learn which spatial areas a “teacher” network deems most important. This technique consistently enhances the accuracy of smaller or shallower architectures across diverse datasets, including CIFAR-10, ImageNet, and specialized fine-grained recognition tasks. The study demonstrates that transferring these spatial focus patterns is often more effective than traditional knowledge distillation or transferring full activation tensors. Furthermore, the approach successfully speeds up model convergence and can be combined with existing training methods without significant computational overhead. Ultimately, the work highlights that guiding a model on where to look is a vital component in training efficient and high-performing artificial vision systems. Codes on Github.
To be more specific, the main idea of this paper is the proposal of Attention Transfer, a novel method for improving the performance of a student Convolutional Neural Network (CNN) by forcing it to mimic the “attention maps” of a more powerful teacher network.
Here are the core components of this approach:
1. Attention as a Transfer Mechanism The authors hypothesize that artificial neural networks, much like humans, use attention to perceive surroundings and gather high-level information. Consequently, a teacher network can improve a student network not just by sharing its final predictions (as in Knowledge Distillation), but by providing information about where it looks and concentrates its focus within an image.
2. Two Types of Attention Maps To implement this, the authors define attention as a set of spatial maps that encode the areas of the input the network focuses on to make a decision. They propose two specific methods for defining these maps:
- Activation-based attention: This derives attention maps from the 3D activation tensors of the network’s layers. By calculating statistics (such as the sum of absolute values) across the channel dimension, they flatten the activations into a 2D spatial map that highlights the most discriminative regions, such as eyes or faces in object recognition.
- Gradient-based attention: This treats attention as the gradient of the loss with respect to the input. It functions as a sensitivity map, indicating which pixels would affect the output most if changed, essentially showing where the network is “paying attention”.
3. Training Process and Results The authors define a loss function that penalises the difference between the student’s and the teacher’s attention maps, effectively training the student to generate similar spatial focus points. Their experiments demonstrate several key findings:
- Broad Applicability: Attention transfer consistently improves performance across various datasets (CIFAR, ImageNet, CUB) and architectures (ResNet, Network-In-Network).
- Efficiency: Transferring attention maps (activation-based) often yields better improvements than trying to transfer the full activation tensors themselves, suggesting that the spatial attention maps contain the most critical information for knowledge transfer.
- Complementary Nature: Activation-based attention transfer can be successfully combined with Knowledge Distillation (KD) to achieve even higher accuracy than either method used alone.
Analogy To understand the difference between this method and standard training, imagine a master art forger teaching an apprentice to paint a specific landscape.
- Standard Training (Ground Truth): The apprentice looks at the real landscape and tries to paint it.
- Knowledge Distillation: The apprentice looks at the master’s completed painting and tries to copy the final result.
- Attention Transfer: The master stands next to the apprentice and points specifically to the shadows under the trees or the glint on the water, saying, “Focus your eyes here; this is the detail that matters most.” The apprentice learns not just what to paint, but where to look to understand the scene.