The paper “ImputeFormer: Low Rankness-Induced Transformers for Generalizable Spatiotemporal Imputation,” from the KDD ’24 conference, addresses the pervasive issue of missing data in spatiotemporal datasets, particularly contrasting traditional low-rank models with modern deep learning methods like Transformers. The authors introduce ImputeFormer, a novel model that integrates the structural priors of low-rankness into a Transformer architecture via projected temporal attention and embedded spatial attention to achieve a better balance between signal and noise. This new approach aims to create a generalizable and efficient imputation solution, which is demonstrated to outperform state-of-the-art models across diverse benchmarks, especially in handling highly sparse observations.
ImputeFormer effectively leverages low-rankness to balance signal and noise by embedding structural priors directly into the Transformer architecture, thereby bridging the gap between the high expressivity of deep learning and the noise-filtering capabilities of analytic low-rank models.
While traditional low-rank models often oversmooth data by truncating informative signals, and standard deep learning models tend to overfit by preserving high-frequency noise, ImputeFormer achieves a balance through three specific mechanisms:
1. Temporal Projected Attention
ImputeFormer exploits the inherent redundancy of time series data, where most information can be reconstructed using a few dominant patterns. Instead of computing a full-rank attention matrix—which is computationally expensive and prone to modeling noise—the model employs a projected attention mechanism:
- Low-Rank Factorization: The model projects initial features into a lower-dimensional space using a learnable projector, effectively performing a channel-wise matrix factorization.
- Noise Filtering: By forcing the information flow through this compact representation (dimension $C \ll T$), the model retains principal temporal patterns while filtering out non-essential high-frequency noise.
- Reconstruction: The model then disperses this concentrated information back to the full sequence, ensuring that the reconstruction relies on dominating correlational structures rather than spurious correlations.
2. Spatial Embedded Attention
To manage noise in the spatial dimension, ImputeFormer avoids the full pairwise attention found in standard Transformers, which can be memory-intensive and learn misleading correlations from sparse data.
- Node Embeddings: The model assigns learnable node embeddings to sensors, acting as abstract, low-dimensional representations of the series.
- Factorized Attention: Spatial correlations are estimated via these embeddings, resulting in an attention map with a rank lower bound determined by the embedding dimension rather than the number of sensors. This acts as a factorized low-rank approximation, reducing the risk of learning noise from specific spatial events or missing values.
3. Fourier Imputation Loss (FIL)
To further regularise the balance between signal and noise during training, ImputeFormer introduces a loss function based on the frequency domain.
- Spectral Regularisation: Recognising that spatiotemporal data possesses a sparse Fourier spectrum (which is mathematically equivalent to low-rankness in the time domain), the model minimises the Fourier $L_1$ norm.
- Global Compatibility: This loss encourages the imputed values to align globally with observed data, suppressing high-frequency noise in the spectrum that would otherwise appear as high-rank artifacts.
Empirical Evidence of Signal-Noise Balance
The effectiveness of this approach is validated through singular value (SV) spectrum analysis:
- Preventing Oversmoothing: Unlike pure matrix factorisation models that truncate too much energy (losing signal), ImputeFormer preserves the dominant singular values associated with valid signals.
- Suppressing Noise: Unlike canonical Transformers that maintain excessive energy in the tail of the spectrum (indicating preserved noise), ImputeFormer aligns closely with the SV distribution of the ground truth data.
Analogy: You can think of ImputeFormer’s approach like a professional audio engineer cleaning up a noisy recording.
- A standard deep learning model is like a microphone that amplifies everything, making the voice (signal) loud but also amplifying the background hiss (noise).
- A traditional low-rank model is like an aggressive mute button that cuts out the hiss but also muffles the voice, making it sound flat and robotic.
- ImputeFormer acts as an intelligent equaliser. By “projecting” the sound, it identifies the specific frequencies where the human voice lives (dominant patterns) and focuses its power there, while ignoring the frequencies where only static hiss exists. The result is a recording that is crisp and detailed (expressive) without being fuzzy (noisy).