deep learning – Knowledge sparks

How to reduce catastrophic forgetting when fine-tuning neural networks

by Kurious Fox
January 8, 2026January 8, 2026

Catastrophic forgetting occurs when a neural network “overwrites” what it learned in a previous task while training on a new one. This happens because the weights optimized for the first task are changed to minimize…

Example of Knowledge Distillation (KD) using PyTorch on the MNIST dataset

by Kurious Fox
October 30, 2025October 30, 2025

This example demonstrates Knowledge Distillation, a technique where a small “student” model is trained to mimic a larger, pre-trained “teacher” model. Let’s have a brief introduction to Knowledge Distillation first. 🎓 What is Knowledge Distillation?…

relation between knowledge distill and perceptual loss

by Kurious Fox
October 30, 2025October 30, 2025

Knowledge distillation and perceptual loss are distinct concepts in machine learning, but they can be used together effectively, especially in computer vision tasks. Here’s the simple breakdown: 🧠 What is Knowledge Distillation? Knowledge Distillation is…

RUST for AI software development

by Kurious Fox
October 28, 2025October 28, 2025

What is RUST? Rust is a modern systems programming language focused on three core goals: performance, memory safety, and concurrency. Think of it as having the raw speed and low-level control of languages like C…

AdamW optimization and implementation in PyTorch

by Kurious Fox
September 1, 2025October 12, 2025

The AdamW method was proposed in the paper “Decoupled Weight Decay Regularization” by Ilya Loshchilov and Frank Hutter. While the paper was officially published at the prestigious International Conference on Learning Representations (ICLR) in 2019,…

Transformers Architectures: A Comprehensive Review

by Kurious Fox
July 21, 2025October 12, 2025

The Transformer architecture, introduced in the seminal “Attention Is All You Need” paper in 2017, has fundamentally reshaped the landscape of artificial intelligence. By exclusively leveraging self-attention mechanisms and entirely dispensing with traditional recurrent and…

Curriculum Learning in 3D Medical Imaging: Advancing Diagnostic and Therapeutic Applications

by Kurious Fox
July 10, 2025July 10, 2025

Curriculum learning, a machine learning paradigm inspired by human cognitive development, involves training models on examples of progressively increasing difficulty. 3D medical imaging, encompassing modalities such as Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and…

Masked Autoencoders: A Scalable Paradigm for Self-Supervised Visual Learning

by Kurious Fox
July 9, 2025July 9, 2025

A Masked Autoencoder (MAE) is a sophisticated self-supervised learning framework predominantly employed in computer vision. Its primary function is to acquire robust visual representations by reconstructing portions of an input image that have been intentionally…

Interactive Cosine Annealing with Warmup Visualizer

by Kurious Fox
July 8, 2025July 8, 2025

Interactive Cosine Annealing with Warmup Visualizer Cosine Annealing with Linear Warmup Explore the two-phase learning rate schedule by adjusting the parameters. Controls Warmup Ratio 10% Peak Learning Rate (η_max) 0.01 Min Learning Rate (η_min) 0.0001…

A Comprehensive Analysis of Cosine Annealing and Warmup Learning Rate Schedules

by Kurious Fox
July 8, 2025July 8, 2025

The Imperative for Dynamic Learning Rates In the optimization of deep neural networks, the learning rate stands as arguably the most critical hyperparameter, directly governing the magnitude of weight updates. If the rate is set…

Knowledge Distillation Techniques: A Comprehensive Analysis

by Kurious Fox
June 26, 2025June 26, 2025

Knowledge Distillation (KD) has emerged as a critical model compression technique in machine learning, facilitating the deployment of complex, high-performing models in resource-constrained environments. This methodology involves transferring learned “knowledge” from a powerful, often cumbersome,…

Training and fine-tuning models with Parameter-Efficient Fine-Tuning (PEFT) on limited GPU capacity

by Kurious Fox
June 26, 2025June 26, 2025

Training models, even with adapters, on limited GPU capacity requires careful optimization. Here’s a comprehensive guide to help you do that: 1. Leverage Parameter-Efficient Fine-Tuning (PEFT) Frameworks: 2. Focus on LoRA (Low-Rank Adaptation): 3. Memory-Saving…

An Introduction to Flow Matching and Conditional Flow Matching

by Kurious Fox
June 9, 2025November 19, 2025

An Introduction to Flow Matching Flow Matching is a powerful and relatively new framework for training generative models. It has quickly become a state-of-the-art method, rivaling and in many cases surpassing established techniques like diffusion…

Parameter-Efficient Fine-Tuning: A New Paradigm for Advancing Medical Image Analysis

by Kurious Fox
June 7, 2025June 7, 2025

1. Introduction: The Imperative for Efficiency in Adapting Foundational Models for Medical Imaging The advent of foundation models, pre-trained on extensive and diverse datasets, has marked a significant turning point in artificial intelligence, with profound…

Lipschitz Continuity In Machine Learning

by Kurious Fox
April 13, 2025April 13, 2025

Let and be normed vector spaces. A function is called Lipschitz continuous if there exists a real constant such that for all : Here: For a real-valued function of a real variable ( with the…

How to Build an AI Chess Model: Step-by-Step Guide

by Kurious Fox
January 1, 2025January 1, 2025

Creating a chess AI model involves training it to evaluate board positions and make strategic moves using approaches like Minimax with Alpha-Beta Pruning or machine learning with historical game data.

Deep Ensembles: Leveraging Ensemble Methods for Uncertainty Estimation in AI (with codes)

by Kurious Fox
November 18, 2024December 10, 2024

Ensemble methods enhance machine learning models’ uncertainty estimation by aggregating diverse predictions, improving accuracy, and generalization through training multiple models independently.

Monte Carlo and MC Dropout for neural network Uncertainty modeling

by Kurious Fox
November 18, 2024December 10, 2024

While there are various methods for uncertainty modeling in neural networks, Monte Carlo (MC) methods are widely used due to their simplicity and ease of implementation, particularly when predicting probabilities or modeling distributions is computationally…

Cross Entropy Loss & why we don’t need softmax for Cross Entropy Loss

by Kurious Fox
November 12, 2024November 12, 2024

Cross-entropy loss measures the difference between predicted and actual probability distributions in classification tasks, particularly in neural networks.

A Deep Dive into Adaptive Learning Rate Algorithms with PyTorch implementation

by Kurious Fox
October 27, 2024October 27, 2024

AdaGrad The AdaGrad algorithm individually adjusts the learning rates of all model parameters by scaling them inversely proportional to the square root of the cumulative sum of their past squared gradients. This means that parameters…

Gradient clipping and Pytorch codes

by Kurious Fox
October 25, 2024October 26, 2024

Gradient clipping is a technique used to address the problem of exploding gradients in deep neural networks. It involves capping the gradients during the backpropagation process to prevent them from becoming excessively large, which can…

Minibatch learning and variations of Gradient Descent

by Kurious Fox
October 25, 2024October 26, 2024

Minibatch learning in neural networks is akin to dancers learning a complex routine by breaking it down into smaller, manageable sections. This approach allows both the dancers and the neural network to focus on incremental…

Activation, Initialization and Training a Neural network

by Kurious Fox
October 25, 2024October 25, 2024

Initially, the artificial neural network is like a child. It knows almost nothing! So, it needs to learn. Training a neural network involves using a loss function. The loss function allows the neural network to…

A Comical Introduction to Neural Network

by Kurious Fox
October 25, 2024October 25, 2024

The idea of neural networks is inspired by the structure and functioning of a brain, where interconnected neurons process and transmit information through complex networks. Neural networks have various applications, such as:Generating and telling jokes…

Backpropagation Explained: A Step-by-Step Guide

by Kurious Fox
July 28, 2024October 25, 2024

Backpropagation is crucial for training neural networks. It involves a forward pass to compute activations, loss calculation, backward pass to compute gradients, and weight updates using gradient descent. This iterative process minimizes loss and effectively trains the network.

Gradient Descent Algorithm & Codes in PyTorch

by Kurious Fox
July 26, 2024October 26, 2024

Gradient Descent is an optimization algorithm that iteratively adjusts the model’s parameters (weights and biases) to find the values that minimize the loss function. The intuition behind gradient descent is learning how to move from…

Batch normalization & Codes in PyTorch

by Kurious Fox
July 26, 2024October 26, 2024

Batch normalization is a crucial technique for training deep neural networks, offering benefits such as stabilized learning, reduced internal covariate shift, and acting as a regularizer. Its process involves computing the mean and variance for each mini-batch and implementing normalization. In PyTorch, it can be easily implemented.

Early Stopping & Restore Best Weights & Codes in PyTorch on MNIST dataset

by Kurious Fox
July 26, 2024November 24, 2024

When using early stopping, it’s important to save and reload the model’s best weights to maximize performance. In PyTorch, this involves tracking the best validation loss, saving the best weights, and then reloading them after early stopping. Practical considerations include model checkpointing, choosing the right validation metric.

Overfitting, Underfitting, Early Stopping, Restore Best Weights & Codes in PyTorch

by Kurious Fox
July 26, 2024November 24, 2024

Early stopping is a vital technique in deep learning training to prevent overfitting by monitoring model performance on a validation dataset and stopping training when the performance degrades. It saves time and resources, and enhances model performance. Implementing it involves monitoring, defining patience, and training termination. Practical considerations include metric selection, patience tuning, checkpointing, and monitoring multiple metrics.

Learning Rate strategy & PyTorch codes

by Kurious Fox
July 26, 2024October 26, 2024

The learning rate is a hyperparameter that determines the size of the steps taken during the optimization process to update the model parameters. One can analogize it to riding a bike in a valley: Just…