Backpropagation Explained: A Step-by-Step Guide

Backpropagation is a fundamental algorithm used to train neural networks. The core idea of backpropagation is learning from past errors by using the gradients to adjust the model’s parameters.

This process allows the network to iteratively improve its predictions, much like how continuous practice and adjustment help improve a skill. By systematically reducing the loss, the network becomes better at making accurate predictions on new, unseen data.

Backpropagation involves a two-step process:

  • The forward pass and the backward pass. During the forward pass, the input data is passed through the network to compute the output.
  • In the backward pass, the error between the predicted output and the actual output is propagated back through the network for the model to learn from its mistakes and update the weights.
  • The loops of forwarding and then backpropagating are repeated again and again, just like a child learning from his mistake.

Now, suppose that we use gradient descent to minimize the loss. Gradient descent involves computing the gradient (partial derivative) of the loss function with respect to each weight in the network. This gradient indicates the direction and rate at which the weights should be adjusted to reduce the loss.

The backpropagation algorithm uses the chain rule of calculus to compute these gradients efficiently. It starts from the output layer and moves backward through the network, layer by layer, to calculate the gradient for each parameter. Once the gradients are computed, the weights are updated in the opposite direction of the gradient (since we want to minimize the loss) by a small step controlled by the learning rate. This update process iteratively adjusts the weights so that the network gradually learns to make better predictions.

Backpropagation process: a toy example

Backpropagation is an essential algorithm for training neural networks. Let’s assume we have a simple neural network with one hidden layer and one output layer. Here’s a step-by-step outline of the backpropagation process and the relevant formulas.

Network Structure

  • Input Layer: x (input vector)
  • Hidden Layer:
  • Weights: W_1
  • Biases: b_1
  • Activations: h_1 = \phi(g_1)
  • g_1 = W_1^T x + b_1
  • Output Layer:
  • Weights: W_2
  • Biases: b_2
  • Activations: h_2 = \phi(g_2)
  • g_2 = W_2^T h_1 + b_2

Here, \phi is the activation function.

Forward Pass

  1. Hidden Layer:
    g_1 = W_1^T x + b_1
    h_1 = \phi(g_1)
  2. Output Layer:
    g_2 = W_2^T h_1 + b_2
    h_2 = \phi(g_2)

Loss Function

Assume we use the Mean Squared Error (MSE) loss:
L = \frac{1}{2} (y - h_2)^2
where y is the true label.

Backward Pass (Derivatives)

Step 1: Output Layer

  1. Compute the error term:
    \delta_2 = \frac{\partial L}{\partial g_2} = \frac{\partial L}{\partial h_2} \cdot \frac{\partial h_2}{\partial g_2}
    Since L = \frac{1}{2} (y - h_2)^2 :
    \frac{\partial L}{\partial h_2} = -(y - h_2)
    For h_2 = \phi(g_2) , the derivative \frac{\partial h_2}{\partial g_2} = \phi'(g_2) :
    \delta_2 = -(y - h_2) \cdot \phi'(g_2)
  2. Compute the gradients w.r.t. weights and biases:
    \frac{\partial L}{\partial W_2} = h_1 \delta_2^T
    \frac{\partial L}{\partial b_2} = \delta_2

Step 2: Hidden Layer

  1. Compute the error term:
    \delta_1 = \frac{\partial L}{\partial g_1} = \left( W_2 \delta_2 \right) \cdot \phi'(g_1)
  2. Compute the gradients w.r.t. weights and biases:
    \frac{\partial L}{\partial W_1} = x \delta_1^T
    \frac{\partial L}{\partial b_1} = \delta_1

Update Rules

Using gradient descent, the update rules for the weights and biases are:

  1. Hidden Layer:
    W_1 := W_1 - \eta \frac{\partial L}{\partial W_1}
    b_1 := b_1 - \eta \frac{\partial L}{\partial b_1}
  2. Output Layer:
    W_2 := W_2 - \eta \frac{\partial L}{\partial W_2}
    b_2 := b_2 - \eta \frac{\partial L}{\partial b_2}

Here, \eta is the learning rate.

Summary

To summarize, the backpropagation process involves:

  1. Forward pass: Compute activations for each layer.
  2. Loss calculation: Calculate the loss.
  3. Backward pass: Compute the gradients of the loss with respect to weights and biases using the chain rule.
  4. Weight update: Update weights and biases using the gradients and a learning rate.

This iterative process minimizes the loss function and trains the neural network effectively.


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!