Initially, the artificial neural network is like a child. It knows almost nothing! So, it needs to learn. Training a neural network involves using a loss function. The loss function allows the neural network to receive the feedback from the data, just as how a child learns by exploring the world and receiving feedback from experiences.
Why loss, not profit? Imagine a child got stung by a bee, causing pain. The child learns to avoid similar situations. Similarly, a neural network makes predictions on the training data. The loss function calculates the error between the predictions and the actual outcomes. Then, based on that loss, the neural network modifies the network’s parameters to improve future predictions.
Practice makes perfect! With each new exploration, the child refines their understanding of what is safe and what is not. The child learns to differentiate between harmless and harmful situations. Similarly, through iterations over the data, the network continuously adjusts its weights to minimize the loss. Each epoch reinforces learning from previous errors.
Generalization to unseen data: Over time, the child generalizes their learning to avoid not only flowers with bees but also other potentially dangerous situations. Similarly, the trained neural network generalizes from the training data to make accurate predictions on new, unseen data.
Overfitting & Underfitting Review:
- Overfitting is when a neural network learns the training data too well, capturing noise and details that don’t generalize to new data.
- Underfitting occurs when a neural network fails to capture the underlying patterns in the training data, resulting in poor performance on both the training and new data.

Weights are the parameters associated with the connections between nodes in adjacent layers, determining the strength and direction of the input signals. Often, neural networks use a linear transformation that contains the weights stored inside the weight matrix , followed by a non-linear function called an activation function. The linear transformation is:
where is a weight matrix,
is the input vector,
is the transpose of the weight matrix,
is a bias vector.
So the full operation, including the activation function , is:
where is the non-linear activation function applied element-wise (
is a vector).
Now, may look a bit similar to the matrix form of multiple linear regression, but it’s not. Here,
is a vector, not a matrix. So, you see that neural networks are way overparameterized compared to linear regression. That’s why they can learn more complex data structures.
However, also due to overparameterization, neural networks, without careful training, can be prone to overfitting, i.e., doing a good job on training data, but not on testing data.

So, the first layer can be represented by
where is the activation function.
Next, is the input for the second layer. So, the second layer can be represented by
Common Activation Functions
Sigmoid (Logistic) activation function

Sigmoid (Logistic) activation function:
has smooth gradients and produces output values bound between 0 and 1, which can be interpreted as probabilities. So, it is often used in the output layer for binary classification.

Rectified Linear Unit (ReLU) activation function:
is computationally efficient, requiring only a threshold at zero.
Hyperbolic Tangent (Tanh) activation function
Its range is (-1, 1) and the outputs are zero-centered, making optimization easier.
It is often used in the output layer for binary classification.
Softmax activation function
The range of softmax is (0, 1). It converts logits into probabilities that sum to 1.
It is typically used in the output layer for multi-class classification tasks.
Gradient Explosion

Exploding gradients are a problem that occurs during the training of deep neural networks when the gradients of the loss function with respect to the model parameters become excessively large. Gradient explosion can occur due to very deep networks multiplying gradients exponentially, poor weight initialization, and high learning rates causing large updates.
When gradients explode, they can make the model parameters update excessively, leading to unstable and erratic behaviors during training, making the optimization process struggle to converge, as the large gradients cause the model to overshoot the optimal solution repeatedly. Solutions to gradient explosion include using proper initialization techniques like Xavier or He, employing advanced optimizers like Adam, and applying gradient clipping to limit gradient values.
Gradient vanishing

The vanishing gradient problem occurs in deep neural networks when the gradients of the loss function for early layers become very small, preventing significant weight updates and slowing or halting learning. In very deep networks, gradients are propagated backward from the output layer to the input layer. If the activation functions used (like sigmoid or tanh) have derivatives that are less than 1, repeated multiplication of these small numbers causes the gradients to shrink exponentially as they move backward through the layers.
When the gradients become very small, the updates to the weights in the early layers are minimal, almost negligible. This leads to slow learning for the weights in the initial layers, while the later layers may still be learning effectively. Certain activation functions are more prone to causing vanishing gradients. For instance, the sigmoid and tanh functions can saturate for large positive or negative inputs, leading to gradients near zero. The ReLU activation function or its variants (Leaky ReLU, Parametric ReLU) helps mitigate this problem by not saturating for positive inputs.
Batch Normalization also helps to maintain healthy gradients and accelerates training.
Xavier (Glorot) initialization or He initialization helps set the initial weights to values that prevent the gradients from vanishing or exploding.
some other activation functions:

Recall that Rectified Linear Unit (ReLU) activation function has the form .
It is computationally efficient but can suffer from dying ReLU problem, where neurons get stuck in the negative side and stop learning.

Leaky ReLU activation function takes values if
and takes value
if
. It allows a small, non-zero gradient when
. It helps prevent dying ReLU problem.

Parametric ReLU (PReLU) activation function is similar to Leaky ReLU, but with a learnable parameter for negative slope. So,
if
, and
if
. However, it adds additional parameters to be learned during training.
Initialization
Initialization methods are critical in training deep neural networks as they set the starting point for the weights in the model. Proper initialization can lead to faster convergence and better performance. If not properly scaled, it can lead to exploding gradients (gradients too big) or vanishing gradients (gradients too small, the network can’t learn due to small updates).
Xavier (Glorot) Initialization:
Weights are initialized from a distribution with a mean of zero and a variance of , where
and
are the number of input and output units respectively,
So, the initialization is
or
He Initialization:
Weights are initialized from a normal or uniform distribution with a mean of zero and a variance of , i.e. \
or
He Initialization works well with ReLU and its variants by accounting for their specific activation properties.
Tensorflow and PyTorch for training neural network






Quizzes: Initialization of Neural Networks
Why are initial values important in machine learning?
A) They determine the size of the dataset
B) They affect how quickly and effectively the model converges during training
C) They control the number of features in the model
D) They set the target values for predictions
Show answer
Answer: B) They affect how quickly and effectively the model converges during training
Which of the following problems can occur if the initial values are not chosen properly?
A) The model may fail to converge
B) The model may have too many parameters
C) The model may require more epochs
D) The model may be too simple
Show answer
Answer: A) The model may fail to converge
Question: Why is proper initialization of weights crucial for training deep neural networks?
A) To avoid redundancy in the training dataset
B) To ensure that the model learns efficiently and effectively
C) To maximize the use of computing resources
D) To simplify the model architecture
Show Answer
B) To ensure that the model learns efficiently and effectively
Question: What can happen if all weights are initialized to zero?
A) The model will converge too quickly
B) The model will become more complex
C) The model will fail to learn meaningful features due to symmetry
D) The model will perform better on the training data
Show Answer
C) The model will fail to learn meaningful features due to symmetry
Question: Which initialization method can help avoid the vanishing and exploding gradient problems in deep neural networks?
A) Zero initialization
B) Constant initialization
C) Xavier/Glorot initialization
D) Batch initialization
Show Answer
C) Xavier/Glorot initialization
What is He initialization particularly useful for?
A) Shallow networks
B) Linear regression models
C) Deep networks with ReLU activation functions
D) Binary classification problems
Show Answer
C) Deep networks with ReLU activation functions
What is the vanishing gradient problem?
A) When gradients become too large during training
B) When gradients become too small during training, slowing down learning
C) When the model fails to generalize to new data
D) When the model overfits to the training data
Show Answer
B) When gradients become too small during training, slowing down learning
How does proper weight initialization help mitigate the vanishing gradient problem?
A) By ensuring weights are too large initially
B) By scaling weights to prevent the gradients from shrinking too much
C) By ensuring weights are the same across all layers
D) By simplifying the network architecture
Show Answer
B) By scaling weights to prevent the gradients from shrinking too much
What is a common symptom of the exploding gradient problem?
A) The model converges too quickly
B) The model’s performance suddenly degrades during training
C) The model fails to learn any patterns
D) The loss function becomes very small
Show Answer
B) The model’s performance suddenly degrades during training
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.