AdaGrad
The AdaGrad algorithm individually adjusts the learning rates of all model parameters by scaling them inversely proportional to the square root of the cumulative sum of their past squared gradients. This means that parameters with large partial derivatives of the loss experience a rapid reduction in their learning rates, while those with small partial derivatives see a smaller reduction. As a result, the algorithm makes more significant progress in directions where the parameter space slopes gently.
RMSProp
The RMSProp algorithm improves upon AdaGrad by modifying the gradient accumulation into an exponentially weighted moving average, making it more effective in nonconvex settings.
While AdaGrad is designed for rapid convergence when applied to convex functions, its learning rate can become too small when navigating the various structures of a nonconvex function, such as those encountered while training a neural network. This issue arises because AdaGrad reduces the learning rate based on the entire history of squared gradients.
In contrast, RMSProp uses an exponentially decaying average to disregard distant past gradients, allowing it to maintain a suitable learning rate. This enables RMSProp to converge quickly once it reaches a locally convex region, functioning as if it were an AdaGrad instance initialized within that region.
Adam

Adam is another adaptive learning rate optimization algorithm. Adam incorporates momentum by directly estimating the first-order moment (with exponential weighting) of the gradient. In contrast, adding momentum to RMSProp typically involves applying it to the rescaled gradients.
Adam includes bias corrections for both the first-order moments (momentum term) and the (uncentered) second-order moments to account for their initialization at the origin, unlike RMSProp, which estimates the second-order moment without a correction factor. This makes the RMSProp second-order moment estimate potentially biased early in training.
Adam is generally considered robust to hyperparameter choices, although the learning rate may occasionally need adjustment from its default value.
Implementation
To use different optimization algorithms like AdaGrad, RMSProp, and Adam in PyTorch, you would first need to import the necessary PyTorch module. Then, when defining your optimizer, you can select the appropriate algorithm. Here’s how you can do it:
- AdaGrad:
import torch.optim as optim
# define your parameters, typically model parameters
params = model.parameters()
optimizer = optim.Adagrad(params, lr=0.01)
- RMSProp:
import torch.optim as optim
# define your parameters, typically model parameters
params = model.parameters()
optimizer = optim.RMSprop(params, lr=0.01)
- Adam:
import torch.optim as optim
# define your parameters, typically model parameters
params = model.parameters()
optimizer = optim.Adam(params, lr=0.01)
In the above examples, lr
stands for learning rate. You can adjust it according to your needs. After defining the optimizer, you can use it in your training loop to update your model’s parameters.
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.