Stochastic Gradient Descent#
In this chapter we close the circle that will allow us to train a model - we need an algorithm that will help us search efficiently in the weight space to find the optimal set
Gradient Descent#
Obviously for us to be able to find the right weights we need to pose the learning problem via a suitable objective (loss) function such as the cross-entropy. Optimization refers to the task of either minimizing or maximizing some function
As the simplest possible example the following figure show the simplest possible objective function and what an optimization algorithm is doing.
Gradient descent. An illustration of how the gradient descent algorithm uses the derivatives of a function to follow the function downhill to a minimum
The global minimum of such nicely convex function can be obtained by solving the following equation for
where
The derivative is therefore useful for minimizing a function because it tells us how to change
Local minima in optimizing over complex loss functions
We often minimize loss functions that have multiple inputs:
Element
In the generic case:
where
Iterations in gradient descent towards the global in this case min
Here is an animation on how it works in a more complicated loss function:
Stochastic Gradient Descent (SGD)#
To calculate the new
Its partial derivative is
which necessitates going over the whole dataset at each iteration. This would be extremely slow and instead we perform an approximation to the gradient descent involving two steps:
We define a mini-batch over which we estimate the gradient. When the mini-batch size is 1, we implement the Stochastic Gradient Descent algorithm. Note in practice people may refer to SGD but may mean mini-batch.
We define a schedule of learning rates instead of sticking to only one value.
The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs. The following video showcases the advantages of SGD over GD,
For an overview of the various algorithms that are considered enhancements of SGD, you can read this blog post and python implementations are also included in the d2l.ai book. Momentum and Adam are two of the most popular enhancements.
Digging further#
For an overview of optimization theory please go through Ian Goodfellow’s chapter 4 below. Stochastic gradient descent is treated also in section 5.9 of this book.