If you have done any machine learning (as opposed to statistical learning) you would have come across gradient descent. How does it work? What are the limitations of it?
In the gradient descent scenario, we have a function \(f\) which we need to optimise, we start with an initial guess \(\theta_0\), which is then updated iteratively given a learning rate \(\alpha\). We also need to conceptually know the gradient of a function \(\nabla f\). The update procedure used is typically:
$$\theta_{n+1} = \theta_{n} - \alpha \nabla f(\theta{n})$$
In our case, I’ll provide both the function \((x-2)^2\) and calculate the gradient analytically \(2x(x-2)\). Through our choices of the learning rate, it will take smaller and smaller steps (tempered by the gradient function) as it moves towards the final estimation. Unfortunately, if our learning rate is too low, it may fail to converge, but if its too high, it can also “escape” and lead to less than optimal solution. This demonstrates why its important to treat $\alpha$ as a hyperparameter that needs to be learned itself!
Try moving the sliders! If the learning rate is too high, we’ll “escape” this region, if its too low, we’ll fail to converge
With a learning rate of \(\alpha = 0.5\) even with unlimited steps, we will fail to converge as the learning step is “too large” and the estimate will continually move backwards and forwards around \(2\).
In a future post, we’ll look at some strategies to how we go beyond gradient descent to mitigate the need to search for hyperparameters related to $\alpha$ and address other challenges with gradient descent. We’ll also look at the history as well as some other strategies for minimising a cost function.