Imagine for a moment that you’re in the late 19th Century, without a computer - what would spur you to think about gradient descent? What variations were already considered at this time where there wasn’t computational resources to solve it effectively?
Starting out
In our previous post we considered gradient descent with one variable. We consider the class of functions \(f \) which are non-negative, that is \(f(x) \geq 0 \forall x\) and continuous. The reasoning (n.b. Cauchy’s approach was multivariate, but we’ll start with the univariate case first), was if we consider derivatives from first principle, we get the following approximation
$$f^\prime(x) \approx \frac{f(x+h) - f(x)}{h} $$
for some sufficiently small \(h\). Rearranging, we can easily see that
$$f(x+h) \approx f(x) + h(f^\prime(x))$$
Then if we choose a \(\theta > 0\) (spoiler alert: our step size) based on relationship:
$$h = -\theta f^\prime(x)$$
Then our equation now becomes
$$f(x-\theta f^\prime(x)) \approx f(x) -\theta (f^\prime(x))^2$$
Remembering that we want to find the minimum point of this function! So then, we reason that our new “guess” for \(x^\prime = x-\theta f^\prime(x)\) must indeed satisfy
$$f(x^\prime) \leq f(x)$$
If the step size becomes “small enough” because we have assumed the function to be continuous.
Extending this to Multivariate
The way the multivariate setting was framed was that, we could choose a single theta for all the variables. If we change the target function to \(f(x, y, z)\), then the extension would be:
$$f(x-\theta f^\prime(x, ., .), y-\theta f^\prime(., y, .), z-\theta f^\prime(., ., z)) \approx f(x, y, z) -\theta ((f^\prime(x, ., .))^2 + (f^\prime(., y, .))^2 + (f^\prime(., ., z))^2)$$
would be approximately true, Where \(f^\prime(\dots, i, \dots)\) represents the partial derivative with respect to variable \(i\). Cauchy postulate three different ways how this would apply in a multivariate setting.
- Backtracking Line search - This is the most obvious method, whereby, we start with a “large” value of \(\theta\), and keep decreasing it until we improve over the previous estimate.
- Steepest Descent - Determine \(\theta\) based on trying to determine analytically by solving $$\frac{\partial f(x-\theta f^\prime(x, ., .), y-\theta f^\prime(., y, .), z-\theta f^\prime(., ., z))}{\partial \theta} = 0$$ Which can be done via any kind of root-finding algorithm.
- The last approach, presumes that the current estimation is already “close” to ideal and close to \(0\), and uses the approximation directly to find the ideal value of \(\theta\), that is $$\theta = \frac{f(x, y, z)}{(f^\prime(x, ., .))^2 + (f^\prime(., y, .))^2 + (f^\prime(., ., z))^2}$$
In Cauchy’s original notes, the notional idea of convergence wasn’t given much thought. Instead he believed that this solution could fail to find the proper solution (this can be observed as a function can be constructed with a local minima, and gradient descent would get stuck). Today, These ideas were certainly interesting at the time and do reflect the various approaches which are used today.