I’m trying to teach a lesson on gradient descent from a more statistical and theoretical perspective, and need a good example to show its usefulness.
What is the simplest possible algebraic function that would be impossible or rather difficult to optimize for, by setting its 1st derivative to 0, but easily doable with gradient descent? I preferably want to demonstrate this in context linear regression or some extremely simple machine learning model.
A simple example is when you have a set of n points, and you want to find a local maximum of “distance away from the closest point in the set”. The gradient is easy to approximate, and you can show visually what’s happening to students. There are algorithms to solve this exactly but they are much more complicated than gradient descent.
For one step more complexity you could do an svm algorithm. Ie, divide points into red and blue, and find the line that separates them into two groups, maximizing the margin to the closest point. This is a real problem for which gradient descent (well, some variant of it) is the best solution. It’s also nicely visualizable.