I’m trying to teach a lesson on gradient descent from a more statistical and theoretical perspective, and need a good example to show its usefulness.
What is the simplest possible algebraic function that would be impossible or rather difficult to optimize for, by setting its 1st derivative to 0, but easily doable with gradient descent? I preferably want to demonstrate this in context linear regression or some extremely simple machine learning model.
ignoring the fact that you can’t prove anything about gradient descent for nonsmooth functions (you need subgradients), a finite step size will NOT work. You can end up at the minimum and compute a subgradient which is different from zero there, and end up moving AWAY from the minimizer. You need to decay your step size to ensure that you end up at the minimizer (asymptotically).