@jamochasnake

jamochasnake@alien.top · 2 年前

and the pre-training of GPT-4

jamochasnake@alien.top · 2 年前

ignoring the fact that you can’t prove anything about gradient descent for nonsmooth functions (you need subgradients), a finite step size will NOT work. You can end up at the minimum and compute a subgradient which is different from zero there, and end up moving AWAY from the minimizer. You need to decay your step size to ensure that you end up at the minimizer (asymptotically).

jamochasnake@alien.top · 2 年前

it’s not just sam, he’s bringing talent with him. people like jakub pachocki.

jamochasnake@alien.top · 2 年前

of course, every locally lipschitz function is differentiable almost everywhere, i.e., the set of points where it is not differentiable is a measure zero set. So if you drew a point randomly (with distribution absolutely continuous w.r.t. the lebesgue measure) then you avoid such points with probability. However, this has NOTHING to do with actually performing an algorithm like gradient descent - in these algorithms you are NOT simply drawing points randomly and you cannot guarantee you avoid these points a priori.