Often when I read ML papers the authors compare their results against a benchmark (e.g. using RMSE, accuracy, …) and say “our results improved with our new method by X%”. Nobody makes a significance test if the new method Y outperforms benchmark Z. Is there a reason why? Especially when you break your results down e.g. to the anaylsis of certain classes in object classification this seems important for me. Or do I overlook something?
Old papers were doing so. Yet now that we decided that NN are bricks that should be trained leaving no data out, we do not care about statistical significance anymore. Anyway test set is probably partially included in the training set of the fondation model you downloaded. It started with Deepmind and RL where experiments where very expensive to run (Joelle Pineau had a nice talk about theses issues). Yet as alpha_whatever are undeniable successes researcher pursued this path. Now go for useless confidence intervals when a training is worth 100 millions in compute… Nah, better rate the outputs by a bunch of humans.