Often when I read ML papers the authors compare their results against a benchmark (e.g. using RMSE, accuracy, …) and say “our results improved with our new method by X%”. Nobody makes a significance test if the new method Y outperforms benchmark Z. Is there a reason why? Especially when you break your results down e.g. to the anaylsis of certain classes in object classification this seems important for me. Or do I overlook something?
Statistical significance is best used for establishing group differences. ML is used for individual datapoint classification. If you have 75% accuracy in a reasonably sized dataset, it’s trivial to include a p-value to establish statistical significance, but it may not be impressive by ML standards (depending on the task).