Often when I read ML papers the authors compare their results against a benchmark (e.g. using RMSE, accuracy, …) and say “our results improved with our new method by X%”. Nobody makes a significance test if the new method Y outperforms benchmark Z. Is there a reason why? Especially when you break your results down e.g. to the anaylsis of certain classes in object classification this seems important for me. Or do I overlook something?

  • isparavanje@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    You’re right. This is likely one of the reasons why ML has a reproducibility crisis, together with other effects like data leakage. (see: https://reproducible.cs.princeton.edu/)

    Sometimes, indeed, results are so different that things are obviously statistically significant, even by eye, and that is uncommon in natural sciences. Even then, however, it should be stated clearly that the researchers believe this to be the case, and some evidence should be given.