PAPER: https://arxiv.org/abs/2310.16764

SUMMARY

The paper “ConvNets Match Vision Transformers at Scale” from Google DeepMind aims to debunk the prevalent notion that Vision Transformers (ViTs) are inherently superior to ConvNets for large-scale image classification. Using the NFNet model family as a representative ConvNet architecture, the authors pre-train various models on the extensive JFT-4B dataset under different compute budgets, ranging from 0.4k to 110k TPU-v4 core hours. Through this empirical analysis, they observe a log-log scaling law between held-out loss and compute budget. Importantly, when these NFNets are fine-tuned on ImageNet, they match the performance metrics of ViTs trained under comparable computational constraints. Their most resource-intensive model even achieves a Top-1 ImageNet accuracy of 90.4%.

The crux of the paper’s argument is that the supposed performance gap between ConvNets and ViTs largely vanishes under a fair comparison, which accounts for compute and data scale. In other words, the efficacy of a machine learning model in large-scale image classification is more dependent on the available data and computational resources than on the choice between ConvNet and Vision Transformer architectures. This challenges the community’s leaning towards ViTs and emphasizes the importance of equitable benchmarking when evaluating different neural network architectures.

  • linearmodality@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Wasn’t this already known? I thought the ConvNeXt paper already showed this a year and a half ago.

    • qalis@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      Yes and no. In my opinion, ConvNeXt is less about data and more about careful architecture design and smart training, and less about data. But yeah, CNNs are better than ViTs if done well, that’s true.

    • That_Flamingo_4114@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      Not necessarily, a maxxed out perfect conditions system could match the newest developing technology. The papers whole point was that of how you use a technique can matter as much as the algorithm itself. Another paper stating this occurred in the world of recommender systems by Google