PAPER: https://arxiv.org/abs/2310.16764

SUMMARY

The paper “ConvNets Match Vision Transformers at Scale” from Google DeepMind aims to debunk the prevalent notion that Vision Transformers (ViTs) are inherently superior to ConvNets for large-scale image classification. Using the NFNet model family as a representative ConvNet architecture, the authors pre-train various models on the extensive JFT-4B dataset under different compute budgets, ranging from 0.4k to 110k TPU-v4 core hours. Through this empirical analysis, they observe a log-log scaling law between held-out loss and compute budget. Importantly, when these NFNets are fine-tuned on ImageNet, they match the performance metrics of ViTs trained under comparable computational constraints. Their most resource-intensive model even achieves a Top-1 ImageNet accuracy of 90.4%.

The crux of the paper’s argument is that the supposed performance gap between ConvNets and ViTs largely vanishes under a fair comparison, which accounts for compute and data scale. In other words, the efficacy of a machine learning model in large-scale image classification is more dependent on the available data and computational resources than on the choice between ConvNet and Vision Transformer architectures. This challenges the community’s leaning towards ViTs and emphasizes the importance of equitable benchmarking when evaluating different neural network architectures.

    • currentscurrents@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      You can train CNNs on unlabeled data too. Unsupervised learning works with any model type, and diffusion models or VAEs are often CNN-based.

  • neu_jose@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    I’m saving my excitement for the “fully-connected is all you need” paper, 2026.

    • Miss-Quiz-Mis@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      Arent transformers a sort of fully connectef network with weights being dynamic based on the specific input?

  • GFrings@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Has there been a study that performed a deep dive into the opposite end of the spectrum? There are myriad edge applications out there which cannot rely on training a large model and pruning it down for deployment. I wonder which architectures are most suited to learning at small scales.

    • currentscurrents@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      Generally, models with stronger inductive biases (like CNNs) work better at small scales - as long as those biases are correct for the kind of data you’re working with.

  • linearmodality@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Wasn’t this already known? I thought the ConvNeXt paper already showed this a year and a half ago.

    • That_Flamingo_4114@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      Not necessarily, a maxxed out perfect conditions system could match the newest developing technology. The papers whole point was that of how you use a technique can matter as much as the algorithm itself. Another paper stating this occurred in the world of recommender systems by Google

    • qalis@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      Yes and no. In my opinion, ConvNeXt is less about data and more about careful architecture design and smart training, and less about data. But yeah, CNNs are better than ViTs if done well, that’s true.

  • ewanmcrobert@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    I was going to say vision transformers still have the advantage as they are often pre-trained on unlabelled images. But now I think of it I don’t see any reason why you couldn’t pre-train a convolutional neural network in the same manner. Just seem to read about it more with vision transformers than CNNs

  • Smallpaul@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    The “it” in AI models is the dataset.

    … trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point. Sufficiently large diffusion conv-unets produce the same images as ViT generators. AR sampling produces the same images as diffusion.

    • currentscurrents@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      Maybe it’s less about having as many parameters as the human brain, and more about having datasets as rich and diverse as the real world.

      • TheCrazyAcademic@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        11 months ago

        Well people with mutations like megacephaly which is an enlarged brain aren’t any smarter and somehow become even dumber because it messes with neuronal density so we know brain size does not correlate to intelligence at all. Animals with bigger brains meaning more neurons then humans aren’t smarter at least in theory, scientists could just be using bad benchmarks.

      • TikiTDO@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        11 months ago

        People talk a lot about datasets being “rich” and “diverse,” but I wish they would also mentioned “not full of crap” in the same breath. Whether it be AI or humans, garbage-in, garbage-out still applies. You can have a rich and diverse dataset that teaches AI horrific, terrible ideas and practices.

        We know with humans you get a very different effect based on the quality of the teacher and the teaching material, and we know that a bad teacher teaching bad lessons can be even worse than nothing at all. AI isn’t really that different.

        • shanereid1@alien.topB
          link
          fedilink
          English
          arrow-up
          1
          ·
          11 months ago

          Was at a big data industry conference yesterday, and one of the big takeaways was that data quality is going to be critical in the age of genAI.