PAPER: https://arxiv.org/abs/2310.16764
SUMMARY
The paper “ConvNets Match Vision Transformers at Scale” from Google DeepMind aims to debunk the prevalent notion that Vision Transformers (ViTs) are inherently superior to ConvNets for large-scale image classification. Using the NFNet model family as a representative ConvNet architecture, the authors pre-train various models on the extensive JFT-4B dataset under different compute budgets, ranging from 0.4k to 110k TPU-v4 core hours. Through this empirical analysis, they observe a log-log scaling law between held-out loss and compute budget. Importantly, when these NFNets are fine-tuned on ImageNet, they match the performance metrics of ViTs trained under comparable computational constraints. Their most resource-intensive model even achieves a Top-1 ImageNet accuracy of 90.4%.
The crux of the paper’s argument is that the supposed performance gap between ConvNets and ViTs largely vanishes under a fair comparison, which accounts for compute and data scale. In other words, the efficacy of a machine learning model in large-scale image classification is more dependent on the available data and computational resources than on the choice between ConvNet and Vision Transformer architectures. This challenges the community’s leaning towards ViTs and emphasizes the importance of equitable benchmarking when evaluating different neural network architectures.
The other advantage is multimodality, you can tokenize anything.