[R] ConvNets Match Vision Transformers at Scale

psyyduck@alien.top · 2 years ago

[R] ConvNets Match Vision Transformers at Scale

ReasonablyBadass@alien.top · 2 years ago

The abstract says they trained on a labeled dataset. ViTs work on unlabeled ones, right?

currentscurrents@alien.top · 2 years ago

You can train CNNs on unlabeled data too. Unsupervised learning works with any model type, and diffusion models or VAEs are often CNN-based.

neu_jose@alien.top · 2 years ago

I’m saving my excitement for the “fully-connected is all you need” paper, 2026.

Miss-Quiz-Mis@alien.top · 2 years ago

Arent transformers a sort of fully connectef network with weights being dynamic based on the specific input?

GFrings@alien.top · 2 years ago

Has there been a study that performed a deep dive into the opposite end of the spectrum? There are myriad edge applications out there which cannot rely on training a large model and pruning it down for deployment. I wonder which architectures are most suited to learning at small scales.

currentscurrents@alien.top · 2 years ago

Generally, models with stronger inductive biases (like CNNs) work better at small scales - as long as those biases are correct for the kind of data you’re working with.

linearmodality@alien.top · 2 years ago

Wasn’t this already known? I thought the ConvNeXt paper already showed this a year and a half ago.

That_Flamingo_4114@alien.top · 2 years ago

Not necessarily, a maxxed out perfect conditions system could match the newest developing technology. The papers whole point was that of how you use a technique can matter as much as the algorithm itself. Another paper stating this occurred in the world of recommender systems by Google

RobbinDeBank@alien.top · 2 years ago

This group might have too much TPU credits and don’t know what to with it.

qalis@alien.top · 2 years ago

Yes and no. In my opinion, ConvNeXt is less about data and more about careful architecture design and smart training, and less about data. But yeah, CNNs are better than ViTs if done well, that’s true.

ewanmcrobert@alien.top · 2 years ago

I was going to say vision transformers still have the advantage as they are often pre-trained on unlabelled images. But now I think of it I don’t see any reason why you couldn’t pre-train a convolutional neural network in the same manner. Just seem to read about it more with vision transformers than CNNs

qalis@alien.top · 2 years ago

That’s exactly what ConvNeXt V2 does

mileseverett@alien.top · 2 years ago

Masked Image Modelling objectives are just harder with CNNs compared to ViTs

Smallpaul@alien.top · 2 years ago

The “it” in AI models is the dataset.

… trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point. Sufficiently large diffusion conv-unets produce the same images as ViT generators. AR sampling produces the same images as diffusion.

currentscurrents@alien.top · 2 years ago

Maybe it’s less about having as many parameters as the human brain, and more about having datasets as rich and diverse as the real world.

hoppyJonas@alien.top · 2 years ago

It’s probably both. In the Chinchilla paper, they showed that for compute-optimal training, the model size and the training dataset size should be proportional.

TheCrazyAcademic@alien.top · 2 years ago

Well people with mutations like megacephaly which is an enlarged brain aren’t any smarter and somehow become even dumber because it messes with neuronal density so we know brain size does not correlate to intelligence at all. Animals with bigger brains meaning more neurons then humans aren’t smarter at least in theory, scientists could just be using bad benchmarks.

TikiTDO@alien.top · 2 years ago

People talk a lot about datasets being “rich” and “diverse,” but I wish they would also mentioned “not full of crap” in the same breath. Whether it be AI or humans, garbage-in, garbage-out still applies. You can have a rich and diverse dataset that teaches AI horrific, terrible ideas and practices.

We know with humans you get a very different effect based on the quality of the teacher and the teaching material, and we know that a bad teacher teaching bad lessons can be even worse than nothing at all. AI isn’t really that different.

shanereid1@alien.top · 2 years ago

Was at a big data industry conference yesterday, and one of the big takeaways was that data quality is going to be critical in the age of genAI.

Dankmemexplorer@alien.top · 2 years ago

isnt the biggest advantage of ViTs that theyre easier to distribute training for?

currentscurrents@alien.top · 2 years ago

The other advantage is multimodality, you can tokenize anything.