I mean, if your hypothesis is that the human brain is the product of one billion years of evolution ‘searching’ for a configuration of neurons and synapses that is very efficient at sampling the environment, detect any changes, and act accordingly to increase likelihood of survival, and also communicate with other such configurations in order to devise and execute more complicated plans, then that…doesn’t bode very well for current AI architectures, does it? Their training sessions are incredibly weak by comparison, simply learning to predict and interpolate some sparse dataset that some human brains produced.
If by ‘there’s no fundamental reason we can’t jam together perceptrons this way’ you mean that we can always throw a bunch of them into an ever-changing virtual world, let them mutate and multiply and after some long time fish out the survivors and have them work for us, sure, but we’re talking about A LOT of compute here. Our hope is that we can find some sort of shortcut, because if we truly have to do it like evolution did, it probably won’t happen this side of the millenium.
We don’t currently know exactly why gradient descent works to find powerful, generalizing minima
But, like, it does
The minima we can reliably find, in practice, don’t just interpolate the training data. I mean, they do that, but they find compressions which seem to actually represent knowledge, in the sense that they can identify true relationships between concepts which reliably hold outside the training distribution.
I want to stress, “predict the next token” is what the models are trained to do, it is not what they learn to do. They learn deep representations and learn to deploy those representations in arbitrary contexts. They learn to predict tokens the same way a high-school student learns to fill in scantrons: the scantron is designed so that filling it out requires other more useful skills.
It’s unclear if gradient descent will continue to work so unreasonably well as we try to push it farther and farther, but so long as the current paradigm holds I don’t see a huge difference between human inference ability and Transformer inference ability. Number of neurons* and amount of training data seem to be the things holding LLMs back. Humans beat LLMs on both counts, but in some ways LLMs seem to outperform biology in terms of what they can learn with a given quantity of neurons/data. As for the “billions of years” issue, that’s why we are using human-generated data, so they can catch up instead of starting from scratch.
By “number of neurons” I really mean something like “expressive power in some universally quantified sense.” Obviously you can’t directly compare perceptrons to biological neurons
I have to say, this is completely the *opposite* of what i have gotten by playing around with those models(GPT4). At no point did I got the impression that I’m dealing with something that, had you taught it all humanity knew in the early 1800s about, say, electricity and magnetism, it would have learned ‘deep representations’ of those concepts to a degree that it would allow it to synthesize something truly novel, like prediction of electromagnetic waves.
I mean, the model has already digested most of what’s written out there, what’s the probability that something that has the ability to 'learn deep representations and learn to deploy those representations in arbitrary contexts’ would have made zero contributions, drew zero new connections that had escaped humans, in something more serious that ‘write an Avengers movie in the style of Shakespeare’? I’m not talking about something as big as electromagnetism but…something? Anything? It has ‘grokked’, as you say, pretty much the entirety of stack overflow, and yet I know of zero new programming techniques or design patterns or concepts it has come up with?
I mean, if your hypothesis is that the human brain is the product of one billion years of evolution ‘searching’ for a configuration of neurons and synapses that is very efficient at sampling the environment, detect any changes, and act accordingly to increase likelihood of survival, and also communicate with other such configurations in order to devise and execute more complicated plans, then that…doesn’t bode very well for current AI architectures, does it? Their training sessions are incredibly weak by comparison, simply learning to predict and interpolate some sparse dataset that some human brains produced.
If by ‘there’s no fundamental reason we can’t jam together perceptrons this way’ you mean that we can always throw a bunch of them into an ever-changing virtual world, let them mutate and multiply and after some long time fish out the survivors and have them work for us, sure, but we’re talking about A LOT of compute here. Our hope is that we can find some sort of shortcut, because if we truly have to do it like evolution did, it probably won’t happen this side of the millenium.
We don’t currently know exactly why gradient descent works to find powerful, generalizing minima
But, like, it does
The minima we can reliably find, in practice, don’t just interpolate the training data. I mean, they do that, but they find compressions which seem to actually represent knowledge, in the sense that they can identify true relationships between concepts which reliably hold outside the training distribution.
I want to stress, “predict the next token” is what the models are trained to do, it is not what they learn to do. They learn deep representations and learn to deploy those representations in arbitrary contexts. They learn to predict tokens the same way a high-school student learns to fill in scantrons: the scantron is designed so that filling it out requires other more useful skills.
It’s unclear if gradient descent will continue to work so unreasonably well as we try to push it farther and farther, but so long as the current paradigm holds I don’t see a huge difference between human inference ability and Transformer inference ability. Number of neurons* and amount of training data seem to be the things holding LLMs back. Humans beat LLMs on both counts, but in some ways LLMs seem to outperform biology in terms of what they can learn with a given quantity of neurons/data. As for the “billions of years” issue, that’s why we are using human-generated data, so they can catch up instead of starting from scratch.
I have to say, this is completely the *opposite* of what i have gotten by playing around with those models(GPT4). At no point did I got the impression that I’m dealing with something that, had you taught it all humanity knew in the early 1800s about, say, electricity and magnetism, it would have learned ‘deep representations’ of those concepts to a degree that it would allow it to synthesize something truly novel, like prediction of electromagnetic waves.
I mean, the model has already digested most of what’s written out there, what’s the probability that something that has the ability to 'learn deep representations and learn to deploy those representations in arbitrary contexts’ would have made zero contributions, drew zero new connections that had escaped humans, in something more serious that ‘write an Avengers movie in the style of Shakespeare’? I’m not talking about something as big as electromagnetism but…something? Anything? It has ‘grokked’, as you say, pretty much the entirety of stack overflow, and yet I know of zero new programming techniques or design patterns or concepts it has come up with?