We need a name for the fallacy where people call highly nonlinear algorithms with billions of parameters “just statistics”, as if all they’re doing is linear regression.
ChatGPT isn’t AGI yet, but it is a huge leap in modeling natural language. The fact that there’s some statistics involved explains neither of those two points.
You’re probably talking about the “fallacy of composition”. This logical fallacy occurs when it’s assumed that what is true for individual parts will also be true for the whole group or system. It’s a mistaken belief that specific attributes of individual components must necessarily be reflected in the larger structure or collection they are part of.
Here are some clearly flawed examples illustrating the fallacy of composition.
Building Strength: Believing that if a single brick can hold a certain amount of weight, a wall made of these bricks can hold the same amount of weight per brick. This ignores the structural integrity and distribution of weight in a wall.
Athletic Team: Assuming that a sports team will be unbeatable because it has a few star athletes. This ignores the importance of teamwork, strategy, and the fact that the performance of a team is not just the sum of its individual players’ skills.
These examples highlight the danger of oversimplifying complex systems or groups by extrapolating from individual components. They show that the interactions and dynamics within a system play a crucial role in determining the overall outcome, and these interactions can’t be understood by just looking at individual parts in isolation.
How… did it map oversimplification to… holistic thinking??? Saying that it’s “just statistics” is wrong because “just statistics” covers some very complicated models in principle. They weren’t saying that simple subsystems are incapable of generating complex behavior.
God, why do people think these things are intelligent? I guess people fall for cons all the time…
Yet still limited compared to even not-very-deep NNs. If the user wants to fit a parabola with a linear regression, he pretty much has to manually add a quadratic term himself.
I think they’re widely used primarily because they’re widely taught in school.
It’s not a fallacy at all. It is just statistics, combined with some very useful inductive biases. The fallacy is trying to smuggle some extra magic into the description of what it is.
Actual AGI would be able to explain something that no human has understood before. We aren’t really close to that at all. Falling back on “___ may not be AGI yet, but…” is a lot like saying “rocket ships may not be FTL yet, but…”
The fallacy is the part where you imply that humans have magic.
“An LLM is just doing statistics, therefore an LLM can’t match human intellect unless you add pixie dust somewhere.” Clearly the implication is that human intellect involves pixie dust somehow?
Or maybe, idk, humans are just the result of random evolutionary processes jamming together neurons into a configuration that happens to behave in a way that lets us build steam engines, and there’s no fundamental reason that jamming together perceptrons can’t accomplish the same thing?
I mean, if your hypothesis is that the human brain is the product of one billion years of evolution ‘searching’ for a configuration of neurons and synapses that is very efficient at sampling the environment, detect any changes, and act accordingly to increase likelihood of survival, and also communicate with other such configurations in order to devise and execute more complicated plans, then that…doesn’t bode very well for current AI architectures, does it? Their training sessions are incredibly weak by comparison, simply learning to predict and interpolate some sparse dataset that some human brains produced.
If by ‘there’s no fundamental reason we can’t jam together perceptrons this way’ you mean that we can always throw a bunch of them into an ever-changing virtual world, let them mutate and multiply and after some long time fish out the survivors and have them work for us, sure, but we’re talking about A LOT of compute here. Our hope is that we can find some sort of shortcut, because if we truly have to do it like evolution did, it probably won’t happen this side of the millenium.
We don’t currently know exactly why gradient descent works to find powerful, generalizing minima
But, like, it does
The minima we can reliably find, in practice, don’t just interpolate the training data. I mean, they do that, but they find compressions which seem to actually represent knowledge, in the sense that they can identify true relationships between concepts which reliably hold outside the training distribution.
I want to stress, “predict the next token” is what the models are trained to do, it is not what they learn to do. They learn deep representations and learn to deploy those representations in arbitrary contexts. They learn to predict tokens the same way a high-school student learns to fill in scantrons: the scantron is designed so that filling it out requires other more useful skills.
It’s unclear if gradient descent will continue to work so unreasonably well as we try to push it farther and farther, but so long as the current paradigm holds I don’t see a huge difference between human inference ability and Transformer inference ability. Number of neurons* and amount of training data seem to be the things holding LLMs back. Humans beat LLMs on both counts, but in some ways LLMs seem to outperform biology in terms of what they can learn with a given quantity of neurons/data. As for the “billions of years” issue, that’s why we are using human-generated data, so they can catch up instead of starting from scratch.
By “number of neurons” I really mean something like “expressive power in some universally quantified sense.” Obviously you can’t directly compare perceptrons to biological neurons
I have to say, this is completely the *opposite* of what i have gotten by playing around with those models(GPT4). At no point did I got the impression that I’m dealing with something that, had you taught it all humanity knew in the early 1800s about, say, electricity and magnetism, it would have learned ‘deep representations’ of those concepts to a degree that it would allow it to synthesize something truly novel, like prediction of electromagnetic waves.
I mean, the model has already digested most of what’s written out there, what’s the probability that something that has the ability to 'learn deep representations and learn to deploy those representations in arbitrary contexts’ would have made zero contributions, drew zero new connections that had escaped humans, in something more serious that ‘write an Avengers movie in the style of Shakespeare’? I’m not talking about something as big as electromagnetism but…something? Anything? It has ‘grokked’, as you say, pretty much the entirety of stack overflow, and yet I know of zero new programming techniques or design patterns or concepts it has come up with?
Real brains aren’t perceptrons. They don’t learn by back-propagation or by evaluating performance on a training set. They’re not mathematical models, or even mathematical functions in any reasonable sense. This is a “god of the gaps” scenario, wherein there are a lot of things we don’t understand about how real brains work, and people jump to fill in the gap with something they do understand (e.g. ML models).
Brains are absolutely mathematical functions in a very reasonable sense, and anyone who says otherwise is a crazy person
You think brains aren’t turing machines? Like, you really think that? Every physical process ever studied, all of them, are turing machines. Every one. Saying that brains aren’t turing machines is no different from saying that humans have souls. You’re positing the existence of extra-special magic outside the realm of science just to justify your belief that humans are too special for science to ever comprehend
(By “is a turing machine” I mean that its behavior can be predicted to arbitrary accuracy by a turing machine, and so observing its behavior is mathematically equivalent to running a turing machine)
LLMs might still lack something that the human brain has. Internal monologue, for example, that allows us to allocate more than fixed amount of compute per output token.
You can just give an LLM an internal monologue. It’s called a scratchpad.
I’m not sure how this applies to the broader discussion, like honestly I can’t tell if we’re off-topic. But once you have LLMs you can implement basically everything humans can do. The only limitations I’m aware of that aren’t trivial from an engineering perspective are
current LLMs mostly aren’t as smart as humans, like literally they have fewer neurons and can’t model systems as complexly
humans have more complex memory, with a mix of short-term and long-term and a fluid process of moving between them
humans can learn on-the-go, this is equivalent to “online training” and is probably related to long-term memory
humans are multimodal, it’s unclear to what extent this is a “limitation” vs just a pedantic nit-pick, I’ll let you decide how to account for it
ChatGPT predicts the most probable next token, or the next token that yields the highest probability of a thumbs up, depending on whether you’re talking about the semi-supervised learning or the reinforcement learning stage of training. That is the conceptual underpinning of how the parameter updates are calculated. It only achieves the ability to communicate because it was trained on text that successfully communicates.
The vacuous truth is saying that AI is statistical. It certainly is, but it’s also much more.
The fallacy part is to take that fact and claim about an AI algorithm, that because it’s “just statistics” that it therefore cannot exhibit “true” intelligence but it’s somehow faking or mimicking intelligence.
We need a name for the illness that is people throwing some shoddy homebrew "L"LM at their misshapen prompts - if you can call them that - and then concluding that it’s just bad imitation of speech because they keep asking their models to produce page after page of sonic fanfic. Or thinking everything is equally hallucination-prone.
Really, the moronic takes about all this are out of this world, never even mind how people all of a sudden have a very clear idea of what intelligence, among the most ambiguous and ill-defined notions we debate, entails. Except they’re struggling putting this knowledge into words, it’s more about the feel of it all, y’know.
We need a name for the fallacy where people call highly nonlinear algorithms with billions of parameters “just statistics”, as if all they’re doing is linear regression.
ChatGPT isn’t AGI yet, but it is a huge leap in modeling natural language. The fact that there’s some statistics involved explains neither of those two points.
Let’s ask GPT4!
How… did it map oversimplification to… holistic thinking??? Saying that it’s “just statistics” is wrong because “just statistics” covers some very complicated models in principle. They weren’t saying that simple subsystems are incapable of generating complex behavior.
God, why do people think these things are intelligent? I guess people fall for cons all the time…
I dunno. The “fallacy of composition” is just made up of 3 words, and there’s not a lot that you can explain with only three words.
Well, thanks to quantum mechanics; pretty much all of existence is probably “just statistics”.
Well, practically all interesting statistics are NONlinear regressions. Including ML. And your brain. And physics.
What a lot of people don’t understand is that linear regression can still handle non-linear relationships.
For a statistician, linear regression just means the coefficients are linear, it doesn’t mean the relationship itself is a straight line.
That’s why linear models are still incredibly powerful and are used so widely across so many fields.
Yet still limited compared to even not-very-deep NNs. If the user wants to fit a parabola with a linear regression, he pretty much has to manually add a quadratic term himself.
I think they’re widely used primarily because they’re widely taught in school.
Statisticians use nonlinear models all the time
It’s not a fallacy at all. It is just statistics, combined with some very useful inductive biases. The fallacy is trying to smuggle some extra magic into the description of what it is.
Actual AGI would be able to explain something that no human has understood before. We aren’t really close to that at all. Falling back on “___ may not be AGI yet, but…” is a lot like saying “rocket ships may not be FTL yet, but…”
Why would a human level AGI need to be able to explain something that no human has understood before? That sounds more like ASI than AGI.
And the human brain is FTL then?
The fallacy is the part where you imply that humans have magic.
“An LLM is just doing statistics, therefore an LLM can’t match human intellect unless you add pixie dust somewhere.” Clearly the implication is that human intellect involves pixie dust somehow?
Or maybe, idk, humans are just the result of random evolutionary processes jamming together neurons into a configuration that happens to behave in a way that lets us build steam engines, and there’s no fundamental reason that jamming together perceptrons can’t accomplish the same thing?
I mean, if your hypothesis is that the human brain is the product of one billion years of evolution ‘searching’ for a configuration of neurons and synapses that is very efficient at sampling the environment, detect any changes, and act accordingly to increase likelihood of survival, and also communicate with other such configurations in order to devise and execute more complicated plans, then that…doesn’t bode very well for current AI architectures, does it? Their training sessions are incredibly weak by comparison, simply learning to predict and interpolate some sparse dataset that some human brains produced.
If by ‘there’s no fundamental reason we can’t jam together perceptrons this way’ you mean that we can always throw a bunch of them into an ever-changing virtual world, let them mutate and multiply and after some long time fish out the survivors and have them work for us, sure, but we’re talking about A LOT of compute here. Our hope is that we can find some sort of shortcut, because if we truly have to do it like evolution did, it probably won’t happen this side of the millenium.
We don’t currently know exactly why gradient descent works to find powerful, generalizing minima
But, like, it does
The minima we can reliably find, in practice, don’t just interpolate the training data. I mean, they do that, but they find compressions which seem to actually represent knowledge, in the sense that they can identify true relationships between concepts which reliably hold outside the training distribution.
I want to stress, “predict the next token” is what the models are trained to do, it is not what they learn to do. They learn deep representations and learn to deploy those representations in arbitrary contexts. They learn to predict tokens the same way a high-school student learns to fill in scantrons: the scantron is designed so that filling it out requires other more useful skills.
It’s unclear if gradient descent will continue to work so unreasonably well as we try to push it farther and farther, but so long as the current paradigm holds I don’t see a huge difference between human inference ability and Transformer inference ability. Number of neurons* and amount of training data seem to be the things holding LLMs back. Humans beat LLMs on both counts, but in some ways LLMs seem to outperform biology in terms of what they can learn with a given quantity of neurons/data. As for the “billions of years” issue, that’s why we are using human-generated data, so they can catch up instead of starting from scratch.
I have to say, this is completely the *opposite* of what i have gotten by playing around with those models(GPT4). At no point did I got the impression that I’m dealing with something that, had you taught it all humanity knew in the early 1800s about, say, electricity and magnetism, it would have learned ‘deep representations’ of those concepts to a degree that it would allow it to synthesize something truly novel, like prediction of electromagnetic waves.
I mean, the model has already digested most of what’s written out there, what’s the probability that something that has the ability to 'learn deep representations and learn to deploy those representations in arbitrary contexts’ would have made zero contributions, drew zero new connections that had escaped humans, in something more serious that ‘write an Avengers movie in the style of Shakespeare’? I’m not talking about something as big as electromagnetism but…something? Anything? It has ‘grokked’, as you say, pretty much the entirety of stack overflow, and yet I know of zero new programming techniques or design patterns or concepts it has come up with?
Real brains aren’t perceptrons. They don’t learn by back-propagation or by evaluating performance on a training set. They’re not mathematical models, or even mathematical functions in any reasonable sense. This is a “god of the gaps” scenario, wherein there are a lot of things we don’t understand about how real brains work, and people jump to fill in the gap with something they do understand (e.g. ML models).
Brains are absolutely mathematical functions in a very reasonable sense, and anyone who says otherwise is a crazy person
You think brains aren’t turing machines? Like, you really think that? Every physical process ever studied, all of them, are turing machines. Every one. Saying that brains aren’t turing machines is no different from saying that humans have souls. You’re positing the existence of extra-special magic outside the realm of science just to justify your belief that humans are too special for science to ever comprehend
(By “is a turing machine” I mean that its behavior can be predicted to arbitrary accuracy by a turing machine, and so observing its behavior is mathematically equivalent to running a turing machine)
LLMs might still lack something that the human brain has. Internal monologue, for example, that allows us to allocate more than fixed amount of compute per output token.
You can just give an LLM an internal monologue. It’s called a scratchpad.
I’m not sure how this applies to the broader discussion, like honestly I can’t tell if we’re off-topic. But once you have LLMs you can implement basically everything humans can do. The only limitations I’m aware of that aren’t trivial from an engineering perspective are
And the network still uses skills that it learned in a fixed-computation-per-token regime.
Sure, future versions will lift many existing limitations, but I was talking about current LLMs.
ChatGPT predicts the most probable next token, or the next token that yields the highest probability of a thumbs up, depending on whether you’re talking about the semi-supervised learning or the reinforcement learning stage of training. That is the conceptual underpinning of how the parameter updates are calculated. It only achieves the ability to communicate because it was trained on text that successfully communicates.
I think it’s a vacuous truth.
The vacuous truth is saying that AI is statistical. It certainly is, but it’s also much more.
The fallacy part is to take that fact and claim about an AI algorithm, that because it’s “just statistics” that it therefore cannot exhibit “true” intelligence but it’s somehow faking or mimicking intelligence.
We need a name for the illness that is people throwing some shoddy homebrew "L"LM at their misshapen prompts - if you can call them that - and then concluding that it’s just bad imitation of speech because they keep asking their models to produce page after page of sonic fanfic. Or thinking everything is equally hallucination-prone.
Really, the moronic takes about all this are out of this world, never even mind how people all of a sudden have a very clear idea of what intelligence, among the most ambiguous and ill-defined notions we debate, entails. Except they’re struggling putting this knowledge into words, it’s more about the feel of it all, y’know.
embeddings are statistics. they evolved from linear models of statistics but they are now non-linear models of statistics. Bengio 03 explains this