@InterstitialLove

InterstitialLove@alien.top · 2 years ago

We don’t currently know exactly why gradient descent works to find powerful, generalizing minima

But, like, it does

The minima we can reliably find, in practice, don’t just interpolate the training data. I mean, they do that, but they find compressions which seem to actually represent knowledge, in the sense that they can identify true relationships between concepts which reliably hold outside the training distribution.

I want to stress, “predict the next token” is what the models are trained to do, it is not what they learn to do. They learn deep representations and learn to deploy those representations in arbitrary contexts. They learn to predict tokens the same way a high-school student learns to fill in scantrons: the scantron is designed so that filling it out requires other more useful skills.

It’s unclear if gradient descent will continue to work so unreasonably well as we try to push it farther and farther, but so long as the current paradigm holds I don’t see a huge difference between human inference ability and Transformer inference ability. Number of neurons* and amount of training data seem to be the things holding LLMs back. Humans beat LLMs on both counts, but in some ways LLMs seem to outperform biology in terms of what they can learn with a given quantity of neurons/data. As for the “billions of years” issue, that’s why we are using human-generated data, so they can catch up instead of starting from scratch.

By “number of neurons” I really mean something like “expressive power in some universally quantified sense.” Obviously you can’t directly compare perceptrons to biological neurons

InterstitialLove@alien.top · 2 years ago

The state-of-the-art on training and architecture is likely to improve over the next year alone, certainly over the next 2 or 3. It’s also reasonable to expect cheaper hardware for running LLMs, since all the chip makers are working on it.

If you don’t need a local LLM now but think it might save money only in the long run, it probably makes sense to wait and build one once we’re better at it

Collating training data in the mean time probably makes sense. Recording as much as you can, encouraging employees to document more, etc. That data will be useful even in the absence of AI, and with improving AI technology it is likely to become more and more valuable every year. It also takes time to produce that data, and no one else can do it for you

InterstitialLove@alien.top · 2 years ago

These answers seem weird to me. Am I misunderstanding? Here’s the obvious-seeming answer to me:

You need two different matrices because you need an attention coefficient for every single pair of vectors.

If there are n tokens, then for the n’th token you need n-1 different attention coefficients (one for each token it attends). For the n-1’th token, you need n-2 different coefficients, and so on, until the 2nd vector which needs only one coefficient, and the first vector which needs zero (it can’t attend anything).

That’s ~n^2 coefficients in total. If you compute key and query vectors, then you only need 2n different vectors (one key and one query for each of the n vectors). If the key/query vectors are d-dimensional that’s 2dn numbers, which is still smaller than n^2 if the context size is bigger than the key/query dimension

So using separate vectors is more efficient and more scalable.

The other answers on this thread seem different, which is surprising to me since this answer feels very straightforward. If I’m missing something, I’d love an explanation

InterstitialLove@alien.top · 2 years ago

Brains are absolutely mathematical functions in a very reasonable sense, and anyone who says otherwise is a crazy person

You think brains aren’t turing machines? Like, you really think that? Every physical process ever studied, all of them, are turing machines. Every one. Saying that brains aren’t turing machines is no different from saying that humans have souls. You’re positing the existence of extra-special magic outside the realm of science just to justify your belief that humans are too special for science to ever comprehend

(By “is a turing machine” I mean that its behavior can be predicted to arbitrary accuracy by a turing machine, and so observing its behavior is mathematically equivalent to running a turing machine)

InterstitialLove@alien.top · 2 years ago

You can just give an LLM an internal monologue. It’s called a scratchpad.

I’m not sure how this applies to the broader discussion, like honestly I can’t tell if we’re off-topic. But once you have LLMs you can implement basically everything humans can do. The only limitations I’m aware of that aren’t trivial from an engineering perspective are

current LLMs mostly aren’t as smart as humans, like literally they have fewer neurons and can’t model systems as complexly
humans have more complex memory, with a mix of short-term and long-term and a fluid process of moving between them
humans can learn on-the-go, this is equivalent to “online training” and is probably related to long-term memory
humans are multimodal, it’s unclear to what extent this is a “limitation” vs just a pedantic nit-pick, I’ll let you decide how to account for it

InterstitialLove@alien.top · 2 years ago

The fallacy is the part where you imply that humans have magic.

“An LLM is just doing statistics, therefore an LLM can’t match human intellect unless you add pixie dust somewhere.” Clearly the implication is that human intellect involves pixie dust somehow?

Or maybe, idk, humans are just the result of random evolutionary processes jamming together neurons into a configuration that happens to behave in a way that lets us build steam engines, and there’s no fundamental reason that jamming together perceptrons can’t accomplish the same thing?

InterstitialLove@alien.top · 2 years ago

Whether it has to do with safety or not, Shear is seemingly more safety-conscious than Altman

Shear has said in interviews that he thinks the probability AI kills everyone in a Skynet-esque scenario (I’m paraphrasing) is between 5% and 50%

Whether recent events were motivated by safety or not, the conflict seems to have wound up being between safety-conscious and accelerationist factions