[D]Three things I think should get more attention in large language models

ExaminationNo8522@alien.top · 1 year ago

[D]Three things I think should get more attention in large language models

waiting4omscs@alien.top · 1 year ago

On 2, the users intent is unclear with what you’ve given. Do they want the answer, or is it part of some other narrative? There’s a ton of valid continuations, like “located, larger, smaller, the, not” … How would different sampling be better than what’s currently available?

residentmouse@alien.top · 1 year ago

I’d add a few others to this list but I largely agree with the premise that we focus too much on attention. We lavish praise on the Transformer model but there is so much extra machinery that goes into it to make it work even a little bit, and now papers are coming out claiming ConvNets scale at the same learning rate, and the RetNet paper claims you can swap out attention altogether.

Obv. the issue is “emergence” (terrible term, but I mean non-linear training performance) and the sheer cost of testing permutations of LLM architecture at scale. To what extent has the ML community become the victim of sunk cost?

cnapun@alien.top · 1 year ago

As another commenter has pointed out, 2 is an active area of research; it’s much easier to experiment with sampling in decoding because it generally involves a fixed model.

For your example, I believe nucleus sampling would solve that because the probability of the correct token should be very high (although i’ve only read cursory summaries, haven’t read the paper/implementation in depth)

Doormatty@alien.top · 1 year ago

What are the current areas of research with regards to tokenization?

VastUnique@alien.top · 1 year ago

But this doesn’t always make sense, especially for sentences that should have a clear answer. For example, if the sentence starts with “The capital of Slovakia is”

A city? An interesting place? A place with amazing restaurants and culture? Language is extremely flexible and modular.

ReasonablyBadass@alien.top · 1 year ago

To 1: I remember a recent paper saying they got better results without tokenisation, at least in one area. Don’t have the link right now though.

ACreativeNerd@alien.top · 1 year ago

Could someone explain how/why the log(exp(x)+1) works?

Dangerous-Flan-6581@alien.top · 1 year ago

On 2 totally agree!
On 3 how is log(exp(x)+1) an alternative to softmax? The outputs are not class probabilities. But I agree in general. I have many many many problems with the use of Softmax and hope there is a better alternative.