- Tokenization Techniques: Many people use the default BPE tokenizer for llama2 or other common tokenizers. But I think we could do a lot of experiments with different kinds of tokenizers, especially ones that are made to work well with certain types of data. The size of the vocabulary is a really important setting when you’re working with big language models. You could try using a much smaller vocabulary and tokenizer for a data set that only includes certain words, and then train a model on that. This might help us train smaller models that still work really well on smaller amounts of data. I’d love to read any research papers about this.
- Sampling Mechanisms: There’s a lot of discussion about models making things up, but not many people talk about how this could be connected to the way we pick the next word when generating text. Most of the time, we treat the model’s output like a set of probabilities, and we randomly pick the next word based on these probabilities. But this doesn’t always make sense, especially for sentences that should have a clear answer. For example, if the sentence starts with “The capital of Slovakia is”, random sampling might give you the wrong answer, even though the model knows that “Bratislava” is the most likely correct answer. This way of picking words randomly could lead to the model making things up. I wonder if we could create another model to help decide how to pick the next word, or if there are better ways to do this sampling.
- Softmax Alternatives in Neural Networks: I’ve worked on designing processors for neural networks, and I’ve found that the softmax function is tricky to implement in hardware. However, I’ve had good results using the log(exp(x)+1) function instead. It’s cheaper and easier to put into hardware and software. I’ve tried this with smaller GPT models, and the results looked just as good as when I used the softmax function.
On 2, the users intent is unclear with what you’ve given. Do they want the answer, or is it part of some other narrative? There’s a ton of valid continuations, like “located, larger, smaller, the, not” … How would different sampling be better than what’s currently available?
I’d add a few others to this list but I largely agree with the premise that we focus too much on attention. We lavish praise on the Transformer model but there is so much extra machinery that goes into it to make it work even a little bit, and now papers are coming out claiming ConvNets scale at the same learning rate, and the RetNet paper claims you can swap out attention altogether.
Obv. the issue is “emergence” (terrible term, but I mean non-linear training performance) and the sheer cost of testing permutations of LLM architecture at scale. To what extent has the ML community become the victim of sunk cost?
As another commenter has pointed out, 2 is an active area of research; it’s much easier to experiment with sampling in decoding because it generally involves a fixed model.
For your example, I believe nucleus sampling would solve that because the probability of the correct token should be very high (although i’ve only read cursory summaries, haven’t read the paper/implementation in depth)
What are the current areas of research with regards to tokenization?
But this doesn’t always make sense, especially for sentences that should have a clear answer. For example, if the sentence starts with “The capital of Slovakia is”
A city? An interesting place? A place with amazing restaurants and culture? Language is extremely flexible and modular.
To 1: I remember a recent paper saying they got better results without tokenisation, at least in one area. Don’t have the link right now though.
Could someone explain how/why the log(exp(x)+1) works?
On 2 totally agree!
On 3 how is log(exp(x)+1) an alternative to softmax? The outputs are not class probabilities. But I agree in general. I have many many many problems with the use of Softmax and hope there is a better alternative.