ExaminationNo8522@alien.topB to Machine Learning@academy.gardenEnglish · 1 year ago

[D]Three things I think should get more attention in large language models

8

1

[D]Three things I think should get more attention in large language models

ExaminationNo8522@alien.topB to Machine Learning@academy.gardenEnglish · 1 year ago

8

Tokenization Techniques: Many people use the default BPE tokenizer for llama2 or other common tokenizers. But I think we could do a lot of experiments with different kinds of tokenizers, especially ones that are made to work well with certain types of data. The size of the vocabulary is a really important setting when you’re working with big language models. You could try using a much smaller vocabulary and tokenizer for a data set that only includes certain words, and then train a model on that. This might help us train smaller models that still work really well on smaller amounts of data. I’d love to read any research papers about this.
Sampling Mechanisms: There’s a lot of discussion about models making things up, but not many people talk about how this could be connected to the way we pick the next word when generating text. Most of the time, we treat the model’s output like a set of probabilities, and we randomly pick the next word based on these probabilities. But this doesn’t always make sense, especially for sentences that should have a clear answer. For example, if the sentence starts with “The capital of Slovakia is”, random sampling might give you the wrong answer, even though the model knows that “Bratislava” is the most likely correct answer. This way of picking words randomly could lead to the model making things up. I wonder if we could create another model to help decide how to pick the next word, or if there are better ways to do this sampling.
Softmax Alternatives in Neural Networks: I’ve worked on designing processors for neural networks, and I’ve found that the softmax function is tricky to implement in hardware. However, I’ve had good results using the log(exp(x)+1) function instead. It’s cheaper and easier to put into hardware and software. I’ve tried this with smaller GPT models, and the results looked just as good as when I used the softmax function.

Chat

Dangerous-Flan-6581@alien.topB
link
fedilink
English
arrow-up
1·
1 year ago
On 2 totally agree!
On 3 how is log(exp(x)+1) an alternative to softmax? The outputs are not class probabilities. But I agree in general. I have many many many problems with the use of Softmax and hope there is a better alternative.

Machine Learning@academy.garden

machinelearning@academy.garden

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !machinelearning@academy.garden

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

1 user / day
1 user / week
1 user / month
1 user / 6 months
1 local subscriber
1 subscriber
786 Posts
3.03K Comments
Modlog

mods:
communick@academy.garden