@dnsod_si666

dnsod_si666@alien.top · 1 year ago

RWKV looks awesome

dnsod_si666@alien.top · 1 year ago

You could also use this to measure different models against each other right? And just in general, use this as a model benchmark.

Get dataset of text.
Tokenize dataset.
Measure true probabilities straight from the dataset.
Train model number 1 on tokenized dataset.
Measure KL divergence of model from true probabilities.
Repeat steps 4,5 for model number 2
Compare KL divergence of model 1 to model 2.

-Separate Idea- Also isn’t getting the true probabilities useful anyway, because then we could have the training process be:

Get dataset.
Tokenize.
Get true probabilities.
Train on probabilities instead of directly on the tokens.

Like instead of training twice (sequence to probabilities):

sequence1 -> [1, 0]
sequence1 -> [0, 1] You train it once with:
sequence1 -> [0.5, 0.5]

So you are training on less data which would reduce training costs and whatnot.

dnsod_si666@alien.top · 1 year ago

This may be a dumb question, but why do we use any sampling modifications at all? Is that not defeating the purpose of the model training to learn those probabilities?