Thanks for writing this, it’s an interesting idea and very relevant to the issue that I am trying to solve too - creative writing, which definitely hates repetition, and very interested to try out what you proposed once it’s available :)
One technical question for this approach: Wouldn’t it change the original distribution of training data / output, specially in case where there is one and obviously good one next token to choose from? I can see the value when multiple next tokens are all considered great with close probability, but curious how would it behave otherwise in terms of consistency in correctness.
That sounds like CPU speed. What you see from `watch nvidia-smi -d -n 0.1` while you’re running inference ?