Let’s say hypothetically that I’m a GPU poor and I’m a simpleton who has never gone beyond oobaboogaing and koboldcpping, and I want to run models larger than mistral at more than 2 tokens per second. Speculative decoding is my only option, right? What’s the easiest way to do this? Do any UIs support it out of the box?
You must log in or register to comment.