Measly 12t/s then. I mean that’s great for hosting your own llm’s if you are a business - awesome cost savings, since you only need an 8-pack of those and you can serve about 20-80k concurrent users, given that most of the time they are reading the replies and not reply with new context immediately. For people like us who don’t share the gpu, it doesn’t make much sense outside of rare cases. Do you by any chance know how I could set a kobold-like completion api that does batch size of 4/8? I want to create a synthetic dataset based on certain provided context, locally. I was doing it via batch size of 1 so far, but I have enough spare vram now that I should he able to up my batch size. Is it possible with autoawq and oobabooga webui? Does it quickly run into cpu bottleneck?
For people like us who don’t share the gpu, it doesn’t make much sense outside of rare cases.
Multiple agents talking to each other. Quickly parsing a knowledge base. Sampling methods like: tree of thought, the simple old beam search, or using multiple prompts.
I don’t want to spend that amount of money, but I definitely want play on one for a few months.
There was a paper where you’d return a faster model to come up with a sentence and then basically run a batch on them big model with each prompt being the same sentence, with different lengthsending in a different word predicted by the small model, to basically see where the small one went wrong. That gets you a speed up if the two models are more or less aligned.
Other than that I could imagine other things, like having batches with one sentence being generated for each actor, one for descriptions, one for actions, etc. Or simply multiple options for you to choose.
Measly 12t/s then. I mean that’s great for hosting your own llm’s if you are a business - awesome cost savings, since you only need an 8-pack of those and you can serve about 20-80k concurrent users, given that most of the time they are reading the replies and not reply with new context immediately. For people like us who don’t share the gpu, it doesn’t make much sense outside of rare cases. Do you by any chance know how I could set a kobold-like completion api that does batch size of 4/8? I want to create a synthetic dataset based on certain provided context, locally. I was doing it via batch size of 1 so far, but I have enough spare vram now that I should he able to up my batch size. Is it possible with autoawq and oobabooga webui? Does it quickly run into cpu bottleneck?
Multiple agents talking to each other. Quickly parsing a knowledge base. Sampling methods like: tree of thought, the simple old beam search, or using multiple prompts.
I don’t want to spend that amount of money, but I definitely want play on one for a few months.
There was a paper where you’d return a faster model to come up with a sentence and then basically run a batch on them big model with each prompt being the same sentence, with different lengthsending in a different word predicted by the small model, to basically see where the small one went wrong. That gets you a speed up if the two models are more or less aligned.
Other than that I could imagine other things, like having batches with one sentence being generated for each actor, one for descriptions, one for actions, etc. Or simply multiple options for you to choose.