With a 3090 and sufficient system RAM, you can run 70b models but they’ll be slow. About 1.5 tokens/second. Plus quite a bit of time for prompt ingestion. It’s doable but not fun.
With a 3090 and sufficient system RAM, you can run 70b models but they’ll be slow. About 1.5 tokens/second. Plus quite a bit of time for prompt ingestion. It’s doable but not fun.
Because a frightening amount of people still think Twitter matters.
The models don’t have memory per se, they just process the entirety of the context (i.e. the conversation) with each generation. As this becomes larger and more complex, models with less parameters struggle.
You can try to add certain instructions into the system prompt, such as “advance the story” but ultimately, more parameters means better grasp of the conversation. I haven’t come across any model below an 8 bit 13b model that could keep a story together, so that’s the minimum I go for when I want to RP.
As for the 70b’s writing being less interesting, I’d say that’s independent of the model capabilities and more down to style. Again, giving it instructions on how to write as well as example messages can help but it does somewhat come down to what it was trained on.
It’s a rule of thumb that yes, higher parameter at low quant beats lower parameter at high quant (or no quant) but take it with a grain of salt as you may still prefer a lower parameter model that’s more tuned for the task you prefer.
The model, called Q* – and pronounced as “Q-Star” – was able to solve basic maths problems it had not seen before, according to the tech news site the Information, which added that the pace of development behind the system had alarmed some safety researchers.
Sound like a load of bullocks to me. How would anybody working in AI be “alarmed” by a model solving basic maths problems?
Try just exllama2, no HF.
I know but it’s slowing down quite a bit at 32k already so I don’t think it’s worth pushing it further. But hey, even at just 16k it’s four times what we usually get, so I’m not complaining.
With this particular model, I can crank it up to 32k if I enable " Use 8-bit cache to save VRAM" and that’s as high as it can go in Oobabooga WebUI.
The base Yi can handle 200k. The version I used can do 48k (though I only tested 16k so far). Larger context size requires more VRAM.
The size that TheBloke like gives for GGUF is the minimum size at 0 context. As context increases, VRAM use increases.
I was hoping for a shakeup and all we got was an expensive game of musical chairs? Meh.
It should work with those specs. Not sure what “connection” it means. Perhaps post a screenshot of the console?
Now would be a good time for a disgruntled employee to leak some models and make OpenAI actually open. ;)
What are you looking for?
With a 3090, you can run any 13b model in 8 bit, group size 128, act order true, at decent speed.
Go-tos for the more spicy stuff would be Mythomax and Tie fighter.
Under full load and if thermals allow it, that machine can draw up to 120 from the wall. Likely the tool isn’t reading the SOC power draw correctly.
My poor liver!
Hadn’t thought of that. I have 24gb so I’ve always used GPTQ and with that, you really need more than 16gb.
with 16gb you could run q8
Not really though. Any kind of context will push you over 16gb. Or I’m doing something wrong.
Obviously. There aren’t many people in the world with 50k burning a hole in their pockets and of those, even fewer are nerdy enough to want to set up their own AI server in their basement just for themselves to tinker with.
Use 10 second clips of clean audio, no music, no background noise. I like to record samples from audiobooks. Free samples on Amazon recorded with audacity work well for me.
One thing to note, my install (an implementation for SillyTavern) somehow got corrupted, no idea how. It still worked but sounded way worse. Reinstall fixed that so maybe that’s happening to you too.