• 0 Posts
  • 6 Comments
Joined 1 year ago
cake
Cake day: November 25th, 2023

help-circle

  • If you’re on Windows, I’d download KoboldCPP and TheBloke’s GGUF models from HuggingFace.

    Then you just launch KoboldCPP, select the .gguf file, select your GPU, enter the number of layers to offload, set the context size (4096 for those), etc and launch it.

    Then you’re good to start messing around. Can use the Kobold interface that’ll pop up or use it through the API with something like SillyTavern.


  • Adding into Automata’s theoretical info, I can say that anecdotally I find 4bit 70B substantially better than 8bit 34B or below, but it’ll depend on your task.

    It seems like right now the 70b are really good for storywriting, RP, logic, etc, while if you’re doing programming or data classification or similar you might be better off with a high precision smaller model that’s been fine-tuned towards the task at hand.

    I noticed in my 70b circle jerk rant thread I posted a couple days ago, most of the people saying they didn’t find the 70b that much better (or better at all) were doing programming or data classification type stuff.



  • Q4_0 and Q4_1 would both be legacy.

    The k_m is the new “k quant” (I guess it’s not that new anymore, it’s been around for months now).

    The idea is that the more important layers are done at a higher precision, while the less important layers are done at a lower precision.

    It seems to work well, thus why it has become the new standard for the most part.

    Q4_k_m does the most important layers at 5 bit and the less important ones at 4 bit.

    It is closer in quality/perplexity to q5_0, while being closer in size to q4_0.


  • I’d try Goliath 120B and lzlv 70B. Those are the absolute best I’ve used, assuming you’re doing story writing / RP and stuff.

    LZLV should be speedy as can be and easily done in VRAM.

    Goliath won’t quite fit at 4 bit but you could do lower precision or sacrifice some speed and do q4_k_m GGUF with most of the layers offloaded. That’d be my choice, but I have a high tolerance for slow generation.