- Deepseek 67B still beats XVERSE-65B in the benchmarking scores.
- The benchmarks indicate strong math and coding performance for these two model series.
- Yuan has a unique optional attention mechanism that enhances output quality
The bandwidth utilization is not the best yet on gpu, its only 1/3rd of the potential 400GB/s.
The cpu RAM bandwidth utilization in llama.cpp on the otherhand, is nearly 100%, For my 32gb of DDR4, I get 1.5t/s with the 70B Q3_K_S model.
There will hopefully be more optimizations to speed this up.
Cramming mistral at 2.7bpw I get 2k. Are you talking about vram though?
same story here.
8k with 2.4bpw and 20 t/s, the vram usage says 23.85/24.00 gb.
16k with 2.4bpw 20 t/s with fp8 cache
I have 0.5-0.6gb used for driving the monitor graphics on ubuntu.
Did you disable the nvidia system memory fallback that they pushed on Windows users? That’s probably what you need.
Last I checked, 38t/s is minimum prompt processing speeds with zero layers offloaded on a 3090 for 70B q4_k_m
I’m sure its way higher now. When you offload layers, you can do more, but I think you have to have pre knowledge of the max length, so that your gpu doesnt OOM towards the end.
I think your supposed to adjust the prompt processing batch size settings also.
I highly recommend checking the nvidia PRs in llama.cpp for the prompt processing speeds, for differences between GPUs. If they have double or triple that will tell you something and you could calculate the amount of time for processing your text.
possibly even going even larger than 120b parameters
I didn’t know that was possible, have people made a 1T model yet?
The threads are best at 4-5, unless that’s changed. So I think the default in “batched” binary is setup that way.
I reach the maximum cpu utilization (30-36%)after 64, but still see further fain at 256
Its basically… 0?
From github:
More friendly than usual GPT. Because you don’t need to keep a huge context (or kv cache). You just need the hidden state of the last single token.
RWKV-4 7b does not increase any RAM usage with --nommap at 13k with koboldcpp. is that normal? Is there no kv-cache and no extra ram usage for context?
Would the amount of RAM used at the end of 16k or 32k compared to mistral be less?
Is the t/s the same speed as during the beginning?
Looks like something to test in kobold.cpp later if nobody has done those tests yet.
I get 1.33 t/s with 180B Q4_K_S with a batch of 64. here’s my test: https://www.reddit.com/r/LocalLLaMA/comments/17jhwpa/tested_batched_decoding_on_cpu/
Yes, speculative decoding does work with the llama models + tinyllama. but we don’t have an optimal model trained alongside the original models, so we get no higher than 1.3-1.5x for chat usage.
Lookahead decoding is another thing, I assume it will be better!
thanks for sharing!
I use ggml mmap inference, 0gb ram or vram needed. I use this model it is 360gb in size. https://huggingface.co/imi2/airoboros-180b-2.2.1-gguf/blob/main/airoboros-180b-2.2.1-f16.gguf.a
When I tried running f16 180b purely from disc I get ~90s/t with pcie 4.0
With Q4_K_S, that becomes ~22s/t
Also try this out for running on multiple machines:
Not sure if your layer method is fast enough and I think its going to be a bottleneck if you get any faster.
BTW, cpu performance can match the bandwidth of good GPUs.
Here’s a good post on a potential 1tb ram setup:
Long context is useless without flash-attention.
for cpu only it is not viewable due to mmap-loading which saves time during startup. to view, use --no-mmap
What value specifically worked?