55B Yi model merges

Aaaaaaaaaeeeee@alien.top · 1 year ago

Deepseek 67B still beats XVERSE-65B in the benchmarking scores.
The benchmarks indicate strong math and coding performance for these two model series.
Yuan has a unique optional attention mechanism that enhances output quality

Aaaaaaaaaeeeee@alien.top · 1 year ago

The bandwidth utilization is not the best yet on gpu, its only 1/3rd of the potential 400GB/s.

The cpu RAM bandwidth utilization in llama.cpp on the otherhand, is nearly 100%, For my 32gb of DDR4, I get 1.5t/s with the 70B Q3_K_S model.

There will hopefully be more optimizations to speed this up.

Aaaaaaaaaeeeee@alien.top · 1 year ago

55B Yi model merges

Aaaaaaaaaeeeee@alien.top · 1 year ago

Cramming mistral at 2.7bpw I get 2k. Are you talking about vram though?

Aaaaaaaaaeeeee@alien.top · 1 year ago

same story here.

Aaaaaaaaaeeeee@alien.top · 1 year ago

8k with 2.4bpw and 20 t/s, the vram usage says 23.85/24.00 gb.

16k with 2.4bpw 20 t/s with fp8 cache

I have 0.5-0.6gb used for driving the monitor graphics on ubuntu.

Did you disable the nvidia system memory fallback that they pushed on Windows users? That’s probably what you need.

Aaaaaaaaaeeeee@alien.top · 1 year ago

Last I checked, 38t/s is minimum prompt processing speeds with zero layers offloaded on a 3090 for 70B q4_k_m

I’m sure its way higher now. When you offload layers, you can do more, but I think you have to have pre knowledge of the max length, so that your gpu doesnt OOM towards the end.

I think your supposed to adjust the prompt processing batch size settings also.

I highly recommend checking the nvidia PRs in llama.cpp for the prompt processing speeds, for differences between GPUs. If they have double or triple that will tell you something and you could calculate the amount of time for processing your text.

Aaaaaaaaaeeeee@alien.top · 1 year ago

Low memory bandwidth utilization on 3090?

Aaaaaaaaaeeeee@alien.top · 1 year ago

https://old.reddit.com/r/LocalLLaMA/comments/17ldyak/why_are_you_running_local_models_what_are_you/

https://old.reddit.com/r/LocalLLaMA/comments/13b8ij7/why_run_llms_locally/

Aaaaaaaaaeeeee@alien.top · 1 year ago

possibly even going even larger than 120b parameters

I didn’t know that was possible, have people made a 1T model yet?

Aaaaaaaaaeeeee@alien.top · 1 year ago

https://pastebin.com/b7KYMZzU

The threads are best at 4-5, unless that’s changed. So I think the default in “batched” binary is setup that way.

I reach the maximum cpu utilization (30-36%)after 64, but still see further fain at 256

Aaaaaaaaaeeeee@alien.top · 1 year ago

Its basically… 0?

From github:

More friendly than usual GPT. Because you don’t need to keep a huge context (or kv cache). You just need the hidden state of the last single token.

Aaaaaaaaaeeeee@alien.top · 1 year ago

RWKV-4 7b does not increase any RAM usage with --nommap at 13k with koboldcpp. is that normal? Is there no kv-cache and no extra ram usage for context?

Aaaaaaaaaeeeee@alien.top · 1 year ago

Would the amount of RAM used at the end of 16k or 32k compared to mistral be less?

Is the t/s the same speed as during the beginning?

Looks like something to test in kobold.cpp later if nobody has done those tests yet.

Aaaaaaaaaeeeee@alien.top · 1 year ago

I get 1.33 t/s with 180B Q4_K_S with a batch of 64. here’s my test: https://www.reddit.com/r/LocalLLaMA/comments/17jhwpa/tested_batched_decoding_on_cpu/

Yes, speculative decoding does work with the llama models + tinyllama. but we don’t have an optimal model trained alongside the original models, so we get no higher than 1.3-1.5x for chat usage.

Lookahead decoding is another thing, I assume it will be better!