Remember to try 8-bit cache If you haven’t yet, it should get you to 5.5k tokens context length.
You can get around 10-20k context length with 4bpw yi-34b 200k quants on single 24GB card.
Remember to try 8-bit cache If you haven’t yet, it should get you to 5.5k tokens context length.
You can get around 10-20k context length with 4bpw yi-34b 200k quants on single 24GB card.
I’ve only seen merging of same-upstream-pretrained-model-at-same-size.
Not anymore.
Here’s a merge of llama 2 13B and llama 1 33B https://huggingface.co/chargoddard/llama2-22b
Have you checked deepseek Coder instruct 33b already? I don’t know about it’s knowledge of pytorch but it’s pretty much the best local coding model you can run, so it’s your best shot.
Instead, it uses an Amazon platform known as Bedrock, which connects several A.I. systems together, including Amazon’s own Titan as well as ones developed by Anthropic and META.
It’s a llama! :D i wonder how they can comply with llama license, I think they have more than 700M customer.
Good to see more competitors at least, enterprise office people are totally In MS hands so that’s not an area where open source end-to-end solutions have too much chance of competing, the only way to get them there is if big corp like Amazon adopts them in their infrastructure for a product like this.
Are you talking about base yi-34B or a fine-tuned one? Base model will be hard to use but will score pretty high. Benchmarks are generally written with completion in mind, so they work really well on base models and instruct tuning may make it much easier to work with but not necessarily score higher on benchmarks.
I can’t corraborate results for Pascal cards. They had very limited FP16 performance, usually 1:64 of FP32 performance. Switching over to rtx 3090 ti from gtx 1080 got me around 10-20x gains in qlora training, assuming keeping the exact same batch size and ctx length, changing only calculations from fp16 to bf16.
yeah, it will be a sum of tokens that the next token is generated on. I don’t know how often KV cache is updated.
Yeah if that’s the case, I can see gpt-4 requiring about 220-250B of loaded parameters to do token decoding
Here’s the formula
batch_size * seqlen * (d_model/n_heads) * n_layers * 2 (K and V) * 2 (bytes per Float16) * n_kv_heads
https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices
Formula to calculate kv cache, as in space used by context
batch_size * seqlen * (d_model/n_heads) * n_layers * 2 (K and V) * 2 (bytes per Float16) * n_kv_heads
https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices
This blog post is really good, I recommend you to read it.
Usually bigger models have more layers, heads and dimensions, but I am not sure whether heads or dimensions grow faster. It’s something you can look up though.
Jondurbin made something like this with qlora.
The explanation that gpt-4 is MoE model doesn’t make sense to me. Gpt4 api is 30x more expensive than gpt-3-5-turbo. Gpt-3-5 turbo is 175B parameters, right? So, if they had 8 220B experts, it wouldn’t need to cost 30x more, it would be 20-50% more for API use. There was also some speculation that 3.5 turbo is 22B. In that case it also doesn’t make sense to me that it would be 30x as expensive.
I upgraded from gtx 1080 to rtx 3090 ti 2 weeks ago. I think going with rtx 3090 / 3090 ti / 4090 would be a good option for you, I don’t know how big of a difference having stronger cpu would have, I think exllama v2 has some cpu bottlenecking going on, but I have no idea what is computed on cpu and why. There were moments during generation where it seemed like it was using only 1 thread and it was maxing it out, being bottleneck for gpu. I don’t think ram matters a lot unless you train and merge loras and models.
I use Deepseek Coder instruct at work for writing PowerShell scripts and some help with troubleshooting. I set up a pc that wasn’t used anymore and had quadro rtx 4000 with DeepSeek Coder instruct 6.7B model and shared it among the team, earlier I was sharing 33B version from home to my work computer and using it at work myself only. I find it better than Bing Chat Enterprise for my use-case - it’s much faster and I don’t have to fight with it just to generate me some code. It’s also all local so I don’t have to worry about how private actually is Bing chat Enterprise.
At home I use various models for questions that I would be too embarrassed about when asking real human, or I would have to pay that human for an answer. It’s a really big deal for me to have some private pocket intelligence that I can talk to that won’t remember what you talked with it about and won’t log the conversation god knows where.
Isn’t cublas specific to Nvidia cards and clBLAST compatible with both Nvidia and AMD? I am not sure how cublas could work with AMD cards, ROCm?
They are really small, one person can make them relatively quickly, so I don’t think there are huge gains to be had by splitting the work. You can always push the dataset to huggingface and make it public, allowing others to add their samples.
That’s assuming batch 1. 4090 for example can serve multiple batches of 7B model at once, around 850 t/s. https://github.com/casper-hansen/AutoAWQ Now get a bigger gpu that has more vram and can host multiple llama 70b batches, or split the layers across multiple gpus. You can get 10-20x t/s uplift by doing batched generation.
I believe that gpu offloading in llama.cpp can be used to merge your vram and ram. I would suggest you to try some airoboros llama 2 70b q3_k_m quant and Tess-m-1.3 q5_k_m once TheBloke makes quants. There will be some leftover space in your RAM after loading Tess, but it’s a model with 200k context, so you will need it for context. Max out your vram and maybe use batch size of -1 to trade prompt processing speed for more vram space, try offloading both with cublas and clBLAST. Last time I checked, it seemed like using clBLAST allowed to offload more layers to gpu in the same memory footprint.
I like the idea of trying to get tiny 13 sample dataset to work. Can you upload adapter files or full fp16 model? With what you have uploaded currently, only llama.cpp and derivatives can be used for inference. If you would upload adapter files, someone could merge them with base model and for example run it in exllama.
What do you mean as base + extra?
Merging models can be unpredictable, it isn’t an established science yet. It can absolutely make it better at a particular benchmark than any of it’s component is. I don’t think it’s any evidence to be honest.
I really really enjoy seeing perpetual irrevocable licenses.