How to do correctly speculative decoding on the CPU using small models 1B and 7B?

vasileer@alien.top · 2 years ago

you need to have a knowledge base (with your legal, or financial data) and use semantic search to feed the context for a model fine-tuned to follow instructions and answer from the context,

for small (7B) models there are OpenHermes-2.5-Mistral-7B, Mistral-7B-OpenOrca, dolphin-2.1-mistral-7b,

for bigger models there is Nous-Capybara-34B

vasileer@alien.top · 2 years ago

file size which impacts load time:

with load_in_4bit it will download and parse the big file (which is 4x bigger if it is bfloat16, or 8x bigger if it is float32) and then will quantize on the fly,

with pre-quantized files, it downloads only the quants, so expect a 4x to 8x faster load time for 4bit quants

vasileer@alien.top · 2 years ago

are the finetuned models published somewhere?

vasileer@alien.top · 2 years ago

it also hallucinates people

and Mistral doesn’t?

keep in mind that the demo is for 3B model, and the post is about 7B, which I expect to be way better

vasileer@alien.top · 2 years ago

I tested the 3B model and it looks good, especially the multilingual part (demo https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2)

vasileer@alien.top · 2 years ago

the question was if multiple small models can beat a single big model but also having the speed advantage, and answer is yes, and an example of that is MOE, which is a collection of small models all inside a single big model,

https://huggingface.co/google/switch-c-2048 is a such example

vasileer@alien.top · 2 years ago

yes, this is done by Mixture of Experts (MoE)

and we already have this type of examples:

coding - deepseek-coder-7B is better at coding than many 70B models

answering from the context - llama2-7B is better than llama-2-13B at openbookqa test

https://preview.redd.it/1gexvwd83i2c1.png?width=1000&format=png&auto=webp&s=cda1ee16000c2e89410091c172bf4756bc8a427b

etc.

vasileer@alien.top · 2 years ago

Multiple passes at lower learning rates isn’t supposed to produce different results.

Overfitting is not a technical challenge, its a mathematical property which undeniably exists when ever the training data is smaller than the full problem domain and simultaneously the learning rate (importantly - multiplied by the number of epochs!) would result in a higher specialization ration on the learned to unobserved data than would be expected based on the ration of the learned to unobserved size.

Basically if you learn 1 digit addition but half your training sets involve the left number being 1 and none of your training sets involve your left number being 5 then likely your model will treat 5 and 1 the same (since it’s so over trained on examples with 1s)

GPT-4:

The statement contains several inaccuracies:

Multiple passes at lower learning rates: It’s not entirely true that multiple passes with lower learning rates will produce identical results. Different learning rates can lead to different convergence properties, and multiple passes with lower learning rates can help in fine-tuning the model and potentially avoid overfitting by making smaller, more precise updates to the weights.
Overfitting as a mathematical property: Overfitting is indeed more of an empirical observation than a strict mathematical property. It is a phenomenon where a model learns the training data too well, including its noise and outliers, which harms its performance on unseen data. It’s not strictly due to the size of the training data but rather the model’s capacity to learn from it relative to its complexity.
Learning rate multiplied by the number of epochs: The learning rate and the number of epochs are both factors in a model’s training process, but their product is not a direct measure of specialization. Instead, it’s the learning rate’s influence on weight updates over time (across epochs) that can affect specialization. Moreover, a model’s capacity and the regularization techniques applied also significantly influence overfitting.
Example of learning 1 digit addition: The example given is somewhat simplistic and does not fully capture the complexities of overfitting. Overfitting would mean the model performs well on the training data (numbers with 1) but poorly on unseen data (numbers with 5). However, the example also suggests a sampling bias in the training data, which is a separate issue from overfitting. Sampling bias can lead to a model that doesn’t generalize well because it hasn’t been exposed to a representative range of the problem domain.

Overall, while the intention of the statement is to describe overfitting and the effects of learning rates, it conflates different concepts and could benefit from clearer differentiation between them.

vasileer@alien.top · 2 years ago

what models did you try?

vasileer@alien.top · 2 years ago

since Mistral release there are (almost) no 13B models better than Mistral finetunes, and this can be seen on Open LLM Leaderboard: it is Qwen-14B and second is a Mistral finetune intel/neural-chat, and Orca-13B comes 6th

https://preview.redd.it/ddmvw3un172c1.png?width=1525&format=png&auto=webp&s=d1fb52530c48ed74cfd915b273de7cc3c92e12b2

vasileer@alien.top · 2 years ago

Mistral OpenHermes 2.5

vasileer@alien.top · 2 years ago

2 ideas

- use deepseek-coder-1.3b-instruct not the base model

- check that you use the correct prompting template for the model

vasileer@alien.top · 2 years ago

did you test the model before advertising it?

vasileer@alien.top · 2 years ago

“A 34B model beating all 70Bs and achieving the same perfect scores as GPT-4 and Goliath 120B in this series of tests!”

https://www.reddit.com/r/LocalLLaMA/comments/17vcr9d/llm_comparisontest_2x_34b_yi_dolphin_nous/

from a link another commenter posted

vasileer@alien.top · 2 years ago

works for me with the latest llama.cpp on Windows (CPU only, AVX)

command

`main -m …/models/deepseek-coder-6.7b-instruct.Q4_K_M.gguf -p “### Instruction\n:write Snake game in python\n### Response:” -n 2048 -e`

result

https://preview.redd.it/k0poo4o1171c1.png?width=978&format=png&auto=webp&s=3bf1fc497ed66da28742af4d53972c5e15928390

vasileer@alien.top · 2 years ago

3 ideas

quantization

fastchat-t5 is a 3B model on bfloat16, that means it needs at least at least 3B x 16bits ~ 6GB RAM only for the model itself, and 2K tokens limit for the context (for both prompt and answer),

a quick way to speed up is to use a quantized version:

8bit quant, with almost no quality lost, like https://huggingface.co/limcheekin/fastchat-t5-3b-ct2,

you will get a 2x smaller file and 2x faster inference,

but better read #2 :)

a better model/finetune for better quality

a Mistral finetune like https://huggingface.co/TheBloke/neural-chat-7B-v3-1-GGUF, wich is 7B, quantized to 4bits, will have ~ the same size as 8bit fastchat-t5,

but a superior performance as it was most probably trained on more tokens than llama2 (~2T tokens), and flan-t5 (base model of the fastchat-t5) was only on 1T,

explanation why a larger model quantized is better than a smaller one even not quantized is explained here https://github.com/ggerganov/llama.cpp/pull/1684

use HuggingFace as a hosting, it is ~20$/month for the same server you mentioned that costs 160$, so it is 8x cheaper

https://preview.redd.it/54x2ff87gk0c1.png?width=839&format=png&auto=webp&s=dae1d27376c9c858935c285dd765246af79a86a4

vasileer@alien.top · 2 years ago

the context was already 32K

https://preview.redd.it/5jl7c7a53i0c1.png?width=958&format=png&auto=webp&s=ae51ae2b52717bb5ab14bed76580e7e0a45075ed

vasileer@alien.top · 2 years ago

is this a scam or what? none of the models above are from NurtureAI:

- zephyr-beta is trained by HuggingFace and is 32K by default

- neural-chat is from Intel

- synthia is from migtissera

Original links:

https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

https://huggingface.co/Intel/neural-chat-7b-v3-1

https://huggingface.co/migtissera/SynthIA-7B-v2.0

vasileer@alien.top · 2 years ago

200K context!!

vasileer@alien.top · 2 years ago

How to do correctly speculative decoding on the CPU using small models 1B and 7B?

vasileer@alien.top · 2 years ago

on quality: if you go with a smaller model or even another model you will lose quality, as Mistral (and his finetunes) is the best among <70B models and another rule of thumb is that a bigger model quantized (even 2bits) is better than a smaller unquantized,

on speed: the fastest inference is from Q4_K_S https://github.com/ggerganov/llama.cpp/pull/1684