You got me there 😊
You got me there 😊
Yes, you could do it, but there’s no need to load the full model on each node. Only the layers that are assigned for this node.
Kinda, but not exactly.
Petals dev also separates the work so that volunteers can pick it up. But it doesn’t use the same gpu multiple times during the same inference. So to achieve a similar performance in Petals.dev, you would need 16 volunteers with 3090 cards, while here you have 8 3090s locally.
It’s not bad at all! I just wanted to see full model. The approach can be applied to quantized models too, I just wanted the most extreme example in terms of model and context size. It only gets better from there! Light quantization + speculative decoding gets you close to real-time.
Quantized would run significantly faster, although I haven’t measured it extensively yet. That is because you avoid most of the data transfer and also the layers take a lot less memory and run much faster themselves.
The model is definitely not the best, but what was important for me was to see something that’s close to GPT-3.5 in terms of size. So I have a blueprint for running newer open source models of similar sizes.
Yes, agreed. But there is effort related with releasing it, I have to document it, test it, think good about naming so that it’s intuitive, etc.
If I release it and no one cares, that would be a waste of time, but since there is interest, this motivates me to release it.
Thanks guys 😊
If there are multiple GPUs in the same machine, via PCI. If on different machines, via networking with 1Gbit switch.
Yes, I agree. The libraries I wrote actually make this sort of approach easier, because you can run parts of the bigger model as your usual PyTorch modules :)
That sounds like a great idea. I don’t have a Mac Studo, but in theory it should totally work, since every part in this experiment is a normal PyTorch module. So if you can run PyTorch on Mac (which you definitely can), you can run it on two Mac Studio Ultras.
Thanks for the info! What is the context size? Is it small or big? Because that definitely matters.
Thanks for sharing, that’s very useful! What GPUs and how many are you using, just to make sure I understand correctly?
EDIT: What CPU are you using? Because 90s/t is pretty impressive to be honest.
The layer method basically uses the time when the node is idle, so it works on large context sizes or if you have many GPUs (so you can load a small number layers on the GPU and can reload them super fast).
The intended use case long-term is extracting data from documents. One document is typically around 1500 tokens. Since I know the output should be contained in the original document, I restrict the output to predefined choices from the document and a single pass gives me the choice with the highest probability. This way I do not expose my data and it is actually faster than OpenAI API, because there I cannot restrict the output to just a few tokens and it goes on to write irrelevant stuff. Moreover, the data is very sensitive and I obviously cannot send it to an external service just like that. With this fully local approach of less than 10k USD one-time cost, I am able to process about 100k documents per month, which is good enough for now. Not only that, because it’s a one-time cost, it’s way cheaper than OpenAI API in the long run, as it pays off in just 2-3 months.
Thanks for your comment! I will and will also share the Jupyter notebooks as well. Will probably be next week
I haven’t ran the full Goliath yet. Soon 😊