• 1 Post
  • 13 Comments
Joined 1 year ago
cake
Cake day: November 24th, 2023

help-circle




    1. It’s not bad at all! I just wanted to see full model. The approach can be applied to quantized models too, I just wanted the most extreme example in terms of model and context size. It only gets better from there! Light quantization + speculative decoding gets you close to real-time.

    2. Quantized would run significantly faster, although I haven’t measured it extensively yet. That is because you avoid most of the data transfer and also the layers take a lot less memory and run much faster themselves.

    3. The model is definitely not the best, but what was important for me was to see something that’s close to GPT-3.5 in terms of size. So I have a blueprint for running newer open source models of similar sizes.








  • The intended use case long-term is extracting data from documents. One document is typically around 1500 tokens. Since I know the output should be contained in the original document, I restrict the output to predefined choices from the document and a single pass gives me the choice with the highest probability. This way I do not expose my data and it is actually faster than OpenAI API, because there I cannot restrict the output to just a few tokens and it goes on to write irrelevant stuff. Moreover, the data is very sensitive and I obviously cannot send it to an external service just like that. With this fully local approach of less than 10k USD one-time cost, I am able to process about 100k documents per month, which is good enough for now. Not only that, because it’s a one-time cost, it’s way cheaper than OpenAI API in the long run, as it pays off in just 2-3 months.