Hey guys,
I’m running the quantized version of mistral-7B-instruct and its pretty fast and accurate for my use case. On my PC I’m generating approximately 4 tokens per second with the idea of generating one-sentence responses for my NPC characters, which is good enough for what I need.
After fiddling around with oobabooga a bit I found out that you can perform API calls on localhost and print out the text, which is exactly what I need for this to work.
The issue I’m running into here is that if I were to make a game with AI-generated content, how can I make it easy for players to run their own localhost and perform api calls in the game this way? I feel like for the unexperienced, setting all this up would be a nightmare for them and I don’t want to alienate non-tech savvy players.
I’d go the Koboldcpp route instead because its portable for them so its much simpler to install and use. Koboldcpp has API documentation available if you add /api to a working link (Or you can just check it here). If you already made it for the OpenAI compatible API stuff it supports that to.
The fastest way would be to ingest the ggerganov server.cpp module & make HTTP calls to it. Way easier to package into other apps & supports parallel decoding with 30tok/s on Apple Silicon(M1 Pro)
look at the api examples in the ooba code.