I think in the longer term the demand for the “do 10,000 generations at once” will rise. Chatbots and chat-based interfaces that have fairly spread out/consistent traffic flow are the first widely propagating use case for LLMs but they are a bit gimmicky. There are and will be plenty of very specific, niche domain use cases where you will want the hundreds/thousands generations at once and then not see traffic again for days/weeks until a next sudden spike.
If your current demand is from chatbots then build that, but once other industries and domains start to figure out how best to use LLMs, I reckon there will be growth in demand for cloud compute that can handle infrequent but super spikey inference requests.
Yep.
Looks like I might be missing a pyinstaller hook. Someone appears to have made a pull request that addresses the same thing but for llama cpp:
https://github.com/abetlen/llama-cpp-python/pull/709
Will try the same thing and post here if it worked.