on-demand inference or batch inference?

Ok_Post_149@alien.top · 1 year ago

on-demand inference or batch inference?

AdamDhahabi@alien.top · 1 year ago

I think batched inference is a must for companies who want to put an on-premise chatbot in front of their users. This is a use case many are busy with at the moment. I saw llama.cpp now supports batched inference, only since 2 weeks, I don’t have hands-on experience with it yet.

Ok_Post_149@alien.top · 1 year ago

Thanks for this feedback, what is your definition of an on-prem chatbot? Hosted on their physical infrastructure?

matkley12@alien.top · 1 year ago

Does llama.cpp support batch inference on CPU ?

Hoblywobblesworth@alien.top · 1 year ago

I think in the longer term the demand for the “do 10,000 generations at once” will rise. Chatbots and chat-based interfaces that have fairly spread out/consistent traffic flow are the first widely propagating use case for LLMs but they are a bit gimmicky. There are and will be plenty of very specific, niche domain use cases where you will want the hundreds/thousands generations at once and then not see traffic again for days/weeks until a next sudden spike.

If your current demand is from chatbots then build that, but once other industries and domains start to figure out how best to use LLMs, I reckon there will be growth in demand for cloud compute that can handle infrequent but super spikey inference requests.

Ok_Post_149@alien.top · 1 year ago

This is really useful feedback, I’d definitely be able to produce a revenue generating product faster if I focus on chatbots… so in terms of trying to get funding for this idea that seems to be the better avenue. In the future I could definitely address both use cases but trying not to spread myself too thin at the moment.