Hi guys, I am new to LLMs and especially in using them locally. I have done basic stuffs to learn things like RAG using framework/library LangChain on collab and locally on my cpu machine by using quantised models from TheBloke. But now I want to move on development and production stuffs for some of my potential clients. I will have lots of question during this time, but I will start with learning GPU things.
What minimum GPU server required to run a model like Mistral-7B or LLaMA-13B for inference purpose to build a simple RAG application, keeping 8K context length? Basically I have no idea what types of GPU someone should look for different LLM operations. And what one should look for while building any such LLM apps in productions for a small to midsize company?
A quick google search landed me to https://www.gpu-mart.com/gpu-dedicated-server, but I have little to no knowledge to process out this information. Also, I would appreciate if someone can refer me to an up to date guide on deciding servers configuration (GPU plus other things) for these types of LLM app.