Yeah man just use langchain+pydantic class/guidance lib by MS with Mistral Instruct or Zephyr & you’re golden
Yeah man just use langchain+pydantic class/guidance lib by MS with Mistral Instruct or Zephyr & you’re golden
All you need a 32K LLM. Everything beyond that needs a tool invocation where the archived texts can be pulled from. You’ll have to make your orchestrator smart enough to know that there is content beyond just needs to be invoked
Just run on TGI or vLLM for flash attention & continuous batching for parallel requests
It’s extremely overpriced. With INT4 llama.cpp does even crazier numbers. A system with 4090s can be made for $2500 in India & cheaper elsewhere for sure.
I feel like verifiable math & physics simulation should be something which every LLM should just invoke as a tool instead of trying to do it within slowly
vLLM, TGI, Tensort-LLM
Fuyu-8b
By that logic every LLM put there will engage in talk about Xi