You mean we don’t need to use llama-cpp-Python anymore to serve this at an OAI-like endpoint?
A bit related. I think all the tools mentioned here are for using an existing UI.
But what if you wanted to easily roll your own, preferably in Python. I know of some options:
Gradio https://www.gradio.app/guides/creating-a-custom-chatbot-with-blocks
Panel https://www.anaconda.com/blog/how-to-build-your-own-panel-ai-chatbots
Reflex (formerly Pynecone) https://github.com/reflex-dev/reflex-chat https://news.ycombinator.com/item?id=35136827
Solara https://news.ycombinator.com/item?id=38196008 https://github.com/widgetti/wanderlust
I like streamlit (simple but not very versatile) And reflex seems to have a richer set of features.
My questions - Which of these do people like to use the most? Or are the tools mentioned by OP also good for rolling your own UI on top of your own software ?
Langroid has a DocChatAgent, you can see an example script here -
https://github.com/langroid/langroid-examples/blob/main/examples/docqa/chat.py
Every generated answer is accompanied by Source (doc link or local path), and Extract (the first few and last few words of the reference — I avoid quoting the whole sentence to save on token costs).
There are other variants of RAG scripts in that same folder, like multi-agent RAG (doc-chat-2.py) where you have one master agent delegating smaller questions to a retrieval agent and asking it in different ways if it can’t answer etc. There’s also a doc-chat-multi-llm.py where you can have the master agent powered by GPT4 and the RAG agent powered by a local LLM (because after all it only needs to do extraction and summarization).
> intuitively it seems like you might be able to avoid calling a model at all b/c shouldn’t the relevant sentences just be closer to the search
Not really, as I mention in my reply to u/jsfour above: Embeddings will give you similarity to the query, whereas an LLM can identify relevance to answering a query. Specifically, embeddings won’t be able to find cross-references (e.g. Giraffes are tall. They eat mostly leaves), and won’t be able to zoom in on answers -- e.g. the President Biden question I mention there.
Here is the comparison for that specific example.
That was exactly my thought! In Langroid (the agent-oriented LLM framework from ex-CMU/UW-Madison researchers), we call it Relevance Extraction — given a passage and a query, use the LLM to extract only the portions relevant to the query. In a RAG pipeline where you optimistically retrieve top k chunks (to improve recall), the chunks could be large and hence contain irrelevant/distracting text. We concurrently do relevance extraction from these k chunks: https://github.com/langroid/langroid/blob/main/langroid/agent/special/doc\_chat\_agent.py#L801
One thing often missed in this is the un-necessary cost (latency and token-cost) of parroting out verbatim text from context. In Langroid we use a numbering trick to mitigate this: pre-annotate the passage sentences with numbers, and ask the LLM to simply specify the relevant sentence-numbers. We have an elegant implementation of this in our RelevanceExtractorAgent using tools/function-calling.
Here’s a post I wrote about comparing Langroid’s method with LangChain’s naive equivalent of relevance extraction called `LLMChainExtractor.compress` , and no surprise Langroid’s methos is far faster and cheaper:
https://www.reddit.com/r/LocalLLaMA/comments/17k39es/relevance_extraction_in_rag_pipelines/
If I had the time, the next steps would have been: 1. give it a fancy name, 2. post on arxiv with a bunch of experiments, but I’d rather get on with building 😄