TL:DR is there an example someone can point me to for RAG with highly structured documents where the agent returns conversation along with cross references to document paragraphs or sections? Input= long text document (~500-1000 page), output is Q/A with references to document paragraph, page, or other simple cross reference.

I’ve been looking into RAG in my (extremely limited) spare time for a few months now but I’m getting hung up on vector databases. It may be due to the fact that my use case revolves around highly structured specification documents where I desire to be able to recover section and paragraph references in a QA session with a rag assistant.

Most off-the-shelf solutions seem to not care what your data looks like and just provides a black box solution for data chunking and vectoring, like having a single HTML link for a website for the source information and magically it works. This confuses me because langchain has a great learning path that includes quite a bit of focus on proper data chunking and vector database structuring, then literally every example treats the chunking and vector store step as an afterthought. I don’t like to do something I don’t understand so I’ve been focused more on creating a database for my data that makes sense in my brain.

I have successfully created a local vector database (sqlite) with SBERT that returns paragraph numbers with a similarity search but I haven’t bridged that to feeding those results into an LLM.

Am I thinking too hard about this? Are the off the shelf rag solutions able to handle the paragraph numbers without me explicitly trying to cram them into a database structure? Or am I on the right path, and I should continue with the database that makes sense to me and keep figuring out how to implement the LLM step after the vector search?

I started looking at llamaindex, then Langchain, now autogen. But my spare time is limited enough that I haven’t implemented anything with any of these, only a (successful) sbert similarity search which didn’t use any of these. If someone has an example for structured documents where the q/a provides cross-references, I’d really appreciate it.

  • SatoshiNotMe@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Langroid has a DocChatAgent, you can see an example script here -

    https://github.com/langroid/langroid-examples/blob/main/examples/docqa/chat.py

    Every generated answer is accompanied by Source (doc link or local path), and Extract (the first few and last few words of the reference — I avoid quoting the whole sentence to save on token costs).

    There are other variants of RAG scripts in that same folder, like multi-agent RAG (doc-chat-2.py) where you have one master agent delegating smaller questions to a retrieval agent and asking it in different ways if it can’t answer etc. There’s also a doc-chat-multi-llm.py where you can have the master agent powered by GPT4 and the RAG agent powered by a local LLM (because after all it only needs to do extraction and summarization).

  • grumpy_autist@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    @smerfj - I’m currently researching same problem. You can find some information in LlamaIndex project docs. What you probably need is so called composite index with both vector database and knowledge graph that links particular knowledge bits or text paragraphs together. Alternatively you can try restricting vector search to chunks computed from one particular document.

    I suspect that knowlege graphs are “the shit” because you can keep and query really small but highly relevant pieces of data without overflowing LLM context and slowing it down.

    • Smerfj@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Thanks for the pointers. Since my aims are using local models eventually, I’ll take any efficiency I can squeeze out.