I came across this interesting problem in RAG, what I call Relevance Extraction.

After retrieving relevant documents (or chunks), these chunks are often large and may contain several portions irrelevant to the query at hand. Stuffing the entire chunk into an LLM prompt impacts token-cost as well as response accuracy (distracting the LLM with irrelevant text), and and can also cause bumping into context-length limits.

So a critical step in most pipelines is Relevance Extraction: use the LLM to extract verbatim only the portions relevant to the query. This is known by other names, e.g. LangChain calls it Contextual Compression, and the RECOMP paper calls it Extractive Compression.

Thinking about how best to do this, I realized it is highly inefficient to simply ask the LLM to “parrot” out relevant portions of the text: this is obviously slow, and also consumes valuable token generation space and can cause you to bump into context-length limits (and of course is expensive, e.g. for gpt4 we know generation is 6c/1k tokens vs input cost of 3c/1k tokens).

I realized the best way (or at least a good way) to do this is to number the sentences and have the LLM simply spit out the relevant sentence numbers. Langroid’s unique Multi-Agent + function-calling architecture allows an elegant implementation of this, in the RelevanceExtractorAgent : The agent annotates the docs with sentence numbers, and instructs the LLM to pick out the sentence-numbers relevant to the query, rather than whole sentences using a function-call (SegmentExtractTool), and the agent’s function-handler interprets this message and strips out the indicated sentences by their numbers. To extract from a set of passages, langroid automatically does this async + concurrently so latencies in practice are much, much lower than the sentence-parroting approach.

[FD – I am the lead dev of Langroid]

I thought this numbering idea is a fairly obvious idea in theory, so I looked at LangChain’s equivalent LLMChainExtractor.compress_docs (they call this Contextual Compression) and was surprised to see it is the simple “parrot” method, i.e. the LLM writes out whole sentences verbatim from its input. I thought it would be interesting to compare Langroid vs LangChain, you can see it in this Colab .

On the specific example in the notebook, the Langroid numbering approach is 22x faster (LangChain takes 145 secs, vs Langroid under 7 secs) and 36% cheaper (~900 output tokens with LangChain vs 40 with Langroid) with gpt4 than LangChain’s parrot method (I promise this name is not inspired by their logo :)

I wonder if anyone had thoughts on relevance extraction, or other approaches. At the very least, I hope langroid’s implementation is useful to you – you can use the DocChatAgent.get_verbatim_extracts(query, docs) as part of your pipeline, regardless of whether you are using langroid for your entire system or not.

  • SatoshiNotMe@alien.topOPB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    > intuitively it seems like you might be able to avoid calling a model at all b/c shouldn’t the relevant sentences just be closer to the search

    Not really, as I mention in my reply to u/jsfour above: Embeddings will give you similarity to the query, whereas an LLM can identify relevance to answering a query. Specifically, embeddings won’t be able to find cross-references (e.g. Giraffes are tall. They eat mostly leaves), and won’t be able to zoom in on answers -- e.g. the President Biden question I mention there.