I came across this interesting problem in RAG, what I call Relevance Extraction.

After retrieving relevant documents (or chunks), these chunks are often large and may contain several portions irrelevant to the query at hand. Stuffing the entire chunk into an LLM prompt impacts token-cost as well as response accuracy (distracting the LLM with irrelevant text), and and can also cause bumping into context-length limits.

So a critical step in most pipelines is Relevance Extraction: use the LLM to extract verbatim only the portions relevant to the query. This is known by other names, e.g. LangChain calls it Contextual Compression, and the RECOMP paper calls it Extractive Compression.

Thinking about how best to do this, I realized it is highly inefficient to simply ask the LLM to “parrot” out relevant portions of the text: this is obviously slow, and also consumes valuable token generation space and can cause you to bump into context-length limits (and of course is expensive, e.g. for gpt4 we know generation is 6c/1k tokens vs input cost of 3c/1k tokens).

I realized the best way (or at least a good way) to do this is to number the sentences and have the LLM simply spit out the relevant sentence numbers. Langroid’s unique Multi-Agent + function-calling architecture allows an elegant implementation of this, in the RelevanceExtractorAgent : The agent annotates the docs with sentence numbers, and instructs the LLM to pick out the sentence-numbers relevant to the query, rather than whole sentences using a function-call (SegmentExtractTool), and the agent’s function-handler interprets this message and strips out the indicated sentences by their numbers. To extract from a set of passages, langroid automatically does this async + concurrently so latencies in practice are much, much lower than the sentence-parroting approach.

[FD – I am the lead dev of Langroid]

I thought this numbering idea is a fairly obvious idea in theory, so I looked at LangChain’s equivalent LLMChainExtractor.compress_docs (they call this Contextual Compression) and was surprised to see it is the simple “parrot” method, i.e. the LLM writes out whole sentences verbatim from its input. I thought it would be interesting to compare Langroid vs LangChain, you can see it in this Colab .

On the specific example in the notebook, the Langroid numbering approach is 22x faster (LangChain takes 145 secs, vs Langroid under 7 secs) and 36% cheaper (~900 output tokens with LangChain vs 40 with Langroid) with gpt4 than LangChain’s parrot method (I promise this name is not inspired by their logo :)

I wonder if anyone had thoughts on relevance extraction, or other approaches. At the very least, I hope langroid’s implementation is useful to you – you can use the DocChatAgent.get_verbatim_extracts(query, docs) as part of your pipeline, regardless of whether you are using langroid for your entire system or not.

  • spirobel@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    just to double check: you embed the sentence numbers into the context, right?

    so the llm will see: “1: Giraffes have long necks. 2: They eat mostly leaves….”

    or does the llm learn by itself what sentence is what number?

    The general optimization behind this is to reduce the number of tokens to generate even at a slight increase in context size, correct?

    Wonder where the trade off here is … there are probably more tricks like this, but I assume at some point there will be diminishing returns, where the added context size makes it not worth it …

  • PopeSalmon@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    yeah i thought of numbering sections too, i agree that is or should be obvious , just now it occurred to me what if you took an embedding of each sentence & compared those, intuitively it seems like you might be able to avoid calling a model at all b/c shouldn’t the relevant sentences just be closer to the search

    • SatoshiNotMe@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      > intuitively it seems like you might be able to avoid calling a model at all b/c shouldn’t the relevant sentences just be closer to the search

      Not really, as I mention in my reply to u/jsfour above: Embeddings will give you similarity to the query, whereas an LLM can identify relevance to answering a query. Specifically, embeddings won’t be able to find cross-references (e.g. Giraffes are tall. They eat mostly leaves), and won’t be able to zoom in on answers -- e.g. the President Biden question I mention there.