Paper: https://arxiv.org/abs/2311.11829
Abstract:
Soft attention in Transformer-based Large Language Models (LLMs) is susceptible to incorporating irrelevant information from the context into its latent representations, which adversely affects next token generations. To help rectify these issues, we introduce System 2 Attention (S2A), which leverages the ability of LLMs to reason in natural language and follow instructions in order to decide what to attend to. S2A regenerates the input context to only include the relevant portions, before attending to the regenerated context to elicit the final response. In experiments, S2A outperforms standard attention-based LLMs on three tasks containing opinion or irrelevant information, QA, math word problems and longform generation, where S2A increases factuality and objectivity, and decreases sycophancy.
The method in the paper is indeed simple and effective: removing irrelevant information through prompt. But is it necessary to dress up this simple method with a fancy neuroscience term?
I hate that part also at the same time enjoy it.
Well, it’s more like a psychological term, and attention is already there to illustrate the intended meaning of a dot product. The analogy holds up, so why doubting the validity of using system 2 attention rather than that of using attention at all?
That was exactly my thought! In Langroid (the agent-oriented LLM framework from ex-CMU/UW-Madison researchers), we call it Relevance Extraction — given a passage and a query, use the LLM to extract only the portions relevant to the query. In a RAG pipeline where you optimistically retrieve top k chunks (to improve recall), the chunks could be large and hence contain irrelevant/distracting text. We concurrently do relevance extraction from these k chunks: https://github.com/langroid/langroid/blob/main/langroid/agent/special/doc\_chat\_agent.py#L801
One thing often missed in this is the un-necessary cost (latency and token-cost) of parroting out verbatim text from context. In Langroid we use a numbering trick to mitigate this: pre-annotate the passage sentences with numbers, and ask the LLM to simply specify the relevant sentence-numbers. We have an elegant implementation of this in our RelevanceExtractorAgent using tools/function-calling.
Here’s a post I wrote about comparing Langroid’s method with LangChain’s naive equivalent of relevance extraction called `LLMChainExtractor.compress` , and no surprise Langroid’s methos is far faster and cheaper:
https://www.reddit.com/r/LocalLLaMA/comments/17k39es/relevance_extraction_in_rag_pipelines/
If I had the time, the next steps would have been: 1. give it a fancy name, 2. post on arxiv with a bunch of experiments, but I’d rather get on with building 😄