Hi,

I’m trying to understand all the stuff you’re talking about. I have no ambitions of actually implementing anything. And I’m rather a beginner in the field.

With a few questions about retrieval-augmented generation:

I think I understand that RAG means that the shell around the LLM proper (say, the ChatGPT web app) uses your prompt to search for relevant documents in a vector database that is storing embeddings (vectors in a high-dimensional semantic (“latent”) space), gets the most relevant embeddings (encoded chunks of documents) and feeds them into the LLM as (user-invisible) part of the prompt.

  1. Why do we need embeddings here? We could use a regular text search, say in Solr, and stuff the prompt with human-readable documents. Is it just because embeddings compress the documents? Is it because the transformer’s encoder makes embeddings out of it anyway, so you can skip that step? On the other hand, having the documents user-readable (and usable for regular search in other applications) would be a plus, wouldn’t it?

  2. If we get back embeddings from the database we cannot simply prepend the result to the prompt, can we? Because embeddings are something different than user input, it needs to “skip” the encoder part, right? Or can LLMs handle embeddings in user prompt, as they seem to be able to handle base64 sometimes? I’m quite confused here, because all those introductory articles seem to say that the retrieval result is prepended to the prompt, but that is only a conceptual view, isn’t it?

  3. If embeddings need to “skip” part of the LLM, doesn’t that mean that a RAG-enabled system cannot be a mere wrapper around a closed LLM, but that the LLM needs to change its implementation/architecture, if only slightly?

  4. What exactly is so difficult? I’m always reading about how RAG is highly difficult, with chunking and tuning and so on. Is it difficult because the relevance search task is difficult, so that it is just as difficult as a regular text relevance search with result snippets and facetting and so on, or is there “more” difficulty? Especially re: chunking, what is the conceptual difference between chunking in the vector database and breaking up documents in regular search?

Thank you for taking some time out of your Sunday to answer “stupid” questions!