Improving the Retrieval Step of RAG Systems
Basics of RAG
Retrieval Augmented Generation (RAG) is the go-to method to connect Large Language Models (LLM) to stores of knowledge such as document databases. Given a prompt from the user, an RAG system selects the most relevant chunks of documents for that query and attaches them to the prompt as context. As such, the LLM does not need to be retrained, and updates to the knowledge store does not require update to the overall system. Furthermore, we prevent hallucinations because the LLM will only answer the user prompt with the attached context.
The most important step in RAG is finding the relevant chunks of information for a given query. To find these chunks, we compare the user prompt to all document chunks. We do this on text embeddings we create. There are dedicated vector databases such as Pinecone or Chroma, but many other database systems also support vector search, e.g. redis, PostgreSQL, elasticsearch among others.
However, just comparing out-of-the-box embeddings is often not good enough, as vital contextual information is missing when we just compare the embeddings. The  following is a list of ways to improve the matching step.
For an introduction and great overview of the benefits of RAG, see this pinecone post.
Embedding Finetuning
In a first step, using a pre-trained embedding is perfectly fine. But in many cases, it can be beneficial to train an embedding that is aware of specific wordings, products or abbreviations, resulting in better retrieval.
For an overview of finetuning LLaMA Index, please see https://gpt-index.readthedocs.io/en/latest/end_to_end_tutorials/finetuning.html#finetuning-embeddings-for-better-retrieval-performance
To finetune an embedding, we need a training set consisting of questions and retrieved documents. However, it is often expensive to create such a dataset. Another method is to generate synthetic questions given a knowledge store. This does not require any human annotation, but can still improve the retrieval accuracy by 5-10%.
Note that the finetuning does not have to be updated when we add new documents to the document database. Retraining the embedding might bring a benefit for the following cases:
- We add different types of documents. For example if our previous source of documents only contained emails, and we now want to extend the system with legal documents.
- We add new vocabulary. E.g. we add descriptions of a new products with specific names, a pre-trained embedding might not understand those names and how they relate to user queries.
For technical tutorials on how to finetune an embedding, see https://github.com/run-llama/finetune-embedding.
Metatdata Attachment
Attaching isolated chunks of text from a larger document often leads to loss of contextual information. If you get a single paragraph from a 20 page document, it is difficult to make sense of that information without understanding the broader context: Where in the document that text comes from. E.g. all of the following is relevant contextual information: Document title, section name, subsection, overall structure and intent of that document, date, intended reader, etc. So much is lost if we simply extract text from a broader document.
Adding metadata can recover some of that lost context.
There are various sources of metadata that can be added:
- Source filename
- Date and time
- Document tile, section title, subsection name
- Position in document
- Keywords such as product name, document type
There are multiple ways of automatically extracting metadata from documents, see LLaMA Index functions for a examples. There are also advanced LLM/NLP techniques to automatically extract and create keywords, such as RAKE (Rapid Automatic Keyword Extraction). However, these techniques tend to be more expensive to build, use during indexing of documents, and maintain. For a great overview, see this blog post.
There are two ways of using the metadata for retrieval: Either you add them when creating the embedding for each chunk, or you use them with a hybrid search approach (see next section). Either way, adding them as context for the LLM query will probably improve the generated response.
Hybrid Retrieval
Instead of just the standard embedding and distance search, a more sophisticated retrieval based on classical ranking can be used. For an e-commerce store, this could be based on product keywords plus embeddings. And for a generic application, classical keyword based search can still be used. For example Azure reports that hybrid search outperforms embedding-only search.
Depending on your application, a hybrid approach might be a good option, but keep in mind that it adds complexity and cost.
Date and time information might also be highly relevant for some applications. For example when dealing with emails, a system that weights chunks from an email by how recent they are, will probably generate much more relevant content.
User Query Rephrasing and Augmentation
Instead of just calculating the embedding of the user's prompt directly, we can rephrase and augment it. This could be adding keywords to the user's prompt or rewriting the prompt using an LLM. Rewriting the prompt also gives the option of generating multiple prompts and doing multiple retrievals and then averaging or ranking the results, thereby creating a kind of 'ensemble' retrieval method. Of course, this will add additional costs, as we introduce an additional LLM inference step, and multiple retrieval steps. For an example of rewriting user prompts, check out Metaphor, which rewrites user queries to optimized search queries.
More advanced techniques also exist, such as HYDE, which takes the user query, creates a completely generated document, and then searches for similar documents in the database.
Query Routing
Instead of using just one single index, multiple indices can be used, each one for a specific type of user question. E.g. one could handle summarization questions, another product specific queries, and a third one technical support questions. An initial LLM step will detect the best route from the user query.
The query routing can not only be used for retrieval, but different routes can also use different LLMs! For example, for more difficult tasks, the more expensive gpt-4 could be used, while for simpler tasks, gpt-3.5 or an even smaller model might give good enough results, thereby saving costs.
See LLaMA index documentation and LangChain documentation on how to implement query routing.
Conclusion
There are many ways to improve the retrieval step of an RAG system. It is usually best to keep it simple initially, and not plan for improvements until it becomes clear what will actually work best. As such, I advocate starting with a vanilla approach, and first collect experience. Prompting and manually looking at retrieved chunks usually give great insights into what is missing during the retrieval step. Is contextual information missing? Then add meta information. Is the retrieval working for a group of prompts but not another? Then add query routing. Does the retrieval not work for some specific wording? Maybe finetuning an embedding can help.
Improving the retrieval will need a lot of experimentation. As such, it is important that you define a consistent way to evaluate experiments and potential improvements.
