top of page
  • Writer's pictureMunshi Alam

Chat with Documents with Improved Response Accuracy


Chatting with Documents using Natural Language using one of the most sought-after use cases of LLM by enterprises. A cute short-name for this problem is called RAG - Retrieval Augmented Generation. However, RAG lacks transparency in revealing what it retrieves, making it uncertain which questions the system will encounter. This results in valuable information getting lost amid a mass of irrelevant text, which is not ideal for a production-grade application.


Techniques for improving RAG performance:


After almost a year of building with LLM, I have learned many techniques to improve RAG performance and summarized some of my lessons in using RAG. In this section, I will go over few tested techniques to improve RAG performance.

  1. Adding additional info in the header or footer of the chunk

  2. Adding metadata in each chunk

  3. Adding summarized info in each chunk

  4. Use Langchain's "Parent Document Retrieval" by using two sets of chunk sizes


Pre-retrieval Step:

Despite recent tremendous interest in utilizing NLP for wider range of real world applications, most NLP papers, tasks and pipelines assume raw, clean texts. However, many texts we encounter in the wild are not so clean, with many of them being visually structured documents (VSDs) such as PDFs. Conventional preprocessing tools for VSDs mainly focused on word segmentation and coarse layout analysis. PDFs are versatile, preserving the visual integrity of documents, but they often pose a significant challenge when it comes to extracting and manipulating their contents.


We all have heard of “garbage in, garbage out”. I think it also applies to RAG, but many people just ignore this step and focus on optimizing steps after this very crucial initial step. You cannot simply extract text from your documents and throw them into a vector database and assume to get realiable, accurate answers. Extraction of the texts and tables from the documents have to be semantically accurate and coherent.


Here is an example from my own experience. I had 10 resumes of different candidates. At the begining of the resume I got the name of the candidate. The rest of the pages (assume each resume is 2-page long) have no mention of the name.

In this case, chunks may lose information when split up each resume using some chunk size. One easy way to solve it is to add additional info (e.g. name of the candidate) in each chunk as header or as footer.


The second technique is chunk optimization. Based on what your downstream task is, you need to determine what the optimal length of the chunk is and how much overlap you want to have for each chunk. If your chunk is too small, it may not include all the information the LLM needs to answer the user’s query; if the chunk is too big, it may contain too much irrelevant information that reduces that vector search accuracy, and also confuses the LLM, and may be, sometimes, too big to fit into the context size.


From my own experience, you don’t have to stick to one chunk optimization method for all the steps in your pipeline. For example, if your pipeline involves both high-level tasks like summarization and low-level tasks like coding based on a function definition, you could try to use a bigger chunk size for summarization and then smaller chunks for coding reference.


When your query is such that LLM needs to search lots of documents and then send a list of documents as answers, then it is better to useimilarity_search_wtih_score  search type.


If your query requires LLM to perform a multi-step search to come an answer, you can use the prompt "Think step by step" to the LLM. This helps the engine to break down the query into multiple sub-queries


After you retrieve the relevant chunks from your database, there are still some more techniques to improve the generation quality. You can use one or multiple of the following techniques based on the nature of your task and the format of your text chunks.

If your task is more relevant to one specific chunk, one commonly used technique is reranking or scoring. As I’ve mentioned earlier, a high score in vector similarity search does not mean that it will always have the highest relevance. You should do a second round of reranking or scoring to pick out the text chunks that are actually useful for generating the answer. For reranking or scoring, you can ask the LLM to rank the relevance of the documents or you can use some other methods like keyword frequency or metadata matching to refine the selection before passing those documents to the LLM to generate a final answer.



Balancing quality and latency

There are also some other tips that I found to be useful in improving and balancing generation quality and latency. In actual productions, your users may not have time to wait for the multi-step RAG process to finish, especially when there is a chain of LLM calls. The following choices may be helpful if you want to improve the latency of your RAG pipeline.


The first is to use a smaller, faster model for some steps. You don’t necessarily need to use the most powerful model (which is often the slowest) for all the steps in the RAG process. For example, for some easy query rewriting, generation of hypothetical documents, or summarizing a text chunk, you can probably use a faster model (like a 7B or 13B local model). Some of these models may even be capable of generating a high-quality final output for the user.


48 views0 comments
bottom of page