On a quest to building production grade LLM apps

Back to Pixeltree

Data Science

Samuel Surulere

July 23, 2024

6 min

After completing a successful transition to Data Science, I noticed a growing buzz/hype regarding generative AI and LLMs (in particular). In order to improve the chances of landing a job, it became apparent that acquiring skills in this field would do a lot of good. This prediction turns out to be timely because my first job as a data scientist required me to develop RAG architectures for interacting with pdf and csv files. In this blog, I will give a high-level overview of my current learnings about advanced Retrieval-Augmented Generation (RAG) techniques. My knowledge of RAG architectures majorly started through watching YouTube videos. After learning different use cases of LLMs, the first use case project I worked on was a naive RAG pipeline (as at early February 2024). Back then, I didn’t know that the technical name of that use case of LLMs was RAG. That generative AI chatbot was used as the backend of a software mobile app to assist newcomers with settling into Calgary. The bot and app provided curated resources for newcomers to ensure they do everything they need to do to settle in. It is common knowledge that OpenAI’s GPT models are trained on data up to 2022 thereabout. The RAG architecture basically extends the knowledge of the OpenAI’s GPT model by exposing it to specific documents it has no knowledge of. Hence, the augmented part of the RAG pipeline (augmenting the knowledge of the LLM).

That generative AI app won second place at Calgary’s biggest hackathon. The achievement served as a basis strong enough to inspire me to develop skills in building LLM apps. I have since worked on a couple of other concepts related to interacting with pdf files through conversational chains (question and answering chatbot). LangChain is the framework I’ve mostly used in building the RAG pipeline. I recently began to work with LlamaIndex and found that it is an awesome library. One time, I also built a proof of concept app (summarization bot) that picks up information from a csv file and summarizes that sports event (a water cooler kind of ideation). All of my exposure and work done on RAG pipelines have been prototypes as none made it to the production cycle. The tech stack I am very much familiar with are LangChain, OpenAI GPT-3.5 model, Huggingface. Sometimes, I tried playing around with open-source models like 4-bit quantized version of Llama 2 using the LlamaCPP framework but the model took too long to return answers to queries (average of 1 minute) despite using a 16GB GPU powered machine. I also tried working with some other open-source models from Huggingface but didn’t make much progress.

In recent weeks, there was a need to move beyond building prototype LLM apps and build a production grade app. The naive RAG process I am well familiar with was:

Ingesting and loading the data (pdf or csv or text file) using a PDF reader or CSV reader from LangChain (or an external library).
Splitting the loaded data into chunks using the RecursiveCharacterTextSplitter from LangChain to overcome the context window limitation of LLMs.
Converting the document chunks into vector embeddings through an encoder model (mostly used the text-embedding-ada-002 embedding model from OpenAIEmbeddings).
Storing the vector embeddings into a vectorstore database for computationally efficient retrieval (using FAISS or Chroma).
Setting up the LLM (usually ChatOpenAI).
Overriding the default GPT’s prompting and ensuring the model return answers only from the given context. Crafting the prompt to reduce the possibility of hallucinations (also aided by setting the temperature parameter of the LLM to 0).
Instantiating the question and answering chain using ConversationalRetrievalChain. The ConversationBufferMemory was also instantiated so the bot can remember previous chat history and use them as additional context.
Making a call to the chain (invoking the chain) to test for functionality. In the backend, the user’s query is encoded into vectors and a similarity search is performed. The LLM then retrieves document chunks that have the highest semantic similarity to the user’s query.
Deploying the modular code (python scripts) using Streamlit or as an API backend (POST request).

This naive RAG pipeline has some limitations (which includes low quality retrieval and hallucinations) which will raise skepticism and provide lack of satisfaction to users during the production cycle. Due to the stochastic nature of machine learning models (including advanced models), it will be quite challenging to solve the problem of hallucination. This problem could persist despite effectively crafted prompts and even adjusting the temperature parameter. This could definitely lead to customer churn which will be bad for the business product. The obvious next step was to do some studying into how to solve these possible issues during the production cycle. This led me to study about advanced RAG techniques. It turns out that some of these techniques have been proposed as far back as November 2023. The goal of most advanced techniques is to improve the retrieval part of the RAG process. I learned about pre-retrieval and post-retrieval techniques for improving the naive RAG pipeline. Examples of post-retrieval techniques include multi-query retrieval approach, parent document retriever, self-query retriever, contextual compression retriever (reranking). These techniques are for the LangChain framework. For the LlamaIndex framework, some advanced techniques include Child-Parent Recursive Retriever, Sentence Window Retrieval plus Sentence Reranker and Auto merging Retrieval plus Sentence Reranker. The above techniques do not in anyway constitute the entire advanced techniques. They were mentioned because I was able to experiment with them and obtain preliminary results from. More advanced techniques include Self-RAG, Corrective RAG, Adaptive RAG among others. The extensive research efforts and articles that have been published in a quest to improving the RAG process goes to show how important RAG is to the advancement of LLMs.

Another goal that made me engage on a deep dive into learning about advanced RAG techniques was because I wanted a way to monitor the LLM app in production cycle. Traditional machine learning models are usually evaluated using certain metrics (like accuracy, precision, recall, area under the curve, Hamming loss, among others) to understand how robust/efficient the training process was and how efficient the trained model would be in production. But for the LLM apps developed, I haven’t been evaluating RAG pipeline. I mostly treated it as a black box until I recently came across RAGAs and LangSmith framework. It turns out that metrics for evaluating the performance of LLM apps exist. Rummaging through the official documentation of RAGAs was really interesting to me because I learned about metrics like faithfulness, answer relevancy, context precision, context relevancy, context recall, answer semantic similarity, answer correctness and some few others. I also came across the LangSmith ecosystem which makes monitoring a LLM app very convenient. One only needs to create the an account, then create an API key. Save the key into a .env file and also set the tracing to True. Automatically, every call made to the encoder model or the LLM model will be tracked on the LangSmith projects page. Lastly, I stumbled on a library that can optimize the pre-retrieval process (loading the data). Unstructured is an API that apparently extracts complex data and transforms them into formats that are compatible with major vector databases and LLM frameworks. Depending on the quantity of the data, this API is not free.

In order to improve the proof of concept LLM apps I have built in the past, I started experimenting with Parent Document Retriever and Contextual Compression Retriever. The Parent Document Retriever is a technique that creates two chunks of the loaded data. The larger chunks (parent chunks) are first created then the smaller chunks (child chunks) are created from the larger chunks. This serves the process of increasing the precision of retrieved information. The tradeoff is that the context provided to the LLM would be limited. When the LLM conducts a similarity search through the vector database and finds the answer (from the child chunks), the parent chunk containing that child chunk will be returned instead. The Contextual Compression Retriever is basically a reranking approach that reranks the most relevant retrieved documents (using the naive retriever) by recalculating the documents that are most relevant. It then returns the top_n documents specified when the cross-encoder model used for the calculations is instantiated. For example, if the naive retriever returns the top 10 relevant documents, the cross-encoder model would calculate the top 3 most relevant documents and return them instead. The drawback of this approach is that it might incur costs as the number of calls made to the cross-encoder model increases. Comparing the results generated from a user’s query and also the retrieved documents, I noticed that both methods gave optimized retrieved documents and better generated output compared to naive RAG. I tried to combine both the Parent Document Retriever and the Contextual Compression Retriever (just out of curiosity). The results were not any better compared to the previous techniques. Then, I also experimented with the multi-query retriever approach. This technique takes the user query and generates several queries that are variants of the original question. The documents that provides answers to these five generated queries will be retrieved by the final chain. The chain would also combine all five different queries and their corresponding retrieved documents. The LLM then processes the combined documents and generates an output (based on the aggregated information presented to it).

There are several more approaches that I am hoping to experiment with. I am also engaged with experimenting with combining some of these approaches into a pipeline that would not result in conflict and would optimize the entire RAG pipeline for the retrieval and generation process. In the not too distant future, I plan to develop the jupyter notebooks into modular code (python scripts) which would be easier for deployments and monitoring using the LangSmith ecosystem. Thank you for reading and do feel free to share your knowledge about the advanced RAG techniques that you feel are more efficient, computationally and optimization wise.

Some helpful resources

Conversation & Iteration: The new peanut butter and jelly

Clear Vision for Successful App Development

Samuel Surulere

Jr. Data Scientist

Journey to Data Science

Samuel Surulere

4 min

Quick Links

About Services Contact Us

Social Media

github

.css-t8d5lr{box-sizing:border-box;margin:0;min-width:0;display:block;color:var(--theme-ui-colors-heading,#2d3748);font-weight:normal;-webkit-text-decoration:none;text-decoration:none;margin-bottom:1rem;font-size:1.25rem;position:relative;}Some helpful resources

Samuel Surulere

Jr. Data Scientist

Related Posts

Some helpful resources