RAG Implementation: Ground LLM Answers on Your Private Data
The RAG (Retrieval-Augmented Generation) architecture is an approach that can improve the answers generated by a LLM by using custom data. This enables GenAI applications to provide grounded answers based on private data without the need for fine-tuning.
There is a lot to unpack in these statements so ill try my best to explain what this all means and how we apply it.
Why do we need RAG?
LLMs (Large Language Models) like GPT-4, Mistral and Claude are very powerful deep learning models trained on massive amounts of (public) data. This means that these models have no knowledge of anything outside their dataset such as your private data or events that happened after they were trained. Another pitfall of LLMs is that they are always very confident, even in matters they have no data on. Resulting in hallucinations when generating text on subjects they have little knowledge on.
Imagine asking ChatGPT to help you write a blogpost about a recent achievement using your personal tone of voice. It won’t be able to do this properly because the achievement occurred after the training data cut-off date and it has no knowledge of your tone of voice. To fix this we need to provide the LLM with this data.
Supplying LLMs with private or more recent data can be done in roughly 2 ways:
Training on the desired data
Adding the desired data in your prompt when asking a question
Training a new LLM or finetuning an existing LLM can get very expensive quickly and still is not able to provide the most recent data as (re)training takes some time. For these reasons adding the data to your prompt has become the industry standard, and this is exactly what RAG does!
How does RAG work?
We now know that RAG basically means that we retrieve relevant information about the user query and add this to the query so the LLM can generate a more relevant answer. The retrieval of information can be done from any data source imaginable like doing a web search or fetching information from a database. The only requirement is that it can return information based only on the user query.
Vector databases are used a lot nowadays. These databases specialise in being a retriever for Semantic Search. Semantic search is a technique were relevant information is looked up by searching for text that semantically similar to the user query. This works by creating a embedding (long list of numbers a.k.a. a vector) for every piece of information and the user query and comparing these embeddings to see which ones are closest to the user query. How this works deserves it’s own blog post, but if your interested this article is a good place to start.
Implementation
To bring these steps all together and create a good LLM experience for the user we need the following parts:
Vector database
A vector database contains the information you want to add to the user query. In the architecture picture down below you can see we have chosen PGVector as a vector database (a PostgreSQL extension). It’s a great place to start if you already have a Postgres server running somewhere in your stack. But there are many other great vectordb options like Weaviate, Pinecone or Qdrant. You can find a nice comparison between different options here.
Workflow orchestrator
Why a workflow orchestrator you might ask? Well, thats because we need to fill and update the vector database somehow. This means every once in a while we need a process to connect to our datasources and get all documents that have been added or updated. These documents are then cut up into “chunks” and sent to a embedding API to generate a vector. These vectors are then stored in the database together with the chunks.
Web application
Finally, a web application is used as an interface to a user. We have used the Langchain framework to quickly be able to connect to the required LLMs and vectordb’s. By using an existing framework you don’t have to reinvent the wheel and build a whole host of connections yourself but can focus more on the fun stuff like prompt engineering and LLM agents.
Challenges
After implementing all these parts you are definitely not guaranteed to have amazing results from the start. Depending on your use-case and dataset some further tuning might be required. In our developments we have already encountered the following challenges:
Extracting proper metadata. Sometimes information is not contained in the document itself but maybe the location is stored. Extracting these extra pieces of metadata and using these in the embedding can really improve search results
Chunking strategies. Since LLM context windows are finite it is sometimes necessary to divide up larger documents in chunks. But finding a good way to do this and a target chunks size remains an art as it might depend on how many chunks you want to add to the prompt and how large the context window is. Also for different types of text (word document vs file containing Python code) using different methods to determine where to split a document can prevent incoherent chunks.
Prompt engineering. The behaviour of an LLM can be influenced a lot by the prompt that is used. Finding a good prompt for your situation can be quite challenging.
Conclusion
I hope this gives a bit more insight into the workings of a RAG architecture. If you’d like to know more about RAG, LLMs or how you can apply this in your context, you can contact me at nathan@wolk.work.