An Expanded Explanation of RAG, Vector Search, and how it is implemented on IRIS in the IRIS RAG App

#ai #chatgpt #javascript #python

I received some really excellent feedback from a community member on my submission to the Python 2024 contest. I hope its okay if I repost it here:

you build a container more than 5 times the size of pure IRIS

and this takes time

container start is also slow but completes

backend is accessible as described

a production is hanging around

frontend reacts

I fail to understand what is intended to show

the explanation is meant for experts other than me

The submission is here: https://openexchange.intersystems.com/package/IRIS-RAG-App

I really appreciate this feedback, not the least because it is a great prompt for an article about the project. This project includes fairly comprehensive documentation, but it does assume a familiarity with vector embeddings, RAG pipelines, and LLM text generation, as well as python and certain popular python libraries, like LLamaIndex.

This article, written completely without AI, is meant to be an attempt at an explanation of those things and how they fit together in this project to demonstrate the RAG workflow on IRIS.

The container is large because the library dependencies needed for the python packages involved in creating vector embeddings are very large. It is possible that through more selective imports, the size could be cut down considerably.

It does take time to initially build the container, but once you have done so, it takes less time to start it. The startup time could still definitely be improved. The main reason startup takes so much time is that the entrypoint.sh was updated with the assumption that changes might have been made to any part of the application since that last startup, including the database migrations, css configurations, javascript configurations, and the python backend code, and it recompiles the entire project every time it starts up. This is to make it easier to get started with developing on this project, since otherwise it can be tricky to properly run the frontend and backend builds whenever changes are made. This way, if you change any of the code in the project, you just need to restart the container, maybe recover the production in the backend, and your changes should be reflected in the interface and operation of the application.

I am fairly sure that the production in the backend is what is passing the http requests to the Django application, and is crucial to the interoperability in this package. I am new to the IRIS platform, however, and have more to learn about productions.

Next I’d like to provide a comprehensive explanation of vector embeddings, LLMs, and RAG. The first of these to be invented was the vector embedding. First we can describe a vector. In most contexts a vector is a direction. It’s an arrow pointing somewhere in space. More formally, a vector is “a quantity having direction as well as magnitude”. This could be exemplified by a firework, which travels in a particular direction and explodes at a particular point in space. Let’s say every firework is fired from the same central point, a point of origin, [0,0,0], but they all fly out and explode in a cloud around that origin point. Mathematically you could describe the location of each firework explosion using a three coordinate system, [x,y,z] and that would be a “vector embedding” for a firework explosion. If you took lots of video of a firework display and recorded all of the firework explosions as a dataset, then you would be creating a kind of vector embedding database, or vector store, of the fireworks display.

What could you do with that information about the fireworks display? If I pointed out a particular firework and asked for the fireworks that exploded closest to the same point throughout the entire display, you could find those other fireworks that exploded at nearby points in space. You just find the ones that are closest, and there’s math to do that.

Remember, we only recorded three numbers for each firework, the x, y, and z coordinates in a three dimensional space with [0,0,0] being the firework launcher on the ground.

What if I wanted to also know the firework that exploded both closest in distance, and closest in time to another particular firework? To know that, we would have to go back through our video footage of the fireworks display and record the time of each explosion as well. Now we have a 4-dimensional vector with 4 numbers: the three dimensional position of the firework explosion and the time of the explosion. Now we have a more descriptive type of embedding for the firework display by adding another dimension to our vector embeddings.

How does this translate to machine learning? Well, long story short, by processing a huge amount of text data, computer scientists managed to create embedding models that can transform a piece of text like a phrase, sentence, paragraph, or even a page, and turn it into a very long series of numbers that represent a point in a theoretical high dimension space.

Instead of 4 numbers, there are 300, or 700, or even 1500. These represent 1500 ways in which one piece of text can be “close” or “far” away from another, or 1500 dimensions of meaning. It’s a captivating concept for many that we have the means to create numbers that represent in some way the semantic meaning of a piece of text.

Using math, two of these high-dimension text vector embeddings can be compared to find out how similar or “close” they are to one another if they were created by the same model.

That’s the first thing that happens in this app. The user must put in a document and name it, and then choose a type of embedding. The server takes that document, breaks it into text chunks, and then turns each of those chunks into a vector embedding, and that chunk is saved as a row in a dedicated table for that document. Each document is stored in its own dedicated table to allow for the variable length of the vector embeddings created by different text embedding models.

Once a document is stored in the database as vector embeddings, the user can enter a query to “ask” the document. The query is used in two ways. The first way is to search the document. We don’t do a traditional text search, instead we are doing a “vector search”. The app takes the query, turns it into a vector embedding, and then finds the sections of the document with embeddings that are most similar to the query vector embedding. A similarity score between 0 and 1 is then generated for every document section, and several sections are retrieved from the vector database based on the top_k_similarity and the similarity_threshold. Basically, you can ask it how many document sections to retrieve, and how similar they must be to your query to qualify for retrieval.

That’s the Retrieval in Retrieval Augmented Generation. The next step is the generation.

Once computer scientists figured out how to convert text to semantically significant numeric vector embeddings, the next step was to create models that could produce text. They did so with great success, and now we have Large Language Models like GPT-4, LLama3, and Claude 3.5. These LLMs can take a prompt, or query, and deliver a completion, or answer, which is the text it thinks most likely to continue from the text presented, the prompt.

LLMs must be trained on large amounts of text data, and their responses, or completions, are limited to that training data. When we want the LLMs to provide completions that might include data not in their training sets, or ground their completions in a particular set of knowledge, one way to do that is to include extra contextual data in the prompt. Basically, if we want an answer from an LLM about something it wasn’t trained on, we have to give it the information in the prompt.

Many people found themselves in a situation in which they wished chatGPT or their local LLama installation could provide answers based on their own personal documents. It’s simple enough to search your documents for that information, paste it into the prompt, and put in your question, and people found themselves doing it manually. That is its own form of Retrieval Augmented Generation. RAG is just the automation of finding information relevant to the user query and providing it with the query to the LLM for a more accurate or useful response.

In this app, the document sections we retrieve with the vector search are sent with the query to the chosen LLM, labeled in the interface as the Model, to provide the context to the answer.

In the video example I made for this project, I ask the question “Who is the villain in this play?” with the documents “Hamlet” and “King Lear”, which contain the entire text of the two Shakespeare plays. The IRIS database already has two tables, one for Hamlet, and the other for King Lear. Each table is filled with rows of vector embeddings created from splitting the text of each play into sections. These embeddings are long series of numbers representing the many dimensions of meaning in each of the document sections.

The server converts the question “Who is the villain in this play” into a numeric vector using the same text-to-vector model that generated the vector embeddings for King Lear, and finds the sections in the King Lear table that are most similar to it. These are probably sections that mention the word villain, yes, but possibly other villainous things, such as treachery, betrayal, and deceit, even if villainy is not explicitly mentioned. These document sections are added to the query and sent together as a prompt to an LLM which then answers the question based on the provided document sections.

This is done separately for each document, and this is why the answer to the query is different depending on the document being queried. This completes the acronym, since we are Augmenting the Generation of our answer from the LLM with the Retrieval of relevant context information using the power of vector search.

Many thanks to anyone who takes the time to read this and I would be happy to expand on any of these topics in a future article. Feedback is always welcome.