Setup a RAG server using 100% Open Sauce

Almost regardless of what you're doing related to AI, you'll need Vector Similarity Search (VSS). VSS allows you to use Retrieval Augmented Generation again (RAG). RAG is what allows you to "seed" the LLM with context. It's basically similar to what happens when you upload a document to ChatGPT and ask it questions about your document. This allows you to create LLMs that answer questions based upon the context. Below is an example.

The point with the above of course, is that if you go to ChatGPT and ask the same question it'll probably answer Sam Altman or something. With context RAG data however, it will answer based upon the context.

How RAG works?

When users asks a question to our AI chatbot the middleware will first create embeddings of the question itself. Then it will use these embeddings to perform VSS search through our RAG database, and extract some 1,000 to 50,000 tokens of context, depending upon how it's configured.

When we've got context, we create a list of messages resembling the following.

System message
Contextual RAG information retrieved using VSS
Historical messages
The user's question

Notice - We prompt engineer the LLM to only answer the question using the existing context. This almost completely eliminates AI hallucinations, in addition to that the LLM knows "everything" about your company. Below is another example from one of our AI chatbots.

This process allows us to deliver AI chatbots based upon your company's data, answering questions about your company 24/7.

FYI, the above screenshot also demonstrates how we can deliver AI chatbots that displays images.

Magic Cloud, a RAG Database Server

If you want to create any type of AI agent or AI chatbot today, you'll need to create some sort of RAG database. Building such a RAG database is tedious, complex, and requires a lot of time. That's why we released Magic Cloud as open source.

Basically, you can clone and setup Magic as your RAG database server, and create your own inference logic on top of it. This would basically eliminate the 2nd step from the above list, allowing you to focus on your integration code, and your inference invocations. The process for a chat message would then be as follows.

Attach system message
Retrieve RAG from Magic Cloud
Invoke the LLM
Return the response

The point being that step number 2 is basically 80% of the workload.

Importing RAG data

Magic Cloud supports importing all sorts of documents;

CSV files
XML files
PDF files
Raw images
Basically "everything" ...

However, in addition it also contains a plugin called "utilities". This plugin contains a couple of important HTTP methods related to RAG.

POST snippet - Imports a piece of text into your RAG db
POST snippets-match - Returns RAG context to you

Both methods require a JWT token you can create by clicking "username/Generate Token" in your dashboard.

This allows you to use Magic Cloud as a "RAG server", while building your own GUI and inference logic.

Wrapping up

Before Magic you had to use Pinecone or something similar. Such vector databases are often expensive, without any self hosting opportunities. With Magic you can self how your own RAG server, significantly reducing latency, in addition to saving money.

However, creating such RAG servers is expensive, takes a lot of time, and isn't really something you should focus on either. And with Pinecone and other vendors, you have to invoke OpenAI's embeddings API yourself. With Magic Cloud everything is encapsulated in a single cohesive HTTP endpoint. And the only thing you need to do, is to integrate the above two endpoints, and configure Magic with your OpenAI API key.

If you're interested in trying out Magic, you can find its open source repository below.