Richard Kovacs

Posted on Jun 4, 2024 • Originally published at richardkovacs.dev

Three Key Concepts to Build Future-Proof AI Applications

#ai #rag #embedding #finetuning

Current AI models won't reach AGI. Not even larger ones. GPT-5, GPT-6, Llama 4, Llama 5, Gemini Ultra Pro Max, it doesn't matter. We are currently living in the age of mixed computing. Deterministic computing is still the dominant type, as the majority of humanity isn't even aware of the capabilities of probabilistic computing, aka Artificial Intelligence.

As the age of commercial chatbots has just started, many feel that current state-of-the-art language models aren't capable enough to remove significant weight from our shoulders. Hallucinations are frequent, calculations are incorrect, and running inference on problems that don't require AI just because it is the buzzword nowadays is expensive compared to running deterministic algorithms.

Current chatbots are simply not powerful enough. However, future models will also be insufficient, as they will just combine and rephrase information from their training set faster and better.

But what if they don't even have to be more powerful? What if I told you that a model today is just enough if you know how to give it some steroids? And these steroids are RAG, fine-tuning, and function calling.

This post is not about the how but the why. I won't go into how to fine-tune a model, embed documents, or add tools to the model's hands because each is a large enough topic to cover in a separate post later. What I want to answer below is the why.

"That's Just an AI Wrapper"

Many indie developers have heard at least one of the following phrases:

"Your product is just a frontend for ChatGPT."
"That's just an AI wrapper."
"Why would I use your product if Llama is free and open source?"

And so on. This post is not about deciding whether an AI wrapper is a valid product. Although there are indeed apps that are really just a better frontend before the OpenAI API, I want to point out a different kind. I am speaking about the apps that offer much more than a better UI and some folders for your prompts. These are the apps that will survive the next OpenAI release or the emergence of a better model.

In fact, these apps will be among the first to utilize it. When an app offers more than a better design that OpenAI will probably ship sometime in the future, it is protected against future releases. When a language model is not the product itself but just a tool to augment something happening in the background, it becomes harder replaceable.

The following points also answer comments from larger companies, such as "Why should I start using ChatGPT? It just steals my company data." or "Llama is our preference since you can self-host it, but it isn't powerful enough for our use cases yet."

Well, it is already powerful enough. You just lack some information about the best ways to augment it. Lucky for you, this post contains exactly what you need.

Fine-Tuning

Depending on how long you have used ChatGPT or any other chatbot, you probably have developed a perfect sense to detect whether it wrote a piece of text or was human-written. By the way, the same is true for image models, but let's stay at language models for now. Another useful test to conduct on raw models is using them in languages other than English. They probably won't be as capable since less information was found on the internet written in that language when the providers trained the model.

But this must not necessarily be the case. With fine-tuning, you can change the default style of the model to suit your needs better. Since I am Hungarian, I have plenty of use cases requiring a fine-tuned model for the Hungarian language. Since it is an extremely rare language (only official in Hungary), the sources on the internet that can be used for training are minimal compared to English. Large providers probably won't care about improving their global models in Hungarian, but they usually offer the ability to fine-tune them for ourselves.

A fine-tuned Hungarian GPT-4 model would probably handle Hungarian questions much better than the base model. The same is true for programming languages. Let's say you want to build a coding assistant for Python programmers. Meta did exactly this with Code Llama. This model will perform much better in answering Python-related questions than the Llama foundation model.

But languages are not the only thing you can fine-tune for. The standard guideline is that if you require a specific style from your model, you should fine-tune it for that style. Some examples are:

Sarcasm
Formal style
Short answers
Viral tweets
Texts without the word "delve"

This is a non-exhaustive list. Of course, you could write these in the system prompt as well, but you would waste countless precious tokens (not to mention the cost) in every message. When you fine-tune a model, it will inherently know the style you want to achieve without further prompting. You achieve better results for less money.

A fine-tuned model is also less susceptible to new model releases. If your application's added value lies in a well-constructed training set, you can easily fine-tune a new model on it and switch to that one when released.

Fine-tuning will help you tailor the style of any model, but it won't extend its knowledge. You need something else for that.

Retrieval Augmented Generation (RAG)

Public chatbots have plenty of knowledge about the world. However, one thing they don't have access to is private documents. Whether they are your private files or the internal files of the company you work for, these files could not have been part of any commercial model's training set because they are inaccessible on the open internet. And unless you don't know about Retrieval Augmented Generation (RAG), you might think that the time of personal and private company assistants is still far away.

With RAG, this hypothesis quickly becomes false. Imagine that you have a bunch of internal software documentation, financial statements, legal documents, design guidelines, and much more in your company that employees frequently use. Since these are internal documents, any commercial chatbot without RAG would be unusable for any question regarding these files. However, you can give the model access to these documents with RAG.

First, you have to embed every document into a vector database. Then, when a user asks something, related sentences from the embedded documents can be retrieved with the help of the same embedding model that was used to embed them. In the next step, these sentences must be injected into the model's context, and voilà, you just extended a foundation model's knowledge with thousands of documents without requiring a larger model or fine-tuning.

Of course, you can combine these if you want. It makes perfect sense to do so. When you have a dataset ready for fine-tuning and a knowledge base embedded in a vector database, the model you use in the background matters less. Let's say you use a fine-tuned GPT-3.5 and have 1000 documents embedded. Then, OpenAI releases GPT-4. Two things can happen in this case:

Either your use case is so unique that the fine-tuned model performs much better than GPT-4 or
You fine-tune the new model on your dataset and switch to it immediately.

In neither case did you have to change your embedding logic since a different model handles that (an embedding model). Also, in neither of the cases did the new model's release pose any danger to your application, regardless of whether it is an internal or a SaaS application.

At this point, hopefully, I could convince you that smaller models with some extensions can be more than enough for a variety of use cases. Also, these use cases can be completely outside the normal capabilities of state-of-the-art foundation models.

However, you might still think that today's models lack a crucial feature: the ability to interact with the outside world. This is what the next section is about.

Function Calling

I first encountered function calling in the OpenAI API, but today, they aren't the only ones offering this capability. In fact, you could also try it yourself with a small model. Just write a prompt that tells the model to return a JSON object that you will use to call a function in the next step.

Yes, OpenAI doesn't really call any function with their models. The way function calling works is that they fine-tuned some of their models to recognize when they face a problem for which a better tool is available. The available tools are functions that you, the developer, wrote and provided the documentation for. When the model decides it is time to call a function for a given task, it will return a specific message containing the function's name to call and its parameters. What you do with that information is up to you, but your implementation will probably pass these parameters to the chosen function. Then, you have to pass the function's response back to the model, based on which it will create the answer.

Here is a simplified example of a message history with function calls.

You ask the model for the price of Bitcoin.
The model has access to a function named get_price(symbol). The model will return a message telling you to call the get_price function with "BTC" as the symbol. It will also give you a unique ID that represents this function call.
You call the function in your application. It returns 64,352. You must add a message to the message history with the unique ID from the previous step and the returned price, 64,352. The model will know from the unique ID that this number answers its previous question.
Based on the previous three messages, the model answers: "The current price of Bitcoin is $64,352."

This is how a typical function calling scenario looks like with a simple tool or function. When the model has access to more tools, it may return multiple tool calls, and your job is to call each function and provide the answers. Note that the model never calls any function. It is your job to do so. What the model is capable of depends on your implementation. You can write your functions with the least possible privileges (as you should), and the model won't cause any trouble.

Putting It All Together

Let's stop for a moment and consider the implications of the above examples. Let's say you are using a simpler model like GPT-3.5. (Crazy how GPT-3.5 is now considered a simpler model, right?) Many more capable models are out there, but you are still using GPT 3.5. It has no access to the internet, has a knowledge cutoff a few months back in the past, speaks too vaguely, and cannot do anything else than provide textual answers. Sounds pretty limited compared to the newest alternatives.

Unless...

We give it some superpowers in the form of the above concepts. I currently have an idea in my pipeline that is ideal for demonstration purposes. It's time to build a financial analyst chatbot.

GPT 3.5 out of the box is pretty far from this dream. My first step was to add some tools in its hand to fetch real-time market information such as the actual price of stocks, dividends, well-known ratios, financial statements, analyst recommendations, etc. I could implement this for free since the yfinance Python module is more than enough for a simple purpose like mine.

At this point, the model could tell from the numbers the actual state of each company. The amount of information available for the model was only dependent on me since the API can handle 128 functions, more than enough for most use cases.

However, one key input of financial analysis was still missing: news. For that, I needed something up-to-date and free for experimentation. I found Tavily to be the perfect candidate because it has access to real-time information from the web, such as news and blog articles. By the way, Tavily uses RAG under the hood.

One last flaw in my application is that the answers are too vague. It doesn't provide specific insights; it just summarizes what it retrieves with the configured functions and RAG. Also, it usually never answers direct questions like "Which stock should I buy out of these two"? This behavior is the result of OpenAI's training. It looks like one of the key points of alignment is that it won't provide financial advice no matter how you ask. Of course, I could write a long system prompt to convince it to answer, but sending it at the start of every conversation would probably cost more in the long run than creating a fine-tuned model that behaves exactly as I desire.

Fine-tuning is still ahead of me, but that is the missing piece in my chatbot. When I have my dataset ready, it won't matter if OpenAI releases a stronger model next week. I can still use Tavily for news, and the new model will still be able to call my functions. I can still fine-tune it with my dataset to fit my needs.

This way, the app will offer much more than just a chat frontend, meaning it will be future-proof for longer than simple API wrappers. Any application utilizing these techniques will quadruple the underlying model's capabilities.

Also, Tavily is just one specific example that is ideal for my use case. It's not the only one and definitely won't suit everyone's needs. There are other providers out there. You could also build a vector database and integrate RAG into your application using your custom records.

Case Studies

The concepts I explained above aren't new. Two excellent examples of apps that utilize at least some of them are Perplexity and Consensus. In my honest opinion, Perplexity nailed the combination of language model answers and web browsing. It uses RAG under the hood and probably also utilizes function calls and fine-tuning.

Consensus is well-known for providing scientific paper references to user questions. It also provides a consensus based on multiple retrieved papers. As you probably have guessed, they also use RAG in the background.

Neither of these apps is in danger of future model releases because their use case is very specific, and both of them offer much more than the underlying model's raw answer.

There is literally no limit to what you can already do with current models if you are creative enough.

Let's build!

Forem

Three Key Concepts to Build Future-Proof AI Applications

"That's Just an AI Wrapper"

Fine-Tuning

Retrieval Augmented Generation (RAG)

Function Calling

Putting It All Together

Case Studies

Top comments (0)

Read next

A Practical Guide to Reducing LLM Hallucinations with Sandboxed Code Interpreter

Just Launched RobinReach: Multi-Channel Social Media Management 🚀

A beginner's guide to the Remove-Bg model by Lucataco on Replicate

🚀 Amazon Nova: AWS's New Foundation Model for GenAI🤖