Kirk Kirkconnell

Posted on Oct 3

Train it or feed it? Teaching LLMs your data the smart way

#webdev #ai #llm #rag

I received an interesting question during a webinar yesterday, and I wanted to do some research to explore the topic further, without getting too detailed. The question was, and I am paraphrasing, "Which is best, training or fine-tuning an LLM with specific data to create a custom LLM, or using Retrieval-Augmented Generation (RAG) with your application?" If you’re not familiar with RAG, it refers to the process of sending data from an external source, such as a database (e.g., MongoDB), external API, etc., to an LLM to augment its existing knowledge. The data is then used by the LLM to generate its response.

Training or Fine-Tuning with Specific Data to Create a Custom LLM

Using this method, you either train a model from scratch with your data (which is rare due to the cost and level of effort) or fine-tune an existing LLM with your data.

Fine-tuning an LLM

The specifics of fine-tuning are beyond the scope of this post, but essentially, you are instructing an existing LLM with knowledge it wouldn’t have otherwise and nudging it via various settings so that its predictions better align with the patterns in your data. With that, you get a customized LLM for your needs. As with everything, there are some trade-offs, so let’s get into those.

Training an LLM

If you go this route, it’s all on you. From picking the LLM’s purpose to exactly what data it learns from, preparing that data, to every setting, and all of the processing, compute, and performance characteristics. There is a ton more to it, but it reminds me of the days of doing custom builds of Apache Web Server to only include exactly the capabilities we needed and nothing more. Nevertheless, there’s a reason why this approach is less common; however, if you require complete control and precision for your use case, this is the method you might want to consider.

Data Preparation

Regardless of which method you use, you are responsible for data preparation, and this can be an intensive task. This includes, but is not limited to, formatting, cleaning, labeling (if supervised), chunking, and tokenization of the data.

Data Timeliness

A custom LLM is only as up-to-date as when you last trained and fine-tuned it. If that was last week, and you have new data, the LLM doesn’t know anything about the new data. The training and fine-tuning process via iteration is slow because every time updates are made, you have to train, test, tune, and ensure there are no regressions, among other steps.

When to choose a custom model

Using either method, fine-tune or from scratch, is appropriate when:

Your data is stable, closed-domain, and you require tight control over output behavior (e.g., legal, medical, scientific text). An example is the voyage-law-2 model by Voyage AI, which is optimized for legal topics.
You need low-latency responses or offline capability.
You're building a productized LLM with predictable behavior and no reliance on real-time external sources.

Pros:

No need for external retrieval at runtime.
Tighter integration of domain-specific language or tone.
Potential for lower latency and cost once deployed.
Deployable in a software product with no external data sources

Cons

Expensive, time-consuming, hard to update.
Prone to “hallucination” if the domain shifts.
Risk of catastrophic forgetting during fine-tuning if not managed carefully.

Retrieval-Augmented Generation (RAG)

In this case, you use an "off-the-shelf" LLM, but your app is augmenting the response from the LLM with data that it wasn’t trained on. The app retrieves data from an API, a database, or any other source, and that data becomes part of the LLM prompt. For example, say you have an LLM that doesn’t have access to non-public knowledge base articles. A user asks a question, and the app searches for semantically matching articles using MongoDB Atlas Vector Search. The app injects the retrieved articles into the LLM, enabling it to generate a better response to the user.

Additionally, this method is significantly faster to set up, as it eliminates the need for training time associated with custom models. It can be built and tuned in hours to days, rather than days to weeks.

When to choose it

You need dynamic access to evolving or non-public data.
You don’t have resources or need to fine-tune.
You want to prototype quickly and iterate based on user feedback.

Pros

Cheaper and faster to build and maintain.
Updatable in real time, and no retraining is needed when documents change.
Works well with off-the-shelf models, such as OpenAI GPT-5, Anthropic Claude, or open-source LLMs.

Cons

Context length limits can be a bottleneck.
Needs a robust retrieval system, or you'll get garbage in, garbage out.
Output quality depends heavily on chunking strategy, embedding quality, and retrieval relevance.

Which one should you choose? TL;DR

Use RAG if

You want to quickly prototype and integrate with existing LLMs.
Your data changes often, is usually private, or includes long documents.
You value flexibility, lower cost, and ease of iteration.

Use a custom LLM if:

Your application demands fast, self-contained models with no reliance on external systems.
You have highly specialized content or tone that general LLMs can't reproduce.
You’re delivering a product where latency, control, and data privacy are paramount.

DEV Community

Train it or feed it? Teaching LLMs your data the smart way

Training or Fine-Tuning with Specific Data to Create a Custom LLM

Fine-tuning an LLM

Training an LLM

Data Preparation

Data Timeliness

When to choose a custom model

Pros:

Cons

Retrieval-Augmented Generation (RAG)

When to choose it

Pros

Cons

Which one should you choose? TL;DR

Top comments (0)