DEV Community

Pierfelice Menga
Pierfelice Menga

Posted on • Edited on

The Real Engineering Challenges of Using LLMs in Production Systems

" Large Language Models are no longer experimental novelties. They are now embedded into internal copilots, support systems, search interfaces, analytics assistants, coding workflows, document pipelines, and increasingly, decision-support platforms. At the prototype stage, they often appear surprisingly capable. A well-written prompt produces fluent answers, clean code, and convincing reasoning. But the moment an LLM is placed inside a production system, the engineering reality changes."

Title image

The central problem is simple to state and difficult to solve

an LLM can produce output that looks correct, sounds correct, and fits the requested format, while being fundamentally wrong.

What is RAG? - Retrieval-Augmented Generation AI Explained - AWS

What is Retrieval-Augmented Generation (RAG), how and why businesses use RAG AI, and how to use RAG with AWS.

favicon aws.amazon.com

Understanding the Role of Rag in AI Applications

Explore how RAG combines real-time data to refine AI responses, boosting accuracy and context. Delve into its uses and advancements in natural language…

favicon linkedin.com

That single property reshapes everything about system design. Traditional software engineering is built on deterministic assumptions. Given the same input and the same state, the system should behave in the same way. LLM-based systems violate that expectation at the component level. They are probabilistic, not deterministic. They generate, rather than retrieve. They imitate valid structure without actually guaranteeing semantic correctness. As a result, the main challenge is not how to make an LLM answer beautifully, but how to make a larger system remain reliable when one of its core components is inherently uncertain.

This is where the real engineering work begins.

Why hallucinations are a system problem, not a model quirk

Hallucination is often described too casually, as if it were just an occasional mistake. In practice, it is much more structural than that. An LLM does not check a truth table before replying. It predicts the next token based on learned statistical patterns. If the available context is weak, incomplete, conflicting, or slightly off-distribution, the model does not pause like a careful engineer and say, “I do not have enough verified information.” Instead, it continues the pattern of plausible generation.

realibility

That behavior becomes dangerous because the output usually preserves the surface signals humans trust most:

  • correct grammar
  • correct formatting
  • domain vocabulary
  • coherent flow
  • confident tone

In other words, the answer often fails at the exact layer that is hardest to detect quickl: **meaning.**

A generated function may compile and even pass a few happy-path tests while still failing on edge cases. A generated API call may look perfectly aligned with the target service while using parameters that do not actually exist. A generated SQL transformation may execute successfully while applying the wrong filter condition, quietly corrupting downstream metrics. In all of these cases, the visible structure suggests correctness, but the hidden logic is flawed.

That distinction matters. A broken JSON response is easy to reject. A beautifully structured but incorrect JSON response is much more expensive to catch.

Example: valid syntax, invalid logic

  • Consider a simple function generated for discount calculation:
def apply_discount(price, discount):
    return price - price * discount
Enter fullscreen mode Exit fullscreen mode

Example of Incorrect RAG Code and Why It Fails

One of the most common mistakes in early RAG systems is assuming that retrieval alone guarantees correctness. In reality, a poorly designed retrieval pipeline can silently inject irrelevant context into the prompt, which makes the final answer look grounded while still being wrong.

Here is a deliberately incorrect RAG example:

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

documents = [
    "Stripe uses PaymentIntents for modern payments.",
    "Redis is an in-memory database.",
    "The Eiffel Tower is in Paris.",
    "Legacy charges API exists in older Stripe workflows."
]

model = SentenceTransformer("all-MiniLM-L6-v2")
doc_embeddings = model.encode(documents)

index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(np.array(doc_embeddings, dtype="float32"))

def rag_answer(query, llm):
    query_embedding = model.encode([query]).astype("float32")
    distances, indices = index.search(query_embedding, k=3)

    context = " ".join([documents[i] for i in indices[0]])

    prompt = f"""
    Answer the question using the context below.

    Context:
    {context}

    Question:
    {query}
    """

    return llm(prompt)
Enter fullscreen mode Exit fullscreen mode

At first glance, this looks reasonable. It encodes documents, runs similarity search, builds a context string, and passes everything to the model. But from an engineering perspective, this implementation is fragile in several ways.

Why this code is incorrect

First, it retrieves chunks only by vector similarity and blindly trusts the top results. That means semantically related but operationally useless text can enter the context. If the query is about Stripe, the retriever may still include general or outdated chunks, or even partially related noise.

Second, there is no threshold for retrieval quality. Even if the top matches are weak, the pipeline still sends them to the LLM. The model then receives low-confidence evidence and often turns it into a high-confidence answer.

Third, there is no reranking or filtering. The code assumes the vector index already returned the most useful chunks in the best order. In practice, top-k similarity is often only the first stage.

Fourth, the context is merged into one flat block. There is no metadata, no source labeling, no freshness information, and no separation between high-trust and low-trust documents. The LLM sees one blended text surface and may combine unrelated facts into a single polished response.

Fifth, there is no validation after generation. Even if the LLM produces a well-written answer based on outdated or irrelevant chunks, nothing in the system detects that failure.

This is the core engineering danger of bad RAG:

  • What can go wrong in practice

Imagine the user asks:

How should I integrate Stripe payments in a new application?

The retriever may return:

  • a correct chunk about PaymentIntents
  • an old chunk about legacy Charges API
  • an unrelated chunk because the embedding similarity was only loosely relevant

The model now has mixed evidence. Instead of refusing or expressing uncertainty, it may generate a blended answer such as:

Use the Charges API for direct payment creation, or PaymentIntents if needed.

A stronger RAG version

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

documents = [
    {"text": "Stripe uses PaymentIntents for modern payments.", "source": "official", "topic": "stripe"},
    {"text": "Redis is an in-memory database.", "source": "official", "topic": "redis"},
    {"text": "The Eiffel Tower is in Paris.", "source": "general", "topic": "travel"},
    {"text": "Legacy charges API exists in older Stripe workflows.", "source": "archive", "topic": "stripe"}
]

model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [doc["text"] for doc in documents]
doc_embeddings = model.encode(texts)

index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(np.array(doc_embeddings, dtype="float32"))

def retrieve_relevant(query, topic_filter=None, k=5, max_distance=1.2):
    query_embedding = model.encode([query]).astype("float32")
    distances, indices = index.search(query_embedding, k)

    results = []
    for dist, idx in zip(distances[0], indices[0]):
        doc = documents[idx]
        if dist <= max_distance:
            if topic_filter is None or doc["topic"] == topic_filter:
                results.append({
                    "text": doc["text"],
                    "source": doc["source"],
                    "distance": float(dist)
                })
    return results

def build_context(results):
    approved = [r for r in results if r["source"] == "official"]
    if not approved:
        return None
    return "\n".join([f"[Source: {r['source']}] {r['text']}" for r in approved])

def rag_answer(query, llm):
    retrieved = retrieve_relevant(query, topic_filter="stripe")
    context = build_context(retrieved)

    if not context:
        return "I do not have enough reliable retrieved context to answer safely."

    prompt = f"""
    Use only the context below.
    If the answer is not explicitly supported, say you do not know.

    Context:
    {context}

    Question:
    {query}
    """

    return llm(prompt)
Enter fullscreen mode Exit fullscreen mode

That answer sounds professional, but it is not a reliable recommendation for a modern production system.
The problem is not that retrieval failed completely.
The problem is that retrieval failed partially, which is harder to notice.

the system appears grounded, but the grounding itself is weak.

1) Main libraries used in LLM systems

Library Main role Typical use Notes
openai Model inference, embeddings, API access Generate answers, structured outputs, embeddings OpenAI’s API includes Responses and Embeddings endpoints. (OpenAI Platform)
langchain Orchestration framework Prompting, chains, retrievers, agents LangChain docs cover retrieval flows including 2-step RAG and agentic RAG. (LangChain Docs)
sentence-transformers Local embedding models Encode queries/docs into vectors Common for semantic search and RAG embedding pipelines. SentenceTransformer(...).encode(...) is the core pattern. (SentenceTransformers)
faiss Dense vector similarity search Fast local ANN/vector search FAISS is designed for efficient similarity search and clustering of dense vectors. (Faiss)
qdrant-client / Qdrant Production vector DB Store/search vectors with payload filters Qdrant stores points made of vectors plus optional payload metadata and supports search/filtering. (Qdrant)
pydantic Output/schema validation Validate structured LLM outputs Not a model library, but widely used to make LLM responses safer in production.
requests External API/tool calls Fetch docs, APIs, webpages Frequently used inside tool-using or retrieval workflows. LangChain’s examples use it in agentic retrieval flows. (LangChain Docs)
numpy Vector/matrix handling Embedding arrays, FAISS inputs Standard companion library for local embedding and vector search pipelines.
transformers Local HF model inference/training Run local LLMs/embeddings Often used when you do not want hosted inference.
tiktoken or tokenizer libs Token counting/chunking Split context safely Useful for prompt budgeting and chunk sizing.

2) Main libraries used specifically in RAG systems

Library RAG stage What it usually does Notes
langchain Pipeline orchestration Load docs, split, embed, retrieve, chain to LLM Its retrieval docs explicitly describe RAG architectures and retriever-driven flows. (LangChain Docs)
sentence-transformers Embedding Converts chunks and queries into vectors Common local embedding choice for semantic retrieval. (SentenceTransformers)
openai Embedding + generation Hosted embeddings and answer generation OpenAI embeddings return vectors whose length depends on the selected model. (OpenAI Platform)
faiss Vector index Local similarity search over dense vectors Strong for fast local prototypes and single-node systems. (Faiss)
qdrant-client / Qdrant Vector storage + filtering Production search with metadata/payload Supports dense, sparse, and hybrid retrieval in the LangChain integration. (LangChain Docs)
langchain-community Integrations FAISS, loaders, utilities LangChain’s FAISS integration lives in langchain-community. (LangChain Docs)
langchain-qdrant Qdrant integration Qdrant vector store wrapper for LangChain Official LangChain integration package for Qdrant. (LangChain Docs)
rank-bm25 or sparse search tools Keyword retrieval Lexical retrieval complement Often paired with dense retrieval for hybrid RAG.
Cross-encoders (sentence-transformers) Re-ranking Reorder retrieved results more accurately Sentence Transformers provides Cross-Encoder reranking models for passage reranking. (SentenceTransformers)

At first glance, this looks fine. It is short, readable, and syntactically correct. But what does discount mean? Is it 0.2 for twenty percent? Is it 20? What happens with negative values? What if the value exceeds 1? The model has produced a function that looks complete, but key semantic assumptions are left unresolved.

A production-safe implementation would make those assumptions explicit:

def apply_discount(price: float, discount_rate: float) -> float:
    if price < 0:
        raise ValueError("price must be non-negative")
    if not 0 <= discount_rate <= 1:
        raise ValueError("discount_rate must be between 0 and 1")
    return round(price * (1 - discount_rate), 2)
Enter fullscreen mode Exit fullscreen mode

The important lesson is not that the second version is longer. It is that engineering requires explicit constraints, while generation often omits them unless forced by the system.

A useful question to ask whenever an LLM produces code is this:

Does this output merely look like an implementation, or does it encode the actual business rules?

That question separates demo-quality output from production-quality output.


Why reliability is harder than accuracy

Many teams initially frame the problem as accuracy: how do we get more correct answers? Accuracy matters, but reliability is broader and often more important. A system can be reasonably accurate on average and still be operationally unreliable if its failures are inconsistent, irreproducible, and hard to debug.

(This is the second major engineering challenge of LLM systems: non-determinism.)

Traditional software systems are expected to behave consistently. If a bug appears, engineers try to reproduce it, isolate the state, inspect the inputs, and trace the logic path. With LLMs, that workflow becomes less stable. Two runs with nearly identical conditions can yield different wording, different assumptions, different decomposition steps, and sometimes different final conclusions.

Work process

This variability affects much more than output style. It changes how systems must be tested, monitored, and maintained.
A small variation in an early classification step can alter retrieval. Altered retrieval changes context. Changed context changes generation. Changed generation may trigger or avoid a validator. In a multi-step pipeline, small probabilistic differences can cascade into materially different outcomes.
That is why reproducibility becomes a first-class engineering concern.

A practical question for any production LLM pipeline is:

If the same request fails today, can we reproduce the same failure tomorrow?
If the answer is no, debugging becomes slower, monitoring becomes noisier, and rollback analysis becomes more difficult.

The shape of a production-safe architecture

Because LLMs are probabilistic generators, they should almost never sit alone between user input and final output in a serious system. A production architecture needs surrounding layers that constrain, ground, verify, and observe behavior.

A useful high-level diagram looks like this:

Accuracy

This diagram matters because it shows the correct mental model: the LLM is one stage in a larger reliability pipeline, not the pipeline itself.
Each layer exists because a different class of failure must be handled outside the model.

  • Routing reduces ambiguity by deciding what kind of problem this is.
  • Retrieval grounds the response in actual data.
  • Context processing removes noise before generation.
  • Validation checks whether the output is structurally and semantically acceptable.
  • The decision layer determines whether to accept, reject, retry, or escalate.

The deeper point is architectural: you do not solve hallucinations by asking the model to “be more careful.” You solve them by reducing the amount of unverified freedom the model is allowed to exercise.


Context processing is one of the most underestimated layers

Even with good retrieval, raw context is rarely ready to pass directly into the model. Retrieved material can contain redundancy, conflicting information, outdated fragments, or irrelevant passages. Many teams focus heavily on embeddings and the LLM itself, while underinvesting in the layer that prepares context.

That is a mistake, because the model’s answer quality depends as much on context hygiene as on model capability.

Context processing is where the system decides what evidence is allowed to influence generation. This may include:

  • removing duplicate chunks
  • filtering low-confidence results
  • keeping only chunks from approved sources
  • normalizing formats
  • ordering evidence by priority
  • truncating to preserve only the strongest signal

A simple illustration:

def process_context(chunks: list[str], max_chars: int = 1200) -> str:
    seen = set()
    cleaned = []

    for chunk in chunks:
        normalized = chunk.strip()
        if normalized and normalized not in seen:
            seen.add(normalized)
            cleaned.append(normalized)

    return "\n\n".join(cleaned)[:max_chars]
Enter fullscreen mode Exit fullscreen mode

This is a basic example, but it reflects an important idea: context is not raw input to the model. It is curated evidence.

A strong question to ask at this stage is:

If the model fails, did it fail because it reasoned poorly, or because we handed it noisy evidence?

That question often reveals that the failure belongs to upstream system design, not to the model alone.


Validation is where probabilistic output meets deterministic engineering

If there is one layer that most clearly separates prototypes from production systems, it is validation.

Without validation, an LLM system is essentially trusting generated output based on presentation quality. With validation, the system begins to behave like engineered software again. The goal is not to prove the model is always right. The goal is to ensure the system does not accept high-risk outputs without deterministic checks.

The type of validation depends on the task.

For structured outputs, schema validation is the first barrier. If the model is supposed to return an object with specific fields, those fields should be validated strictly.

from pydantic import BaseModel, ValidationError

class ApiCall(BaseModel):
    method: str
    endpoint: str
    requires_auth: bool

def validate_structured_output(data: dict):
    try:
        return ApiCall(**data)
    except ValidationError as e:
        return None
Enter fullscreen mode Exit fullscreen mode

This catches malformed responses, but it does not catch false content inside a valid structure. A perfectly shaped object can still be wrong.

That is why semantic validation must follow structural validation.

Example: a valid structure with invalid semantics
The model returns:

{
  "method": "POST",
  "endpoint": "/v1/charge",
  "requires_auth": true
}
Enter fullscreen mode Exit fullscreen mode

This may pass a schema validator because the fields exist and types are correct. But the endpoint is still wrong. Structural validation succeeded. Semantic validation failed.

For code generation, semantic validation often means execution plus tests.

def run_generated_code_safely(code: str, test_func):
    namespace = {}
    try:
        exec(code, {}, namespace)
        return test_func(namespace)
    except Exception:
        return False
Enter fullscreen mode Exit fullscreen mode

The critical insight is that validation must answer a harder question than formatting:

Could this output be accepted by the system and still be wrong?

If yes, more validation is needed.


Comparing traditional software and LLM systems

One reason teams underestimate these challenges is that they unconsciously apply the wrong engineering intuition. The table below shows why LLM systems need a different mindset.

Dimension Traditional Software LLM Component
Output behavior Deterministic Probabilistic
Truth source Rules and state Learned token distributions
Failure mode Explicit error or exception Plausible but incorrect response
Debugging Reproduce exact path Analyze distributions and context
Testing Exact expected output Statistical and scenario-based
Safety strategy Unit/integration tests Validation, grounding, observability

This comparison explains why a prompt-only approach usually breaks at scale. Prompting can improve local performance, but it does not change the underlying failure model.


Consistency requires control, not hope

Because non-determinism cannot be eliminated completely, it must be managed. The system needs mechanisms that reduce variance where consistency matters.

One common control is lower-temperature generation. Lower temperature reduces randomness and usually improves consistency. But it is not a magic fix. A confidently repeated wrong answer is still wrong. Consistency without verification can simply stabilize the wrong behavior.

Another control is structured prompting. When prompts specify the expected reasoning path and output format, they reduce ambiguity and narrow the model’s action space.

For example, compare these two prompts.

Too open-ended:

Explain how to call the API and give the right parameters.
Enter fullscreen mode Exit fullscreen mode

More controlled:

Using only the provided documentation context, return a JSON object with:

1. HTTP method
2. exact endpoint
3. required headers
4. required body fields
If any field is not explicitly supported by the context, return 
Enter fullscreen mode Exit fullscreen mode

The second prompt is better not because it is longer, but because it reduces hidden assumptions and creates output that is easier to validate.

A further step is multi-candidate generation with ranking or verification. Instead of trusting one answer, the system can generate several and choose the one that best satisfies rules or passes validation.

def choose_best_output(prompt: str, generator, scorer, n: int = 3):
    candidates = [generator(prompt) for _ in range(n)]
    scored = [(candidate, scorer(candidate)) for candidate in candidates]
    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[0][0]
Enter fullscreen mode Exit fullscreen mode

This is especially useful when a task admits multiple plausible phrasings but only some are fully grounded or structurally compliant.

A practical question here is:

Should the system optimize for one eloquent answer, or for the most verifiable answer?

In production, the second is usually the right choice.


Observability is mandatory because failures are often silent

In ordinary software systems, obvious failures trigger obvious investigation. In LLM systems, some of the worst failures are silent. The answer is accepted, no exception is thrown, and the problem emerges only later as an incorrect report, a bad integration, or a flawed decision.

That is why observability is not optional. The system needs to record enough information to reconstruct what happened:

  • the original user request
  • the prompt or template version
  • the retrieved context
  • model settings
  • raw outputs
  • validation outcomes
  • final decision
  • user feedback where available

A minimal logging example might look like this:

import time

def log_event(query, context, raw_output, validated, decision):
    return {
        "timestamp": time.time(),
        "query": query,
        "context": context,
        "raw_output": raw_output,
        "validated": validated,
        "decision": decision
    }
Enter fullscreen mode Exit fullscreen mode

In a real system, this data becomes the basis for regression analysis, failure clustering, and evaluation dataset creation.

A strong engineering question is:

If a user reports a wrong answer, do we have enough information to diagnose whether retrieval, prompting, generation, or validation failed?

Without that visibility, the team is not really operating a system.

It is operating a black box. .

The evaluation mindset must change

Testing LLM systems is fundamentally different from testing ordinary code. You cannot rely only on exact-match assertions. Many tasks allow multiple acceptable outputs, while dangerous failures may still look polished.

Evaluation must therefore reflect real usage conditions, not just benchmark convenience. Good evaluation sets should include:

  • normal cases
  • ambiguous cases
  • adversarial phrasing
  • edge conditions
  • outdated context scenarios
  • conflicting evidence scenarios
  • incomplete data scenarios

The aim is not simply to ask, “Did the model answer correctly?” The better question is:

Under what conditions does the entire system fail, and does it fail safely?

That wording matters because a safe refusal can be more valuable than a polished but incorrect answer.

A practical production pattern

A strong LLM system often follows a decision-oriented pipeline like this:

Amazon
This diagram is useful because it shows an engineering principle that applies broadly: the system should not force every request down the same path. Some tasks need retrieval. Some need tools. Some need human escalation. Some should be rejected cleanly.
That is how the architecture absorbs uncertainty instead of pretending uncertainty does not exist.


Questions every production LLM team should keep asking

The strongest teams tend to ask better operational questions than everyone else. Here are some of the most important ones:

Can the system detect a well-formatted but incorrect output?

Does retrieval improve truthfulness, or just increase answer confidence?

Which failures come from the model, and which come from upstream context design?

Can we reproduce a bad output under the same conditions?

Are we optimizing for linguistic quality or decision reliability?
When the system is uncertain, does it expose uncertainty or hide it behind fluency?

What happens if the validator passes a structurally valid but semantically false response?

Which classes of requests should never be answered without human review?

These are not philosophical questions. They are production questions.


Final perspective

The hardest part of deploying LLMs is not integrating an API or writing a better prompt. It is accepting that a fluent model is not the same thing as a reliable system.

A model can generate.
A system must decide.

A model can imitate valid structure.
A system must verify meaning.

A model can produce plausible answers.
A production architecture must control when those answers are trusted, retried, constrained, or rejected.
Enter fullscreen mode Exit fullscreen mode

That is the real engineering challenge of using LLMs in production systems. The teams that succeed are not the ones that merely use advanced models. They are the ones that design robust pipelines around the model’s limitations: grounded retrieval, disciplined context preparation, deterministic validation, controlled generation, observability, and continuous evaluation.

The line between experimenting with AI and engineering with AI is drawn exactly there.

Top comments (3)

Collapse
 
agen-it profile image
Pierfelice Menga

Please leave your opinion about my post.

Collapse
 
james_r206 profile image
James Rony

Good retrieval, raw context is rarely ready to pass directly into the model. Retrieved material can contain redundancy, conflicting information, outdated fragments, or irrelevant passages. Many teams focus heavily on embeddings and the LLM itself, while underinvesting in the layer that prepares context.

Wonderful....
Thank you for your kind explain.
Wish next post!

Collapse
 
agentdevwell profile image
Agent-Dev-Well

An LLM does not check a truth table before replying. It predicts the next token based on learned statistical patterns. If the available context is weak, incomplete, conflicting, or slightly off-distribution.

Excuse me, how much have you been working in AI subject?
I look forward to discuss with you.
Perfect👍👍👍👍