" Large Language Models are no longer experimental novelties. They are now embedded into internal copilots, support systems, search interfaces, analytics assistants, coding workflows, document pipelines, and increasingly, decision-support platforms. At the prototype stage, they often appear surprisingly capable. A well-written prompt produces fluent answers, clean code, and convincing reasoning. But the moment an LLM is placed inside a production system, the engineering reality changes."
The central problem is simple to state and difficult to solve
an LLM can produce output that looks correct, sounds correct, and fits the requested format, while being fundamentally wrong.
That single property reshapes everything about system design. Traditional software engineering is built on deterministic assumptions. Given the same input and the same state, the system should behave in the same way. LLM-based systems violate that expectation at the component level. They are probabilistic, not deterministic. They generate, rather than retrieve. They imitate valid structure without actually guaranteeing semantic correctness. As a result, the main challenge is not how to make an LLM answer beautifully, but how to make a larger system remain reliable when one of its core components is inherently uncertain.
This is where the real engineering work begins.
Why hallucinations are a system problem, not a model quirk
Hallucination is often described too casually, as if it were just an occasional mistake. In practice, it is much more structural than that. An LLM does not check a truth table before replying. It predicts the next token based on learned statistical patterns. If the available context is weak, incomplete, conflicting, or slightly off-distribution, the model does not pause like a careful engineer and say, “I do not have enough verified information.” Instead, it continues the pattern of plausible generation.
That behavior becomes dangerous because the output usually preserves the surface signals humans trust most:
- correct grammar
- correct formatting
- domain vocabulary
- coherent flow
- confident tone
In other words, the answer often fails at the exact layer that is hardest to detect quickl: **meaning.**
A generated function may compile and even pass a few happy-path tests while still failing on edge cases. A generated API call may look perfectly aligned with the target service while using parameters that do not actually exist. A generated SQL transformation may execute successfully while applying the wrong filter condition, quietly corrupting downstream metrics. In all of these cases, the visible structure suggests correctness, but the hidden logic is flawed.
That distinction matters. A broken JSON response is easy to reject. A beautifully structured but incorrect JSON response is much more expensive to catch.
Example: valid syntax, invalid logic
- Consider a simple function generated for discount calculation:
def apply_discount(price, discount):
return price - price * discount
Example of Incorrect RAG Code and Why It Fails
One of the most common mistakes in early RAG systems is assuming that retrieval alone guarantees correctness. In reality, a poorly designed retrieval pipeline can silently inject irrelevant context into the prompt, which makes the final answer look grounded while still being wrong.
Here is a deliberately incorrect RAG example:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
documents = [
"Stripe uses PaymentIntents for modern payments.",
"Redis is an in-memory database.",
"The Eiffel Tower is in Paris.",
"Legacy charges API exists in older Stripe workflows."
]
model = SentenceTransformer("all-MiniLM-L6-v2")
doc_embeddings = model.encode(documents)
index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(np.array(doc_embeddings, dtype="float32"))
def rag_answer(query, llm):
query_embedding = model.encode([query]).astype("float32")
distances, indices = index.search(query_embedding, k=3)
context = " ".join([documents[i] for i in indices[0]])
prompt = f"""
Answer the question using the context below.
Context:
{context}
Question:
{query}
"""
return llm(prompt)
At first glance, this looks reasonable. It encodes documents, runs similarity search, builds a context string, and passes everything to the model. But from an engineering perspective, this implementation is fragile in several ways.
Why this code is incorrect
First, it retrieves chunks only by vector similarity and blindly trusts the top results. That means semantically related but operationally useless text can enter the context. If the query is about Stripe, the retriever may still include general or outdated chunks, or even partially related noise.
Second, there is no threshold for retrieval quality. Even if the top matches are weak, the pipeline still sends them to the LLM. The model then receives low-confidence evidence and often turns it into a high-confidence answer.
Third, there is no reranking or filtering. The code assumes the vector index already returned the most useful chunks in the best order. In practice, top-k similarity is often only the first stage.
Fourth, the context is merged into one flat block. There is no metadata, no source labeling, no freshness information, and no separation between high-trust and low-trust documents. The LLM sees one blended text surface and may combine unrelated facts into a single polished response.
Fifth, there is no validation after generation. Even if the LLM produces a well-written answer based on outdated or irrelevant chunks, nothing in the system detects that failure.
This is the core engineering danger of bad RAG:
- What can go wrong in practice
Imagine the user asks:
How should I integrate Stripe payments in a new application?
The retriever may return:
- a correct chunk about PaymentIntents
- an old chunk about legacy Charges API
- an unrelated chunk because the embedding similarity was only loosely relevant
The model now has mixed evidence. Instead of refusing or expressing uncertainty, it may generate a blended answer such as:
Use the Charges API for direct payment creation, or PaymentIntents if needed.
A stronger RAG version
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
documents = [
{"text": "Stripe uses PaymentIntents for modern payments.", "source": "official", "topic": "stripe"},
{"text": "Redis is an in-memory database.", "source": "official", "topic": "redis"},
{"text": "The Eiffel Tower is in Paris.", "source": "general", "topic": "travel"},
{"text": "Legacy charges API exists in older Stripe workflows.", "source": "archive", "topic": "stripe"}
]
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [doc["text"] for doc in documents]
doc_embeddings = model.encode(texts)
index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(np.array(doc_embeddings, dtype="float32"))
def retrieve_relevant(query, topic_filter=None, k=5, max_distance=1.2):
query_embedding = model.encode([query]).astype("float32")
distances, indices = index.search(query_embedding, k)
results = []
for dist, idx in zip(distances[0], indices[0]):
doc = documents[idx]
if dist <= max_distance:
if topic_filter is None or doc["topic"] == topic_filter:
results.append({
"text": doc["text"],
"source": doc["source"],
"distance": float(dist)
})
return results
def build_context(results):
approved = [r for r in results if r["source"] == "official"]
if not approved:
return None
return "\n".join([f"[Source: {r['source']}] {r['text']}" for r in approved])
def rag_answer(query, llm):
retrieved = retrieve_relevant(query, topic_filter="stripe")
context = build_context(retrieved)
if not context:
return "I do not have enough reliable retrieved context to answer safely."
prompt = f"""
Use only the context below.
If the answer is not explicitly supported, say you do not know.
Context:
{context}
Question:
{query}
"""
return llm(prompt)
That answer sounds professional, but it is not a reliable recommendation for a modern production system.
The problem is not that retrieval failed completely.
The problem is that retrieval failed partially, which is harder to notice.
the system appears grounded, but the grounding itself is weak.
1) Main libraries used in LLM systems
| Library | Main role | Typical use | Notes |
|---|---|---|---|
openai |
Model inference, embeddings, API access | Generate answers, structured outputs, embeddings | OpenAI’s API includes Responses and Embeddings endpoints. (OpenAI Platform) |
langchain |
Orchestration framework | Prompting, chains, retrievers, agents | LangChain docs cover retrieval flows including 2-step RAG and agentic RAG. (LangChain Docs) |
sentence-transformers |
Local embedding models | Encode queries/docs into vectors | Common for semantic search and RAG embedding pipelines. SentenceTransformer(...).encode(...) is the core pattern. (SentenceTransformers) |
faiss |
Dense vector similarity search | Fast local ANN/vector search | FAISS is designed for efficient similarity search and clustering of dense vectors. (Faiss) |
qdrant-client / Qdrant |
Production vector DB | Store/search vectors with payload filters | Qdrant stores points made of vectors plus optional payload metadata and supports search/filtering. (Qdrant) |
pydantic |
Output/schema validation | Validate structured LLM outputs | Not a model library, but widely used to make LLM responses safer in production. |
requests |
External API/tool calls | Fetch docs, APIs, webpages | Frequently used inside tool-using or retrieval workflows. LangChain’s examples use it in agentic retrieval flows. (LangChain Docs) |
numpy |
Vector/matrix handling | Embedding arrays, FAISS inputs | Standard companion library for local embedding and vector search pipelines. |
transformers |
Local HF model inference/training | Run local LLMs/embeddings | Often used when you do not want hosted inference. |
tiktoken or tokenizer libs |
Token counting/chunking | Split context safely | Useful for prompt budgeting and chunk sizing. |
2) Main libraries used specifically in RAG systems
| Library | RAG stage | What it usually does | Notes |
|---|---|---|---|
langchain |
Pipeline orchestration | Load docs, split, embed, retrieve, chain to LLM | Its retrieval docs explicitly describe RAG architectures and retriever-driven flows. (LangChain Docs) |
sentence-transformers |
Embedding | Converts chunks and queries into vectors | Common local embedding choice for semantic retrieval. (SentenceTransformers) |
openai |
Embedding + generation | Hosted embeddings and answer generation | OpenAI embeddings return vectors whose length depends on the selected model. (OpenAI Platform) |
faiss |
Vector index | Local similarity search over dense vectors | Strong for fast local prototypes and single-node systems. (Faiss) |
qdrant-client / Qdrant |
Vector storage + filtering | Production search with metadata/payload | Supports dense, sparse, and hybrid retrieval in the LangChain integration. (LangChain Docs) |
langchain-community |
Integrations | FAISS, loaders, utilities | LangChain’s FAISS integration lives in langchain-community. (LangChain Docs) |
langchain-qdrant |
Qdrant integration | Qdrant vector store wrapper for LangChain | Official LangChain integration package for Qdrant. (LangChain Docs) |
rank-bm25 or sparse search tools |
Keyword retrieval | Lexical retrieval complement | Often paired with dense retrieval for hybrid RAG. |
Cross-encoders (sentence-transformers) |
Re-ranking | Reorder retrieved results more accurately | Sentence Transformers provides Cross-Encoder reranking models for passage reranking. (SentenceTransformers) |
At first glance, this looks fine. It is short, readable, and syntactically correct. But what does discount mean? Is it 0.2 for twenty percent? Is it 20? What happens with negative values? What if the value exceeds 1? The model has produced a function that looks complete, but key semantic assumptions are left unresolved.
A production-safe implementation would make those assumptions explicit:
def apply_discount(price: float, discount_rate: float) -> float:
if price < 0:
raise ValueError("price must be non-negative")
if not 0 <= discount_rate <= 1:
raise ValueError("discount_rate must be between 0 and 1")
return round(price * (1 - discount_rate), 2)
The important lesson is not that the second version is longer. It is that engineering requires explicit constraints, while generation often omits them unless forced by the system.
A useful question to ask whenever an LLM produces code is this:
Does this output merely look like an implementation, or does it encode the actual business rules?
That question separates demo-quality output from production-quality output.
Why reliability is harder than accuracy
Many teams initially frame the problem as accuracy: how do we get more correct answers? Accuracy matters, but reliability is broader and often more important. A system can be reasonably accurate on average and still be operationally unreliable if its failures are inconsistent, irreproducible, and hard to debug.
(This is the second major engineering challenge of LLM systems: non-determinism.)
Traditional software systems are expected to behave consistently. If a bug appears, engineers try to reproduce it, isolate the state, inspect the inputs, and trace the logic path. With LLMs, that workflow becomes less stable. Two runs with nearly identical conditions can yield different wording, different assumptions, different decomposition steps, and sometimes different final conclusions.
This variability affects much more than output style. It changes how systems must be tested, monitored, and maintained.
A small variation in an early classification step can alter retrieval. Altered retrieval changes context. Changed context changes generation. Changed generation may trigger or avoid a validator. In a multi-step pipeline, small probabilistic differences can cascade into materially different outcomes.
That is why reproducibility becomes a first-class engineering concern.
A practical question for any production LLM pipeline is:
If the same request fails today, can we reproduce the same failure tomorrow?
If the answer is no, debugging becomes slower, monitoring becomes noisier, and rollback analysis becomes more difficult.
The shape of a production-safe architecture
Because LLMs are probabilistic generators, they should almost never sit alone between user input and final output in a serious system. A production architecture needs surrounding layers that constrain, ground, verify, and observe behavior.
A useful high-level diagram looks like this:
This diagram matters because it shows the correct mental model: the LLM is one stage in a larger reliability pipeline, not the pipeline itself.
Each layer exists because a different class of failure must be handled outside the model.
- Routing reduces ambiguity by deciding what kind of problem this is.
- Retrieval grounds the response in actual data.
- Context processing removes noise before generation.
- Validation checks whether the output is structurally and semantically acceptable.
- The decision layer determines whether to accept, reject, retry, or escalate.
The deeper point is architectural: you do not solve hallucinations by asking the model to “be more careful.” You solve them by reducing the amount of unverified freedom the model is allowed to exercise.
Context processing is one of the most underestimated layers
Even with good retrieval, raw context is rarely ready to pass directly into the model. Retrieved material can contain redundancy, conflicting information, outdated fragments, or irrelevant passages. Many teams focus heavily on embeddings and the LLM itself, while underinvesting in the layer that prepares context.
That is a mistake, because the model’s answer quality depends as much on context hygiene as on model capability.
Context processing is where the system decides what evidence is allowed to influence generation. This may include:
- removing duplicate chunks
- filtering low-confidence results
- keeping only chunks from approved sources
- normalizing formats
- ordering evidence by priority
- truncating to preserve only the strongest signal
A simple illustration:
def process_context(chunks: list[str], max_chars: int = 1200) -> str:
seen = set()
cleaned = []
for chunk in chunks:
normalized = chunk.strip()
if normalized and normalized not in seen:
seen.add(normalized)
cleaned.append(normalized)
return "\n\n".join(cleaned)[:max_chars]
This is a basic example, but it reflects an important idea: context is not raw input to the model. It is curated evidence.
A strong question to ask at this stage is:
If the model fails, did it fail because it reasoned poorly, or because we handed it noisy evidence?
That question often reveals that the failure belongs to upstream system design, not to the model alone.
Validation is where probabilistic output meets deterministic engineering
If there is one layer that most clearly separates prototypes from production systems, it is validation.
Without validation, an LLM system is essentially trusting generated output based on presentation quality. With validation, the system begins to behave like engineered software again. The goal is not to prove the model is always right. The goal is to ensure the system does not accept high-risk outputs without deterministic checks.
The type of validation depends on the task.
For structured outputs, schema validation is the first barrier. If the model is supposed to return an object with specific fields, those fields should be validated strictly.
from pydantic import BaseModel, ValidationError
class ApiCall(BaseModel):
method: str
endpoint: str
requires_auth: bool
def validate_structured_output(data: dict):
try:
return ApiCall(**data)
except ValidationError as e:
return None
This catches malformed responses, but it does not catch false content inside a valid structure. A perfectly shaped object can still be wrong.
That is why semantic validation must follow structural validation.
Example: a valid structure with invalid semantics
The model returns:
{
"method": "POST",
"endpoint": "/v1/charge",
"requires_auth": true
}
This may pass a schema validator because the fields exist and types are correct. But the endpoint is still wrong. Structural validation succeeded. Semantic validation failed.
For code generation, semantic validation often means execution plus tests.
def run_generated_code_safely(code: str, test_func):
namespace = {}
try:
exec(code, {}, namespace)
return test_func(namespace)
except Exception:
return False
The critical insight is that validation must answer a harder question than formatting:
Could this output be accepted by the system and still be wrong?
If yes, more validation is needed.
Comparing traditional software and LLM systems
One reason teams underestimate these challenges is that they unconsciously apply the wrong engineering intuition. The table below shows why LLM systems need a different mindset.
| Dimension | Traditional Software | LLM Component |
|---|---|---|
| Output behavior | Deterministic | Probabilistic |
| Truth source | Rules and state | Learned token distributions |
| Failure mode | Explicit error or exception | Plausible but incorrect response |
| Debugging | Reproduce exact path | Analyze distributions and context |
| Testing | Exact expected output | Statistical and scenario-based |
| Safety strategy | Unit/integration tests | Validation, grounding, observability |
This comparison explains why a prompt-only approach usually breaks at scale. Prompting can improve local performance, but it does not change the underlying failure model.
Consistency requires control, not hope
Because non-determinism cannot be eliminated completely, it must be managed. The system needs mechanisms that reduce variance where consistency matters.
One common control is lower-temperature generation. Lower temperature reduces randomness and usually improves consistency. But it is not a magic fix. A confidently repeated wrong answer is still wrong. Consistency without verification can simply stabilize the wrong behavior.
Another control is structured prompting. When prompts specify the expected reasoning path and output format, they reduce ambiguity and narrow the model’s action space.
For example, compare these two prompts.
Too open-ended:
Explain how to call the API and give the right parameters.
More controlled:
Using only the provided documentation context, return a JSON object with:
1. HTTP method
2. exact endpoint
3. required headers
4. required body fields
If any field is not explicitly supported by the context, return
The second prompt is better not because it is longer, but because it reduces hidden assumptions and creates output that is easier to validate.
A further step is multi-candidate generation with ranking or verification. Instead of trusting one answer, the system can generate several and choose the one that best satisfies rules or passes validation.
def choose_best_output(prompt: str, generator, scorer, n: int = 3):
candidates = [generator(prompt) for _ in range(n)]
scored = [(candidate, scorer(candidate)) for candidate in candidates]
scored.sort(key=lambda x: x[1], reverse=True)
return scored[0][0]
This is especially useful when a task admits multiple plausible phrasings but only some are fully grounded or structurally compliant.
A practical question here is:
Should the system optimize for one eloquent answer, or for the most verifiable answer?
In production, the second is usually the right choice.
Observability is mandatory because failures are often silent
In ordinary software systems, obvious failures trigger obvious investigation. In LLM systems, some of the worst failures are silent. The answer is accepted, no exception is thrown, and the problem emerges only later as an incorrect report, a bad integration, or a flawed decision.
That is why observability is not optional. The system needs to record enough information to reconstruct what happened:
- the original user request
- the prompt or template version
- the retrieved context
- model settings
- raw outputs
- validation outcomes
- final decision
- user feedback where available
A minimal logging example might look like this:
import time
def log_event(query, context, raw_output, validated, decision):
return {
"timestamp": time.time(),
"query": query,
"context": context,
"raw_output": raw_output,
"validated": validated,
"decision": decision
}
In a real system, this data becomes the basis for regression analysis, failure clustering, and evaluation dataset creation.
A strong engineering question is:
If a user reports a wrong answer, do we have enough information to diagnose whether retrieval, prompting, generation, or validation failed?
Without that visibility, the team is not really operating a system.
It is operating a black box. .
The evaluation mindset must change
Testing LLM systems is fundamentally different from testing ordinary code. You cannot rely only on exact-match assertions. Many tasks allow multiple acceptable outputs, while dangerous failures may still look polished.
Evaluation must therefore reflect real usage conditions, not just benchmark convenience. Good evaluation sets should include:
- normal cases
- ambiguous cases
- adversarial phrasing
- edge conditions
- outdated context scenarios
- conflicting evidence scenarios
- incomplete data scenarios
The aim is not simply to ask, “Did the model answer correctly?” The better question is:
Under what conditions does the entire system fail, and does it fail safely?
That wording matters because a safe refusal can be more valuable than a polished but incorrect answer.
A practical production pattern
A strong LLM system often follows a decision-oriented pipeline like this:

This diagram is useful because it shows an engineering principle that applies broadly: the system should not force every request down the same path. Some tasks need retrieval. Some need tools. Some need human escalation. Some should be rejected cleanly.
That is how the architecture absorbs uncertainty instead of pretending uncertainty does not exist.
Questions every production LLM team should keep asking
The strongest teams tend to ask better operational questions than everyone else. Here are some of the most important ones:
Can the system detect a well-formatted but incorrect output?
Does retrieval improve truthfulness, or just increase answer confidence?
Which failures come from the model, and which come from upstream context design?
Can we reproduce a bad output under the same conditions?
Are we optimizing for linguistic quality or decision reliability?
When the system is uncertain, does it expose uncertainty or hide it behind fluency?What happens if the validator passes a structurally valid but semantically false response?
Which classes of requests should never be answered without human review?
These are not philosophical questions. They are production questions.
Final perspective
The hardest part of deploying LLMs is not integrating an API or writing a better prompt. It is accepting that a fluent model is not the same thing as a reliable system.
A model can generate.
A system must decide.
A model can imitate valid structure.
A system must verify meaning.
A model can produce plausible answers.
A production architecture must control when those answers are trusted, retried, constrained, or rejected.
That is the real engineering challenge of using LLMs in production systems. The teams that succeed are not the ones that merely use advanced models. They are the ones that design robust pipelines around the model’s limitations: grounded retrieval, disciplined context preparation, deterministic validation, controlled generation, observability, and continuous evaluation.
The line between experimenting with AI and engineering with AI is drawn exactly there.




Top comments (3)
Please leave your opinion about my post.
Good retrieval, raw context is rarely ready to pass directly into the model. Retrieved material can contain redundancy, conflicting information, outdated fragments, or irrelevant passages. Many teams focus heavily on embeddings and the LLM itself, while underinvesting in the layer that prepares context.
Wonderful....
Thank you for your kind explain.
Wish next post!
An LLM does not check a truth table before replying. It predicts the next token based on learned statistical patterns. If the available context is weak, incomplete, conflicting, or slightly off-distribution.
Excuse me, how much have you been working in AI subject?
I look forward to discuss with you.
Perfect👍👍👍👍