<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pierfelice Menga</title>
    <description>The latest articles on DEV Community by Pierfelice Menga (@agen-it).</description>
    <link>https://dev.to/agen-it</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3718196%2F73251327-38e5-4779-b4df-473972eb7165.jpg</url>
      <title>DEV Community: Pierfelice Menga</title>
      <link>https://dev.to/agen-it</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/agen-it"/>
    <language>en</language>
    <item>
      <title>The Real Engineering Challenges of Using LLMs in Production Systems</title>
      <dc:creator>Pierfelice Menga</dc:creator>
      <pubDate>Thu, 09 Apr 2026 05:21:17 +0000</pubDate>
      <link>https://dev.to/agen-it/the-real-engineering-challenges-of-using-llms-in-production-systems-3h67</link>
      <guid>https://dev.to/agen-it/the-real-engineering-challenges-of-using-llms-in-production-systems-3h67</guid>
      <description>&lt;p&gt;&lt;em&gt;" Large Language Models are no longer experimental novelties. They are now embedded into internal copilots, support systems, search interfaces, analytics assistants, coding workflows, document pipelines, and increasingly, decision-support platforms. At the prototype stage, they often appear surprisingly capable. A well-written prompt produces fluent answers, clean code, and convincing reasoning. But the moment an LLM is placed inside a production system, the engineering reality changes."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjc26r64euz08iycwu6n.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhjc26r64euz08iycwu6n.jpg" alt="Title image" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;The central problem is simple to state and difficult to solve&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;an LLM can produce output that looks correct, sounds correct, and fits the requested format, while being fundamentally wrong.&lt;/strong&gt;&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
        &lt;div class="c-embed__cover"&gt;
          &lt;a href="https://aws.amazon.com/what-is/retrieval-augmented-generation/" class="c-link align-middle" rel="noopener noreferrer"&gt;
            &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdocs.aws.amazon.com%2Fimages%2Fsagemaker%2Flatest%2Fdg%2Fimages%2Fjumpstart%2Fjumpstart-fm-rag.jpg" height="474" class="m-0" width="800"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://aws.amazon.com/what-is/retrieval-augmented-generation/" rel="noopener noreferrer" class="c-link"&gt;
            What is RAG? - Retrieval-Augmented Generation AI Explained - AWS
          &lt;/a&gt;
        &lt;/h2&gt;
          &lt;p class="truncate-at-3"&gt;
            What is Retrieval-Augmented Generation (RAG), how and why businesses use RAG AI, and how to use RAG with AWS.
          &lt;/p&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
            &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fa0.awsstatic.com%2Flibra-css%2Fimages%2Fsite%2Ffav%2Ffavicon.ico" width="16" height="16"&gt;
          aws.amazon.com
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
        &lt;div class="c-embed__cover"&gt;
          &lt;a href="https://www.linkedin.com/top-content/artificial-intelligence/understanding-ai-systems/understanding-the-role-of-rag-in-ai-applications/" class="c-link align-middle" rel="noopener noreferrer"&gt;
            &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstatic.licdn.com%2Faero-v1%2Fsc%2Fh%2Fen3f1pk3qk4cxtj2j4fff0gtr" height="21" class="m-0" width="84"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://www.linkedin.com/top-content/artificial-intelligence/understanding-ai-systems/understanding-the-role-of-rag-in-ai-applications/" rel="noopener noreferrer" class="c-link"&gt;
            Understanding the Role of Rag in AI Applications
          &lt;/a&gt;
        &lt;/h2&gt;
          &lt;p class="truncate-at-3"&gt;
            Explore how RAG combines real-time data to refine AI responses, boosting accuracy and context. Delve into its uses and advancements in natural language…
          &lt;/p&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
            &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstatic.licdn.com%2Faero-v1%2Fsc%2Fh%2Fal2o9zrvru7aqj8e1x2rzsrca" width="64" height="64"&gt;
          linkedin.com
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://www.forbes.com/councils/forbesbusinesscouncil/2024/04/24/the-rag-effect-how-ai-is-becoming-more-relevant-and-accurate/" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;forbes.com&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;That single property reshapes everything about system design. Traditional software engineering is built on deterministic assumptions. Given the same input and the same state, the system should behave in the same way. LLM-based systems violate that expectation at the component level. They are probabilistic, not deterministic. They generate, rather than retrieve. They imitate valid structure without actually guaranteeing semantic correctness. As a result, the main challenge is not how to make an LLM answer beautifully, but how to make a larger system remain reliable when one of its core components is inherently uncertain.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;This is where the real engineering work begins.&lt;/u&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why hallucinations are a system problem, not a model quirk
&lt;/h2&gt;

&lt;p&gt;Hallucination is often described too casually, as if it were just an occasional mistake. In practice, it is much more structural than that. An LLM does not check a truth table before replying. It predicts the next token based on learned statistical patterns. If the available context is weak, incomplete, conflicting, or slightly off-distribution, the model does not pause like a careful engineer and say, “I do not have enough verified information.” Instead, it continues the pattern of plausible generation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5yifazyx32xkymwa0z06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5yifazyx32xkymwa0z06.png" alt="realibility" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That behavior becomes dangerous because the output usually preserves the surface signals humans trust most:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;correct grammar&lt;/li&gt;
&lt;li&gt;correct formatting&lt;/li&gt;
&lt;li&gt;domain vocabulary&lt;/li&gt;
&lt;li&gt;coherent flow&lt;/li&gt;
&lt;li&gt;confident tone&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;In other words, the answer often fails at the exact layer that is hardest to detect quickl: **meaning&lt;/strong&gt;.**&lt;/p&gt;

&lt;p&gt;A generated function may compile and even pass a few happy-path tests while still failing on edge cases. A generated API call may look perfectly aligned with the target service while using parameters that do not actually exist. A generated SQL transformation may execute successfully while applying the wrong filter condition, quietly corrupting downstream metrics. In all of these cases, the visible structure suggests correctness, but the hidden logic is flawed.&lt;/p&gt;

&lt;p&gt;That distinction matters. A broken JSON response is easy to reject. A beautifully structured but incorrect JSON response is much more expensive to catch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: valid syntax, invalid logic&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consider a simple function generated for discount calculation:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;apply_discount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;discount&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;discount&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example of Incorrect RAG Code and Why It Fails&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;One of the most common mistakes in early RAG systems is assuming that retrieval alone guarantees correctness. In reality, a poorly designed retrieval pipeline can silently inject irrelevant context into the prompt, which makes the final answer look grounded while still being wrong.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here is a deliberately incorrect RAG example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stripe uses PaymentIntents for modern payments.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Redis is an in-memory database.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The Eiffel Tower is in Paris.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Legacy charges API exists in older Stripe workflows.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;doc_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;IndexFlatL2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rag_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Answer the question using the context below.

    Context:
    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

    Question:
    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At first glance, this looks reasonable. It encodes documents, runs similarity search, builds a context string, and passes everything to the model. But from an engineering perspective, this implementation is fragile in several ways.&lt;/p&gt;

&lt;p&gt;Why this code is incorrect&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;First, it retrieves chunks only by vector similarity and blindly trusts the top results. That means semantically related but operationally useless text can enter the context. If the query is about Stripe, the retriever may still include general or outdated chunks, or even partially related noise.&lt;/p&gt;

&lt;p&gt;Second, there is no threshold for retrieval quality. Even if the top matches are weak, the pipeline still sends them to the LLM. The model then receives low-confidence evidence and often turns it into a high-confidence answer.&lt;/p&gt;

&lt;p&gt;Third, there is no reranking or filtering. The code assumes the vector index already returned the most useful chunks in the best order. In practice, top-k similarity is often only the first stage.&lt;/p&gt;

&lt;p&gt;Fourth, the context is merged into one flat block. There is no metadata, no source labeling, no freshness information, and no separation between high-trust and low-trust documents. The LLM sees one blended text surface and may combine unrelated facts into a single polished response.&lt;/p&gt;

&lt;p&gt;Fifth, there is no validation after generation. Even if the LLM produces a well-written answer based on outdated or irrelevant chunks, nothing in the system detects that failure.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;u&gt;This is the core engineering danger of bad RAG:&lt;/u&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What can go wrong in practice&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Imagine the user asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How should I integrate Stripe payments in a new application?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The retriever may return:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a correct chunk about PaymentIntents&lt;/li&gt;
&lt;li&gt;an old chunk about legacy Charges API&lt;/li&gt;
&lt;li&gt;an unrelated chunk because the embedding similarity was only loosely relevant&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model now has mixed evidence. Instead of refusing or expressing uncertainty, it may generate a blended answer such as:&lt;/p&gt;

&lt;p&gt;Use the Charges API for direct payment creation, or PaymentIntents if needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A stronger RAG version&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stripe uses PaymentIntents for modern payments.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;official&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stripe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Redis is an in-memory database.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;official&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The Eiffel Tower is in Paris.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;general&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;travel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Legacy charges API exists in older Stripe workflows.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;archive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stripe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;doc_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;IndexFlatL2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_relevant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topic_filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_distance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;dist&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;max_distance&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;topic_filter&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;topic_filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;distance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;approved&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;official&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;approved&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Source: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;approved&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rag_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;retrieved&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;retrieve_relevant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topic_filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stripe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retrieved&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I do not have enough reliable retrieved context to answer safely.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Use only the context below.
    If the answer is not explicitly supported, say you do not know.

    Context:
    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

    Question:
    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That answer sounds professional, but it is not a reliable recommendation for a modern production system.&lt;br&gt;
The problem is not that retrieval failed completely.&lt;br&gt;
The problem is that retrieval failed partially, which is harder to notice.&lt;/p&gt;

&lt;p&gt;the system appears grounded, but the grounding itself is weak.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1) Main libraries used in LLM systems&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;Main role&lt;/th&gt;
&lt;th&gt;Typical use&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;openai&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Model inference, embeddings, API access&lt;/td&gt;
&lt;td&gt;Generate answers, structured outputs, embeddings&lt;/td&gt;
&lt;td&gt;OpenAI’s API includes Responses and Embeddings endpoints. (&lt;a href="https://docs.langchain.com/oss/python/langchain/retrieval" rel="noopener noreferrer"&gt;OpenAI Platform&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;langchain&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Orchestration framework&lt;/td&gt;
&lt;td&gt;Prompting, chains, retrievers, agents&lt;/td&gt;
&lt;td&gt;LangChain docs cover retrieval flows including 2-step RAG and agentic RAG. (&lt;a href="https://www.sbert.net/docs/sentence_transformer/pretrained_models.html" rel="noopener noreferrer"&gt;LangChain Docs&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sentence-transformers&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Local embedding models&lt;/td&gt;
&lt;td&gt;Encode queries/docs into vectors&lt;/td&gt;
&lt;td&gt;Common for semantic search and RAG embedding pipelines. &lt;code&gt;SentenceTransformer(...).encode(...)&lt;/code&gt; is the core pattern. (&lt;a href="https://platform.openai.com/docs/api-reference/embeddings?_clear=true&amp;amp;lang=node.js&amp;amp;utm_source=chatgpt.com" rel="noopener noreferrer"&gt;SentenceTransformers&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;faiss&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Dense vector similarity search&lt;/td&gt;
&lt;td&gt;Fast local ANN/vector search&lt;/td&gt;
&lt;td&gt;FAISS is designed for efficient similarity search and clustering of dense vectors. (&lt;a href="https://faiss.ai/index.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Faiss&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;qdrant-client&lt;/code&gt; / Qdrant&lt;/td&gt;
&lt;td&gt;Production vector DB&lt;/td&gt;
&lt;td&gt;Store/search vectors with payload filters&lt;/td&gt;
&lt;td&gt;Qdrant stores points made of vectors plus optional payload metadata and supports search/filtering. (&lt;a href="https://docs.langchain.com/oss/python/integrations/vectorstores/qdrant" rel="noopener noreferrer"&gt;Qdrant&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pydantic&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Output/schema validation&lt;/td&gt;
&lt;td&gt;Validate structured LLM outputs&lt;/td&gt;
&lt;td&gt;Not a model library, but widely used to make LLM responses safer in production.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;requests&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;External API/tool calls&lt;/td&gt;
&lt;td&gt;Fetch docs, APIs, webpages&lt;/td&gt;
&lt;td&gt;Frequently used inside tool-using or retrieval workflows. LangChain’s examples use it in agentic retrieval flows. (&lt;a href="https://www.sbert.net/docs/sentence_transformer/pretrained_models.html" rel="noopener noreferrer"&gt;LangChain Docs&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;numpy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Vector/matrix handling&lt;/td&gt;
&lt;td&gt;Embedding arrays, FAISS inputs&lt;/td&gt;
&lt;td&gt;Standard companion library for local embedding and vector search pipelines.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;transformers&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Local HF model inference/training&lt;/td&gt;
&lt;td&gt;Run local LLMs/embeddings&lt;/td&gt;
&lt;td&gt;Often used when you do not want hosted inference.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;tiktoken&lt;/code&gt; or tokenizer libs&lt;/td&gt;
&lt;td&gt;Token counting/chunking&lt;/td&gt;
&lt;td&gt;Split context safely&lt;/td&gt;
&lt;td&gt;Useful for prompt budgeting and chunk sizing.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;2) Main libraries used specifically in RAG systems&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;RAG stage&lt;/th&gt;
&lt;th&gt;What it usually does&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;langchain&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pipeline orchestration&lt;/td&gt;
&lt;td&gt;Load docs, split, embed, retrieve, chain to LLM&lt;/td&gt;
&lt;td&gt;Its retrieval docs explicitly describe RAG architectures and retriever-driven flows. (&lt;a href="https://docs.langchain.com/oss/python/langchain/retrieval" rel="noopener noreferrer"&gt;LangChain Docs&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sentence-transformers&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Embedding&lt;/td&gt;
&lt;td&gt;Converts chunks and queries into vectors&lt;/td&gt;
&lt;td&gt;Common local embedding choice for semantic retrieval. (&lt;a href="https://www.sbert.net/docs/sentence_transformer/pretrained_models.html" rel="noopener noreferrer"&gt;SentenceTransformers&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;openai&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Embedding + generation&lt;/td&gt;
&lt;td&gt;Hosted embeddings and answer generation&lt;/td&gt;
&lt;td&gt;OpenAI embeddings return vectors whose length depends on the selected model. (&lt;a href="https://platform.openai.com/docs/api-reference/embeddings?_clear=true&amp;amp;lang=node.js&amp;amp;utm_source=chatgpt.com" rel="noopener noreferrer"&gt;OpenAI Platform&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;faiss&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Vector index&lt;/td&gt;
&lt;td&gt;Local similarity search over dense vectors&lt;/td&gt;
&lt;td&gt;Strong for fast local prototypes and single-node systems. (&lt;a href="https://faiss.ai/index.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Faiss&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;qdrant-client&lt;/code&gt; / Qdrant&lt;/td&gt;
&lt;td&gt;Vector storage + filtering&lt;/td&gt;
&lt;td&gt;Production search with metadata/payload&lt;/td&gt;
&lt;td&gt;Supports dense, sparse, and hybrid retrieval in the LangChain integration. (&lt;a href="https://docs.langchain.com/oss/python/integrations/vectorstores/qdrant" rel="noopener noreferrer"&gt;LangChain Docs&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;langchain-community&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Integrations&lt;/td&gt;
&lt;td&gt;FAISS, loaders, utilities&lt;/td&gt;
&lt;td&gt;LangChain’s FAISS integration lives in &lt;code&gt;langchain-community&lt;/code&gt;. (&lt;a href="https://docs.langchain.com/oss/python/integrations/vectorstores/faiss" rel="noopener noreferrer"&gt;LangChain Docs&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;langchain-qdrant&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Qdrant integration&lt;/td&gt;
&lt;td&gt;Qdrant vector store wrapper for LangChain&lt;/td&gt;
&lt;td&gt;Official LangChain integration package for Qdrant. (&lt;a href="https://docs.langchain.com/oss/python/integrations/vectorstores/qdrant" rel="noopener noreferrer"&gt;LangChain Docs&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;rank-bm25&lt;/code&gt; or sparse search tools&lt;/td&gt;
&lt;td&gt;Keyword retrieval&lt;/td&gt;
&lt;td&gt;Lexical retrieval complement&lt;/td&gt;
&lt;td&gt;Often paired with dense retrieval for hybrid RAG.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-encoders (&lt;code&gt;sentence-transformers&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Re-ranking&lt;/td&gt;
&lt;td&gt;Reorder retrieved results more accurately&lt;/td&gt;
&lt;td&gt;Sentence Transformers provides Cross-Encoder reranking models for passage reranking. (&lt;a href="https://www.sbert.net/docs/pretrained-models/ce-msmarco.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;SentenceTransformers&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At first glance, this looks fine. It is short, readable, and syntactically correct. But what does discount mean? Is it 0.2 for twenty percent? Is it 20? What happens with negative values? What if the value exceeds 1? The model has produced a function that looks complete, but key semantic assumptions are left unresolved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A production-safe implementation would make those assumptions explicit:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;apply_discount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;discount_rate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price must be non-negative&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;discount_rate&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;discount_rate must be between 0 and 1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;discount_rate&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important lesson is not that the second version is longer. It is that engineering requires explicit constraints, while generation often omits them unless forced by the system.&lt;/p&gt;

&lt;p&gt;A useful question to ask whenever an LLM produces code is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Does this output merely look like an implementation, or does it encode the actual business rules?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://dev.tourl"&gt;That question separates demo-quality output from production-quality output.&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why reliability is harder than accuracy
&lt;/h2&gt;

&lt;p&gt;Many teams initially frame the problem as accuracy: how do we get more correct answers? Accuracy matters, but reliability is broader and often more important. A system can be reasonably accurate on average and still be operationally unreliable if its failures are inconsistent, irreproducible, and hard to debug.&lt;/p&gt;

&lt;p&gt;(This is the second major engineering challenge of LLM systems: non-determinism.)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Traditional software systems are expected to behave consistently. If a bug appears, engineers try to reproduce it, isolate the state, inspect the inputs, and trace the logic path. With LLMs, that workflow becomes less stable. Two runs with nearly identical conditions can yield different wording, different assumptions, different decomposition steps, and sometimes different final conclusions.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1mpsyx15250dl7vwbekk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1mpsyx15250dl7vwbekk.png" alt="Work process" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This variability affects much more than output style. It changes how systems must be tested, monitored, and maintained.&lt;br&gt;
A small variation in an early classification step can alter retrieval. Altered retrieval changes context. Changed context changes generation. Changed generation may trigger or avoid a validator. In a multi-step pipeline, small probabilistic differences can cascade into materially different outcomes.&lt;br&gt;
That is why reproducibility becomes a first-class engineering concern.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;A practical question for any production LLM pipeline is:&lt;/u&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If the same request fails today, can we reproduce the same failure tomorrow?&lt;br&gt;
If the answer is no, debugging becomes slower, monitoring becomes noisier, and rollback analysis becomes more difficult.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The shape of a production-safe architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because LLMs are probabilistic generators, they should almost never sit alone between user input and final output in a serious system. A production architecture needs surrounding layers that constrain, ground, verify, and observe behavior.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A useful high-level diagram looks like this:&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmujpwsktkzc0md1evur9.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmujpwsktkzc0md1evur9.jpg" alt="Accuracy" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This diagram matters because it shows the correct mental model: the LLM is one stage in a larger reliability pipeline, not the pipeline itself.&lt;br&gt;
&lt;code&gt;Each layer exists because a different class of failure must be handled outside the model.&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Routing reduces ambiguity by deciding what kind of problem this is.&lt;/li&gt;
&lt;li&gt;Retrieval grounds the response in actual data.&lt;/li&gt;
&lt;li&gt;Context processing removes noise before generation.&lt;/li&gt;
&lt;li&gt;Validation checks whether the output is structurally and semantically acceptable.&lt;/li&gt;
&lt;li&gt;The decision layer determines whether to accept, reject, retry, or escalate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;u&gt;The deeper point is architectural: you do not solve hallucinations by asking the model to “be more careful.” You solve them by reducing the amount of unverified freedom the model is allowed to exercise.&lt;/u&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Context processing is one of the most underestimated layers
&lt;/h2&gt;

&lt;p&gt;Even with good retrieval, raw context is rarely ready to pass directly into the model. Retrieved material can contain redundancy, conflicting information, outdated fragments, or irrelevant passages. Many teams focus heavily on embeddings and the LLM itself, while underinvesting in the layer that prepares context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That is a mistake, because the model’s answer quality depends as much on context hygiene as on model capability.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Context processing is where the system decides what evidence is allowed to influence generation. This may include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;removing duplicate chunks&lt;/li&gt;
&lt;li&gt;filtering low-confidence results&lt;/li&gt;
&lt;li&gt;keeping only chunks from approved sources&lt;/li&gt;
&lt;li&gt;normalizing formats&lt;/li&gt;
&lt;li&gt;ordering evidence by priority&lt;/li&gt;
&lt;li&gt;truncating to preserve only the strongest signal&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;A simple illustration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;max_chars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;seen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;normalized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;normalized&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;normalized&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="n"&gt;max_chars&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a basic example, but it reflects an important idea: context is not raw input to the model. It is curated evidence.&lt;/p&gt;

&lt;p&gt;A strong question to ask at this stage is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If the model fails, did it fail because it reasoned poorly, or because we handed it noisy evidence?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That question often reveals that the failure belongs to upstream system design, not to the model alone.&lt;/p&gt;




&lt;h2&gt;
  
  
  Validation is where probabilistic output meets deterministic engineering
&lt;/h2&gt;

&lt;p&gt;If there is one layer that most clearly separates prototypes from production systems, it is validation.&lt;/p&gt;

&lt;p&gt;Without validation, an LLM system is essentially trusting generated output based on presentation quality. With validation, the system begins to behave like engineered software again. The goal is not to prove the model is always right. The goal is to ensure the system does not accept high-risk outputs without deterministic checks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The type of validation depends on the task.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For structured outputs, schema validation is the first barrier. If the model is supposed to return an object with specific fields, those fields should be validated strictly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ValidationError&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ApiCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;requires_auth&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_structured_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ApiCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ValidationError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This catches malformed responses, but it does not catch false content inside a valid structure. A perfectly shaped object can still be wrong.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;That is why semantic validation must follow structural validation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Example: a valid structure with invalid semantics&lt;br&gt;
The model returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"POST"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"endpoint"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/v1/charge"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"requires_auth"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This may pass a schema validator because the fields exist and types are correct. But the endpoint is still wrong. Structural validation succeeded. Semantic validation failed.&lt;/p&gt;

&lt;p&gt;For code generation, semantic validation often means execution plus tests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_generated_code_safely&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_func&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;test_func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical insight is that validation must answer a harder question than formatting:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Could this output be accepted by the system and still be wrong?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If yes, more validation is needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Comparing traditional software and LLM systems
&lt;/h2&gt;

&lt;p&gt;One reason teams underestimate these challenges is that they unconsciously apply the wrong engineering intuition. The table below shows why LLM systems need a different mindset.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Traditional Software&lt;/th&gt;
&lt;th&gt;LLM Component&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Output behavior&lt;/td&gt;
&lt;td&gt;Deterministic&lt;/td&gt;
&lt;td&gt;Probabilistic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Truth source&lt;/td&gt;
&lt;td&gt;Rules and state&lt;/td&gt;
&lt;td&gt;Learned token distributions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure mode&lt;/td&gt;
&lt;td&gt;Explicit error or exception&lt;/td&gt;
&lt;td&gt;Plausible but incorrect response&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debugging&lt;/td&gt;
&lt;td&gt;Reproduce exact path&lt;/td&gt;
&lt;td&gt;Analyze distributions and context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Testing&lt;/td&gt;
&lt;td&gt;Exact expected output&lt;/td&gt;
&lt;td&gt;Statistical and scenario-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Safety strategy&lt;/td&gt;
&lt;td&gt;Unit/integration tests&lt;/td&gt;
&lt;td&gt;Validation, grounding, observability&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This comparison explains why a prompt-only approach usually breaks at scale. Prompting can improve local performance, but it does not change the underlying failure model.&lt;/p&gt;




&lt;h2&gt;
  
  
  Consistency requires control, not hope
&lt;/h2&gt;

&lt;p&gt;Because non-determinism cannot be eliminated completely, it must be managed. The system needs mechanisms that reduce variance where consistency matters.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;One common control is lower-temperature generation. Lower temperature reduces randomness and usually improves consistency. But it is not a magic fix. A confidently repeated wrong answer is still wrong. Consistency without verification can simply stabilize the wrong behavior.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Another control is structured prompting. When prompts specify the expected reasoning path and output format, they reduce ambiguity and narrow the model’s action space.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For example, compare these two prompts.&lt;/p&gt;

&lt;p&gt;Too open-ended:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Explain how to call the API and give the right parameters.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More controlled:&lt;/p&gt;

&lt;p&gt;Using only the provided documentation context, return a JSON object with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. HTTP method
2. exact endpoint
3. required headers
4. required body fields
If any field is not explicitly supported by the context, return 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The second prompt is better not because it is longer, but because it reduces hidden assumptions and creates output that is easier to validate.&lt;/p&gt;

&lt;p&gt;A further step is multi-candidate generation with ranking or verification. Instead of trusting one answer, the system can generate several and choose the one that best satisfies rules or passes validation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;choose_best_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="n"&gt;scored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;scorer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is especially useful when a task admits multiple plausible phrasings but only some are fully grounded or structurally compliant.&lt;/p&gt;

&lt;p&gt;A practical question here is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Should the system optimize for one eloquent answer, or for the most verifiable answer?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In production, the second is usually the right choice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability is mandatory because failures are often silent
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;In ordinary software systems, obvious failures trigger obvious investigation. In LLM systems, some of the worst failures are silent. The answer is accepted, no exception is thrown, and the problem emerges only later as an incorrect report, a bad integration, or a flawed decision.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That is why observability is not optional. The system needs to record enough information to reconstruct what happened:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the original user request&lt;/li&gt;
&lt;li&gt;the prompt or template version&lt;/li&gt;
&lt;li&gt;the retrieved context&lt;/li&gt;
&lt;li&gt;model settings&lt;/li&gt;
&lt;li&gt;raw outputs&lt;/li&gt;
&lt;li&gt;validation outcomes&lt;/li&gt;
&lt;li&gt;final decision&lt;/li&gt;
&lt;li&gt;user feedback where available&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A minimal logging example might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;log_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;raw_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;raw_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;validated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decision&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;decision&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a real system, this data becomes the basis for regression analysis, failure clustering, and evaluation dataset creation.&lt;/p&gt;

&lt;p&gt;A strong engineering question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If a user reports a wrong answer, do we have enough information to diagnose whether retrieval, prompting, generation, or validation failed?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Without that visibility, the team is not really operating a system. &lt;/p&gt;

&lt;p&gt;&lt;u&gt;It is operating a black box.                                      .         &lt;/u&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The evaluation mindset must change&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Testing LLM systems is fundamentally different from testing ordinary code. You cannot rely only on exact-match assertions. Many tasks allow multiple acceptable outputs, while dangerous failures may still look polished.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Evaluation must therefore reflect real usage conditions, not just benchmark convenience. Good evaluation sets should include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;normal cases&lt;/li&gt;
&lt;li&gt;ambiguous cases&lt;/li&gt;
&lt;li&gt;adversarial phrasing&lt;/li&gt;
&lt;li&gt;edge conditions&lt;/li&gt;
&lt;li&gt;outdated context scenarios&lt;/li&gt;
&lt;li&gt;conflicting evidence scenarios&lt;/li&gt;
&lt;li&gt;incomplete data scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The aim is not simply to ask, “Did the model answer correctly?” The better question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Under what conditions does the entire system fail, and does it fail safely?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That wording matters because a safe refusal can be more valuable than a polished but incorrect answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A practical production pattern&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/dI_TmTW9S4c"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;A strong LLM system often follows a decision-oriented pipeline like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdayjird4clggwm5df7dl.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdayjird4clggwm5df7dl.jpg" alt="Amazon" width="800" height="1200"&gt;&lt;/a&gt;&lt;br&gt;
This diagram is useful because it shows an engineering principle that applies broadly: the system should not force every request down the same path. Some tasks need retrieval. Some need tools. Some need human escalation. Some should be rejected cleanly.&lt;br&gt;
That is how the architecture absorbs uncertainty instead of pretending uncertainty does not exist.&lt;/p&gt;


&lt;h2&gt;
  
  
  Questions every production LLM team should keep asking
&lt;/h2&gt;

&lt;p&gt;The strongest teams tend to ask better operational questions than everyone else. Here are some of the most important ones:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can the system detect a well-formatted but incorrect output?&lt;/p&gt;

&lt;p&gt;Does retrieval improve truthfulness, or just increase answer confidence?&lt;/p&gt;

&lt;p&gt;Which failures come from the model, and which come from upstream context design?&lt;/p&gt;

&lt;p&gt;Can we reproduce a bad output under the same conditions?&lt;/p&gt;

&lt;p&gt;Are we optimizing for linguistic quality or decision reliability?&lt;br&gt;
When the system is uncertain, does it expose uncertainty or hide it behind fluency?&lt;/p&gt;

&lt;p&gt;What happens if the validator passes a structurally valid but semantically false response?&lt;/p&gt;

&lt;p&gt;Which classes of requests should never be answered without human review?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These are not philosophical questions. They are production questions.&lt;/p&gt;


&lt;h2&gt;
  
  
  Final perspective
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The hardest part of deploying LLMs is not integrating an API or writing a better prompt. It is accepting that a fluent model is not the same thing as a reliable system.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A model can generate.
A system must decide.

A model can imitate valid structure.
A system must verify meaning.

A model can produce plausible answers.
A production architecture must control when those answers are trusted, retried, constrained, or rejected.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the real engineering challenge of using LLMs in production systems. The teams that succeed are not the ones that merely use advanced models. They are the ones that design robust pipelines around the model’s limitations: grounded retrieval, disciplined context preparation, deterministic validation, controlled generation, observability, and continuous evaluation.&lt;/p&gt;

&lt;p&gt;The line between experimenting with AI and engineering with AI is drawn exactly there.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Let’s Grow and Support Together! 💛</title>
      <dc:creator>Pierfelice Menga</dc:creator>
      <pubDate>Wed, 18 Mar 2026 15:58:08 +0000</pubDate>
      <link>https://dev.to/agen-it/lets-grow-and-support-together-1i9f</link>
      <guid>https://dev.to/agen-it/lets-grow-and-support-together-1i9f</guid>
      <description>&lt;p&gt;Hey everyone! 🌟&lt;/p&gt;

&lt;p&gt;This community is all about supporting each other and growing together. Let’s make it a place where everyone feels encouraged and celebrated.&lt;/p&gt;

&lt;p&gt;Here’s how we can help each other:&lt;/p&gt;

&lt;p&gt;Follow each other – Let’s increase our follower counts together.&lt;br&gt;
Like and comment – Every like and comment counts! It shows support and helps our posts reach more people.&lt;br&gt;
Share positivity – A kind word goes a long way.&lt;br&gt;
By supporting each other, we all rise together. Let’s make this community stronger, more connected, and full of energy! 🚀&lt;/p&gt;

&lt;p&gt;So, join in, engage, and let’s grow together! 💛&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdzzi577jvbzssp2k8hrg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdzzi577jvbzssp2k8hrg.png" alt=" " width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Seeking the Heraculess</title>
      <dc:creator>Pierfelice Menga</dc:creator>
      <pubDate>Thu, 26 Feb 2026 09:03:38 +0000</pubDate>
      <link>https://dev.to/agen-it/seeking-the-heraculess-3189</link>
      <guid>https://dev.to/agen-it/seeking-the-heraculess-3189</guid>
      <description>&lt;p&gt;I’m a remote software and AI developer working with international online clients. I’m currently looking for a reliable, US/EU-based partner to collaborate with me in a long-term remote working arrangement.&lt;/p&gt;

&lt;p&gt;No technical or AI background is required.&lt;/p&gt;

&lt;p&gt;Your role would include:&lt;br&gt;
Assisting with coordination on the European side&lt;br&gt;
Helping with applications, communication, and interview scheduling&lt;br&gt;
Acting as a local contact for EU-based platforms and clients&lt;br&gt;
What I offer:&lt;/p&gt;

&lt;p&gt;15–20%+30US$ of my monthly income&lt;br&gt;
Fully remote cooperation&lt;br&gt;
Long-term partnership&lt;br&gt;
A clear, transparent, and honest agreement&lt;/p&gt;

&lt;p&gt;This opportunity may be a good fit for:&lt;/p&gt;

&lt;p&gt;Individuals looking for additional income&lt;br&gt;
People comfortable communicating in English and following instructions&lt;br&gt;
This opportunity is legal, genuine, and low-risk. All details will be clearly discussed and agreed upon before we begin.&lt;/p&gt;

&lt;p&gt;Contact:&lt;/p&gt;

&lt;p&gt;Discord: sada.ko&lt;br&gt;
Telegram: @devdavid6&lt;br&gt;
Whatsapp: +1 (503) 446-7790&lt;br&gt;
Email: &lt;a href="mailto:RonnyHukuda@gmail.com"&gt;RonnyHukuda@gmail.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
    </item>
    <item>
      <title>Seeking the Hera of the business support</title>
      <dc:creator>Pierfelice Menga</dc:creator>
      <pubDate>Thu, 26 Feb 2026 09:02:51 +0000</pubDate>
      <link>https://dev.to/agen-it/seeking-the-hera-of-the-business-support-201d</link>
      <guid>https://dev.to/agen-it/seeking-the-hera-of-the-business-support-201d</guid>
      <description>&lt;p&gt;I’m a remote software and AI developer working with international online clients. I’m currently looking for a reliable, US/EU-based partner to collaborate with me in a long-term remote working arrangement.&lt;/p&gt;

&lt;p&gt;No technical or AI background is required.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxo9ztlnne1jl9wewgvbb.png%40buy-belbien-online4" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxo9ztlnne1jl9wewgvbb.png%40buy-belbien-online4" alt=" " width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Your role would include:&lt;br&gt;
Assisting with coordination on the European side&lt;br&gt;
Helping with applications, communication, and interview scheduling&lt;br&gt;
Acting as a local contact for EU-based platforms and clients&lt;br&gt;
What I offer:&lt;/p&gt;

&lt;p&gt;15–20%+30US$ of my monthly income&lt;br&gt;
Fully remote cooperation&lt;br&gt;
Long-term partnership&lt;br&gt;
A clear, transparent, and honest agreement&lt;/p&gt;

&lt;p&gt;This opportunity may be a good fit for:&lt;/p&gt;

&lt;p&gt;Individuals looking for additional income&lt;br&gt;
People comfortable communicating in English and following instructions&lt;br&gt;
This opportunity is legal, genuine, and low-risk. All details will be clearly discussed and agreed upon before we begin.&lt;/p&gt;

&lt;p&gt;Contact:&lt;/p&gt;

&lt;p&gt;Discord: sada.ko&lt;br&gt;
Telegram: @devdavid6&lt;br&gt;
Whatsapp: +1 (503) 446-7790&lt;br&gt;
Email: &lt;a href="mailto:RonnyHukuda@gmail.com"&gt;RonnyHukuda@gmail.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>javascript</category>
      <category>programming</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Pierfelice Menga</dc:creator>
      <pubDate>Thu, 26 Feb 2026 09:01:38 +0000</pubDate>
      <link>https://dev.to/agen-it/-4omo</link>
      <guid>https://dev.to/agen-it/-4omo</guid>
      <description></description>
    </item>
    <item>
      <title>In 5 Years, “Knowing Syntax” Will Be the Least Important Dev Skill</title>
      <dc:creator>Pierfelice Menga</dc:creator>
      <pubDate>Sun, 08 Feb 2026 13:31:44 +0000</pubDate>
      <link>https://dev.to/agen-it/in-5-years-knowing-syntax-will-be-the-least-important-dev-skill-3ieg</link>
      <guid>https://dev.to/agen-it/in-5-years-knowing-syntax-will-be-the-least-important-dev-skill-3ieg</guid>
      <description>&lt;p&gt;&lt;strong&gt;I learned JavaScript by memorizing syntax.&lt;br&gt;
AI learned JavaScript by eating the entire internet.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo1wkanhqkfqink74xoyt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo1wkanhqkfqink74xoyt.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;br&gt;
Guess who won? 😅&lt;br&gt;
Writing code is no longer the hard part.&lt;br&gt;
AI can already generate functions, APIs, tests, and configs faster than any human with caffeine.&lt;br&gt;
What actually matters now isn’t how to write code, but:&lt;/p&gt;

&lt;p&gt;❤❤  What code should exist&lt;br&gt;
✌✌  Why it should exist&lt;br&gt;
👍👍 When not to write it&lt;br&gt;
🎇🎇 AI writes working code.&lt;/p&gt;

&lt;p&gt;But it doesn’t:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;understand business context&lt;/li&gt;
&lt;li&gt;care about maintainability&lt;/li&gt;
&lt;li&gt;feel tech debt slowly ruining a project&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The future developer won’t say:&lt;/p&gt;

&lt;p&gt;🎉🎉“I know 12 frameworks.”&lt;/p&gt;

&lt;p&gt;They’ll say:&lt;br&gt;
“I know why this system is built this way — and how not to break production.”&lt;/p&gt;

&lt;p&gt;Syntax is becoming cheap.&lt;br&gt;
Judgment is becoming priceless 💎&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>programming</category>
      <category>javascript</category>
    </item>
  </channel>
</rss>
