<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lei Ye</title>
    <description>The latest articles on DEV Community by Lei Ye (@lei_ye_2cc01a0af9e8260e).</description>
    <link>https://dev.to/lei_ye_2cc01a0af9e8260e</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3808468%2F5b0247f8-5d88-4e05-ad2c-f1af8a1ade2e.png</url>
      <title>DEV Community: Lei Ye</title>
      <link>https://dev.to/lei_ye_2cc01a0af9e8260e</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lei_ye_2cc01a0af9e8260e"/>
    <language>en</language>
    <item>
      <title>Why Your Production RAG System Slowly Gets Worse</title>
      <dc:creator>Lei Ye</dc:creator>
      <pubDate>Mon, 29 Jun 2026 04:17:36 +0000</pubDate>
      <link>https://dev.to/lei_ye_2cc01a0af9e8260e/why-your-production-rag-system-slowly-gets-worse-25fi</link>
      <guid>https://dev.to/lei_ye_2cc01a0af9e8260e/why-your-production-rag-system-slowly-gets-worse-25fi</guid>
      <description>&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;Production RAG systems rarely fail through a single catastrophic event. More commonly, reliability erodes through a sequence of operational changes: documentation evolves, retrieval behavior shifts, prompts are revised, dependencies change, and evaluation datasets become stale.&lt;/p&gt;

&lt;p&gt;Traditional engineering practices classify failures by system components—retrievers, prompts, vector databases, or language models. While useful for implementation, this perspective provides limited guidance for operating production AI systems over time.&lt;/p&gt;

&lt;p&gt;This article proposes a reliability framework based on three complementary dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Failure Dynamics&lt;/strong&gt; — how reliability changes over time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability Control Surface&lt;/strong&gt; — where engineers can observe and intervene&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detectability&lt;/strong&gt; — how easily the failure is discovered before users are affected&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To illustrate the framework, a controlled experiment simulates seven weeks of gradual documentation evolution in a production-style RAG system. The experiment demonstrates one representative failure class—&lt;strong&gt;Gradual Knowledge Drift&lt;/strong&gt;—and shows why this class of failure frequently escapes traditional operational monitoring.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Introduction — AI Systems Rarely Fail the Way Traditional Software Does
&lt;/h2&gt;

&lt;p&gt;Modern software systems fail in ways that operations teams understand well. A bad deployment increases error rates. A database outage causes requests to fail. A networking issue adds latency. Infrastructure becomes unavailable. These failures are disruptive, but they are also highly visible. Dashboards turn red, alerts fire, and engineers know where to start investigating.&lt;/p&gt;

&lt;p&gt;Retrieval-Augmented Generation (RAG) systems introduce a different class of failure. Usually , a production RAG application can appear perfectly healthy from an operational perspective. Requests complete successfully, APIs return HTTP 200 responses, latency remains within service-level objectives, and every component in the architecture is online. Traditional monitoring tools report a healthy system. Yet users begin to lose confidence in the answers.&lt;/p&gt;

&lt;p&gt;Fundamentally, we are trying to solve the AI reliability problem instead of the traditional software reliability problem.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4gnv0k5gjb6tzn6g9nv7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4gnv0k5gjb6tzn6g9nv7.png" alt="Figure 1 - Traditional Software Reliability vs AI Reliability Timeline&lt;br&gt;
" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the graph, the key differences is that traditional software failures are around discrete events and gives immediate feedback; while RAG systems degrades gradually and usually invisible to infrastructure-level monitoring. Fundamentally, traditional software’s reliability is typically judged by correctness and availability: either the service works or it doesn't. RAG systems add another dimension—knowledge quality. A system can achieve excellent uptime while steadily becoming less reliable.&lt;/p&gt;

&lt;p&gt;This reframes reliability from a problem of system correctness to a problem of sustained knowledge quality.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Why Existing Classifications Are Insufficient
&lt;/h2&gt;

&lt;p&gt;What do we know about RAG system failures. Perhaps newly published documentation isn't being retrieved. Maybe document metadata has drifted, reducing retrieval accuracy. An embedding model has changed, but only part of the corpus has been re-indexed… &lt;/p&gt;

&lt;p&gt;Current discussions usually classify failures by components, some of the examples are :&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Typical failures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Embedding model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Poor semantic representations, embedding drift after model changes, domain mismatch, multilingual mismatch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vector database&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low recall, indexing errors, stale or missing vectors, incorrect filtering, ANN search inaccuracies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chunking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Chunks too large/small, broken context boundaries, duplicated information, loss of semantic coherence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retriever&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Irrelevant documents retrieved, low recall, poor ranking, metadata filtering mistakes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reranker&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Relevant documents demoted, irrelevant documents promoted, unstable ranking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prompt&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hallucinations, ignored context, prompt injection, poor instruction following, format inconsistencies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM / Generator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hallucination, incorrect synthesis, unsupported claims, reasoning errors, overconfidence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Knowledge base&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Outdated documents, incomplete corpus, inconsistent information, stale data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ingestion pipeline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Failed indexing, partial ingestion, parsing/OCR errors, metadata extraction failures&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Figure 2 - AI Failure Examples&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These do explain &lt;strong&gt;where&lt;/strong&gt; failures originate. However,  they hardly explain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how failures evolve&lt;/li&gt;
&lt;li&gt;when engineers discover them&lt;/li&gt;
&lt;li&gt;which operational strategy is appropriate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Production RAG system operations require a reliability model, not only an architecture model.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. A Reliability Framework for Production AI Systems
&lt;/h2&gt;

&lt;p&gt;Imagine an engineer receiving the following incident report:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"The RAG system is hallucinating more than usual."&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Although the statement describes a symptom, it immediately raises several unanswered questions.&lt;/p&gt;

&lt;p&gt;Has the system failed suddenly after a deployment, or has answer quality been declining for weeks? Is the root cause likely to be in the knowledge base, the retrieval pipeline, or the generation stage? Should engineers inspect operational dashboards, rerun evaluation suites, or begin a deeper investigation?&lt;/p&gt;

&lt;p&gt;The difficulty is not a lack of observability—it is a lack of structure for reasoning about production AI failures.&lt;/p&gt;

&lt;p&gt;From examining recurring production incidents, I found that most failures can be described along three complementary dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Failure Dynamics&lt;/strong&gt; describe how reliability changes over time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability Control Surfaces&lt;/strong&gt; identify where corrective action is most effective.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detectability&lt;/strong&gt; characterizes how easily the failure is discovered before affecting users.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rather than treating every incident as unique, these dimensions provide a common language for understanding, classifying, and responding to production AI failures.&lt;/p&gt;




&lt;h3&gt;
  
  
  Dimension 1 — Failure Dynamics
&lt;/h3&gt;

&lt;p&gt;When a RAG incident occurs, the first question engineers should ask is not &lt;em&gt;what&lt;/em&gt; failed, but &lt;em&gt;how&lt;/em&gt; reliability changed over time.&lt;/p&gt;

&lt;p&gt;Traditional software systems are typically designed around discrete failures. A deployment introduces a regression, a dependency fails, or a resource becomes exhausted. Reliability changes are usually tied to identifiable events, allowing engineers to reason about incidents as immediate failures.&lt;/p&gt;

&lt;p&gt;Production RAG systems behave differently. Reliability often changes continuously rather than discretely. Documentation evolves, retrieval behavior shifts, prompts are revised, and evaluation datasets become stale. Individually, these changes appear harmless; collectively, they reshape the behavior of the system. As a result, understanding a production AI incident begins with a different question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How did reliability evolve over time?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This leads to the first dimension of the framework: &lt;strong&gt;Failure Dynamics&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Immediate&lt;/strong&gt;&lt;br&gt;
Immediate failures appear immediately after a discrete system change or unexpected input. They are typically associated with deployments, prompt revisions, tool misconfiguration, or invalid context injection. Engineers usually observe an immediate drop in correctness or task completion&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gradual&lt;/strong&gt;&lt;br&gt;
Gradual failures emerge through a sequence of individually harmless changes. Documentation evolves, retrieval behavior shifts, evaluation datasets become stale, or models are upgraded incrementally. No single change is sufficient to trigger an incident, but their cumulative effect steadily erodes reliability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Threshold&lt;/strong&gt;&lt;br&gt;
Threshold failures remain latent until accumulated changes push the system beyond a critical operating boundary. Reliability appears stable until a tipping point is reached, after which performance degrades abruptly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Oscillating&lt;/strong&gt;&lt;br&gt;
Oscillating failures exhibit inconsistent reliability under similar operating conditions. Performance alternates between successful and unsuccessful outcomes because the underlying system behavior depends on input distribution, retrieval ordering, model stochasticity, or changing operational conditions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cascading&lt;/strong&gt;&lt;br&gt;
Cascading failures originate from a local defect that propagates through downstream workflow stages. A retrieval error may influence planning, which affects tool selection, memory updates, and ultimately produces a significantly larger end-user failure than the original defect alone.&lt;/p&gt;




&lt;h3&gt;
  
  
  Dimension 2 — Reliability Control Surface
&lt;/h3&gt;

&lt;p&gt;Once the failure dynamics have been identified, the next engineering question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Where should engineers intervene?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Failure Dynamics describe &lt;strong&gt;how&lt;/strong&gt; reliability changes. Reliability Control Surfaces describe &lt;strong&gt;where&lt;/strong&gt; reliability can be observed, influenced, and improved.&lt;/p&gt;

&lt;p&gt;In traditional software systems, the answer is often localized. Engineers scale infrastructure to address resource contention, upgrade dependencies to resolve compatibility issues, or adjust service-level trade-offs between latency, availability, and consistency. The intervention point is usually well-defined because the system itself is deterministic.&lt;/p&gt;

&lt;p&gt;Production RAG systems are different. A single user-visible failure may emerge from interactions across multiple stages of the pipeline. Corrective actions therefore require engineers to identify the control surface where reliability can be most effectively improved.&lt;/p&gt;

&lt;p&gt;We define five primary &lt;strong&gt;Reliability Control Surfaces&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Knowledge&lt;/strong&gt;&lt;br&gt;
The knowledge surface governs the quality of the information available to the system. Engineers intervene here by improving the corpus itself: removing stale documents, eliminating duplicates, correcting inconsistencies, or refining document organization. If the system retrieves incorrect knowledge, no downstream component can reliably recover the correct answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval&lt;/strong&gt;&lt;br&gt;
The retrieval surface determines which knowledge reaches the model. Engineers adjust retrieval algorithms, chunking strategies, embedding models, metadata filters, rerankers, and search parameters to improve the relevance and completeness of retrieved context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generation&lt;/strong&gt;&lt;br&gt;
The generation surface governs how retrieved context is transformed into an answer. Prompt design, model selection, decoding strategies, and structured output constraints all influence whether the model produces accurate, complete, and faithful responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation&lt;/strong&gt;&lt;br&gt;
The evaluation surface determines how reliability is measured and enforced. Rather than improving answers directly, evaluation establishes quality gates through automated benchmarks, regression tests, and production monitoring. It answers the question: &lt;em&gt;Has reliability changed enough to require intervention?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operations&lt;/strong&gt;&lt;br&gt;
The operations surface coordinates how the entire system behaves in production. Version management, deployment policies, rollout strategies, monitoring, traffic routing, and incident response all influence the long-term reliability of the application, even when individual components remain unchanged.&lt;/p&gt;




&lt;h3&gt;
  
  
  Dimension 3 — Detectability
&lt;/h3&gt;

&lt;p&gt;The previous dimension answered &lt;strong&gt;where engineers should intervene&lt;/strong&gt;. Detectability answers a different operational question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How likely is this failure to be discovered before users experience it?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not all failures are equally visible. Some immediately trigger monitoring systems, while others remain hidden behind apparently successful requests and fluent model responses. From an operational perspective, the cost of a failure depends not only on its severity but also on how long it remains undetected.&lt;/p&gt;

&lt;p&gt;Traditional software systems have benefited from decades of investment in observability. Infrastructure failures, resource exhaustion, deployment regressions, and service interruptions typically produce measurable signals that monitoring systems can detect automatically.&lt;/p&gt;

&lt;p&gt;Production AI systems introduce a different class of reliability problems. A request may complete successfully, latency may remain stable, and no infrastructure alarms may fire, yet answer quality can still deteriorate. In these cases, correctness—not availability—becomes the primary operational concern.&lt;/p&gt;

&lt;p&gt;We therefore classify production AI failures according to their detectability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;D0 — Immediately observable&lt;/strong&gt;&lt;br&gt;
Failures are directly visible through conventional operational signals or obvious incorrect behavior. Engineers are typically alerted immediately through monitoring systems or user-facing errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;D1 — Operationally observable&lt;/strong&gt;&lt;br&gt;
Failures become apparent through changes in production telemetry, deployment behavior, or runtime characteristics. Although the application continues functioning, operational metrics indicate that reliability has changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;D2 — Evaluation observable&lt;/strong&gt;&lt;br&gt;
Failures cannot be detected reliably through infrastructure monitoring alone. Instead, they require scheduled or continuous evaluation using representative workloads to identify declining correctness, retrieval quality, or answer fidelity before users notice the regression.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;D3 — Investigation observable&lt;/strong&gt;&lt;br&gt;
Failures remain operationally invisible until a specific customer incident triggers investigation. Root cause identification requires manual analysis, reproduction, and engineering judgment, making this class the most expensive and operationally disruptive.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Complete AI Reliability Framework
&lt;/h3&gt;

&lt;p&gt;The three dimensions introduced in this section are intended to be used together rather than independently.&lt;/p&gt;

&lt;p&gt;Failure Dynamics describe &lt;strong&gt;how&lt;/strong&gt; reliability changes over time. Reliability Control Surfaces identify &lt;strong&gt;where&lt;/strong&gt; engineers should investigate and intervene. Detectability determines &lt;strong&gt;how&lt;/strong&gt; failures become visible during operation. Together, these dimensions transform an isolated production incident into a structured reliability problem with an appropriate operational response.&lt;/p&gt;

&lt;p&gt;Figure 3 summarizes the complete AI Reliability Framework. Rather than treating AI failures as disconnected symptoms such as hallucination, retrieval drift, or poor answer quality, the framework provides a systematic reasoning process—from identifying the failure dynamics to selecting the most appropriate operational response.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fht677zuwrzcmsw6od1ce.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fht677zuwrzcmsw6od1ce.png" alt="Figure 3 - AI Reliability Framework" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Controlled Experiment
&lt;/h2&gt;

&lt;p&gt;To illustrate how the proposed framework can be applied in practice, we conducted a controlled experiment on a representative production RAG failure. Rather than attempting to validate every combination of failure dynamics, control surfaces, and detectability levels, this article focuses on a single, realistic scenario that demonstrates how the framework guides diagnosis and operational response.&lt;/p&gt;

&lt;p&gt;The selected failure represents a &lt;strong&gt;Gradual&lt;/strong&gt; failure dynamic, occurring on the &lt;strong&gt;Knowledge&lt;/strong&gt; control surface with &lt;strong&gt;D2 (Evaluation Observable)&lt;/strong&gt; detectability. This class of failure was chosen because it is both common in production RAG systems and operationally expensive. Reliability degrades incrementally as the knowledge corpus evolves, yet the decline often remains invisible until systematic evaluation reveals a measurable regression.&lt;/p&gt;

&lt;p&gt;The experimental system consists of a Retrieval-Augmented Generation (RAG) application serving API documentation. Over a seven-week period, the underlying documentation corpus is progressively modified to simulate realistic knowledge evolution while the retrieval pipeline, model configuration, and evaluation dataset remain unchanged. This isolates knowledge evolution as the primary independent variable.&lt;/p&gt;

&lt;p&gt;Each week, the system is evaluated against the same benchmark using four reliability metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source Hit@k&lt;/strong&gt; — whether the correct supporting documents are successfully retrieved.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Answer Pass Rate&lt;/strong&gt; — proportion of responses meeting predefined correctness criteria.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faithfulness&lt;/strong&gt; — degree to which generated answers are supported by retrieved evidence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unsupported Answer Rate&lt;/strong&gt; — frequency of responses containing unsupported or fabricated information.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Figure 4 illustrates the overall experimental design. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fyso6mffgkg7qcnp47roa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fyso6mffgkg7qcnp47roa.png" alt="Figure 4 - experiment architecture" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Result
&lt;/h2&gt;

&lt;p&gt;The controlled experiment produced three observations that characterize &lt;strong&gt;Gradual Knowledge Drift&lt;/strong&gt; in production RAG systems.&lt;/p&gt;




&lt;h3&gt;
  
  
  Observation 1 — User-visible behavior changed before the system appeared to fail
&lt;/h3&gt;

&lt;p&gt;Figure 5 follows a single production question across seven weeks.&lt;/p&gt;

&lt;p&gt;Although the retrieval pipeline, evaluation dataset, and application remained unchanged, the answer gradually shifted from a correct response to repeated uncertainty before later converging on an incomplete explanation.&lt;/p&gt;

&lt;p&gt;Importantly, the application never failed operationally. Every request completed successfully, yet the quality of the answer changed substantially as the knowledge corpus evolved.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;week&lt;/th&gt;
&lt;th&gt;question&lt;/th&gt;
&lt;th&gt;answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;week_0&lt;/td&gt;
&lt;td&gt;How should clients authenticate API requests?&lt;/td&gt;
&lt;td&gt;Clients must include an &lt;code&gt;Authorization: Bearer &amp;lt;api_key&amp;gt;&lt;/code&gt; header in all API requests.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;week_1&lt;/td&gt;
&lt;td&gt;How should clients authenticate API requests?&lt;/td&gt;
&lt;td&gt;Unknown from the available documentation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;week_2&lt;/td&gt;
&lt;td&gt;How should clients authenticate API requests?&lt;/td&gt;
&lt;td&gt;Unknown from the available documentation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;week_3&lt;/td&gt;
&lt;td&gt;How should clients authenticate API requests?&lt;/td&gt;
&lt;td&gt;Unknown from the available documentation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;week_4&lt;/td&gt;
&lt;td&gt;How should clients authenticate API requests?&lt;/td&gt;
&lt;td&gt;Unknown from the available documentation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;week_5&lt;/td&gt;
&lt;td&gt;How should clients authenticate API requests?&lt;/td&gt;
&lt;td&gt;Clients authenticate API requests using a token, as indicated by the per-token rate limits and the &lt;code&gt;401&lt;/code&gt; status code returned when authentication fails.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;week_6&lt;/td&gt;
&lt;td&gt;How should clients authenticate API requests?&lt;/td&gt;
&lt;td&gt;Clients authenticate API requests using a token, as indicated by the per-token rate limits and the &lt;code&gt;401&lt;/code&gt; status code returned when authentication fails.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Figure 5 - Q&amp;amp;A Result&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 5 demonstrates that user-visible reliability can deteriorate even when the system appears operationally healthy.&lt;/em&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Observation 2 — Reliability degraded without operational failures
&lt;/h3&gt;

&lt;p&gt;To determine whether this qualitative change reflected a measurable reliability decline, we evaluated the system weekly using four metrics.&lt;/p&gt;

&lt;p&gt;Figure 6 records the evaluation results throughout the experiment.&lt;/p&gt;

&lt;p&gt;Several observations stand out.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source Hit@4 remained consistently high for most weeks.&lt;/li&gt;
&lt;li&gt;Faithfulness changed only slightly.&lt;/li&gt;
&lt;li&gt;No infrastructure failures or retrieval outages occurred.&lt;/li&gt;
&lt;li&gt;However, Answer Pass Rate declined significantly, reaching its lowest point during Week 3.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The degradation therefore cannot be explained by infrastructure instability or catastrophic retrieval failure. Instead, the experiment demonstrates a gradual reduction in answer quality while conventional operational indicators remained largely stable.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Week&lt;/th&gt;
&lt;th&gt;Source Hit@4&lt;/th&gt;
&lt;th&gt;Answer Pass Rate&lt;/th&gt;
&lt;th&gt;Faithfulness Pass Rate&lt;/th&gt;
&lt;th&gt;Unsupported Answer Rate&lt;/th&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;week_0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.85&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;td&gt;Early drift visible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;week_1&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;Early drift visible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;week_2&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.85&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;td&gt;Early drift visible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;week_3&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.6&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;Material decay&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;week_4&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;td&gt;Early drift visible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;week_5&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Early drift visible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;week_6&lt;/td&gt;
&lt;td&gt;0.85&lt;/td&gt;
&lt;td&gt;0.75&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;td&gt;Early drift visible&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Figure 6 - Evaluation Record&lt;/p&gt;




&lt;h3&gt;
  
  
  Observation 3 — Progressive degradation only becomes obvious longitudinally
&lt;/h3&gt;

&lt;p&gt;Figure 7 plots the evaluation metrics over time.&lt;/p&gt;

&lt;p&gt;Viewed week-by-week, each individual regression appears relatively minor. None would typically justify an operational incident on its own.&lt;/p&gt;

&lt;p&gt;Viewed longitudinally, however, the pattern becomes unmistakable. Small fluctuations accumulate into a sustained decline in answer quality despite stable infrastructure and largely unchanged retrieval metrics.&lt;/p&gt;

&lt;p&gt;This illustrates an important characteristic of Gradual Knowledge Drift: &lt;strong&gt;the operational signal emerges from the trend rather than any individual observation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Focmpfqkyqg0069x2ngr1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Focmpfqkyqg0069x2ngr1.png" alt="Figure 7 - Performance Curve" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Interpretation
&lt;/h3&gt;

&lt;p&gt;The experiment supports the hypothesis that some production RAG failures evolve gradually rather than immediately.&lt;/p&gt;

&lt;p&gt;Unlike a deployment regression, where a single event produces an obvious operational change, gradual knowledge drift emerges through a sequence of individually reasonable modifications to the knowledge corpus.&lt;/p&gt;

&lt;p&gt;Each weekly change appears harmless in isolation. Collectively, however, they alter the behavior of the retrieval-generation system enough to produce measurable reliability degradation.&lt;/p&gt;

&lt;p&gt;Consequently, conventional monitoring—which is designed to detect discrete failures—provides little warning before users begin experiencing degraded answers.&lt;/p&gt;

&lt;p&gt;The appropriate operational response is therefore fundamentally different. Immediate failures require incident response and rollback. Gradual failures require continuous evaluation, longitudinal monitoring, and periodic reliability assessment to identify trends before they become customer-visible.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Operational Principals
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Principle 1: Reliability should be monitored longitudinally rather than episodically.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The experiment demonstrates that no individual weekly change was sufficient to trigger an operational incident. Instead, reliability declined through a sequence of small, individually reasonable modifications to the knowledge corpus. This suggests that production AI reliability is fundamentally temporal. The reliability state of the system cannot be inferred from a single evaluation or deployment event; it emerges from trends observed over time.&lt;/p&gt;

&lt;p&gt;Consequently, evaluation should not be treated as a periodic validation activity performed before release. It should become a continuous operational process capable of identifying long-term changes in system behavior before they become customer-visible.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Principle 2: Operational strategies should depend on failure dynamics rather than implementation components.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The observed degradation originated from changes to the knowledge corpus, yet the operational response was determined by the behavior of the failure rather than its technical origin. An immediate deployment regression and a gradual knowledge drift may involve the same retrieval pipeline, but they require fundamentally different operational strategies. One favors rollback and incident response; the other favors continuous evaluation and trend monitoring.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Principle 3: Different detectability classes require different operational controls.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not every reliability failure can be discovered through conventional monitoring. The experiment illustrates a D2 (Evaluation Observable) failure. Operational metrics remained largely stable while answer quality gradually declined. Traditional infrastructure monitoring therefore provided little indication that reliability had changed.&lt;/p&gt;

&lt;p&gt;Operational controls should therefore be matched to the detectability class of the failure. Immediately observable failures benefit from alerting and monitoring, whereas evaluation-observable failures require continuous benchmark evaluation capable of detecting subtle quality regressions before users encounter them.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Principle 4: Knowledge evolution is an operational concern, not merely a documentation concern.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Throughout the experiment, the application code remained unchanged. Reliability changed because the knowledge available to the system changed. This observation highlights an important distinction between traditional software systems and production AI systems. Documentation is no longer passive reference material consumed only by engineers; it has become executable operational state consumed directly by the application.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Limitations
&lt;/h2&gt;

&lt;p&gt;This article intentionally evaluates a single representative failure profile rather than the complete AI Reliability Framework.&lt;/p&gt;

&lt;p&gt;The objective of this experiment is not to validate every possible combination of Failure Dynamics, Reliability Control Surfaces, and Detectability levels. Instead, it demonstrates how the proposed framework can be applied to reason about a realistic production incident under controlled experimental conditions.&lt;/p&gt;

&lt;p&gt;The selected scenario—&lt;strong&gt;Gradual&lt;/strong&gt; failure dynamics on the &lt;strong&gt;Knowledge&lt;/strong&gt; control surface with &lt;strong&gt;D2 (Evaluation Observable)&lt;/strong&gt; detectability—was chosen because it represents a common production failure that is difficult to identify using conventional operational monitoring.&lt;/p&gt;

&lt;p&gt;Future work will extend the experimental series to additional failure profiles, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Threshold&lt;/strong&gt; failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Oscillating&lt;/strong&gt; failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cascading&lt;/strong&gt; failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;using the same controlled experimental methodology. Evaluating multiple failure classes will allow the proposed framework to be assessed across a broader range of production AI reliability scenarios.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Conclusion
&lt;/h2&gt;

&lt;p&gt;Production AI systems cannot be understood solely through their architectural components. Models, retrievers, prompts, and vector databases explain &lt;strong&gt;how&lt;/strong&gt; an AI system is built, but they do not explain &lt;strong&gt;how reliability changes once the system enters production&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This article proposed an operational reliability framework that complements architectural thinking by introducing three additional dimensions: &lt;strong&gt;Failure Dynamics&lt;/strong&gt;, &lt;strong&gt;Reliability Control Surfaces&lt;/strong&gt;, and &lt;strong&gt;Detectability&lt;/strong&gt;. Together, these dimensions provide a structured way to reason from a production incident toward an appropriate engineering response.&lt;/p&gt;

&lt;p&gt;The controlled experiment on Gradual Knowledge Drift demonstrates one representative failure class within this framework. More importantly, it illustrates a broader operational reality: many production AI failures emerge not through catastrophic regressions, but through the accumulation of individually reasonable changes that gradually alter system behavior.&lt;/p&gt;

&lt;p&gt;Traditional software engineering matured by developing shared languages for correctness, availability, and performance. As AI systems become long-lived production systems, reliability engineering will likewise require new operational concepts that describe how AI behavior evolves, how failures should be classified, and how reliability should be managed over time.&lt;/p&gt;

&lt;p&gt;The framework presented here is one step toward that operational vocabulary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Appendix A — Experiment Details
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Corpus composition&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;auth.md&lt;/code&gt; - Explains bearer-token authentication and how to interpret &lt;code&gt;401&lt;/code&gt; and &lt;code&gt;403&lt;/code&gt; responses&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;errors.md&lt;/code&gt; - Summarizes HTTP error handling, including &lt;code&gt;429&lt;/code&gt; rate limits, &lt;code&gt;5xx&lt;/code&gt; retries, and auth-related errors.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pagination.md&lt;/code&gt; - Describes cursor-based pagination using &lt;code&gt;limit&lt;/code&gt;, &lt;code&gt;next_cursor&lt;/code&gt;, and &lt;code&gt;cursor&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rate_limits.md&lt;/code&gt; - Defines rate-limit responses, relevant headers, and worker coordination after &lt;code&gt;429&lt;/code&gt;s.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;retry_behavior.md&lt;/code&gt; - Specifies retry rules for &lt;code&gt;429&lt;/code&gt;, &lt;code&gt;5xx&lt;/code&gt;, idempotent requests, jitter, backoff, and &lt;code&gt;Retry-After&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;troubleshooting.m&lt;/code&gt; - Lists support-debugging details, timeout guidance, and webhook retry behavior.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Weekly mutations&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;week 1 - Added overlapping troubleshooting content&lt;/li&gt;
&lt;li&gt;week 2 - Added stale migration guide with outdated retry behavior&lt;/li&gt;
&lt;li&gt;week 3 - Re-indexed retry behavior doc with worse chunking&lt;/li&gt;
&lt;li&gt;week 4 - Added legacy SDK docs with similar wording&lt;/li&gt;
&lt;li&gt;week 5 - Switched to looser prompt version&lt;/li&gt;
&lt;li&gt;week 6 - Added noisy FAQ content competing with correct source&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Support questions&lt;/strong&gt;
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;When should clients retry after a 429 error? &lt;br&gt;&lt;br&gt;
  What should an API client do when it receives 429 Too Many Requests? &lt;br&gt;&lt;br&gt;
Should clients use Retry-After or retry immediately after a 429? &lt;br&gt;&lt;br&gt;
How should workers behave after a rate limit retry? &lt;br&gt;&lt;br&gt;
Is a fixed 60 seconds the current retry policy for 429 responses? &lt;br&gt;&lt;br&gt;
Which header tells clients how long to wait after a rate limit? &lt;br&gt;&lt;br&gt;
Which operations should be retried automatically? &lt;br&gt;&lt;br&gt;
How should clients handle 5xx server errors? &lt;br&gt;&lt;br&gt;
How should clients authenticate API requests? &lt;br&gt;&lt;br&gt;
What does a 401 response mean? &lt;br&gt;&lt;br&gt;
What does a 403 response mean? &lt;br&gt;&lt;br&gt;
How do clients request the next page of results? &lt;br&gt;&lt;br&gt;
When should clients stop paginating? &lt;br&gt;&lt;br&gt;
What should a client do after a timeout? &lt;br&gt;&lt;br&gt;
What information should be included when contacting support? &lt;br&gt;&lt;br&gt;
Are webhook retries the same as client retry behavior? &lt;br&gt;&lt;br&gt;
What should clients do to request volume after receiving 429s? &lt;br&gt;&lt;br&gt;
Which headers describe the rate limit budget? &lt;br&gt;&lt;br&gt;
Should clients automatically retry requests that create side effects? &lt;br&gt;&lt;br&gt;
Which response value helps support investigate failures? &lt;br&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Repository link&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/leiye-07/rag-decay-experiment" rel="noopener noreferrer"&gt;rag-decay-experiment&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Appendix B — Framework Summary
&lt;/h2&gt;

&lt;p&gt;A summary the three-dimension reliability framework:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Dynamics&lt;/th&gt;
&lt;th&gt;Typical Control Surface&lt;/th&gt;
&lt;th&gt;Typical Detectability&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Immediate&lt;/td&gt;
&lt;td&gt;Generation&lt;/td&gt;
&lt;td&gt;D0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gradual&lt;/td&gt;
&lt;td&gt;Knowledge&lt;/td&gt;
&lt;td&gt;D2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Threshold&lt;/td&gt;
&lt;td&gt;Retrieval&lt;/td&gt;
&lt;td&gt;D1–D2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Oscillating&lt;/td&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;D3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cascading&lt;/td&gt;
&lt;td&gt;Cross-surface&lt;/td&gt;
&lt;td&gt;D3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Suggested Citation:
&lt;/h2&gt;

&lt;p&gt;Lei Ye. Why Your Production RAG System Slowly Gets Worse. 2026. &lt;a href="https://lei-ye.dev/blog/reliability-framework-for-ai-engineers/" rel="noopener noreferrer"&gt;https://lei-ye.dev/blog/reliability-framework-for-ai-engineers/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>systemdesign</category>
      <category>rag</category>
    </item>
    <item>
      <title>The Hidden Problem With Prompts in Production AI</title>
      <dc:creator>Lei Ye</dc:creator>
      <pubDate>Wed, 11 Mar 2026 23:36:45 +0000</pubDate>
      <link>https://dev.to/lei_ye_2cc01a0af9e8260e/prompt-as-code-build-prompt-registry-with-versioning-1n5f</link>
      <guid>https://dev.to/lei_ye_2cc01a0af9e8260e/prompt-as-code-build-prompt-registry-with-versioning-1n5f</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at: &lt;a href="https://lei-ye.dev/blog/prompt-as-code//" rel="noopener noreferrer"&gt;Prompt as Code — Build Prompt Registry with Versioning&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;



&lt;br&gt;
When teams first build AI features, prompts usually start simple.

&lt;p&gt;A string in a function.&lt;br&gt;
A template inside a route.&lt;br&gt;
Maybe a small helper function.&lt;/p&gt;

&lt;p&gt;Something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the following system event:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then with system evolves, prompts start changing.&lt;/p&gt;

&lt;p&gt;A word here.&lt;br&gt;
A constraint there.&lt;br&gt;
Someone adds a new instruction for better formatting.&lt;/p&gt;

&lt;p&gt;Before long, the system behaves differently and nobody can explain why.&lt;/p&gt;

&lt;p&gt;That’s when &lt;strong&gt;prompt chaos&lt;/strong&gt; begins.&lt;/p&gt;



&lt;h2&gt;
  
  
  1. The Problem: Prompt Chaos
&lt;/h2&gt;

&lt;p&gt;Unlike normal code, prompts are often invisible infrastructure.&lt;/p&gt;

&lt;p&gt;They live inside strings scattered across services. They change quietly during experimentation.&lt;/p&gt;

&lt;p&gt;Over time this creates several problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Responses change unexpectedly&lt;/li&gt;
&lt;li&gt;Evaluation metrics become unreliable&lt;/li&gt;
&lt;li&gt;Debugging becomes difficult&lt;/li&gt;
&lt;li&gt;Prompt history disappears&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If an output changes today, you may not know whether the cause was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A prompt change ?&lt;/li&gt;
&lt;li&gt;A model change ?&lt;/li&gt;
&lt;li&gt;A parameter change ?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without prompt identity, the system becomes difficult to reason about.&lt;/p&gt;



&lt;h2&gt;
  
  
  2. Why Prompts Need Versioning
&lt;/h2&gt;

&lt;p&gt;Prompts influence system behavior as much as code does.&lt;/p&gt;

&lt;p&gt;In fact, prompts are closer to &lt;strong&gt;configuration that drives behavior&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That means prompts deserve the same discipline as code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Version control&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reproducibility&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Traceability&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of treating prompts as strings, we can treat them as versioned assets.&lt;/p&gt;

&lt;p&gt;This approach allows us to answer important questions:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Which prompt generated this output?&lt;br&gt;
Which version was deployed last week?&lt;br&gt;
Which prompt version performs best during evaluation?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the idea behind &lt;strong&gt;Prompt as Code&lt;/strong&gt;.&lt;/p&gt;



&lt;h2&gt;
  
  
  3. What a Prompt Registry Is
&lt;/h2&gt;

&lt;p&gt;A Prompt Registry is a small service responsible for managing prompt templates.&lt;/p&gt;

&lt;p&gt;Instead of constructing prompts directly in application logic, the application resolves them from a registry.&lt;/p&gt;

&lt;p&gt;A prompt registry provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Prompt templates&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version management&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deterministic rendering&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prompt hashing&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This transforms prompts from ad-hoc strings into structured runtime assets.&lt;/p&gt;

&lt;p&gt;Example prompt template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;PromptTemplate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are a production AI assistant focused on reliability.
Summarize the following system event:
{event_text}
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now prompts have &lt;em&gt;identity&lt;/em&gt;.&lt;/p&gt;



&lt;h2&gt;
  
  
  4. Architecture
&lt;/h2&gt;

&lt;p&gt;The prompt registry sits between the API layer and the model gateway.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client Request
      ↓
Prompt Registry
      ↓
Rendered Prompt
      ↓
Model Gateway
      ↓
Provider Adapter
      ↓
Cost Metering
      ↓
Evaluation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This architecture ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompts are resolved before inference&lt;/li&gt;
&lt;li&gt;Prompt versions are logged&lt;/li&gt;
&lt;li&gt;Evaluation remains &lt;strong&gt;reproducible&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also cleanly &lt;strong&gt;separates prompt management from model execution&lt;/strong&gt;.&lt;/p&gt;



&lt;h2&gt;
  
  
  5. Implementation
&lt;/h2&gt;

&lt;p&gt;In the &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt; toolkit, the prompt registry lives inside:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;packages/
  prompt_registry/
      models.py
      registry.py
      service.py
      hashing.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A prompt template is defined as a structured object.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PromptTemplate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;template&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Templates are stored in a registry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nc"&gt;PromptTemplate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the following system event:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;{event_text}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the API receives a request, the prompt service resolves and renders the prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rendered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prompt_service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;render&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User downloaded a large dataset.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rendered prompt includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;prompt name&lt;/li&gt;
&lt;li&gt;prompt version&lt;/li&gt;
&lt;li&gt;prompt content&lt;/li&gt;
&lt;li&gt;prompt hash&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hash guarantees that the exact prompt used during inference can be traced later.&lt;/p&gt;



&lt;h2&gt;
  
  
  6. Prompt Versioning Examples
&lt;/h2&gt;

&lt;p&gt;Once prompts become versioned assets, evaluation becomes much more reliable.&lt;/p&gt;

&lt;p&gt;Each request now records:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;model&lt;/li&gt;
&lt;li&gt;provider&lt;/li&gt;
&lt;li&gt;prompt name&lt;/li&gt;
&lt;li&gt;prompt version&lt;/li&gt;
&lt;li&gt;prompt hash&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows teams to compare prompt performance across versions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Prompt v1&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;Request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"prompt_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system_summary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"prompt_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"variables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"event_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Admin revoked API key for user account 742."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gpt-4.1-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"gpt-4.1-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"trace_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"61050fd4e94849d791e566ead8c8f1c6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"prompt_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"system_summary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"prompt_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"prompt_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"5ba4cce2a985f8234698a63fe2260428b029dfd7d61e53a5793cc963b8737036"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"[OpenAI:gpt-4.1-mini] Generated response for prompt: You are a production AI assistant.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;Summarize the following system event clearly:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;Admin revoked API key for user account"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"cost"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"gpt-4.1-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"input_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"output_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;88&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"input_cost_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"0.000016"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"output_cost_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"0.000077"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_cost_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"0.000093"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"unit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"USD"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"evaluation"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"pass"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reliability_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"non_empty"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"max_length"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example Prompt v2&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Request (default to latest):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"prompt_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system_summary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"variables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"event_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"System latency increased above 300ms for the inference service."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"gpt-4.1-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"trace_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"7ec85dc989dc4da8a0ac9bb73f2317a7"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"prompt_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"system_summary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"prompt_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"v2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"prompt_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"06c08f6125a189abf90b44c9a63a5bc0f5307f06319363a922a476b38776b8c6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"[OpenAI:gpt-4.1-mini] Generated response for prompt: You are a production AI assistant focused on reliability.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;Summarize the following system event.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;Be concise, mention operational impact, and keep the tone factual.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;System latency increased above 300ms for the inference service."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"cost"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"gpt-4.1-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"input_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;66&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"output_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;46&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;112&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"input_cost_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"0.000026"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"output_cost_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"0.000074"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_cost_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"0.000100"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"unit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"USD"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"evaluation"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"pass"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reliability_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"non_empty"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"max_length"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now prompt optimization becomes measurable rather than guesswork.&lt;/p&gt;



&lt;h2&gt;
  
  
  7. Lessons Learned
&lt;/h2&gt;

&lt;p&gt;Building a prompt registry revealed a few important lessons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Prompts evolve quickly&lt;/strong&gt; &lt;br&gt;&lt;br&gt;
Even small systems accumulate many prompt variations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Reproducibility matters early&lt;/strong&gt; &lt;br&gt;&lt;br&gt;
Without prompt versioning, evaluation results become meaningless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Prompt identity simplifies debugging&lt;/strong&gt; &lt;br&gt;&lt;br&gt;
When responses change, engineers can immediately identify the cause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Prompts should live outside business logic&lt;/strong&gt; &lt;br&gt;&lt;br&gt;
Separating prompts from application code improves maintainability.&lt;/p&gt;



&lt;h2&gt;
  
  
  8. The Code
&lt;/h2&gt;

&lt;p&gt;The implementation described in this article is part of an open-source project called Maester.&lt;/p&gt;

&lt;p&gt;Maester is a lightweight toolkit focused on AI API reliability, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Model gateway routing&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost metering&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evaluation pipelines&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prompt registry&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Repository: &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;maester&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The goal is to explore how production AI systems can remain observable, reproducible, and resilient as they grow.&lt;/p&gt;




&lt;br&gt;&lt;br&gt;
&lt;em&gt;Note: This article was originally published on my engineering blog where I’m documenting the design of Maester, a production AI SaaS infrastructure system built in public. Original post:&lt;a href="https://lei-ye.dev/blog/prompt-as-code//" rel="noopener noreferrer"&gt;Prompt as Code — Build Prompt Registry with Versioning&lt;/a&gt;&lt;/em&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>Building Maester — Enable Multi-provider LLM APIs</title>
      <dc:creator>Lei Ye</dc:creator>
      <pubDate>Tue, 10 Mar 2026 21:28:17 +0000</pubDate>
      <link>https://dev.to/lei_ye_2cc01a0af9e8260e/building-maester-enable-multi-provider-llm-apis-4lp7</link>
      <guid>https://dev.to/lei_ye_2cc01a0af9e8260e/building-maester-enable-multi-provider-llm-apis-4lp7</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://lei-ye.dev/blog/multi-llm-provider-apis/" rel="noopener noreferrer"&gt;Building Maester — Enable Multi-provider LLM APIs&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  We Locked Ourselves Into GCP
&lt;/h2&gt;

&lt;p&gt;Most infrastructure mistakes don’t start as mistakes. They start as reasonable decisions. This one started with a discount.&lt;/p&gt;




&lt;h3&gt;
  
  
  It Worked Beautifully
&lt;/h3&gt;

&lt;p&gt;In the beginning, the decision felt obvious. We had a large GCP startup credit, so our entire stack ran there.&lt;/p&gt;

&lt;p&gt;Compute.&lt;br&gt;
Storage. &lt;br&gt;
Data pipelines. &lt;br&gt;
Model training. &lt;br&gt;
... &lt;br&gt;
Everything.&lt;/p&gt;

&lt;p&gt;And honestly, it worked beautifully! &lt;strong&gt;Monitoring&lt;/strong&gt; was already integrated.&lt;br&gt;
&lt;strong&gt;Identity management&lt;/strong&gt; was built in. &lt;strong&gt;IAM policies&lt;/strong&gt; were easy to manage.&lt;br&gt;
Even &lt;strong&gt;LDAP&lt;/strong&gt; integration was already available.&lt;/p&gt;

&lt;p&gt;One of my teammates said something that sounded perfectly reasonable:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Don’t reinvent the wheel.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And he was right. &lt;em&gt;Why build infrastructure when the cloud already solved it?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We were a small team. Most of our compute was tied to token usage, so costs looked predictable. Everything felt &lt;em&gt;lightweight&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;So we did what most startups do. We committed.&lt;br&gt;&lt;/p&gt;


&lt;h3&gt;
  
  
  Where Did the Cost Come From?
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;"Don't spend like a billionaire with the company's money !"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What? We were all so much confused with the bill complaints at a Monday morning standup meeting months later. The bill arrived and nobody could clearly explain it. It was the cloud bill. And it started eating into margins.&lt;/p&gt;

&lt;p&gt;Where did the cost come from? &lt;br&gt;
Storage? &lt;br&gt;
Network egress? &lt;br&gt;
Pipelines?&lt;br&gt;
Inference traffic?&lt;/p&gt;

&lt;p&gt;Someone suggested hiring a cloud optimization engineer. Another suggested redesigning the entire data pipeline.&lt;/p&gt;

&lt;p&gt;But we were still a startup. Every time we opened the roadmap we saw something else staring at us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer requests.&lt;/li&gt;
&lt;li&gt;Feature releases.&lt;/li&gt;
&lt;li&gt;Revenue milestones.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Infrastructure work always lost that fight. So the bills kept climbing. We weren't bankrupt. But we were trapped.&lt;/p&gt;


&lt;h3&gt;
  
  
  We Split the Stack
&lt;/h3&gt;

&lt;p&gt;Eventually we did something radical. We split the stack. The architecture finally looked like this:&lt;/p&gt;

&lt;p&gt;Azure → Identity / Compliance &lt;br&gt;
AWS   → Applications / Storage &lt;br&gt;
GCP   → Data Pipelines / Training&lt;/p&gt;

&lt;p&gt;And the cost?&lt;/p&gt;

&lt;p&gt;Still &lt;em&gt;expensive&lt;/em&gt;. But &lt;strong&gt;predictable&lt;/strong&gt;. Even without our original startup discount, the system became easier to control.&lt;/p&gt;

&lt;p&gt;Vendor lock-in is &lt;strong&gt;invisible&lt;/strong&gt; when things work. It becomes &lt;strong&gt;obvious&lt;/strong&gt; only when you try to leave.&lt;/p&gt;


&lt;h2&gt;
  
  
  We Are Not Going to Lock into OpenAI
&lt;/h2&gt;

&lt;p&gt;So when we started building the AI APIs, I began seeing the same pattern again.&lt;/p&gt;

&lt;p&gt;It was just:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And honestly, that works.&lt;/p&gt;

&lt;p&gt;But I kept remembering the GCP moment. The moment when switching vendors became impossible. We were about to repeat the same mistake.&lt;/p&gt;

&lt;p&gt;Except this time the vendor was not a cloud. It was a model provider. So I made a decision. &lt;strong&gt;We are not going to lock into OpenAI.&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Approach 1 — Let the client choose the model
&lt;/h3&gt;

&lt;p&gt;The simplest idea was letting the client select the model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST /generate
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"model"&lt;/span&gt;: &lt;span class="s2"&gt;"gpt-4.1-mini"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allowed switching between providers.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;OpenAI&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Anthropic&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Others later&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Technically it worked. But users quickly complained.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“I don't want to choose the model. I just want the best answer.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The user is always right. They just wanted results.&lt;/p&gt;




&lt;h3&gt;
  
  
  Approach 2 — Introduce a Model Gateway
&lt;/h3&gt;

&lt;p&gt;So we moved the decision out of the client. Instead of clients choosing providers, we introduced a Model Gateway.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Application
     ↓
Model Gateway
     ↓
Provider Router
     ↓
Provider Adapter
(OpenAI / Anthropic / others)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gateway would manage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;provider routing&lt;/li&gt;
&lt;li&gt;fallback logic&lt;/li&gt;
&lt;li&gt;cost tracking&lt;/li&gt;
&lt;li&gt;observability&lt;/li&gt;
&lt;li&gt;evaluation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The application now simply asks for a response. And the infrastructure decides how to produce it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Code
&lt;/h2&gt;

&lt;p&gt;The implementation lives inside a small reference project I’ve been building called &lt;strong&gt;&lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The goal of the project is not to build a full AI platform, but to demonstrate a reliable AI API architecture.&lt;/p&gt;

&lt;p&gt;The gateway sits inside the system like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apps/
   api/
      routes/
         reliable_completion.py

packages/
   model_gateway/
      base.py
      provider_openai.py
      provider_anthropic.py
      router.py
      client.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  The Provider Contract
&lt;/h3&gt;

&lt;p&gt;The first step was defining a provider interface. This follows the Adapter Pattern, allowing different model vendors to conform to a shared interface.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Protocol&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;supports&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GenerationRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;GenerationResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each provider adapter simply implements this contract.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OpenAIProvider
AnthropicProvider
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both produce the same normalized response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GenerationResponse
 ├─ provider
 ├─ model
 ├─ content
 └─ usage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means the rest of the system never deals with vendor-specific formats.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Router
&lt;/h3&gt;

&lt;p&gt;Next comes the router. The router decides which provider handles a request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelRouter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ModelProvider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;providers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;supports&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fallback_provider&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In production systems this layer can later evolve into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cost-aware routing&lt;/li&gt;
&lt;li&gt;latency-aware routing&lt;/li&gt;
&lt;li&gt;capability routing&lt;/li&gt;
&lt;li&gt;traffic shaping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the interface stays the same.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Gateway Client
&lt;/h3&gt;

&lt;p&gt;Finally the application talks to the gateway through a simple client.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelGateway&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GenerationRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;router&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API layer doesn't know which provider was selected. It just receives a normalized response.&lt;/p&gt;




&lt;h3&gt;
  
  
  The API Layer
&lt;/h3&gt;

&lt;p&gt;The FastAPI route becomes extremely simple.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;requested_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After generation, the system runs the reliability pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cost metering 
&lt;/li&gt;
&lt;li&gt;Evaluation
&lt;/li&gt;
&lt;li&gt;Structured logging
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;model_routed
requested_model: gpt-4.1-mini
selected_provider: openai
fallback_used: false
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives operators visibility without leaking provider logic into application code.&lt;/p&gt;




&lt;h3&gt;
  
  
  Why This Architecture Matters
&lt;/h3&gt;

&lt;p&gt;This design combines several classic software engineering principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dependency Inversion&lt;/strong&gt;
Application code depends on abstractions, not providers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adapter Pattern&lt;/strong&gt;
Each vendor SDK is wrapped behind a provider adapter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strategy Pattern&lt;/strong&gt;
Routing policies are interchangeable strategies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Separation of Concerns&lt;/strong&gt; 
API layer handles orchestration.Gateway handles provider logic.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  What This Enables Later
&lt;/h3&gt;

&lt;p&gt;Once this boundary exists, the system becomes far easier to evolve.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multi-provider fallback&lt;/li&gt;
&lt;li&gt;provider benchmarking&lt;/li&gt;
&lt;li&gt;cost-aware routing&lt;/li&gt;
&lt;li&gt;latency optimization&lt;/li&gt;
&lt;li&gt;evaluation-based routing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of those changes can happen inside the gateway. The application API never changes. That is the real value of the design.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Lesson
&lt;/h2&gt;

&lt;p&gt;Vendor lock-in rarely feels dangerous at the beginning. Everything works. Costs look reasonable. The roadmap is full of features.&lt;/p&gt;

&lt;p&gt;Then one day something changes. Prices rise. Performance shifts. A better provider appears.And suddenly the architecture makes switching painful.&lt;/p&gt;

&lt;p&gt;The lesson I learned from our cloud migration was simple: &lt;strong&gt;Always design one layer where you can change your mind later&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For our AI systems, that layer became the &lt;strong&gt;Model Gateway&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The application talks to the gateway.&lt;br&gt;
The gateway talks to providers.&lt;br&gt;
And the providers can change.&lt;/p&gt;

&lt;p&gt;Because eventually they always do.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Note: This article was originally published on my egineering blog where I document the design of Maester, an AI SaaS infrastructure system built in public.&lt;br&gt;
Original post: &lt;a href="https://lei-ye.dev/blog/multi-llm-provider-apis/" rel="noopener noreferrer"&gt;Building Maester — Enable Multi-provider LLM APIs&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>What Breaks After Your AI Demo Works</title>
      <dc:creator>Lei Ye</dc:creator>
      <pubDate>Sun, 08 Mar 2026 05:32:45 +0000</pubDate>
      <link>https://dev.to/lei_ye_2cc01a0af9e8260e/what-breaks-after-your-ai-demo-works-2g8p</link>
      <guid>https://dev.to/lei_ye_2cc01a0af9e8260e/what-breaks-after-your-ai-demo-works-2g8p</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://lei-ye.dev/blog/design-reliable-ai-apis/" rel="noopener noreferrer"&gt;What Breaks After Your AI Demo Works&lt;/a&gt;.&lt;/em&gt;&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  A Short Story of How My AI Demo Worked and Failed
&lt;/h2&gt;

&lt;p&gt;A few weeks ago I built a small AI API. Nothing fancy. Just a simple endpoint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It worked.&lt;/p&gt;

&lt;p&gt;Requests came in. The model responded.Everything looked good.&lt;br&gt;
Until the second week.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The First Question
&lt;/h3&gt;

&lt;p&gt;A teammate asked: &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Which request generated this output?”&lt;br&gt;
I checked the logs. There was nothing useful there.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;NO request ID.&lt;br&gt;
NO trace.&lt;br&gt;
NO connection between the prompt and the output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The system worked — but it wasn’t traceable.&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Second Question
&lt;/h3&gt;

&lt;p&gt;Very quickly another question appeared.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Why did our AI bill jump yesterday?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I had no answer.&lt;/p&gt;

&lt;p&gt;We were calling models through an API wrapper, but we weren’t recording:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token usage&lt;/li&gt;
&lt;li&gt;Model pricing&lt;/li&gt;
&lt;li&gt;Request-level cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We had built an AI system that spent money invisibly.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Third Question
&lt;/h3&gt;

&lt;p&gt;Then something more subtle happened.&lt;/p&gt;

&lt;p&gt;A user reported that an output looked wrong. The model had responded successfully, but the answer was clearly not useful. Which raised another question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How do we know if a model response is acceptable?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We didn’t.&lt;/p&gt;

&lt;p&gt;The API only knew whether the model responded, not whether the result made sense.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Realization
&lt;/h3&gt;

&lt;p&gt;The model wasn't the problem. The system around the model was. AI APIs are fundamentally different from traditional APIs. They introduce three operational challenges:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Challenge&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Can we trace what happened?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Economics&lt;/td&gt;
&lt;td&gt;How much did this request cost?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output reliability&lt;/td&gt;
&lt;td&gt;Was the response acceptable?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Without solving these, AI systems quickly become hard to operate. So I built a small reference project to explore this problem. I called it &lt;strong&gt;&lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;



&lt;h2&gt;
  
  
  The Minimal Reliability Architecture
&lt;/h2&gt;

&lt;p&gt;A reliable AI API request should pass through a few structured steps.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client Request
      ↓
API Middleware
(request_id + trace_id)
      ↓
Route Handler
      ↓
Model Gateway
      ↓
Cost Metering
      ↓
Evaluation
      ↓
Structured Logs
      ↓
Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each step adds operational clarity.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Observability: Making AI Requests Traceable
&lt;/h3&gt;

&lt;p&gt;The first primitive is &lt;strong&gt;observability&lt;/strong&gt;. Every request should be traceable.&lt;br&gt;
In &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;, middleware attaches:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;request_id&lt;/span&gt;
&lt;span class="n"&gt;trace_id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to the request context.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;request_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;new_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;trace_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;start_trace&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These identifiers propagate through the entire request lifecycle. Then operations are wrapped in spans:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The span records:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Operation name&lt;/li&gt;
&lt;li&gt;Duration&lt;/li&gt;
&lt;li&gt;Attributes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example log output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;span_end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;span&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duration_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;412&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives immediate insight into where time is spent.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Cost Metering: AI Systems Spend Money Per Request
&lt;/h3&gt;

&lt;p&gt;Unlike traditional APIs, AI requests have direct monetary cost. Token usage translates into real spend. So every request should produce a cost record.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cost_record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;meter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The meter uses a pricing catalog:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MODEL_PRICING&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_per_1k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.00015&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_per_1k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.00060&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The request returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt;
&lt;span class="n"&gt;output_tokens&lt;/span&gt;
&lt;span class="n"&gt;total_cost&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example response fragment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;350&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_cost_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.00042&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the API answers a critical question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What did this request cost?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3. Evaluation: Successful Calls Aren’t Always Correct
&lt;/h3&gt;

&lt;p&gt;Even if a model responds successfully, the output may still be unusable.That is where evaluation comes in.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;, responses pass through a simple evaluator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;evaluator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Current checks include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Non-empty response&lt;/li&gt;
&lt;li&gt;Required term presence&lt;/li&gt;
&lt;li&gt;Maximum length&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example evaluation result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;non_empty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required_terms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern becomes more important as systems grow. Evaluation can evolve into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured output validation&lt;/li&gt;
&lt;li&gt;Hallucination detection&lt;/li&gt;
&lt;li&gt;Policy enforcement&lt;/li&gt;
&lt;li&gt;Safety filters&lt;/li&gt;
&lt;/ul&gt;



&lt;h2&gt;
  
  
  Why Not Just Use OpenTelemetry
&lt;/h2&gt;

&lt;p&gt;I thoght to just adopt OpenTelemetry at the very beginning of this project, but decided to use home-made instead. Because OpenTelemetry solves a different problem. It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Distributed tracing&lt;/li&gt;
&lt;li&gt;Metrics exporters&lt;/li&gt;
&lt;li&gt;Telemetry pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt; focuses on application-level reliability primitives. Think of it as the layer that answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What happened in this AI request?&lt;br&gt;
What model was called?&lt;br&gt;
What did it cost?&lt;br&gt;
Did the result pass validation?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These signals can later be exported to full observability stacks.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Worker Path
&lt;/h2&gt;

&lt;p&gt;AI systems rarely run only inside HTTP requests. Background jobs often run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Batch inference&lt;/li&gt;
&lt;li&gt;Evaluation pipelines&lt;/li&gt;
&lt;li&gt;Data enrichment tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt; includes a worker example to demonstrate that the same reliability primitives apply there. Worker execution uses the same tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tracing&lt;/li&gt;
&lt;li&gt;Cost metering&lt;/li&gt;
&lt;li&gt;Evaluation&lt;/li&gt;
&lt;li&gt;Structured logs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reliability should not depend on the entrypoint.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Architecture Achieves
&lt;/h2&gt;

&lt;p&gt;With only a few modules, the system now answers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What request generated this output?&lt;/td&gt;
&lt;td&gt;tracing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;How long did the model call take?&lt;/td&gt;
&lt;td&gt;spans&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;How many tokens were used?&lt;/td&gt;
&lt;td&gt;cost meter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What did it cost?&lt;/td&gt;
&lt;td&gt;pricing model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Was the output valid?&lt;/td&gt;
&lt;td&gt;evaluator&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These signals turn a black-box AI API into a traceable system.&lt;br&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Most reliability discussions around AI focus on models. But reliability often comes from system design, not model quality.&lt;/p&gt;

&lt;p&gt;A simple architecture that records:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. What happened&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;2. What it costs&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;3. Whether the result was acceptable&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;can dramatically improve how AI systems are operated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The earlier these ideas are introduced into a system, the easier that system will be to maintain.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;
&lt;br&gt;&lt;br&gt;
&lt;em&gt;Note: This article was originally published on my engineering blog where I document the design of Maester, an AI SaaS infrastructure system built in public.&lt;br&gt;
Original post: &lt;a href="https://lei-ye.dev/blog/design-reliable-ai-apis/" rel="noopener noreferrer"&gt;What Breaks After Your AI Demo Works&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>saas</category>
      <category>programming</category>
      <category>architecture</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Lei Ye</dc:creator>
      <pubDate>Thu, 05 Mar 2026 20:12:17 +0000</pubDate>
      <link>https://dev.to/lei_ye_2cc01a0af9e8260e/-hcp</link>
      <guid>https://dev.to/lei_ye_2cc01a0af9e8260e/-hcp</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/lei_ye_2cc01a0af9e8260e" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3808468%2F5b0247f8-5d88-4e05-ad2c-f1af8a1ade2e.png" alt="lei_ye_2cc01a0af9e8260e"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/lei_ye_2cc01a0af9e8260e/introducing-maester-the-knowledge-engine-of-your-company-h22" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;Introducing Maester&lt;/h2&gt;
      &lt;h3&gt;Lei Ye ・ Mar 5&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#ai&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#saas&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#architecture&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#machinelearning&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>saas</category>
      <category>architecture</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Introducing Maester</title>
      <dc:creator>Lei Ye</dc:creator>
      <pubDate>Thu, 05 Mar 2026 20:03:16 +0000</pubDate>
      <link>https://dev.to/lei_ye_2cc01a0af9e8260e/introducing-maester-the-knowledge-engine-of-your-company-h22</link>
      <guid>https://dev.to/lei_ye_2cc01a0af9e8260e/introducing-maester-the-knowledge-engine-of-your-company-h22</guid>
      <description>&lt;p&gt;&lt;em&gt;The Knowledge Engine of Your Company&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most companies today want the same thing from AI: Turn their internal knowledge into something &lt;strong&gt;queryable, explainable, and operational&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In practice this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Documents scattered across tools
&lt;/li&gt;
&lt;li&gt;Institutional knowledge trapped in teams
&lt;/li&gt;
&lt;li&gt;Data that exists but cannot be used&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the typical solution becomes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Let’s build an AI assistant.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But building an &lt;strong&gt;AI demo&lt;/strong&gt; and building &lt;strong&gt;AI infrastructure that survives production&lt;/strong&gt; are very different things. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;&lt;/strong&gt; is our attempt to build the latter. &lt;/p&gt;




&lt;h2&gt;
  
  
  What Maester Is
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;&lt;/strong&gt; is a reference implementation of a &lt;strong&gt;B2B SaaS AI knowledge engine&lt;/strong&gt;. It demonstrates how a company can transform internal data into a &lt;strong&gt;production-grade knowledge system&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;At its core, &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt; allows organizations to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ingest internal documents&lt;/li&gt;
&lt;li&gt;structure and embed them&lt;/li&gt;
&lt;li&gt;retrieve relevant knowledge&lt;/li&gt;
&lt;li&gt;generate responses with citations&lt;/li&gt;
&lt;li&gt;trace every operation across the system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But more importantly, &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt; is designed as &lt;strong&gt;infrastructure&lt;/strong&gt;, not just an AI feature. That means we are focusing on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reliability&lt;/li&gt;
&lt;li&gt;traceability&lt;/li&gt;
&lt;li&gt;operational cost control&lt;/li&gt;
&lt;li&gt;asynchronous pipelines&lt;/li&gt;
&lt;li&gt;multi-tenant architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We are building this project &lt;strong&gt;in public&lt;/strong&gt;, both as a working system and as a learning artifact. Every design choice will be documented. Every architecture decision will be explained.&lt;/p&gt;

&lt;p&gt;This blog serves as a &lt;strong&gt;system design journal&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Infrastructure Problem Most AI SaaS Products Ignore
&lt;/h2&gt;

&lt;p&gt;When teams first add AI to a product, the initial version often works. A prototype connects an LLM, retrieves some documents, and produces answers. But once the system meets real users, things break quickly. We repeatedly see the same failure modes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Timeouts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LLM calls are slow and unpredictable. Without proper timeouts and retries, requests cascade into system failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Uncontrolled costs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every query triggers embedding calls, retrieval operations, and model inference.  Without cost tracking and guardrails, usage grows faster than expected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Queues and ingestion pipelines&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Document ingestion is not instantaneous.  Parsing, chunking, and embedding require asynchronous pipelines that many systems lack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Traceability gaps&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When something goes wrong, teams often cannot answer simple questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What document generated this answer?&lt;/li&gt;
&lt;li&gt;Which embedding version was used?&lt;/li&gt;
&lt;li&gt;Which model responded?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without observability, AI becomes a &lt;strong&gt;black box in production&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  But What “Production-ready AI Infrastructure” Actually Means
&lt;/h2&gt;

&lt;p&gt;For us, production readiness is not about model quality. It is about &lt;strong&gt;system design&lt;/strong&gt;. A production AI SaaS system must provide:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Asynchronous ingestion pipelines&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Documents must move through structured stages:&lt;br&gt;
parse → chunk → embed → index. &lt;br&gt;
Each stage should be observable and retryable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Reliable model access&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All LLM access must go through a gateway that manages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;timeouts&lt;/li&gt;
&lt;li&gt;retries&lt;/li&gt;
&lt;li&gt;provider fallback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Usage and cost accounting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every request must produce a &lt;strong&gt;usage record&lt;/strong&gt;. Production systems must answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which tenant generated this cost?&lt;/li&gt;
&lt;li&gt;Which model generated this response?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Traceability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Requests must carry a &lt;strong&gt;correlation ID&lt;/strong&gt; through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API layer&lt;/li&gt;
&lt;li&gt;worker queues&lt;/li&gt;
&lt;li&gt;model calls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is how production systems become &lt;strong&gt;debuggable&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Maester is Structured
&lt;/h2&gt;

&lt;p&gt;Instead of treating AI as a feature, we treat it as &lt;strong&gt;infrastructure&lt;/strong&gt;. &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt; separates the system into clear operational layers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkwrh43w2j3gt1bwueh2v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkwrh43w2j3gt1bwueh2v.png" alt="Architecture Design" width="800" height="1077"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Architectural Layers
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;API Layer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Handles request entry, tenant routing, and request validation. This layer also generates &lt;strong&gt;request IDs&lt;/strong&gt; used for tracing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Knowledge Engine Core&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;’s core logic lives. Responsibilities include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;document retrieval&lt;/li&gt;
&lt;li&gt;query orchestration&lt;/li&gt;
&lt;li&gt;interaction with the model gateway&lt;/li&gt;
&lt;li&gt;enforcing cost budgets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Async Worker System&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All heavy processing moves to asynchronous workers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;document parsing&lt;/li&gt;
&lt;li&gt;chunking&lt;/li&gt;
&lt;li&gt;embedding&lt;/li&gt;
&lt;li&gt;vector indexing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents ingestion tasks from blocking user requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Gateway&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of calling models directly, all inference flows through a gateway. This gateway manages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;provider abstraction&lt;/li&gt;
&lt;li&gt;retry logic&lt;/li&gt;
&lt;li&gt;token usage tracking&lt;/li&gt;
&lt;li&gt;future fallback support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Observability Layer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We treat observability as a first-class concern. Every request is traceable across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API requests&lt;/li&gt;
&lt;li&gt;worker jobs&lt;/li&gt;
&lt;li&gt;model calls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows production debugging without guesswork.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;&lt;/strong&gt; is not just an AI application.&lt;/p&gt;

&lt;p&gt;It is an exploration of how &lt;strong&gt;AI systems should be engineered&lt;/strong&gt;. In the coming posts, we will document:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;architecture decisions&lt;/li&gt;
&lt;li&gt;reliability patterns&lt;/li&gt;
&lt;li&gt;cost control strategies&lt;/li&gt;
&lt;li&gt;production ML infrastructure design&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our goal is simple: To build a &lt;strong&gt;knowledge engine that companies can trust in production&lt;/strong&gt;. And to make every engineering decision &lt;strong&gt;transparent and explainable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The system starts &lt;strong&gt;SMALL&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But the architecture is designed to &lt;strong&gt;SCALE&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;Originally published on my engineering blog: &lt;a href="https://lei-ye.dev/blog/introducing-maester" rel="noopener noreferrer"&gt;https://lei-ye.dev/blog/introducing-maester&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>saas</category>
      <category>architecture</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
