<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lei Ye</title>
    <description>The latest articles on DEV Community by Lei Ye (@lei_ye_2cc01a0af9e8260e).</description>
    <link>https://dev.to/lei_ye_2cc01a0af9e8260e</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3808468%2F5b0247f8-5d88-4e05-ad2c-f1af8a1ade2e.png</url>
      <title>DEV Community: Lei Ye</title>
      <link>https://dev.to/lei_ye_2cc01a0af9e8260e</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lei_ye_2cc01a0af9e8260e"/>
    <language>en</language>
    <item>
      <title>The Hidden Problem With Prompts in Production AI</title>
      <dc:creator>Lei Ye</dc:creator>
      <pubDate>Wed, 11 Mar 2026 23:36:45 +0000</pubDate>
      <link>https://dev.to/lei_ye_2cc01a0af9e8260e/prompt-as-code-build-prompt-registry-with-versioning-1n5f</link>
      <guid>https://dev.to/lei_ye_2cc01a0af9e8260e/prompt-as-code-build-prompt-registry-with-versioning-1n5f</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at: &lt;a href="https://lei-ye.dev/blog/prompt-as-code//" rel="noopener noreferrer"&gt;Prompt as Code — Build Prompt Registry with Versioning&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;



&lt;br&gt;
When teams first build AI features, prompts usually start simple.

&lt;p&gt;A string in a function.&lt;br&gt;
A template inside a route.&lt;br&gt;
Maybe a small helper function.&lt;/p&gt;

&lt;p&gt;Something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the following system event:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then with system evolves, prompts start changing.&lt;/p&gt;

&lt;p&gt;A word here.&lt;br&gt;
A constraint there.&lt;br&gt;
Someone adds a new instruction for better formatting.&lt;/p&gt;

&lt;p&gt;Before long, the system behaves differently and nobody can explain why.&lt;/p&gt;

&lt;p&gt;That’s when &lt;strong&gt;prompt chaos&lt;/strong&gt; begins.&lt;/p&gt;



&lt;h2&gt;
  
  
  1. The Problem: Prompt Chaos
&lt;/h2&gt;

&lt;p&gt;Unlike normal code, prompts are often invisible infrastructure.&lt;/p&gt;

&lt;p&gt;They live inside strings scattered across services. They change quietly during experimentation.&lt;/p&gt;

&lt;p&gt;Over time this creates several problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Responses change unexpectedly&lt;/li&gt;
&lt;li&gt;Evaluation metrics become unreliable&lt;/li&gt;
&lt;li&gt;Debugging becomes difficult&lt;/li&gt;
&lt;li&gt;Prompt history disappears&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If an output changes today, you may not know whether the cause was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A prompt change ?&lt;/li&gt;
&lt;li&gt;A model change ?&lt;/li&gt;
&lt;li&gt;A parameter change ?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without prompt identity, the system becomes difficult to reason about.&lt;/p&gt;



&lt;h2&gt;
  
  
  2. Why Prompts Need Versioning
&lt;/h2&gt;

&lt;p&gt;Prompts influence system behavior as much as code does.&lt;/p&gt;

&lt;p&gt;In fact, prompts are closer to &lt;strong&gt;configuration that drives behavior&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That means prompts deserve the same discipline as code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Version control&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reproducibility&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Traceability&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of treating prompts as strings, we can treat them as versioned assets.&lt;/p&gt;

&lt;p&gt;This approach allows us to answer important questions:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Which prompt generated this output?&lt;br&gt;
Which version was deployed last week?&lt;br&gt;
Which prompt version performs best during evaluation?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the idea behind &lt;strong&gt;Prompt as Code&lt;/strong&gt;.&lt;/p&gt;



&lt;h2&gt;
  
  
  3. What a Prompt Registry Is
&lt;/h2&gt;

&lt;p&gt;A Prompt Registry is a small service responsible for managing prompt templates.&lt;/p&gt;

&lt;p&gt;Instead of constructing prompts directly in application logic, the application resolves them from a registry.&lt;/p&gt;

&lt;p&gt;A prompt registry provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Prompt templates&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version management&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deterministic rendering&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prompt hashing&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This transforms prompts from ad-hoc strings into structured runtime assets.&lt;/p&gt;

&lt;p&gt;Example prompt template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;PromptTemplate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are a production AI assistant focused on reliability.
Summarize the following system event:
{event_text}
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now prompts have &lt;em&gt;identity&lt;/em&gt;.&lt;/p&gt;



&lt;h2&gt;
  
  
  4. Architecture
&lt;/h2&gt;

&lt;p&gt;The prompt registry sits between the API layer and the model gateway.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client Request
      ↓
Prompt Registry
      ↓
Rendered Prompt
      ↓
Model Gateway
      ↓
Provider Adapter
      ↓
Cost Metering
      ↓
Evaluation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This architecture ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompts are resolved before inference&lt;/li&gt;
&lt;li&gt;Prompt versions are logged&lt;/li&gt;
&lt;li&gt;Evaluation remains &lt;strong&gt;reproducible&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also cleanly &lt;strong&gt;separates prompt management from model execution&lt;/strong&gt;.&lt;/p&gt;



&lt;h2&gt;
  
  
  5. Implementation
&lt;/h2&gt;

&lt;p&gt;In the &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt; toolkit, the prompt registry lives inside:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;packages/
  prompt_registry/
      models.py
      registry.py
      service.py
      hashing.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A prompt template is defined as a structured object.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PromptTemplate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;template&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Templates are stored in a registry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nc"&gt;PromptTemplate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the following system event:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;{event_text}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the API receives a request, the prompt service resolves and renders the prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rendered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prompt_service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;render&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User downloaded a large dataset.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rendered prompt includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;prompt name&lt;/li&gt;
&lt;li&gt;prompt version&lt;/li&gt;
&lt;li&gt;prompt content&lt;/li&gt;
&lt;li&gt;prompt hash&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hash guarantees that the exact prompt used during inference can be traced later.&lt;/p&gt;



&lt;h2&gt;
  
  
  6. Prompt Versioning Examples
&lt;/h2&gt;

&lt;p&gt;Once prompts become versioned assets, evaluation becomes much more reliable.&lt;/p&gt;

&lt;p&gt;Each request now records:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;model&lt;/li&gt;
&lt;li&gt;provider&lt;/li&gt;
&lt;li&gt;prompt name&lt;/li&gt;
&lt;li&gt;prompt version&lt;/li&gt;
&lt;li&gt;prompt hash&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows teams to compare prompt performance across versions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Prompt v1&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;Request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"prompt_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system_summary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"prompt_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"variables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"event_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Admin revoked API key for user account 742."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gpt-4.1-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"gpt-4.1-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"trace_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"61050fd4e94849d791e566ead8c8f1c6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"prompt_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"system_summary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"prompt_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"prompt_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"5ba4cce2a985f8234698a63fe2260428b029dfd7d61e53a5793cc963b8737036"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"[OpenAI:gpt-4.1-mini] Generated response for prompt: You are a production AI assistant.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;Summarize the following system event clearly:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;Admin revoked API key for user account"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"cost"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"gpt-4.1-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"input_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"output_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;88&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"input_cost_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"0.000016"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"output_cost_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"0.000077"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_cost_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"0.000093"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"unit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"USD"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"evaluation"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"pass"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reliability_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"non_empty"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"max_length"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example Prompt v2&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Request (default to latest):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"prompt_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system_summary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"variables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"event_text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"System latency increased above 300ms for the inference service."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"gpt-4.1-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"trace_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"7ec85dc989dc4da8a0ac9bb73f2317a7"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"prompt_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"system_summary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"prompt_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"v2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"prompt_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"06c08f6125a189abf90b44c9a63a5bc0f5307f06319363a922a476b38776b8c6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"[OpenAI:gpt-4.1-mini] Generated response for prompt: You are a production AI assistant focused on reliability.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;Summarize the following system event.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;Be concise, mention operational impact, and keep the tone factual.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;System latency increased above 300ms for the inference service."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"cost"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"gpt-4.1-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"input_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;66&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"output_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;46&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;112&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"input_cost_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"0.000026"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"output_cost_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"0.000074"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_cost_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"0.000100"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"unit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"USD"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"evaluation"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"pass"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reliability_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"non_empty"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"max_length"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"passed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now prompt optimization becomes measurable rather than guesswork.&lt;/p&gt;



&lt;h2&gt;
  
  
  7. Lessons Learned
&lt;/h2&gt;

&lt;p&gt;Building a prompt registry revealed a few important lessons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Prompts evolve quickly&lt;/strong&gt; &lt;br&gt;&lt;br&gt;
Even small systems accumulate many prompt variations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Reproducibility matters early&lt;/strong&gt; &lt;br&gt;&lt;br&gt;
Without prompt versioning, evaluation results become meaningless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Prompt identity simplifies debugging&lt;/strong&gt; &lt;br&gt;&lt;br&gt;
When responses change, engineers can immediately identify the cause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Prompts should live outside business logic&lt;/strong&gt; &lt;br&gt;&lt;br&gt;
Separating prompts from application code improves maintainability.&lt;/p&gt;



&lt;h2&gt;
  
  
  8. The Code
&lt;/h2&gt;

&lt;p&gt;The implementation described in this article is part of an open-source project called Maester.&lt;/p&gt;

&lt;p&gt;Maester is a lightweight toolkit focused on AI API reliability, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Model gateway routing&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost metering&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evaluation pipelines&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prompt registry&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Repository: &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;maester&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The goal is to explore how production AI systems can remain observable, reproducible, and resilient as they grow.&lt;/p&gt;




&lt;br&gt;&lt;br&gt;
&lt;em&gt;Note: This article was originally published on my engineering blog where I’m documenting the design of Maester, a production AI SaaS infrastructure system built in public. Original post:&lt;a href="https://lei-ye.dev/blog/prompt-as-code//" rel="noopener noreferrer"&gt;Prompt as Code — Build Prompt Registry with Versioning&lt;/a&gt;&lt;/em&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>Building Maester — Enable Multi-provider LLM APIs</title>
      <dc:creator>Lei Ye</dc:creator>
      <pubDate>Tue, 10 Mar 2026 21:28:17 +0000</pubDate>
      <link>https://dev.to/lei_ye_2cc01a0af9e8260e/building-maester-enable-multi-provider-llm-apis-4lp7</link>
      <guid>https://dev.to/lei_ye_2cc01a0af9e8260e/building-maester-enable-multi-provider-llm-apis-4lp7</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://lei-ye.dev/blog/multi-llm-provider-apis/" rel="noopener noreferrer"&gt;Building Maester — Enable Multi-provider LLM APIs&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  We Locked Ourselves Into GCP
&lt;/h2&gt;

&lt;p&gt;Most infrastructure mistakes don’t start as mistakes. They start as reasonable decisions. This one started with a discount.&lt;/p&gt;




&lt;h3&gt;
  
  
  It Worked Beautifully
&lt;/h3&gt;

&lt;p&gt;In the beginning, the decision felt obvious. We had a large GCP startup credit, so our entire stack ran there.&lt;/p&gt;

&lt;p&gt;Compute.&lt;br&gt;
Storage. &lt;br&gt;
Data pipelines. &lt;br&gt;
Model training. &lt;br&gt;
... &lt;br&gt;
Everything.&lt;/p&gt;

&lt;p&gt;And honestly, it worked beautifully! &lt;strong&gt;Monitoring&lt;/strong&gt; was already integrated.&lt;br&gt;
&lt;strong&gt;Identity management&lt;/strong&gt; was built in. &lt;strong&gt;IAM policies&lt;/strong&gt; were easy to manage.&lt;br&gt;
Even &lt;strong&gt;LDAP&lt;/strong&gt; integration was already available.&lt;/p&gt;

&lt;p&gt;One of my teammates said something that sounded perfectly reasonable:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Don’t reinvent the wheel.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And he was right. &lt;em&gt;Why build infrastructure when the cloud already solved it?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We were a small team. Most of our compute was tied to token usage, so costs looked predictable. Everything felt &lt;em&gt;lightweight&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;So we did what most startups do. We committed.&lt;br&gt;&lt;/p&gt;


&lt;h3&gt;
  
  
  Where Did the Cost Come From?
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;"Don't spend like a billionaire with the company's money !"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What? We were all so much confused with the bill complaints at a Monday morning standup meeting months later. The bill arrived and nobody could clearly explain it. It was the cloud bill. And it started eating into margins.&lt;/p&gt;

&lt;p&gt;Where did the cost come from? &lt;br&gt;
Storage? &lt;br&gt;
Network egress? &lt;br&gt;
Pipelines?&lt;br&gt;
Inference traffic?&lt;/p&gt;

&lt;p&gt;Someone suggested hiring a cloud optimization engineer. Another suggested redesigning the entire data pipeline.&lt;/p&gt;

&lt;p&gt;But we were still a startup. Every time we opened the roadmap we saw something else staring at us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer requests.&lt;/li&gt;
&lt;li&gt;Feature releases.&lt;/li&gt;
&lt;li&gt;Revenue milestones.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Infrastructure work always lost that fight. So the bills kept climbing. We weren't bankrupt. But we were trapped.&lt;/p&gt;


&lt;h3&gt;
  
  
  We Split the Stack
&lt;/h3&gt;

&lt;p&gt;Eventually we did something radical. We split the stack. The architecture finally looked like this:&lt;/p&gt;

&lt;p&gt;Azure → Identity / Compliance &lt;br&gt;
AWS   → Applications / Storage &lt;br&gt;
GCP   → Data Pipelines / Training&lt;/p&gt;

&lt;p&gt;And the cost?&lt;/p&gt;

&lt;p&gt;Still &lt;em&gt;expensive&lt;/em&gt;. But &lt;strong&gt;predictable&lt;/strong&gt;. Even without our original startup discount, the system became easier to control.&lt;/p&gt;

&lt;p&gt;Vendor lock-in is &lt;strong&gt;invisible&lt;/strong&gt; when things work. It becomes &lt;strong&gt;obvious&lt;/strong&gt; only when you try to leave.&lt;/p&gt;


&lt;h2&gt;
  
  
  We Are Not Going to Lock into OpenAI
&lt;/h2&gt;

&lt;p&gt;So when we started building the AI APIs, I began seeing the same pattern again.&lt;/p&gt;

&lt;p&gt;It was just:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And honestly, that works.&lt;/p&gt;

&lt;p&gt;But I kept remembering the GCP moment. The moment when switching vendors became impossible. We were about to repeat the same mistake.&lt;/p&gt;

&lt;p&gt;Except this time the vendor was not a cloud. It was a model provider. So I made a decision. &lt;strong&gt;We are not going to lock into OpenAI.&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Approach 1 — Let the client choose the model
&lt;/h3&gt;

&lt;p&gt;The simplest idea was letting the client select the model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST /generate
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"model"&lt;/span&gt;: &lt;span class="s2"&gt;"gpt-4.1-mini"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allowed switching between providers.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;OpenAI&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Anthropic&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Others later&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Technically it worked. But users quickly complained.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“I don't want to choose the model. I just want the best answer.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The user is always right. They just wanted results.&lt;/p&gt;




&lt;h3&gt;
  
  
  Approach 2 — Introduce a Model Gateway
&lt;/h3&gt;

&lt;p&gt;So we moved the decision out of the client. Instead of clients choosing providers, we introduced a Model Gateway.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Application
     ↓
Model Gateway
     ↓
Provider Router
     ↓
Provider Adapter
(OpenAI / Anthropic / others)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gateway would manage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;provider routing&lt;/li&gt;
&lt;li&gt;fallback logic&lt;/li&gt;
&lt;li&gt;cost tracking&lt;/li&gt;
&lt;li&gt;observability&lt;/li&gt;
&lt;li&gt;evaluation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The application now simply asks for a response. And the infrastructure decides how to produce it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Code
&lt;/h2&gt;

&lt;p&gt;The implementation lives inside a small reference project I’ve been building called &lt;strong&gt;&lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The goal of the project is not to build a full AI platform, but to demonstrate a reliable AI API architecture.&lt;/p&gt;

&lt;p&gt;The gateway sits inside the system like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apps/
   api/
      routes/
         reliable_completion.py

packages/
   model_gateway/
      base.py
      provider_openai.py
      provider_anthropic.py
      router.py
      client.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  The Provider Contract
&lt;/h3&gt;

&lt;p&gt;The first step was defining a provider interface. This follows the Adapter Pattern, allowing different model vendors to conform to a shared interface.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Protocol&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;supports&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GenerationRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;GenerationResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each provider adapter simply implements this contract.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OpenAIProvider
AnthropicProvider
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both produce the same normalized response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GenerationResponse
 ├─ provider
 ├─ model
 ├─ content
 └─ usage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means the rest of the system never deals with vendor-specific formats.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Router
&lt;/h3&gt;

&lt;p&gt;Next comes the router. The router decides which provider handles a request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelRouter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ModelProvider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;providers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;supports&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fallback_provider&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In production systems this layer can later evolve into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cost-aware routing&lt;/li&gt;
&lt;li&gt;latency-aware routing&lt;/li&gt;
&lt;li&gt;capability routing&lt;/li&gt;
&lt;li&gt;traffic shaping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the interface stays the same.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Gateway Client
&lt;/h3&gt;

&lt;p&gt;Finally the application talks to the gateway through a simple client.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelGateway&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GenerationRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;router&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API layer doesn't know which provider was selected. It just receives a normalized response.&lt;/p&gt;




&lt;h3&gt;
  
  
  The API Layer
&lt;/h3&gt;

&lt;p&gt;The FastAPI route becomes extremely simple.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;requested_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After generation, the system runs the reliability pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cost metering 
&lt;/li&gt;
&lt;li&gt;Evaluation
&lt;/li&gt;
&lt;li&gt;Structured logging
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;model_routed
requested_model: gpt-4.1-mini
selected_provider: openai
fallback_used: false
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives operators visibility without leaking provider logic into application code.&lt;/p&gt;




&lt;h3&gt;
  
  
  Why This Architecture Matters
&lt;/h3&gt;

&lt;p&gt;This design combines several classic software engineering principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dependency Inversion&lt;/strong&gt;
Application code depends on abstractions, not providers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adapter Pattern&lt;/strong&gt;
Each vendor SDK is wrapped behind a provider adapter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strategy Pattern&lt;/strong&gt;
Routing policies are interchangeable strategies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Separation of Concerns&lt;/strong&gt; 
API layer handles orchestration.Gateway handles provider logic.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  What This Enables Later
&lt;/h3&gt;

&lt;p&gt;Once this boundary exists, the system becomes far easier to evolve.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multi-provider fallback&lt;/li&gt;
&lt;li&gt;provider benchmarking&lt;/li&gt;
&lt;li&gt;cost-aware routing&lt;/li&gt;
&lt;li&gt;latency optimization&lt;/li&gt;
&lt;li&gt;evaluation-based routing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of those changes can happen inside the gateway. The application API never changes. That is the real value of the design.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Lesson
&lt;/h2&gt;

&lt;p&gt;Vendor lock-in rarely feels dangerous at the beginning. Everything works. Costs look reasonable. The roadmap is full of features.&lt;/p&gt;

&lt;p&gt;Then one day something changes. Prices rise. Performance shifts. A better provider appears.And suddenly the architecture makes switching painful.&lt;/p&gt;

&lt;p&gt;The lesson I learned from our cloud migration was simple: &lt;strong&gt;Always design one layer where you can change your mind later&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For our AI systems, that layer became the &lt;strong&gt;Model Gateway&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The application talks to the gateway.&lt;br&gt;
The gateway talks to providers.&lt;br&gt;
And the providers can change.&lt;/p&gt;

&lt;p&gt;Because eventually they always do.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Note: This article was originally published on my egineering blog where I document the design of Maester, an AI SaaS infrastructure system built in public.&lt;br&gt;
Original post: &lt;a href="https://lei-ye.dev/blog/multi-llm-provider-apis/" rel="noopener noreferrer"&gt;Building Maester — Enable Multi-provider LLM APIs&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>What Breaks After Your AI Demo Works</title>
      <dc:creator>Lei Ye</dc:creator>
      <pubDate>Sun, 08 Mar 2026 05:32:45 +0000</pubDate>
      <link>https://dev.to/lei_ye_2cc01a0af9e8260e/what-breaks-after-your-ai-demo-works-2g8p</link>
      <guid>https://dev.to/lei_ye_2cc01a0af9e8260e/what-breaks-after-your-ai-demo-works-2g8p</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://lei-ye.dev/blog/design-reliable-ai-apis/" rel="noopener noreferrer"&gt;What Breaks After Your AI Demo Works&lt;/a&gt;.&lt;/em&gt;&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  A Short Story of How My AI Demo Worked and Failed
&lt;/h2&gt;

&lt;p&gt;A few weeks ago I built a small AI API. Nothing fancy. Just a simple endpoint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It worked.&lt;/p&gt;

&lt;p&gt;Requests came in. The model responded.Everything looked good.&lt;br&gt;
Until the second week.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The First Question
&lt;/h3&gt;

&lt;p&gt;A teammate asked: &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Which request generated this output?”&lt;br&gt;
I checked the logs. There was nothing useful there.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;NO request ID.&lt;br&gt;
NO trace.&lt;br&gt;
NO connection between the prompt and the output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The system worked — but it wasn’t traceable.&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Second Question
&lt;/h3&gt;

&lt;p&gt;Very quickly another question appeared.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Why did our AI bill jump yesterday?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I had no answer.&lt;/p&gt;

&lt;p&gt;We were calling models through an API wrapper, but we weren’t recording:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token usage&lt;/li&gt;
&lt;li&gt;Model pricing&lt;/li&gt;
&lt;li&gt;Request-level cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We had built an AI system that spent money invisibly.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Third Question
&lt;/h3&gt;

&lt;p&gt;Then something more subtle happened.&lt;/p&gt;

&lt;p&gt;A user reported that an output looked wrong. The model had responded successfully, but the answer was clearly not useful. Which raised another question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How do we know if a model response is acceptable?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We didn’t.&lt;/p&gt;

&lt;p&gt;The API only knew whether the model responded, not whether the result made sense.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Realization
&lt;/h3&gt;

&lt;p&gt;The model wasn't the problem. The system around the model was. AI APIs are fundamentally different from traditional APIs. They introduce three operational challenges:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Challenge&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Can we trace what happened?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Economics&lt;/td&gt;
&lt;td&gt;How much did this request cost?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output reliability&lt;/td&gt;
&lt;td&gt;Was the response acceptable?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Without solving these, AI systems quickly become hard to operate. So I built a small reference project to explore this problem. I called it &lt;strong&gt;&lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;



&lt;h2&gt;
  
  
  The Minimal Reliability Architecture
&lt;/h2&gt;

&lt;p&gt;A reliable AI API request should pass through a few structured steps.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client Request
      ↓
API Middleware
(request_id + trace_id)
      ↓
Route Handler
      ↓
Model Gateway
      ↓
Cost Metering
      ↓
Evaluation
      ↓
Structured Logs
      ↓
Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each step adds operational clarity.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Observability: Making AI Requests Traceable
&lt;/h3&gt;

&lt;p&gt;The first primitive is &lt;strong&gt;observability&lt;/strong&gt;. Every request should be traceable.&lt;br&gt;
In &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;, middleware attaches:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;request_id&lt;/span&gt;
&lt;span class="n"&gt;trace_id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to the request context.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;request_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;new_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;trace_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;start_trace&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These identifiers propagate through the entire request lifecycle. Then operations are wrapped in spans:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The span records:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Operation name&lt;/li&gt;
&lt;li&gt;Duration&lt;/li&gt;
&lt;li&gt;Attributes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example log output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;span_end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;span&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duration_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;412&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives immediate insight into where time is spent.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Cost Metering: AI Systems Spend Money Per Request
&lt;/h3&gt;

&lt;p&gt;Unlike traditional APIs, AI requests have direct monetary cost. Token usage translates into real spend. So every request should produce a cost record.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cost_record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;meter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The meter uses a pricing catalog:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MODEL_PRICING&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_per_1k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.00015&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_per_1k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.00060&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The request returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt;
&lt;span class="n"&gt;output_tokens&lt;/span&gt;
&lt;span class="n"&gt;total_cost&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example response fragment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;350&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_cost_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.00042&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the API answers a critical question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What did this request cost?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3. Evaluation: Successful Calls Aren’t Always Correct
&lt;/h3&gt;

&lt;p&gt;Even if a model responds successfully, the output may still be unusable.That is where evaluation comes in.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;, responses pass through a simple evaluator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;evaluator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Current checks include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Non-empty response&lt;/li&gt;
&lt;li&gt;Required term presence&lt;/li&gt;
&lt;li&gt;Maximum length&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example evaluation result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;non_empty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required_terms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern becomes more important as systems grow. Evaluation can evolve into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured output validation&lt;/li&gt;
&lt;li&gt;Hallucination detection&lt;/li&gt;
&lt;li&gt;Policy enforcement&lt;/li&gt;
&lt;li&gt;Safety filters&lt;/li&gt;
&lt;/ul&gt;



&lt;h2&gt;
  
  
  Why Not Just Use OpenTelemetry
&lt;/h2&gt;

&lt;p&gt;I thoght to just adopt OpenTelemetry at the very beginning of this project, but decided to use home-made instead. Because OpenTelemetry solves a different problem. It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Distributed tracing&lt;/li&gt;
&lt;li&gt;Metrics exporters&lt;/li&gt;
&lt;li&gt;Telemetry pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt; focuses on application-level reliability primitives. Think of it as the layer that answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What happened in this AI request?&lt;br&gt;
What model was called?&lt;br&gt;
What did it cost?&lt;br&gt;
Did the result pass validation?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These signals can later be exported to full observability stacks.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Worker Path
&lt;/h2&gt;

&lt;p&gt;AI systems rarely run only inside HTTP requests. Background jobs often run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Batch inference&lt;/li&gt;
&lt;li&gt;Evaluation pipelines&lt;/li&gt;
&lt;li&gt;Data enrichment tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt; includes a worker example to demonstrate that the same reliability primitives apply there. Worker execution uses the same tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tracing&lt;/li&gt;
&lt;li&gt;Cost metering&lt;/li&gt;
&lt;li&gt;Evaluation&lt;/li&gt;
&lt;li&gt;Structured logs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reliability should not depend on the entrypoint.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Architecture Achieves
&lt;/h2&gt;

&lt;p&gt;With only a few modules, the system now answers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What request generated this output?&lt;/td&gt;
&lt;td&gt;tracing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;How long did the model call take?&lt;/td&gt;
&lt;td&gt;spans&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;How many tokens were used?&lt;/td&gt;
&lt;td&gt;cost meter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What did it cost?&lt;/td&gt;
&lt;td&gt;pricing model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Was the output valid?&lt;/td&gt;
&lt;td&gt;evaluator&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These signals turn a black-box AI API into a traceable system.&lt;br&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Most reliability discussions around AI focus on models. But reliability often comes from system design, not model quality.&lt;/p&gt;

&lt;p&gt;A simple architecture that records:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. What happened&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;2. What it costs&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;3. Whether the result was acceptable&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;can dramatically improve how AI systems are operated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The earlier these ideas are introduced into a system, the easier that system will be to maintain.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;
&lt;br&gt;&lt;br&gt;
&lt;em&gt;Note: This article was originally published on my engineering blog where I document the design of Maester, an AI SaaS infrastructure system built in public.&lt;br&gt;
Original post: &lt;a href="https://lei-ye.dev/blog/design-reliable-ai-apis/" rel="noopener noreferrer"&gt;What Breaks After Your AI Demo Works&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>saas</category>
      <category>programming</category>
      <category>architecture</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Lei Ye</dc:creator>
      <pubDate>Thu, 05 Mar 2026 20:12:17 +0000</pubDate>
      <link>https://dev.to/lei_ye_2cc01a0af9e8260e/-hcp</link>
      <guid>https://dev.to/lei_ye_2cc01a0af9e8260e/-hcp</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/lei_ye_2cc01a0af9e8260e" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3808468%2F5b0247f8-5d88-4e05-ad2c-f1af8a1ade2e.png" alt="lei_ye_2cc01a0af9e8260e"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/lei_ye_2cc01a0af9e8260e/introducing-maester-the-knowledge-engine-of-your-company-h22" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;Introducing Maester&lt;/h2&gt;
      &lt;h3&gt;Lei Ye ・ Mar 5&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#ai&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#saas&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#architecture&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#machinelearning&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>saas</category>
      <category>architecture</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Introducing Maester</title>
      <dc:creator>Lei Ye</dc:creator>
      <pubDate>Thu, 05 Mar 2026 20:03:16 +0000</pubDate>
      <link>https://dev.to/lei_ye_2cc01a0af9e8260e/introducing-maester-the-knowledge-engine-of-your-company-h22</link>
      <guid>https://dev.to/lei_ye_2cc01a0af9e8260e/introducing-maester-the-knowledge-engine-of-your-company-h22</guid>
      <description>&lt;p&gt;&lt;em&gt;The Knowledge Engine of Your Company&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most companies today want the same thing from AI: Turn their internal knowledge into something &lt;strong&gt;queryable, explainable, and operational&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In practice this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Documents scattered across tools
&lt;/li&gt;
&lt;li&gt;Institutional knowledge trapped in teams
&lt;/li&gt;
&lt;li&gt;Data that exists but cannot be used&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the typical solution becomes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Let’s build an AI assistant.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But building an &lt;strong&gt;AI demo&lt;/strong&gt; and building &lt;strong&gt;AI infrastructure that survives production&lt;/strong&gt; are very different things. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;&lt;/strong&gt; is our attempt to build the latter. &lt;/p&gt;




&lt;h2&gt;
  
  
  What Maester Is
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;&lt;/strong&gt; is a reference implementation of a &lt;strong&gt;B2B SaaS AI knowledge engine&lt;/strong&gt;. It demonstrates how a company can transform internal data into a &lt;strong&gt;production-grade knowledge system&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;At its core, &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt; allows organizations to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ingest internal documents&lt;/li&gt;
&lt;li&gt;structure and embed them&lt;/li&gt;
&lt;li&gt;retrieve relevant knowledge&lt;/li&gt;
&lt;li&gt;generate responses with citations&lt;/li&gt;
&lt;li&gt;trace every operation across the system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But more importantly, &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt; is designed as &lt;strong&gt;infrastructure&lt;/strong&gt;, not just an AI feature. That means we are focusing on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reliability&lt;/li&gt;
&lt;li&gt;traceability&lt;/li&gt;
&lt;li&gt;operational cost control&lt;/li&gt;
&lt;li&gt;asynchronous pipelines&lt;/li&gt;
&lt;li&gt;multi-tenant architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We are building this project &lt;strong&gt;in public&lt;/strong&gt;, both as a working system and as a learning artifact. Every design choice will be documented. Every architecture decision will be explained.&lt;/p&gt;

&lt;p&gt;This blog serves as a &lt;strong&gt;system design journal&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Infrastructure Problem Most AI SaaS Products Ignore
&lt;/h2&gt;

&lt;p&gt;When teams first add AI to a product, the initial version often works. A prototype connects an LLM, retrieves some documents, and produces answers. But once the system meets real users, things break quickly. We repeatedly see the same failure modes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Timeouts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LLM calls are slow and unpredictable. Without proper timeouts and retries, requests cascade into system failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Uncontrolled costs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every query triggers embedding calls, retrieval operations, and model inference.  Without cost tracking and guardrails, usage grows faster than expected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Queues and ingestion pipelines&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Document ingestion is not instantaneous.  Parsing, chunking, and embedding require asynchronous pipelines that many systems lack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Traceability gaps&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When something goes wrong, teams often cannot answer simple questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What document generated this answer?&lt;/li&gt;
&lt;li&gt;Which embedding version was used?&lt;/li&gt;
&lt;li&gt;Which model responded?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without observability, AI becomes a &lt;strong&gt;black box in production&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  But What “Production-ready AI Infrastructure” Actually Means
&lt;/h2&gt;

&lt;p&gt;For us, production readiness is not about model quality. It is about &lt;strong&gt;system design&lt;/strong&gt;. A production AI SaaS system must provide:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Asynchronous ingestion pipelines&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Documents must move through structured stages:&lt;br&gt;
parse → chunk → embed → index. &lt;br&gt;
Each stage should be observable and retryable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Reliable model access&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All LLM access must go through a gateway that manages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;timeouts&lt;/li&gt;
&lt;li&gt;retries&lt;/li&gt;
&lt;li&gt;provider fallback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Usage and cost accounting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every request must produce a &lt;strong&gt;usage record&lt;/strong&gt;. Production systems must answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which tenant generated this cost?&lt;/li&gt;
&lt;li&gt;Which model generated this response?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Traceability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Requests must carry a &lt;strong&gt;correlation ID&lt;/strong&gt; through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API layer&lt;/li&gt;
&lt;li&gt;worker queues&lt;/li&gt;
&lt;li&gt;model calls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is how production systems become &lt;strong&gt;debuggable&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Maester is Structured
&lt;/h2&gt;

&lt;p&gt;Instead of treating AI as a feature, we treat it as &lt;strong&gt;infrastructure&lt;/strong&gt;. &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt; separates the system into clear operational layers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkwrh43w2j3gt1bwueh2v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkwrh43w2j3gt1bwueh2v.png" alt="Architecture Design" width="800" height="1077"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Architectural Layers
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;API Layer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Handles request entry, tenant routing, and request validation. This layer also generates &lt;strong&gt;request IDs&lt;/strong&gt; used for tracing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Knowledge Engine Core&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where &lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;’s core logic lives. Responsibilities include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;document retrieval&lt;/li&gt;
&lt;li&gt;query orchestration&lt;/li&gt;
&lt;li&gt;interaction with the model gateway&lt;/li&gt;
&lt;li&gt;enforcing cost budgets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Async Worker System&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All heavy processing moves to asynchronous workers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;document parsing&lt;/li&gt;
&lt;li&gt;chunking&lt;/li&gt;
&lt;li&gt;embedding&lt;/li&gt;
&lt;li&gt;vector indexing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents ingestion tasks from blocking user requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Gateway&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of calling models directly, all inference flows through a gateway. This gateway manages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;provider abstraction&lt;/li&gt;
&lt;li&gt;retry logic&lt;/li&gt;
&lt;li&gt;token usage tracking&lt;/li&gt;
&lt;li&gt;future fallback support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Observability Layer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We treat observability as a first-class concern. Every request is traceable across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API requests&lt;/li&gt;
&lt;li&gt;worker jobs&lt;/li&gt;
&lt;li&gt;model calls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows production debugging without guesswork.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/leiye-07/maester" rel="noopener noreferrer"&gt;Maester&lt;/a&gt;&lt;/strong&gt; is not just an AI application.&lt;/p&gt;

&lt;p&gt;It is an exploration of how &lt;strong&gt;AI systems should be engineered&lt;/strong&gt;. In the coming posts, we will document:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;architecture decisions&lt;/li&gt;
&lt;li&gt;reliability patterns&lt;/li&gt;
&lt;li&gt;cost control strategies&lt;/li&gt;
&lt;li&gt;production ML infrastructure design&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our goal is simple: To build a &lt;strong&gt;knowledge engine that companies can trust in production&lt;/strong&gt;. And to make every engineering decision &lt;strong&gt;transparent and explainable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The system starts &lt;strong&gt;SMALL&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But the architecture is designed to &lt;strong&gt;SCALE&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;Originally published on my engineering blog: &lt;a href="https://lei-ye.dev/blog/introducing-maester" rel="noopener noreferrer"&gt;https://lei-ye.dev/blog/introducing-maester&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>saas</category>
      <category>architecture</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
