<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: sm1ck</title>
    <description>The latest articles on DEV Community by sm1ck (@sm1ck).</description>
    <link>https://dev.to/sm1ck</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3035707%2F79914038-15a3-4a68-9e07-803c587c48a8.png</url>
      <title>DEV Community: sm1ck</title>
      <link>https://dev.to/sm1ck</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sm1ck"/>
    <language>en</language>
    <item>
      <title>When the LLM Refuses: A Fallback Chain That Salvages Most Refusals</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Sun, 31 May 2026 01:45:23 +0000</pubDate>
      <link>https://dev.to/sm1ck/when-the-llm-refuses-a-fallback-chain-that-salvages-most-refusals-52i7</link>
      <guid>https://dev.to/sm1ck/when-the-llm-refuses-a-fallback-chain-that-salvages-most-refusals-52i7</guid>
      <description>&lt;p&gt;Every production LLM app eats false-positive refusals. A user asks something perfectly fine, the safety filter trips, the model emits two sentences of "I can't help with that," and your UI shows a wall. Do that a few times and the user leaves.&lt;/p&gt;

&lt;p&gt;We've measured this on &lt;a href="https://honeychat.bot/" rel="noopener noreferrer"&gt;HoneyChat&lt;/a&gt; — Telegram-native AI companion, ~300 DAU, 17 languages. Across a normal day, &lt;strong&gt;somewhere between 2% and 8%&lt;/strong&gt; of model calls land in a refusal or &lt;code&gt;finish_reason="content_filter"&lt;/code&gt; state. Most of those are not actually problematic content — they're the model being twitchy about edge phrasing, polysemous words, or roleplay framing. The pattern below recovers about &lt;strong&gt;70%&lt;/strong&gt; of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HoneyChat LLM routing at a glance&lt;/strong&gt; (&lt;code&gt;core/llm.py&lt;/code&gt;, plan-gated via OpenRouter):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier(s)&lt;/th&gt;
&lt;th&gt;Pace&lt;/th&gt;
&lt;th&gt;Primary model (OpenRouter slug)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;free&lt;/code&gt; / &lt;code&gt;basic&lt;/code&gt; / &lt;code&gt;premium&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;natural&lt;/td&gt;
&lt;td&gt;&lt;code&gt;qwen/qwen3-235b-a22b-2507&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;free&lt;/code&gt; / &lt;code&gt;basic&lt;/code&gt; / &lt;code&gt;premium&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;instant / explicit&lt;/td&gt;
&lt;td&gt;&lt;code&gt;deepseek/deepseek-v4-flash&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;vip&lt;/code&gt; / &lt;code&gt;elite&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;any&lt;/td&gt;
&lt;td&gt;&lt;code&gt;google/gemini-3.1-flash-lite-preview&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Emergency &lt;code&gt;content_filter&lt;/code&gt; fallback chain (&lt;code&gt;GEMINI_CONTENT_FILTER_FALLBACK_CHAIN&lt;/code&gt;): &lt;code&gt;x-ai/grok-4.20&lt;/code&gt; → an open roleplay-tuned model. The rescue chain below is what feeds traffic into that fallback only when it's actually needed.&lt;/p&gt;

&lt;p&gt;Three steps, in order of cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 0: Don't trigger it in the first place
&lt;/h2&gt;

&lt;p&gt;Free, and where most posts on this topic stop. Two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tighten the safety knobs the provider exposes.&lt;/strong&gt; For Gemini via OpenRouter, that's &lt;code&gt;safety_settings&lt;/code&gt; in the extra body. Default is &lt;code&gt;BLOCK_MEDIUM_AND_ABOVE&lt;/code&gt; on four categories; for roleplay/chat traffic we lower them via a helper called &lt;code&gt;_maybe_inject_gemini_safety_off()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;extra_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;safety_settings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HARM_CATEGORY_HARASSMENT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCK_NONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HARM_CATEGORY_HATE_SPEECH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCK_NONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HARM_CATEGORY_SEXUALLY_EXPLICIT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCK_NONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HARM_CATEGORY_DANGEROUS_CONTENT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCK_NONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;Probe before/after on the same fictional-scene prompt: 130-char refusal → 2,571-char full response. The hard, non-negotiable filters (CSAM, etc.) stay on at the provider level regardless of this knob; only the &lt;em&gt;adjustable&lt;/em&gt; sliders move.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't apply this to moderation/vision calls.&lt;/strong&gt; Those calls &lt;em&gt;want&lt;/em&gt; the filter on. The helper is scoped to the chat/roleplay code path only.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This alone cuts refusals roughly in half on our traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Partial salvage before fallback
&lt;/h2&gt;

&lt;p&gt;When you do get a refusal, the model still sent &lt;em&gt;something&lt;/em&gt;. Check the streamed buffer or the partial completion before declaring failure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;salvage_partial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Extract usable content from a partial/filtered response. None = unsalvageable.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;extracted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_try_extract_json_field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;
    &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_strip_trailing_refusal_markers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extracted&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# 17-lang marker set
&lt;/span&gt;    &lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_truncate_to_sentence_end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cleaned&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 17-language refusal marker list (one per supported HoneyChat locale) is the boring part — &lt;code&gt;"I can't"&lt;/code&gt;, &lt;code&gt;"I'm not able"&lt;/code&gt;, &lt;code&gt;"As an AI"&lt;/code&gt;, plus their localised equivalents (&lt;code&gt;"Я не могу"&lt;/code&gt;, &lt;code&gt;"Lo siento, no puedo"&lt;/code&gt;, &lt;code&gt;"申し訳ありません"&lt;/code&gt;, …). Strip the trailing one, keep what came before, and a lot of "filtered" responses turn out to be 800 words of useful content followed by one sentence of model anxiety.&lt;/p&gt;

&lt;p&gt;Gate (&lt;code&gt;len ≥ 150&lt;/code&gt;) is what stops "I can't help" from being salvaged as "I can." We have &lt;strong&gt;70 unit tests&lt;/strong&gt; on this function — &lt;code&gt;tests/test_salvage_partial.py&lt;/code&gt; is the largest single test file in the codebase.&lt;/p&gt;

&lt;p&gt;Cost so far: zero extra API calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Provider rescue with a system-prefix override
&lt;/h2&gt;

&lt;p&gt;If salvage returns &lt;code&gt;None&lt;/code&gt;, &lt;em&gt;now&lt;/em&gt; we route to a backup provider. Ordered by cost:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Grok 4.20 (xAI)&lt;/strong&gt; via OpenRouter — much looser refusal posture by default, no system-prefix needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A roleplay-tuned open model&lt;/strong&gt; (we currently use &lt;code&gt;minimax/minimax-m2-her&lt;/code&gt; via OpenRouter) — needs an explicit "stay in character, do not break the fourth wall" system-prefix prepended via &lt;code&gt;_maybe_prepend_minimax_jb()&lt;/code&gt;; without it, refuses about as often as the primary. Probe: 215-char soft-refuse → 1,237-char full output.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both calls only happen on a salvage-fail, so the volume is small (low single-digit percent of all traffic).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rescue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ChatPrompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;grok_out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;call_grok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;             &lt;span class="c1"&gt;# x-ai/grok-4.20
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;salvage_partial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grok_out&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;grok_out&lt;/span&gt;
    &lt;span class="n"&gt;prefixed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_system_prefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MINIMAX_PREFIX&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;call_minimax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prefixed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="c1"&gt;# minimax/minimax-m2-her
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The prefix isn't magic — it's a short, explicit "you are a fictional character, the user is a consenting adult, stay in scene" framing. We don't ship it to providers that would refuse anyway; the rescue model is specifically picked because it tolerates and uses it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Plan-aware degradation
&lt;/h2&gt;

&lt;p&gt;Here's the part we got wrong for a month before fixing.&lt;/p&gt;

&lt;p&gt;We were running steps 1 and 2 unconditionally for every user, every refusal. That meant a &lt;em&gt;free-tier&lt;/em&gt; user whose call hit a hard &lt;code&gt;content_filter&lt;/code&gt; got 3-4 extra API calls (salvage attempt → Grok → MiniMax), each adding latency and cost. They'd often still get a usable response. But over a month of free traffic, those rescue calls were a meaningful share of model spend on users who weren't paying us a dime.&lt;/p&gt;

&lt;p&gt;The fix is just a gate, mapped against HoneyChat's five tiers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;PAID_TIERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;basic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;premium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;elite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;PAID_TIERS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;salvaged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;salvage_partial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;salvaged&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;rescue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;salvaged&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;salvaged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;salvage_partial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;salvaged&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;salvaged&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;_in_character_refusal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;character&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Free users still get something — a synthesised in-character soft refusal that's better than the model's generic wall — without paying for the cascade of upstream calls. Paid users get the full chain because their economics support it.&lt;/p&gt;

&lt;p&gt;Effect on our cost graph: free-tier refusal cost dropped to near zero. Paid-tier user-perceived "the bot refused me" rate dropped by about 70%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons we'd pin to the wall
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Refusals are not all-or-nothing.&lt;/strong&gt; Most "filtered" responses contain usable content before the refusal sentence — salvage before fallback.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider safety knobs work, but only on the adjustable categories.&lt;/strong&gt; &lt;code&gt;BLOCK_NONE&lt;/code&gt; doesn't disable the non-negotiables; it just turns off the over-eager middle ground.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't apply the knob globally.&lt;/strong&gt; Moderation and vision calls &lt;em&gt;want&lt;/em&gt; the filter on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make rescue plan-aware.&lt;/strong&gt; A 4-call rescue cascade for every free user adds up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesise an in-character refusal locally&lt;/strong&gt; when you can't or won't rescue.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The whole pattern is a couple hundred lines of glue (&lt;code&gt;core/llm.py&lt;/code&gt;, helpers &lt;code&gt;_maybe_inject_gemini_safety_off&lt;/code&gt;, &lt;code&gt;_maybe_prepend_minimax_jb&lt;/code&gt;, &lt;code&gt;salvage_partial&lt;/code&gt;). The unit-test suite around &lt;code&gt;salvage_partial&lt;/code&gt; keeps the regression risk low.&lt;/p&gt;




&lt;p&gt;This pattern is in production at &lt;strong&gt;&lt;a href="https://honeychat.bot/" rel="noopener noreferrer"&gt;HoneyChat&lt;/a&gt;&lt;/strong&gt; — Telegram-native AI companion bot where a single refusal mid-conversation kills the experience. Canonical version: &lt;a href="https://honeychat.bot/en/blog/llm-content-filter-fallback-rescue-chain/" rel="noopener noreferrer"&gt;honeychat.bot/en/blog/llm-content-filter-fallback-rescue-chain&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;— &lt;em&gt;HoneyChat Engineering&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ai.google.dev/gemini-api/docs/safety-settings" rel="noopener noreferrer"&gt;Google — Gemini safety settings&lt;/a&gt; — the four adjustable harm categories, threshold semantics, what &lt;code&gt;BLOCK_NONE&lt;/code&gt; does and doesn't.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openrouter.ai/docs/api-reference/parameters" rel="noopener noreferrer"&gt;OpenRouter — Provider parameters / extra_body&lt;/a&gt; — passthrough to provider-specific knobs.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openrouter.ai/docs/features/model-routing" rel="noopener noreferrer"&gt;OpenRouter — Model routing &amp;amp; fallback&lt;/a&gt; — declarative fallback chain semantics.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.anthropic.com/en/api/messages" rel="noopener noreferrer"&gt;Anthropic — &lt;code&gt;stop_reason&lt;/code&gt; and &lt;code&gt;finish_reason&lt;/code&gt; reference&lt;/a&gt; — how providers signal a content-filter stop vs a token-limit stop.&lt;/li&gt;
&lt;li&gt;HoneyChat engineering notes: &lt;a href="https://honeychat.bot/en/blog/llm-routing-per-tier-openrouter/" rel="noopener noreferrer"&gt;LLM routing per tier on OpenRouter&lt;/a&gt; · &lt;a href="https://honeychat.bot/en/blog/llm-prompt-caching-in-production/" rel="noopener noreferrer"&gt;prompt caching measured&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
    </item>
    <item>
      <title>Inworld TTS Paralinguistic Tags Don't Work — Here's What Does</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Sun, 31 May 2026 01:42:57 +0000</pubDate>
      <link>https://dev.to/sm1ck/inworld-tts-paralinguistic-tags-dont-work-heres-what-does-50pj</link>
      <guid>https://dev.to/sm1ck/inworld-tts-paralinguistic-tags-dont-work-heres-what-does-50pj</guid>
      <description>&lt;p&gt;If you've worked with expressive TTS in the last year you've probably seen the pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;She paused. [sigh] "Fine, you can come in."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inline paralinguistic tags. Half the model demos use them. So when we wired up &lt;strong&gt;Inworld TTS-1.5 Max&lt;/strong&gt; for &lt;a href="https://honeychat.bot/" rel="noopener noreferrer"&gt;HoneyChat&lt;/a&gt; — Telegram-native AI companion where voice messages are a first-class output — we sprinkled &lt;code&gt;[laugh]&lt;/code&gt;, &lt;code&gt;[sigh]&lt;/code&gt;, &lt;code&gt;[breathe]&lt;/code&gt; through the prompts and shipped.&lt;/p&gt;

&lt;p&gt;The audio sounded fine. Just… exactly the same as before. No laugh. No sigh. The tags were getting read out as silence at best, and as the literal text "sigh" at worst, depending on the voice.&lt;/p&gt;

&lt;p&gt;We tested all the variants we could find. None of them moved the needle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HoneyChat voice stack at a glance:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Engine:&lt;/strong&gt; Inworld TTS-1.5 Max — $10 per 1M characters, currently #1 on the TTS Arena ELO board at &lt;strong&gt;1259 ELO&lt;/strong&gt;, &lt;strong&gt;15 languages&lt;/strong&gt; with native pronunciation: &lt;code&gt;en, ru, ja, zh, ko, es, fr, de, it, pt, pl, hi, ar, he, nl&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice catalog:&lt;/strong&gt; 312 designed voices (26 character archetypes × 12 languages), stored as &lt;code&gt;voiceId&lt;/code&gt; strings in &lt;code&gt;config/archetype_voice_ids.json&lt;/code&gt;. Generated via the Voice Design API and managed with &lt;code&gt;core/voice_design.py&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom voices:&lt;/strong&gt; Voice Clone Manager (&lt;code&gt;core/voice_clone_manager.py&lt;/code&gt;) — persistent &lt;code&gt;voiceId&lt;/code&gt; minted from a WAV/MP3 sample.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache:&lt;/strong&gt; voice previews + test samples are lazy-loaded from &lt;strong&gt;Storj S3&lt;/strong&gt; via &lt;code&gt;core/voice_cache.py&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fallback:&lt;/strong&gt; gTTS (Google) — free, no API key, used if Inworld returns 5xx or budget is exhausted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What we removed to get here:&lt;/strong&gt; Kokoro (CPU Docker, latency too high) and Chatterbox (GPU on Vast.ai, ops cost too high). Inworld replaced both for a flat per-char cost and dramatically better expressivity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One API gotcha:&lt;/strong&gt; gender enum is &lt;code&gt;VOICE_GENDER_MALE&lt;/code&gt;/&lt;code&gt;VOICE_GENDER_FEMALE&lt;/code&gt;, not &lt;code&gt;"male"&lt;/code&gt;/&lt;code&gt;"female"&lt;/code&gt; strings. Passing the strings 400s silently.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What actually doesn't work
&lt;/h2&gt;

&lt;p&gt;Tried on the same sentence, same voice, side-by-side audio comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;What it did&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;[laugh]&lt;/code&gt; &lt;code&gt;[sigh]&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Silence in output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;(laughs)&lt;/code&gt; &lt;code&gt;(sighs)&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Sometimes read literally&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;*laughs*&lt;/code&gt; &lt;code&gt;*sighs*&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Silence (asterisks get stripped)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;&amp;lt;laugh/&amp;gt;&lt;/code&gt; &lt;code&gt;&amp;lt;sigh/&amp;gt;&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Silence (not valid SSML on Inworld)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;&amp;lt;emotion&amp;gt;laugh&amp;lt;/emotion&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Silence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Inworld API does not document support for any of these. We had assumed (because every other TTS post on the internet uses them) that they were a universal convention. They are not.&lt;/p&gt;

&lt;p&gt;What Inworld &lt;em&gt;does&lt;/em&gt; expose is &lt;strong&gt;&lt;code&gt;temperature&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;speakingRate&lt;/code&gt;&lt;/strong&gt; as request parameters, plus a small subset of SSML. The expressivity has to come from those plus how you shape the &lt;em&gt;text itself&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually does work
&lt;/h2&gt;

&lt;p&gt;After enough A/B-ing across 26 archetypes × 15 languages, four patterns reliably change the audio output.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Asterisks for emphasis
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"You did *what?*"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The asterisks get stripped from the spoken text but the emphasised word lands with audible stress. Works in every voice we tried. The cheapest, highest-hit-rate marker.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Ellipsis for pause-with-mood
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Fine... you can come in."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three dots produces a real pause with a tonal drop — the voice equivalent of a sigh, without trying to fake &lt;code&gt;[sigh]&lt;/code&gt;. Five dots for a longer pause. The model interprets them as prosodic cues.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. SSML &lt;code&gt;&amp;lt;break&amp;gt;&lt;/code&gt; for hard pauses
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;speak&amp;gt;&lt;/span&gt;
  She paused. &lt;span class="nt"&gt;&amp;lt;break&lt;/span&gt; &lt;span class="na"&gt;time=&lt;/span&gt;&lt;span class="s"&gt;"0.4s"&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt; "Fine, you can come in."
&lt;span class="nt"&gt;&amp;lt;/speak&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inworld accepts a useful subset of SSML, and &lt;code&gt;&amp;lt;break&amp;gt;&lt;/code&gt; is the one that matters most for expressive speech. &lt;code&gt;0.2s&lt;/code&gt; for a beat, &lt;code&gt;0.4s&lt;/code&gt; for a sigh-pause, &lt;code&gt;0.8s&lt;/code&gt; for a beat-before-a-line-delivery moment. Wrap the whole text in &lt;code&gt;&amp;lt;speak&amp;gt;&lt;/code&gt; and the parser handles it.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Onomatopoeia for laughs, moans, breath
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Mmm... ha-ha, you're right."
"ahh... I needed that."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model &lt;em&gt;will&lt;/em&gt; render &lt;code&gt;ha-ha&lt;/code&gt;, &lt;code&gt;mmm&lt;/code&gt;, &lt;code&gt;ahh&lt;/code&gt;, &lt;code&gt;oh&lt;/code&gt;, &lt;code&gt;nnn&lt;/code&gt; as the actual sound, because they're spellings of sounds rather than meta-tags. They sound far more natural than a synthesised &lt;code&gt;[laugh]&lt;/code&gt; even when one exists.&lt;/p&gt;

&lt;p&gt;For emotional/intimate scenes, rhythmic repeats (&lt;code&gt;ah... ah... ah&lt;/code&gt;) carry actual prosody. We use this for breath patterns where another TTS would want a &lt;code&gt;[breathe]&lt;/code&gt; marker.&lt;/p&gt;

&lt;h2&gt;
  
  
  The wrapper that ties it together
&lt;/h2&gt;

&lt;p&gt;In &lt;code&gt;core/voice.py&lt;/code&gt; we run every chunk through &lt;code&gt;enrich_for_tts()&lt;/code&gt; (line ~772) before handing it to Inworld. Regex-based, language-aware, idempotent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;enrich_for_tts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return (preprocessed_text, request_params).
    Strips fake paralinguistic tags, adds SSML breaks where appropriate,
    and bumps temperature/speakingRate for high-emotion scenes.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_STRIP_FAKE_TAGS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_ELLIPSIS_TO_BREAK&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;break time=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.3s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;break&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;speak&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/speak&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_detect_mood_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The mood detector looks for emotional cues (intensity words, repeated punctuation, onomatopoeia density) and bumps &lt;code&gt;temperature&lt;/code&gt; and &lt;code&gt;speakingRate&lt;/code&gt; for the more expressive scenes. Same model, same voice, much more dynamic output, all without any inline tag that the model would have ignored.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Don't assume &lt;code&gt;[laugh]&lt;/code&gt;/&lt;code&gt;[sigh]&lt;/code&gt; is universal.&lt;/strong&gt; It isn't. Check the provider's docs and probe.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Probe with side-by-side audio, not just visual diffs.&lt;/strong&gt; A &lt;code&gt;[sigh]&lt;/code&gt; that emits silence looks identical to one that emits a sigh in any log.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use what the API actually exposes.&lt;/strong&gt; For Inworld that's &lt;code&gt;temperature&lt;/code&gt;, &lt;code&gt;speakingRate&lt;/code&gt;, and a useful subset of SSML — not inline tags.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Onomatopoeia beats meta-tags for emotional sounds.&lt;/strong&gt; &lt;code&gt;"ahh..."&lt;/code&gt; is a thing the model can read; &lt;code&gt;[sigh]&lt;/code&gt; is a meta-instruction it can't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strip the fake tags out of your prompt before sending.&lt;/strong&gt; Otherwise they leak as text on some voices.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The audio quality jump from these four patterns is meaningful — users notice. The cost is a 30-line preprocessor and the courage to delete every &lt;code&gt;[laugh]&lt;/code&gt; your team has been sprinkling for months.&lt;/p&gt;




&lt;p&gt;This is from production work at &lt;strong&gt;&lt;a href="https://honeychat.bot/" rel="noopener noreferrer"&gt;HoneyChat&lt;/a&gt;&lt;/strong&gt; — Telegram-native AI companion where voice messages are a first-class output. Canonical version: &lt;a href="https://honeychat.bot/en/blog/inworld-tts-paralinguistic-tags-alternatives/" rel="noopener noreferrer"&gt;honeychat.bot/en/blog/inworld-tts-paralinguistic-tags-alternatives&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;— &lt;em&gt;HoneyChat Engineering&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.inworld.ai/" rel="noopener noreferrer"&gt;Inworld TTS — documentation&lt;/a&gt; — supported request parameters (&lt;code&gt;temperature&lt;/code&gt;, &lt;code&gt;speakingRate&lt;/code&gt;), SSML subset, voice design API.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.w3.org/TR/speech-synthesis11/" rel="noopener noreferrer"&gt;W3C — Speech Synthesis Markup Language (SSML) 1.1&lt;/a&gt; — full SSML spec; &lt;code&gt;&amp;lt;break&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;speak&amp;gt;&lt;/code&gt;, prosody elements.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena" rel="noopener noreferrer"&gt;TTS Arena (Hugging Face)&lt;/a&gt; — community ELO ranking; Inworld TTS-1.5 Max top-position context.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://gtts.readthedocs.io/" rel="noopener noreferrer"&gt;gTTS — Python library&lt;/a&gt; — the free fallback we use when Inworld is unavailable.&lt;/li&gt;
&lt;li&gt;HoneyChat engineering notes: &lt;a href="https://honeychat.bot/en/blog/llm-prompt-caching-in-production/" rel="noopener noreferrer"&gt;LLM prompt caching measured&lt;/a&gt; · &lt;a href="https://honeychat.bot/en/blog/llm-content-filter-fallback-rescue-chain/" rel="noopener noreferrer"&gt;LLM refusal rescue chain&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>tts</category>
      <category>llm</category>
      <category>voice</category>
    </item>
    <item>
      <title>We Deleted 10 Real Users with a Test-Cleanup Script — RCA</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Thu, 28 May 2026 10:39:49 +0000</pubDate>
      <link>https://dev.to/sm1ck/we-deleted-10-real-users-with-a-test-cleanup-script-rca-1lb1</link>
      <guid>https://dev.to/sm1ck/we-deleted-10-real-users-with-a-test-cleanup-script-rca-1lb1</guid>
      <description>&lt;h2&gt;
  
  
  The incident, in two lines
&lt;/h2&gt;

&lt;p&gt;On &lt;strong&gt;2026-05-11&lt;/strong&gt;, a test-cleanup script on &lt;a href="https://honeychat.bot/" rel="noopener noreferrer"&gt;HoneyChat&lt;/a&gt; (Telegram-native AI companion, ~3 months in production, ~300 DAU, PostgreSQL 16 + Redis) ran:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;91111200&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;91111100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;About &lt;strong&gt;ten real OAuth users&lt;/strong&gt; had IDs in that narrow window. They were now gone. Their &lt;code&gt;users&lt;/code&gt; row, their &lt;code&gt;subscriptions&lt;/code&gt; row, their &lt;code&gt;chat_sessions&lt;/code&gt; / &lt;code&gt;web_messages&lt;/code&gt; — all gone from Postgres, and recovery from backup was effectively impossible (more on that below).&lt;/p&gt;

&lt;p&gt;This is the postmortem and the contract we now run instead. The honest version: &lt;strong&gt;the destructive script went to prod on a schema I never verified end-to-end&lt;/strong&gt;. Three separate design mistakes lined up to make it possible, and &lt;em&gt;not one&lt;/em&gt; of them was caught before the script ran on a Tuesday night.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the same negative IDs ended up shared between test and real users
&lt;/h2&gt;

&lt;p&gt;Two signup paths feed the &lt;code&gt;users&lt;/code&gt; table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Population&lt;/th&gt;
&lt;th&gt;ID source&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Telegram users (most of base)&lt;/td&gt;
&lt;td&gt;Positive integers — Telegram's own user IDs come in on the message envelope&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OAuth users (Google / Discord, web sign-in)&lt;/td&gt;
&lt;td&gt;Negative integers from a Postgres sequence &lt;code&gt;web_user_id_seq&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;OAuth IDs were negative on purpose — to keep them out of the positive Telegram-ID space and avoid collisions when a Telegram user later signed in via web. The minter in &lt;code&gt;api/web_auth.py&lt;/code&gt; looked roughly like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_allocate_negative_user_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  &lt;span class="c1"&gt;# retry on rare UniqueViolation
&lt;/span&gt;        &lt;span class="n"&gt;new_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT nextval(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;web_user_id_seq&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO users (id, ...) VALUES (%s, ...)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;new_id&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;UniqueViolation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# someone else took it; bump the sequence past current MIN(id) and retry
&lt;/span&gt;            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT setval(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;web_user_id_seq&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, GREATEST(-MIN(id), currval(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;web_user_id_seq&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)))&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; FROM users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;could not allocate user id after 5 retries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;setval(GREATEST(-MIN(id), current))&lt;/code&gt; step is the load-bearing piece you have to keep in mind. It says: &lt;em&gt;whatever the most-negative &lt;code&gt;users.id&lt;/code&gt; is right now, my sequence should be at least that far advanced, so I never collide with it again&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;For QA I was creating test users by hand with &lt;strong&gt;hardcoded negative IDs&lt;/strong&gt; like &lt;code&gt;-91111101&lt;/code&gt;, &lt;code&gt;-91111102&lt;/code&gt;, … via &lt;code&gt;INSERT ... ON CONFLICT (id) DO UPDATE&lt;/code&gt;. Easy to remember, easy to clean up later by range.&lt;/p&gt;

&lt;p&gt;That choice triggered three independent failure modes, each on its own benign, lethal in combination:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The first hardcoded test-user insert pushed &lt;code&gt;web_user_id_seq&lt;/code&gt; to 91 111 101.&lt;/strong&gt; Because of the &lt;code&gt;setval(GREATEST(...))&lt;/code&gt; line above, the very next OAuth signup retry saw the new test row with &lt;code&gt;id = -91111101&lt;/code&gt;, computed &lt;code&gt;-MIN(id) = 91111101&lt;/code&gt;, and advanced its own sequence. From that moment on, all real OAuth signups were drawing IDs in the neighbourhood of &lt;code&gt;-91111111&lt;/code&gt;, &lt;code&gt;-91111112&lt;/code&gt;, … — right inside the window where my test users lived.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My test-user inserts used &lt;code&gt;INSERT ... ON CONFLICT (id) DO UPDATE&lt;/code&gt;.&lt;/strong&gt; When a real OAuth signup happened to land on the same ID I'd hardcoded, my script silently overwrote that user's &lt;code&gt;plan&lt;/code&gt;, &lt;code&gt;auth_source&lt;/code&gt; and several other fields instead of erroring.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The cleanup script then ran &lt;code&gt;DELETE … WHERE id BETWEEN -91111200 AND -91111100&lt;/code&gt;&lt;/strong&gt; to remove the test users. Anyone whose OAuth ID had drifted into that 100-row window was a real user, and they went too.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of these three behaviors is exotic. The &lt;code&gt;setval(GREATEST(...))&lt;/code&gt; retry pattern is a normal way to handle UniqueViolation on a seeded sequence. &lt;code&gt;ON CONFLICT DO UPDATE&lt;/code&gt; is a normal Postgres upsert. Range-DELETE is a normal cleanup pattern. &lt;strong&gt;Each was safe on its own; the &lt;em&gt;interaction&lt;/em&gt; of all three was lethal — and I never set up a staging run that would have surfaced the interaction before it touched prod.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A 30-second sanity check on the second insert ("did adding &lt;code&gt;id = -91111101&lt;/code&gt; move &lt;code&gt;web_user_id_seq&lt;/code&gt;? what does the next OAuth signup land on?") would have shown the cascading effect immediately. Nobody — me — ran it. The cleanup script ran nightly for weeks looking healthy because real OAuth signup volume hadn't yet pushed a real ID into the deletion window.&lt;/p&gt;

&lt;h2&gt;
  
  
  What got deleted, what we couldn't recover
&lt;/h2&gt;

&lt;p&gt;Recovery from Postgres backup was effectively impossible. The chain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The most recent &lt;code&gt;pg_dump&lt;/code&gt; to Storj was about 22 hours old — &lt;em&gt;and pre-dated my test-user inserts&lt;/em&gt;. The dump didn't contain even the affected rows in their pre-overwrite state, because the &lt;code&gt;ON CONFLICT DO UPDATE&lt;/code&gt; had already mutated their &lt;code&gt;plan&lt;/code&gt; and &lt;code&gt;auth_source&lt;/code&gt; columns earlier the same day.&lt;/li&gt;
&lt;li&gt;WAL archiving was on the "after the next sprint" list and wasn't on. So there was no point-in-time recovery between hourly snapshots.&lt;/li&gt;
&lt;li&gt;Autovacuum had run between the DELETE and our discovery of the incident, so dead tuples on the relevant &lt;code&gt;users&lt;/code&gt; pages were gone too.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What we &lt;em&gt;could&lt;/em&gt; salvage came from side channels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Recent chat turns&lt;/strong&gt; — Redis with a &lt;strong&gt;90-day TTL&lt;/strong&gt; held the most-recent ~20 turns per affected session. We &lt;code&gt;PERSIST&lt;/code&gt;-ed what looked important and reconstructed recent conversations for affected users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan / subscription state&lt;/strong&gt; — rebuilt from each payment provider's webhook log. Our payments run over three providers (&lt;strong&gt;Telegram Stars&lt;/strong&gt; as global primary, &lt;strong&gt;card payments&lt;/strong&gt; through a regional web checkout, and &lt;strong&gt;CryptoBot&lt;/strong&gt; for TON on the non-RU surface), all of which keep their own server-side record of who paid what.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;chat_sessions&lt;/code&gt; and &lt;code&gt;web_messages&lt;/code&gt;&lt;/strong&gt; rows — lost. These are the canonical web-app message store and they only existed in Postgres. The 90-day Redis TTL covers the bot side, not the web-side conversation tree.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Net: people kept their accounts and most of their &lt;em&gt;recent&lt;/em&gt; conversations, but lost web-side scene context older than the Redis window. We comped the affected users. The cost of the incident wasn't the rows — it was the trust dent and the day-and-a-half of recovery work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Root causes (plural — they always are)
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The schema interaction (sequence retry + &lt;code&gt;ON CONFLICT DO UPDATE&lt;/code&gt; + range-DELETE) was never verified end-to-end before any of it touched production.&lt;/strong&gt; Each piece was a fine pattern in isolation. The interaction was lethal. A single &lt;code&gt;INSERT&lt;/code&gt; of &lt;code&gt;id = -91111101&lt;/code&gt; in staging followed by &lt;em&gt;one&lt;/em&gt; OAuth signup, then a &lt;code&gt;SELECT id FROM users ORDER BY id LIMIT 5&lt;/code&gt;, would have shown the sequence had jumped to the test neighbourhood. Nobody ran it. &lt;em&gt;This is the primary cause and the one I lost the most sleep over.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test data was distinguished from real data by &lt;em&gt;ID range&lt;/em&gt;, not by an attribute.&lt;/strong&gt; A range is something a &lt;code&gt;BETWEEN&lt;/code&gt; query can sweep. An attribute is something a &lt;code&gt;WHERE auth_source = 'test'&lt;/code&gt; query cannot accidentally trip over.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test-user seeding used &lt;code&gt;INSERT ON CONFLICT (id) DO UPDATE&lt;/code&gt;.&lt;/strong&gt; This silently overwrote real OAuth users when their IDs collided, instead of raising. Pure &lt;code&gt;INSERT&lt;/code&gt; would have failed loudly and surfaced the collision &lt;em&gt;days&lt;/em&gt; before the DELETE.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The cleanup script had no dry-run, no safety check, no assertion of expected row count.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backups were daily, not continuous, and the most recent one pre-dated the corrupting writes.&lt;/strong&gt; WAL archiving was on the "soon" list and hadn't shipped.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Any one of these five would have saved us; we had all five wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The contract we now run
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Test users have an attribute, not a range
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;auth_source&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="s1"&gt;'oauth'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- backfill: 'telegram' for positive Telegram IDs, 'oauth' for legacy negative,&lt;/span&gt;
&lt;span class="c1"&gt;-- 'test' for known test rows that we then deleted via the new path.&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;users_auth_source_idx&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;auth_source&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# scripts/test_user_factory.py
&lt;/span&gt;&lt;span class="n"&gt;TEST_ID_RANGE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1_000_000_001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1_999_999_999&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# high *positive* — out of all real paths
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_test_user&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_next_test_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO users (id, auth_source, ...) VALUES (%s, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, ...)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# scripts/test_user_cleanup.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cleanup_test_users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dry_run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT id FROM users WHERE auth_source = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;dry_run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Would delete &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; test users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DELETE FROM users WHERE auth_source = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script defaults to &lt;code&gt;dry_run=True&lt;/code&gt;. The CLI flag to actually run it is explicit and shows the count first.&lt;/p&gt;

&lt;p&gt;We've also banned, in our engineering doc and in code review: any &lt;code&gt;DELETE … WHERE id BETWEEN …&lt;/code&gt; on the &lt;code&gt;users&lt;/code&gt; table, for any reason; any &lt;code&gt;INSERT … ON CONFLICT (id) DO UPDATE&lt;/code&gt; on &lt;code&gt;users.id&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Backup cadence with explicit RPO
&lt;/h3&gt;

&lt;p&gt;We rebuilt the backup story around explicit recovery point objectives. Off-site is &lt;strong&gt;Storj&lt;/strong&gt; (~7 GB total, ~$0.03/month — cost is not the constraint).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backup tier&lt;/th&gt;
&lt;th&gt;Cadence&lt;/th&gt;
&lt;th&gt;Destination&lt;/th&gt;
&lt;th&gt;RPO&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Postgres &lt;code&gt;pg_dump&lt;/code&gt; (logical)&lt;/td&gt;
&lt;td&gt;Hourly&lt;/td&gt;
&lt;td&gt;Local disk&lt;/td&gt;
&lt;td&gt;≤ 1 h&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Postgres &lt;code&gt;pg_dump&lt;/code&gt; (logical)&lt;/td&gt;
&lt;td&gt;Daily&lt;/td&gt;
&lt;td&gt;Storj S3&lt;/td&gt;
&lt;td&gt;≤ 24 h&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Off-site cold copy&lt;/td&gt;
&lt;td&gt;Weekly&lt;/td&gt;
&lt;td&gt;Storj S3&lt;/td&gt;
&lt;td&gt;≤ 7 d&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redis snapshot (RDB)&lt;/td&gt;
&lt;td&gt;Every 6 h&lt;/td&gt;
&lt;td&gt;Local + Storj&lt;/td&gt;
&lt;td&gt;≤ 6 h&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;WAL archiving to S3-compatible storage is still pending — that's the next item. With it, RPO drops to seconds. Without it, hourly logical dumps are the floor.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Recovery rehearsal, not just backups
&lt;/h3&gt;

&lt;p&gt;A backup you've never restored from is a hope, not a backup. We restore from yesterday's hourly dump into a scratch container monthly. The first time we tried, the restore script had bit-rotted and didn't compile.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Verify the partition scheme end-to-end before any destructive script touches prod.&lt;/strong&gt; "Run the query without &lt;code&gt;DELETE&lt;/code&gt;, in staging, against real data, and read the results" is thirty seconds of work. It is also the only thing that would have caught this.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Range-based partitioning of test vs real data is an accident waiting to happen.&lt;/strong&gt; Use attributes. Filter on them. Index them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default cleanup scripts to dry-run.&lt;/strong&gt; Make the destructive flag explicit and noisy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assert expected counts.&lt;/strong&gt; If the cleanup script suddenly finds 10× the usual rows, &lt;em&gt;that&lt;/em&gt; is the signal to stop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick an RPO, then pick a backup cadence that meets it.&lt;/strong&gt; Not the other way around.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Restore from your backups on a schedule.&lt;/strong&gt; Untested backups silently rot.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We've run the new contract for two weeks now. No range-DELETE incidents. The new &lt;code&gt;auth_source = 'test'&lt;/code&gt; filter is boring and explicit and impossible to fat-finger. Boring is the goal.&lt;/p&gt;




&lt;p&gt;This postmortem is from production work at &lt;strong&gt;&lt;a href="https://honeychat.bot/" rel="noopener noreferrer"&gt;HoneyChat&lt;/a&gt;&lt;/strong&gt; — a Telegram-native AI companion. Canonical version: &lt;a href="https://honeychat.bot/en/blog/range-delete-test-user-incident-rca/" rel="noopener noreferrer"&gt;honeychat.bot/en/blog/range-delete-test-user-incident-rca&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;— &lt;em&gt;HoneyChat Engineering&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.postgresql.org/docs/current/continuous-archiving.html" rel="noopener noreferrer"&gt;PostgreSQL — Continuous archiving (WAL)&lt;/a&gt; — the right way to get sub-second RPO.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.postgresql.org/docs/current/app-pgdump.html" rel="noopener noreferrer"&gt;PostgreSQL — &lt;code&gt;pg_dump&lt;/code&gt; documentation&lt;/a&gt; — what hourly logical dumps actually give you.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://sre.google/sre-book/postmortem-culture/" rel="noopener noreferrer"&gt;Google SRE Book — Postmortem culture&lt;/a&gt; — blameless postmortems, why root-cause-singular framing is wrong.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://core.telegram.org/bots/payments" rel="noopener noreferrer"&gt;Telegram Bot Payments API&lt;/a&gt; — Telegram Stars webhook semantics.&lt;/li&gt;
&lt;li&gt;HoneyChat engineering notes: &lt;a href="https://honeychat.bot/en/blog/chromadb-lru-memory-leak-production/" rel="noopener noreferrer"&gt;ChromaDB 0.5 leak fix&lt;/a&gt; · &lt;a href="https://honeychat.bot/en/blog/oauth-state-validation-client-side/" rel="noopener noreferrer"&gt;OAuth state on the client&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>devops</category>
      <category>testing</category>
    </item>
    <item>
      <title>ChromaDB 0.5 Silently Leaks Memory Until You Set One Env Var</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Thu, 28 May 2026 10:12:32 +0000</pubDate>
      <link>https://dev.to/sm1ck/chromadb-05-silently-leaks-memory-until-you-set-one-env-var-1ghh</link>
      <guid>https://dev.to/sm1ck/chromadb-05-silently-leaks-memory-until-you-set-one-env-var-1ghh</guid>
      <description>&lt;h2&gt;
  
  
  The TL;DR
&lt;/h2&gt;

&lt;p&gt;If you run ChromaDB 0.5.x with more than a few hundred collections, set these two env vars before anything else:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;CHROMA_SEGMENT_CACHE_POLICY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;LRU
&lt;span class="nv"&gt;CHROMA_MEMORY_LIMIT_BYTES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10737418240   &lt;span class="c"&gt;# 10 GiB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without them, ChromaDB 0.5.x has an unresolved memory leak in the segment cache. Upstream issues &lt;a href="https://github.com/chroma-core/chroma/issues/3336" rel="noopener noreferrer"&gt;#3336&lt;/a&gt; and &lt;a href="https://github.com/chroma-core/chroma/issues/5843" rel="noopener noreferrer"&gt;#5843&lt;/a&gt; are still open. We discovered this the slow way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HoneyChat at a glance (for context):&lt;/strong&gt; Telegram-native AI companion bot, ~300 DAU, 17 languages. Stack: &lt;code&gt;aiogram&lt;/code&gt; bot + FastAPI (uvicorn, 4 workers) + Celery workers (queues &lt;code&gt;llm&lt;/code&gt; / &lt;code&gt;images&lt;/code&gt; / &lt;code&gt;gifs&lt;/code&gt; / &lt;code&gt;voice&lt;/code&gt;) + Celery beat (RedBeat) + Next.js 15 web + Astro blog + React/Vite Mini App. Storage: PostgreSQL 16 + Redis + &lt;strong&gt;ChromaDB 0.5.x&lt;/strong&gt; + Storj S3. Host: 32 GB / 16-core Xeon, single box.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape of the leak
&lt;/h2&gt;

&lt;p&gt;We run &lt;strong&gt;2,233 ChromaDB collections&lt;/strong&gt; in production — one per &lt;code&gt;(character_id, session_id)&lt;/code&gt; pair, so each conversation gets isolated semantic memory and scene context never bleeds between sessions. Mean collection size: &lt;strong&gt;4.9 documents&lt;/strong&gt; (small per-collection, large in aggregate).&lt;/p&gt;

&lt;p&gt;On 0.4 this ran fine for months. We upgraded to 0.5 for some new features, and within a week the &lt;code&gt;chromadb&lt;/code&gt; container was OOM-killing nightly. The pattern was unmistakable: every time a fresh collection got queried, RSS bumped a few MiB and never came back down. With ~10K collection touches a day across that fleet of 2,233, the container budget filled in about three days. Restart, repeat.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we tried first (and what didn't work)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Restarting the container.&lt;/strong&gt; Buys a day, doesn't fix the cause.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Upgrading ChromaDB.&lt;/strong&gt; The underlying behavior hasn't changed in the 0.5.x line.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Increasing the container memory limit.&lt;/strong&gt; Just delays the OOM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sharding collections further.&lt;/strong&gt; We already split per &lt;code&gt;(character, session)&lt;/code&gt; — narrower sharding would have &lt;em&gt;worsened&lt;/em&gt; the cache, not helped it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blaming the embedding model.&lt;/strong&gt; Profile pointed elsewhere.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Profiling pointed at the segment cache: ChromaDB caches per-collection segment metadata, and on 0.5 the cache is &lt;strong&gt;unbounded by default&lt;/strong&gt;. The "fix" of "let's just give it more RAM" never converges if the cache only grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;The env vars above tell ChromaDB to use an LRU eviction policy on the segment cache, capped at a memory limit you set. Once we set them and bounced the container, RSS stabilised in a 6-8 GiB band and has stayed there for months.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;chromadb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;chromadb/chroma:0.5.18&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;CHROMA_SEGMENT_CACHE_POLICY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LRU"&lt;/span&gt;
      &lt;span class="na"&gt;CHROMA_MEMORY_LIMIT_BYTES&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10737418240"&lt;/span&gt;   &lt;span class="c1"&gt;# 10 GiB&lt;/span&gt;
    &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;12G&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;CHROMA_SEGMENT_CACHE_POLICY=LRU&lt;/code&gt; switches the cache from unbounded to least-recently-used eviction. &lt;code&gt;CHROMA_MEMORY_LIMIT_BYTES&lt;/code&gt; is the budget LRU operates against — 10 GiB out of 32 GB host RAM, leaving room for Postgres, Redis, FastAPI, four Celery workers, nginx, ChromaDB itself, and the OS.&lt;/p&gt;

&lt;p&gt;Pick a &lt;code&gt;CHROMA_MEMORY_LIMIT_BYTES&lt;/code&gt; that's well under your container's hard limit — the policy needs headroom to actually evict before the kernel kills you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The catch (don't forget this one)
&lt;/h2&gt;

&lt;p&gt;These env vars are &lt;strong&gt;only applied at container creation&lt;/strong&gt;. &lt;code&gt;docker compose restart chromadb&lt;/code&gt; is &lt;em&gt;not&lt;/em&gt; enough — you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--force-recreate&lt;/span&gt; &lt;span class="nt"&gt;--no-deps&lt;/span&gt; chromadb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We learned this the second time we changed limits while debugging, watching RSS climb again wondering why the fix had stopped working. It hadn't — the new env never got picked up. If you change the limits, always recreate, not restart.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this isn't on the docs landing page
&lt;/h2&gt;

&lt;p&gt;Most ChromaDB benchmarks and getting-started guides assume one big collection — the documented happy path. If you're per-user or per-session partitioning (multi-tenant SaaS, per-conversation memory, per-document RAG silos), you hit cache-and-eviction behaviour the docs don't warn about. The issues are real and open in the repo; the docs just haven't caught up.&lt;/p&gt;

&lt;p&gt;This isn't a knock on the team — 0.5 was a big jump and they're shipping fast. It's just a heads-up that if your workload is "many small collections," your config has to be different from the tutorial.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;"It's a leak" is usually "it's a cache without an eviction policy."&lt;/strong&gt; Read your dependency's cache config before chasing valgrind ghosts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Many-small-collections is not the documented happy path.&lt;/strong&gt; Per-user/per-session partitioning needs a config nobody's tutorial mentions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check open issues before assuming your config is wrong.&lt;/strong&gt; &lt;a href="https://github.com/chroma-core/chroma/issues/3336" rel="noopener noreferrer"&gt;#3336&lt;/a&gt; and &lt;a href="https://github.com/chroma-core/chroma/issues/5843" rel="noopener noreferrer"&gt;#5843&lt;/a&gt; are community-known, not docs-known.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set both env vars together.&lt;/strong&gt; Without &lt;code&gt;CHROMA_MEMORY_LIMIT_BYTES&lt;/code&gt;, the LRU policy has nothing to evict against and effectively no-ops.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recreate, don't restart, when changing startup env.&lt;/strong&gt; Standard Docker gotcha, doubly painful when you're debugging memory.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're on Chroma 0.5+ with many collections and seeing slow RSS creep — that's almost certainly it. Three lines of YAML, one container recreate, done.&lt;/p&gt;




&lt;p&gt;This write-up is from production work at &lt;strong&gt;&lt;a href="https://honeychat.bot/" rel="noopener noreferrer"&gt;HoneyChat&lt;/a&gt;&lt;/strong&gt; — a Telegram-native AI companion bot where each &lt;code&gt;(character, session)&lt;/code&gt; pair gets its own ChromaDB collection for isolated semantic memory. The canonical version (with our other engineering notes) lives at &lt;a href="https://honeychat.bot/en/blog/chromadb-lru-memory-leak-production/" rel="noopener noreferrer"&gt;honeychat.bot/en/blog/chromadb-lru-memory-leak-production&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;— &lt;em&gt;HoneyChat Engineering&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.trychroma.com/" rel="noopener noreferrer"&gt;ChromaDB docs&lt;/a&gt; — segment cache, deployment, configuration reference.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/chroma-core/chroma/issues/3336" rel="noopener noreferrer"&gt;ChromaDB issue #3336&lt;/a&gt; — memory leak in segment cache, open.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/chroma-core/chroma/issues/5843" rel="noopener noreferrer"&gt;ChromaDB issue #5843&lt;/a&gt; — many-collections behaviour, open.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.docker.com/reference/cli/docker/compose/up/" rel="noopener noreferrer"&gt;Docker Compose: env vs --force-recreate&lt;/a&gt; — why &lt;code&gt;restart&lt;/code&gt; doesn't pick up new env.&lt;/li&gt;
&lt;li&gt;HoneyChat engineering notes: &lt;a href="https://honeychat.bot/en/blog/persistent-memory-ai-companion-architecture/" rel="noopener noreferrer"&gt;persistent-memory architecture&lt;/a&gt; · &lt;a href="https://honeychat.bot/en/blog/llm-prompt-caching-in-production/" rel="noopener noreferrer"&gt;prompt caching measured&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>docker</category>
      <category>chromadb</category>
    </item>
    <item>
      <title>We Measured LLM Prompt Caching in Production — Same Prompt, 0% to 91% Hit Rates</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Thu, 28 May 2026 08:21:47 +0000</pubDate>
      <link>https://dev.to/sm1ck/we-measured-llm-prompt-caching-in-production-same-prompt-0-to-91-hit-rates-oio</link>
      <guid>https://dev.to/sm1ck/we-measured-llm-prompt-caching-in-production-same-prompt-0-to-91-hit-rates-oio</guid>
      <description>&lt;p&gt;We run an AI companion bot. Every chat turn, the model sees the same ~5K-token prefix — character persona, content-tier rules, formatting guardrails, a memory blob — plus one new user line. Without caching, we pay for those 5K input tokens &lt;em&gt;every single turn&lt;/em&gt;. So we turned on prompt caching across the providers we route through, measured it, and the spread was bigger than any of the marketing pages prepared us for.&lt;/p&gt;

&lt;p&gt;Here's the table that survived four weeks in production, plus the one gotcha that ate two weeks before we figured it out.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hit-rate table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider / model&lt;/th&gt;
&lt;th&gt;Hit rate&lt;/th&gt;
&lt;th&gt;Latency Δ&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cydonia (via OpenRouter)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;91 %&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;−43 %&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Just works, no marker needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.1 Flash Lite&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75 %&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;−49 %&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Requires &lt;code&gt;cache_control&lt;/code&gt; marker&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grok (xAI)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;51 %&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;−40 %&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Sticky" — best on active sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Same code, 600-token test prompt&lt;/td&gt;
&lt;td&gt;0 %&lt;/td&gt;
&lt;td&gt;0 %&lt;/td&gt;
&lt;td&gt;Methodology bug — see below&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same exact 5K-token system prefix across all rows. Same 10 follow-up turns. Wildly different cache behaviour.&lt;/p&gt;

&lt;h2&gt;
  
  
  The marker that "didn't matter" (until it did)
&lt;/h2&gt;

&lt;p&gt;Most OpenAI-compat examples skip any cache hint and assume the provider figures it out from prefix repetition. Some do. Anthropic-style routes — and anything going through OpenRouter that supports &lt;code&gt;cache_control&lt;/code&gt; — don't:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# the long, stable prefix
&lt;/span&gt;                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_msg&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;      &lt;span class="c1"&gt;# the only volatile part
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cydonia caches without it. Grok caches without it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini 3.1 Flash Lite caches at exactly 0 % without it.&lt;/strong&gt; The same model jumps to 75 % with one extra field on the last cacheable content block.&lt;/p&gt;

&lt;p&gt;We had Gemini 3.1 routed in production for a week showing zero cache reads in usage. Concluded the model "just didn't support caching." It does — we were calling the API the way every other model wanted to be called. Cost of including the marker on providers that ignore it: zero. Cost of skipping it on a provider that needs it: your entire spend on that route.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why our first "no, it doesn't cache" test was wrong
&lt;/h2&gt;

&lt;p&gt;Before we caught the marker thing, we'd already wrongly concluded a couple of models "don't cache" — because we'd tested with the wrong prompt.&lt;/p&gt;

&lt;p&gt;The first probe was a ~600-token prompt repeated 10 times. Cache reads: zero, across every provider. Conclusion: this provider doesn't cache.&lt;/p&gt;

&lt;p&gt;Conclusion: wrong. Most providers have a minimum prefix length before caching kicks in (≥ 1K tokens for some routes, closer to ≥ 4K for others). Below that floor, you pay full price even though the prompt repeats verbatim. The cache simply doesn't engage.&lt;/p&gt;

&lt;p&gt;The corrected probe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prefix &lt;strong&gt;≥ 5K tokens&lt;/strong&gt;, shaped like real production (system prompt + persona + retrieved memory).&lt;/li&gt;
&lt;li&gt;10 identical follow-up turns, fresh request each time.&lt;/li&gt;
&lt;li&gt;For Anthropic-style providers, include the &lt;code&gt;cache_control&lt;/code&gt; marker on the last cacheable content block.&lt;/li&gt;
&lt;li&gt;Read &lt;code&gt;usage.cache_creation_input_tokens&lt;/code&gt; and &lt;code&gt;usage.cache_read_input_tokens&lt;/code&gt; (or the provider's equivalent) back — don't trust round-trip latency alone.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once we did that, every "broken" provider started reporting cache reads.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "sticky" caching looks like (Grok)
&lt;/h2&gt;

&lt;p&gt;Grok was the weird one. Hit rate 51 % — lower than Cydonia and Gemini — but the cache &lt;em&gt;survived longer&lt;/em&gt; between calls. Other providers behaved like a ~5-minute ephemeral cache; Grok looked more like a hot-window-then-slow-decay curve. Practical consequence: Grok did &lt;em&gt;better&lt;/em&gt; than its hit rate suggested when the same user kept chatting actively, and &lt;em&gt;worse&lt;/em&gt; when they came back hours later.&lt;/p&gt;

&lt;p&gt;Lesson — a single hit-rate number per provider lies a little. The shape (how it decays, how it warms) matters as much as the headline percentage when your traffic is bursty.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it actually saved
&lt;/h2&gt;

&lt;p&gt;We route turns through different model tiers depending on the user's plan. After caching landed and the marker was wired in everywhere it was needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cached input tokens are billed at roughly 10 % of normal price (provider-dependent, sometimes lower).&lt;/li&gt;
&lt;li&gt;Cost per turn on the heavy-tier routes dropped about 40–45 %, matching the hit rates above.&lt;/li&gt;
&lt;li&gt;End-to-end latency dropped 40–49 %, which users &lt;em&gt;actually notice&lt;/em&gt; — the typing-dots animation snapping back faster feels like a different product.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pleasant surprise was that latency mattered to retention more than cost mattered to the P&amp;amp;L. Cheaper turns are nice; faster replies are felt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons we'd pin to the wall
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Test with a production-shaped prompt.&lt;/strong&gt; Short toy prompts will tell you caching doesn't work on providers where it works fine. The minimum-prefix floor is real and silent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read provider-specific cache hints.&lt;/strong&gt; Anthropic-style &lt;code&gt;cache_control&lt;/code&gt; is required on some routes (Gemini 3.1 line via OpenRouter, in our case) and ignored by others. Always send it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify with usage fields, not vibes.&lt;/strong&gt; &lt;code&gt;cache_read_input_tokens&lt;/code&gt; doesn't lie. End-to-end latency does — TTFB swings hide a lot of noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One hit-rate per provider lies a little.&lt;/strong&gt; The decay curve matters more than the headline number for bursty vs. steady chat patterns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-probe quarterly.&lt;/strong&gt; Providers ship cache changes silently. The 75 % on Gemini 3.1 Flash Lite is a &lt;em&gt;2026&lt;/em&gt; number — the same code on the same model gave us 0 % earlier this year, before the marker was wired in.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're running an AI app where the system prompt dwarfs the user input — companion bots, RAG with chunky retrieved context, agentic loops — you almost certainly leave 40 % of your bill and half a second of latency on the table by trusting the defaults. The marker is one line. The corrected methodology is one afternoon.&lt;/p&gt;




&lt;p&gt;If you've got hit-rate numbers from a different routing setup (Bedrock, Fireworks, Together, direct Anthropic), drop them in the comments — curious how the marker situation compares outside the OpenRouter ecosystem.&lt;/p&gt;

&lt;p&gt;This write-up is from production work at &lt;strong&gt;&lt;a href="https://honeychat.bot/" rel="noopener noreferrer"&gt;HoneyChat&lt;/a&gt;&lt;/strong&gt; — a Telegram-native AI companion where the system prompt is the load-bearing wall (persona + content tier + memory blob = the whole 5K). The canonical version of this post lives at &lt;a href="https://honeychat.bot/en/blog/llm-prompt-caching-in-production/" rel="noopener noreferrer"&gt;honeychat.bot/en/blog/llm-prompt-caching-in-production&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;— &lt;em&gt;HoneyChat Engineering&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching" rel="noopener noreferrer"&gt;Anthropic — Prompt caching&lt;/a&gt; — &lt;code&gt;cache_control&lt;/code&gt; field reference, ephemeral cache, billing rates.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://platform.openai.com/docs/guides/prompt-caching" rel="noopener noreferrer"&gt;OpenAI — Prompt caching&lt;/a&gt; — automatic caching, minimum prefix length, &lt;code&gt;cached_tokens&lt;/code&gt; in usage.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ai.google.dev/gemini-api/docs/caching" rel="noopener noreferrer"&gt;Google — Context caching&lt;/a&gt; — Gemini API caching, supported models.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openrouter.ai/docs/features/prompt-caching" rel="noopener noreferrer"&gt;OpenRouter — Prompt caching&lt;/a&gt; — provider-specific cache passthrough, Anthropic-style marker support.&lt;/li&gt;
&lt;li&gt;HoneyChat engineering notes: &lt;a href="https://honeychat.bot/en/blog/llm-routing-per-tier-openrouter/" rel="noopener noreferrer"&gt;LLM routing per tier on OpenRouter&lt;/a&gt; · &lt;a href="https://honeychat.bot/en/blog/persistent-memory-ai-companion-architecture/" rel="noopener noreferrer"&gt;Persistent-memory companion architecture&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>llm</category>
      <category>performance</category>
    </item>
    <item>
      <title>IP-Adapter + LoRA for product catalog rendering — putting shop items on AI characters</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Sat, 25 Apr 2026 02:35:59 +0000</pubDate>
      <link>https://dev.to/sm1ck/ip-adapter-lora-for-product-catalog-rendering-putting-shop-items-on-ai-characters-5h36</link>
      <guid>https://dev.to/sm1ck/ip-adapter-lora-for-product-catalog-rendering-putting-shop-items-on-ai-characters-5h36</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📦 Runnable workflow:&lt;/strong&gt; &lt;a href="https://github.com/sm1ck/honeychat/tree/main/tutorial/04-ipadapter" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/tree/main/tutorial/04-ipadapter&lt;/a&gt; — a ComfyUI &lt;code&gt;workflow.json&lt;/code&gt; (with &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; placeholders for IP-Adapter weight/end_at) plus a stdlib Python client that posts it to your ComfyUI instance and saves the output.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the previous post I argued that &lt;strong&gt;LoRA per character&lt;/strong&gt; is often the strongest fit for visual identity. But what happens when you want to render that character wearing a &lt;em&gt;specific&lt;/em&gt; item — a shop product, a user-uploaded outfit, a gift from another user?&lt;/p&gt;

&lt;p&gt;LoRA helps stabilize the character. To also preserve an arbitrary reference image, IP-Adapter is a common fit. Those two techniques can compete unless you configure them carefully.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;LoRA stabilizes the character's face. IP-Adapter pulls features from a reference image. If both are too strong late in sampling, the face can drift toward the reference.&lt;/li&gt;
&lt;li&gt;Balance: &lt;strong&gt;moderate IP-Adapter weight&lt;/strong&gt; (lower half of 0–1) with &lt;strong&gt;early handoff&lt;/strong&gt; (IP-Adapter releases control before the final denoising steps). The final steps belong to the LoRA.&lt;/li&gt;
&lt;li&gt;A useful node order: &lt;code&gt;Checkpoint → LoRA → FreeU → IP-Adapter → KSampler&lt;/code&gt;. Feeding IP-Adapter into the model conditioning &lt;em&gt;after&lt;/em&gt; LoRA lets LoRA reassert on late steps.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Render your first outfit preview
&lt;/h2&gt;

&lt;p&gt;This section walks you from clone to a generated image in under ten minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Prereqs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A running ComfyUI instance (local GPU, rented box, or a friend's)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/cubiq/ComfyUI_IPAdapter_plus" rel="noopener noreferrer"&gt;ComfyUI_IPAdapter_plus&lt;/a&gt; installed in it&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ip-adapter-plus_sdxl_vit-h.safetensors&lt;/code&gt; in &lt;code&gt;models/ipadapter/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CLIP-ViT-H-14-laion2B-s32B-b79K.safetensors&lt;/code&gt; in &lt;code&gt;models/clip_vision/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Your own SDXL base checkpoint&lt;/li&gt;
&lt;li&gt;A character LoRA — if you don't have one, go through &lt;a href="https://honeychat.bot/en/blog/character-consistency-custom-lora/" rel="noopener noreferrer"&gt;the previous article&lt;/a&gt; first&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Clone and install the client&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/sm1ck/honeychat
&lt;span class="nb"&gt;cd &lt;/span&gt;honeychat/tutorial/04-ipadapter
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Put your outfit reference next to the client&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Anything flat-lay, clean-background works best. &lt;code&gt;./my-dress.png&lt;/code&gt; for this example.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Run — start at the middle of both tuning ranges&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;COMFY_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8188
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;REFERENCE_IMAGE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;./my-dress.png
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CHECKPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-sdxl-base.safetensors
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;LORA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-character-v1.safetensors
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;IPADAPTER_WEIGHT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.4      &lt;span class="c"&gt;# lower half of 0–1&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;IPADAPTER_END_AT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.8      &lt;span class="c"&gt;# upper half of 0–1&lt;/span&gt;

python client.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output lands in &lt;code&gt;./out/outfit_preview_&amp;lt;n&amp;gt;.png&lt;/code&gt;. First run should usually show your character wearing something that resembles the reference dress.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Tune&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Inspect the output. Two failure modes tell you how to adjust:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Face drifted&lt;/strong&gt; → lower &lt;code&gt;IPADAPTER_WEIGHT&lt;/code&gt; or lower &lt;code&gt;IPADAPTER_END_AT&lt;/code&gt; by 0.05 and re-run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Item doesn't resemble the reference&lt;/strong&gt; → raise &lt;code&gt;IPADAPTER_WEIGHT&lt;/code&gt; by 0.05, or raise &lt;code&gt;IPADAPTER_END_AT&lt;/code&gt; slightly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sweep in 0.05 steps, not 0.1. The usable range can be narrower than expected, and a new base model may take several tuning sweeps before the balance feels stable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Validate the workflow JSON with pytest&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;
pytest &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five tests make sure &lt;code&gt;workflow.json&lt;/code&gt; stays valid JSON, every node class is still referenced, and &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; placeholders haven't been accidentally committed with real values.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;You have a character (Anna) stabilized by a custom LoRA. She appears reasonably consistent across generations. Now the user buys a specific dress in your shop. The dress is a reference image. You want:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Anna's face&lt;/strong&gt; — unchanged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;This specific dress&lt;/strong&gt; — rendered faithfully on Anna.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Prompt engineering usually can't guarantee this. "Anna wearing a red silk dress with a white collar" generates &lt;em&gt;a&lt;/em&gt; red silk dress, not necessarily &lt;em&gt;this&lt;/em&gt; red silk dress. SKU-level fidelity needs the reference image in the generation path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why naive IP-Adapter breaks the character
&lt;/h2&gt;

&lt;p&gt;IP-Adapter pulls features from a reference image into the model's cross-attention. If you set it too high, it can preserve the reference image aggressively — including &lt;em&gt;its face&lt;/em&gt;, if there is one. Even if the reference is an unworn product shot, IP-Adapter can pull in lighting, backdrop, and styling from the reference photo.&lt;/p&gt;

&lt;p&gt;At high weight: Anna's face may start looking more like whoever (or whatever) is in the reference. Lighting and pose can bias toward the reference.&lt;/p&gt;

&lt;p&gt;At low weight: The character is fine. The dress is approximately the right color and cut but not recognizable as &lt;em&gt;this&lt;/em&gt; dress. Your product catalog becomes decorative rather than accurate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The balance: moderate weight + early handoff
&lt;/h2&gt;

&lt;p&gt;The two knobs that matter are &lt;strong&gt;weight&lt;/strong&gt; and &lt;strong&gt;end_at&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weight&lt;/strong&gt; — the multiplier on IP-Adapter's contribution to cross-attention. Below the lower-middle of the 0–1 range, the reference is a "mood" more than a fact. Above the upper-middle, the reference dominates. Somewhere in the lower half is where you find the range that preserves item identity without killing face identity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;end_at&lt;/strong&gt; — the fraction of denoising steps during which IP-Adapter is active. If it runs through all steps, it has a say in the final face details. If it ends earlier (say 70–90% of the way through), the last steps belong to the rest of the pipeline, and LoRA face features reassert.&lt;/p&gt;

&lt;p&gt;In rough terms: the item gets baked in during the middle of denoising, the face re-sharpens at the end.&lt;/p&gt;

&lt;h2&gt;
  
  
  Workflow node order (ComfyUI)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbtv84aiq459r1zyduchz.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbtv84aiq459r1zyduchz.webp" alt="IP-Adapter plus LoRA ComfyUI workflow chain: checkpoint, character LoRA, FreeU, outfit reference image through IP-Adapter, KSampler, and VAE decode" width="800" height="280"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Checkpoint Loader]
  → [LoRA Loader: character_lora]
    → [FreeU: quality touch-up]
      → [IPAdapter Advanced: reference, weight=W, end_at=E]
        → [KSampler]
          → [VAE Decode]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things about this order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;LoRA comes before IP-Adapter in the chain.&lt;/strong&gt; The LoRA modifies the checkpoint weights; IP-Adapter modifies cross-attention during sampling. When IP-Adapter ends at step &lt;code&gt;end_at&lt;/code&gt;, the remaining steps operate on the LoRA-modified weights without IP-Adapter influence — this is what lets the face reassert.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FreeU is optional.&lt;/strong&gt; It's a noise rebalance that improves quality without adding compute.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The tutorial client takes the base &lt;code&gt;workflow.json&lt;/code&gt;, rewrites the &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; placeholders with env-supplied values, uploads the reference image to ComfyUI, and queues the prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rewrite_workflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ref_filename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fill in the `&amp;lt;tune&amp;gt;` and `&amp;lt;path&amp;gt;` placeholders with actual values.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# deep copy
&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;checkpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ckpt_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;checkpoint&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lora_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strength_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora_strength&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strength_clip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora_strength&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ref_filename&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_at&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;seed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0xFFFFFFFF&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;wf&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/04-ipadapter/client.py#L55-L77" rel="noopener noreferrer"&gt;full source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The full &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/04-ipadapter/workflow.json" rel="noopener noreferrer"&gt;workflow.json&lt;/a&gt; in the tutorial folder ships with &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; placeholders on every field you should touch. The test suite asserts those placeholders stay in the template — a safety net against accidentally committing your tuned production values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Weight tuning loop
&lt;/h2&gt;

&lt;p&gt;The practical process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick a reference item with a clean product photo.&lt;/li&gt;
&lt;li&gt;Pick a character with a strong LoRA.&lt;/li&gt;
&lt;li&gt;Render around &lt;code&gt;weight=0.3, end_at=0.8&lt;/code&gt;. Check face, check item.&lt;/li&gt;
&lt;li&gt;Face drifts → lower weight or lower end_at.&lt;/li&gt;
&lt;li&gt;Item doesn't resemble the reference → raise weight carefully, or leave weight and raise end_at.&lt;/li&gt;
&lt;li&gt;Sweep in 0.05 increments, not 0.1. The usable range is narrower than you'd expect.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Several tuning sweeps on realistic and anime bases usually land you on a working pair.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production integration
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Outfit catalog as reference images.&lt;/strong&gt; Each shop item has a reference image stored in object storage. At generation time, pass the reference URL to the GPU worker, which downloads it once and caches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Catalog pre-rendering for previews.&lt;/strong&gt; When a user browses the shop, they see a preview of each item rendered on their active character. These previews don't need to happen on every page load — generate them asynchronously (Celery worker), store in S3, serve from cache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistency across image and video.&lt;/strong&gt; The same IP-Adapter + LoRA pair used for images can often drive the start-frame of video generation (e.g., Kling). Tune the still-image path first, then reuse it carefully.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fallback when the item isn't visual.&lt;/strong&gt; Some "items" in a shop are stats buffs, relationship flags, or dialogue unlocks — things without a visual. Gate the IP-Adapter pathway to items flagged as visual-only.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production issues that came up
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Face drifted on a noticeable slice of catalog previews.&lt;/strong&gt; Running IP-Adapter weight too high "for stronger outfit adherence." Rolled back to the lower-half range after face-drift complaints spiked. Lesson: tune one variable at a time, even when it feels slow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cached reference URLs expired.&lt;/strong&gt; Shop items in S3 had time-limited presigned URLs. Generation workers fetched the URL at queue-time, but the URL expired before ComfyUI actually downloaded it. Fix: pre-fetch on the worker side, pass the ComfyUI-side filename instead of the external URL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IP-Adapter model version mismatch with SDXL base.&lt;/strong&gt; IP-Adapter Plus ships multiple weights keyed to specific SDXL base models. Mixing can produce worse output without an obvious runtime error — just lower fidelity. Pin the IP-Adapter version to the base in your deployment config.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-visual shop items crashed the workflow.&lt;/strong&gt; The API tried to render "stat boost" items through the image pipeline. Fix: a &lt;code&gt;visual: true|false&lt;/code&gt; flag on catalog entries, checked at the API boundary before queuing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change if starting over
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start with a clean catalog.&lt;/strong&gt; Reference images with consistent backgrounds, consistent lighting, no model already wearing the item if possible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version the tuning.&lt;/strong&gt; When you move base models, your IP-Adapter weight/end_at values probably move too. Treat them as part of the deployment, not as constants.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache the pre-rendered previews aggressively.&lt;/strong&gt; A character × item grid grows multiplicatively. Pre-render on character creation and on new item add.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this lives
&lt;/h2&gt;

&lt;p&gt;HoneyChat's shop renders outfits, accessories, and gifts on active characters using IP-Adapter Plus layered over per-character LoRA. Public architecture doc: &lt;a href="https://github.com/sm1ck/honeychat/blob/main/docs/architecture.md" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/blob/main/docs/architecture.md&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/tencent-ailab/IP-Adapter" rel="noopener noreferrer"&gt;IP-Adapter (tencent-ailab)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/cubiq/ComfyUI_IPAdapter_plus" rel="noopener noreferrer"&gt;ComfyUI IPAdapter Plus extension&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2309.11497" rel="noopener noreferrer"&gt;FreeU paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0" rel="noopener noreferrer"&gt;SDXL base model&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you've shipped an IP-Adapter + LoRA combo in production, I'm curious what weight / end_at pairs you landed on and for which base. The sweet spot seems to shift meaningfully between anime and realistic bases.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>python</category>
      <category>comfyui</category>
    </item>
    <item>
      <title>Character consistency in AI image generation — where prompts break down and LoRA helps</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Wed, 22 Apr 2026 12:22:02 +0000</pubDate>
      <link>https://dev.to/sm1ck/character-consistency-in-ai-image-generation-where-prompts-break-down-and-lora-helps-320b</link>
      <guid>https://dev.to/sm1ck/character-consistency-in-ai-image-generation-where-prompts-break-down-and-lora-helps-320b</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📦 Training template:&lt;/strong&gt; &lt;a href="https://github.com/sm1ck/honeychat/tree/main/tutorial/03-lora" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/tree/main/tutorial/03-lora&lt;/a&gt; — a generic Kohya SDXL config with &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; placeholders and a dataset curation guide. No docker-compose (LoRA training is GPU-heavy) — you bring your own GPU or rent one.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's a failure mode many AI companion apps run into on launch day: users send two requests in a row for the same character, get two different faces, and conclude the product is broken. They're not wrong to feel that way. Character identity is part of the product.&lt;/p&gt;

&lt;p&gt;This post is about why that happens, why the obvious fixes (seed-pinning, more prompt detail, reference images) often don't fully solve it, and what class of solution works better.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Identical seed + identical prompt + different batch size = different face.&lt;/strong&gt; Seeds only help within the same sampler run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt detail plateaus fast.&lt;/strong&gt; Past a certain tag count, the model interpolates anyway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reference image (IP-Adapter) works but can bleed stylistic features&lt;/strong&gt; — outfit, lighting, background — into generations where you only wanted identity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom LoRA per character makes identity much more stable&lt;/strong&gt; by encoding it at the weights level instead of relying only on prompt text.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Train your own character LoRA — the short walkthrough
&lt;/h2&gt;

&lt;p&gt;LoRA training is GPU-heavy and doesn't belong in a docker-compose, so the tutorial folder at &lt;a href="https://github.com/sm1ck/honeychat/tree/main/tutorial/03-lora" rel="noopener noreferrer"&gt;tutorial/03-lora&lt;/a&gt; ships the &lt;strong&gt;config template and recipe&lt;/strong&gt;. You bring the GPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Get a GPU&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;24 GB VRAM (RTX 3090/4090) fits SDXL LoRA at batch size 2–4 comfortably. Don't own one? Rent a spot — Vast.ai, RunPod, Modal, Paperspace, Lambda. A full training run costs a few dollars.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Install Kohya_ss&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/bmaltais/kohya_ss ~/kohya_ss
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/kohya_ss &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; ./setup.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Grab the template&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/sm1ck/honeychat
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; honeychat/tutorial/03-lora ./my-character-lora
&lt;span class="nb"&gt;cd &lt;/span&gt;my-character-lora
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Prepare your dataset&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Drop 15–30 varied images of your subject into &lt;code&gt;dataset/train/5_character/&lt;/code&gt; (the &lt;code&gt;5_&lt;/code&gt; is the repeat count). For each image, create a same-named &lt;code&gt;.txt&lt;/code&gt; caption describing the &lt;em&gt;scene&lt;/em&gt; — not the character. See &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/03-lora/dataset/README.md" rel="noopener noreferrer"&gt;dataset/README.md&lt;/a&gt; for the full curation checklist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Fill the &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; slots in &lt;code&gt;kohya-config.toml&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every hyperparameter is a placeholder you pick based on your dataset and base model. Read the inline comments, then replace each &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; with a real value. The safety check in &lt;code&gt;train.sh&lt;/code&gt; will refuse to run if any placeholder remains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Train&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;KOHYA_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;~/kohya_ss
bash train.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The checkpoint lands at &lt;code&gt;./output/&amp;lt;your-character&amp;gt;.safetensors&lt;/code&gt;. Load it into ComfyUI or Diffusers like any other SDXL LoRA. Generate a test grid, iterate, retrain if needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why "same prompt, same face" doesn't hold
&lt;/h2&gt;

&lt;p&gt;Users naturally assume this works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"anime girl, long silver hair, green eyes, Arknights operator outfit"
+ seed=12345
→ Anna, always. Or so it seems.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not reliably. Three reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch size changes the output.&lt;/strong&gt; In most Stable Diffusion runs, &lt;code&gt;batch_size=1&lt;/code&gt; and &lt;code&gt;batch_size=4&lt;/code&gt; with the same seed produce &lt;em&gt;different&lt;/em&gt; images for position 0. The RNG state depends on batch dimension.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider-side sampler drift.&lt;/strong&gt; If you're calling a managed API (fal.ai, Replicate, Together), provider-side changes — model updates, sampler tweaks, default parameter shifts — can produce visually different outputs across weeks. Your "locked" character can drift.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt detail saturates.&lt;/strong&gt; At some point, adding more tags ("sharp nose, high cheekbones, narrow eyes, specific mole position") stops helping much. The model has a rough template and interpolates within it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The in-between fix that doesn't quite work: IP-Adapter
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/tencent-ailab/IP-Adapter" rel="noopener noreferrer"&gt;IP-Adapter&lt;/a&gt; lets you pass a reference image alongside the prompt. The model bakes the reference's features into the cross-attention. For product photography, excellent.&lt;/p&gt;

&lt;p&gt;For character identity, it has a practical drawback: &lt;strong&gt;IP-Adapter can carry stylistic baggage&lt;/strong&gt;. A reference photo with specific lighting, pose, outfit, and background can bleed those into the generated image. You can turn the weight down, but then identity may weaken; turn it up, and the reference can dominate.&lt;/p&gt;

&lt;p&gt;IP-Adapter is a good fit when the &lt;em&gt;reference&lt;/em&gt; is what you want preserved (e.g., rendering a shop item on a character — next post in the series). It's usually a poor fit when what you want preserved is only the face.&lt;/p&gt;

&lt;h2&gt;
  
  
  The solution: custom LoRA per character
&lt;/h2&gt;

&lt;p&gt;A LoRA (Low-Rank Adaptation) is a small set of additional weights layered on top of a base model. A character-specific LoRA trained on a curated dataset — consistent face, varied pose/outfit/lighting — encodes the identity into the weights themselves, not into the prompt.&lt;/p&gt;

&lt;p&gt;Inference pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;workflow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Checkpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# base SDXL model
&lt;/span&gt;    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LoRA: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;char&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# the character's custom LoRA
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FreeU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# quality touch-up
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;KSampler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="c1"&gt;# actual diffusion
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now Anna is much more likely to stay Anna across pose, outfit, and lighting changes. The face is represented in the weights, not only in the words.&lt;/p&gt;

&lt;h3&gt;
  
  
  Training a character LoRA (public-friendly template)
&lt;/h3&gt;

&lt;p&gt;The conceptual shape of the training job using the publicly available &lt;a href="https://github.com/bmaltais/kohya_ss" rel="noopener noreferrer"&gt;Kohya_ss SDXL trainer&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# Kohya_ss SDXL LoRA training config — generic template&lt;/span&gt;
&lt;span class="c"&gt;# Replace every &amp;lt;tune&amp;gt; value based on your dataset and base model.&lt;/span&gt;
&lt;span class="c"&gt;# See Kohya docs for the full parameter reference.&lt;/span&gt;

&lt;span class="nn"&gt;[model_arguments]&lt;/span&gt;
&lt;span class="py"&gt;pretrained_model_name_or_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;path/to/sdxl-base-or-finetune.safetensors&amp;gt;"&lt;/span&gt;

&lt;span class="nn"&gt;[dataset_arguments]&lt;/span&gt;
&lt;span class="py"&gt;train_data_dir&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"./dataset/train"&lt;/span&gt;
&lt;span class="py"&gt;resolution&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"1024,1024"&lt;/span&gt;
&lt;span class="py"&gt;caption_extension&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;".txt"&lt;/span&gt;

&lt;span class="nn"&gt;[training_arguments]&lt;/span&gt;
&lt;span class="py"&gt;output_dir&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"./output"&lt;/span&gt;
&lt;span class="py"&gt;output_name&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;your_character_v1&amp;gt;"&lt;/span&gt;
&lt;span class="py"&gt;save_model_as&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"safetensors"&lt;/span&gt;

&lt;span class="c"&gt;# Training steps and batch — VRAM-bound. Tune for your hardware.&lt;/span&gt;
&lt;span class="py"&gt;learning_rate&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;
&lt;span class="py"&gt;max_train_steps&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;
&lt;span class="py"&gt;train_batch_size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;

&lt;span class="nn"&gt;[network_arguments]&lt;/span&gt;
&lt;span class="py"&gt;network_module&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"networks.lora"&lt;/span&gt;
&lt;span class="py"&gt;network_dim&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;
&lt;span class="py"&gt;network_alpha&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/03-lora/kohya-config.toml" rel="noopener noreferrer"&gt;full template on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The parameters that matter — LR, step count, rank, alpha, dataset size — are subject-dependent. Anime faces converge differently than realistic faces. There is no universal "best" setting.&lt;/p&gt;

&lt;p&gt;What to actually optimize for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dataset quality over dataset size.&lt;/strong&gt; 20 clean, varied, captioned images beat 100 messy ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Varied pose and lighting&lt;/strong&gt;, constant face. Same angle 30 times teaches "this angle," not "this character."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clean captions&lt;/strong&gt; — describe the scene, not the character. "Woman standing in a garden" is better than "Anna standing in a garden" because you want the model to learn the face from context, not from the token.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dedicated rank for face detail.&lt;/strong&gt; Lower ranks underfit the identity; higher ranks overfit and kill flexibility.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Marginal cost: usually manageable
&lt;/h2&gt;

&lt;p&gt;If you're running inference on a rented or owned GPU, training one character LoRA is a one-time cost usually measured in minutes to hours of GPU time, depending on dataset and settings. Inference with the LoRA attached often adds little overhead compared with the base generation. At scale, the per-character cost is dominated by dataset curation, not just training compute.&lt;/p&gt;

&lt;p&gt;This is why a LoRA-per-character pipeline can be viable for products with many characters: once the pipeline exists, adding a new character is mostly a dataset and QA exercise, not a research project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production concerns
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LoRA hot-swapping.&lt;/strong&gt; Load the base checkpoint once, swap LoRAs per request. ComfyUI and Diffusers both support this natively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dataset hygiene.&lt;/strong&gt; LoRAs memorize whatever's in the dataset. Enforce licensing upstream — the LoRA is downstream of the decision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Storage at scale.&lt;/strong&gt; LoRA file size depends on base model and rank; expect anything from a few MB to much larger checkpoints. Object storage + hot-LoRA pinning on inference workers keeps latency down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Face ≠ body.&lt;/strong&gt; A LoRA trained on face crops will not lock body proportions. Include full-body shots in the dataset if you need full-body consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change if starting over
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ship the LoRA pipeline from day 1&lt;/strong&gt;, even for three characters. Inconsistent visuals in the free tier can hurt activation before users ever see the stronger parts of the product.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Curate datasets manually, don't scrape.&lt;/strong&gt; Five iterations of a hand-picked set of 20 images beat a scraped 200.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store base-model version with each LoRA.&lt;/strong&gt; When you update the base, you need to know which LoRAs need retraining.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version LoRAs (v1, v2) and keep old versions live.&lt;/strong&gt; If v2 ships with a regression, roll back per-character without reverting a whole release.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this lives
&lt;/h2&gt;

&lt;p&gt;HoneyChat uses custom LoRA per character for visual identity in image and video generation. Public architecture: &lt;a href="https://github.com/sm1ck/honeychat" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Previous: &lt;a href="https://dev.to/sm1ck/llm-routing-per-tier-via-openrouter-when-one-model-doesnt-fit-all-3ami"&gt;LLM routing per tier via OpenRouter&lt;/a&gt;.&lt;br&gt;
Next: &lt;a href="https://dev.to/sm1ck/ip-adapter-lora-for-product-catalog-rendering-putting-shop-items-on-ai-characters-5h36"&gt;IP-Adapter Plus for a product catalog&lt;/a&gt; — how to put arbitrary shop items on a character while keeping the character's face locked.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2106.09685" rel="noopener noreferrer"&gt;LoRA paper — Hu et al.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/bmaltais/kohya_ss" rel="noopener noreferrer"&gt;Kohya_ss SDXL training&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/tencent-ailab/IP-Adapter" rel="noopener noreferrer"&gt;IP-Adapter (for comparison)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0" rel="noopener noreferrer"&gt;SDXL base model&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you've trained character LoRAs in production and have opinions on rank selection or caption strategy, I'd love to hear them in the comments. There's very little public writing on this outside the anime generation community.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>LLM routing per tier via OpenRouter — when one model doesn't fit all</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Tue, 21 Apr 2026 23:50:29 +0000</pubDate>
      <link>https://dev.to/sm1ck/llm-routing-per-tier-via-openrouter-when-one-model-doesnt-fit-all-3ami</link>
      <guid>https://dev.to/sm1ck/llm-routing-per-tier-via-openrouter-when-one-model-doesnt-fit-all-3ami</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📦 Full runnable example:&lt;/strong&gt; &lt;a href="https://github.com/sm1ck/honeychat/tree/main/tutorial/02-routing" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/tree/main/tutorial/02-routing&lt;/a&gt; — &lt;code&gt;docker compose up&lt;/code&gt; exposes &lt;code&gt;POST /complete&lt;/code&gt; on localhost:8000. Every snippet below is pulled from that repo.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most introductory "chat with AI" tutorials pick one model and call it a day. That works in a toy. It stops being enough in production, where users have different price sensitivity, different conversation styles, and different expectations for what the product should allow.&lt;/p&gt;

&lt;p&gt;Here's how to route LLM calls across a handful of providers via &lt;a href="https://openrouter.ai/" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt;, how that routing handles &lt;code&gt;finish_reason=content_filter&lt;/code&gt; empty-completion edge cases, and the fallback chain pattern that keeps replies flowing.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Route by &lt;strong&gt;tier&lt;/strong&gt; (price elasticity) &lt;em&gt;and&lt;/em&gt; by &lt;strong&gt;content mode&lt;/strong&gt; (what kind of turn this is). A single default model can't do both.&lt;/li&gt;
&lt;li&gt;Some reasoning/model-provider combinations can return &lt;code&gt;finish_reason=content_filter&lt;/code&gt; with empty content on borderline content. A retry policy that only catches HTTP errors can miss this.&lt;/li&gt;
&lt;li&gt;The working pattern: &lt;code&gt;primary → different-provider fallback → specialized last resort&lt;/code&gt;, with retries triggered by both error responses &lt;em&gt;and&lt;/em&gt; suspicious empty completions.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Run it yourself in 3 minutes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Clone and configure&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/sm1ck/honeychat
&lt;span class="nb"&gt;cd &lt;/span&gt;honeychat/tutorial/02-routing
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;.env&lt;/code&gt;, paste your &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; (&lt;a href="https://openrouter.ai/keys" rel="noopener noreferrer"&gt;get one here&lt;/a&gt;). The three default model slots all point to free-tier OpenRouter models so you can experiment without spending.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Start the service&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;
curl http://localhost:8000/health   &lt;span class="c"&gt;# {"ok":true}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Send a normal turn — primary answers&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/complete &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'content-type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"messages":[{"role":"user","content":"Name three cold-climate fruits."}]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Apples, pears, and cloudberries..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"meta-llama/llama-3.1-8b-instruct:free"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"attempt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"used_fallback"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;attempt: 0&lt;/code&gt; means the primary model answered. &lt;code&gt;used_fallback: false&lt;/code&gt; means no retry was needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Force a fallback&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Override the primary to point at a model you know tends to refuse — or any bogus model name — and watch the chain kick in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/complete &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'content-type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"messages":[{"role":"user","content":"Say hi"}],"primary":"this/model-does-not-exist"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | jq &lt;span class="s1"&gt;'.model, .attempt, .used_fallback'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;attempt: 1&lt;/code&gt; (or 2) — the next rung answered. In production, log this metric: a rising fallback rate on a class of content means it's time to move the content to a different primary, not to tweak retry logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Run the unit tests&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;
pytest &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Seven tests cover the failure modes in this chain — &lt;code&gt;content_filter=empty&lt;/code&gt;, transient 5xx, non-transient 4xx, all-models-fail.&lt;/p&gt;

&lt;p&gt;With the service running and the tests green, the rest of this post explains why the chain is shaped this way.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why one model doesn't fit all
&lt;/h2&gt;

&lt;p&gt;Three distinct pressures push against a single-model setup:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Price elasticity by tier.&lt;/strong&gt; A free user generating 20 messages a day at flagship-model prices can burn cash every month per active user for zero revenue. A paying top-tier user sending the same 20 messages may reasonably expect higher quality. The unit economics do not agree.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content mode.&lt;/strong&gt; Mainstream-aligned models can refuse content that some legitimate companion/roleplay products allow on paid tiers. Conversely, less-restrictive models can have weaker long-context coherence. The right model depends on the turn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency vs. depth.&lt;/strong&gt; Instant conversational turns need sub-3-second responses. Long scene-writing turns can tolerate 10+ seconds for better prose. Hardcoding a single model optimizes for one and sacrifices the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  The reasoning-model empty-completion edge case
&lt;/h2&gt;

&lt;p&gt;This is the one that cost me a full afternoon to diagnose.&lt;/p&gt;

&lt;p&gt;Some reasoning-class model/provider combinations do server-side moderation or filtering before returning a final answer. On borderline turns, they may not return an HTTP error. Instead, they can return a valid response with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"finish_reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"content_filter"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Empty string. No exception. No status code to check. If you don't guard for it, your user sees a blank reply.&lt;/p&gt;

&lt;p&gt;If your retry logic only triggers on &lt;code&gt;httpx.HTTPStatusError&lt;/code&gt;, this can pass through.&lt;/p&gt;

&lt;h2&gt;
  
  
  The guard
&lt;/h2&gt;

&lt;p&gt;The whole failure mode is caught by a tiny function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_is_silent_refusal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    The whole point of this post: reasoning models can return a successful
    HTTP response with finish_reason=content_filter AND an empty content.
    If you only check HTTP status, you ship blank replies to users.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finish_reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content_filter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/02-routing/app/router.py#L64-L73" rel="noopener noreferrer"&gt;full source&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Resilient fallback chain
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbeynrwjlgh7bsgzvb90.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbeynrwjlgh7bsgzvb90.webp" alt="LLM routing fallback chain: a chat turn tries a tier-specific primary model, retries on a different-provider fallback after empty content_filter responses, then falls back to a specialized last resort" width="800" height="373"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;CompletionResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run the fallback chain. Return the first usable response.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;_build_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HTTPStatusError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;TRANSIENT_CODES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;continue&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;
            &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadTimeout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConnectError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="n"&gt;choice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;choices&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[{}])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;_is_silent_refusal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;CompletionResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;AllModelsFailedError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no model returned usable content; tried &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/02-routing/app/router.py#L90-L128" rel="noopener noreferrer"&gt;full source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two details worth calling out:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Empty content check is separate from the finish reason.&lt;/strong&gt; Some models can return &lt;code&gt;finish_reason=stop&lt;/code&gt; with empty content when they refuse. Always check &lt;code&gt;not content.strip()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track which model ultimately answered.&lt;/strong&gt; Log &lt;code&gt;attempt &amp;gt; 0&lt;/code&gt; as a fallback event. If your primary fails 10% of the time on a class of content, that's a routing decision, not a retry problem — move that content to a different primary.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Picking the fallback order
&lt;/h2&gt;

&lt;p&gt;For a permissive roleplay mode, the shape looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;content-mode primary   → first model for this type of turn
  ↓ (on failure / empty)
diff-provider fallback → avoids the same upstream failure mode
  ↓
specialized last resort
  ↓
abort — ask the user to try a shorter or clearer prompt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ordering rule: &lt;strong&gt;different-provider fallbacks&lt;/strong&gt;. If the primary is hosted on provider A and fails for a provider-side reason, prefer a fallback hosted on provider B. Same-provider fallbacks can fail on the same content because the provider's moderation layer may be upstream of the model. OpenRouter makes this easier because each model's provider metadata is visible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Content-level gating happens before the LLM, not after
&lt;/h2&gt;

&lt;p&gt;The fallback chain handles &lt;em&gt;model-level&lt;/em&gt; refusals. But if the user's intent is clearly above your product's content ceiling, retrying on a more permissive model just burns extra tokens before the user hits the real limit. Gate the content level in your system prompt assembly — don't rely on the model to enforce policy.&lt;/p&gt;

&lt;p&gt;Keep the tier-level policy simple: the escalation class (detected from user intent) must be &lt;code&gt;≤&lt;/code&gt; the user's plan ceiling. If over, the character responds in-character and the bot sends the upsell. The LLM does not need to know the tier exists — it just gets a system prompt with the right constraints for this turn.&lt;/p&gt;

&lt;h2&gt;
  
  
  Instrumentation that matters
&lt;/h2&gt;

&lt;p&gt;Log three things per LLM call:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model that answered&lt;/strong&gt; (primary or fallback index)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time to first token&lt;/strong&gt; vs &lt;strong&gt;total time&lt;/strong&gt; — tells you whether latency was model-side or network-side&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token cost&lt;/strong&gt; (input + output) per message, bucketed by tier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Costs track in Redis counters with short TTL — daily sum, per-user daily sum. A global daily ceiling blocks new generations if spend crosses a configured threshold (fail-closed: if the counter is unreachable, block, don't pass). This helped cap a runaway generation loop at a known ceiling.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change if starting over
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Route by content mode from day 1&lt;/strong&gt;, not as an afterthought. Retrofitting the split into an existing handler is painful.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instrument the silent-refusal rate&lt;/strong&gt;. It may be rare, but you won't know unless you measure it specifically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't share a single OpenRouter key across environments.&lt;/strong&gt; Rate limits are per-key and dev noise eats prod quota.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publish the tier → model map in your public docs.&lt;/strong&gt; Users comparing products care. Competitors already know. Keeping the docs in sync with the code forces alignment.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this lives
&lt;/h2&gt;

&lt;p&gt;HoneyChat's LLM router sits behind the chat handler on both the Telegram bot and the web app. Public architecture: &lt;a href="https://github.com/sm1ck/honeychat/blob/main/docs/architecture.md" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/blob/main/docs/architecture.md&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Previous in the series: &lt;a href="https://dev.to/sm1ck/building-an-ai-companion-with-persistent-memory-redis-chromadb-4i8k"&gt;dual-layer memory with Redis + ChromaDB&lt;/a&gt;.&lt;br&gt;
Next: &lt;a href="https://dev.to/sm1ck/character-consistency-in-ai-image-generation-where-prompts-break-down-and-lora-helps-320b"&gt;character consistency with custom LoRA&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openrouter.ai/models" rel="noopener noreferrer"&gt;OpenRouter model list&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/api-reference/chat/object" rel="noopener noreferrer"&gt;Chat Completions &lt;code&gt;finish_reason&lt;/code&gt; semantics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Curious how others have solved the silent-refusal pattern. If you've hit it on a different provider, drop a comment — I want to know which models ship which behavior.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>openrouter</category>
    </item>
    <item>
      <title>Building an AI companion with persistent memory — Redis + ChromaDB</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Mon, 20 Apr 2026 12:16:42 +0000</pubDate>
      <link>https://dev.to/sm1ck/building-an-ai-companion-with-persistent-memory-redis-chromadb-4i8k</link>
      <guid>https://dev.to/sm1ck/building-an-ai-companion-with-persistent-memory-redis-chromadb-4i8k</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📦 Full runnable example:&lt;/strong&gt; &lt;a href="https://github.com/sm1ck/honeychat/tree/main/tutorial/01-memory" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/tree/main/tutorial/01-memory&lt;/a&gt; — clone, &lt;code&gt;docker compose up&lt;/code&gt;, chat with the demo bot on Telegram. Every code snippet below is pulled from that repo.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most AI chatbots still struggle with reliable, queryable long-term recall. Character.AI has pinned and chat memories, but unpinned details can still fall out of the active conversation context. Replika remembers profile facts, preferences, and generated memories, but that is not the same as semantic recall over the full conversation. Even ChatGPT's Memory is built for useful preferences and details, not verbatim replay of long sessions.&lt;/p&gt;

&lt;p&gt;I wanted a chat companion with &lt;strong&gt;practical persistent memory&lt;/strong&gt; — not just the current conversation, but older facts and events surfaced when they matter. Here's the architecture that worked well for this use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hot layer (Redis)&lt;/strong&gt; — recent messages per conversation, short TTL, low-latency reads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold layer (ChromaDB)&lt;/strong&gt; holds &lt;em&gt;summaries of chunks&lt;/em&gt;, not individual messages. Every N bot turns, a background task summarizes that window via a cheap LLM and stores the summary as a document. Keeps the vector index tiny, queries fast.&lt;/li&gt;
&lt;li&gt;On every user message, three retrieval paths fire in parallel via &lt;code&gt;asyncio.gather&lt;/code&gt;: recent buffer, latest summary, top-K semantic search. All three get assembled into the system prompt.&lt;/li&gt;
&lt;li&gt;Result: substantially fewer tokens than full-history replay, while still making old context retrievable weeks later.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Run it yourself in 5 minutes
&lt;/h2&gt;

&lt;p&gt;Before the architectural deep-dive, boot the demo so you can poke the memory layers live.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Clone and enter the folder&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/sm1ck/honeychat
&lt;span class="nb"&gt;cd &lt;/span&gt;honeychat/tutorial/01-memory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Configure two tokens&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;.env&lt;/code&gt; and fill:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;TELEGRAM_BOT_TOKEN&lt;/code&gt; — get it from &lt;a href="https://t.me/BotFather" rel="noopener noreferrer"&gt;@BotFather&lt;/a&gt; (30 seconds: &lt;code&gt;/newbot&lt;/code&gt;, pick a name, copy the token)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; — from &lt;a href="https://openrouter.ai/keys" rel="noopener noreferrer"&gt;openrouter.ai/keys&lt;/a&gt;. The default &lt;code&gt;LLM_MODEL&lt;/code&gt; is a free-tier Llama 3.1 8B so you don't spend a cent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Start the stack&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;
docker compose logs &lt;span class="nt"&gt;-f&lt;/span&gt; bot       &lt;span class="c"&gt;# watch the bot come alive&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four containers: &lt;code&gt;redis&lt;/code&gt;, &lt;code&gt;chromadb&lt;/code&gt;, &lt;code&gt;api&lt;/code&gt; (FastAPI inspector on &lt;code&gt;localhost:8000&lt;/code&gt;), &lt;code&gt;bot&lt;/code&gt; (your Telegram bot polling).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Talk to your bot&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Open it on Telegram, hit &lt;code&gt;/start&lt;/code&gt;, chat for 10–20 turns. Tell it things about yourself. Come back later and reference something you said earlier — it'll pull it from ChromaDB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Peek at what each layer holds&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Replace 12345 with your own Telegram user ID (ask @userinfobot)&lt;/span&gt;
curl http://localhost:8000/memory/12345/demo/recent  | jq
curl http://localhost:8000/memory/12345/demo/summary | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;recent&lt;/code&gt; shows the raw Redis buffer. &lt;code&gt;summary&lt;/code&gt; shows the latest ChromaDB document.&lt;/p&gt;

&lt;p&gt;With the demo running, the rest of this post explains what you just booted.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why rolling summaries alone don't work
&lt;/h2&gt;

&lt;p&gt;A common pattern for chatbot memory is a rolling summary — every N messages, regenerate a compressed version of older context. It's cheap. It's also &lt;strong&gt;lossy in a very specific way&lt;/strong&gt;: nuance dies in repeated compression.&lt;/p&gt;

&lt;p&gt;Walk it through three regenerations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Turn 1: "She said she hates her boss because he takes credit for her work"
Turn 2 summary: "User mentioned workplace frustration with manager"
Turn 3 summary: "User has job-related stress"
Turn 4 summary: "User has a job"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By turn 4, the &lt;em&gt;reason&lt;/em&gt; is gone. A companion bot starts sounding generic. The fix used here: &lt;strong&gt;keep raw recent messages verbatim&lt;/strong&gt; and only summarize chunks that are genuinely old, while being able to semantically retrieve any summary from the full history when the current conversation calls back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkh7qpvljz5wjeh349kel.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkh7qpvljz5wjeh349kel.webp" alt="Dual-layer memory architecture: Redis recent buffer and ChromaDB summaries retrieved in parallel before LLM prompt assembly" width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two independent layers. Writes to Redis are synchronous on every turn; writes to ChromaDB are asynchronous, batched. Reads from both happen in parallel on every message.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hot layer — Redis
&lt;/h2&gt;

&lt;p&gt;Each &lt;code&gt;(user_id, character_id)&lt;/code&gt; conversation is stored as a bounded Redis list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_redis&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rpush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ltrim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;HOT_BUFFER_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;86400&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;HOT_BUFFER_TTL_DAYS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/01-memory/app/memory.py#L75-L89" rel="noopener noreferrer"&gt;full source on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three things matter here:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ltrim&lt;/code&gt; on every write.&lt;/strong&gt; The list is bounded. Memory per user is O(1), not O(conversation length).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTL extended on every write.&lt;/strong&gt; Inactive users' history evicts automatically. Configure Redis with &lt;code&gt;allkeys-lru&lt;/code&gt; so overflow evicts instead of refusing writes — &lt;code&gt;noeviction&lt;/code&gt; is the default and it's a footgun.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipelined writes.&lt;/strong&gt; &lt;code&gt;rpush + ltrim + expire&lt;/code&gt; in one round trip.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The cold layer — ChromaDB with summaries, not messages
&lt;/h2&gt;

&lt;p&gt;A tempting implementation is to embed every message and run semantic search over them. Two problems: the index grows linearly with conversation volume, and individual messages are often too short or context-free to retrieve meaningfully ("yeah" returns a lot of "yeah" matches).&lt;/p&gt;

&lt;p&gt;Instead: &lt;strong&gt;embed LLM-generated summaries of chunks&lt;/strong&gt;. Every N bot turns, compress the window via a cheap LLM and write it as one document to a per-(user, character) ChromaDB collection. Ten weeks of active conversation is maybe 30–50 documents per collection, not tens of thousands.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retrieval — three paths in parallel
&lt;/h2&gt;

&lt;p&gt;On every user message, the chat handler fires three reads in parallel via &lt;code&gt;asyncio.gather&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_prompt_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Parallel fire the three reads. Returns everything the handler needs.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memories&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;get_recent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;get_latest_summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;get_relevant_memories&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memories&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;memories&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/01-memory/app/memory.py#L163-L173" rel="noopener noreferrer"&gt;full source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The fast path for the summary hits Redis. The slower path queries ChromaDB only when the Redis cache expired, then writes back so the next call is hot again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production issues that came up
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Double-summarize race.&lt;/strong&gt; Two concurrent messages for the same pair both trigger summarization, writing overlapping summaries. Fix: per-key task tracking, cancel the pending task if a new one fires.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User clears history mid-summarize.&lt;/strong&gt; A user hits "reset chat" while a summary is in flight. The summary then writes to a collection that just got deleted. Fix: re-check &lt;code&gt;r.exists(key)&lt;/code&gt; before writing; bail if the list is gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Empty summaries cached.&lt;/strong&gt; LLM rate-limited, returned empty content — and I was caching the empty string with a 3-day TTL. Fix: &lt;code&gt;if summary:&lt;/code&gt; guard before &lt;code&gt;setex&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ChromaDB collection doesn't exist for new users.&lt;/strong&gt; &lt;code&gt;col.query&lt;/code&gt; raises on a non-existent collection. Wrap in try/except and return empty — normal for a user's first few messages.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change if starting over
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Skip pgvector for this shape of workload.&lt;/strong&gt; Two weeks on it first; for my short-query summaries, recall was worse than ChromaDB and reindexing pain wasn't worth it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't embed per message.&lt;/strong&gt; Index exploded, recall didn't improve. Summary-level is the right granularity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summarize fixed-size windows, not time-based batches.&lt;/strong&gt; Daily summaries are useless for users who chatted 500 times in one day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build the cancellation pattern from day 1.&lt;/strong&gt; Race conditions around user actions (clear history, switch character) became one of the top sources of production bugs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this lives
&lt;/h2&gt;

&lt;p&gt;HoneyChat — an AI companion that runs both as a Telegram bot and a web app on the same backend. The architecture above is in production. Try it: &lt;a href="https://t.me/HoneyChatAIBot" rel="noopener noreferrer"&gt;@HoneyChatAIBot&lt;/a&gt; on Telegram or &lt;a href="https://honeychat.bot" rel="noopener noreferrer"&gt;honeychat.bot&lt;/a&gt; in the browser.&lt;/p&gt;

&lt;p&gt;Public docs: &lt;a href="https://github.com/sm1ck/honeychat" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat&lt;/a&gt; — service topology, API surface, major flows.&lt;/p&gt;

&lt;p&gt;Next in the series: &lt;a href="https://dev.to/sm1ck/llm-routing-per-tier-via-openrouter-when-one-model-doesnt-fit-all-3ami"&gt;LLM routing per tier&lt;/a&gt; — why one model doesn't fit all, and how to handle content_filter errors from reasoning models.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.trychroma.com/" rel="noopener noreferrer"&gt;ChromaDB docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://redis.io/commands/ltrim/" rel="noopener noreferrer"&gt;Redis &lt;code&gt;LTRIM&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aiogram.dev/" rel="noopener noreferrer"&gt;aiogram&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openrouter.ai/" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://support.character.ai/hc/en-us/articles/24327914463003-New-Feature-Pinned-Memories" rel="noopener noreferrer"&gt;Character.AI pinned memories&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.character.ai/helping-characters-remember-what-matters-most/" rel="noopener noreferrer"&gt;Character.AI chat memories&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://help.replika.com/hc/en-us/categories/4410741916045-Conversation-Memory" rel="noopener noreferrer"&gt;Replika memory docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://help.openai.com/en/articles/10303002-how-does-memory-use-past-conversations" rel="noopener noreferrer"&gt;ChatGPT Memory FAQ&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you're building something similar and have questions about the memory layout or the summarization pipeline, drop a comment. Especially curious how others handle race conditions around user-initiated state resets.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
