<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: sm1ck</title>
    <description>The latest articles on DEV Community by sm1ck (@sm1ck).</description>
    <link>https://dev.to/sm1ck</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3035707%2F79914038-15a3-4a68-9e07-803c587c48a8.png</url>
      <title>DEV Community: sm1ck</title>
      <link>https://dev.to/sm1ck</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sm1ck"/>
    <language>en</language>
    <item>
      <title>IP-Adapter + LoRA for product catalog rendering — putting shop items on AI characters</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Sat, 25 Apr 2026 02:35:59 +0000</pubDate>
      <link>https://dev.to/sm1ck/ip-adapter-lora-for-product-catalog-rendering-putting-shop-items-on-ai-characters-5h36</link>
      <guid>https://dev.to/sm1ck/ip-adapter-lora-for-product-catalog-rendering-putting-shop-items-on-ai-characters-5h36</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📦 Runnable workflow:&lt;/strong&gt; &lt;a href="https://github.com/sm1ck/honeychat/tree/main/tutorial/04-ipadapter" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/tree/main/tutorial/04-ipadapter&lt;/a&gt; — a ComfyUI &lt;code&gt;workflow.json&lt;/code&gt; (with &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; placeholders for IP-Adapter weight/end_at) plus a stdlib Python client that posts it to your ComfyUI instance and saves the output.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the previous post I argued that &lt;strong&gt;LoRA per character&lt;/strong&gt; is often the strongest fit for visual identity. But what happens when you want to render that character wearing a &lt;em&gt;specific&lt;/em&gt; item — a shop product, a user-uploaded outfit, a gift from another user?&lt;/p&gt;

&lt;p&gt;LoRA helps stabilize the character. To also preserve an arbitrary reference image, IP-Adapter is a common fit. Those two techniques can compete unless you configure them carefully.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;LoRA stabilizes the character's face. IP-Adapter pulls features from a reference image. If both are too strong late in sampling, the face can drift toward the reference.&lt;/li&gt;
&lt;li&gt;Balance: &lt;strong&gt;moderate IP-Adapter weight&lt;/strong&gt; (lower half of 0–1) with &lt;strong&gt;early handoff&lt;/strong&gt; (IP-Adapter releases control before the final denoising steps). The final steps belong to the LoRA.&lt;/li&gt;
&lt;li&gt;A useful node order: &lt;code&gt;Checkpoint → LoRA → FreeU → IP-Adapter → KSampler&lt;/code&gt;. Feeding IP-Adapter into the model conditioning &lt;em&gt;after&lt;/em&gt; LoRA lets LoRA reassert on late steps.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Render your first outfit preview
&lt;/h2&gt;

&lt;p&gt;This section walks you from clone to a generated image in under ten minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Prereqs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A running ComfyUI instance (local GPU, rented box, or a friend's)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/cubiq/ComfyUI_IPAdapter_plus" rel="noopener noreferrer"&gt;ComfyUI_IPAdapter_plus&lt;/a&gt; installed in it&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ip-adapter-plus_sdxl_vit-h.safetensors&lt;/code&gt; in &lt;code&gt;models/ipadapter/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CLIP-ViT-H-14-laion2B-s32B-b79K.safetensors&lt;/code&gt; in &lt;code&gt;models/clip_vision/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Your own SDXL base checkpoint&lt;/li&gt;
&lt;li&gt;A character LoRA — if you don't have one, go through &lt;a href="https://honeychat.bot/en/blog/character-consistency-custom-lora/" rel="noopener noreferrer"&gt;the previous article&lt;/a&gt; first&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Clone and install the client&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/sm1ck/honeychat
&lt;span class="nb"&gt;cd &lt;/span&gt;honeychat/tutorial/04-ipadapter
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Put your outfit reference next to the client&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Anything flat-lay, clean-background works best. &lt;code&gt;./my-dress.png&lt;/code&gt; for this example.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Run — start at the middle of both tuning ranges&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;COMFY_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8188
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;REFERENCE_IMAGE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;./my-dress.png
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CHECKPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-sdxl-base.safetensors
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;LORA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-character-v1.safetensors
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;IPADAPTER_WEIGHT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.4      &lt;span class="c"&gt;# lower half of 0–1&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;IPADAPTER_END_AT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.8      &lt;span class="c"&gt;# upper half of 0–1&lt;/span&gt;

python client.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output lands in &lt;code&gt;./out/outfit_preview_&amp;lt;n&amp;gt;.png&lt;/code&gt;. First run should usually show your character wearing something that resembles the reference dress.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Tune&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Inspect the output. Two failure modes tell you how to adjust:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Face drifted&lt;/strong&gt; → lower &lt;code&gt;IPADAPTER_WEIGHT&lt;/code&gt; or lower &lt;code&gt;IPADAPTER_END_AT&lt;/code&gt; by 0.05 and re-run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Item doesn't resemble the reference&lt;/strong&gt; → raise &lt;code&gt;IPADAPTER_WEIGHT&lt;/code&gt; by 0.05, or raise &lt;code&gt;IPADAPTER_END_AT&lt;/code&gt; slightly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sweep in 0.05 steps, not 0.1. The usable range can be narrower than expected, and a new base model may take several tuning sweeps before the balance feels stable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Validate the workflow JSON with pytest&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;
pytest &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five tests make sure &lt;code&gt;workflow.json&lt;/code&gt; stays valid JSON, every node class is still referenced, and &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; placeholders haven't been accidentally committed with real values.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;You have a character (Anna) stabilized by a custom LoRA. She appears reasonably consistent across generations. Now the user buys a specific dress in your shop. The dress is a reference image. You want:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Anna's face&lt;/strong&gt; — unchanged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;This specific dress&lt;/strong&gt; — rendered faithfully on Anna.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Prompt engineering usually can't guarantee this. "Anna wearing a red silk dress with a white collar" generates &lt;em&gt;a&lt;/em&gt; red silk dress, not necessarily &lt;em&gt;this&lt;/em&gt; red silk dress. SKU-level fidelity needs the reference image in the generation path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why naive IP-Adapter breaks the character
&lt;/h2&gt;

&lt;p&gt;IP-Adapter pulls features from a reference image into the model's cross-attention. If you set it too high, it can preserve the reference image aggressively — including &lt;em&gt;its face&lt;/em&gt;, if there is one. Even if the reference is an unworn product shot, IP-Adapter can pull in lighting, backdrop, and styling from the reference photo.&lt;/p&gt;

&lt;p&gt;At high weight: Anna's face may start looking more like whoever (or whatever) is in the reference. Lighting and pose can bias toward the reference.&lt;/p&gt;

&lt;p&gt;At low weight: The character is fine. The dress is approximately the right color and cut but not recognizable as &lt;em&gt;this&lt;/em&gt; dress. Your product catalog becomes decorative rather than accurate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The balance: moderate weight + early handoff
&lt;/h2&gt;

&lt;p&gt;The two knobs that matter are &lt;strong&gt;weight&lt;/strong&gt; and &lt;strong&gt;end_at&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weight&lt;/strong&gt; — the multiplier on IP-Adapter's contribution to cross-attention. Below the lower-middle of the 0–1 range, the reference is a "mood" more than a fact. Above the upper-middle, the reference dominates. Somewhere in the lower half is where you find the range that preserves item identity without killing face identity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;end_at&lt;/strong&gt; — the fraction of denoising steps during which IP-Adapter is active. If it runs through all steps, it has a say in the final face details. If it ends earlier (say 70–90% of the way through), the last steps belong to the rest of the pipeline, and LoRA face features reassert.&lt;/p&gt;

&lt;p&gt;In rough terms: the item gets baked in during the middle of denoising, the face re-sharpens at the end.&lt;/p&gt;

&lt;h2&gt;
  
  
  Workflow node order (ComfyUI)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbtv84aiq459r1zyduchz.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbtv84aiq459r1zyduchz.webp" alt="IP-Adapter plus LoRA ComfyUI workflow chain: checkpoint, character LoRA, FreeU, outfit reference image through IP-Adapter, KSampler, and VAE decode" width="800" height="280"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Checkpoint Loader]
  → [LoRA Loader: character_lora]
    → [FreeU: quality touch-up]
      → [IPAdapter Advanced: reference, weight=W, end_at=E]
        → [KSampler]
          → [VAE Decode]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things about this order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;LoRA comes before IP-Adapter in the chain.&lt;/strong&gt; The LoRA modifies the checkpoint weights; IP-Adapter modifies cross-attention during sampling. When IP-Adapter ends at step &lt;code&gt;end_at&lt;/code&gt;, the remaining steps operate on the LoRA-modified weights without IP-Adapter influence — this is what lets the face reassert.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FreeU is optional.&lt;/strong&gt; It's a noise rebalance that improves quality without adding compute.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The tutorial client takes the base &lt;code&gt;workflow.json&lt;/code&gt;, rewrites the &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; placeholders with env-supplied values, uploads the reference image to ComfyUI, and queues the prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rewrite_workflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ref_filename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fill in the `&amp;lt;tune&amp;gt;` and `&amp;lt;path&amp;gt;` placeholders with actual values.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# deep copy
&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;checkpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ckpt_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;checkpoint&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lora_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strength_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora_strength&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strength_clip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora_strength&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ref_filename&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_at&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;
    &lt;span class="n"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;seed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0xFFFFFFFF&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;wf&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/04-ipadapter/client.py#L55-L77" rel="noopener noreferrer"&gt;full source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The full &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/04-ipadapter/workflow.json" rel="noopener noreferrer"&gt;workflow.json&lt;/a&gt; in the tutorial folder ships with &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; placeholders on every field you should touch. The test suite asserts those placeholders stay in the template — a safety net against accidentally committing your tuned production values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Weight tuning loop
&lt;/h2&gt;

&lt;p&gt;The practical process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick a reference item with a clean product photo.&lt;/li&gt;
&lt;li&gt;Pick a character with a strong LoRA.&lt;/li&gt;
&lt;li&gt;Render around &lt;code&gt;weight=0.3, end_at=0.8&lt;/code&gt;. Check face, check item.&lt;/li&gt;
&lt;li&gt;Face drifts → lower weight or lower end_at.&lt;/li&gt;
&lt;li&gt;Item doesn't resemble the reference → raise weight carefully, or leave weight and raise end_at.&lt;/li&gt;
&lt;li&gt;Sweep in 0.05 increments, not 0.1. The usable range is narrower than you'd expect.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Several tuning sweeps on realistic and anime bases usually land you on a working pair.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production integration
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Outfit catalog as reference images.&lt;/strong&gt; Each shop item has a reference image stored in object storage. At generation time, pass the reference URL to the GPU worker, which downloads it once and caches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Catalog pre-rendering for previews.&lt;/strong&gt; When a user browses the shop, they see a preview of each item rendered on their active character. These previews don't need to happen on every page load — generate them asynchronously (Celery worker), store in S3, serve from cache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistency across image and video.&lt;/strong&gt; The same IP-Adapter + LoRA pair used for images can often drive the start-frame of video generation (e.g., Kling). Tune the still-image path first, then reuse it carefully.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fallback when the item isn't visual.&lt;/strong&gt; Some "items" in a shop are stats buffs, relationship flags, or dialogue unlocks — things without a visual. Gate the IP-Adapter pathway to items flagged as visual-only.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production issues that came up
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Face drifted on a noticeable slice of catalog previews.&lt;/strong&gt; Running IP-Adapter weight too high "for stronger outfit adherence." Rolled back to the lower-half range after face-drift complaints spiked. Lesson: tune one variable at a time, even when it feels slow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cached reference URLs expired.&lt;/strong&gt; Shop items in S3 had time-limited presigned URLs. Generation workers fetched the URL at queue-time, but the URL expired before ComfyUI actually downloaded it. Fix: pre-fetch on the worker side, pass the ComfyUI-side filename instead of the external URL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IP-Adapter model version mismatch with SDXL base.&lt;/strong&gt; IP-Adapter Plus ships multiple weights keyed to specific SDXL base models. Mixing can produce worse output without an obvious runtime error — just lower fidelity. Pin the IP-Adapter version to the base in your deployment config.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-visual shop items crashed the workflow.&lt;/strong&gt; The API tried to render "stat boost" items through the image pipeline. Fix: a &lt;code&gt;visual: true|false&lt;/code&gt; flag on catalog entries, checked at the API boundary before queuing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change if starting over
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start with a clean catalog.&lt;/strong&gt; Reference images with consistent backgrounds, consistent lighting, no model already wearing the item if possible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version the tuning.&lt;/strong&gt; When you move base models, your IP-Adapter weight/end_at values probably move too. Treat them as part of the deployment, not as constants.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache the pre-rendered previews aggressively.&lt;/strong&gt; A character × item grid grows multiplicatively. Pre-render on character creation and on new item add.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this lives
&lt;/h2&gt;

&lt;p&gt;HoneyChat's shop renders outfits, accessories, and gifts on active characters using IP-Adapter Plus layered over per-character LoRA. Public architecture doc: &lt;a href="https://github.com/sm1ck/honeychat/blob/main/docs/architecture.md" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/blob/main/docs/architecture.md&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/tencent-ailab/IP-Adapter" rel="noopener noreferrer"&gt;IP-Adapter (tencent-ailab)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/cubiq/ComfyUI_IPAdapter_plus" rel="noopener noreferrer"&gt;ComfyUI IPAdapter Plus extension&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2309.11497" rel="noopener noreferrer"&gt;FreeU paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0" rel="noopener noreferrer"&gt;SDXL base model&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you've shipped an IP-Adapter + LoRA combo in production, I'm curious what weight / end_at pairs you landed on and for which base. The sweet spot seems to shift meaningfully between anime and realistic bases.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>python</category>
      <category>comfyui</category>
    </item>
    <item>
      <title>Character consistency in AI image generation — where prompts break down and LoRA helps</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Wed, 22 Apr 2026 12:22:02 +0000</pubDate>
      <link>https://dev.to/sm1ck/character-consistency-in-ai-image-generation-where-prompts-break-down-and-lora-helps-320b</link>
      <guid>https://dev.to/sm1ck/character-consistency-in-ai-image-generation-where-prompts-break-down-and-lora-helps-320b</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📦 Training template:&lt;/strong&gt; &lt;a href="https://github.com/sm1ck/honeychat/tree/main/tutorial/03-lora" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/tree/main/tutorial/03-lora&lt;/a&gt; — a generic Kohya SDXL config with &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; placeholders and a dataset curation guide. No docker-compose (LoRA training is GPU-heavy) — you bring your own GPU or rent one.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's a failure mode many AI companion apps run into on launch day: users send two requests in a row for the same character, get two different faces, and conclude the product is broken. They're not wrong to feel that way. Character identity is part of the product.&lt;/p&gt;

&lt;p&gt;This post is about why that happens, why the obvious fixes (seed-pinning, more prompt detail, reference images) often don't fully solve it, and what class of solution works better.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Identical seed + identical prompt + different batch size = different face.&lt;/strong&gt; Seeds only help within the same sampler run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt detail plateaus fast.&lt;/strong&gt; Past a certain tag count, the model interpolates anyway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reference image (IP-Adapter) works but can bleed stylistic features&lt;/strong&gt; — outfit, lighting, background — into generations where you only wanted identity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom LoRA per character makes identity much more stable&lt;/strong&gt; by encoding it at the weights level instead of relying only on prompt text.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Train your own character LoRA — the short walkthrough
&lt;/h2&gt;

&lt;p&gt;LoRA training is GPU-heavy and doesn't belong in a docker-compose, so the tutorial folder at &lt;a href="https://github.com/sm1ck/honeychat/tree/main/tutorial/03-lora" rel="noopener noreferrer"&gt;tutorial/03-lora&lt;/a&gt; ships the &lt;strong&gt;config template and recipe&lt;/strong&gt;. You bring the GPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Get a GPU&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;24 GB VRAM (RTX 3090/4090) fits SDXL LoRA at batch size 2–4 comfortably. Don't own one? Rent a spot — Vast.ai, RunPod, Modal, Paperspace, Lambda. A full training run costs a few dollars.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Install Kohya_ss&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/bmaltais/kohya_ss ~/kohya_ss
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/kohya_ss &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; ./setup.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Grab the template&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/sm1ck/honeychat
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; honeychat/tutorial/03-lora ./my-character-lora
&lt;span class="nb"&gt;cd &lt;/span&gt;my-character-lora
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Prepare your dataset&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Drop 15–30 varied images of your subject into &lt;code&gt;dataset/train/5_character/&lt;/code&gt; (the &lt;code&gt;5_&lt;/code&gt; is the repeat count). For each image, create a same-named &lt;code&gt;.txt&lt;/code&gt; caption describing the &lt;em&gt;scene&lt;/em&gt; — not the character. See &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/03-lora/dataset/README.md" rel="noopener noreferrer"&gt;dataset/README.md&lt;/a&gt; for the full curation checklist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Fill the &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; slots in &lt;code&gt;kohya-config.toml&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every hyperparameter is a placeholder you pick based on your dataset and base model. Read the inline comments, then replace each &lt;code&gt;&amp;lt;tune&amp;gt;&lt;/code&gt; with a real value. The safety check in &lt;code&gt;train.sh&lt;/code&gt; will refuse to run if any placeholder remains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Train&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;KOHYA_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;~/kohya_ss
bash train.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The checkpoint lands at &lt;code&gt;./output/&amp;lt;your-character&amp;gt;.safetensors&lt;/code&gt;. Load it into ComfyUI or Diffusers like any other SDXL LoRA. Generate a test grid, iterate, retrain if needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why "same prompt, same face" doesn't hold
&lt;/h2&gt;

&lt;p&gt;Users naturally assume this works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"anime girl, long silver hair, green eyes, Arknights operator outfit"
+ seed=12345
→ Anna, always. Or so it seems.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not reliably. Three reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch size changes the output.&lt;/strong&gt; In most Stable Diffusion runs, &lt;code&gt;batch_size=1&lt;/code&gt; and &lt;code&gt;batch_size=4&lt;/code&gt; with the same seed produce &lt;em&gt;different&lt;/em&gt; images for position 0. The RNG state depends on batch dimension.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider-side sampler drift.&lt;/strong&gt; If you're calling a managed API (fal.ai, Replicate, Together), provider-side changes — model updates, sampler tweaks, default parameter shifts — can produce visually different outputs across weeks. Your "locked" character can drift.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt detail saturates.&lt;/strong&gt; At some point, adding more tags ("sharp nose, high cheekbones, narrow eyes, specific mole position") stops helping much. The model has a rough template and interpolates within it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The in-between fix that doesn't quite work: IP-Adapter
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/tencent-ailab/IP-Adapter" rel="noopener noreferrer"&gt;IP-Adapter&lt;/a&gt; lets you pass a reference image alongside the prompt. The model bakes the reference's features into the cross-attention. For product photography, excellent.&lt;/p&gt;

&lt;p&gt;For character identity, it has a practical drawback: &lt;strong&gt;IP-Adapter can carry stylistic baggage&lt;/strong&gt;. A reference photo with specific lighting, pose, outfit, and background can bleed those into the generated image. You can turn the weight down, but then identity may weaken; turn it up, and the reference can dominate.&lt;/p&gt;

&lt;p&gt;IP-Adapter is a good fit when the &lt;em&gt;reference&lt;/em&gt; is what you want preserved (e.g., rendering a shop item on a character — next post in the series). It's usually a poor fit when what you want preserved is only the face.&lt;/p&gt;

&lt;h2&gt;
  
  
  The solution: custom LoRA per character
&lt;/h2&gt;

&lt;p&gt;A LoRA (Low-Rank Adaptation) is a small set of additional weights layered on top of a base model. A character-specific LoRA trained on a curated dataset — consistent face, varied pose/outfit/lighting — encodes the identity into the weights themselves, not into the prompt.&lt;/p&gt;

&lt;p&gt;Inference pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;workflow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Checkpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# base SDXL model
&lt;/span&gt;    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LoRA: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;char&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# the character's custom LoRA
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FreeU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# quality touch-up
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;KSampler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="c1"&gt;# actual diffusion
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now Anna is much more likely to stay Anna across pose, outfit, and lighting changes. The face is represented in the weights, not only in the words.&lt;/p&gt;

&lt;h3&gt;
  
  
  Training a character LoRA (public-friendly template)
&lt;/h3&gt;

&lt;p&gt;The conceptual shape of the training job using the publicly available &lt;a href="https://github.com/bmaltais/kohya_ss" rel="noopener noreferrer"&gt;Kohya_ss SDXL trainer&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# Kohya_ss SDXL LoRA training config — generic template&lt;/span&gt;
&lt;span class="c"&gt;# Replace every &amp;lt;tune&amp;gt; value based on your dataset and base model.&lt;/span&gt;
&lt;span class="c"&gt;# See Kohya docs for the full parameter reference.&lt;/span&gt;

&lt;span class="nn"&gt;[model_arguments]&lt;/span&gt;
&lt;span class="py"&gt;pretrained_model_name_or_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;path/to/sdxl-base-or-finetune.safetensors&amp;gt;"&lt;/span&gt;

&lt;span class="nn"&gt;[dataset_arguments]&lt;/span&gt;
&lt;span class="py"&gt;train_data_dir&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"./dataset/train"&lt;/span&gt;
&lt;span class="py"&gt;resolution&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"1024,1024"&lt;/span&gt;
&lt;span class="py"&gt;caption_extension&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;".txt"&lt;/span&gt;

&lt;span class="nn"&gt;[training_arguments]&lt;/span&gt;
&lt;span class="py"&gt;output_dir&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"./output"&lt;/span&gt;
&lt;span class="py"&gt;output_name&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;your_character_v1&amp;gt;"&lt;/span&gt;
&lt;span class="py"&gt;save_model_as&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"safetensors"&lt;/span&gt;

&lt;span class="c"&gt;# Training steps and batch — VRAM-bound. Tune for your hardware.&lt;/span&gt;
&lt;span class="py"&gt;learning_rate&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;
&lt;span class="py"&gt;max_train_steps&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;
&lt;span class="py"&gt;train_batch_size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;

&lt;span class="nn"&gt;[network_arguments]&lt;/span&gt;
&lt;span class="py"&gt;network_module&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"networks.lora"&lt;/span&gt;
&lt;span class="py"&gt;network_dim&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;
&lt;span class="py"&gt;network_alpha&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;tune&amp;gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/03-lora/kohya-config.toml" rel="noopener noreferrer"&gt;full template on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The parameters that matter — LR, step count, rank, alpha, dataset size — are subject-dependent. Anime faces converge differently than realistic faces. There is no universal "best" setting.&lt;/p&gt;

&lt;p&gt;What to actually optimize for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dataset quality over dataset size.&lt;/strong&gt; 20 clean, varied, captioned images beat 100 messy ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Varied pose and lighting&lt;/strong&gt;, constant face. Same angle 30 times teaches "this angle," not "this character."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clean captions&lt;/strong&gt; — describe the scene, not the character. "Woman standing in a garden" is better than "Anna standing in a garden" because you want the model to learn the face from context, not from the token.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dedicated rank for face detail.&lt;/strong&gt; Lower ranks underfit the identity; higher ranks overfit and kill flexibility.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Marginal cost: usually manageable
&lt;/h2&gt;

&lt;p&gt;If you're running inference on a rented or owned GPU, training one character LoRA is a one-time cost usually measured in minutes to hours of GPU time, depending on dataset and settings. Inference with the LoRA attached often adds little overhead compared with the base generation. At scale, the per-character cost is dominated by dataset curation, not just training compute.&lt;/p&gt;

&lt;p&gt;This is why a LoRA-per-character pipeline can be viable for products with many characters: once the pipeline exists, adding a new character is mostly a dataset and QA exercise, not a research project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production concerns
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LoRA hot-swapping.&lt;/strong&gt; Load the base checkpoint once, swap LoRAs per request. ComfyUI and Diffusers both support this natively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dataset hygiene.&lt;/strong&gt; LoRAs memorize whatever's in the dataset. Enforce licensing upstream — the LoRA is downstream of the decision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Storage at scale.&lt;/strong&gt; LoRA file size depends on base model and rank; expect anything from a few MB to much larger checkpoints. Object storage + hot-LoRA pinning on inference workers keeps latency down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Face ≠ body.&lt;/strong&gt; A LoRA trained on face crops will not lock body proportions. Include full-body shots in the dataset if you need full-body consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change if starting over
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ship the LoRA pipeline from day 1&lt;/strong&gt;, even for three characters. Inconsistent visuals in the free tier can hurt activation before users ever see the stronger parts of the product.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Curate datasets manually, don't scrape.&lt;/strong&gt; Five iterations of a hand-picked set of 20 images beat a scraped 200.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store base-model version with each LoRA.&lt;/strong&gt; When you update the base, you need to know which LoRAs need retraining.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version LoRAs (v1, v2) and keep old versions live.&lt;/strong&gt; If v2 ships with a regression, roll back per-character without reverting a whole release.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this lives
&lt;/h2&gt;

&lt;p&gt;HoneyChat uses custom LoRA per character for visual identity in image and video generation. Public architecture: &lt;a href="https://github.com/sm1ck/honeychat" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Previous: &lt;a href="https://dev.to/sm1ck/llm-routing-per-tier-via-openrouter-when-one-model-doesnt-fit-all-3ami"&gt;LLM routing per tier via OpenRouter&lt;/a&gt;.&lt;br&gt;
Next: &lt;a href="https://dev.to/sm1ck/ip-adapter-lora-for-product-catalog-rendering-putting-shop-items-on-ai-characters-5h36"&gt;IP-Adapter Plus for a product catalog&lt;/a&gt; — how to put arbitrary shop items on a character while keeping the character's face locked.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2106.09685" rel="noopener noreferrer"&gt;LoRA paper — Hu et al.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/bmaltais/kohya_ss" rel="noopener noreferrer"&gt;Kohya_ss SDXL training&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/tencent-ailab/IP-Adapter" rel="noopener noreferrer"&gt;IP-Adapter (for comparison)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0" rel="noopener noreferrer"&gt;SDXL base model&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you've trained character LoRAs in production and have opinions on rank selection or caption strategy, I'd love to hear them in the comments. There's very little public writing on this outside the anime generation community.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>LLM routing per tier via OpenRouter — when one model doesn't fit all</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Tue, 21 Apr 2026 23:50:29 +0000</pubDate>
      <link>https://dev.to/sm1ck/llm-routing-per-tier-via-openrouter-when-one-model-doesnt-fit-all-3ami</link>
      <guid>https://dev.to/sm1ck/llm-routing-per-tier-via-openrouter-when-one-model-doesnt-fit-all-3ami</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📦 Full runnable example:&lt;/strong&gt; &lt;a href="https://github.com/sm1ck/honeychat/tree/main/tutorial/02-routing" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/tree/main/tutorial/02-routing&lt;/a&gt; — &lt;code&gt;docker compose up&lt;/code&gt; exposes &lt;code&gt;POST /complete&lt;/code&gt; on localhost:8000. Every snippet below is pulled from that repo.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most introductory "chat with AI" tutorials pick one model and call it a day. That works in a toy. It stops being enough in production, where users have different price sensitivity, different conversation styles, and different expectations for what the product should allow.&lt;/p&gt;

&lt;p&gt;Here's how to route LLM calls across a handful of providers via &lt;a href="https://openrouter.ai/" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt;, how that routing handles &lt;code&gt;finish_reason=content_filter&lt;/code&gt; empty-completion edge cases, and the fallback chain pattern that keeps replies flowing.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Route by &lt;strong&gt;tier&lt;/strong&gt; (price elasticity) &lt;em&gt;and&lt;/em&gt; by &lt;strong&gt;content mode&lt;/strong&gt; (what kind of turn this is). A single default model can't do both.&lt;/li&gt;
&lt;li&gt;Some reasoning/model-provider combinations can return &lt;code&gt;finish_reason=content_filter&lt;/code&gt; with empty content on borderline content. A retry policy that only catches HTTP errors can miss this.&lt;/li&gt;
&lt;li&gt;The working pattern: &lt;code&gt;primary → different-provider fallback → specialized last resort&lt;/code&gt;, with retries triggered by both error responses &lt;em&gt;and&lt;/em&gt; suspicious empty completions.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Run it yourself in 3 minutes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Clone and configure&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/sm1ck/honeychat
&lt;span class="nb"&gt;cd &lt;/span&gt;honeychat/tutorial/02-routing
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;.env&lt;/code&gt;, paste your &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; (&lt;a href="https://openrouter.ai/keys" rel="noopener noreferrer"&gt;get one here&lt;/a&gt;). The three default model slots all point to free-tier OpenRouter models so you can experiment without spending.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Start the service&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;
curl http://localhost:8000/health   &lt;span class="c"&gt;# {"ok":true}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Send a normal turn — primary answers&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/complete &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'content-type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"messages":[{"role":"user","content":"Name three cold-climate fruits."}]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Apples, pears, and cloudberries..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"meta-llama/llama-3.1-8b-instruct:free"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"attempt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"used_fallback"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;attempt: 0&lt;/code&gt; means the primary model answered. &lt;code&gt;used_fallback: false&lt;/code&gt; means no retry was needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Force a fallback&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Override the primary to point at a model you know tends to refuse — or any bogus model name — and watch the chain kick in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/complete &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'content-type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"messages":[{"role":"user","content":"Say hi"}],"primary":"this/model-does-not-exist"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | jq &lt;span class="s1"&gt;'.model, .attempt, .used_fallback'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;attempt: 1&lt;/code&gt; (or 2) — the next rung answered. In production, log this metric: a rising fallback rate on a class of content means it's time to move the content to a different primary, not to tweak retry logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Run the unit tests&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;
pytest &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Seven tests cover the failure modes in this chain — &lt;code&gt;content_filter=empty&lt;/code&gt;, transient 5xx, non-transient 4xx, all-models-fail.&lt;/p&gt;

&lt;p&gt;With the service running and the tests green, the rest of this post explains why the chain is shaped this way.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why one model doesn't fit all
&lt;/h2&gt;

&lt;p&gt;Three distinct pressures push against a single-model setup:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Price elasticity by tier.&lt;/strong&gt; A free user generating 20 messages a day at flagship-model prices can burn cash every month per active user for zero revenue. A paying top-tier user sending the same 20 messages may reasonably expect higher quality. The unit economics do not agree.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content mode.&lt;/strong&gt; Mainstream-aligned models can refuse content that some legitimate companion/roleplay products allow on paid tiers. Conversely, less-restrictive models can have weaker long-context coherence. The right model depends on the turn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency vs. depth.&lt;/strong&gt; Instant conversational turns need sub-3-second responses. Long scene-writing turns can tolerate 10+ seconds for better prose. Hardcoding a single model optimizes for one and sacrifices the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  The reasoning-model empty-completion edge case
&lt;/h2&gt;

&lt;p&gt;This is the one that cost me a full afternoon to diagnose.&lt;/p&gt;

&lt;p&gt;Some reasoning-class model/provider combinations do server-side moderation or filtering before returning a final answer. On borderline turns, they may not return an HTTP error. Instead, they can return a valid response with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"finish_reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"content_filter"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Empty string. No exception. No status code to check. If you don't guard for it, your user sees a blank reply.&lt;/p&gt;

&lt;p&gt;If your retry logic only triggers on &lt;code&gt;httpx.HTTPStatusError&lt;/code&gt;, this can pass through.&lt;/p&gt;

&lt;h2&gt;
  
  
  The guard
&lt;/h2&gt;

&lt;p&gt;The whole failure mode is caught by a tiny function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_is_silent_refusal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    The whole point of this post: reasoning models can return a successful
    HTTP response with finish_reason=content_filter AND an empty content.
    If you only check HTTP status, you ship blank replies to users.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finish_reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content_filter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/02-routing/app/router.py#L64-L73" rel="noopener noreferrer"&gt;full source&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Resilient fallback chain
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbeynrwjlgh7bsgzvb90.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbeynrwjlgh7bsgzvb90.webp" alt="LLM routing fallback chain: a chat turn tries a tier-specific primary model, retries on a different-provider fallback after empty content_filter responses, then falls back to a specialized last resort" width="800" height="373"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;CompletionResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run the fallback chain. Return the first usable response.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;_build_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HTTPStatusError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;TRANSIENT_CODES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;continue&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;
            &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadTimeout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConnectError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="n"&gt;choice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;choices&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[{}])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;_is_silent_refusal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;CompletionResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;AllModelsFailedError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no model returned usable content; tried &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/02-routing/app/router.py#L90-L128" rel="noopener noreferrer"&gt;full source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two details worth calling out:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Empty content check is separate from the finish reason.&lt;/strong&gt; Some models can return &lt;code&gt;finish_reason=stop&lt;/code&gt; with empty content when they refuse. Always check &lt;code&gt;not content.strip()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track which model ultimately answered.&lt;/strong&gt; Log &lt;code&gt;attempt &amp;gt; 0&lt;/code&gt; as a fallback event. If your primary fails 10% of the time on a class of content, that's a routing decision, not a retry problem — move that content to a different primary.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Picking the fallback order
&lt;/h2&gt;

&lt;p&gt;For a permissive roleplay mode, the shape looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;content-mode primary   → first model for this type of turn
  ↓ (on failure / empty)
diff-provider fallback → avoids the same upstream failure mode
  ↓
specialized last resort
  ↓
abort — ask the user to try a shorter or clearer prompt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ordering rule: &lt;strong&gt;different-provider fallbacks&lt;/strong&gt;. If the primary is hosted on provider A and fails for a provider-side reason, prefer a fallback hosted on provider B. Same-provider fallbacks can fail on the same content because the provider's moderation layer may be upstream of the model. OpenRouter makes this easier because each model's provider metadata is visible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Content-level gating happens before the LLM, not after
&lt;/h2&gt;

&lt;p&gt;The fallback chain handles &lt;em&gt;model-level&lt;/em&gt; refusals. But if the user's intent is clearly above your product's content ceiling, retrying on a more permissive model just burns extra tokens before the user hits the real limit. Gate the content level in your system prompt assembly — don't rely on the model to enforce policy.&lt;/p&gt;

&lt;p&gt;Keep the tier-level policy simple: the escalation class (detected from user intent) must be &lt;code&gt;≤&lt;/code&gt; the user's plan ceiling. If over, the character responds in-character and the bot sends the upsell. The LLM does not need to know the tier exists — it just gets a system prompt with the right constraints for this turn.&lt;/p&gt;

&lt;h2&gt;
  
  
  Instrumentation that matters
&lt;/h2&gt;

&lt;p&gt;Log three things per LLM call:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model that answered&lt;/strong&gt; (primary or fallback index)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time to first token&lt;/strong&gt; vs &lt;strong&gt;total time&lt;/strong&gt; — tells you whether latency was model-side or network-side&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token cost&lt;/strong&gt; (input + output) per message, bucketed by tier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Costs track in Redis counters with short TTL — daily sum, per-user daily sum. A global daily ceiling blocks new generations if spend crosses a configured threshold (fail-closed: if the counter is unreachable, block, don't pass). This helped cap a runaway generation loop at a known ceiling.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change if starting over
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Route by content mode from day 1&lt;/strong&gt;, not as an afterthought. Retrofitting the split into an existing handler is painful.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instrument the silent-refusal rate&lt;/strong&gt;. It may be rare, but you won't know unless you measure it specifically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't share a single OpenRouter key across environments.&lt;/strong&gt; Rate limits are per-key and dev noise eats prod quota.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publish the tier → model map in your public docs.&lt;/strong&gt; Users comparing products care. Competitors already know. Keeping the docs in sync with the code forces alignment.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this lives
&lt;/h2&gt;

&lt;p&gt;HoneyChat's LLM router sits behind the chat handler on both the Telegram bot and the web app. Public architecture: &lt;a href="https://github.com/sm1ck/honeychat/blob/main/docs/architecture.md" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/blob/main/docs/architecture.md&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Previous in the series: &lt;a href="https://dev.to/sm1ck/building-an-ai-companion-with-persistent-memory-redis-chromadb-4i8k"&gt;dual-layer memory with Redis + ChromaDB&lt;/a&gt;.&lt;br&gt;
Next: &lt;a href="https://dev.to/sm1ck/character-consistency-in-ai-image-generation-where-prompts-break-down-and-lora-helps-320b"&gt;character consistency with custom LoRA&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openrouter.ai/models" rel="noopener noreferrer"&gt;OpenRouter model list&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/api-reference/chat/object" rel="noopener noreferrer"&gt;Chat Completions &lt;code&gt;finish_reason&lt;/code&gt; semantics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Curious how others have solved the silent-refusal pattern. If you've hit it on a different provider, drop a comment — I want to know which models ship which behavior.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>openrouter</category>
    </item>
    <item>
      <title>Building an AI companion with persistent memory — Redis + ChromaDB</title>
      <dc:creator>sm1ck</dc:creator>
      <pubDate>Mon, 20 Apr 2026 12:16:42 +0000</pubDate>
      <link>https://dev.to/sm1ck/building-an-ai-companion-with-persistent-memory-redis-chromadb-4i8k</link>
      <guid>https://dev.to/sm1ck/building-an-ai-companion-with-persistent-memory-redis-chromadb-4i8k</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📦 Full runnable example:&lt;/strong&gt; &lt;a href="https://github.com/sm1ck/honeychat/tree/main/tutorial/01-memory" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat/tree/main/tutorial/01-memory&lt;/a&gt; — clone, &lt;code&gt;docker compose up&lt;/code&gt;, chat with the demo bot on Telegram. Every code snippet below is pulled from that repo.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most AI chatbots still struggle with reliable, queryable long-term recall. Character.AI has pinned and chat memories, but unpinned details can still fall out of the active conversation context. Replika remembers profile facts, preferences, and generated memories, but that is not the same as semantic recall over the full conversation. Even ChatGPT's Memory is built for useful preferences and details, not verbatim replay of long sessions.&lt;/p&gt;

&lt;p&gt;I wanted a chat companion with &lt;strong&gt;practical persistent memory&lt;/strong&gt; — not just the current conversation, but older facts and events surfaced when they matter. Here's the architecture that worked well for this use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hot layer (Redis)&lt;/strong&gt; — recent messages per conversation, short TTL, low-latency reads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold layer (ChromaDB)&lt;/strong&gt; holds &lt;em&gt;summaries of chunks&lt;/em&gt;, not individual messages. Every N bot turns, a background task summarizes that window via a cheap LLM and stores the summary as a document. Keeps the vector index tiny, queries fast.&lt;/li&gt;
&lt;li&gt;On every user message, three retrieval paths fire in parallel via &lt;code&gt;asyncio.gather&lt;/code&gt;: recent buffer, latest summary, top-K semantic search. All three get assembled into the system prompt.&lt;/li&gt;
&lt;li&gt;Result: substantially fewer tokens than full-history replay, while still making old context retrievable weeks later.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Run it yourself in 5 minutes
&lt;/h2&gt;

&lt;p&gt;Before the architectural deep-dive, boot the demo so you can poke the memory layers live.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Clone and enter the folder&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/sm1ck/honeychat
&lt;span class="nb"&gt;cd &lt;/span&gt;honeychat/tutorial/01-memory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Configure two tokens&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;.env&lt;/code&gt; and fill:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;TELEGRAM_BOT_TOKEN&lt;/code&gt; — get it from &lt;a href="https://t.me/BotFather" rel="noopener noreferrer"&gt;@BotFather&lt;/a&gt; (30 seconds: &lt;code&gt;/newbot&lt;/code&gt;, pick a name, copy the token)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; — from &lt;a href="https://openrouter.ai/keys" rel="noopener noreferrer"&gt;openrouter.ai/keys&lt;/a&gt;. The default &lt;code&gt;LLM_MODEL&lt;/code&gt; is a free-tier Llama 3.1 8B so you don't spend a cent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Start the stack&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;
docker compose logs &lt;span class="nt"&gt;-f&lt;/span&gt; bot       &lt;span class="c"&gt;# watch the bot come alive&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four containers: &lt;code&gt;redis&lt;/code&gt;, &lt;code&gt;chromadb&lt;/code&gt;, &lt;code&gt;api&lt;/code&gt; (FastAPI inspector on &lt;code&gt;localhost:8000&lt;/code&gt;), &lt;code&gt;bot&lt;/code&gt; (your Telegram bot polling).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Talk to your bot&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Open it on Telegram, hit &lt;code&gt;/start&lt;/code&gt;, chat for 10–20 turns. Tell it things about yourself. Come back later and reference something you said earlier — it'll pull it from ChromaDB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Peek at what each layer holds&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Replace 12345 with your own Telegram user ID (ask @userinfobot)&lt;/span&gt;
curl http://localhost:8000/memory/12345/demo/recent  | jq
curl http://localhost:8000/memory/12345/demo/summary | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;recent&lt;/code&gt; shows the raw Redis buffer. &lt;code&gt;summary&lt;/code&gt; shows the latest ChromaDB document.&lt;/p&gt;

&lt;p&gt;With the demo running, the rest of this post explains what you just booted.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why rolling summaries alone don't work
&lt;/h2&gt;

&lt;p&gt;A common pattern for chatbot memory is a rolling summary — every N messages, regenerate a compressed version of older context. It's cheap. It's also &lt;strong&gt;lossy in a very specific way&lt;/strong&gt;: nuance dies in repeated compression.&lt;/p&gt;

&lt;p&gt;Walk it through three regenerations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Turn 1: "She said she hates her boss because he takes credit for her work"
Turn 2 summary: "User mentioned workplace frustration with manager"
Turn 3 summary: "User has job-related stress"
Turn 4 summary: "User has a job"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By turn 4, the &lt;em&gt;reason&lt;/em&gt; is gone. A companion bot starts sounding generic. The fix used here: &lt;strong&gt;keep raw recent messages verbatim&lt;/strong&gt; and only summarize chunks that are genuinely old, while being able to semantically retrieve any summary from the full history when the current conversation calls back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkh7qpvljz5wjeh349kel.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkh7qpvljz5wjeh349kel.webp" alt="Dual-layer memory architecture: Redis recent buffer and ChromaDB summaries retrieved in parallel before LLM prompt assembly" width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two independent layers. Writes to Redis are synchronous on every turn; writes to ChromaDB are asynchronous, batched. Reads from both happen in parallel on every message.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hot layer — Redis
&lt;/h2&gt;

&lt;p&gt;Each &lt;code&gt;(user_id, character_id)&lt;/code&gt; conversation is stored as a bounded Redis list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_redis&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rpush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ltrim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;HOT_BUFFER_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;86400&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;HOT_BUFFER_TTL_DAYS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/01-memory/app/memory.py#L75-L89" rel="noopener noreferrer"&gt;full source on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three things matter here:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ltrim&lt;/code&gt; on every write.&lt;/strong&gt; The list is bounded. Memory per user is O(1), not O(conversation length).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTL extended on every write.&lt;/strong&gt; Inactive users' history evicts automatically. Configure Redis with &lt;code&gt;allkeys-lru&lt;/code&gt; so overflow evicts instead of refusing writes — &lt;code&gt;noeviction&lt;/code&gt; is the default and it's a footgun.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipelined writes.&lt;/strong&gt; &lt;code&gt;rpush + ltrim + expire&lt;/code&gt; in one round trip.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The cold layer — ChromaDB with summaries, not messages
&lt;/h2&gt;

&lt;p&gt;A tempting implementation is to embed every message and run semantic search over them. Two problems: the index grows linearly with conversation volume, and individual messages are often too short or context-free to retrieve meaningfully ("yeah" returns a lot of "yeah" matches).&lt;/p&gt;

&lt;p&gt;Instead: &lt;strong&gt;embed LLM-generated summaries of chunks&lt;/strong&gt;. Every N bot turns, compress the window via a cheap LLM and write it as one document to a per-(user, character) ChromaDB collection. Ten weeks of active conversation is maybe 30–50 documents per collection, not tens of thousands.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retrieval — three paths in parallel
&lt;/h2&gt;

&lt;p&gt;On every user message, the chat handler fires three reads in parallel via &lt;code&gt;asyncio.gather&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_prompt_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Parallel fire the three reads. Returns everything the handler needs.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memories&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;get_recent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;get_latest_summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;get_relevant_memories&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;char_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memories&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;memories&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ &lt;a href="https://github.com/sm1ck/honeychat/blob/main/tutorial/01-memory/app/memory.py#L163-L173" rel="noopener noreferrer"&gt;full source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The fast path for the summary hits Redis. The slower path queries ChromaDB only when the Redis cache expired, then writes back so the next call is hot again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production issues that came up
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Double-summarize race.&lt;/strong&gt; Two concurrent messages for the same pair both trigger summarization, writing overlapping summaries. Fix: per-key task tracking, cancel the pending task if a new one fires.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User clears history mid-summarize.&lt;/strong&gt; A user hits "reset chat" while a summary is in flight. The summary then writes to a collection that just got deleted. Fix: re-check &lt;code&gt;r.exists(key)&lt;/code&gt; before writing; bail if the list is gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Empty summaries cached.&lt;/strong&gt; LLM rate-limited, returned empty content — and I was caching the empty string with a 3-day TTL. Fix: &lt;code&gt;if summary:&lt;/code&gt; guard before &lt;code&gt;setex&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ChromaDB collection doesn't exist for new users.&lt;/strong&gt; &lt;code&gt;col.query&lt;/code&gt; raises on a non-existent collection. Wrap in try/except and return empty — normal for a user's first few messages.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change if starting over
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Skip pgvector for this shape of workload.&lt;/strong&gt; Two weeks on it first; for my short-query summaries, recall was worse than ChromaDB and reindexing pain wasn't worth it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't embed per message.&lt;/strong&gt; Index exploded, recall didn't improve. Summary-level is the right granularity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summarize fixed-size windows, not time-based batches.&lt;/strong&gt; Daily summaries are useless for users who chatted 500 times in one day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build the cancellation pattern from day 1.&lt;/strong&gt; Race conditions around user actions (clear history, switch character) became one of the top sources of production bugs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this lives
&lt;/h2&gt;

&lt;p&gt;HoneyChat — an AI companion that runs both as a Telegram bot and a web app on the same backend. The architecture above is in production. Try it: &lt;a href="https://t.me/HoneyChatAIBot" rel="noopener noreferrer"&gt;@HoneyChatAIBot&lt;/a&gt; on Telegram or &lt;a href="https://honeychat.bot" rel="noopener noreferrer"&gt;honeychat.bot&lt;/a&gt; in the browser.&lt;/p&gt;

&lt;p&gt;Public docs: &lt;a href="https://github.com/sm1ck/honeychat" rel="noopener noreferrer"&gt;github.com/sm1ck/honeychat&lt;/a&gt; — service topology, API surface, major flows.&lt;/p&gt;

&lt;p&gt;Next in the series: &lt;a href="https://dev.to/sm1ck/llm-routing-per-tier-via-openrouter-when-one-model-doesnt-fit-all-3ami"&gt;LLM routing per tier&lt;/a&gt; — why one model doesn't fit all, and how to handle content_filter errors from reasoning models.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.trychroma.com/" rel="noopener noreferrer"&gt;ChromaDB docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://redis.io/commands/ltrim/" rel="noopener noreferrer"&gt;Redis &lt;code&gt;LTRIM&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aiogram.dev/" rel="noopener noreferrer"&gt;aiogram&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openrouter.ai/" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://support.character.ai/hc/en-us/articles/24327914463003-New-Feature-Pinned-Memories" rel="noopener noreferrer"&gt;Character.AI pinned memories&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.character.ai/helping-characters-remember-what-matters-most/" rel="noopener noreferrer"&gt;Character.AI chat memories&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://help.replika.com/hc/en-us/categories/4410741916045-Conversation-Memory" rel="noopener noreferrer"&gt;Replika memory docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://help.openai.com/en/articles/10303002-how-does-memory-use-past-conversations" rel="noopener noreferrer"&gt;ChatGPT Memory FAQ&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you're building something similar and have questions about the memory layout or the summarization pipeline, drop a comment. Especially curious how others handle race conditions around user-initiated state resets.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
