<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: qcrao</title>
    <description>The latest articles on DEV Community by qcrao (@qcrao).</description>
    <link>https://dev.to/qcrao</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3895196%2Fdc8d29f7-4dbf-4ec7-a679-d9d3fa85ba0b.jpg</url>
      <title>DEV Community: qcrao</title>
      <link>https://dev.to/qcrao</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/qcrao"/>
    <language>en</language>
    <item>
      <title>Character consistency in AI comics: 3 tricks that beat LoRA training for me</title>
      <dc:creator>qcrao</dc:creator>
      <pubDate>Thu, 14 May 2026 14:44:30 +0000</pubDate>
      <link>https://dev.to/qcrao/character-consistency-in-ai-comics-3-tricks-that-beat-lora-training-for-me-3ad7</link>
      <guid>https://dev.to/qcrao/character-consistency-in-ai-comics-3-tricks-that-beat-lora-training-for-me-3ad7</guid>
      <description>&lt;p&gt;The single thing that breaks an AI-generated comic isn't the art style or the prompt — it's the moment your protagonist's hair flips from red to auburn between panel 2 and panel 3. Readers will forgive an awkward pose, a melted hand, even a missing background. They won't forgive a character who clearly isn't the same person across the page. Once that happens, the page stops reading as a story and starts reading as a slideshow.&lt;/p&gt;

&lt;p&gt;I ran into this hard while building a multi-panel pipeline on top of FLUX Kontext. The default playbook says "train a LoRA per character." That works, but ~30 minutes of training per character is a horrible feedback loop when you're iterating on a 6-panel scene and a new side character shows up in panel 4. So I spent two weeks trying to make a &lt;em&gt;training-free&lt;/em&gt; setup hit the same consistency. Below are the three tricks that ended up beating my LoRA baseline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trick 1: IP-Adapter with a frozen reference image
&lt;/h2&gt;

&lt;p&gt;Problem: training a per-character LoRA is a 30-minute, ~150MB commitment for a face that might appear in five panels.&lt;/p&gt;

&lt;p&gt;IP-Adapter lets you pass a reference image directly into the cross-attention layers at inference time. Instead of teaching the model who the character is by gradient descent, you hand the model a portrait and say "match this." FLUX Kontext exposes the image-conditioning slot natively, so the wiring is small. The first time it clicked I deleted three LoRA &lt;code&gt;.safetensors&lt;/code&gt; files and never trained another one for a named character.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;diffusers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FluxKontextPipeline&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FluxKontextPipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;black-forest-labs/FLUX.1-Kontext-dev&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ref&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;characters/mira_ref_front.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;panel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a 9-year-old girl, red braided hair, freckles, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
           &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;blue overalls, sitting on a wooden swing, soft afternoon light&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ip_adapter_scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# below 0.7 keeps pose freedom
&lt;/span&gt;    &lt;span class="n"&gt;guidance_scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;3.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_inference_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ip_adapter_scale=0.65&lt;/code&gt; is the load-bearing number. At 0.85+ the model copies the reference &lt;em&gt;pose&lt;/em&gt; too, which kills any new action you're asking for. At 0.4 the face drifts. In a 600-panel eval, 0.65 ± 0.05 was the sweet spot: 84% of panels passed a manual same-character check, vs. 78% from a 30-minute LoRA trained on 18 reference images of the same character.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trick 2: Prompt-anchored attribute pinning
&lt;/h2&gt;

&lt;p&gt;Problem: even with image conditioning, two prompts that &lt;em&gt;describe&lt;/em&gt; the same character in different word orders produce visibly different faces.&lt;/p&gt;

&lt;p&gt;This one surprised me. I'd been writing prompts conversationally — "Mira, who's 9, sits on the swing. Her red braids catch the wind." vs. "On the swing sits Mira, a freckled girl in blue overalls." Same character, same image conditioning, noticeably different output. The text encoder is order-sensitive in ways that don't show up in single-image generation but absolutely show up across a comic strip. Fix: lock the attribute order to a fixed template and never reorder, never paraphrase.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CHARACTER_TEMPLATE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a {age}-year-old {gender}, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{hair_color} {hair_style} hair, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{skin_detail}, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wearing {outfit}, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{action}, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{setting}, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{lighting}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;mira&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gender&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;girl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;hair_color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;red&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hair_style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;braided&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;skin_detail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;freckles across the nose&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;outfit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;blue overalls and a yellow t-shirt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;panel_3_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CHARACTER_TEMPLATE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;mira&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;laughing while holding a paper airplane&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;setting&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on a wooden porch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lighting&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warm afternoon sun&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first six slots — age, gender, hair color, hair style, skin detail, outfit — never move. Only &lt;code&gt;action&lt;/code&gt;, &lt;code&gt;setting&lt;/code&gt;, &lt;code&gt;lighting&lt;/code&gt; change between panels. After enforcing this template, I re-ran the same 600-panel eval: consistency jumped from 84% to 87.5% with no other changes. Most of the lift came from the hair-color slot; if I let "red" drift later in the prompt the model would sometimes interpret it as "auburn" or "copper."&lt;/p&gt;

&lt;h2&gt;
  
  
  Trick 3: ControlNet pose + character-token interleaving
&lt;/h2&gt;

&lt;p&gt;Problem: when the prompt asks for a strong pose, the model trades face fidelity for pose accuracy.&lt;/p&gt;

&lt;p&gt;The model has a finite attention budget. If panel 4 needs a dramatic over-the-shoulder shot, FLUX will spend its capacity on the pose and the face gets generic. The fix is to externalize the pose to ControlNet (so the diffusion model isn't using its text-encoder capacity to &lt;em&gt;describe&lt;/em&gt; the pose) and then concentrate the character description in the early text-encoder layers where identity features live.&lt;/p&gt;

&lt;p&gt;I use a small wrapper that injects the character clause only into the first two T5 encoder layers, letting the action-and-setting clause flow through all layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;encode_with_layer_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_encoder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;character_clause&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;action_clause&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# character identity → layers 0-1 only (identity features live early)
&lt;/span&gt;    &lt;span class="c1"&gt;# action + setting     → all layers (composition lives late)
&lt;/span&gt;    &lt;span class="n"&gt;char_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;character_clause&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;
    &lt;span class="n"&gt;act_ids&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action_clause&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;

    &lt;span class="n"&gt;char_emb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;text_encoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;char_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_hidden_states&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;hidden_states&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;act_emb&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;text_encoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;act_ids&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;last_hidden_state&lt;/span&gt;

    &lt;span class="c1"&gt;# concatenate at the token axis; FLUX accepts variable-length conditioning
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;char_emb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;act_emb&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Combined with an OpenPose-conditioned ControlNet, dramatic-angle panels (the previously worst bucket) went from 71% consistency to 83%. The interleaving idea came from poking at attention maps in a notebook — identity features peak in T5 layers 1-3, and pushing the character tokens through layers 4-24 mostly just adds noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Baseline LoRA vs. hybrid approach
&lt;/h2&gt;

&lt;p&gt;I logged 600 panels on each setup across three named characters:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;LoRA per character&lt;/th&gt;
&lt;th&gt;Hybrid (IP-Adapter + template + layer-split)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Setup time per new character&lt;/td&gt;
&lt;td&gt;~30 min training&lt;/td&gt;
&lt;td&gt;0 min (just a reference image)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage per character&lt;/td&gt;
&lt;td&gt;~150 MB &lt;code&gt;.safetensors&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;0 bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Panel-to-panel consistency (manual review)&lt;/td&gt;
&lt;td&gt;78%&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dramatic-angle consistency&lt;/td&gt;
&lt;td&gt;71%&lt;/td&gt;
&lt;td&gt;83%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hair-color drift incidents / 100 panels&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New side character onboarding&lt;/td&gt;
&lt;td&gt;retrain&lt;/td&gt;
&lt;td&gt;drop a portrait in &lt;code&gt;characters/&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference latency per panel&lt;/td&gt;
&lt;td&gt;6.1s&lt;/td&gt;
&lt;td&gt;6.4s (+300ms for IP-Adapter)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The +300ms is real but invisible inside the page-render budget. The bigger win is the workflow: a side character who shows up for two panels and never returns no longer warrants a 30-minute training run. I just generate a reference portrait, save it, and reuse it.&lt;/p&gt;

&lt;p&gt;There's a ceiling here. The hybrid approach tops out around 85-87%, and I don't see an obvious path past that without going back to fine-tuning. For a flagship recurring character that appears in 50+ panels across a series, a proper LoRA still wins — the 8-10 extra percentage points of consistency are worth the half-hour. But for the long tail of one-scene characters, training is just waste.&lt;/p&gt;

&lt;p&gt;This is the same pipeline that powers character generation inside &lt;a href="https://www.comicory.com" rel="noopener noreferrer"&gt;Comicory&lt;/a&gt;, which is the multi-panel comic side project I've been chipping at on weekends. Everything above runs on a single 4090; FLUX Kontext is the only model in the loop.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Tags: ai-art, comics, flux, lora, character-consistency, sideproject&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>comics</category>
      <category>machinelearning</category>
      <category>sideprojects</category>
    </item>
    <item>
      <title>How I picked an SRS algorithm for TubeVocab without becoming an Anki nerd</title>
      <dc:creator>qcrao</dc:creator>
      <pubDate>Thu, 14 May 2026 14:34:45 +0000</pubDate>
      <link>https://dev.to/qcrao/how-i-picked-an-srs-algorithm-for-tubevocab-without-becoming-an-anki-nerd-1m2l</link>
      <guid>https://dev.to/qcrao/how-i-picked-an-srs-algorithm-for-tubevocab-without-becoming-an-anki-nerd-1m2l</guid>
      <description>&lt;p&gt;Most "smart" vocabulary apps shove a generic spaced repetition curve at every word and call it a day. They take SM-2, the algorithm Anki has used since the 90s, plug it into a flashcard table, and assume &lt;code&gt;cat&lt;/code&gt;, &lt;code&gt;ostensibly&lt;/code&gt;, and &lt;code&gt;photosynthesis&lt;/code&gt; should all behave the same way in your brain. They don't.&lt;/p&gt;

&lt;p&gt;I needed a scheduler for TubeVocab that respected one obvious fact: a B1 learner forgets a C1 word in about a third of the time it takes them to forget an A1 word. If your scheduler doesn't know that, you waste reviews on words the user already owns and bury them under reviews of words they're not ready for. After three rewrites I landed on a hybrid I'm actually happy with. Here's the path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why SM-2 is the wrong default for vocabulary
&lt;/h2&gt;

&lt;p&gt;SM-2 assumes domain-uniform difficulty. Every new card starts with the same interval (1 day) and the same ease factor (2.5). That works for trivia decks where every fact is, on average, equally weird. It does not work for vocabulary where item difficulty has a known prior — the CEFR band.&lt;/p&gt;

&lt;p&gt;Concretely: if I show a B1 learner the word &lt;code&gt;house&lt;/code&gt; and they get it right, SM-2 schedules the next review in 1 day. That's absurd. They've known &lt;code&gt;house&lt;/code&gt; since A1. The "correct" interval is closer to 7 days because the forgetting curve for high-frequency A1 vocabulary is much flatter. Burning a review slot on &lt;code&gt;house&lt;/code&gt; tomorrow is a tax on every C1 word in the same deck.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Naive SM-2 — every new card gets the same initial schedule
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sm2_initial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repetitions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When I ran this on my first 200 alpha users, the median number of reviews per "mastered" A1 card was 6.2. For C1 it was 4.8 — backwards. The algorithm was over-drilling easy words because they "graduated" the daily-review hurdle later than they should have.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I tried: SM-2, FSRS-4, custom Leitner, modulated SM-2
&lt;/h2&gt;

&lt;p&gt;Four candidates, evaluated over two weeks against the same 40 beta users (each got randomly assigned a scheduler, deck content held constant):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Vanilla SM-2&lt;/strong&gt; — baseline. Easy to implement. Wrong priors as above.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FSRS-4&lt;/strong&gt; — the modern Anki default. Three-parameter model (stability, difficulty, retrievability) trained on review logs. Genuinely better than SM-2 in the long run, but you need ~1000 reviews per user before the per-user fit converges. My users churn before that.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom 5-box Leitner&lt;/strong&gt; — boxes graduate at 1/3/7/14/30 days. Dead simple. No ease. Surprisingly competitive on short timescales but degrades because there's no individual signal — a confident "got it" and a barely-recalled "uh, sure" promote you the same way.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CEFR-modulated SM-2&lt;/strong&gt; — vanilla SM-2 with the &lt;em&gt;initial&lt;/em&gt; interval and ease seeded by the CEFR band of the word. This is what shipped.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;FSRS is the right answer once I have a year of review history. Today I have weeks. The modulation hack is what bridges the gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The winner: CEFR-modulated SM-2
&lt;/h2&gt;

&lt;p&gt;Punchline: the CEFR band sets the prior. SM-2 takes over after the first successful recall. The modulation only touches the &lt;em&gt;initial&lt;/em&gt; schedule, which is exactly the part SM-2 gets wrong.&lt;/p&gt;

&lt;p&gt;Mapping I converged on, after eyeballing my 14-day retention curves split by band:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Initial interval (days) and ease factor seeded by CEFR band
&lt;/span&gt;&lt;span class="n"&gt;CEFR_PRIOR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.7&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.6&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;B1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;B2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.4&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.3&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.2&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;schedule_new_card&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cefr_band&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prior&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CEFR_PRIOR&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cefr_band&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CEFR_PRIOR&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;B2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prior&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prior&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repetitions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;due_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;today&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prior&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the first correct review, the standard SM-2 recurrence kicks in — interval becomes &lt;code&gt;previous_interval * ease&lt;/code&gt;, ease gets bumped or penalized by the user's quality rating, and CEFR is no longer consulted. The point isn't to keep tuning by band forever; it's to not start in the wrong place.&lt;/p&gt;

&lt;p&gt;The full scheduler with the quality-rating branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;next_schedule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# quality: 0=blackout, 3=correct-with-effort, 5=easy
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;quality&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# lapse — reset interval, keep ease minus penalty
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ease&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repetitions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;reps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;repetitions&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;reps&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interval_days&lt;/span&gt;  &lt;span class="c1"&gt;# honor the CEFR prior
&lt;/span&gt;    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;reps&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interval_days&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;2.2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interval_days&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ease&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;ease&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ease&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.08&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.02&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interval_days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ease&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ease&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repetitions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;reps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two things worth pointing out: the prior is honored on the first successful repetition (not overwritten), and the ease floor is 1.3 so a single bad day doesn't sentence a card to permanent daily review.&lt;/p&gt;

&lt;h2&gt;
  
  
  The number: 22% to 38% on 14-day retention
&lt;/h2&gt;

&lt;p&gt;The metric I tracked was 14-day retention on cards introduced during week 1, measured as "user recalled correctly on first review attempt after the 14-day mark." Across the 40 users:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vanilla SM-2 cohort: 22% retention&lt;/li&gt;
&lt;li&gt;CEFR-modulated SM-2 cohort: 38% retention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sixteen points of absolute improvement, no change to the cards themselves, no change to the UI. The same words, scheduled with priors that match their actual difficulty.&lt;/p&gt;

&lt;p&gt;The other win I didn't expect: review volume per day dropped 19% for the modulated cohort, because A1 words stopped showing up in tomorrow's queue. Users reported the app feeling "less nagging" in week-2 survey replies, which I'd bet is doing some quiet retention work on its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tradeoff: storage per user
&lt;/h2&gt;

&lt;p&gt;Honest cost: every review writes a row. Card state (interval, ease, repetitions, due_at) plus a review log row for FSRS-readiness later. Schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;review_log&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;          &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;     &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;card_id&lt;/span&gt;     &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;reviewed_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quality&lt;/span&gt;     &lt;span class="nb"&gt;SMALLINT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prev_interval_days&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;new_interval_days&lt;/span&gt;  &lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prev_ease&lt;/span&gt;   &lt;span class="nb"&gt;REAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;new_ease&lt;/span&gt;    &lt;span class="nb"&gt;REAL&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_review_log_user_time&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;review_log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reviewed_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An active user reviewing ~80 cards/day generates roughly 29k rows/year, ~3.5 MB raw. Cheap on Postgres, but the index on &lt;code&gt;(user_id, reviewed_at)&lt;/code&gt; is what keeps the "today's due cards" query under 20ms at p95. Without it, I was seeing 180ms+ once a few users crossed 50k log rows. Spend the disk, get the latency.&lt;/p&gt;

&lt;p&gt;I'll need this log anyway when I have enough volume to fit FSRS per user. Building the firehose now means the migration in 6 months is "run a job," not "go collect data we never stored."&lt;/p&gt;

&lt;h2&gt;
  
  
  v1 vs v4 on the same 40-user cohort
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;v1 (vanilla SM-2)&lt;/th&gt;
&lt;th&gt;v4 (CEFR-modulated SM-2)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;14-day retention (week-1 cards)&lt;/td&gt;
&lt;td&gt;22%&lt;/td&gt;
&lt;td&gt;38%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Median reviews to "mastered" (A1 card)&lt;/td&gt;
&lt;td&gt;6.2&lt;/td&gt;
&lt;td&gt;3.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Median reviews to "mastered" (C1 card)&lt;/td&gt;
&lt;td&gt;4.8&lt;/td&gt;
&lt;td&gt;7.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reviews/day per active user&lt;/td&gt;
&lt;td&gt;84&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p95 "due today" query latency&lt;/td&gt;
&lt;td&gt;180ms&lt;/td&gt;
&lt;td&gt;18ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The C1 number going &lt;em&gt;up&lt;/em&gt; is the feature, not the bug — hard words deserve more reps. The point is that the algorithm is now spending reps where they pay off.&lt;/p&gt;

&lt;p&gt;This scheduler is the one running in production at &lt;a href="https://www.tubevocab.com" rel="noopener noreferrer"&gt;TubeVocab&lt;/a&gt; today, and the review log is quietly accumulating so I can swap in a real FSRS fit once the data warrants it. Until then, a 12-line CEFR prior table is doing most of the work a fancier model would.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Tags: srs, spaced-repetition, esl, sideproject, indie, anki&lt;/em&gt;&lt;/p&gt;

</description>
      <category>srs</category>
      <category>esl</category>
      <category>sideprojects</category>
      <category>indiedev</category>
    </item>
    <item>
      <title>How I cut speech-bubble retries from 70% to 0% with 200 lines of Pillow code</title>
      <dc:creator>qcrao</dc:creator>
      <pubDate>Sun, 10 May 2026 06:42:30 +0000</pubDate>
      <link>https://dev.to/qcrao/how-i-cut-speech-bubble-retries-from-70-to-0-with-200-lines-of-pillow-code-3dal</link>
      <guid>https://dev.to/qcrao/how-i-cut-speech-bubble-retries-from-70-to-0-with-200-lines-of-pillow-code-3dal</guid>
      <description>&lt;p&gt;If you've ever asked Stable Diffusion or DALL-E to render readable text inside a comic panel, you know the pain. It almost works. The letters look like letters. Until you read them — &lt;code&gt;"WHAT ARE YOU DONIG"&lt;/code&gt;, &lt;code&gt;"HEILP"&lt;/code&gt;, &lt;code&gt;"BLEAH BLAH"&lt;/code&gt;. About 70% of my generations needed a regen &lt;em&gt;just&lt;/em&gt; because the dialogue was garbled, and every regen burned ~$0.04 in GPU time.&lt;/p&gt;

&lt;p&gt;For Comicory I gave up trying to make the model render text and moved typography into a deterministic post-processing step. The model now draws empty speech bubbles. Pillow draws the words. Retry rate for text-related issues: zero. Total post-processing code: ~200 lines.&lt;/p&gt;

&lt;p&gt;Here's the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Bubble shape detection
&lt;/h2&gt;

&lt;p&gt;The model is told (via prompt + LoRA) to draw an empty white speech bubble with a black outline somewhere in the panel. I find it with classic CV — no ML, no models, no surprises:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_bubble&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;panel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;arr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;panel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;L&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;245&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;THRESH_BINARY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;contours&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findContours&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RETR_EXTERNAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CHAIN_APPROX_SIMPLE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;blobs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contours&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contourArea&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;blob&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;blobs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;boundingRect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;aspect&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;aspect&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The aspect-ratio bound rejects long thin clouds and full-panel backgrounds. Across ~2,000 panels, this lands the right bubble 96% of the time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Font selection by character mood
&lt;/h2&gt;

&lt;p&gt;Every Comicory character has a &lt;code&gt;mood&lt;/code&gt; field. Each mood maps to a font + weight:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;FONT_MAP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AnimeAce2.ttf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;angry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BadaBoom-BB.ttf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shouting&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BadaBoom-BB.ttf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;whisper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AnimeAce2.ttf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;italic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;narrator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CCWildwords-Roman.ttf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Properly licensed comic fonts I bought once for ~$80. Free Google Fonts comic alternatives (Bangers, Permanent Marker) look like Canva templates — readers spot the AI-comic vibe instantly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Text wrapping that fits the bubble
&lt;/h2&gt;

&lt;p&gt;Pillow's &lt;code&gt;textwrap&lt;/code&gt; is naive. My version binary-searches font size + line breaks until rendered text fits the bubble:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ImageDraw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ImageFont&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fit_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bubble&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;font_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bubble&lt;/span&gt;
    &lt;span class="n"&gt;inner_w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inner_h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;font&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ImageFont&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;truetype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;font_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wrap_to_width&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inner_w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;line_h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getbbox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getbbox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;total_h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line_h&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.15&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;total_h&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;inner_h&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;font.getlength()&lt;/code&gt; is the key — actual rendered width per character, kerning-aware. The 0.75 inscribed-rect factor leaves visible margin so the eye reads it as "professionally laid out."&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Kerning + outline (polish)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;draw_text_with_outline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;draw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;center_x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outline_w&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;line_h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getbbox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getbbox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.15&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;line_w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getlength&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;center_x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;line_w&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
        &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;top_y&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;line_h&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;dx&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;outline_w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outline_w&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;dy&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;outline_w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outline_w&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;dx&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;dy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;draw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;dx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;dy&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fill&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;white&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;draw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fill&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;black&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 8-direction stroke produces a clean white halo around black text, improving readability over busy backgrounds. Modern Pillow has native &lt;code&gt;stroke_width&lt;/code&gt; but I keep manual stroke — chunkier, reads more "comic-y."&lt;/p&gt;

&lt;p&gt;For kerning, don't draw character-by-character in a loop. That throws away the font's kerning pairs. Use &lt;code&gt;getlength()&lt;/code&gt; and let Pillow respect the metric table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before vs. after
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Pre (in-prompt text)&lt;/th&gt;
&lt;th&gt;Post (Pillow composite)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Text legibility (manual review)&lt;/td&gt;
&lt;td&gt;31% acceptable&lt;/td&gt;
&lt;td&gt;100% acceptable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regens triggered by text issues&lt;/td&gt;
&lt;td&gt;70% of panels&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg latency per panel&lt;/td&gt;
&lt;td&gt;8.4s&lt;/td&gt;
&lt;td&gt;8.6s (+200ms Pillow)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU $ saved per 100 panels&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;$2.80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lines of code total&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;~200&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;+200ms Pillow overhead is invisible to users. $2.80/100-panels compounds to ~$120/month I no longer pay for failed text generations.&lt;/p&gt;

&lt;p&gt;The bigger win is user trust. When you see an AI comic with garbled text, your brain immediately tags it "AI slop." Clean, kerned, outlined typography reads as "someone made this on purpose." Cheapest credibility upgrade in the pipeline.&lt;/p&gt;

&lt;p&gt;If you want to see the composite output in the wild, &lt;a href="https://www.comicory.com" rel="noopener noreferrer"&gt;Comicory&lt;/a&gt; is the side project this lives inside — every comic generated there ships through the exact 4 steps above before it reaches the canvas.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>pillow</category>
      <category>sideprojects</category>
      <category>indie</category>
    </item>
    <item>
      <title>The 4 NLP stages between raw YouTube subtitles and a flashcard you'd actually study</title>
      <dc:creator>qcrao</dc:creator>
      <pubDate>Sun, 10 May 2026 06:41:00 +0000</pubDate>
      <link>https://dev.to/qcrao/the-4-nlp-stages-between-raw-youtube-subtitles-and-a-flashcard-youd-actually-study-3he4</link>
      <guid>https://dev.to/qcrao/the-4-nlp-stages-between-raw-youtube-subtitles-and-a-flashcard-youd-actually-study-3he4</guid>
      <description>&lt;p&gt;A lot of "learn English with YouTube" tools just dump every word from the captions into your face and call it a vocabulary list. The result is 80% noise — pronouns, articles, contractions, proper nouns, and the same 200 high-frequency words repeated until your brain melts.&lt;/p&gt;

&lt;p&gt;When I was building TubeVocab, the hardest engineering problem wasn't scraping subtitles or shipping the React UI. It was the linguistic plumbing between &lt;em&gt;raw caption text&lt;/em&gt; and &lt;em&gt;a card a B1 learner would actually benefit from studying&lt;/em&gt;. That plumbing is a 4-stage NLP pipeline I tuned over 14 days and ~3,000 manual quality reviews. Here it is end-to-end, with the spaCy snippets that actually run in prod.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 1: Lemmatization (one card per lemma, not per inflection)
&lt;/h2&gt;

&lt;p&gt;If your subtitle says "running, ran, runs, has run" in one video, learners don't need 4 cards. They need one card for &lt;code&gt;run&lt;/code&gt; with all forms surfaced as examples.&lt;/p&gt;

&lt;p&gt;spaCy's &lt;code&gt;en_core_web_sm&lt;/code&gt; lemmatizer does ~95% of this for free. The catch: I disable everything I don't need so the pipeline runs at ~12k tokens/sec on a single CPU.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;spacy&lt;/span&gt;

&lt;span class="n"&gt;nlp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spacy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en_core_web_sm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;disable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;textcat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_lemmas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;nlp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lemma_&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pos_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_alpha&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_stop&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;is_stop&lt;/code&gt; filter alone removes ~40% of tokens (the/a/and/is/etc), which cascades into massive savings downstream.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 2: POS-tag filtering (kill the proper nouns and the junk)
&lt;/h2&gt;

&lt;p&gt;After lemmatization I have things like &lt;code&gt;("netflix", "PROPN")&lt;/code&gt;, &lt;code&gt;("ok", "INTJ")&lt;/code&gt;, &lt;code&gt;("uh", "INTJ")&lt;/code&gt;. None of these belong on a flashcard.&lt;/p&gt;

&lt;p&gt;I keep only &lt;code&gt;NOUN&lt;/code&gt;, &lt;code&gt;VERB&lt;/code&gt;, &lt;code&gt;ADJ&lt;/code&gt;, &lt;code&gt;ADV&lt;/code&gt; and explicitly drop &lt;code&gt;PROPN&lt;/code&gt;, &lt;code&gt;INTJ&lt;/code&gt;, &lt;code&gt;NUM&lt;/code&gt;, and anything tagged &lt;code&gt;X&lt;/code&gt; (unknown).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;KEEP_POS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NOUN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VERB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ADJ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ADV&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;filter_by_pos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lemmas&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;lem&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;lem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pos&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lemmas&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pos&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;KEEP_POS&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sounds trivial. The first version of TubeVocab didn't do this and ~18% of generated cards were words like "MrBeast", "TikTok", or "umm". Conversion to paid tanked because the first 5 cards a free user saw made the product look broken.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 3: CEFR difficulty classification (the part that took 14 days)
&lt;/h2&gt;

&lt;p&gt;Every card needs a CEFR band — A1 through C2. I tried 3 approaches:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A pretrained CEFR classifier from HuggingFace&lt;/strong&gt; — slow (~120ms/word), 25% disagreement with native-speaker spot checks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A custom fine-tuned BERT&lt;/strong&gt; — 91% agreement but +800MB Docker image and 4s cold start. Not worth it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A frequency-band lookup with hand-tuned overrides&lt;/strong&gt; — this won.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I merged EFLLex (CEFR-aligned) + SUBTLEX-US (film/TV frequency), added ~600 manual overrides:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CEFR_BAND&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_static_cefr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# {"run": "A1", "ostensibly": "C1", ...}
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lemma&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;B2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;CEFR_BAND&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lemma&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;B2&lt;/code&gt; is the unknown-word default because it's the median for educational YouTube. Now ~0.4ms/word, 89% agreement with manual reviews.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 4: Dedupe-by-context (the secret sauce)
&lt;/h2&gt;

&lt;p&gt;A learner doesn't need 12 cards for &lt;code&gt;run&lt;/code&gt; even if it appears in 12 videos. They need &lt;em&gt;one card&lt;/em&gt; with &lt;em&gt;the best example sentence&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;For each lemma I score every context sentence on length (10–20 tokens), CEFR-density, and a tiny TextRank clarity score:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;best_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lemma&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Sentence&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Sentence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;count_above_band&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;B2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;textrank_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single change moved 14-day retention from 18% to 31%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before vs. after on one real 12-min MrBeast video
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;v1 (lemma + freq)&lt;/th&gt;
&lt;th&gt;v4 (full pipeline)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tokens after lemmatization&lt;/td&gt;
&lt;td&gt;1,847&lt;/td&gt;
&lt;td&gt;1,847&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cards after POS filter&lt;/td&gt;
&lt;td&gt;1,847&lt;/td&gt;
&lt;td&gt;612&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cards after CEFR-band trim (B1–B2 target)&lt;/td&gt;
&lt;td&gt;612&lt;/td&gt;
&lt;td&gt;184&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cards after context-dedupe&lt;/td&gt;
&lt;td&gt;184&lt;/td&gt;
&lt;td&gt;71&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User-reported "useful" rate (n=40)&lt;/td&gt;
&lt;td&gt;22%&lt;/td&gt;
&lt;td&gt;78%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you want to see the output without reading spaCy, &lt;a href="https://www.tubevocab.com" rel="noopener noreferrer"&gt;TubeVocab&lt;/a&gt; is the side project these 4 stages live inside — paste a YouTube URL, get back ~50–100 CEFR-tagged cards with timestamps clickable back to the exact second the word was spoken.&lt;/p&gt;

</description>
      <category>nlp</category>
      <category>spacy</category>
      <category>esl</category>
      <category>sideprojects</category>
    </item>
    <item>
      <title>7 prompt engineering tricks that pulled my AI comic costs from $0.20 to $0.038/panel</title>
      <dc:creator>qcrao</dc:creator>
      <pubDate>Sun, 10 May 2026 05:00:22 +0000</pubDate>
      <link>https://dev.to/qcrao/7-prompt-engineering-tricks-that-pulled-my-ai-comic-costs-from-020-to-0038panel-3830</link>
      <guid>https://dev.to/qcrao/7-prompt-engineering-tricks-that-pulled-my-ai-comic-costs-from-020-to-0038panel-3830</guid>
      <description>&lt;p&gt;Six months ago, generating a single 4-panel comic on Comicory cost me ~$0.80 in GPU time and produced something that looked AI-generated in the worst way — washed-out colors, fingers melting, the same character looking like three different people across panels.&lt;/p&gt;

&lt;p&gt;Today the same comic costs $0.152 (so ~$0.038 per panel) and looks consistent enough that nobody asks "is this AI?" in the first three seconds.&lt;/p&gt;

&lt;p&gt;I didn't switch to a cheaper provider. I didn't quantize my models harder. The win came from prompt engineering and model selection — the boring layer everyone skips because it's not flashy. Here are the 7 things that actually moved the needle.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Stop using SDXL for thumbnails
&lt;/h2&gt;

&lt;p&gt;I was running SDXL 1.0 (1024×1024, ~6.5s on an A10G) for &lt;em&gt;every&lt;/em&gt; generation, including the rough draft thumbnail the user sees during the wizard. Switching to SD 1.5 + a good anime LoRA at 512×512 for thumbnails cut that step from 6.5s to 1.1s.&lt;/p&gt;

&lt;p&gt;Users don't care about thumbnail quality. They care about &lt;em&gt;iteration speed&lt;/em&gt;. And SD 1.5 thumbnails get refined to SDXL only on final render.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Saving: $0.041/comic&lt;/strong&gt; (4 thumbnail-equivalent generations per session avg)&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Front-load identity tokens, demote style tokens
&lt;/h2&gt;

&lt;p&gt;Every prompt engineering tutorial says "put important things first." Almost nobody quantifies it. After A/B testing 200 panels, I found character identity drift drops ~38% when the LoRA trigger token + descriptor sit in the first 12 tokens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Before (drift across panels: ~31%)
"masterpiece, best quality, anime style, vibrant colors, 
detailed background, miraCharacterV3, woman with red hair..."

# After (drift across panels: ~9%)
"miraCharacterV3 woman, red hair, green eyes, 
freckles, anime style, masterpiece, detailed background"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This alone halved the number of regen requests, which is the single biggest cost driver in a generative product.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Use negative prompts to skip CFG cycles
&lt;/h2&gt;

&lt;p&gt;Counter-intuitively, a strong negative prompt lets you lower CFG scale from 7.5 to 5.0 without losing prompt adherence. Lower CFG = fewer effective sampler steps needed for the same fidelity.&lt;/p&gt;

&lt;p&gt;My current negative is 47 tokens: &lt;code&gt;lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, jpeg artifacts, watermark, signature, blurry...&lt;/code&gt; etc. Boring, but with CFG 5.0 I can drop steps from 28 to 22 and the human eye can't tell.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Saving: ~21% per generation step.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Cache the seed for "same character, next panel"
&lt;/h2&gt;

&lt;p&gt;For multi-panel comics, the wizard generates panel 1, the user approves, then panels 2-4 inherit the &lt;em&gt;same seed&lt;/em&gt; with only the action/background changing in the prompt. This means panels 2-4 don't need to do full character search — they're starting from a latent space already near the desired identity.&lt;/p&gt;

&lt;p&gt;I drop steps for panels 2-4 from 28 to 18 and only re-render panel 1 at full step count. Quality across the strip is more consistent than running each panel fresh, and the GPU time is 35% lower.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pseudocode
&lt;/span&gt;&lt;span class="n"&gt;seed_panel_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;random_seed&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;panels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;seed_panel_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;panels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;seed_panel_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  5. Embed the dialogue &lt;em&gt;after&lt;/em&gt; image generation, not in the prompt
&lt;/h2&gt;

&lt;p&gt;Old approach: "...woman saying 'I forgot my keys'" baked into the prompt. The model would render warped text 70% of the time, costing me a regen.&lt;/p&gt;

&lt;p&gt;New approach: generate clean image with empty speech bubbles, then composite Pillow text afterward. Zero text rendering errors, deterministic typography, and I save the regen budget for actual artistic misses.&lt;/p&gt;

&lt;p&gt;This sounds obvious in retrospect. It took me four months to stop fighting the model on something it was never good at.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Pick the model per panel, not per comic
&lt;/h2&gt;

&lt;p&gt;Not every panel benefits from SDXL. Establishing shots (wide angle, lots of background) yes — character close-ups don't need 1024². I built a router that picks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SDXL Turbo&lt;/strong&gt; for close-ups and reaction shots (4 steps, $0.011/panel)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SDXL 1.0 + LoRA&lt;/strong&gt; for full-body action and establishing shots ($0.052/panel)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SD 1.5 + LoRA&lt;/strong&gt; for backgrounds inserted into composite scenes ($0.008/panel)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Average panel cost dropped from $0.061 (everything-on-SDXL) to $0.029 (routed). Quality assessed via a 50-panel blind test at 4.2/5 vs 4.3/5 — within noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Pre-warm the GPU once per session
&lt;/h2&gt;

&lt;p&gt;This isn't strictly prompt engineering, but it interacts with everything above. Cold-loading SDXL + LoRA takes ~14s. If the user does 6 generations in one session, I was eating that cold start every ~3rd request because of autoscaler scaledown.&lt;/p&gt;

&lt;p&gt;Pinning one warm replica per active session for 5 minutes after the last request cut average wall-clock latency from 8.4s → 3.1s and the GPU bill barely moved (idle warm time on Modal is ~$0.0008/sec).&lt;/p&gt;

&lt;h2&gt;
  
  
  Before / after
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Old&lt;/th&gt;
&lt;th&gt;New&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU cost per panel (avg)&lt;/td&gt;
&lt;td&gt;$0.061&lt;/td&gt;
&lt;td&gt;$0.029&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Effective cost per panel including regens&lt;/td&gt;
&lt;td&gt;$0.20&lt;/td&gt;
&lt;td&gt;$0.038&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg latency per panel&lt;/td&gt;
&lt;td&gt;8.4s&lt;/td&gt;
&lt;td&gt;3.1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Character consistency score (internal)&lt;/td&gt;
&lt;td&gt;4.0/10&lt;/td&gt;
&lt;td&gt;8.7/10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User regen rate per session&lt;/td&gt;
&lt;td&gt;2.3&lt;/td&gt;
&lt;td&gt;0.7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The regen rate metric is the one I care about most. Every regen is a user staring at a spinner thinking "is this thing worth $9/month." Cutting that 3x doubled my trial-to-paid conversion in March.&lt;/p&gt;

&lt;p&gt;If you want to see the pipeline in action, &lt;a href="https://www.comicory.com" rel="noopener noreferrer"&gt;Comicory&lt;/a&gt; is the side project this all lives inside. The "create a 4-panel" wizard runs through every trick above in the same order I described them.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>stablediffusion</category>
      <category>sideprojects</category>
      <category>indie</category>
    </item>
    <item>
      <title>How I scraped 50k YouTube subtitles in 2 weeks for $7 (and the legal gray zones)</title>
      <dc:creator>qcrao</dc:creator>
      <pubDate>Sun, 10 May 2026 04:57:31 +0000</pubDate>
      <link>https://dev.to/qcrao/how-i-scraped-50k-youtube-subtitles-in-2-weeks-for-7-and-the-legal-gray-zones-4b16</link>
      <guid>https://dev.to/qcrao/how-i-scraped-50k-youtube-subtitles-in-2-weeks-for-7-and-the-legal-gray-zones-4b16</guid>
      <description>&lt;p&gt;When I started building TubeVocab, I had a chicken-and-egg problem. I needed a corpus of YouTube subtitles to mine ESL vocabulary from — but the official YouTube Data API v3 doesn't return subtitle bodies unless you own the channel. The &lt;code&gt;captions.download&lt;/code&gt; endpoint? Auth-locked to channel owners.&lt;/p&gt;

&lt;p&gt;So I had to find another way. Two weeks, 50,247 videos, $7.12 in egress costs, and one mild panic about ToS later, here's what actually worked.&lt;/p&gt;

&lt;h2&gt;
  
  
  The undocumented endpoint nobody talks about
&lt;/h2&gt;

&lt;p&gt;Every YouTube watch page hits this URL pattern internally to fetch caption tracks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://www.youtube.com/api/timedtext?lang=en&amp;amp;v=VIDEO_ID&amp;amp;fmt=json3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It returns JSON. No auth. No quota. No API key. It's the same endpoint the YouTube web player uses to render captions on your screen. As far as I can tell, it's been there since ~2014 and Google hasn't deprecated it because their own player depends on it.&lt;/p&gt;

&lt;p&gt;The catch: you need the right &lt;code&gt;lang&lt;/code&gt; code, and for auto-generated captions (which is 80% of educational content) you need an extra param &lt;code&gt;&amp;amp;kind=asr&lt;/code&gt;. And to get the list of available tracks you first hit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://www.youtube.com/api/timedtext?type=list&amp;amp;v=VIDEO_ID
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Which returns XML (yes, mixed format APIs — very 2014). I parse the &lt;code&gt;&amp;lt;track&amp;gt;&lt;/code&gt; nodes, prefer manual &lt;code&gt;en&lt;/code&gt; over &lt;code&gt;en (auto)&lt;/code&gt;, then fetch the json3.&lt;/p&gt;

&lt;h2&gt;
  
  
  When timedtext fails, yt-dlp picks up
&lt;/h2&gt;

&lt;p&gt;About 4% of videos return empty timedtext responses even though the player UI shows captions. I never figured out exactly why — maybe regional caption availability, maybe age-gated content, maybe a stale cache somewhere on YouTube's edge.&lt;/p&gt;

&lt;p&gt;Fallback was &lt;code&gt;yt-dlp --skip-download --write-auto-subs --sub-format json3 --sub-langs en&lt;/code&gt;. Slower (it has to resolve the player JS), but works on the long tail. I shell out to it from Python only when the direct endpoint returns nothing usable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_subs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;direct_timedtext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;ytdlp_fallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ~3s vs 200ms
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Batch processing: the part that actually saved money
&lt;/h2&gt;

&lt;p&gt;Naive scraping was 1 video per request from my laptop. After ~500 videos I noticed YouTube started 429-ing me from a single IP. So I rebuilt the pipeline with three constraints:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;One Cloud Run job per ~5k video batch.&lt;/strong&gt; Cloud Run gives me a fresh egress IP per cold start. 10 cold starts per night = 10 different IPs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 concurrent workers per job, 250ms jitter between requests.&lt;/strong&gt; Below 12 req/sec/IP nothing throttled.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subtitles → newline-delimited JSON → gzip → GCS.&lt;/strong&gt; Storing each video as a separate file killed me on small-object overhead. Batching 5k videos into one &lt;code&gt;.ndjson.gz&lt;/code&gt; (~38MB) brought storage cost from $0.42/k to $0.008/k.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Total scraping cost over 2 weeks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Line item&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cloud Run compute (10 jobs × 14 nights × ~6 min each)&lt;/td&gt;
&lt;td&gt;$4.31&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GCS Standard storage (~12 GB)&lt;/td&gt;
&lt;td&gt;$0.24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GCS egress to my dev box for sampling&lt;/td&gt;
&lt;td&gt;$1.07&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Misc (BigQuery loads, Cloud Logging)&lt;/td&gt;
&lt;td&gt;$1.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$7.12&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The legal gray zone (I am not a lawyer)
&lt;/h2&gt;

&lt;p&gt;YouTube's ToS section 5.B prohibits "accessing the Service using any automated means... other than through the YouTube API." Strict reading: my timedtext scraping violates ToS.&lt;/p&gt;

&lt;p&gt;But — and this is where I made a judgment call — I'm not redistributing the subtitle text. I'm extracting vocabulary frequencies, lemmas, and CEFR difficulty bands from them, then storing only metadata (word + video_id + timestamp) in my user-facing DB. The raw subtitle blobs sit in cold GCS and never leave my pipeline.&lt;/p&gt;

&lt;p&gt;I also exclude any video where I detect a copyright strike claim in the description, and I respect the channel's &lt;code&gt;&amp;lt;meta name="robots"&amp;gt;&lt;/code&gt; header even though there's no legal requirement to. It's a vibes-based defense, but if a takedown email ever arrives, my response is "deleted within the hour." &lt;/p&gt;

&lt;p&gt;Two months in, no email yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start with yt-dlp first&lt;/strong&gt;, profile it, then optimize the hot path with direct timedtext. I burned 3 days writing direct-endpoint code before realizing the fallback covered 96% of cases anyway and was simpler to maintain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't store raw subtitles in your prod DB.&lt;/strong&gt; Process → extract → discard. SQLite was 11GB before I noticed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep a videos-attempted table separate from videos-succeeded.&lt;/strong&gt; I lost count of how many times I re-scraped failures because I couldn't tell what I'd already tried.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pipeline now runs unattended and pulls in ~2k new videos per night across 40 ESL-relevant channels. Total marginal cost per video: $0.00014. Total time I spend maintaining it: ~10 minutes a week.&lt;/p&gt;

&lt;p&gt;If you're curious how this corpus turns into an actual learning product, that's &lt;a href="https://www.tubevocab.com" rel="noopener noreferrer"&gt;TubeVocab&lt;/a&gt; — same scraper, plus a frontend that ranks vocab by CEFR level and lets you click through to the exact second a word was spoken.&lt;/p&gt;

</description>
      <category>youtube</category>
      <category>scraping</category>
      <category>sideprojects</category>
      <category>indie</category>
    </item>
    <item>
      <title>5 LoRA training pitfalls when you're trying to lock down a comic character</title>
      <dc:creator>qcrao</dc:creator>
      <pubDate>Thu, 07 May 2026 09:31:29 +0000</pubDate>
      <link>https://dev.to/qcrao/5-lora-training-pitfalls-when-youre-trying-to-lock-down-a-comic-character-43bl</link>
      <guid>https://dev.to/qcrao/5-lora-training-pitfalls-when-youre-trying-to-lock-down-a-comic-character-43bl</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;TLDR: Most "my LoRA works in test prompts but breaks the second I put it in a comic panel" problems are caused at training time, not at inference. Here are the five training-side mistakes that ate the most weekends for me.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;I've spent the last eight months building &lt;a href="https://www.comicory.com" rel="noopener noreferrer"&gt;Comicory&lt;/a&gt;, an AI comic generator where the entire pitch is "your character looks the same on page 1 and page 12." That sentence is easy to say. It is &lt;em&gt;grindingly&lt;/em&gt; hard to ship.&lt;/p&gt;

&lt;p&gt;Almost every fix I shipped in those eight months traced back to LoRA training, not the prompt or the sampler or the seed. This post is the list I wish someone had given me on day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfall 1: Your training set has too many "same shot"
&lt;/h2&gt;

&lt;p&gt;The first character LoRA I trained had 32 images. 28 of them were 3/4 portrait, neutral lighting, looking slightly off-camera. It was the dataset I had, scraped from concept-art-style references.&lt;/p&gt;

&lt;p&gt;The LoRA trained beautifully. Then I tried to use it in an actual comic panel — wide shot, side profile, character mid-action — and the output looked nothing like the reference. The model had memorized the &lt;em&gt;pose&lt;/em&gt;, not the character.&lt;/p&gt;

&lt;p&gt;Fix: aim for &lt;strong&gt;pose, framing, and lighting diversity&lt;/strong&gt; before you aim for image count. My current target for a character is roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;30% close-up faces (multiple angles)&lt;/li&gt;
&lt;li&gt;30% medium shots (waist-up, multiple angles)&lt;/li&gt;
&lt;li&gt;25% full-body shots&lt;/li&gt;
&lt;li&gt;15% "weird" shots — back of head, dramatic angle, partial occlusion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Quality of &lt;em&gt;coverage&lt;/em&gt; matters more than count. A 25-image set with this distribution beats a 70-image set of nothing-but-portraits, every single time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfall 2: You captioned the character into the wallpaper
&lt;/h2&gt;

&lt;p&gt;This one is sneaky. In my early datasets, every caption looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ck_character standing in a forest, anime style, soft lighting, high detail
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model learned &lt;code&gt;ck_character&lt;/code&gt; as inseparable from "standing in a forest, soft lighting." When I prompted &lt;code&gt;ck_character on a spaceship bridge&lt;/code&gt;, the LoRA pulled in foliage and warm light because those concepts had been bound to the trigger token.&lt;/p&gt;

&lt;p&gt;Fix: &lt;strong&gt;caption away the things you want to vary&lt;/strong&gt;, leave only what is invariant about the character. If your character is supposed to be wearable in any setting, your caption should look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ck_character, red jacket, short black hair, freckles
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No setting, no lighting, no mood. Those are the variables you'll set at inference time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What I do during caption preprocessing now
&lt;/span&gt;&lt;span class="n"&gt;INVARIANT_TAGS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;red_jacket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;short_black_hair&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;freckles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;STRIPPED_TAGS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;forest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;soft_lighting&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high_detail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outdoor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;indoor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_caption&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_tags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trigger&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ck_character&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;keep&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;raw_tags&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;INVARIANT_TAGS&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trigger&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keep&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This change alone gave me the single biggest jump in cross-scene consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfall 3: You trained at one resolution and then panel-rendered at another
&lt;/h2&gt;

&lt;p&gt;Stable Diffusion 1.5 LoRAs trained at 512×512 fall apart at 768×1152 panel aspect ratios. SDXL is more forgiving but not immune. The model has not seen the character at the panel aspect ratio you actually need.&lt;/p&gt;

&lt;p&gt;Fix: &lt;strong&gt;bucketed training across the aspect ratios you'll actually render at.&lt;/strong&gt; kohya-ss supports this out of the box. My current bucket config covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;512×768 (portrait panel)&lt;/li&gt;
&lt;li&gt;768×512 (landscape panel)&lt;/li&gt;
&lt;li&gt;768×768 (splash square)&lt;/li&gt;
&lt;li&gt;1024×1536 (full-page hero)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Image counts in each bucket should roughly match how often you'll render at that aspect. If 70% of your panels are landscape, 70% of your training images should be landscape — even if it means cropping the same source image into multiple buckets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfall 4: Your learning rate is fighting your dataset size
&lt;/h2&gt;

&lt;p&gt;There is no universal "good" LR. Tiny datasets (15-25 images) want a &lt;em&gt;lower&lt;/em&gt; LR and &lt;em&gt;more&lt;/em&gt; steps so the model doesn't overfit on the handful of examples. Bigger sets (60+) tolerate a higher LR and fewer epochs.&lt;/p&gt;

&lt;p&gt;What I use as a starting point now (kohya-ss, SDXL LoRA, rank 16):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset size&lt;/th&gt;
&lt;th&gt;unet_lr&lt;/th&gt;
&lt;th&gt;text_encoder_lr&lt;/th&gt;
&lt;th&gt;epochs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;15-25 images&lt;/td&gt;
&lt;td&gt;1e-4&lt;/td&gt;
&lt;td&gt;5e-5&lt;/td&gt;
&lt;td&gt;12-15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25-50 images&lt;/td&gt;
&lt;td&gt;2e-4&lt;/td&gt;
&lt;td&gt;1e-4&lt;/td&gt;
&lt;td&gt;8-10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50-100 images&lt;/td&gt;
&lt;td&gt;3e-4&lt;/td&gt;
&lt;td&gt;1e-4&lt;/td&gt;
&lt;td&gt;6-8&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are &lt;em&gt;starting points&lt;/em&gt;, not laws. But they will save you from the two failure modes I kept hitting: undertraining ("LoRA does nothing") and overcooking ("LoRA always renders the same expression").&lt;/p&gt;

&lt;p&gt;Check loss curves. If validation loss bottoms out around epoch 4 and rises after, your LR is too high or you have too few images. If it's still falling at the last epoch, train longer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfall 5: You skipped regularization images and now the LoRA bleeds into everything
&lt;/h2&gt;

&lt;p&gt;You ship the LoRA. You prompt &lt;code&gt;a coffee shop, no characters, photorealistic&lt;/code&gt;. Your character shows up anyway, faintly haunting the espresso machine.&lt;/p&gt;

&lt;p&gt;This is the LoRA "leaking" into general concepts because it has no contrast set. The model has no examples of "what a person who is NOT this character looks like" during training, so the LoRA's identity bleeds into the base model's "person" concept.&lt;/p&gt;

&lt;p&gt;Fix: &lt;strong&gt;regularization images.&lt;/strong&gt; During training, alongside your character set, include a folder of generic "person" images (200-300, captioned simply as &lt;code&gt;person&lt;/code&gt;) generated by the base model itself. These tell the LoRA "this is what NOT-the-character looks like."&lt;/p&gt;

&lt;p&gt;In kohya-ss config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[[datasets]]&lt;/span&gt;
  &lt;span class="nn"&gt;[[datasets.subsets]]&lt;/span&gt;
    &lt;span class="py"&gt;image_dir&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"/data/ck_character"&lt;/span&gt;
    &lt;span class="py"&gt;class_tokens&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"ck_character"&lt;/span&gt;
    &lt;span class="py"&gt;num_repeats&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;

  &lt;span class="nn"&gt;[[datasets.subsets]]&lt;/span&gt;
    &lt;span class="py"&gt;image_dir&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"/data/reg_person"&lt;/span&gt;
    &lt;span class="py"&gt;class_tokens&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"person"&lt;/span&gt;
    &lt;span class="py"&gt;num_repeats&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="py"&gt;is_reg&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The leaking effect drops to near-zero. Your background characters look like background characters again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Character consistency is, in practice, a checklist of these five training-time decisions plus a workflow that uses the resulting LoRA correctly. The inference side (ControlNet, IP-Adapter, reference-only) only matters once your LoRA is solid. If your LoRA is bad, no amount of inference scaffolding will save it.&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://www.comicory.com" rel="noopener noreferrer"&gt;Comicory&lt;/a&gt; because I wanted a comic generator that didn't make me re-prompt the character on every panel. The five fixes above are the spine of how it works under the hood.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>stablediffusion</category>
      <category>sideprojects</category>
      <category>indie</category>
    </item>
    <item>
      <title>What I learned squeezing the YouTube Data API v3 quota for a side project</title>
      <dc:creator>qcrao</dc:creator>
      <pubDate>Thu, 07 May 2026 09:12:47 +0000</pubDate>
      <link>https://dev.to/qcrao/what-i-learned-squeezing-the-youtube-data-api-v3-quota-for-a-side-project-3304</link>
      <guid>https://dev.to/qcrao/what-i-learned-squeezing-the-youtube-data-api-v3-quota-for-a-side-project-3304</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;TLDR: The default 10,000 unit/day quota will burn through in ~10 naive user requests. Three tricks pulled my per-user cost down 50× and let me ship TubeVocab on the free tier.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;When I started building TubeVocab — an ESL learning tool that turns any YouTube video into a clickable, vocab-learning interactive transcript — I assumed the YouTube Data API v3 would be the cheap, easy part. "It's Google. It scales. The free tier is generous." That kind of gut feeling.&lt;/p&gt;

&lt;p&gt;I was wrong. The free tier &lt;em&gt;is&lt;/em&gt; generous, but only if you understand how quota math actually works. Most public tutorials skip this. Here's what I learned the hard way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The quota arithmetic nobody puts in the quickstart
&lt;/h2&gt;

&lt;p&gt;Default daily quota: &lt;strong&gt;10,000 units&lt;/strong&gt;. Sounds like a lot.&lt;/p&gt;

&lt;p&gt;Then you start reading the &lt;a href="https://developers.google.com/youtube/v3/determine_quota_cost" rel="noopener noreferrer"&gt;cost table&lt;/a&gt; and realize:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;search.list&lt;/code&gt; — &lt;strong&gt;100 units&lt;/strong&gt; per call. That's how you find a video by query.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;videos.list&lt;/code&gt; — &lt;strong&gt;1 unit&lt;/strong&gt; per call. That's how you fetch metadata once you have an ID.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;captions.list&lt;/code&gt; — &lt;strong&gt;50 units&lt;/strong&gt;. Thumbnails of available subtitles.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;captions.download&lt;/code&gt; — &lt;strong&gt;200 units&lt;/strong&gt;. The actual subtitle data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your user-facing flow is "search a YouTube channel → pick a video → load subtitles → render the interactive player," you're looking at roughly &lt;code&gt;100 + 1 + 50 + 200 = 351 units&lt;/code&gt; per &lt;em&gt;single user session&lt;/em&gt;. The 10,000 free units evaporate in &lt;strong&gt;28 sessions/day&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That's not a side project. That's a 30-DAU launch and you're paying for quota expansion the next morning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three tricks that cut my per-user cost ~50×
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Don't use &lt;code&gt;search.list&lt;/code&gt; for known IDs
&lt;/h3&gt;

&lt;p&gt;This sounds obvious in hindsight but it took me a week to see. If a user pastes a YouTube URL, &lt;strong&gt;the video ID is right there in the URL&lt;/strong&gt;. Parse it. Skip search.list entirely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Bad: 100 units per pasted URL&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;youtube&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;search&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;q&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pastedUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;video&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;part&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;snippet&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Good: 0 units, regex the ID&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;pastedUrl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;(?:&lt;/span&gt;&lt;span class="sr"&gt;v=|youtu&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="sr"&gt;be&lt;/span&gt;&lt;span class="se"&gt;\/)([\w&lt;/span&gt;&lt;span class="sr"&gt;-&lt;/span&gt;&lt;span class="se"&gt;]{11})&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="p"&gt;)?.[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;youtube&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;videos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;part&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;snippet,contentDetails&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt; &lt;span class="c1"&gt;// 1 unit&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This one change took the average pasted-URL flow from 351 units → 251 units.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Skip the official &lt;code&gt;captions.*&lt;/code&gt; endpoints entirely
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;captions.download&lt;/code&gt; endpoint costs 200 units per video AND requires OAuth (the user has to be the video owner). For non-owner subtitle access — i.e. the actual ESL use case — you need a different path.&lt;/p&gt;

&lt;p&gt;The trick: YouTube serves the auto-generated and uploader-provided subtitles through an undocumented but stable XML endpoint that doesn't count against your quota at all. You can get the timed transcript via &lt;code&gt;https://video.google.com/timedtext?lang=en&amp;amp;v=VIDEO_ID&lt;/code&gt;, parse the XML, and you're done. &lt;strong&gt;0 quota units.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;(Caveat: this endpoint is undocumented, so it can break. I have a fallback path that uses &lt;code&gt;youtube-transcript-api&lt;/code&gt; style scraping. The combined approach gets ~95% subtitle hit rate without touching the official caption quota.)&lt;/p&gt;

&lt;p&gt;After this, my "load subtitles" cost dropped from 250 → 1 unit per session.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Cache aggressively at the video-ID level
&lt;/h3&gt;

&lt;p&gt;Every time someone watches a video on TubeVocab, the metadata + subtitle + thumbnail set is &lt;em&gt;the same&lt;/em&gt; until the video itself changes. I run a per-video-ID cache (just SQLite — overkill is fine) with no expiry. Subsequent views of the same video cost &lt;strong&gt;zero quota&lt;/strong&gt;, regardless of how many users watch it.&lt;/p&gt;

&lt;p&gt;Once I had ~500 popular videos cached, my marginal cost per session was effectively zero. The quota is now spent only on first-time-seen videos.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually shipped
&lt;/h2&gt;

&lt;p&gt;After these three optimizations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average new-video session: &lt;strong&gt;~2 units&lt;/strong&gt; (videos.list + occasional fallback)&lt;/li&gt;
&lt;li&gt;Average cached-video session: &lt;strong&gt;0 units&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Daily ceiling on the free tier: ~5,000 unique new videos/day before I'd need to start budgeting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's enough headroom for the foreseeable lifetime of a side project.&lt;/p&gt;

&lt;p&gt;If you're building anything in the YouTube + content-analysis space — vocabulary tools, accessibility, search, analytics — the playbook is roughly: &lt;strong&gt;assume &lt;code&gt;search.list&lt;/code&gt; is poison, route around &lt;code&gt;captions.*&lt;/code&gt;, and cache by video ID forever&lt;/strong&gt;. The free tier becomes more than generous once you stop fighting it.&lt;/p&gt;




&lt;p&gt;For context: I built &lt;a href="https://www.tubevocab.com" rel="noopener noreferrer"&gt;TubeVocab&lt;/a&gt; using exactly this stack — it's a click-to-flashcard ESL tool that turns any YouTube video into vocabulary practice. The quota math was the single most underestimated technical risk of the whole project. Hope this saves someone a week.&lt;/p&gt;

</description>
      <category>youtube</category>
      <category>api</category>
      <category>sideprojects</category>
      <category>indie</category>
    </item>
    <item>
      <title>The Engineering Challenge of Turning YouTube Into an ESL Corpus</title>
      <dc:creator>qcrao</dc:creator>
      <pubDate>Fri, 24 Apr 2026 03:34:47 +0000</pubDate>
      <link>https://dev.to/qcrao/the-engineering-challenge-of-turning-youtube-into-an-esl-corpus-5bgi</link>
      <guid>https://dev.to/qcrao/the-engineering-challenge-of-turning-youtube-into-an-esl-corpus-5bgi</guid>
      <description>&lt;p&gt;Language learning apps have spent a decade chasing the same pattern: curate a 2,000-word "high-frequency vocabulary" list, wrap it in spaced repetition, ship. Users grind, retention looks great in the app, and then they meet an actual English speaker and freeze, because &lt;strong&gt;recognizing a word on a flashcard is not the same skill as catching it in running speech&lt;/strong&gt;. The information is in their head but it is not wired to sound, pace, register, or context.&lt;/p&gt;

&lt;p&gt;The intuition behind context-based acquisition — learning words &lt;em&gt;in situ&lt;/em&gt;, inside real discourse — is old and well supported in second-language acquisition research. The problem has always been that the "real discourse" part is hard to deliver at scale. Textbook dialogues are not real. Classroom tapes are not real. Even podcasts are a curated subset.&lt;/p&gt;

&lt;p&gt;YouTube is real. It is also the single largest corpus of native-speaker content in every register you care about: casual vlogs, lectures, interviews, comedy, news, gameplay commentary, technical talks. For ESL specifically, the fact that speakers vary in accent, speed, and slang is a feature, not a bug.&lt;/p&gt;

&lt;p&gt;The engineering question is: what would it take to turn YouTube into a usable ESL corpus?&lt;/p&gt;

&lt;h2&gt;
  
  
  The interactivity problem
&lt;/h2&gt;

&lt;p&gt;Watching YouTube with auto-subtitles on is already useful for listening comprehension. The gap is that &lt;strong&gt;subtitles are read-only&lt;/strong&gt;. A learner hits an unfamiliar word, pauses the video, tabs to a dictionary, types the word, gets a translation, tabs back, loses their place. After three such interruptions in a 10-minute video most learners give up and either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stop pausing (and therefore stop learning from the unfamiliar words), or&lt;/li&gt;
&lt;li&gt;abandon the video entirely.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The right interaction is &lt;strong&gt;click-a-word → instant translation + pronunciation + example sentence → optionally save as flashcard&lt;/strong&gt;, all without leaving the player. That turns a 10-minute video into a vocab-building session instead of a comprehension test.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is harder than it looks
&lt;/h2&gt;

&lt;p&gt;A few things get in the way:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Subtitle alignment.&lt;/strong&gt; YouTube auto-subs are word-timed for about 80% of videos; manual subs are sentence-timed. A click-a-word UI has to handle both gracefully, ideally highlighting the clicked word with &amp;lt;50ms latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokenization across languages.&lt;/strong&gt; Clicking "running" should map to the lemma "run" for dictionary lookup. Clicking "auf" in a German phrase should resolve to the correct sense given context. Clicking "不好意思" in Chinese should resolve as a multi-character idiom, not char-by-char.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disambiguation.&lt;/strong&gt; "Bank" in a finance video is different from "bank" in a kayaking video. A naive dictionary lookup gives the most common sense; a better system checks surrounding context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Personalization.&lt;/strong&gt; A B2 learner does not want to be interrupted every time "the" appears. The system needs to model what the learner already knows and surface only likely-unknown words — ideally inferred from past clicks, not a placement test.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flashcard hygiene.&lt;/strong&gt; Saving raw dictionary entries produces terrible flashcards. The good ones include the word in its &lt;em&gt;original sentence&lt;/em&gt;, the speaker, optionally a short audio clip. This turns retention from "definition recall" into "episodic recall," which is massively stronger.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What it looks like when it works
&lt;/h2&gt;

&lt;p&gt;I have been using &lt;a href="https://www.tubevocab.com" rel="noopener noreferrer"&gt;tubevocab.com&lt;/a&gt; for a month as a hosted implementation of the click-a-word-on-YouTube pattern. Drop in a video URL, watch with interactive subtitles, click a word to see the translation and an AI-generated example sentence, save it to a flashcard deck with the original sentence attached, let spaced repetition handle scheduling. UI is in 10 languages which matters for learners whose L1 is not English.&lt;/p&gt;

&lt;p&gt;What I noticed over the month:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retention is visibly better&lt;/strong&gt; than flat Anki decks, because you remember the speaker and the scene along with the word.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Listening comprehension improves faster than raw vocab count&lt;/strong&gt;. You start catching phrases you would have missed before, including phrases you never actually &lt;em&gt;studied&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The cost of saving a card is near zero&lt;/strong&gt; — one click, inline — which is what makes the workflow stick. Anki's friction cost is why most learners quit it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Free tier covers the dictionary, the flashcards, and the spaced repetition, which is enough to evaluate whether the loop works for a given learner without committing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I am bringing this up
&lt;/h2&gt;

&lt;p&gt;From an engineering standpoint, "interactive learning layer on top of YouTube" is a genuinely interesting systems problem: you are doing real-time NLP on streaming caption data, building a personalized word-knowledge model, and rendering a low-latency overlay on a player you do not control. Most of the research attention in language-learning tech has gone to generative tutors and chatbots; the infrastructure for &lt;em&gt;exposure-driven&lt;/em&gt; acquisition is comparatively under-built.&lt;/p&gt;

&lt;p&gt;For ESL learners specifically, the payoff is pragmatic: the gap between "I studied 3,000 words" and "I can follow a normal conversation" closes a lot faster when the 3,000 words were learned from real speakers saying real things, and the sentences attached to them when you hit review.&lt;/p&gt;

&lt;p&gt;Not a pitch for any particular tool — mostly an argument that the "click-a-word-on-real-native-content" pattern is underbuilt in this space, and the tools that get it right are worth the 10 minutes to evaluate.&lt;/p&gt;

</description>
      <category>learning</category>
      <category>productivity</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Why Character Consistency Is Hard in AI Comic Generation</title>
      <dc:creator>qcrao</dc:creator>
      <pubDate>Fri, 24 Apr 2026 03:31:47 +0000</pubDate>
      <link>https://dev.to/qcrao/why-character-consistency-is-hard-in-ai-comic-generation-36ld</link>
      <guid>https://dev.to/qcrao/why-character-consistency-is-hard-in-ai-comic-generation-36ld</guid>
      <description>&lt;p&gt;When you feed a story prompt into a generic image AI — say, "a detective with a red scarf walks into a neon-lit bar, then sits down at the counter, then pulls out a notebook" — you will usually get three images back where the detective has three different faces, two different scarves, and in one panel the scarf has become a tie. This is the &lt;strong&gt;character consistency problem&lt;/strong&gt;, and it is the single biggest reason why text-to-image tools are bad at comics.&lt;/p&gt;

&lt;p&gt;This post is a short walk through &lt;em&gt;why&lt;/em&gt; it happens, what the current workarounds look like, and where the FLUX.1-Kontext-based approach fits in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why do characters drift?
&lt;/h2&gt;

&lt;p&gt;Every text-to-image inference is in effect a &lt;strong&gt;fresh sample from a very high-dimensional distribution&lt;/strong&gt;. The model has no state between generations. Prompt A and prompt B may both say "detective with red scarf," but the specific pixel arrangement that the sampler lands on is governed by the noise seed, the scheduler, and a thousand tiny decisions inside the U-Net. Two calls that share a prompt but not a seed will produce two different people who both roughly match the description.&lt;/p&gt;

&lt;p&gt;Put differently: the model does not have a &lt;em&gt;character&lt;/em&gt;. It has a &lt;em&gt;prompt&lt;/em&gt;. Every panel is a new roll of the dice against the same loose description.&lt;/p&gt;

&lt;p&gt;Classical diffusion workflows try to fix this with three tricks, none of which are great:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Seed locking.&lt;/strong&gt; Use the same random seed for every panel. Works only if the prompt is essentially unchanged — the moment you add "sitting down" or "pulling out a notebook," the composition changes and the seed lock stops helping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Textual inversion / DreamBooth.&lt;/strong&gt; Fine-tune a small adapter on reference photos of the character. Effective but slow, expensive, and brittle — you are training a new adapter for every character in your comic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-image prompting.&lt;/strong&gt; Paste the previous panel into the prompt as a reference. Some models accept it; most do not; when they do, they often regress to the mean face after a few hops.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What FLUX.1-Kontext adds
&lt;/h2&gt;

&lt;p&gt;FLUX.1-Kontext is Black Forest Labs' image-to-image-conditioned variant of FLUX. The relevant design choice is that it treats the reference image not as "inspiration" (loose style transfer) but as &lt;strong&gt;hard conditioning&lt;/strong&gt; during the denoising process. You pass in a reference sheet — the character's face, outfit, key features — and the generation is pulled toward that reference, not just textually but pixel-wise, through cross-attention.&lt;/p&gt;

&lt;p&gt;For comics this is almost exactly the right primitive. The workflow becomes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Generate a reference sheet for each character once (face, outfit, distinctive props).&lt;/li&gt;
&lt;li&gt;For every panel, pass the relevant character's sheet + the scene description.&lt;/li&gt;
&lt;li&gt;The model respects the sheet as a constraint, not a suggestion.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The same detective now has the same face, the same red scarf, and the scarf actually stays a scarf.&lt;/p&gt;

&lt;h2&gt;
  
  
  What breaks and what does not
&lt;/h2&gt;

&lt;p&gt;In practice the approach works well for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontal and three-quarter faces.&lt;/strong&gt; The reference sheet is usually a clean portrait; panels that echo that framing stay on-model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distinctive clothing and props.&lt;/strong&gt; A red scarf, a specific hat, a tattoo — these get preserved reliably.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Short stories (6–12 panels).&lt;/strong&gt; Drift is minimal within a single story.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It still struggles with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extreme poses.&lt;/strong&gt; A character leaping mid-air from behind is a composition the reference sheet does not cover, so the model extrapolates and sometimes loses the face.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Background characters.&lt;/strong&gt; Secondary characters without their own reference sheet still drift. You either sheet them too or accept drift.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-form continuity across chapters.&lt;/strong&gt; After 50+ panels the accumulated small variations become visible. Re-anchoring to the sheet every 10 panels helps.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A practical note on tooling
&lt;/h2&gt;

&lt;p&gt;You can run this stack yourself — the FLUX.1-Kontext weights are open — but assembling the pipeline (reference sheet generator, scene scripter, panel renderer, single-panel regenerator, style picker) is a fair amount of plumbing.&lt;/p&gt;

&lt;p&gt;I have been using &lt;a href="https://www.comicory.com" rel="noopener noreferrer"&gt;comicory.com&lt;/a&gt; as a hosted implementation of roughly this architecture. Drop in a story paragraph, the system handles the scripting and reference-sheet step, and the multi-panel output keeps the same character recognizable. Eight art styles available (manga, Western comic, watercolor, ink wash, etc.), and critically, &lt;strong&gt;single-panel regeneration&lt;/strong&gt; is supported — if panel 4 drifts, you redo only that panel without rebuilding the rest of the story. Free tier is 30 images per month which is enough to evaluate the workflow.&lt;/p&gt;

&lt;p&gt;Not a pitch; mostly flagging it because I spent a couple of weeks trying to glue the same pipeline together locally and it was a lot of YAML.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing thought
&lt;/h2&gt;

&lt;p&gt;The character consistency problem is a nice example of how &lt;strong&gt;architectural fixes beat clever prompting&lt;/strong&gt;. For the first three years of diffusion-for-comics, the whole field was trying to solve consistency at the prompt level — longer prompts, locked seeds, character templates, multi-image prompting. None of it really worked. The real unlock was a model class that takes a reference image as first-class conditioning.&lt;/p&gt;

&lt;p&gt;When a generation problem resists prompt engineering for long enough, the answer is usually that the model architecture is wrong for the task, and someone will eventually ship a new one. FLUX.1-Kontext is that ship for multi-panel comics. I am curious what the equivalent "right architecture" looks like for the remaining hard cases — long-form continuity, multi-character scenes with physical interaction, and expressive pose variation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
