DEV Community: qcrao

Why I Run Claude Code in Git Worktrees (And You Probably Should Too)

qcrao — Sat, 23 May 2026 04:04:48 +0000

For the first month I used Claude Code, I ran it the way I ran every other coding tool: from my repo's root directory, one session at a time. It worked. But every time Claude was halfway through a feature and I needed to pop over to main to fix a typo in production, I'd hit the same wall — stash, checkout, find that Claude's session was now confused about what branch it was on, lose context, swear quietly.

The fix turned out to be a Git feature that's been in the toolbox since 2015 and that I had never used: git worktree.

What git worktree actually does

A worktree is a second (third, tenth) working directory attached to the same .git folder. Different directory, different branch checked out, same repo history. Disk usage is minimal — only the working files duplicate, not the object database.

# from inside your main repo
git worktree add ../myrepo-hotfix main
git worktree add ../myrepo-feature-x -b feature/x

You now have three sibling directories, each on its own branch, each fully usable.

Why this matters for Claude Code

Once you have worktrees, the workflow becomes obvious: one worktree per Claude session.

mkdir -p ~/work/myrepo-worktrees
cd ~/work/myrepo
git worktree add ../myrepo-worktrees/feat-auth -b feat/auth
git worktree add ../myrepo-worktrees/feat-billing -b feat/billing
git worktree add ../myrepo-worktrees/hotfix-prod -b hotfix/prod

# now spin up three Claude sessions in three terminals
cd ~/work/myrepo-worktrees/feat-auth && claude
cd ~/work/myrepo-worktrees/feat-billing && claude
cd ~/work/myrepo-worktrees/hotfix-prod && claude

Each Claude session is rooted in a different directory, on a different branch, with its own file context. They cannot step on each other's edits. The "branch is already checked out at..." error goes away because each worktree owns its branch. And when one agent finishes, merging is a normal git merge — no branch acrobatics needed.

For me this turned "one feature at a time" into "three parallel features all day", and the only thing it cost was 200 MB of duplicated node_modules.

The tmux setup that made it click

I wrap the three sessions in a tmux window so I can glance at all three at once:

tmux new-session -d -s claude-pool
tmux send-keys -t claude-pool 'cd ~/work/myrepo-worktrees/feat-auth && claude' C-m
tmux split-window -h -t claude-pool
tmux send-keys -t claude-pool 'cd ~/work/myrepo-worktrees/feat-billing && claude' C-m
tmux split-window -v -t claude-pool
tmux send-keys -t claude-pool 'cd ~/work/myrepo-worktrees/hotfix-prod && claude' C-m
tmux attach -t claude-pool

Three panes, three agents, zero context collision.

Gotchas

node_modules: each worktree needs its own. Either eat the disk cost or use pnpm with a shared store.
.env files: these are gitignored, so they don't follow you into a new worktree. Symlink or copy.
Hooks: git hooks live in .git/hooks which is shared across worktrees. Test once, runs for all.
Removing a worktree: never rm -rf the directory. Use git worktree remove <path> so Git's bookkeeping stays consistent, or you'll see worktree is dirty errors later.

When it's overkill

If you only ever run one agent at a time and you're not context-switching across hotfixes, plain branches are fine. Worktrees pay off when parallelism × interruption rate is high — which, with AI agents, it nearly always is.

I ended up cataloging every git worktree command, every common error, and the setups for Cursor / Codex / OpenCode too over at gitworktree.org — the Claude Code + worktree guide is the one I refer back to most.

If you've got your own worktree tricks for AI agents, drop them in the comments — always looking for the next 10% of speedup.

Why backstory drafting comes before scene rendering in Comicorys pipeline

qcrao — Sun, 17 May 2026 17:46:32 +0000

Why Comicory uses six panels as the default ceiling for AI comic pages

qcrao — Sun, 17 May 2026 17:42:25 +0000

Why Comicory keeps text bubbles separate from panel generation

qcrao — Sun, 17 May 2026 17:38:24 +0000

Why TubeVocab caches flashcards in the browser instead of relying on the backend

qcrao — Sun, 17 May 2026 17:34:33 +0000

Why I scoped TubeVocab to YouTube only and skipped TikTok and Bilibili

qcrao — Sun, 17 May 2026 17:29:58 +0000

The most common feedback on a YouTube-based vocabulary tool is some version of "this is great, can you add TikTok next?" Or Instagram Reels. Or Bilibili for Chinese learners. Or Netflix. Or any platform with subtitled video.

I considered all of them. I built TubeVocab as a YouTube-only product on purpose. The decision was less about engineering effort and more about content quality and learner outcomes.

Captions are not a commodity

A vocabulary tool that mines comprehensible input from video stands or falls on caption quality. Without a clean caption track, every downstream feature — click-to-flashcard, dictation, cloze tests, vocabulary in context — degrades.

YouTube's caption ecosystem has three things going for it that other platforms do not:

A massive base of human-edited captions from creators who care about accessibility and SEO.
Automatic captions that, while imperfect, are usable for popular speakers and standard accents.
A long-standing public API that lets a learning tool read the caption track without scraping fragile HTML.

TikTok captions are heavily auto-generated, often inaccurate, and frequently burned into the video as visual overlays rather than text. Instagram captions are similar. Netflix subtitles are high quality but locked inside DRM. Bilibili has good captions but a Chinese-speaking user base that is not the dominant target for an English learning tool.

A vocabulary tool built on the wrong caption source is built on sand.

Watch time and content length matter

YouTube is unique among video platforms in the variety of content length it supports. A learner can find a five-minute explainer or a ninety-minute documentary on essentially any topic, in any English accent.

Vocabulary acquisition through video requires extended exposure to coherent speech. A ten-second TikTok cannot give a learner the same kind of language exposure that a fifteen-minute YouTube video can. The information density per minute is similar, but the contextual scaffolding inside a longer video helps learners attach new words to surrounding ideas.

A tool that depends on long-form context simply works better on a platform that hosts long-form content. Scoping TubeVocab to YouTube made each saved word more useful, because each word came from a more complete contextual moment.

Focused scope beats feature breadth

There is a familiar pull in product design to support every adjacent platform "for completeness". The downside is that completeness drains energy from the depth of the experience on any single platform.

For TubeVocab, supporting YouTube alone means I can build features that exploit YouTube specifically: deep linking to a specific second of a video, channel-aware vocabulary grouping, recommendation against a learner's watched list, transcript-level cloze tests. Each of those would have to be reinvented per platform if the tool tried to serve all video sources at once.

Depth on one platform beats shallow coverage of five.

The exception is the rule

Plenty of learners genuinely watch a mix of YouTube and other video sources. For those learners, TubeVocab covers only part of their habit. That is fine. A vocabulary tool does not need to capture every word the learner ever encounters. It needs to capture enough words, often enough, with enough context, to build a useful flashcard deck and a stable review habit.

For learners whose YouTube watching is a meaningful chunk of their English input, scoping to YouTube is plenty. For learners whose YouTube watching is incidental, no tool would help much anyway.

The lesson

Picking the right single platform and building deeply on it is a more honest strategy than trying to be everywhere. It also keeps the development surface small enough that a tiny team can ship real improvements every week instead of patching integration breakage on five different APIs.

For a vocabulary tool, content quality and content length determine the ceiling. YouTube wins on both, by a margin that makes adding TikTok or Bilibili a worse product, not a more inclusive one.

Why TubeVocab keeps the dictionary popup quiet instead of overwhelming learners

qcrao — Sun, 17 May 2026 17:25:58 +0000

Most dictionary popups inside language-learning tools try to do too much at once. Click a word and you get the translation, three definitions, four example sentences, frequency rank, etymology, conjugation table, audio button, save button, and a related-words list. All at once, all on the same panel.

It looks generous. It actually breaks the moment of reading.

When I built the click-to-look-up popup for TubeVocab, I started with that maximalist design and slowly stripped it down. The version that actually retains learners is intentionally quiet: one translation, one short example, one save button. Everything else lives behind a single tap.

A popup is a reading interruption, not a reference page

When someone watches a YouTube video with subtitles and clicks an unknown word, they are not opening a dictionary. They are buying themselves a half second of clarification so they can keep watching.

If the popup contains a dense reference layout, three things happen, all of which break the loop:

The video stays paused longer than the learner intended.
The learner starts reading the popup instead of returning to the subtitle.
The cognitive load of choosing what to read inside the popup competes with the cognitive load of the next subtitle line.

The popup is a tooltip, not a destination. Treating it like a destination is the most common mistake in this kind of interface.

Defaults should match the dominant intent

Among learners hovering over an unknown word, the dominant intent is "what does this mean, roughly, so I can keep going". Maybe 80 percent of clicks fit that pattern.

A smaller share genuinely wants to dive in: see all senses of the word, hear it pronounced multiple ways, check a usage table, save it with context, compare regional variants. That is real, but it is the minority intent.

A good popup serves the majority intent in zero clicks and the minority intent in one click. A bad popup serves both at the same time and accidentally serves neither.

What the quiet popup actually contains

The default state in TubeVocab shows a single line for the most common translation, a single line for a short usage example pulled from the subtitle context, and a small heart icon to save. Nothing else.

If the learner wants more, a single tap expands the panel into the fuller reference view with all senses, all examples, the audio button, and the save-with-tags flow. Tapping outside collapses it back to quiet mode.

That structure rewards both modes of attention. A learner who clicks ten unknown words in a five-minute video gets ten near-instant peeks. A learner who pauses on one fascinating word can drill in without losing the rest of the popup.

The quieter the default, the higher the volume

Counterintuitively, the quieter default produced higher save rates and longer session lengths. Two reasons.

First, learners click more unknown words because the cost of clicking is genuinely low. Volume goes up, which is exactly what a vocabulary tool needs to build up a meaningful flashcard collection.

Second, the few learners who care about deep reference do not feel locked out. The fuller view is still there, just intentionally one tap away.

A maximalist popup feels generous on the surface and exhausting in practice. A quiet popup feels minimal at first and rewards extended use.

The lesson

Reference data in a learning tool should be layered, not dumped. The first layer is the answer to the question the learner actually asked: what does this word mean. Every additional layer should require a deliberate gesture to unlock.

A vocabulary tool is not a dictionary. It is a reading aid. Designing the dictionary popup to behave more like a reading aid and less like a dictionary is what made TubeVocab finally feel calm to use, even when learners click dozens of words inside a single video.

Why dialogue placement is the hardest part of AI comic generation

qcrao — Sat, 16 May 2026 15:43:47 +0000

If you ask people what is hard about AI comics, most will say character consistency. That is a real problem, but it is not the worst one in practice.

The harder problem, in my experience building Comicory, is dialogue placement.

Putting words inside a comic panel sounds trivial. It is one of the parts that breaks down most often, and when it breaks down the whole page reads wrong, even if every individual panel looks beautiful.

Speech bubbles are constraint puzzles

A speech bubble is not just a graphic. It is a constraint that ties together text length, panel composition, character position, reading order, and the actual story logic.

A bubble needs to:

Sit somewhere that does not cover the speaker's face.
Not cover other important visual information in the panel.
Point clearly to the speaker.
Be readable in size and contrast.
Fit the text without overflowing or shrinking the font.
Come before the next bubble in reading order.
Match the tone of the line (a whisper bubble looks different from a shout).

Each constraint sounds simple. Together, they collide. A model that generates a perfect-looking panel often leaves no room for the bubble, or covers the character's mouth, or breaks the reading order with the next panel.

Text rendering is still unreliable inside images

A second problem is that most image models still cannot reliably render text inside the generated image.

Even when the model places a bubble shape correctly, the letters inside it may be unreadable. Half a word missing. A weird artifact. A misspelling.

For a comic, this is not cosmetic. The dialogue is half the storytelling. A misread bubble flips the meaning of a panel.

That is why most working systems, including Comicory, do not let the image model write the text at all. The model produces an empty bubble or a marked region. The actual text is composited on top by a typography layer that knows about font, kerning, and bubble fit. That gives clean, predictable letters.

But it also moves the hard problem somewhere new. Now the bubble shape and position have to match a text length that was decided separately.

Length mismatch breaks everything

The most common failure is the mismatch between the planned dialogue length and the visual space the model leaves for it.

Imagine the storyboard says character A has a four-word line. The model generates a panel with a small empty corner for the bubble. Fine. Then in iteration, the user rewrites the line to eighteen words. Suddenly there is no room. The compositor either shrinks the text until it is unreadable, or overflows the panel art.

Solving this is not a one-shot problem. The system has to negotiate between text length and panel composition at every revision step. It needs to know:

How big can this bubble grow before it hits important art?
How much of the panel is safe to cover?
Should the bubble break across multiple shapes if the line is long?
Should the line be split into two bubbles for the same speaker?

That is a layout engine, not a prompt.

Reading order is a separate layer

Even if every single bubble fits, the bubbles in a multi-panel page must follow a reading order. Western comics read left to right, top to bottom. Within a panel, bubbles read roughly the same way.

A model that generates panels in isolation has no incentive to keep this order consistent. Panel 1 might have its bubble in the top right, while panel 2 has it in the bottom left, and the eye has to ricochet around the page.

Reading order is the cheap part to fix if the system explicitly models it. It is hard to fix if you only realize after rendering that the page is unreadable.

What the product actually has to do

So the working pipeline for dialogue in an AI comic ends up doing more layout work than image generation work:

Decide dialogue lines per panel during the storyboard stage, so the rendering stage knows how much room each bubble needs.
Tell the image model to leave clear, neutral space in a specific region of each panel.
Composite the real text in a typography layer that the user can tweak font, size, and shape on.
Validate that no bubble covers an important face or hand.
Validate reading order across the page.
Provide a small editor so the user can drag a bubble or split a long line into two.

None of these are heroic. All of them are essential. Skip any one and the page reads worse than a manually drawn comic by an amateur.

The product lesson

The flashy demos of AI comics all show character art. The unglamorous reality is that dialogue placement determines whether the page is actually readable.

For Comicory, treating bubbles as a real layout problem, not a rendering hint, was one of the larger architectural decisions. The model produces art and space. The typography layer produces text and bubble shape. The product glues them together with constraints the user can override.

Image quality matters. Character consistency matters. But until the words sit cleanly in the right place at the right time, no amount of pretty art makes the page feel like a comic.

Why I avoided gamification in a vocabulary tool for adult learners

qcrao — Sat, 16 May 2026 15:42:51 +0000

When you build a language learning product in 2026, the default playbook is gamification. Streaks, leagues, mascots, daily quests, lives, hearts, gems. The big apps have proven that this stuff drives retention.

I left almost all of it out of TubeVocab on purpose.

That decision was not about being contrarian. It was about who the tool is actually for. The learners I had in mind are adults who already know they want to improve their English, who already have other reasons to come back, and who are easy to push away with the wrong kind of pressure.

Gamification optimizes for app opens, not for learning

The honest version of streak-based design is that it optimizes a metric. The metric is daily app opens. The user opens the app every day to protect their streak, even if all they do is tap through five seconds of trivial review.

That is good for engagement charts. It is not always good for learning.

A vocabulary tool used three minutes a day for ninety days in a row will probably teach less than a tool used twenty focused minutes three times a week. The shorter sessions are easier to fake, easier to skim through, and easier to do on autopilot. The user feels productive because the streak says so, but the words do not stick any harder than before.

For an adult learner, this matters. They have limited study time. They cannot afford to feel productive while learning nothing.

Streak anxiety pushes the wrong users away

The other failure mode of streak systems is that they punish exactly the people you want to retain.

A serious adult learner often has a busy week. A work trip, a sick kid, a deadline. They skip three days. They come back to the app and their 87-day streak is gone. The mascot is sad. There is a tiny offer to "restore" the streak.

For some users this works. For others, the streak collapse becomes the moment they delete the app. The signal they receive is: this tool punishes me for being human. That feeling is hard to unwind.

I would rather have a learner come back after two weeks, find their saved words exactly where they left them, and feel welcomed than chase them with a guilt-trip notification.

Adults already know why they are here

Children and teenagers often need a reason to start studying. Gamification creates that reason. Adults usually do not.

A working adult who installs an ESL tool already has a clear motivation. Maybe they want a job in an English-speaking team. Maybe they want to watch their favorite YouTubers without subtitles. Maybe they want to read research papers more comfortably. They came to TubeVocab with a goal.

Adding gems and leagues on top of that goal can dilute it. The user starts optimizing for the in-app reward instead of the actual outcome they came for. The product becomes a slot machine wearing an education costume.

I would rather make the underlying activity satisfying than wrap a thin layer of dopamine on top of weak content.

What replaces gamification

Removing streaks does not mean removing motivation. It means moving the motivation closer to real progress.

A few things I lean on instead:

Show what the learner can now read or watch that they could not before. Concrete capability change is the strongest motivator for adults.
Show their saved words in context, with the real video and timestamp where they captured each one. A learner remembers the moment a word was first useful much better than a number.
Make review feel light when life is busy. Five minutes is a fine session, not a failure compared to twenty.
Default to recall correctness, not session length. A learner who reviewed two words deeply did more than a learner who skimmed twenty.

None of this is flashy. None of it puts a mascot on the home screen. That is fine. The reward is supposed to come from the learning, not from the app.

The product lesson

Gamification works. It is also extremely easy to overdo, and the cost falls on the exact users who would have stuck around without it.

For TubeVocab specifically, I am building for adult learners who use real YouTube content. They are picky, busy, and skeptical of edtech tricks. Treating them like adults turns out to be a stronger retention strategy than turning the app into a streak treadmill.

The boring version is also the honest one: design for the learning outcome, surface concrete capability gains, and trust adult learners to come back when the product earns it.

Why splitting storyboard from rendering beats one mega-prompt for AI comics

qcrao — Sat, 16 May 2026 15:26:04 +0000

The most tempting way to build an AI comic generator is also the worst one: take a paragraph of story, send it to an image model, and ask for a finished page.

It feels like magic when it works. It almost never works.

The story-to-page mega-prompt collapses for the same reason that one-shot UI generation collapses. Too many constraints land on the model at once, and the model has no clean way to negotiate them. So I separated storyboard from rendering early in Comicory, and that single split made everything downstream more controllable.

The mega-prompt confuses two different jobs

A comic page is doing two unrelated things at the same time.

One job is structural. How many panels are on the page. What each panel is about. Who is in it. What they are doing. Where the camera sits. What the dialogue is.

The other job is visual. The actual drawing. Lighting. Style. Character likeness. Background detail. Lettering.

These two jobs need different reasoning. Structure is closer to a writer's mental model. Rendering is closer to an illustrator's. Asking one prompt to do both forces the model to do both at once, and the result is usually a compromise that satisfies neither.

Storyboard is cheap, rendering is expensive

There is also a cost asymmetry that the mega-prompt ignores.

Generating a structural storyboard, even with a strong language model, is fast and cheap. It is essentially a long-form structured response: panel count, panel descriptions, dialogue, camera notes.

Generating the actual panel images is slow and expensive. Every retry costs real money and real time.

If both happen inside one big prompt, every visual retry forces the structural decisions to be rolled too. That is wasteful, and it makes the workflow harder to debug. The user cannot tell whether the failure was a story problem or an image problem.

Splitting them gives the system a chance to validate the cheap part before paying for the expensive part. The structural storyboard can be checked, edited, or regenerated quickly. Only when it is acceptable does the system commit to rendering.

The user is the right person to gate the transition

Even with a perfect model, the storyboard stage should usually be visible to the user.

This sounds like extra friction, but it is exactly where the user wants control. People bring a specific story idea to a comic tool. They have opinions about pacing, about how a beat lands, about which panel should be a close-up. They do not have opinions about brush stroke density.

A short, readable storyboard, presented as a list of panels with a one-line summary and the dialogue for each, gives them a place to spend that attention. They can tweak panel count, rewrite a line, swap a camera angle, before any image is generated.

That is also the part of Comicory that benefits the most from a fast iteration loop. Storyboard edits should feel like editing a doc, not like negotiating with an image model.

Rendering inherits a clear contract

Once the storyboard is settled, the rendering stage has a much smaller and clearer job.

For each panel, it receives an explicit description, a character reference, a camera intent, and the dialogue that must fit. It does not have to invent the page structure or guess what the user meant. It just has to draw the one panel.

That clean handoff makes character consistency easier. It makes panel composition easier. It makes targeted regeneration possible, because the storyboard remains the source of truth across retries.

It also makes the system easier to reason about. If a panel renders badly, the question is just "did the rendering stage do its job for this panel description?" There is no ambiguity about whether the story intent was right.

What this looks like in the product

In Comicory the pipeline ends up roughly like this:

Take the user's story or prompt and produce a structural storyboard.
Show the storyboard to the user so they can edit panel count, descriptions, and dialogue.
Lock the storyboard.
Render each panel against the locked storyboard, reusing the same character reference.
Allow per-panel regeneration without changing the rest of the page.

Each step is its own product surface. Each can be improved on its own. The model does not have to be a comic genius. It just has to be reliable at each stage.

The product lesson

The biggest mistake I see in AI comic demos is collapsing storyboard and rendering into one heroic prompt and then blaming the model when it fails.

The model is not the bottleneck. The architecture is.

Once storyboard and rendering are separated, the model becomes a much more useful collaborator. The user gets a real place to apply taste. The system gets a clean contract between stages. The retries get cheaper. The final page gets better.

That is the kind of structural decision that does not show up in a flashy demo, but determines whether the product is actually usable for someone trying to make a real comic.

Why I chose human-edited subtitles over AI auto-captions for vocabulary mining

qcrao — Sat, 16 May 2026 15:23:48 +0000

When I was prototyping TubeVocab, the obvious shortcut was to use YouTube's auto-generated captions for every video. They are free, they exist on almost every clip, and the API gives them back instantly.

I tried that path for a few weeks. It did not survive contact with real learners.

The vocabulary mining experience depends on subtitle quality much more than on UI polish, and human-edited subtitles consistently beat auto-captions in ways that matter to ESL learners.

Auto-captions get the easy words right

For clean studio audio with a single speaker, auto-captions are good enough. A YouTuber sitting in front of a microphone reading a script will produce captions that match the spoken words at 95 percent accuracy or better.

That accuracy collapses when the audio is messy. Two speakers overlap. Background music kicks in. A guest has a strong accent. A scene cuts to street noise. Suddenly the caption track misses half a sentence, merges words, or hallucinates filler.

Those are exactly the moments where a learner needs the most help. Easy sentences they can already follow. The hard sentences are the ones that need clean subtitles to be saved as a flashcard.

The errors hurt the learning loop more than the UI

A wrong transcription is not just inconvenient. It silently teaches the wrong thing.

If the speaker says "I could have told you," and the caption says "I could of told you," and the learner clicks the phrase to save as vocabulary, they save a piece of folk-grammar that does not exist in formal English. They will be confused later when a teacher marks it wrong.

If the speaker says a specialized term and the caption substitutes a similar-sounding common word, the learner saves the wrong word entirely. Their flashcard deck quietly fills with noise.

That is a worse failure than a missing caption. A missing caption means the learner moves on. A wrong caption means the learner trusts it.

Human-edited subtitles encode rhythm

There is a second, less obvious reason human subtitles are better.

Auto-captions split lines mostly by silence detection. A human editor splits lines by meaning and reading rhythm. They group a phrase like "as a matter of fact" on a single line. They break before a clause boundary, not in the middle of a noun phrase.

That rhythm matters for ESL learning because most of the value comes from reading the line as a chunk, not as isolated words. When I save a phrase from TubeVocab, I want the natural unit a fluent speaker would say in one breath, not whatever the silence detector happened to align with.

Human captions preserve those chunks. Auto-captions chop them up.

The cost of better subtitles

The downside is obvious. Human-edited subtitles do not exist for most YouTube videos.

A lot of educational creators upload only auto-captions, or no captions at all. A learner who only watches the top 1 percent of curated channels gets perfect subtitles. A learner who watches whatever interests them gets a mixed bag.

So in practice the system has to handle three cases:

The video has good human subtitles. Use them directly.
The video has only auto-captions. Use them, but flag the line as machine-generated so the learner knows to double-check before saving a flashcard.
The video has no subtitles. Either skip it or run a higher-quality transcription pass before exposing it as a vocabulary mining source.

The interesting product question is the second case. It is not honest to pretend the captions are clean. It is also not useful to refuse to work at all.

What I show the learner

The compromise I landed on is to render the subtitles as usual, but mark auto-captions visually so the learner knows what they are looking at. When they hover or click to save a phrase, the saved card carries metadata about whether the source was human-edited or auto-generated.

That changes how the system can treat the saved item later. A learner reviewing a phrase saved from a human-captioned video can trust the original text. A phrase saved from an auto-captioned video can be re-checked against the audio before being promoted into spaced repetition.

This is more work than treating every caption as equally valid. But it matches reality, and it stops the flashcard deck from filling with subtle errors over months of use.

The product lesson

The naive version of TubeVocab treated all YouTube subtitles as a uniform data source. The honest version treats them as a quality spectrum.

For an ESL tool built on top of real videos, that distinction shows up everywhere: in the flashcards learners save, in the dictation lines they replay, in the sentences they trust as examples. A vocabulary tool is only as good as the text underneath it.

That is why I now treat subtitle source quality as a first-class signal, not a free input I can use without thinking. It is one of the boring parts of building TubeVocab, and one of the parts that quietly determines whether learners get value or get misled.

Why single-panel editing matters more than perfect first-shot AI comics

qcrao — Sat, 16 May 2026 15:08:17 +0000

The biggest misunderstanding about AI comic generation is the expectation that the model should produce a finished page in one perfect pass.

That is a nice demo. It is not how people actually make comics.

Even with a strong image model, a comic page has too many fragile constraints: the character has to stay recognizable, the pose has to match the story beat, the camera angle has to vary, the background cannot contradict the previous panel, and the speech bubble has to leave enough room for readable text.

One bad panel can ruin the page. That is why single-panel editing became one of the most important parts of Comicory.

The weak link is usually local

When a generated comic page fails, it rarely fails everywhere.

Panel 1 might have the right establishing shot. Panel 2 might capture the emotion. Panel 4 might land the joke. But panel 3 has the character facing the wrong direction, or the hand turns into an unreadable shape, or the room suddenly changes.

Regenerating the entire page is wasteful. It throws away three good panels to fix one local problem.

That matters for cost, but it matters even more for creative control. A user does not want to negotiate with the model from zero every time. They want to say: keep this page, fix this panel.

Consistency is easier when the edit boundary is small

Character consistency is the headline problem in AI comics, but consistency is not only a model issue. It is also a workflow issue.

If the whole page is regenerated, every panel has another chance to drift. The character's haircut changes. The jacket loses a stripe. The face becomes younger or older. The model may solve the original error while introducing two new ones.

A smaller edit boundary reduces the blast radius. The system can preserve the panels that already work, reuse the same character reference, and focus the prompt on one scene.

That is one reason Comicory treats regeneration as a panel-level action instead of only a page-level action.

"Good enough" needs a second pass

Most AI image demos reward the first shot. You type a prompt, get a pretty image, and share it.

Comics are different because they are sequential. A single pretty image is not enough. The panel must serve the story before and after it.

A panel can be visually attractive and still fail the comic:

The character looks away when the line implies direct confrontation.
The camera repeats the same angle for three panels in a row.
The mood is too dramatic for a small joke.
The composition leaves no room for dialogue.
The background implies a different location.

These are editing problems. They need iteration, not just a better prompt.

The interface should assume revision

Once I accepted that revision is normal, the product design changed.

The important question became: how quickly can someone identify a weak panel, change only that panel, and keep the rest of the comic intact?

That means the UI should make each panel feel addressable. The user should not have to restart from the story paragraph. They should be able to keep the script, keep the character identity, and adjust the one visual beat that missed.

This also makes the tool less intimidating for non-artists. They do not need to understand model parameters. They just need a clear revision loop: pick panel, describe change, regenerate, compare.

The product lesson

Perfect first-shot generation is still worth improving, but it is not the whole game.

For real comic creation, control after generation is where the product starts to feel usable. Users forgive a model that needs one retry. They do not forgive a workflow that makes every retry destroy the parts they liked.

That is why I now think of AI comic tools less like image vending machines and more like lightweight editing systems. The model creates the draft. The product decides whether the draft can become a finished comic without making the user fight the whole page again.