DEV Community

Uya
Uya

Posted on • Originally published at zenn.dev

Don't Let Claude Haiku Do the Math — A Two-Stage Read-Aloud Coach Design, and the Prompt Swamp

📝 Originally published in Japanese on Zenn. This is the English version.
Canonical: https://zenn.dev/uya0526_design/articles/satellite2_haiku-coaching

📚 This is satellite article #2 in my "Read-Aloud Speed Meter dev log" series. For the whole picture, see the main article; for the AmiVoice integration, see satellite #1.

Where This Sits

This article covers the part of the read-aloud speed meter that hands the measured numbers to Claude Haiku to generate coaching as "one compliment + one improvement."

Many articles that use generative AI just dump it on the model: "here's the recognized text, evaluate it nicely." This article is the opposite. Don't let the LLM do the math. All computation and rule-based decisions are settled in code, and Haiku only does the wording. I'll go through how I designed this "two-stage" split, and which swamps I sank into during implementation.

Four things:

  • Why "code does the math, Haiku only does the wording" (the role-split design)
  • The finalized prompt, and the messages / system / cache_control implementation
  • First-hand findings: the prompt cache that didn't work
  • The moment real data slapped me with "written in the prompt ≠ obeyed"

💡 I'm an ex-Java engineer learning TypeScript in public, so I drop in Java comparisons.


Why Not Let Haiku Do the "Math"?

Up front: Haiku is more than enough for feedback generation — in fact, you can shape the task into something Haiku is good at. The key is the role split.

I settle all the metric computation (speaking speed, stagnation rate, threshold decisions) entirely in code. So by the time it reaches Haiku, it's already settled facts like this (below is from running labelMetrics on a real measurement — reading the Heike sample during development; stagnationRate is a number for percentage display):

{
  "pureSpeakingSpeed": 322,
  "pureSpeakingSpeedEvaluation": "slightly fast",
  "stagnationRate": 0,
  "stagnationRateEvaluation": "few"
}
Enter fullscreen mode Exit fullscreen mode

Haiku's only job is to translate these numbers and labels into warm, concrete words. That's exactly what small models are best at — wording, summarizing, tone — and it demands no numeric precision.

Conversely, I clearly decided what not to let Haiku do:

  • Precise numeric computation (deriving speed or scores) → done in code
  • Multi-step threshold branching (judging "is 322 chars/min fast?") → code interprets it as "slightly fast" before passing it on

Java comparison: This is exactly the separation of the Service layer (computation = code) from the presentation layer (wording = Haiku). Just as you don't write business logic in a template engine, I draw the line at "don't make the LLM write the decisions."

This split has a practical benefit beyond cost: pushing threshold decisions into code makes the output stable. Asking Haiku to judge "is this fast?" every time wavers with its mood, but passing it the label "slightly fast" leaves only the wording free to vary.


The Finalized Prompt

Here's the system prompt (a static coach persona). After running it against real data several times, it settled into this shape. (Translated here for readability; the production version is in Japanese.)

You are a kind, specific coach who reviews Japanese read-aloud practice.

# Premises (strict)
- The JSON you receive is already-computed, already-evaluated fact.
- Do not recompute numbers. Do not invent new metrics. Do not fill in
  facts that aren't in the JSON by guessing.
- Speak only on the basis of what's written in the JSON.
- Do not mention punctuation (samples can be short and may contain none).

# Output format (strict)
- Polite register. Warm, calm tone.
- No emoji, symbols, line breaks, headings, or preamble. Output exactly two sentences.

# Output format (best-effort)
- About 100 Japanese characters total.

# Output content
Sentence 1: Pick the single best point and praise it concretely.
Sentence 2: Give exactly one improvement, as a concrete action to try next.

# How to pick the improvement (top-down; only the first that matches)
1. "stagnationRateEvaluation" is "somewhat many" or worse → how to use pauses (ma)
2. "pureSpeakingSpeedEvaluation" is "slightly slow"/"slow" → ease the tempo forward
3. "pureSpeakingSpeedEvaluation" is "slightly fast"/"fast" → consciously take a breath and read slower
4. None apply → pick the one most-improvable point and frame it positively as the next goal
Enter fullscreen mode Exit fullscreen mode

Three notes on the intent:

  • I prioritize the improvements and emit only the first match. I also added a fourth branch (a fallback) so the model isn't forced to hunt for flaws when everything is good.
  • Without explicitly saying "exactly two sentences," it tacks on preamble or bullet points. I lock the purity of the output via the prompt.
  • "About 100 chars" is a best-effort goal. I explain why below, but it's because LLMs can't count Japanese characters precisely.

API Implementation: system / messages / cache_control

I call it via messages.create from the Claude Messages API (@anthropic-ai/sdk).

const result = await client.messages.create({
  model: process.env.ANTHROPIC_MODEL!, // e.g. claude-haiku-4-5
  max_tokens: 256,
  system: [
    {
      type: "text",
      text: FEEDBACK_PROMPT,
      cache_control: { type: "ephemeral" }, // goes inside the system block
    },
  ],
  messages: [
    { role: "user", content: JSON.stringify(feedbackFacts) }, // settled facts only
  ],
});
Enter fullscreen mode Exit fullscreen mode

What the structure means:

  • system = Claude's persona and ground rules (outside the conversation). The static coach persona lives here.
  • messages role: "user" = the actual conversational input (the evaluation JSON that changes each time).
  • role is either "user" (your input) or "assistant" (Claude's reply). This is a single round-trip, so one user entry is enough.

Java comparison: A fixed system prompt plus a variable user payload is like passing request/response pairs as an array. For multi-turn conversations, you stack the history in order.

How to think about max_tokens

I initially set 1024, copied from an official example, but this output (two sentences) measured at about 88 tokens. max_tokens is "the maximum you may generate," not "the amount you must generate," so a generous value does almost no harm — generation stops naturally when the output ends. As a safety valve against unintended long output, I tightened it to 256 as in the code above.

Java comparison: It feels close to StringBuilder's initialCapacity. Reserving large doesn't cost anything if you don't use it.


Swamp ① — The Prompt Cache "Didn't Work"

Here's where the first-hand, verify-it-yourself finding begins.

Since the static system prompt is identical every time, I figured I could make it cheaper with prompt caching. I added cache_control: { type: "ephemeral" }, did the break-even math (cache writes cost 1.25× the first time, reads are ~90% off afterward, so "it pays off after two uses"), and concluded "I should use this."

But it wasn't actually working.

The reason is a minimum-token wall. Claude Haiku 4.5's minimum cacheable size is 4,096 tokens. My system prompt was a few hundred tokens — below the threshold. Even with cache_control written, nothing was being cached.

You can check via the response usage:

console.log(result.usage);
// cache_creation_input_tokens: 0
// cache_read_input_tokens: 0   ← if both are 0, nothing was cached
Enter fullscreen mode Exit fullscreen mode

Both were 0. I was so absorbed in the break-even math that the underlying minimum-token condition was my blind spot.

Writing cache_control does no operational harm (it's simply ignored). But writing "I optimized it with caching" in this app would be false. I decided the most honest — and most useful to the reader — thing was to keep it as a verification process: I thought it would help, checked, found it didn't meet the conditions, and it didn't help.

💡 Lesson: "The official feature exists" and "it helps in my use case" are different problems. If you don't check the preconditions — like a minimum token count — you can end up thinking you optimized while doing nothing.


Swamp ② — "Written in the Prompt ≠ Obeyed"

The other swamp. Reading with real data, this happened:

  • Read a short ~10-second sample → the recognized text contains no punctuation
  • Yet Haiku tacked on a concrete tip that wasn't in the input: "take a breath at the punctuation"

The cause was that the finalized prompt's rules had punctuation-dependent wording like "settle at the punctuation." As a fix, I:

  1. Added one line to the premises: "do not mention punctuation"
  2. Changed rule 3 from "settle at the punctuation" → "consciously take a breath and read slower" (punctuation-independent wording)

But there's no guarantee this is obeyed 100%. LLMs can probabilistically break even a "do not ~" prohibition.

In fact, the same run had another wobble. The input was stagnation "somewhat many" + speed "slow," which by the rules should prioritize rule 1 (how to use pauses). But the output leaned toward rule 2 (tempo forward). This surfaced the limit of a design that writes the priority decision into the prompt and lets Haiku choose.

What kicked in here is the same shape as a past lesson: "a passing test ≠ behaving as intended." It's not about whether you instructed it, but whether you verify against real data that it was obeyed. "Written in the prompt" does not equal "obeyed."

Java comparison: It's the same as writing "returns within 100 characters" in a method's Javadoc but still validating on the caller side. Writing a spec and the spec being upheld are separate things. Whether it was upheld is checked on the caller side (in code).

If you want stability, "settle it in code and pass it in"

To make the output fully stable, you can decide which improvement to emit in code and pass it as a single field like { "improvementFocus": "how to use pauses" }. Then Haiku can focus only on wording, and the priority wobble disappears.

Approach Detail Trade-off
A (used this time) Write the priority in the prompt, let Haiku choose Looks smart, but wobbles
B Settle the improvement focus in code, pass as one field Stable output, but rigid

Due to the deadline I left it on approach A this time (watching it after the prompt fix), but I have a clear sense that if the wobble stands out in production, leaning to B is the sure bet. This too was a design-decision point: "looking smart vs. being stable."


How to Handle the Character Limit (Phase 1)

Having Haiku itself count "within 100 chars" isn't trustworthy. LLMs can't count Japanese characters precisely and routinely return over the limit. In Phase 1:

  • Keep the prompt's "about 100 chars" as a best-effort goal
  • Hold the actual ceiling with max_tokens: 256
  • Code-side validation (like truncating on overflow) is not implemented in Phase 1 (a future option)

As with swamp ②, the state here is "the prompt is the spec; the strict guarantee isn't in code yet."


What I Implemented Myself / What I Asked AI For

Area Detail
My decisions / implementation The role split (compute = code / wording = Haiku), the finalized prompt text, the improvement priority order, the max_tokens value, the cache verification, isolating and fixing the punctuation issue, the A/B trade-off judgment
Asked AI for The messages.create skeleton, how to write the system array + cache_control, an example of how to check usage
Fixed on AI's pointer Putting cache_control inside the block rather than on a string system, a sensible max_tokens value, observing the punctuation wobble

The prompt text, the role split, and the verification decisions are all mine. I ran it against real data, sank into the swamps, and isolated and fixed them myself.


Wrapping Up

This was my design for generating read-aloud coaching feedback with Claude Haiku. Three takeaways:

  1. A two-stage design — code does the math, Haiku only does the wording. Keep the LLM on its best skill, translation, and settle the decisions in code.
  2. Prompt caching didn't work in this app (it doesn't reach Haiku's 4,096-token minimum). If you don't check the preconditions, "optimizing" becomes "pretending to."
  3. Written in the prompt ≠ obeyed. Verify the punctuation and priority wobbles against real data, and if you need stability, push the decisions into code.

The thread running through it all is one point: instruction and verification are separate. A prompt is a spec, not a guarantee — on that premise, keeping your verification in code and real data is, I believe, the honest stance when using generative AI in a real product.

The detailed development log is in the repository's LEARNING_LOG_Phase1_Step4.md.

Next time I dig into the rationale behind the metrics — on what basis I set the speed and stagnation-rate thresholds → satellite #3, "The rationale behind the metrics" (https://dev.to/uya0526design/i-went-looking-for-the-basis-of-n-characters-per-minute-is-fast-there-wasnt-one-setting-4967).


This article is part of my public learning journey using AI tools (Claude / Cursor). The design, prompt, and verification decisions are mine, and the output is checked against real data. I collaborate with AI on the article's structure, outline, and draft prose, and I review and revise every line before publishing.

Top comments (0)