DEV Community

Uya
Uya

Posted on • Originally published at zenn.dev

Measuring Japanese Read-Aloud Speed with AmiVoice Timestamps — A Coaching App That Doesn't Stop at STT-to-Claude

📝 Originally published in Japanese on Zenn. This is the English version.
Canonical: https://zenn.dev/uya0526_design/articles/main_article_reading-speed-meter

Introduction — What I Built

I built a web app that lets you read a Japanese passage aloud, measures your speed and fluency, and has an AI coach return a one-line piece of feedback.

The flow is simple. You read a passage aloud into the mic (up to 10 seconds) while looking at the script — I prepared the opening lines of two Japanese classics, The Tale of the Heike and Hōjōki — and when you press "Measure,":

  1. AmiVoice API recognizes the audio,
  2. the code computes your pure speaking speed (characters/min) and stagnation rate from that result, and
  3. Claude Haiku returns coaching as "one compliment + one improvement."

(Measurement starts on a button press after recording — it never runs automatically.)

This article aims to be a single, self-contained piece covering the whole picture, the design decisions, and the reproduction steps.

💡 Where this sits in my journey

I'm an ex-Java engineer learning TypeScript and Python in public. This was my first project where I deliberately adopted "AI-collaborative development" as a clear mode. Throughout, I'll drop in comparisons to Java — hopefully useful for anyone coming from a similar background.


What You'll Get From This Article

  • How to design evaluation logic that fully exploits the per-word timestamps AmiVoice returns
  • How to build a BFF (Backend for Frontend) so API keys never reach the browser
  • A "two-stage" design where code does the math and Claude Haiku only does the wording
  • First-hand findings you only learn by verifying — e.g., "I thought I'd optimized it, but it wasn't actually working"

I include concrete endpoints, parameters, and environment variables so you can reproduce it yourself.


My Learning Style (AI Transparency)

💡 Learning companions & how this article is written

I use Claude Pro and Cursor Pro as learning companions. This project had a deadline (the AmiVoice-sponsored contest in Zennfes Spring 2026), so I stepped one notch beyond my usual "I write every line myself" style into AI-collaborative development. I'm stating that plainly rather than blurring it.

The axis of disclosure is not "did I use AI" but "am I the one driving."

Area Owner
Tech selection, evaluation-algorithm design, architecture decisions, code verification Me
Boilerplate examples (recording, fetch, API Route skeletons) AI (Claude / Cursor)
Article structure, outline, draft prose, translation In collaboration with Claude
Reviewing and revising all content before publishing Me

In other words: the thinking is mine, the wording is AI-assisted, and I verify all of it. Every design decision, every threshold rationale, and every "where I got stuck" is my own first-hand information. I keep a per-step LEARNING_LOG in the repository separating "what I implemented myself" from "what I asked AI for."


Originality — Why Not Stop at "STT → Claude"?

A lot of articles combining speech-to-text (STT) with generative AI stop at "transcribe the audio, then hand the text straight to Claude." That works, but it throws away the best part of AmiVoice.

AmiVoice isn't just transcription. For each individual word, it returns:

Field Meaning
written Surface form (kanji + kana)
spoken Reading (kana)
confidence Confidence score (0–1)
starttime / endtime Word start/end time (milliseconds)

Using starttime / endtime, you know when, which word, and over how long it was read. So my approach is:

Get depth from the evaluation logic (code), then turn it into something human with Claude Haiku — a two-stage design.

In education and phonetics, read-aloud fluency is traditionally discussed along three axes: accuracy, speed, and expressiveness. This app (Phase 1) implements speed primarily, and adds stagnation rate (the proportion of pauses) as an in-house metric that's the flip side of speed. Pauses are academically treated as part of speech rate, so I position stagnation as a speed-adjacent metric.

Axis Metric in this app Phase
Speed Pure speaking speed (chars/min) / stagnation rate (in-house, speed-adjacent) Phase 1 ✅
Accuracy Matching via edit distance Phase 2 (planned — see README)
Expressiveness (intonation, etc.) Not in scope for Phase 1

"Read-aloud evaluation that leans hard on the time information" turned out to be a rare sweet spot — low cost (just arithmetic) yet genuinely distinctive.


Architecture — Keep API Keys Out of the Browser

Both AmiVoice and Claude require API keys, and those keys must never be exposed to the browser. So I insert a thin relay (BFF / proxy) that holds the keys.

[Browser]
  Show script / record (getUserMedia → MediaRecorder → Blob)
        │  FormData(audio)
        ▼
[Next.js API Routes (BFF / holds keys)]
  /api/recognize  → AmiVoice synchronous HTTP (speech → text)
  /api/feedback   → Claude Messages API (one-line feedback)
        │
        ▼
[Browser]  Show metrics → show AI feedback
Enter fullscreen mode Exit fullscreen mode

The browser never calls the two external APIs directly. Next.js API Routes act as a thin key-holding relay.

Java comparison: This is the same idea as a Spring @RestController reading an external API key from application.yml, never showing it to the client, and relaying the call. Think of it as "a thin Servlet that relays an external API without leaking the secret."

Tech stack:

Layer Choice
Framework Next.js 16 (App Router) + API Routes
Language TypeScript
Recording MediaRecorder API
Speech recognition AmiVoice API (synchronous HTTP)
AI feedback Claude API (Haiku / @anthropic-ai/sdk)
Testing Vitest (28 tests)
Deploy Vercel

I considered splitting the backend off into Python (FastAPI), but to hit the deadline I consolidated on Next.js: one language, one repository, one deploy.


Evaluation Logic (Code Side)

All computation and threshold decisions are settled in code. This is where the "depth" comes from.

Pure speaking speed and stagnation rate (a pure function)

I keep everything inside a pure function calculateMetrics that holds no I/O, so it's directly testable with Vitest.

// Input: a type reshaped from the AmiVoice response
interface AmiVoiceResponse {
  text: string;
  segments: { starttime: number; endtime: number }[]; // milliseconds
}
interface ReadingMetrics {
  pureSpeakingSpeed: number; // chars/min
  stagnationRate: number;    // 0–1
}

export function calculateMetrics(res: AmiVoiceResponse): ReadingMetrics {
  const { text, segments } = res;
  if (text.length === 0 || segments.length === 0) {
    return { pureSpeakingSpeed: 0, stagnationRate: 0 };
  }
  const totalSpeakingTimeMs = segments.reduce(
    (acc, segment) => acc + (segment.endtime - segment.starttime), 0);
  if (totalSpeakingTimeMs === 0) {
    return { pureSpeakingSpeed: 0, stagnationRate: 0 };
  }
  let pureSpeakingSpeed = Math.round(
    [...text].length / (totalSpeakingTimeMs / 60000));
  const totalElapsedTimeMs =
    segments[segments.length - 1].endtime - segments[0].starttime;
  if (totalElapsedTimeMs === 0) {
    return { pureSpeakingSpeed: 0, stagnationRate: 0 };
  }
  let stagnationRate =
    (totalElapsedTimeMs - totalSpeakingTimeMs) / totalElapsedTimeMs;
  stagnationRate = Math.round(stagnationRate * 1000) / 1000;
  return { pureSpeakingSpeed, stagnationRate };
}
Enter fullscreen mode Exit fullscreen mode

Two design points:

  • Character count is based on the recognized text, in code points: [...text].length. Basing it on the original script would over-estimate speed when the reader skips ahead, so I avoided that.
  • Keep the numeric type with Math.round. toFixed returns a string, so I don't use it.

Java comparison: [...text].length is a character count that accounts for surrogate pairs; reduce corresponds to stream().mapToLong().sum(). The division-by-zero guards and boundary tests are exactly the kind of error-path and boundary-value analysis you write all the time in JUnit.

Decide thresholds honestly

Next, labelMetrics converts the numbers into evaluation labels ("slightly fast," etc.). The hard question: how many characters/min counts as "fast"?

Honestly: I couldn't find an academic threshold for "N chars/min = fast/slow." So for speed I leaned on a general rule of thumb — a news announcer's pace ≈ 300 chars/min — and intentionally placed that professional level as the start of "slightly fast" for an ordinary person. For stagnation rate there's no numeric standard at all, so I treat it frankly as an in-house heuristic.

Speed label Chars/min
Slow – 149
Slightly slow 150 – 199
Standard 200 – 299
Slightly fast 300 – 350
Fast 351 –

The reasoning behind these lines — the in-house placement, why I treat stagnation as an in-house metric, the cultural context of "fluent as water off a board," and the relation to mora — is something I dig into in a separate satellite article: "The rationale behind the metrics" (https://dev.to/uya0526design/i-went-looking-for-the-basis-of-n-characters-per-minute-is-fast-there-wasnt-one-setting-4967). Here, just hold onto the principle: when there's no basis, don't dress it up as science — disclose the process and draw the line.

⚠️ MVP limits (stated honestly): Right now there's no matching of the script against the recognized text (accuracy is Phase 2 in the README), and the stagnation rate doesn't distinguish Japan's meaningful pauses (ma). The latter is a research item not yet listed even in the README's Phase 2. Both are spelled out in the app's footer.


Exploiting AmiVoice Fully

/api/recognize receives the recording Blob and relays it to AmiVoice's synchronous HTTP API. Here are the spots that trip people up. (Full implementation and a deeper dive are in satellite #1, "Implementing AmiVoice synchronous HTTP" (https://dev.to/uya0526design/calling-amivoices-synchronous-http-api-through-a-nextjs-bff-auth-multipart-order-and-the-webm-21o5).)

const AMIVOICE_ENDPOINT = "https://acp-api.amivoice.com/v1/nolog/recognize";
const AMIVOICE_ENGINE = "-a-general"; // general conversational

export async function POST(req: Request) {
  const inForm = await req.formData();
  const audio = inForm.get("audio") as Blob;

  const outForm = new FormData();
  outForm.append("u", process.env.AMIVOICE_API_KEY ?? ""); // auth
  outForm.append("d", AMIVOICE_ENGINE);                    // engine
  outForm.append("a", audio, "recording.webm");            // audio (must be last)

  const res = await fetch(AMIVOICE_ENDPOINT, { method: "POST", body: outForm });
  const body = await res.text();
  return new NextResponse(body, {
    status: res.status,
    headers: { "Content-Type": res.headers.get("Content-Type") ?? "application/json" },
  });
}
Enter fullscreen mode Exit fullscreen mode

(Simplified. In the real code, a missing AMIVOICE_API_KEY returns 500 and never sends an empty key to AmiVoice.)

Gotchas I confirmed against both the official manual and curl:

  • Auth is the multipart u field, not an Authorization header. I assumed header auth at first and got stuck here.
  • The audio a must be the final multipart part. Add another field after it and it gets ignored.
  • WebM + Opus carries a header in the container, so you can omit the audio-format parameter c on synchronous HTTP (verified with curl). The browser's MediaRecorder output (audio/webm;codecs=opus) went through as-is.

I reshape the raw JSON into AmiVoiceResponse with a pure-function mapper before passing it to calculateMetrics. text is top-level; segments are built from starttime / endtime in results[0].tokens — and yes, I initially referenced result (singular) and had to fix it.


Claude Haiku's "Two-Stage" Design

For feedback generation I'm strict about the split: code does the math, Haiku only does the wording. By the time data reaches Haiku, it's already settled facts — for example, a real measurement during development (reading the Heike sample) ran labelMetrics to "speed = 322 chars/min, label = slightly fast, stagnation = 0%, label = few." Haiku's only job is to translate that into warm words. I keep the small model on what it's best at — wording and tone — and never demand numeric precision from it.

const result = await client.messages.create({
  model: process.env.ANTHROPIC_MODEL!, // e.g. claude-haiku-4-5
  max_tokens: 256,
  system: [
    { type: "text", text: FEEDBACK_PROMPT, cache_control: { type: "ephemeral" } },
  ],
  messages: [
    { role: "user", content: JSON.stringify(feedbackFacts) }, // settled facts only
  ],
});
Enter fullscreen mode Exit fullscreen mode
  • system = persona and ground rules (a static coach persona)
  • messages role: "user" = the dynamic data that changes each time (the evaluation JSON)

Java comparison: A fixed system prompt plus a variable user payload is the same idea as separating the Service layer (computation = code) from the presentation layer (wording = Haiku).

The "optimization" that wasn't working

This is my biggest first-hand finding this time. I added cache_control to the static system prompt to make it cheaper via prompt caching. I even did the break-even math and concluded "it pays off after two uses."

It wasn't working at all. Claude Haiku 4.5's minimum cacheable size is 4,096 tokens, and my system prompt was a few hundred. It didn't meet the threshold, so even with cache_control written, nothing was cached. (Confirmed via the response usage: both cache_creation_input_tokens and cache_read_input_tokens were 0.)

Writing "I optimized it with caching" would be false. So I'm keeping it in the article as a verification process: I thought it would help, checked, found it didn't meet the conditions, and it didn't help. Not casually claiming "optimized" is, to me, part of being honest.

"Written in the prompt ≠ obeyed"

One more. A ~10-second read produces recognized text with no punctuation. Yet Haiku tacked on a concrete tip that wasn't in the input: "take a breath at the punctuation." I added a line to the prompt — "don't mention punctuation" — but even that isn't a 100% guarantee.

This is the same shape as a lesson from a past project: "a passing test ≠ behaving as intended." What matters isn't whether you instructed something, but whether you verify against real data that it was obeyed. If you ultimately want fully stable output, you can push the improvement focus into code and pass it as a single field (a Phase 2 option).

The finalized prompt, the full cache-verification details, and how I isolated the punctuation issue are in satellite #2, "Claude Haiku coaching design and the prompt swamp" (https://dev.to/uya0526design/dont-let-claude-haiku-do-the-math-a-two-stage-read-aloud-coach-design-and-the-prompt-swamp-2ihc).


Reproduction Steps

The minimum to run it locally:

git clone https://github.com/uya0526-design/reading-speed-meter
cd reading-speed-meter
npm install
Enter fullscreen mode Exit fullscreen mode

Create .env.local at the project root (never commit it):

AMIVOICE_API_KEY=your_amivoice_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
ANTHROPIC_MODEL=claude-haiku-4-5
Enter fullscreen mode Exit fullscreen mode
npm run dev    # → http://localhost:3000
npm test       # Vitest (28 tests)
npm run build  # production build check
Enter fullscreen mode Exit fullscreen mode

To deploy on Vercel: import from GitHub → confirm the Framework Preset is detected as Next.js → add the three Environment Variables above → Deploy. Crucially, none of the three should have a NEXT_PUBLIC_ prefix (that would expose them to the browser). Recording needs mic permission, so it runs over HTTPS (Vercel's public URL) or localhost.


Design Decisions & Gotchas — Highlights

Topic What happened / how I decided
AmiVoice auth Assumed an Authorization header → found synchronous HTTP uses the u field; fixed
Audio part order Put a (audio) as the last part — anything after it is ignored
WebM/Opus Header present, so c can be omitted; MediaRecorder output passes as-is
Prompt cache Thought it would help, but didn't meet Haiku's 4,096-token minimum — it didn't work
Punctuation leak Haiku mentioned punctuation that wasn't in the input → fixed the prompt (no full guarantee)
Char-count limit "Within 100 chars" — LLMs can't count precisely, so I demoted it to a best-effort goal
Classics attribution AI explained the Heike / Hōjōki openings backwards → I verified against sources and rejected it

That last one symbolizes "the human is the one driving" in AI-collaborative development. Don't take AI's suggestions at face value; verify domain knowledge against primary sources. That accumulation is what makes this article and the implementation trustworthy.


Satellite Articles (Deep Dives)

This article is the overview. Each topic is broken out separately:


On to Phase 2

Next, following the Phase 2 plan in the README, I'll tackle accuracy (script matching via edit distance), displaying the recognized text on screen, and tempo stability, history, and UI polish. Adding "meaningful pause (ma)" detection to the stagnation rate, and stabilizing Haiku's output (settling the improvement focus in code), remain research items (neither is listed as an individual Phase 2 task yet).


Wrapping Up

This was my record of building a "read-aloud coaching app" with AmiVoice and generative AI. Three takeaways:

  1. Exploiting AmiVoice's timestamps fully gave the app a depth that a plain "STT → Claude hand-off" doesn't have.
  2. A two-stage design — code does the math, Haiku only does the wording — kept the small model on the job it's best at.
  3. I kept the first-hand findings I only learned by verifying (the cache that didn't work, the prompt that wasn't obeyed, the AI's wrong attribution) honestly intact.

The detailed development log lives in the repository's LEARNING_LOG.


This article is part of my public learning journey using AI tools (Claude / Cursor). The design, tech selection, and evaluation-algorithm decisions are mine, and the code is verified with Vitest. I collaborate with AI on the article's structure, outline, and draft prose, and I review and revise every line before publishing.

Top comments (0)