How we score speaking when "native-like" is the wrong target - the eval rubric behind Elispeak

#ai #productivity #machinelearning #startup

How we score speaking when "native-like" is the wrong target - the eval rubric behind Elispeak

I build Elispeak, an AI English speaking coach. The first article in this thread covered what was technically hard. The second covered the user-profile layer that makes Eli (the tutor persona) feel like it remembers you. This one is about the piece that sits underneath both: the eval rubric that decides what "you got better today" actually means.

It is the smallest, driest part of the product. It is also the part that keeps every other part honest. If the rubric is wrong, every weakness flagged in the user profile is wrong, every recommendation is wrong, and every "you levelled up" message is a lie.

The wrong target

The default speaking-coach pitch is "talk like a native." That target is broken in three specific ways.

It is not what the user is hiring you for. A QA engineer in Lviv preparing for a hiring panel does not want to sound like a Texan. They want to be understood by a Canadian PM, a German tech lead, and an Indian SRE on the same call. That is also the lens our conversational English coaching surface is built around: comprehensibility is the goal; accent transfer is not.
It is unmeasurable in a useful way. "Sounds native" collapses fluency, accent, vocabulary range, and interaction style into one fuzzy axis. You cannot tell a user what to fix. You can only tell them they are not there yet.
It is demoralising in the wrong direction. Users who are already understood at work hear "still not native" and infer "still not good enough to interview." That is both factually wrong and the reason a lot of competent speakers quietly stop practicing.

So we threw out the target. The rubric scores something else.

What we score instead

Five axes, all bounded, all aligned to the CEFR descriptor families because the descriptors are the closest thing the field has to a calibrated scale.

type SpeakingScore = {
  comprehensibility: CEFR;   // can a non-native colleague follow you in real time?
  fluency:           CEFR;   // pacing, hesitation, recovery from a stuck word
  accuracy:          CEFR;   // grammar where wrongness blocks meaning
  range:             CEFR;   // vocabulary and structure flexibility
  interaction:       CEFR;   // turn-taking, repair, asking-for-clarification
};

type CEFR = "A2" | "B1" | "B2" | "C1" | "C2";

Two things are worth flagging.

First, accent is not on this list. Not as an axis, not as a sub-axis, not as a hidden penalty. The only accent question is whether the listener can follow, and that question is already inside comprehensibility. Once we made that explicit, three different bug reports about "Eli kept correcting my Indian English" disappeared in the same week.

Second, accuracy is scoped to meaning-blocking errors. A missing article in front of "report" does not move the needle. A wrong tense that flips "I shipped it" into "I will ship it" does. The rubric prompt makes that distinction up front so the scorer does not penalise an engineer for the things their hiring manager would not penalise them for.

The structure of the rubric

Each axis has a small, stable set of descriptors. They are not invented; they are lifted from the CEFR speaking grids and tightened where the grids are vague.

{
  "comprehensibility": {
    "B2": "Listener follows without effort across familiar topics; occasional clarification needed on dense or unfamiliar material.",
    "C1": "Listener follows effortlessly across most topics including abstract or domain-specific; clarification rare and topic-driven, not pronunciation-driven."
  },
  "fluency": {
    "B2": "Speaks at near-natural pace on familiar topics; visible hesitation when reaching for a less common word, recovers without breakdown.",
    "C1": "Speaks fluidly across familiar and unfamiliar topics; hesitation is for thought, not vocabulary; can self-rephrase mid-sentence cleanly."
  }
}

The descriptors are short on purpose. Long descriptors invite the scorer to pattern-match keywords ("hesitation" is in the B2 line, the user hesitated, score B2). Short descriptors force the scorer to compare the actual evidence to the actual claim.

How a score gets generated

The scoring pass is a separate model call from the conversation. Same architectural shape as the post-session profile diff from the previous article: a slow, structured pass on the transcript, never inline with the user's turn.

The scorer receives:

the full transcript of the session (only this session, never the user's history)
the rubric descriptors for B2 and C1 on the relevant axis
four to six anchored examples per axis, drawn from a hand-labelled calibration set

It does not receive the user's previous score, level, or goals. We strip those before the call. If the scorer can see "this user was C1 last week" it will anchor on that and stop seeing the evidence in front of it. Calibration drift comes for free if you let the scorer reuse priors.

Output is structured:

{
  "scores": {
    "comprehensibility": "C1",
    "fluency": "B2",
    "accuracy": "B2",
    "range": "B2",
    "interaction": "C1"
  },
  "evidence": {
    "fluency": [
      "Long pause at 03:42 reaching for `escalate`; recovered with `bring it up`.",
      "Self-rephrased cleanly at 05:11 mid-sentence."
    ]
  },
  "meaning_blocking_errors": [
    { "turn": 7, "issue": "tense flip: `I deploy it` -> intended past" }
  ]
}

The evidence field is non-negotiable. A score with no evidence is silently dropped on the way back. The user never sees a level number that the scorer cannot defend with two specific moments from the transcript.

Where the rubric breaks

Three failure modes show up consistently. None of them are exotic.

1. Short sessions. Three minutes of conversation does not contain enough evidence to move four out of five axes. The rubric returns "insufficient evidence" on those axes instead of guessing. Returning a confident wrong answer here is worse than returning nothing - it sets a fake baseline that the next session has to climb out of.

2. Domain mismatch. A user who is a C1 frontend engineer talking about React is a B2 generalist talking about pension reform. We solved this by tagging each session with a topic family and only updating axis scores within sessions that match the user's declared goal context. Cross-domain extrapolation is off by default.

3. The "fluent fossil" case. Speakers who have plateaued at B2 for a decade can sound very fluent inside their work vocabulary and very stuck outside it. The rubric handles this by requiring range evidence from outside recentTopics before promoting the axis. Without that gate, the scorer happily promotes a fluent fossil to C1 and the user notices something is off the first time Eli treats them like one.

Hooking eval into the user profile

This is where the rubric stops being a measurement and starts being product behaviour.

The previous article described weaknesses[] and strengths[] as bounded tags on the user profile. The rubric is what populates them.

After each session, the rubric output flows into the profile diff:

function rubricToProfileDiff(score: SpeakingScore, evidence: Evidence): ProfileDiff {
  const addWeaknesses: string[] = [];
  const addStrengths: string[] = [];

  if (score.accuracy === "B2" && evidence.meaning_blocking_errors.some(isTenseError)) {
    addWeaknesses.push("tense-blocks-meaning");
  }
  if (score.interaction === "C1" && evidence.interaction.some(isCleanRepair)) {
    addStrengths.push("self-repair");
  }
  // ...

  return { addWeaknesses, addStrengths };
}

A weakness only enters the profile if it has rubric evidence. A strength only enters if it has rubric evidence. The scorer is the gate; the profile cannot drift into "user struggles with articles" because a single session looked uneven. This is also the answer to a question the previous article skipped: where do weaknesses actually come from? Here. Always here. Never from the conversation model directly.

The intersection runs the other way too. When Eli opens a session with "want to keep working on the QA-style interview answers from last time?" - which is the kind of cold-open the QA interview English topic on Elispeak is built around - the topic suggestion is gated by whether the user's range axis has enough evidence inside that domain to make the prep useful. We do not push interview practice on a user who is still B1 in conversational range; the rubric blocks the recommendation upstream.

What I'd tell someone building the same thing

Four things in order of how much time they saved us:

Decide what you are NOT scoring before deciding what you are. "Native-like" was the load-bearing wrong assumption. Cutting it changed the rubric, the prompts, the user copy, and three weeks of disagreement on the team in a single afternoon.
Strip user history before the scoring call. The scorer should re-derive the level from the transcript every time, not anchor on last week. Anchoring is a one-way ratchet toward stale scores.
Require evidence per axis. Drop scores without it. A scorer that returns a confident "B2" with no two-line evidence is hallucinating, and you will not catch it until a user asks why. Dropping unsupported scores is cheap and forces the scorer to behave.
Bound the rubric to bounded inputs. Five axes, five CEFR bands, hand-labelled anchors per axis. Anything broader becomes a free-form essay grader, and free-form essay graders are exactly the thing every team eventually rebuilds because the first version drifted.

The rubric is the least glamorous part of an AI tutor. It is also the only piece that decides whether the rest of the product is telling the user the truth.

Try it

The free tier is enough to see whether the rubric reads your speaking the way you read it yourself. For paid plans, the launch promo ELISPEAK50 gets you 50% off any plan (no minimum).

🔗 Try Elispeak