DEV Community: Elispeak

How I turned fragmented LLM quotas into one larger token budget with an MCP server

Elispeak — Thu, 23 Jul 2026 11:11:27 +0000

If you use coding agents heavily, you have probably run into the same annoying pattern I did.

One provider is rate-limited. Another still has quota left. A third is cheaper for simple tasks but not what your current workflow is wired to use. In practice, that means your session stops anyway, even though you still have perfectly usable model capacity somewhere else.

I built @mrrlin-dev/external-agents to fix that problem.

It is an MCP server that lets tools like Codex and Claude Code route work across multiple LLM providers through one interface. The main value is not "AI magic." It is much more boring and useful than that:

better cost efficiency
a larger effective token budget
fewer rate-limit interruptions
less custom fallback logic glued into every workflow

The actual problem

Most agent setups still assume one provider at a time.

That is fine until you start doing real work:

long review sessions
multi-file edits
repeated reviewer passes
large-context analysis
parallel agent runs

Then the weak point becomes obvious. Your workflow is only as durable as the quota and latency profile of one provider.

That creates two bad outcomes:

You overpay, because you keep stronger or more expensive models in the loop for tasks that do not need them.
You lose flow, because one provider getting busy or rate-limited pauses everything.

The frustrating part is that many of us already have usable capacity spread across multiple providers: Anthropic, OpenAI, Gemini, Groq, OpenRouter, DeepSeek, and others. The capacity exists, but it is fragmented.

The idea

Instead of treating each provider as a separate workflow, treat them as one shared execution pool.

That is what external-agents does.

Your agent talks to one MCP server. Under the hood, that server can dispatch requests across many configured providers. That turns isolated quota buckets into one larger working token budget.

I am deliberately saying effective token budget, not "unlimited tokens" and not "bypass rate limits."

Nothing magical happens here. You are not escaping provider limits. You are just using the capacity you already have more intelligently.

Why MCP is a good fit

MCP is useful here because it gives the agent one stable interface.

Codex or Claude Code does not need custom provider-specific fallback logic every time you want to experiment with routing. You configure the MCP server once, then iterate on policy behind that interface.

That matters more than it sounds.

Without a layer like this, multi-provider setups usually become one of these:

hardcoded scripts no one wants to maintain
ad hoc retries that are invisible until they break
manual API-key switching when a session stalls
separate tools for separate providers, which defeats the point of agent flow

An MCP server is not just a transport choice here. It is the control plane.

What gets better in practice

1. Cost efficiency

Not every task needs your most expensive model.

A practical setup can reserve stronger models for harder reasoning, and let cheaper or free-tier capacity absorb simpler work. Over time, that means more useful output per dollar and less waste from using premium models as the default for everything.

2. Fewer interruptions

The best routing system is the one you barely notice.

If one provider is busy, exhausted, or temporarily constrained, the session can continue through another route instead of dumping the problem back on the human. That is the real UX win.

3. Better use of fragmented quotas

A lot of developers already have partial capacity everywhere:

a paid plan in one place
free-tier credits in another
low-cost overflow through a router
one provider that is better for long context
another that is better for speed

A shared dispatcher turns that from scattered leftovers into something operationally useful.

4. Cleaner reviewer workflows

This was especially relevant for me because I wanted reviewer-panel and consensus-style flows to stop depending on a single vendor path. If you are doing multi-model review, a dispatcher layer is much easier to reason about than wiring every reviewer directly into the outer toolchain.

Where it fits well

I think this pattern is especially useful for people who:

use Codex or Claude Code daily
run long coding sessions
care about spend discipline
already have accounts across several model providers
want resilience without rewriting their agent stack

It is also a nice fit for experimentation. You can compare routing strategies and provider mixes without changing the client surface every time.

Where I would not oversell it

There are a few things I do not think this should be pitched as.

First, it is not a guarantee of identical outputs. Different providers and models behave differently.

Second, it is not a substitute for good judgment about task-model matching. If you route everything everywhere without policy, you just create chaos with extra steps.

Third, it is not about pretending cost disappears. If anything, the point is the opposite: make spend visible, deliberate, and better allocated.

Why I built this as one MCP server for many LLMs

The design goal was simple:

one MCP server, many LLMs, one steadier workflow

I wanted one place to define routing, one place to expose tools, and one place to absorb rate-limit pain before it hits the coding session.

That makes the whole setup easier to run, easier to tweak, and easier to explain to other developers.

If you want to try it

The package is here:

If you are already using Codex or Claude Code and have capacity spread across multiple providers, this is the exact use case it was built for.

The short version is:

You probably do not need more provider accounts.

You may just need a better way to turn the ones you already have into one reliable working budget.

How we score speaking when "native-like" is the wrong target - the eval rubric behind Elispeak

Elispeak — Thu, 07 May 2026 11:50:36 +0000

How we score speaking when "native-like" is the wrong target - the eval rubric behind Elispeak

I build Elispeak, an AI English speaking coach. The first article in this thread covered what was technically hard. The second covered the user-profile layer that makes Eli (the tutor persona) feel like it remembers you. This one is about the piece that sits underneath both: the eval rubric that decides what "you got better today" actually means.

It is the smallest, driest part of the product. It is also the part that keeps every other part honest. If the rubric is wrong, every weakness flagged in the user profile is wrong, every recommendation is wrong, and every "you levelled up" message is a lie.

The wrong target

The default speaking-coach pitch is "talk like a native." That target is broken in three specific ways.

It is not what the user is hiring you for. A QA engineer in Lviv preparing for a hiring panel does not want to sound like a Texan. They want to be understood by a Canadian PM, a German tech lead, and an Indian SRE on the same call. That is also the lens our conversational English coaching surface is built around: comprehensibility is the goal; accent transfer is not.
It is unmeasurable in a useful way. "Sounds native" collapses fluency, accent, vocabulary range, and interaction style into one fuzzy axis. You cannot tell a user what to fix. You can only tell them they are not there yet.
It is demoralising in the wrong direction. Users who are already understood at work hear "still not native" and infer "still not good enough to interview." That is both factually wrong and the reason a lot of competent speakers quietly stop practicing.

So we threw out the target. The rubric scores something else.

What we score instead

Five axes, all bounded, all aligned to the CEFR descriptor families because the descriptors are the closest thing the field has to a calibrated scale.

type SpeakingScore = {
  comprehensibility: CEFR;   // can a non-native colleague follow you in real time?
  fluency:           CEFR;   // pacing, hesitation, recovery from a stuck word
  accuracy:          CEFR;   // grammar where wrongness blocks meaning
  range:             CEFR;   // vocabulary and structure flexibility
  interaction:       CEFR;   // turn-taking, repair, asking-for-clarification
};

type CEFR = "A2" | "B1" | "B2" | "C1" | "C2";

Two things are worth flagging.

First, accent is not on this list. Not as an axis, not as a sub-axis, not as a hidden penalty. The only accent question is whether the listener can follow, and that question is already inside comprehensibility. Once we made that explicit, three different bug reports about "Eli kept correcting my Indian English" disappeared in the same week.

Second, accuracy is scoped to meaning-blocking errors. A missing article in front of "report" does not move the needle. A wrong tense that flips "I shipped it" into "I will ship it" does. The rubric prompt makes that distinction up front so the scorer does not penalise an engineer for the things their hiring manager would not penalise them for.

The structure of the rubric

Each axis has a small, stable set of descriptors. They are not invented; they are lifted from the CEFR speaking grids and tightened where the grids are vague.

{
  "comprehensibility": {
    "B2": "Listener follows without effort across familiar topics; occasional clarification needed on dense or unfamiliar material.",
    "C1": "Listener follows effortlessly across most topics including abstract or domain-specific; clarification rare and topic-driven, not pronunciation-driven."
  },
  "fluency": {
    "B2": "Speaks at near-natural pace on familiar topics; visible hesitation when reaching for a less common word, recovers without breakdown.",
    "C1": "Speaks fluidly across familiar and unfamiliar topics; hesitation is for thought, not vocabulary; can self-rephrase mid-sentence cleanly."
  }
}

The descriptors are short on purpose. Long descriptors invite the scorer to pattern-match keywords ("hesitation" is in the B2 line, the user hesitated, score B2). Short descriptors force the scorer to compare the actual evidence to the actual claim.

How a score gets generated

The scoring pass is a separate model call from the conversation. Same architectural shape as the post-session profile diff from the previous article: a slow, structured pass on the transcript, never inline with the user's turn.

The scorer receives:

the full transcript of the session (only this session, never the user's history)
the rubric descriptors for B2 and C1 on the relevant axis
four to six anchored examples per axis, drawn from a hand-labelled calibration set

It does not receive the user's previous score, level, or goals. We strip those before the call. If the scorer can see "this user was C1 last week" it will anchor on that and stop seeing the evidence in front of it. Calibration drift comes for free if you let the scorer reuse priors.

Output is structured:

{
  "scores": {
    "comprehensibility": "C1",
    "fluency": "B2",
    "accuracy": "B2",
    "range": "B2",
    "interaction": "C1"
  },
  "evidence": {
    "fluency": [
      "Long pause at 03:42 reaching for `escalate`; recovered with `bring it up`.",
      "Self-rephrased cleanly at 05:11 mid-sentence."
    ]
  },
  "meaning_blocking_errors": [
    { "turn": 7, "issue": "tense flip: `I deploy it` -> intended past" }
  ]
}

The evidence field is non-negotiable. A score with no evidence is silently dropped on the way back. The user never sees a level number that the scorer cannot defend with two specific moments from the transcript.

Where the rubric breaks

Three failure modes show up consistently. None of them are exotic.

1. Short sessions. Three minutes of conversation does not contain enough evidence to move four out of five axes. The rubric returns "insufficient evidence" on those axes instead of guessing. Returning a confident wrong answer here is worse than returning nothing - it sets a fake baseline that the next session has to climb out of.

2. Domain mismatch. A user who is a C1 frontend engineer talking about React is a B2 generalist talking about pension reform. We solved this by tagging each session with a topic family and only updating axis scores within sessions that match the user's declared goal context. Cross-domain extrapolation is off by default.

3. The "fluent fossil" case. Speakers who have plateaued at B2 for a decade can sound very fluent inside their work vocabulary and very stuck outside it. The rubric handles this by requiring range evidence from outside recentTopics before promoting the axis. Without that gate, the scorer happily promotes a fluent fossil to C1 and the user notices something is off the first time Eli treats them like one.

Hooking eval into the user profile

This is where the rubric stops being a measurement and starts being product behaviour.

The previous article described weaknesses[] and strengths[] as bounded tags on the user profile. The rubric is what populates them.

After each session, the rubric output flows into the profile diff:

function rubricToProfileDiff(score: SpeakingScore, evidence: Evidence): ProfileDiff {
  const addWeaknesses: string[] = [];
  const addStrengths: string[] = [];

  if (score.accuracy === "B2" && evidence.meaning_blocking_errors.some(isTenseError)) {
    addWeaknesses.push("tense-blocks-meaning");
  }
  if (score.interaction === "C1" && evidence.interaction.some(isCleanRepair)) {
    addStrengths.push("self-repair");
  }
  // ...

  return { addWeaknesses, addStrengths };
}

A weakness only enters the profile if it has rubric evidence. A strength only enters if it has rubric evidence. The scorer is the gate; the profile cannot drift into "user struggles with articles" because a single session looked uneven. This is also the answer to a question the previous article skipped: where do weaknesses actually come from? Here. Always here. Never from the conversation model directly.

The intersection runs the other way too. When Eli opens a session with "want to keep working on the QA-style interview answers from last time?" - which is the kind of cold-open the QA interview English topic on Elispeak is built around - the topic suggestion is gated by whether the user's range axis has enough evidence inside that domain to make the prep useful. We do not push interview practice on a user who is still B1 in conversational range; the rubric blocks the recommendation upstream.

What I'd tell someone building the same thing

Four things in order of how much time they saved us:

Decide what you are NOT scoring before deciding what you are. "Native-like" was the load-bearing wrong assumption. Cutting it changed the rubric, the prompts, the user copy, and three weeks of disagreement on the team in a single afternoon.
Strip user history before the scoring call. The scorer should re-derive the level from the transcript every time, not anchor on last week. Anchoring is a one-way ratchet toward stale scores.
Require evidence per axis. Drop scores without it. A scorer that returns a confident "B2" with no two-line evidence is hallucinating, and you will not catch it until a user asks why. Dropping unsupported scores is cheap and forces the scorer to behave.
Bound the rubric to bounded inputs. Five axes, five CEFR bands, hand-labelled anchors per axis. Anything broader becomes a free-form essay grader, and free-form essay graders are exactly the thing every team eventually rebuilds because the first version drifted.

The rubric is the least glamorous part of an AI tutor. It is also the only piece that decides whether the rest of the product is telling the user the truth.

Try it

The free tier is enough to see whether the rubric reads your speaking the way you read it yourself. For paid plans, the launch promo ELISPEAK50 gets you 50% off any plan (no minimum).

🔗 Try Elispeak

Making an AI tutor feel like it remembers you — the user-profile layer behind Elispeak

Elispeak — Mon, 27 Apr 2026 14:51:49 +0000

Making an AI tutor feel like it remembers you — the user-profile layer behind Elispeak

I build Elispeak — an AI English speaking coach. Most of the interesting product work is not the voice pipeline or the scoring rubric. It's the user-profile layer that sits between a user's sessions and the next conversation Eli (the tutor persona) opens with.

Without it, every session starts with the generic "What would you like to practice today?" With it, Eli opens with something like: "Last time you wanted to sound less stiff in standups — still that, or do you want to prep for Friday's interview instead?"

That one sentence changes retention more than any other single thing we shipped. Here's how the layer actually works.

The problem

LLM apps default to two broken modes:

Stateless. Every session starts from zero. The user has to re-explain who they are, what their level is, what they're practicing for. That friction kills daily-use intent on week two.
Full transcript memory. Shove every past message into context. Expensive, slow, leaks old topics into new ones ("you mentioned your mom's surgery three weeks ago — how is she?" when the user just wanted to practice a TOEFL prompt).

What we actually want is somewhere between these two: a compact, structured model of the user that survives across sessions without dragging raw conversation history forward.

What the profile stores

The profile is a JSON-shaped record per user, updated after every session — not during. A few fields that carry weight:

type UserProfile = {
  goals: Goal[];                // "TOEFL in May", "sound natural in standups"
  level: { speaking: CEFR; writing: CEFR; listening: CEFR };
  weaknesses: Weakness[];       // "articles", "past perfect", "th sounds"
  strengths: string[];          // short, positive; used for tone, not praise
  interests: string[];          // "football", "indie dev", "sci-fi"
  recentTopics: Topic[];        // last ~10, with timestamps + summaries
  styleSignals: {               // helps Eli pace/tone replies
    wantsCorrection: "immediate" | "end-of-turn" | "summary-only";
    preferredPace: "slow" | "normal" | "fast";
    emotionalRegister: "direct" | "warm" | "playful";
  };
  openLoops: OpenLoop[];        // things user said they wanted to come back to
  lastSessionAt: Timestamp;
  sessionCount: number;
};

Nothing here is free-form prose. Everything is a bounded enum or a short tagged string. That constraint is the whole point — it's what lets the layer stay cheap to read and safe to pass into a prompt.

How it gets populated

Two paths:

1. Explicit onboarding. The first few sessions ask the user a small number of low-friction questions — "what's the closest thing to why you're practicing?" with 4 options, not a text box. These seed goals, level, and styleSignals.emotionalRegister.

2. Post-session enrichment. This is the interesting part. After a session ends, a second, slower model pass runs on the transcript and answers a short, fixed set of questions:

Did the user mention any new goal, deadline, or context we don't have?
Which grammatical/phonetic weaknesses showed up at least twice?
Did the user ask to come back to anything later?
Did the user's preferred correction cadence shift in this session?

The output of this pass is a structured diff, not a rewrite. Something like:

{
  "addWeaknesses": ["conditional-3rd"],
  "addOpenLoop": { "topic": "salary negotiation", "context": "promo prep" },
  "reinforce": { "goal": "interview prep", "confidence": 0.8 }
}

The diff is applied to the profile with simple merge rules (cap recentTopics at 10, cap openLoops at 5, decay confidence on older items). Keeping this as a diff — not a full overwrite — is what keeps the profile stable. One weird session doesn't erase four weeks of accumulated knowledge about the user.

How recommendations use it

When the user opens the app, we don't show a flat list of prompts. We compute a small ranked set.

Roughly:

function rankTopics(profile: UserProfile, pool: Topic[]): Topic[] {
  return pool
    .map((t) => ({
      topic: t,
      score:
        goalAlignment(t, profile.goals) * 0.45 +
        weaknessHit(t, profile.weaknesses) * 0.25 +
        interestHit(t, profile.interests) * 0.15 +
        noveltyAgainst(t, profile.recentTopics) * 0.15,
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, 5);
}

The weights are not magic. They came from watching early users either pick the first card or bounce. Three things moved the needle more than tuning the weights:

Novelty penalty against recentTopics. If the user practiced "interview: tell me about yourself" two sessions ago, don't put it first again. This was the single biggest retention move. Users reading the same top card twice don't feel "understood," they feel "lazy AI."
Open-loop surfacing. If the user said "I want to come back to negotiating salary," show that as its own explicit card with the phrase they used. This makes the continuity feel real because the language is theirs, not a paraphrase.
Goal recency decay. Goals aren't permanent. A TOEFL goal with a May date should rank near 1.0 in April and near 0.2 in July. Hard decay beats soft decay here — users notice when stale goals hang around.

How Eli opens a session

This is where the profile stops being a data structure and starts being a feeling.

The opening line is generated by a small prompt that receives the minimum useful slice of the profile — not the whole thing. Something like:

user's top goal: {top_goal}
most recent open loop: {top_open_loop.topic}
last session ended: {days_ago}d ago
preferred register: {emotional_register}

That's it. No transcripts. No list of weaknesses. No confidence scores. The LLM isn't asked to decide what matters; the profile ranking already did that. The LLM is only asked to say one natural-sounding sentence that threads those three or four facts together.

Two rules the opening line has to follow:

Never invent continuity. If there's no recent open loop, don't fake one. "Last time you wanted X" is the fastest way to destroy trust if the user didn't actually say X. When in doubt, ask.
Match the user's register. A user who set emotionalRegister: "direct" gets "Interview prep or something else?" A user with "warm" gets "Hey — want to pick up the interview prep, or reset?" Same information, different tone. This is the cheapest personalization we have.

The privacy line we don't cross

The profile is structured, bounded, and summary-only. Full transcripts are not stored beyond the session's scoring pipeline. That's not just a privacy stance — it's an engineering one. If we kept transcripts, the profile layer would drift toward "shove raw text into context" and we'd be back to the expensive, leaky mode we were avoiding.

The rule we follow internally: if a field can't be expressed as a bounded schema entry, it doesn't belong in the profile. A user saying "I'm nervous about my green card interview next Thursday" becomes { goal: "immigration-interview-prep", deadline: "2026-05-08", register: "warm" } — not a stored quote.

What I'd tell someone building the same thing

Four things in order of how much time they saved us:

Update the profile after the session, not during. Trying to update live made every turn slower and introduced race conditions between the scoring pass and the conversation turn. A slow async pass post-session is fine — the user won't feel it.
Diffs over rewrites. Always. One bad session should never clobber the profile.
Bound every field. Enums, capped arrays, tagged strings. Free-form prose in a profile is technical debt that compounds every session.
Pass the minimum slice to the opener, not the whole profile. Let the ranker decide what matters. The LLM gets four lines of context, not forty.

Once those four are in place, the "feels like Eli knows me" property shows up almost for free. Users describe it as "the AI remembers me" even though technically nothing from last week's transcript is in this week's prompt.

That gap — between what's actually in context and what the user feels — is where the product lives.

Try it

The free tier is enough to see whether the personalized cold-open lands for you. For paid plans, the launch promo ELISPEAK50 gets you 50% off any plan (no minimum).

🔗 Try Elispeak

I built an AI English speaking coach — what was technically hard

Elispeak — Fri, 24 Apr 2026 14:24:57 +0000

I spent the past year building Elispeak, an AI English speaking coach. The user-facing pitch is simple — talk to an AI tutor, get instant pronunciation and fluency feedback, practice TOEFL / IELTS / CELPIP speaking tasks on demand. Under the hood, a few things turned out to be much harder than I expected.

This is a note to myself, and to anyone else building voice-first language tools.

1. Real-time ASR + scoring latency is the whole product

The promise of "instant feedback" falls apart at 4 seconds of latency. At 1.5 seconds it feels like a person listening. At 3.5 it feels like a slow API. The user's confidence between "I spoke well" and "I messed up" is destroyed by the gap.

Getting from end-of-utterance to a scored result — not just transcription, but pronunciation and fluency features — meant:

streaming ASR instead of batch, with interim hypotheses used to start downstream work before the final transcript arrives
precomputing a phoneme-alignment path so pronunciation scoring can start as soon as the audio chunk lands, not after the full sentence
scoring features (pace, filler-word density, stress timing) computed on the audio stream, not derived post-hoc from the transcript

The second-order effect: every piece of UI has to be reactive too. If feedback lands in 1.2s but the UI repaints every 500ms, the user perceives 1.7s. Shaving animation blocking time ended up mattering almost as much as the model pipeline.

2. Exam rubrics are not a prompt — they are a protocol

TOEFL Independent Speaking, IELTS Part 2, CELPIP Task 4 each have published rubrics. It is tempting to drop the rubric into a system prompt and call it done. It is not done.

Timing windows matter more than content. TOEFL gives 15 seconds to prepare and 45 to speak. A "perfect answer" that runs 38 seconds is actually worse at the exam than a B+ answer at 44 seconds. The coach has to grade with that tension in mind, not just on transcript quality.
Exam-safe framing is a legal surface. You cannot say "this is your TOEFL score." You can say "a tutor applying the public band descriptors might score this around 23-25 of 30." That framing has to be in every response, not just onboarding.
Sample answers drift. A stable system prompt with drifting base-model behavior produces drifting feedback. I had to pin model versions per exam mode and run weekly evals on held-out recordings to catch regression.

Treat each exam mode as its own small product with its own eval set, not one mode with a different prompt.

3. TTS voices that don't sound robotic are half the battle

Students do not want to roleplay with a voice that sounds like a call-center IVR. The moment the voice feels synthetic, the emotional bar for opening their mouth goes up — and you just lost the session.

What actually helped:

neural voices tuned for conversational English, not neutral narration
varying pacing and pause patterns per scenario (airport interview is clipped and fast; therapy-style friend chat has longer pauses and more um / yeah / okay fillers)
supporting accent diversity so the student practices comprehension, not just production
lip-sync style micro-delays — the tutor reacts a beat late, like a human would, not instantly like a bot

On the engineering side, this meant a voice persona config per AI tutor character (we ship multiple tutors) and keeping the latency budget from Section 1 intact while adding TTS synthesis.

Stack and trade-offs

ASR: streaming provider with word-level timestamps and phoneme probabilities. Interim hypotheses + confidence scores shaped more of the architecture than raw accuracy.
Scoring: pronunciation on phoneme-level features and edit distance vs. expected; fluency on stream-level features (pace, filler rate, pause distribution); content via an LLM pass scoped to rubric criteria.
LLM: one pinned model per exam mode, with eval regression suite before upgrading.
TTS: neural conversational voices, persona config per tutor.
Frontend: WebRTC for capture, progressive UI updates keyed to pipeline stages so partial results feel immediate.

Trade-offs that bit me:

optimizing for end-to-end latency means giving up some scoring quality per step; I keep having to rebalance the two
picking "one best voice" per tutor is false economy — students attach to specific voices and churn when you change them
rubrics are a moving target; budget time to rerun evals after any provider upgrade

What I would build differently

Invest in the eval loop before the product surface. Most debugging pain in months 4-8 traced back to missing eval coverage, not missing features.
Do not ship more than two exam modes until the first two are clean. More modes means more eval sets means more drift surface.
Pay for a proper observability stack earlier. Custom logging runs out of road faster than you expect on a voice pipeline.

Try it

If you want to kick the tires, there is a free tier at elispeak.com. Paid plans are 50% off with code ELISPEAK50 — no minimum, works on any plan.

👉 Start practicing — 50% off any plan

Happy to answer questions in the comments — especially on ASR pipeline design and rubric evals.