<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Elispeak</title>
    <description>The latest articles on DEV Community by Elispeak (@elispeak111).</description>
    <link>https://dev.to/elispeak111</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3896178%2F93425463-2411-445b-b8a6-09accb2352f0.png</url>
      <title>DEV Community: Elispeak</title>
      <link>https://dev.to/elispeak111</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/elispeak111"/>
    <language>en</language>
    <item>
      <title>How we score speaking when "native-like" is the wrong target - the eval rubric behind Elispeak</title>
      <dc:creator>Elispeak</dc:creator>
      <pubDate>Thu, 07 May 2026 11:50:36 +0000</pubDate>
      <link>https://dev.to/elispeak111/how-we-score-speaking-when-native-like-is-the-wrong-target-the-eval-rubric-behind-elispeak-359e</link>
      <guid>https://dev.to/elispeak111/how-we-score-speaking-when-native-like-is-the-wrong-target-the-eval-rubric-behind-elispeak-359e</guid>
      <description>&lt;h1&gt;
  
  
  How we score speaking when "native-like" is the wrong target - the eval rubric behind Elispeak
&lt;/h1&gt;

&lt;p&gt;I build Elispeak, an AI English speaking coach. The first article in this thread covered what was technically hard. The second covered the user-profile layer that makes Eli (the tutor persona) feel like it remembers you. This one is about the piece that sits underneath both: the eval rubric that decides what "you got better today" actually means.&lt;/p&gt;

&lt;p&gt;It is the smallest, driest part of the product. It is also the part that keeps every other part honest. If the rubric is wrong, every weakness flagged in the user profile is wrong, every recommendation is wrong, and every "you levelled up" message is a lie.&lt;/p&gt;

&lt;h2&gt;
  
  
  The wrong target
&lt;/h2&gt;

&lt;p&gt;The default speaking-coach pitch is "talk like a native." That target is broken in three specific ways.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It is not what the user is hiring you for.&lt;/strong&gt; A QA engineer in Lviv preparing for a hiring panel does not want to sound like a Texan. They want to be understood by a Canadian PM, a German tech lead, and an Indian SRE on the same call. That is also the lens our &lt;a href="https://elispeak.com/ua/rozmovna-angliyska-online/" rel="noopener noreferrer"&gt;conversational English coaching surface&lt;/a&gt; is built around: comprehensibility is the goal; accent transfer is not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is unmeasurable in a useful way.&lt;/strong&gt; "Sounds native" collapses fluency, accent, vocabulary range, and interaction style into one fuzzy axis. You cannot tell a user what to fix. You can only tell them they are not there yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is demoralising in the wrong direction.&lt;/strong&gt; Users who are already understood at work hear "still not native" and infer "still not good enough to interview." That is both factually wrong and the reason a lot of competent speakers quietly stop practicing.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So we threw out the target. The rubric scores something else.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we score instead
&lt;/h2&gt;

&lt;p&gt;Five axes, all bounded, all aligned to the &lt;a href="https://www.coe.int/en/web/common-european-framework-reference-languages" rel="noopener noreferrer"&gt;CEFR&lt;/a&gt; descriptor families because the descriptors are the closest thing the field has to a calibrated scale.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;SpeakingScore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;comprehensibility&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;CEFR&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// can a non-native colleague follow you in real time?&lt;/span&gt;
  &lt;span class="nl"&gt;fluency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;           &lt;span class="nx"&gt;CEFR&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// pacing, hesitation, recovery from a stuck word&lt;/span&gt;
  &lt;span class="nl"&gt;accuracy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="nx"&gt;CEFR&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// grammar where wrongness blocks meaning&lt;/span&gt;
  &lt;span class="nl"&gt;range&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;             &lt;span class="nx"&gt;CEFR&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// vocabulary and structure flexibility&lt;/span&gt;
  &lt;span class="nl"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="nx"&gt;CEFR&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// turn-taking, repair, asking-for-clarification&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;CEFR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;A2&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;B1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;B2&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;C1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;C2&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things are worth flagging.&lt;/p&gt;

&lt;p&gt;First, &lt;strong&gt;accent is not on this list&lt;/strong&gt;. Not as an axis, not as a sub-axis, not as a hidden penalty. The only accent question is whether the listener can follow, and that question is already inside &lt;code&gt;comprehensibility&lt;/code&gt;. Once we made that explicit, three different bug reports about "Eli kept correcting my Indian English" disappeared in the same week.&lt;/p&gt;

&lt;p&gt;Second, &lt;strong&gt;accuracy is scoped to meaning-blocking errors&lt;/strong&gt;. A missing article in front of "report" does not move the needle. A wrong tense that flips "I shipped it" into "I will ship it" does. The rubric prompt makes that distinction up front so the scorer does not penalise an engineer for the things their hiring manager would not penalise them for.&lt;/p&gt;

&lt;h2&gt;
  
  
  The structure of the rubric
&lt;/h2&gt;

&lt;p&gt;Each axis has a small, stable set of descriptors. They are not invented; they are lifted from the CEFR speaking grids and tightened where the grids are vague.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"comprehensibility"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"B2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Listener follows without effort across familiar topics; occasional clarification needed on dense or unfamiliar material."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"C1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Listener follows effortlessly across most topics including abstract or domain-specific; clarification rare and topic-driven, not pronunciation-driven."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fluency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"B2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Speaks at near-natural pace on familiar topics; visible hesitation when reaching for a less common word, recovers without breakdown."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"C1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Speaks fluidly across familiar and unfamiliar topics; hesitation is for thought, not vocabulary; can self-rephrase mid-sentence cleanly."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The descriptors are short on purpose. Long descriptors invite the scorer to pattern-match keywords ("hesitation" is in the B2 line, the user hesitated, score B2). Short descriptors force the scorer to compare the actual evidence to the actual claim.&lt;/p&gt;

&lt;h2&gt;
  
  
  How a score gets generated
&lt;/h2&gt;

&lt;p&gt;The scoring pass is a separate model call from the conversation. Same architectural shape as the post-session profile diff from the previous article: a slow, structured pass on the transcript, never inline with the user's turn.&lt;/p&gt;

&lt;p&gt;The scorer receives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the full transcript of the session (only this session, never the user's history)&lt;/li&gt;
&lt;li&gt;the rubric descriptors for B2 and C1 on the relevant axis&lt;/li&gt;
&lt;li&gt;four to six anchored examples per axis, drawn from a hand-labelled calibration set&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It does not receive the user's previous score, level, or goals. We strip those before the call. If the scorer can see "this user was C1 last week" it will anchor on that and stop seeing the evidence in front of it. Calibration drift comes for free if you let the scorer reuse priors.&lt;/p&gt;

&lt;p&gt;Output is structured:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scores"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"comprehensibility"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fluency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"B2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"accuracy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"B2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"range"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"B2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"interaction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C1"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"evidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fluency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Long pause at 03:42 reaching for `escalate`; recovered with `bring it up`."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Self-rephrased cleanly at 05:11 mid-sentence."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"meaning_blocking_errors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"turn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"issue"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tense flip: `I deploy it` -&amp;gt; intended past"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;evidence&lt;/code&gt; field is non-negotiable. A score with no evidence is silently dropped on the way back. The user never sees a level number that the scorer cannot defend with two specific moments from the transcript.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the rubric breaks
&lt;/h2&gt;

&lt;p&gt;Three failure modes show up consistently. None of them are exotic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Short sessions.&lt;/strong&gt; Three minutes of conversation does not contain enough evidence to move four out of five axes. The rubric returns "insufficient evidence" on those axes instead of guessing. Returning a confident wrong answer here is worse than returning nothing - it sets a fake baseline that the next session has to climb out of.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Domain mismatch.&lt;/strong&gt; A user who is a C1 frontend engineer talking about React is a B2 generalist talking about pension reform. We solved this by tagging each session with a topic family and only updating axis scores within sessions that match the user's declared goal context. Cross-domain extrapolation is off by default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The "fluent fossil" case.&lt;/strong&gt; Speakers who have plateaued at B2 for a decade can sound very fluent inside their work vocabulary and very stuck outside it. The rubric handles this by requiring &lt;code&gt;range&lt;/code&gt; evidence from outside &lt;code&gt;recentTopics&lt;/code&gt; before promoting the axis. Without that gate, the scorer happily promotes a fluent fossil to C1 and the user notices something is off the first time Eli treats them like one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hooking eval into the user profile
&lt;/h2&gt;

&lt;p&gt;This is where the rubric stops being a measurement and starts being product behaviour.&lt;/p&gt;

&lt;p&gt;The previous article described &lt;code&gt;weaknesses[]&lt;/code&gt; and &lt;code&gt;strengths[]&lt;/code&gt; as bounded tags on the user profile. The rubric is what populates them.&lt;/p&gt;

&lt;p&gt;After each session, the rubric output flows into the profile diff:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;rubricToProfileDiff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;SpeakingScore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;evidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Evidence&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;ProfileDiff&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;addWeaknesses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;addStrengths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;accuracy&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;B2&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;evidence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;meaning_blocking_errors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isTenseError&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;addWeaknesses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tense-blocks-meaning&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;interaction&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;C1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;evidence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;interaction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isCleanRepair&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;addStrengths&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;self-repair&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;addWeaknesses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;addStrengths&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A weakness only enters the profile if it has rubric evidence. A strength only enters if it has rubric evidence. The scorer is the gate; the profile cannot drift into "user struggles with articles" because a single session looked uneven. This is also the answer to a question the previous article skipped: where do &lt;code&gt;weaknesses&lt;/code&gt; actually come from? Here. Always here. Never from the conversation model directly.&lt;/p&gt;

&lt;p&gt;The intersection runs the other way too. When Eli opens a session with "want to keep working on the QA-style interview answers from last time?" - which is the kind of cold-open the &lt;a href="https://elispeak.com/ua/topics/qa-interview-english/" rel="noopener noreferrer"&gt;QA interview English topic&lt;/a&gt; on Elispeak is built around - the topic suggestion is gated by whether the user's &lt;code&gt;range&lt;/code&gt; axis has enough evidence inside that domain to make the prep useful. We do not push interview practice on a user who is still B1 in conversational range; the rubric blocks the recommendation upstream.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell someone building the same thing
&lt;/h2&gt;

&lt;p&gt;Four things in order of how much time they saved us:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Decide what you are NOT scoring before deciding what you are.&lt;/strong&gt; "Native-like" was the load-bearing wrong assumption. Cutting it changed the rubric, the prompts, the user copy, and three weeks of disagreement on the team in a single afternoon.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strip user history before the scoring call.&lt;/strong&gt; The scorer should re-derive the level from the transcript every time, not anchor on last week. Anchoring is a one-way ratchet toward stale scores.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Require evidence per axis. Drop scores without it.&lt;/strong&gt; A scorer that returns a confident "B2" with no two-line evidence is hallucinating, and you will not catch it until a user asks why. Dropping unsupported scores is cheap and forces the scorer to behave.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bound the rubric to bounded inputs.&lt;/strong&gt; Five axes, five CEFR bands, hand-labelled anchors per axis. Anything broader becomes a free-form essay grader, and free-form essay graders are exactly the thing every team eventually rebuilds because the first version drifted.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The rubric is the least glamorous part of an AI tutor. It is also the only piece that decides whether the rest of the product is telling the user the truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The free tier is enough to see whether the rubric reads your speaking the way you read it yourself. For paid plans, the launch promo &lt;code&gt;ELISPEAK50&lt;/code&gt; gets you 50% off any plan (no minimum).&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://app.elispeak.com/plan?promoId=promo_1TPkURDWds7e1FfDHk3YG84w&amp;amp;promoCode=ELISPEAK50&amp;amp;utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=ua_cluster_2026_05" rel="noopener noreferrer"&gt;Try Elispeak&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>productivity</category>
      <category>startup</category>
    </item>
    <item>
      <title>Making an AI tutor feel like it remembers you — the user-profile layer behind Elispeak</title>
      <dc:creator>Elispeak</dc:creator>
      <pubDate>Mon, 27 Apr 2026 14:51:49 +0000</pubDate>
      <link>https://dev.to/elispeak111/making-an-ai-tutor-feel-like-it-remembers-you-the-user-profile-layer-behind-elispeak-1854</link>
      <guid>https://dev.to/elispeak111/making-an-ai-tutor-feel-like-it-remembers-you-the-user-profile-layer-behind-elispeak-1854</guid>
      <description>&lt;h1&gt;
  
  
  Making an AI tutor feel like it remembers you — the user-profile layer behind Elispeak
&lt;/h1&gt;

&lt;p&gt;I build Elispeak — an AI English speaking coach. Most of the interesting product work is not the voice pipeline or the scoring rubric. It's the user-profile layer that sits between a user's sessions and the next conversation Eli (the tutor persona) opens with.&lt;/p&gt;

&lt;p&gt;Without it, every session starts with the generic "What would you like to practice today?" With it, Eli opens with something like: "Last time you wanted to sound less stiff in standups — still that, or do you want to prep for Friday's interview instead?"&lt;/p&gt;

&lt;p&gt;That one sentence changes retention more than any other single thing we shipped. Here's how the layer actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;LLM apps default to two broken modes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stateless.&lt;/strong&gt; Every session starts from zero. The user has to re-explain who they are, what their level is, what they're practicing for. That friction kills daily-use intent on week two.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full transcript memory.&lt;/strong&gt; Shove every past message into context. Expensive, slow, leaks old topics into new ones ("you mentioned your mom's surgery three weeks ago — how is she?" when the user just wanted to practice a TOEFL prompt).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What we actually want is somewhere between these two: a compact, structured model of the user that survives across sessions without dragging raw conversation history forward.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the profile stores
&lt;/h2&gt;

&lt;p&gt;The profile is a JSON-shaped record per user, updated after every session — not during. A few fields that carry weight:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;UserProfile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;goals&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Goal&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;                &lt;span class="c1"&gt;// "TOEFL in May", "sound natural in standups"&lt;/span&gt;
  &lt;span class="nl"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;speaking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;CEFR&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;writing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;CEFR&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;listening&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;CEFR&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="nl"&gt;weaknesses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Weakness&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;       &lt;span class="c1"&gt;// "articles", "past perfect", "th sounds"&lt;/span&gt;
  &lt;span class="nl"&gt;strengths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;          &lt;span class="c1"&gt;// short, positive; used for tone, not praise&lt;/span&gt;
  &lt;span class="nl"&gt;interests&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;          &lt;span class="c1"&gt;// "football", "indie dev", "sci-fi"&lt;/span&gt;
  &lt;span class="nl"&gt;recentTopics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Topic&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;        &lt;span class="c1"&gt;// last ~10, with timestamps + summaries&lt;/span&gt;
  &lt;span class="nl"&gt;styleSignals&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;               &lt;span class="c1"&gt;// helps Eli pace/tone replies&lt;/span&gt;
    &lt;span class="na"&gt;wantsCorrection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;immediate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;end-of-turn&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;summary-only&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;preferredPace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;slow&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;normal&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;fast&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;emotionalRegister&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;direct&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;warm&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;playful&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="nl"&gt;openLoops&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;OpenLoop&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;        &lt;span class="c1"&gt;// things user said they wanted to come back to&lt;/span&gt;
  &lt;span class="nl"&gt;lastSessionAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Timestamp&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;sessionCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing here is free-form prose. Everything is a bounded enum or a short tagged string. That constraint is the whole point — it's what lets the layer stay cheap to read and safe to pass into a prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it gets populated
&lt;/h2&gt;

&lt;p&gt;Two paths:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Explicit onboarding.&lt;/strong&gt; The first few sessions ask the user a small number of low-friction questions — "what's the closest thing to why you're practicing?" with 4 options, not a text box. These seed &lt;code&gt;goals&lt;/code&gt;, &lt;code&gt;level&lt;/code&gt;, and &lt;code&gt;styleSignals.emotionalRegister&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Post-session enrichment.&lt;/strong&gt; This is the interesting part. After a session ends, a second, slower model pass runs on the transcript and answers a short, fixed set of questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did the user mention any new goal, deadline, or context we don't have?&lt;/li&gt;
&lt;li&gt;Which grammatical/phonetic weaknesses showed up at least twice?&lt;/li&gt;
&lt;li&gt;Did the user ask to come back to anything later?&lt;/li&gt;
&lt;li&gt;Did the user's preferred correction cadence shift in this session?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output of this pass is a structured &lt;strong&gt;diff&lt;/strong&gt;, not a rewrite. Something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"addWeaknesses"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"conditional-3rd"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"addOpenLoop"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"topic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"salary negotiation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"promo prep"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reinforce"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"goal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"interview prep"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The diff is applied to the profile with simple merge rules (cap &lt;code&gt;recentTopics&lt;/code&gt; at 10, cap &lt;code&gt;openLoops&lt;/code&gt; at 5, decay &lt;code&gt;confidence&lt;/code&gt; on older items). Keeping this as a diff — not a full overwrite — is what keeps the profile stable. One weird session doesn't erase four weeks of accumulated knowledge about the user.&lt;/p&gt;

&lt;h2&gt;
  
  
  How recommendations use it
&lt;/h2&gt;

&lt;p&gt;When the user opens the app, we don't show a flat list of prompts. We compute a small ranked set.&lt;/p&gt;

&lt;p&gt;Roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;rankTopics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;UserProfile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Topic&lt;/span&gt;&lt;span class="p"&gt;[]):&lt;/span&gt; &lt;span class="nx"&gt;Topic&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;pool&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;goalAlignment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;goals&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.45&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="nf"&gt;weaknessHit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;weaknesses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="nf"&gt;interestHit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;interests&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="nf"&gt;noveltyAgainst&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;recentTopics&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The weights are not magic. They came from watching early users either pick the first card or bounce. Three things moved the needle more than tuning the weights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Novelty penalty against &lt;code&gt;recentTopics&lt;/code&gt;.&lt;/strong&gt; If the user practiced "interview: tell me about yourself" two sessions ago, don't put it first again. This was the single biggest retention move. Users reading the same top card twice don't feel "understood," they feel "lazy AI."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-loop surfacing.&lt;/strong&gt; If the user said "I want to come back to negotiating salary," show that as its own explicit card with the phrase they used. This makes the continuity feel real because the language is theirs, not a paraphrase.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Goal recency decay.&lt;/strong&gt; Goals aren't permanent. A TOEFL goal with a May date should rank near 1.0 in April and near 0.2 in July. Hard decay beats soft decay here — users notice when stale goals hang around.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How Eli opens a session
&lt;/h2&gt;

&lt;p&gt;This is where the profile stops being a data structure and starts being a feeling.&lt;/p&gt;

&lt;p&gt;The opening line is generated by a small prompt that receives the &lt;strong&gt;minimum useful slice&lt;/strong&gt; of the profile — not the whole thing. Something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;user's top goal&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;top_goal&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
&lt;span class="na"&gt;most recent open loop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;top_open_loop.topic&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
&lt;span class="na"&gt;last session ended&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;days_ago&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;&lt;span class="s"&gt;d ago&lt;/span&gt;
&lt;span class="na"&gt;preferred register&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;emotional_register&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No transcripts. No list of weaknesses. No confidence scores. The LLM isn't asked to decide what matters; the profile ranking already did that. The LLM is only asked to say one natural-sounding sentence that threads those three or four facts together.&lt;/p&gt;

&lt;p&gt;Two rules the opening line has to follow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Never invent continuity.&lt;/strong&gt; If there's no recent open loop, don't fake one. "Last time you wanted X" is the fastest way to destroy trust if the user didn't actually say X. When in doubt, ask.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Match the user's register.&lt;/strong&gt; A user who set &lt;code&gt;emotionalRegister: "direct"&lt;/code&gt; gets "Interview prep or something else?" A user with &lt;code&gt;"warm"&lt;/code&gt; gets "Hey — want to pick up the interview prep, or reset?" Same information, different tone. This is the cheapest personalization we have.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The privacy line we don't cross
&lt;/h2&gt;

&lt;p&gt;The profile is structured, bounded, and summary-only. Full transcripts are not stored beyond the session's scoring pipeline. That's not just a privacy stance — it's an engineering one. If we kept transcripts, the profile layer would drift toward "shove raw text into context" and we'd be back to the expensive, leaky mode we were avoiding.&lt;/p&gt;

&lt;p&gt;The rule we follow internally: if a field can't be expressed as a bounded schema entry, it doesn't belong in the profile. A user saying "I'm nervous about my green card interview next Thursday" becomes &lt;code&gt;{ goal: "immigration-interview-prep", deadline: "2026-05-08", register: "warm" }&lt;/code&gt; — not a stored quote.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell someone building the same thing
&lt;/h2&gt;

&lt;p&gt;Four things in order of how much time they saved us:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Update the profile after the session, not during.&lt;/strong&gt; Trying to update live made every turn slower and introduced race conditions between the scoring pass and the conversation turn. A slow async pass post-session is fine — the user won't feel it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diffs over rewrites.&lt;/strong&gt; Always. One bad session should never clobber the profile.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bound every field.&lt;/strong&gt; Enums, capped arrays, tagged strings. Free-form prose in a profile is technical debt that compounds every session.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pass the minimum slice to the opener, not the whole profile.&lt;/strong&gt; Let the ranker decide what matters. The LLM gets four lines of context, not forty.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once those four are in place, the "feels like Eli knows me" property shows up almost for free. Users describe it as "the AI remembers me" even though technically nothing from last week's transcript is in this week's prompt.&lt;/p&gt;

&lt;p&gt;That gap — between what's actually in context and what the user feels — is where the product lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The free tier is enough to see whether the personalized cold-open lands for you. For paid plans, the launch promo &lt;code&gt;ELISPEAK50&lt;/code&gt; gets you 50% off any plan (no minimum).&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://app.elispeak.com/plan?promoId=promo_1TPkURDWds7e1FfDHk3YG84w&amp;amp;promoCode=ELISPEAK50&amp;amp;utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=ELISPEAK50_launch_2026_04" rel="noopener noreferrer"&gt;Try Elispeak&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>productivity</category>
      <category>startup</category>
    </item>
    <item>
      <title>I built an AI English speaking coach — what was technically hard</title>
      <dc:creator>Elispeak</dc:creator>
      <pubDate>Fri, 24 Apr 2026 14:24:57 +0000</pubDate>
      <link>https://dev.to/elispeak111/i-built-an-ai-english-speaking-coach-what-was-technically-hard-3m25</link>
      <guid>https://dev.to/elispeak111/i-built-an-ai-english-speaking-coach-what-was-technically-hard-3m25</guid>
      <description>&lt;p&gt;I spent the past year building &lt;a href="https://elispeak.com" rel="noopener noreferrer"&gt;Elispeak&lt;/a&gt;, an AI English speaking coach. The user-facing pitch is simple — talk to an AI tutor, get instant pronunciation and fluency feedback, practice TOEFL / IELTS / CELPIP speaking tasks on demand. Under the hood, a few things turned out to be much harder than I expected.&lt;/p&gt;

&lt;p&gt;This is a note to myself, and to anyone else building voice-first language tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Real-time ASR + scoring latency is the whole product
&lt;/h2&gt;

&lt;p&gt;The promise of "instant feedback" falls apart at 4 seconds of latency. At 1.5 seconds it feels like a person listening. At 3.5 it feels like a slow API. The user's confidence between "I spoke well" and "I messed up" is destroyed by the gap.&lt;/p&gt;

&lt;p&gt;Getting from end-of-utterance to a scored result — not just transcription, but pronunciation and fluency features — meant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;streaming ASR instead of batch, with interim hypotheses used to start downstream work before the final transcript arrives&lt;/li&gt;
&lt;li&gt;precomputing a phoneme-alignment path so pronunciation scoring can start as soon as the audio chunk lands, not after the full sentence&lt;/li&gt;
&lt;li&gt;scoring features (pace, filler-word density, stress timing) computed on the audio stream, not derived post-hoc from the transcript&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The second-order effect: every piece of UI has to be reactive too. If feedback lands in 1.2s but the UI repaints every 500ms, the user perceives 1.7s. Shaving animation blocking time ended up mattering almost as much as the model pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Exam rubrics are not a prompt — they are a protocol
&lt;/h2&gt;

&lt;p&gt;TOEFL Independent Speaking, IELTS Part 2, CELPIP Task 4 each have published rubrics. It is tempting to drop the rubric into a system prompt and call it done. It is not done.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Timing windows matter more than content.&lt;/strong&gt; TOEFL gives 15 seconds to prepare and 45 to speak. A "perfect answer" that runs 38 seconds is actually worse at the exam than a B+ answer at 44 seconds. The coach has to grade with that tension in mind, not just on transcript quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exam-safe framing is a legal surface.&lt;/strong&gt; You cannot say "this is your TOEFL score." You can say "a tutor applying the public band descriptors might score this around 23-25 of 30." That framing has to be in every response, not just onboarding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sample answers drift.&lt;/strong&gt; A stable system prompt with drifting base-model behavior produces drifting feedback. I had to pin model versions per exam mode and run weekly evals on held-out recordings to catch regression.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Treat each exam mode as its own small product with its own eval set, not one mode with a different prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. TTS voices that don't sound robotic are half the battle
&lt;/h2&gt;

&lt;p&gt;Students do not want to roleplay with a voice that sounds like a call-center IVR. The moment the voice feels synthetic, the emotional bar for opening their mouth goes up — and you just lost the session.&lt;/p&gt;

&lt;p&gt;What actually helped:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;neural voices tuned for conversational English, not neutral narration&lt;/li&gt;
&lt;li&gt;varying pacing and pause patterns per scenario (airport interview is clipped and fast; therapy-style friend chat has longer pauses and more um / yeah / okay fillers)&lt;/li&gt;
&lt;li&gt;supporting accent diversity so the student practices comprehension, not just production&lt;/li&gt;
&lt;li&gt;lip-sync style micro-delays — the tutor reacts a beat late, like a human would, not instantly like a bot&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the engineering side, this meant a voice persona config per AI tutor character (we ship multiple tutors) and keeping the latency budget from Section 1 intact while adding TTS synthesis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stack and trade-offs
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ASR&lt;/strong&gt;: streaming provider with word-level timestamps and phoneme probabilities. Interim hypotheses + confidence scores shaped more of the architecture than raw accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scoring&lt;/strong&gt;: pronunciation on phoneme-level features and edit distance vs. expected; fluency on stream-level features (pace, filler rate, pause distribution); content via an LLM pass scoped to rubric criteria.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM&lt;/strong&gt;: one pinned model per exam mode, with eval regression suite before upgrading.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTS&lt;/strong&gt;: neural conversational voices, persona config per tutor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: WebRTC for capture, progressive UI updates keyed to pipeline stages so partial results feel immediate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Trade-offs that bit me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;optimizing for end-to-end latency means giving up some scoring quality per step; I keep having to rebalance the two&lt;/li&gt;
&lt;li&gt;picking "one best voice" per tutor is false economy — students attach to specific voices and churn when you change them&lt;/li&gt;
&lt;li&gt;rubrics are a moving target; budget time to rerun evals after any provider upgrade&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I would build differently
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Invest in the eval loop before the product surface. Most debugging pain in months 4-8 traced back to missing eval coverage, not missing features.&lt;/li&gt;
&lt;li&gt;Do not ship more than two exam modes until the first two are clean. More modes means more eval sets means more drift surface.&lt;/li&gt;
&lt;li&gt;Pay for a proper observability stack earlier. Custom logging runs out of road faster than you expect on a voice pipeline.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;If you want to kick the tires, there is a free tier at &lt;a href="https://elispeak.com" rel="noopener noreferrer"&gt;elispeak.com&lt;/a&gt;. Paid plans are 50% off with code &lt;strong&gt;ELISPEAK50&lt;/strong&gt; — no minimum, works on any plan.&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://app.elispeak.com/plan?promoId=promo_1TPkURDWds7e1FfDHk3YG84w&amp;amp;promoCode=ELISPEAK50&amp;amp;utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=ELISPEAK50_launch_2026_04" rel="noopener noreferrer"&gt;Start practicing — 50% off any plan&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Happy to answer questions in the comments — especially on ASR pipeline design and rubric evals.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>webdev</category>
      <category>startup</category>
    </item>
  </channel>
</rss>
