<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AYUSH SHARMA</title>
    <description>The latest articles on DEV Community by AYUSH SHARMA (@iussharma).</description>
    <link>https://dev.to/iussharma</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3939430%2Fdac46396-f180-4d90-ad58-2d9f34611aa3.jpeg</url>
      <title>DEV Community: AYUSH SHARMA</title>
      <link>https://dev.to/iussharma</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/iussharma"/>
    <language>en</language>
    <item>
      <title>How We Solved the Hidden Problem of Cheap LLMs</title>
      <dc:creator>AYUSH SHARMA</dc:creator>
      <pubDate>Tue, 19 May 2026 06:38:05 +0000</pubDate>
      <link>https://dev.to/iussharma/how-we-solved-the-hidden-problem-of-cheap-llms-54fp</link>
      <guid>https://dev.to/iussharma/how-we-solved-the-hidden-problem-of-cheap-llms-54fp</guid>
      <description>&lt;h1&gt;
  
  
  I Used CascadeFlow After My Cheap Model Got Confident
&lt;/h1&gt;

&lt;p&gt;The first version of my comment-reply agent had a familiar failure mode: the cheapest model often sounded sure of itself even when it had not understood the relationship context. That was worse than a slow reply, because it produced responses that looked plausible until a human noticed they were generic, misprioritized, or quietly missing the point.&lt;/p&gt;

&lt;p&gt;I did not want to solve that by sending every comment to the largest model. Most comments do not need it. A creator replying to “Nice post” should not pay the same inference cost as a creator answering a founder asking how to deploy AI agents for customer onboarding. So I built EchoEngage around a simple constraint: route comments cheaply by default, but make the routing decision visible, testable, and reversible.&lt;/p&gt;

&lt;p&gt;That is where CascadeFlow became useful. I did not use it as decoration around the LLM call. I used it as part of the control plane for deciding which model should answer, when the answer should be checked, and when the system should escalate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What EchoEngage Does
&lt;/h2&gt;

&lt;p&gt;EchoEngage is a creator relationship memory system. It watches incoming social comments, remembers who each follower is, prioritizes the interaction, generates a suggested reply, and stores the new interaction back into memory.&lt;/p&gt;

&lt;p&gt;The backend is a FastAPI service. The frontend is a React dashboard with a comment inbox, follower memory card, generated reply panel, and routing audit. The agent itself is a LangGraph state machine with six nodes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;recall_memory
  -&amp;gt; classify_comment
  -&amp;gt; route_model
  -&amp;gt; generate_reply
  -&amp;gt; quality_gate
  -&amp;gt; retain_memory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The memory layer uses &lt;a href="https://github.com/vectorize-io/hindsight" rel="noopener noreferrer"&gt;Hindsight agent memory&lt;/a&gt; so each follower can have a separate long-lived memory bank. The routing layer uses &lt;a href="https://github.com/lemony-ai/cascadeflow" rel="noopener noreferrer"&gt;CascadeFlow model routing&lt;/a&gt; to keep model selection observable instead of burying it in prompt folklore.&lt;/p&gt;

&lt;p&gt;The important part is that the system is not just “generate a reply.” It is closer to a tiny workflow engine:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fetch relationship history for the follower.&lt;/li&gt;
&lt;li&gt;Classify the comment’s intent and complexity.&lt;/li&gt;
&lt;li&gt;Choose a model based on that classification.&lt;/li&gt;
&lt;li&gt;Generate a short, context-aware reply.&lt;/li&gt;
&lt;li&gt;Run a quality gate.&lt;/li&gt;
&lt;li&gt;Escalate once if the reply fails.&lt;/li&gt;
&lt;li&gt;Store the new interaction back into memory.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That loop is where the interesting engineering tradeoff lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem Was Not Cost. It Was Confidence.
&lt;/h2&gt;

&lt;p&gt;Cheap models are not bad. In fact, for short social comments they are often the right default. The issue is that a cheap model can produce a fluent reply even when it missed the only thing that mattered.&lt;/p&gt;

&lt;p&gt;For example, if Riya asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Any tool recommendations for my content creation workflow? I’ve been struggling with automating my social posts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A generic answer might recommend scheduling tools and call it done. A better answer remembers that Riya previously asked about content research, liked Notion AI, responds well to tool-specific suggestions, and is building a solo creator workflow.&lt;/p&gt;

&lt;p&gt;That memory changes the reply. It also changes the routing decision. A comment that looks simple by length can be important by relationship context.&lt;/p&gt;

&lt;p&gt;The same issue appears with buying signals. When Sara writes that she is happy to pay for a personalized session, the system should not route that as “another question.” It should recognize the commercial intent and spend more care on the answer.&lt;/p&gt;

&lt;p&gt;So the routing logic had to consider more than token count. It needed a small but explicit policy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Routing Became a First-Class Object
&lt;/h2&gt;

&lt;p&gt;The routing service encodes the policy in ordinary Python. I like that. It is not hidden in a prompt, not spread across UI state, and not dependent on reading logs after the fact.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;has_memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;appreciation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CHEAP_MODEL&lt;/span&gt;
        &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Simple appreciation comment — using efficient model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;complexity_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CHEAP_MODEL&lt;/span&gt;
        &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Medium complexity — trying efficient model first, will escalate if quality fails&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;complexity_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complex&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;STRONG_MODEL&lt;/span&gt;
        &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Complex technical/business question — using strong model for quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;complexity_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;buying_signal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;STRONG_MODEL&lt;/span&gt;
        &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Buying signal detected — using strong model for quality response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;complexity_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is deliberately boring code. That is the point. Routing is product behavior, operational policy, and cost control all at once. It deserves to be reviewable in a pull request.&lt;/p&gt;

&lt;p&gt;CascadeFlow fits around this because the decision is recorded as data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complexity_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;complexity_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality_gate_passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;escalated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;estimated_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;estimated_cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;baseline_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline_cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;savings_percentage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;estimated_cost&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;baseline_cost&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;baseline_cost&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This made the frontend routing audit possible. When a creator reviews a reply, they can also see why the system chose that model. For engineers, the more important benefit is that routing becomes debuggable. I can inspect whether too many comments are escalating, whether buying signals are being caught, and whether the cheap path is being overused.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.cascadeflow.ai/" rel="noopener noreferrer"&gt;CascadeFlow documentation&lt;/a&gt; frames this as routing and evaluation infrastructure. In this project, I found the most useful mental model was simpler: every model choice should leave a receipt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hindsight Made “Simple” Comments Less Simple
&lt;/h2&gt;

&lt;p&gt;The memory layer complicated routing in a good way. EchoEngage gives each follower a memory bank, which means the same comment can mean different things from different people.&lt;/p&gt;

&lt;p&gt;The memory service abstracts Hindsight behind &lt;code&gt;recall&lt;/code&gt; and &lt;code&gt;retain&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;recall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;follower_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;bank_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_get_bank_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;follower_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hindsight_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arecall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;bank_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bank_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;memories&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;max_results&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;memories&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No previous memories found for this follower.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The implementation also supports local fallback storage, but the shape of the interface stays the same. The agent asks for memories relevant to the current comment. It does not need to know whether they came from a remote memory service or another backing store.&lt;/p&gt;

&lt;p&gt;I found &lt;a href="https://vectorize.io/what-is-agent-memory" rel="noopener noreferrer"&gt;Vectorize’s explanation of agent memory&lt;/a&gt; useful because it separates memory from chat history. EchoEngage does not need a raw transcript dump. It needs durable facts: interests, previous questions, sentiment, buying intent, and relationship context.&lt;/p&gt;

&lt;p&gt;That distinction affects routing. A comment from a first-time follower can be answered cheaply. A similar comment from a high-value follower with a long history may deserve a more careful generation path, not because the text is hard, but because the relationship is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Quality Gate Is Where Routing Earns Its Keep
&lt;/h2&gt;

&lt;p&gt;The most important design decision was not the initial route. It was allowing the system to be wrong once.&lt;/p&gt;

&lt;p&gt;The LangGraph pipeline has a quality gate after generation. If the generated reply fails, the graph can loop back to generation with a stronger model. That made me more comfortable using the cheap model for medium-complexity comments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;route_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate_reply&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate_reply&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality_gate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_conditional_edges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality_gate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;should_regenerate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;regenerate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate_reply&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retain_memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gate itself asks a narrower question than the generator. It checks whether the reply is relevant, personalized when memory exists, short enough for social media, safe, and human-sounding.&lt;/p&gt;

&lt;p&gt;If the reply fails and the system has not already escalated, the routing service updates the decision:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;escalate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;STRONG_MODEL&lt;/span&gt;
    &lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;escalated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; [ESCALATED: quality gate failed on cheaper model]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;old_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;estimated_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;new_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_estimate_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;STRONG_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complexity_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;estimated_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_cost&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;old_cost&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;new_cost&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;decision&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not a guarantee of correctness. It is a pressure valve. It lets the system attempt the efficient path without pretending that the first answer is always good enough.&lt;/p&gt;

&lt;p&gt;I prefer this to a static rule like “all medium comments use the strong model.” Static rules are easy to reason about but expensive in the wrong places. A quality loop gives the cheap model a chance while preserving an escalation path.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Concrete Interaction
&lt;/h2&gt;

&lt;p&gt;Take three comments from the system.&lt;/p&gt;

&lt;p&gt;The first is a low-stakes appreciation comment:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Just discovered your channel. Great content on AI tools!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The classifier marks it as simple appreciation. The router picks the cheap model. The reply can be short and warm. There is no reason to involve the stronger model unless the quality gate catches something odd.&lt;/p&gt;

&lt;p&gt;The second is a repeat question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Hey, I’m still confused about Zapier vs Make.com. Which one should I pick for my e-commerce store? I asked before but I’m still not sure.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here, Hindsight matters. The system recalls that Priya has asked about this before and is building an e-commerce workflow. The reply should acknowledge the repeated confusion and give a concrete recommendation. Even if the initial route uses the efficient model, the quality gate should reject a vague answer.&lt;/p&gt;

&lt;p&gt;The third is a buying signal:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We’re a 5-person startup and AI tools are becoming essential for us. Do you offer any consulting or personalized tool recommendations? Happy to pay for a session.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This routes directly to the stronger model because the intent is different. The business risk of a generic or careless answer is higher. The system should be helpful, specific, and not over-promise.&lt;/p&gt;

&lt;p&gt;None of these examples require exotic agent behavior. They require memory, classification, routing, and a feedback loop that is explicit enough to debug.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Model routing should be boring code
&lt;/h3&gt;

&lt;p&gt;I do not want routing policy hidden inside a long prompt. Prompts are useful for classification and generation, but model selection affects cost, latency, and user experience. It should be represented as data and ordinary control flow.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Cheap-first only works with a quality gate
&lt;/h3&gt;

&lt;p&gt;Using a cheaper model first is not a strategy by itself. It becomes a strategy when the system has a way to inspect the output and escalate. Without that, cheap-first just means “hope the first model was good enough.”&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Memory changes priority, not just wording
&lt;/h3&gt;

&lt;p&gt;Hindsight improved personalization, but the larger effect was on prioritization. A short comment from a long-time follower is not equivalent to a short comment from a stranger. Memory belongs upstream of routing, not only inside the final reply prompt.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Every route needs an explanation
&lt;/h3&gt;

&lt;p&gt;The routing audit was not just a UI feature. It forced me to store the reason, estimated cost, baseline cost, escalation flag, and latency. That made the system easier to reason about and easier to challenge.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Fallbacks are part of the architecture
&lt;/h3&gt;

&lt;p&gt;The memory service keeps the same interface whether it uses Hindsight or local storage. That matters because the agent graph should not care about infrastructure details. A stable boundary around memory made the rest of the system simpler.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing Thought
&lt;/h2&gt;

&lt;p&gt;I started with a model-cost problem and ended up with a confidence problem. The cheap model was not failing loudly. It was producing replies that looked acceptable until the missing memory or missed intent became obvious.&lt;/p&gt;

&lt;p&gt;CascadeFlow helped because it made routing decisions visible. Hindsight helped because it made follower context durable. LangGraph helped because the workflow could loop when quality failed.&lt;/p&gt;

&lt;p&gt;The lesson I took from building EchoEngage is that production agents need fewer magic tricks and more receipts. Recall the context. Make a route. Explain the route. Check the output. Store what changed. That loop is not glamorous, but it is the difference between a reply generator and a system I can actually operate.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
