<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: guguloth adithyajadhav</title>
    <description>The latest articles on DEV Community by guguloth adithyajadhav (@guguloth_adithyajadhav_9a).</description>
    <link>https://dev.to/guguloth_adithyajadhav_9a</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4006839%2F90356d6d-1eda-490f-8006-09ae9d2a4720.png</url>
      <title>DEV Community: guguloth adithyajadhav</title>
      <link>https://dev.to/guguloth_adithyajadhav_9a</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/guguloth_adithyajadhav_9a"/>
    <language>en</language>
    <item>
      <title>A Sales Agent That Remembers Why the Deal Is Stuck</title>
      <dc:creator>guguloth adithyajadhav</dc:creator>
      <pubDate>Sun, 28 Jun 2026 18:16:21 +0000</pubDate>
      <link>https://dev.to/guguloth_adithyajadhav_9a/a-sales-agent-that-remembers-why-the-deal-is-stuck-80c</link>
      <guid>https://dev.to/guguloth_adithyajadhav_9a/a-sales-agent-that-remembers-why-the-deal-is-stuck-80c</guid>
      <description>&lt;h1&gt;
  
  
  A Sales Agent That Remembers Why the Deal Is Stuck
&lt;/h1&gt;

&lt;p&gt;Every sales AI I'd seen before suffered the same problem: it had no memory. You fed it a transcript and it produced a follow-up email, but the next call started from scratch. Ask it who the real decision-maker is after five conversations and it would answer as if it had never heard of the account. The context that makes a sales rep effective—the accumulating picture of what the customer actually cares about, who matters, what's already been resolved—doesn't survive a stateless LLM call.&lt;/p&gt;

&lt;p&gt;So I built a system that does remember. Not in a vector database slapped on as an afterthought, but as the core architectural concern. This is the story of how that works.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the System Does
&lt;/h2&gt;

&lt;p&gt;The system processes sales call transcripts and produces two things: an analysis of the current real blocker and a personalized follow-up email. The twist is that every call builds on all the calls before it.&lt;/p&gt;

&lt;p&gt;The architecture is two cooperating agents powered by &lt;a href="https://www.crewai.com/" rel="noopener noreferrer"&gt;CrewAI&lt;/a&gt;, backed by persistent memory through &lt;a href="https://github.com/vectorize-io/hindsight" rel="noopener noreferrer"&gt;Hindsight&lt;/a&gt; and cost-aware model routing through &lt;a href="https://github.com/lemony-ai/cascadeflow" rel="noopener noreferrer"&gt;cascadeflow&lt;/a&gt;. The pipeline for each call is exactly four steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Recall&lt;/strong&gt; everything known about this customer from prior calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analyst agent&lt;/strong&gt; reads the new transcript plus recalled memory and identifies the real current blocker and decision-makers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writer agent&lt;/strong&gt; turns that analysis into a personalized follow-up email&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Save&lt;/strong&gt; this call's extracted facts back to memory for next time
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run the full recall -&amp;gt; analyze -&amp;gt; write -&amp;gt; save pipeline for one call.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;recall_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;analysis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_strip_think&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;ask_ai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_analyst_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="n"&gt;email&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_strip_think&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;ask_ai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_writer_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

    &lt;span class="n"&gt;new_facts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_strip_think&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;ask_ai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_facts_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="nf"&gt;save_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_facts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline is deliberately linear. The Analyst sees the raw transcript plus everything recalled from prior calls. The Writer sees the Analyst's output plus the same recalled context. Memory is saved after writing so the next call gets facts extracted with the benefit of the current analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory That Compounds
&lt;/h2&gt;

&lt;p&gt;The core technical story here is &lt;a href="https://vectorize.io/what-is-agent-memory" rel="noopener noreferrer"&gt;agent memory&lt;/a&gt;: not just storing text, but accumulating structured understanding across sessions.&lt;/p&gt;

&lt;p&gt;I used &lt;a href="https://hindsight.vectorize.io/" rel="noopener noreferrer"&gt;Hindsight&lt;/a&gt; as the memory backend. The model is simple: a shared "bank" stores all customer memories. Each customer's memories are isolated from others' by tagging every write with a &lt;code&gt;customer:&amp;lt;slug&amp;gt;&lt;/code&gt; tag and filtering recalls with &lt;code&gt;tags_match="all_strict"&lt;/code&gt;. Customers never bleed into each other.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;_ensure_bank&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;retain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bank_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BANK_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sales call notes for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;_customer_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;recall_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;_ensure_bank&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;recall&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bank_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BANK_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Everything known about &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: their priorities, blockers, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;concerns, budget, timeline, and any context from prior calls.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;_customer_tag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
        &lt;span class="n"&gt;tags_match&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all_strict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What matters isn't the API—it's the compounding behavior. After each call, extracted facts accumulate in the bank. By the fifth call, the system had 33 stored facts compared to 5 after the first call. More importantly, the &lt;em&gt;quality&lt;/em&gt; of what was stored evolved: early facts were surface-level price concerns, later ones captured specific people, their exact authority levels, which security documents were still pending, and what had already been resolved.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Deal That Changed Shape
&lt;/h2&gt;

&lt;p&gt;The five calls in the dataset follow a pattern that's common in B2B sales and that a stateless agent handles badly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Call 1&lt;/strong&gt;: Mike Reynolds, VP Operations, says the $4,800/month price tag is the issue. Jordan focuses on ROI. The system generates a price-focused email to Mike.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Call 2&lt;/strong&gt;: Sarah Chen, IT Security Lead, joins and flags data residency and SOC2 questions. Mike talks over her: "let's not get too deep in the weeds." The system notes Sarah's concerns but Mike is still nominally in charge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Call 3&lt;/strong&gt;: Jordan returns with a 15% discount. Mike says price is mostly resolved. But Sarah blocks forward motion: she needs SOC2 Type 2 (not Type 1), a written data residency guarantee, and a data deletion policy. The blocker has shifted from budget to compliance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Call 4&lt;/strong&gt;: Finance has signed off on the budget. Mike doesn't show up—it's Jordan and Sarah alone. Sarah makes it explicit: &lt;em&gt;"I'm the one who signs off here. Mike owns the budget, but if security doesn't pass, there's no deal."&lt;/em&gt; The real decision-maker was never Mike.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Call 5&lt;/strong&gt;: Sarah has reviewed the SOC2 Type 2 (passes), budget is locked for the Enterprise tier that includes residency controls. One item remains: the EU data residency guarantee in writing. That's it. One document.&lt;/p&gt;

&lt;p&gt;A stateless agent processing Call 5 in isolation would still be pitching ROI to a VP who already has budget approval. The memory-backed system knows that price was resolved two calls ago, that Sarah is the approver, and that one specific document closes the deal.&lt;/p&gt;

&lt;p&gt;The Call 1 email was addressed to Mike and spent most of its words on ROI and cost justification. The Call 5 email went directly to Sarah and referenced the EU data residency guarantee by name. The difference isn't sophistication—it's memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost-Aware Routing
&lt;/h2&gt;

&lt;p&gt;Not every model call deserves the same model. Extracting three bullet points of facts from a transcript is a different task than reasoning about who the real decision-maker is across five calls' worth of context.&lt;/p&gt;

&lt;p&gt;I used &lt;a href="https://docs.cascadeflow.ai/" rel="noopener noreferrer"&gt;cascadeflow&lt;/a&gt; to handle this automatically. The setup is two models: a cheap &lt;code&gt;qwen3-32b&lt;/code&gt; on Groq as the drafter, and &lt;code&gt;gpt-oss-120b&lt;/code&gt; as the verifier that only runs when the drafter's output doesn't clear a quality threshold.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;drafter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ModelConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen/qwen3-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;groq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quality_score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;verifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ModelConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-oss-120b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;groq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quality_score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CascadeAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;drafter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verifier&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;enable_cascade&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.72&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every call is logged: which model was used, whether it escalated, why, how long it took. Looking at the decision log from a full run through five calls, the pattern is clear: analyst reasoning escalates to the larger model, while simpler extraction tasks stay on the cheaper one. Cascadeflow's routing decision on each call shows up explicitly—&lt;code&gt;"moderate query suitable for cascade optimization"&lt;/code&gt; for simple extraction, &lt;code&gt;"hard query requires best model for quality"&lt;/code&gt; for the analyst and writer calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hard Parts
&lt;/h2&gt;

&lt;p&gt;Three bugs caused more pain than the architecture itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The silent empty output.&lt;/strong&gt; The qwen3-32b model, when doing deep reasoning, writes an extended &lt;code&gt;&amp;lt;think&amp;gt;...&amp;lt;/think&amp;gt;&lt;/code&gt; block before its actual answer. If &lt;code&gt;max_tokens&lt;/code&gt; was set too low—my initial value was 512—the model would exhaust its token budget on internal reasoning and return nothing visible. The fix was raising &lt;code&gt;max_tokens&lt;/code&gt; to 2048 across all calls. The symptom was subtle: the model call succeeded with a 200, but the returned content was empty after stripping the think block. I caught it by printing raw output before applying &lt;code&gt;_strip_think&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Speaking of which—that helper is small but essential:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_strip_think&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Remove qwen-style &amp;lt;think&amp;gt;...&amp;lt;/think&amp;gt; reasoning blocks from model output.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;think&amp;gt;.*?&amp;lt;/think&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DOTALL&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without it, the reasoning model's internal deliberation shows up in the rendered output. It's verbose and irrelevant to the user.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The event loop collision.&lt;/strong&gt; Hindsight's sync client drives &lt;code&gt;aiohttp&lt;/code&gt; on its own event loop internally. Calling it from Streamlit's script thread—which runs its own asyncio loop—raises &lt;code&gt;RuntimeError: Timeout context manager should be used inside a task&lt;/code&gt;. The error is confusing because it manifests as a timeout rather than a clear concurrency error.&lt;/p&gt;

&lt;p&gt;The fix: route every Hindsight call through a dedicated &lt;code&gt;ThreadPoolExecutor&lt;/code&gt; with a single worker. That worker thread has no running event loop, so the client creates its own without conflict.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_executor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ThreadPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thread_name_prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hindsight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run a Hindsight client call in the dedicated worker thread.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One worker keeps calls serialized. The Hindsight client reuses one aiohttp session safely. Streamlit's event loop never interferes. This pattern is broadly applicable any time you need to call async-backed sync code from a framework that already owns an event loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;cascadeflow's own event loop.&lt;/strong&gt; A similar collision affected &lt;code&gt;cascadeflow&lt;/code&gt;. Using &lt;code&gt;asyncio.run()&lt;/code&gt; for each call worked for the first call but closed the loop, so subsequent calls failed with &lt;code&gt;Event loop is closed&lt;/code&gt;. The fix was creating one persistent event loop at module import time and routing all calls through &lt;code&gt;loop.run_until_complete()&lt;/code&gt; for the lifetime of the process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pin your dependencies.&lt;/strong&gt; This one is boring but I'll say it anyway. Requirements like &lt;code&gt;hindsight-client&amp;gt;=0.8&lt;/code&gt; can silently resolve to a version that doesn't exist yet if you're installing from a fresh environment. I pinned everything to exact versions that actually install cleanly: &lt;code&gt;hindsight-client==0.8.3&lt;/code&gt;, &lt;code&gt;cascadeflow==0.7.1&lt;/code&gt;, &lt;code&gt;crewai==0.86.0&lt;/code&gt;. If you're integrating newer libraries with fast release cycles, locking versions early saves the "works on my machine" conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Is Good For
&lt;/h2&gt;

&lt;p&gt;The compounding-context pattern applies anywhere you have multi-session interactions with an evolving state of knowledge. Customer support is the obvious analog—a support agent that remembered what the customer already told you, what fixes were already tried, and what the customer's environment is would be substantially more useful than one that asks the same diagnostic questions every call. The same logic applies to research assistants, onboarding flows, and anything where context accumulates faster than a human can reliably track it.&lt;/p&gt;

&lt;p&gt;The model routing layer is separable from the memory layer and useful on its own. If you're making many LLM calls with a mix of simple and complex prompts, paying for a large model on every call is unnecessary. Cascadeflow's automatic escalation keeps the easy calls cheap without requiring you to manually classify which is which.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Memory as first-class architecture, not a bolt-on.&lt;/strong&gt; The session context that makes follow-ups useful has to be explicitly persisted and recalled. Building around that constraint—tagging per customer, recalling before analyzing, saving after writing—shapes the whole design.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The blocker changes. Your system has to notice.&lt;/strong&gt; Price was the stated blocker in Call 1. By Call 5 it was irrelevant. A system without memory keeps addressing a blocker that no longer exists. One with memory can track when something resolves and shift focus to whatever replaced it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Async libraries in synchronous frameworks need care.&lt;/strong&gt; Both Hindsight and cascadeflow hit event loop conflicts in Streamlit. The pattern—a single dedicated thread that owns its own loop—is a reusable solution for this class of problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The reasoning budget matters.&lt;/strong&gt; Chain-of-thought models spend tokens thinking before they answer. If your &lt;code&gt;max_tokens&lt;/code&gt; ceiling is too low, you'll get empty responses and no error. Size your token limits to accommodate both reasoning and output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cheap-first routing is worth the setup.&lt;/strong&gt; It's not a lot of code, but it changes the economics of running many LLM calls per user interaction. Simple operations run fast and cheap; complex reasoning escalates only when needed.&lt;/p&gt;




&lt;p&gt;The code is in Python using &lt;a href="https://www.crewai.com/" rel="noopener noreferrer"&gt;CrewAI&lt;/a&gt;, &lt;a href="https://github.com/vectorize-io/hindsight" rel="noopener noreferrer"&gt;Hindsight&lt;/a&gt;, &lt;a href="https://github.com/lemony-ai/cascadeflow" rel="noopener noreferrer"&gt;cascadeflow&lt;/a&gt;, and Streamlit. Models run on Groq. The Hindsight docs are at &lt;a href="https://hindsight.vectorize.io/" rel="noopener noreferrer"&gt;hindsight.vectorize.io&lt;/a&gt; and cascadeflow's at &lt;a href="https://docs.cascadeflow.ai/" rel="noopener noreferrer"&gt;docs.cascadeflow.ai&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fnmyszfzs2xdew3uhtze0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fnmyszfzs2xdew3uhtze0.png" alt=" " width="800" height="500"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fmy5kbjfo0mf6z0eaq8mx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fmy5kbjfo0mf6z0eaq8mx.png" alt=" " width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>python</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
