A Sales Agent That Remembers Why the Deal Is Stuck

#python #ai #agents #llm

A Sales Agent That Remembers Why the Deal Is Stuck

Every sales AI I'd seen before suffered the same problem: it had no memory. You fed it a transcript and it produced a follow-up email, but the next call started from scratch. Ask it who the real decision-maker is after five conversations and it would answer as if it had never heard of the account. The context that makes a sales rep effective—the accumulating picture of what the customer actually cares about, who matters, what's already been resolved—doesn't survive a stateless LLM call.

So I built a system that does remember. Not in a vector database slapped on as an afterthought, but as the core architectural concern. This is the story of how that works.

What the System Does

The system processes sales call transcripts and produces two things: an analysis of the current real blocker and a personalized follow-up email. The twist is that every call builds on all the calls before it.

The architecture is two cooperating agents powered by CrewAI, backed by persistent memory through Hindsight and cost-aware model routing through cascadeflow. The pipeline for each call is exactly four steps:

Recall everything known about this customer from prior calls
Analyst agent reads the new transcript plus recalled memory and identifies the real current blocker and decision-makers
Writer agent turns that analysis into a personalized follow-up email
Save this call's extracted facts back to memory for next time

def process_call(customer: str, transcript: str) -> dict:
    """Run the full recall -> analyze -> write -> save pipeline for one call."""
    memory = recall_memory(customer)

    analysis = _strip_think(ask_ai(_analyst_prompt(customer, transcript, memory)))
    email    = _strip_think(ask_ai(_writer_prompt(customer, analysis, memory)))

    new_facts = _strip_think(ask_ai(_facts_prompt(customer, transcript, analysis)))
    save_memory(customer, new_facts)

    return {"email": email, "analysis": analysis}

The pipeline is deliberately linear. The Analyst sees the raw transcript plus everything recalled from prior calls. The Writer sees the Analyst's output plus the same recalled context. Memory is saved after writing so the next call gets facts extracted with the benefit of the current analysis.

Memory That Compounds

The core technical story here is agent memory: not just storing text, but accumulating structured understanding across sessions.

I used Hindsight as the memory backend. The model is simple: a shared "bank" stores all customer memories. Each customer's memories are isolated from others' by tagging every write with a customer:<slug> tag and filtering recalls with tags_match="all_strict". Customers never bleed into each other.

def save_memory(customer: str, notes: str) -> None:
    _ensure_bank()
    _call(
        _client.retain,
        bank_id=BANK_ID,
        content=notes,
        context=f"Sales call notes for {customer}",
        tags=[_customer_tag(customer)],
    )

def recall_memory(customer: str) -> str:
    _ensure_bank()
    resp = _call(
        _client.recall,
        bank_id=BANK_ID,
        query=(
            f"Everything known about {customer}: their priorities, blockers, "
            "concerns, budget, timeline, and any context from prior calls."
        ),
        tags=[_customer_tag(customer)],
        tags_match="all_strict",
        budget="high",
    )
    lines = [r.text for r in resp.results if getattr(r, "text", None)]
    return "\n".join(f"- {line}" for line in lines)

What matters isn't the API—it's the compounding behavior. After each call, extracted facts accumulate in the bank. By the fifth call, the system had 33 stored facts compared to 5 after the first call. More importantly, the quality of what was stored evolved: early facts were surface-level price concerns, later ones captured specific people, their exact authority levels, which security documents were still pending, and what had already been resolved.

The Deal That Changed Shape

The five calls in the dataset follow a pattern that's common in B2B sales and that a stateless agent handles badly.

Call 1: Mike Reynolds, VP Operations, says the $4,800/month price tag is the issue. Jordan focuses on ROI. The system generates a price-focused email to Mike.

Call 2: Sarah Chen, IT Security Lead, joins and flags data residency and SOC2 questions. Mike talks over her: "let's not get too deep in the weeds." The system notes Sarah's concerns but Mike is still nominally in charge.

Call 3: Jordan returns with a 15% discount. Mike says price is mostly resolved. But Sarah blocks forward motion: she needs SOC2 Type 2 (not Type 1), a written data residency guarantee, and a data deletion policy. The blocker has shifted from budget to compliance.

Call 4: Finance has signed off on the budget. Mike doesn't show up—it's Jordan and Sarah alone. Sarah makes it explicit: "I'm the one who signs off here. Mike owns the budget, but if security doesn't pass, there's no deal." The real decision-maker was never Mike.

Call 5: Sarah has reviewed the SOC2 Type 2 (passes), budget is locked for the Enterprise tier that includes residency controls. One item remains: the EU data residency guarantee in writing. That's it. One document.

A stateless agent processing Call 5 in isolation would still be pitching ROI to a VP who already has budget approval. The memory-backed system knows that price was resolved two calls ago, that Sarah is the approver, and that one specific document closes the deal.

The Call 1 email was addressed to Mike and spent most of its words on ROI and cost justification. The Call 5 email went directly to Sarah and referenced the EU data residency guarantee by name. The difference isn't sophistication—it's memory.

Cost-Aware Routing

Not every model call deserves the same model. Extracting three bullet points of facts from a transcript is a different task than reasoning about who the real decision-maker is across five calls' worth of context.

I used cascadeflow to handle this automatically. The setup is two models: a cheap qwen3-32b on Groq as the drafter, and gpt-oss-120b as the verifier that only runs when the drafter's output doesn't clear a quality threshold.

drafter = ModelConfig(
    name="qwen/qwen3-32b",
    provider="groq",
    cost=0.0,
    api_key=GROQ_API_KEY,
    max_tokens=2048,
    quality_score=0.7
)

verifier = ModelConfig(
    name="openai/gpt-oss-120b",
    provider="groq",
    cost=0.0,
    api_key=GROQ_API_KEY,
    max_tokens=2048,
    quality_score=0.9
)

agent = CascadeAgent(
    models=[drafter, verifier],
    enable_cascade=True,
    quality={"threshold": 0.72}
)

Every call is logged: which model was used, whether it escalated, why, how long it took. Looking at the decision log from a full run through five calls, the pattern is clear: analyst reasoning escalates to the larger model, while simpler extraction tasks stay on the cheaper one. Cascadeflow's routing decision on each call shows up explicitly—"moderate query suitable for cascade optimization" for simple extraction, "hard query requires best model for quality" for the analyst and writer calls.

The Hard Parts

Three bugs caused more pain than the architecture itself.

The silent empty output. The qwen3-32b model, when doing deep reasoning, writes an extended <think>...</think> block before its actual answer. If max_tokens was set too low—my initial value was 512—the model would exhaust its token budget on internal reasoning and return nothing visible. The fix was raising max_tokens to 2048 across all calls. The symptom was subtle: the model call succeeded with a 200, but the returned content was empty after stripping the think block. I caught it by printing raw output before applying _strip_think.

Speaking of which—that helper is small but essential:

def _strip_think(text: str) -> str:
    """Remove qwen-style <think>...</think> reasoning blocks from model output."""
    return re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()

Without it, the reasoning model's internal deliberation shows up in the rendered output. It's verbose and irrelevant to the user.

The event loop collision. Hindsight's sync client drives aiohttp on its own event loop internally. Calling it from Streamlit's script thread—which runs its own asyncio loop—raises RuntimeError: Timeout context manager should be used inside a task. The error is confusing because it manifests as a timeout rather than a clear concurrency error.

The fix: route every Hindsight call through a dedicated ThreadPoolExecutor with a single worker. That worker thread has no running event loop, so the client creates its own without conflict.

_executor = ThreadPoolExecutor(max_workers=1, thread_name_prefix="hindsight")

def _call(fn, *args, **kwargs):
    """Run a Hindsight client call in the dedicated worker thread."""
    return _executor.submit(fn, *args, **kwargs).result()

One worker keeps calls serialized. The Hindsight client reuses one aiohttp session safely. Streamlit's event loop never interferes. This pattern is broadly applicable any time you need to call async-backed sync code from a framework that already owns an event loop.

cascadeflow's own event loop. A similar collision affected cascadeflow. Using asyncio.run() for each call worked for the first call but closed the loop, so subsequent calls failed with Event loop is closed. The fix was creating one persistent event loop at module import time and routing all calls through loop.run_until_complete() for the lifetime of the process.

Pin your dependencies. This one is boring but I'll say it anyway. Requirements like hindsight-client>=0.8 can silently resolve to a version that doesn't exist yet if you're installing from a fresh environment. I pinned everything to exact versions that actually install cleanly: hindsight-client==0.8.3, cascadeflow==0.7.1, crewai==0.86.0. If you're integrating newer libraries with fast release cycles, locking versions early saves the "works on my machine" conversation.

What This Is Good For

The compounding-context pattern applies anywhere you have multi-session interactions with an evolving state of knowledge. Customer support is the obvious analog—a support agent that remembered what the customer already told you, what fixes were already tried, and what the customer's environment is would be substantially more useful than one that asks the same diagnostic questions every call. The same logic applies to research assistants, onboarding flows, and anything where context accumulates faster than a human can reliably track it.

The model routing layer is separable from the memory layer and useful on its own. If you're making many LLM calls with a mix of simple and complex prompts, paying for a large model on every call is unnecessary. Cascadeflow's automatic escalation keeps the easy calls cheap without requiring you to manually classify which is which.

Takeaways

Memory as first-class architecture, not a bolt-on. The session context that makes follow-ups useful has to be explicitly persisted and recalled. Building around that constraint—tagging per customer, recalling before analyzing, saving after writing—shapes the whole design.

The blocker changes. Your system has to notice. Price was the stated blocker in Call 1. By Call 5 it was irrelevant. A system without memory keeps addressing a blocker that no longer exists. One with memory can track when something resolves and shift focus to whatever replaced it.

Async libraries in synchronous frameworks need care. Both Hindsight and cascadeflow hit event loop conflicts in Streamlit. The pattern—a single dedicated thread that owns its own loop—is a reusable solution for this class of problem.

The reasoning budget matters. Chain-of-thought models spend tokens thinking before they answer. If your max_tokens ceiling is too low, you'll get empty responses and no error. Size your token limits to accommodate both reasoning and output.

Cheap-first routing is worth the setup. It's not a lot of code, but it changes the economics of running many LLM calls per user interaction. Simple operations run fast and cheap; complex reasoning escalates only when needed.

The code is in Python using CrewAI, Hindsight, cascadeflow, and Streamlit. Models run on Groq. The Hindsight docs are at hindsight.vectorize.io and cascadeflow's at docs.cascadeflow.ai.