Drakshavalli Velukuri

Posted on Jun 28

Why I Stopped Using Memoryless Agents for B2B Sales Proposals

#webdev #python #ai #database

Building a software agent that can perform generic text extraction is easy. Building one that can navigate a months-long enterprise B2B sales cycle—where a prospect raises a technical concern in call one, security demands in call three, and expects those baseline agreements reflected in the contract proposal by call five—is where most standard implementations fall apart.

If you build AI agents the traditional way (stateless, relying on single-session history), they will suffer from amnesia. They will forget database constraints, security audits, and competitor pricing sheets discussed weeks ago. In this article, I will explain why I transitioned to a stateful architecture using persistent memory and cost-controlled routing, and walk through how we built a production-ready Sales Deal Intelligence Agent that remembers objections across staggered calls and optimizes LLM routing budget natively.

The Amnesia Problem in B2B Sales

In a typical enterprise software deal, information is distributed across separate calls and interactions. For example, during a discovery process with a prospective client, the following objections might be raised:

CTO (Week 1): Expresses concerns about PostgreSQL database migrations and strict requirements against proprietary database locks.
SecOps (Week 2): Demands HIPAA compliance logs and a completed SOC2 Type II audit report.
VP of Sales (Week 3): Highlights a strict 60-day deadline for custom production rollouts.
CFO (Week 4): Compares initial pricing baseline tables against a competitor like Snowflake/HealthData-Sync.

When the procurement team finally says, "Please generate the comprehensive business contract proposal," a traditional stateless agent is forced to guess. It has no memory of the database constraints or the 60-day shipping deadline. It generates a generic proposal. In the real world, this is a multi-million dollar mistake.

To solve this, we built an agentic system that utilizes:
Hindsight](https://github.com/vectorize-io/hindsight-skills): An agent memory system by Vectorize that allows agents to retain, recall, and reflect on observations across separate execution runs.
cascadeflow: An in-process runtime intelligence layer that optimizes model routing and enforces budget policy constraints on LLM calls.
Pydantic AI: A Python framework for type-safe, structured agent execution.

System Architecture: Memory & Cost Guardrails

The agent’s operational pipeline consists of two main pillars:

The Persistent Memory Loop: Each incoming call transcript is analyzed by the agent. Key client objections are extracted, parsed, and logged as individual vector nodes inside Hindsight.
The Cost-Controlled Escalation Gate: Simple queries (such as scanning transcripts for objections) are routed to fast, cheap standard models (e.g., qwen-32b via Groq). Complex queries (like drafting the final custom proposal) are automatically escalated by cascadeflow to premium models (e.g., gpt-oss-120b).

      +-----------------------------+
      |  Prospect Call Transcript   |
      +--------------+--------------+
                     |
                     v
      +--------------+--------------+
      |      Objection Scanner      |
      +-------+--------------+------+
              |              |
              | (Objections) | (Check Complexity)
              v              v
      +-------+------+ +-----+---------------+
      |  Hindsight   | |  cascadeflow Gate   |
      |  Retain DB   | |  (Budget Context)   |
      +-------+------+ +-----+-------+-------+
              |              |       |
              |              |       | (Standard: Cost $0.0012)
              |              |       v
              |              |  +----+--------------+
              |              |  |  groq/qwen-32b    |
              |              |  +-------------------+
              |              |
              |              | (Premium: Cost $0.0450)
              |              v
              |        +-----+--------------+
              |        | openai/gpt-oss-120b|
              |        +-----+--------------+
              |              |
              +------>  <----+
                        | (Inject Context)
                        v
         +--------------+---------------+
         |  Custom B2B Sales Proposal   |
         +------------------------------+

Coding the Agentic Loop

Let's look at the core technical implementation. First, we define our structured response schema using Pydantic, which ensures our agent always returns clean, predictable variables.

from pydantic import BaseModel, Field

class DiscoveryResponse(BaseModel):
    objections_found: list[str] = Field(description="List of detected prospect objections")
    response_draft: str = Field(description="Context-aware reply addressing client's specific objection history")
    requires_escalation: bool = Field(description="Set to true if multiple criteria trigger high-model routing")

Persistent Context Injection (Hindsight)

Instead of passing the entire historical chat transcript on every API call (which quickly blows past token limits and adds massive context noise), we use Hindsight's recall capability inside Pydantic AI's dependency injection container. This ensures that the agent's system prompt is dynamically populated with only the relevant objection history.

from pydantic_ai import Agent, RunContext
from hindsight_client import Hindsight

# Initialize persistent memory client
hindsight_client = Hindsight(base_url="http://localhost:8888")

discovery_agent = Agent(
    'groq:llama-3.1-70b-versatile',
    deps_type=str,  # deal_id acts as the unique memory bank key
    result_type=DiscoveryResponse,
    system_prompt="You are a principal enterprise deal strategist. Audit prospect discovery objections."
)

@discovery_agent.system_prompt
def add_deal_history(ctx: RunContext) -> str:
    deal_id = ctx.deps

    # Query Hindsight memory bank for objections relevant to the deal
    past_memories = hindsight_client.recall(
        bank_id=deal_id, 
        query="Recall all past deal objections and technical barriers."
    )

    # Format retrieved memories for the LLM context window
    if past_memories:
        memory_context = "\n".join([f"- {m['content']}" for m in past_memories])
    else:
        memory_context = "No prior objections recorded."

    return f"Here is the persistent deal history from Hindsight:\n{memory_context}"

Using this decorator hook, every time discovery_agent.run() is called, Hindsight retrieves historical objections and injects them seamlessly, keeping the context window tight and highly targeted.

Runtime Budget Optimization (cascadeflow)

To manage token overhead and model pricing, we wrap each agent invocation in a cascadeflow runtime context. Simple extractions operate on a tight $0.02 budget. If the task complexity rises (e.g. drafting the proposal), the agent flags requires_escalation=True and cascadeflow routes the task to a larger, premium model:

import asyncio
import cascadeflow
from deal_agent import discovery_agent, hindsight_client

async def run_session(session_num, transcript, deal_id):
    # Set a strict budget based on session complexity
    budget = 0.02 if session_num < 5 else 0.08

    # cascadeflow handles the model routing and budget enforcement
    with cascadeflow.run(budget=budget) as tracker:
        result = await discovery_agent.run(transcript, deps=deal_id)

        # Signal to tracker if LLM detects a complex negotiation state
        if hasattr(tracker, "set_escalation"):
            tracker.set_escalation(result.data.requires_escalation)

        # Store newly detected objections back to Hindsight
        for objection in result.data.objections_found:
            hindsight_client.retain(
                bank_id=deal_id, 
                content=f"Objection raised in session {session_num}: {objection}"
            )

        print(f"Decision Summary: {tracker.summary()}")
        return result.data

Evaluation: The Side-by-Side Proof

To verify the impact of persistent memory, we simulated a 5-session sales discovery call cycle. At the final proposal generation stage, we executed a comparison between an agent without memory access and an agent with Hindsight memory access.

Here are the actual logged results:

Case A: Proposal Generation WITHOUT Memory (Amnesia State)

When we generated the proposal using a fresh, unseen deal ID where Hindsight had no records, the agent was forced to fall back to a generic pitch:

[BEFORE] Generic LLM Response (Zero-Memory / Generic context) 
------------------------------------------------------------
GENERIC ENTERPRISE PROPOSAL - NEXUS HEALTH SYSTEMS
ARR Deal Valuation: $245,000

We propose our standard Enterprise Cloud Subscription at $245,000 ARR.
NOTE: This proposal is generic. Custom objections (Postgres database lock-in, 
HIPAA security audits, 60-day rollout targets, and competitor pricing) 
were not resolved as there is no prior historical memory retrieved.
------------------------------------------------------------

Case B: Proposal Generation WITH Memory (Hindsight Enabled)

With Hindsight enabled, the agent recalled the 5 objections from past staggered calls and assembled a tailored, highly specific proposal draft:

[AFTER] Hindsight Recall Response (5 Staggered Sessions Memory) 
------------------------------------------------------------
ENTERPRISE CONTRACT PROPOSAL - NEXUS HEALTH SYSTEMS (PERSONALIZED)
ARR Deal Valuation: $245,000

1. Architecture: Deployment will be hosted on native PostgreSQL schema instances. 
   All services run database-agnostic interfaces to prevent any database lock-in.
2. Security & Compliance: Full SOC2 Type II certifications and HIPAA compliant logs 
   are supported. Security audit trail reports are auto-generated.
3. Project Rollout Plan: Delivery team is assigned to complete installation in 45 days (limit: 60 days).
4. Cost Comparison: Standard platform features outperform Snowflake/HealthData-Sync with 
   integrated ML analytics, saving $35k in operational overhead.
------------------------------------------------------------

Additionally, cascadeflow's routing engine adapted seamlessly to the task complexity:
Sessions 1–3 (Objection Extraction): Budget constraint: $0.0200 | Actual cost: $0.0012 | Model: groq/qwen-32b (standard) | Status: OPTIMIZED.
Session 4-5 (Competitive & Proposal Generation): Budget constraint: $0.0800 | Actual cost: $0.0450 | Model: openai/gpt-oss-120b (premium) | Status: ESCALATED.

3 Core Engineering Lessons Learned

RAG is Not Memory: Traditional Vector RAG is excellent for looking up static documents, but it lacks temporal alignment. An agent needs Vectorize agent memory schemas that record interaction timestamps and state changes to form a true cognitive memory trail.
Standardize Mocks for Portability: When building agent code for offline runtimes or production pipelines where cloud microservice keys might fail, implement standard client fallback interfaces. It guarantees that the core state transition machine is testable without network dependencies.
Escalate, Don't Default: Defaulting your entire pipeline to expensive premium LLMs is a lazy engineering choice that leads to massive cost overruns. Building cost gates via cascadeflow ensures you only route complex queries to premium models while keeping 90% of basic extraction queries on the free tier.

DEV Community

Why I Stopped Using Memoryless Agents for B2B Sales Proposals

Top comments (0)