How I Stopped Losing Sales Context and Built an Agent That Actually Remembers

#ai #llm #agentmemory #hindsight

Every LLM I've shipped has the same embarrassing flaw: ask it about something from last week's call, and it stares at you blankly. That's not a model problem — it's an architecture problem. And it cost me more lost deals than I care to admit before I fixed it.

The Deal Intelligence Agent started as a frustration project. I was watching sales reps jump between Salesforce tabs, sticky notes, and half-remembered Zoom calls, trying to reconstruct context before every customer call. The obvious move was to build an AI layer on top. What I didn't anticipate was how fast "obvious" would become "broken" the moment a deal stretched past a single conversation.

What the system does

The agent gives sales teams a persistent AI memory layer across their entire pipeline. You can ask it natural-language questions about any deal — "What objections has this prospect raised?" or "Who is the real decision-maker at this account?" — and get answers grounded in actual history, not hallucinations.

The stack is straightforward: FastAPI backend, React frontend, Groq running Llama 3.3 70B for completions. The piece that makes everything else work is Hindsight by Vectorize, which handles persistent vectorized memory scoped per deal.

When a rep logs an objection, mentions a competitor, or completes a call, those events are typed, timestamped, and written to Hindsight. When the agent gets a question, it runs a semantic search against that deal's memory store before the LLM ever sees the query. The model isn't smart — it just has better inputs.

Why the naive approach broke immediately

My first implementation was embarrassingly simple: append the last 20 messages to the system prompt. It worked for demos. It died in production.

The failure mode wasn't what I expected. The agent didn't crash. It confidently gave wrong answers. A rep would ask about a pricing objection raised three weeks ago, and the agent would either miss it entirely (truncated out of the window) or confuse it with a different deal's pricing discussion (context contamination from earlier in the thread).

The deeper problem: context windows are sequential, not semantic. Stuffing raw chat history into a prompt doesn't give the model a memory — it gives it a wall of text it has to reason through every time. That's expensive and unreliable for anything non-trivial.

Standard RAG wasn't the answer either. Chunking call transcripts and doing similarity search worked great for "find me information about X." It failed completely for "what is the current status of Y" because it treated all chunks as equally fresh. A competitor mention from six months ago ranked just as high as one from yesterday.

The Retain-Recall architecture

The approach I landed on is what I now call the Retain-Recall loop. Every meaningful event in a deal's lifecycle gets written to Hindsight's persistent memory layer as a typed entry — not as a raw blob, but as a structured fact with an explicit type (objection, competitor, stakeholder, pricing, outcome) and a prefixed embedding text designed for retrieval.

async def store_memory(
    self,
    deal_id: str,
    entry_type: str,
    content: str,
    metadata: Optional[Dict] = None
) -> Dict:
    entry = {
        "id": self._generate_id(deal_id, content),
        "deal_id": deal_id,
        "type": entry_type,
        "content": content,
        "embedding_text": f"[{entry_type.upper()}] Deal {deal_id}: {content}"
    }
    if self.use_hindsight:
        result = await asyncio.to_thread(
            self.client.memory.store,
            user_id=deal_id,
            text=entry["embedding_text"],
            metadata={"deal_id": deal_id, "type": entry_type, "content": content}
        )

The embedding_text prefix is load-bearing. [OBJECTION] Deal abc123: Price is 40% above current vendor embeds differently from the raw sentence. The type prefix clusters related memories together in vector space, so when the agent asks "what objections came up?", Hindsight's search weights OBJECTION-typed entries appropriately.

On retrieval, the top 10 semantically relevant memories get formatted into a numbered block and prepended to the user query:

def _format_memories(self, memories: List[Dict]) -> str:
    lines = []
    for i, mem in enumerate(memories, 1):
        mem_type = mem.get("type", mem.get("metadata", {}).get("type", "note"))
        content = mem.get("content", mem.get("text", ""))
        timestamp = mem.get("timestamp", "")[:10] if mem.get("timestamp") else ""
        ts_str = f" [{timestamp}]" if timestamp else ""
        lines.append(f"{i}. [{mem_type.upper()}]{ts_str} {content}")
    return "\n".join(lines)

The LLM receives [MEMORY CONTEXT]\n{formatted_memories}\n\n[USER QUERY]\n{question}. Separating context from query in the prompt structure matters — it tells the model "this is ground truth, don't speculate past it."

What the before/after actually looks like

Without memory, asking "what objections did this prospect raise?" returns generic advice. The agent has never heard of this prospect.

With Hindsight active, the same question against a deal with six months of history returns:

"Based on interaction history with Meridian Systems, their CTO David Kim flagged API documentation as a blocker on October 28th. On November 3rd, pricing came up — they're asking for 25% off versus the 15% offered. That pricing pattern matches deals where we've successfully closed with a 24-month commitment bundling the discount."

That last sentence isn't in any single memory entry. It's the agent connecting patterns across stored facts. That only happens because the retrieval gave it the right raw material.

What I learned the hard way

Structure memories at write time, not read time. If you dump raw transcripts into Hindsight and search later, you get noisy results. Parsing events into typed categories (objection, competitor, etc.) before storing dramatically improves retrieval precision. The work at ingestion pays back every query.

asyncio.to_thread is not optional. Hindsight's SDK is synchronous. Calling it directly on the FastAPI event loop blocks every concurrent request. This wasn't obvious until load testing showed request queuing under normal usage. Threading the calls is five lines of code and eliminates the problem entirely.

Graceful fallback is a feature, not a crutch. The in-process dict fallback means the entire system runs without an API key during development. It also means Hindsight failures — transient outages, rate limits — degrade gracefully instead of taking the agent down. Every external dependency should have a local fallback.

The scoping key is a design decision. Using deal_id as Hindsight's user_id means memories are perfectly isolated per deal. Cross-deal queries require explicit iteration. That's the right tradeoff for this use case, but it's not automatic — think through what your scoping key should be before you start writing memories.

The agent memory model that Hindsight implements isn't complicated, but it changes what the agent can do in practice. The LLM is still stateless. The system around it isn't. That distinction is everything.

GitHub: github.com/chaitanya07-ai/deal-intelligence-agent | Live: deal-intelligence-agent-1.onrender.com