The problem I couldn’t ignore
Most "AI agents" I built had one thing in common: they didn’t improve.
They could generate messages. Sometimes decent, sometimes awkward. But every request was stateless. No memory. No learning. Just a fresh guess every time.
At some point, it stopped being interesting.
I didn’t need better prompts. I needed a way to give my agent memory.
What I built instead
I built an outbound prospecting system that improves its messaging over time based on actual outcomes.
The loop is simple:
Input a prospect
Retrieve similar past cases
Recall learned patterns
Generate a message
Track outcome
Store it
Learn from it
That loop is the product.
The stack is fairly standard:
FastAPI backend
PostgreSQL for structured data
Chroma for vector similarity
OpenAI for generation and embeddings
Hindsight for long-term memory
The interesting part is how these pieces interact.
Why prompt engineering wasn’t enough
I started the usual way: bigger prompts, more instructions, more "context".
prompt = f"""
Write a personalized cold email for a {persona}
working at a {company_type} company.
Focus on ROI and keep it concise.
"""
It worked, until it didn’t.
The system couldn’t answer basic questions:
Do CTOs respond better to ROI or technical depth?
Do founders care more about growth or vision?
What actually leads to meetings?
It had no memory of outcomes. No feedback loop.
So I stopped treating generation as the core problem. The real problem was learning.
Adding memory with Hindsight
I came across Hindsight and decided to use it as the memory layer.
Instead of storing raw logs, I started treating every outreach event as a learning unit.
Here’s a simplified retain call:
response = requests.post(
f"{self.base_url}/api/memories/retain",
json=memory_data,
headers=self.headers,
timeout=10
)
Each memory looks something like this:
{
"persona": "CTO",
"industry": "SaaS",
"message_angle": "ROI",
"outcome": "meeting"
}
This isn’t just logging. It’s structured experience.
Using similarity before generation
Before generating a new message, I retrieve context in two ways.
Vector similarity (Chroma)
results = chroma_collection.query(
query_texts=[prospect_description],
n_results=5
)
This gives me:
similar prospects
past messages
successful patterns
Memory recall (Hindsight)
client.recall(
bank_id="outbound",
query="What messaging works for SaaS CTOs?"
)
This returns learned patterns and past outcomes.
Now the system has context that actually matters.
Generation becomes simpler
Once I have:
similar examples (vector search)
learned patterns (memory)
message generation is no longer guesswork.
def generate_message(context):
prompt = build_prompt(
prospect=context.prospect,
examples=context.similar_messages,
insights=context.memory_patterns
)
return openai_client.generate(prompt)
The model is guided instead of improvising blindly.
Reflection changed everything
The most useful feature wasn’t recall. It was reflection.
Instead of manually analyzing results, I let the system reflect on past data:
client.reflect(
bank_id="outbound",
query="What patterns lead to successful outreach?"
)
This produces insights like:
CTOs respond better to ROI-focused messaging
Founders engage more with growth narratives
Technical detail increases replies but not meetings
These insights feed back into generation.
That’s where the system starts to feel different.
What this looks like in practice
A typical flow looks like this:
Add a prospect (e.g., CTO at a SaaS company)
Retrieve similar cases
Recall memory insights
Generate a message
Track outcome (reply, meeting, ignore)
Store memory
Example message:
Hi Rahul,
Noticed you're scaling a SaaS platform.
We’ve helped similar teams improve outbound ROI significantly...
If it leads to a meeting, that pattern gets reinforced.
Over time, messaging shifts toward what works.
Why the frontend matters
I underestimated this initially.
If users can’t see learning, they assume it isn’t happening.
So I added:
reply rate trends
persona performance
generated insights
For example:
"CTOs respond better to ROI messaging than feature-heavy emails."
Now the system shows its reasoning, not just outputs.
Lessons learned
- Memory beats better prompts
Prompt engineering helps, but it doesn’t create continuity.
Without memory, the system repeats mistakes.
- Outcome tracking is everything
Bad data ruins learning.
A reply isn’t the same as a meeting.
You need meaningful signals.
- Similarity + memory works together
Vector search gives examples. Memory gives patterns.
Together, they give context.
- Reflection is underrated
Recall gives you data. Reflection gives you insight.
That’s what actually improves behavior.
- Most agents don’t learn
They generate outputs, but they don’t adapt.
The difference is a feedback loop.
Final thoughts
I didn’t set out to build something complex.
I just wanted a system that didn’t forget everything after each request.
Adding memory turned a stateless generator into something that improves over time.
Not because it’s impressive.
Because it accumulates experience.
And that’s the only way it actually gets better.
Top comments (0)