


The first time our system recommended a pricing experiment that had already failed, I assumed something was broken.
We had the data. We had the logs. We even had a post-mortem explaining exactly why the experiment didn’t work.
And yet, when a similar idea came in, the AI evaluated it as if it had never seen anything like it before.
That’s when it clicked.
Nothing was broken.
The system just couldn’t remember.
⸻
I’ve been working on ExpTracker.AI, a system designed to help businesses track experiments — mostly around pricing, growth strategies, and product decisions — and use that history to guide future decisions.
The idea was straightforward: teams shouldn’t repeat failed experiments just because the insight got buried somewhere.
In practice, that happens all the time.
A pricing test runs. It fails. Someone writes up a summary in Notion or Slack. Maybe there’s a meeting about it. Then everyone moves on.
Six months later, someone proposes the same idea again. Not intentionally — it just sounds reasonable. The context is gone.
We built ExpTracker to solve that.
At least, that’s what we thought.
⸻
The system itself wasn’t complicated.
We had a way to log experiments — hypotheses, pricing changes, outcomes. We stored everything in a structured format. Then we used an LLM to evaluate new proposals.
The flow looked something like this:
A new experiment comes in.
We fetch relevant past experiments.
We pass everything into the model.
The model gives a recommendation.
On paper, this should have worked.
In reality, it didn’t.
The model kept missing obvious context. It would recommend ideas that had already failed, even when the data existed in our system.
At first, I thought retrieval was the issue.
It wasn’t exactly that.
The deeper problem was that we were treating logs as memory.
⸻
Our early data looked like this:
Experiment: Q1 Pricing Test
Old Price: 49.99
New Price: 59.99
Hypothesis: Users are price insensitive
Outcome: Failure
This is fine for humans reading a document.
It’s terrible for a system trying to reason across past decisions.
The moment a new proposal was phrased slightly differently — “premium users won’t react to a 20% increase” — the connection broke.
Keyword matching failed. Even basic filtering failed.
From the model’s perspective, it was seeing a brand new idea.
That’s when I stopped thinking in terms of storage and started thinking in terms of memory.
⸻
What we needed was a way for the system to recall past experiments based on meaning, not wording.
I started looking into different approaches and ended up using Hindsight as a memory layer. It’s essentially built for this exact problem — giving AI systems a way to store and retrieve context semantically.
More importantly, it forced us to rethink how we structured data.
Instead of just logging outcomes, we started capturing intent.
Each experiment now includes:
What was being tested
Why we thought it would work
How big the change was
What actually happened
Why it succeeded or failed
That last part — the “why” — turned out to matter more than anything else.
⸻
Here’s roughly how we store an experiment now:
We don’t just say “price increased and churn went up.”
We store something closer to:
“Starter tier price increased by 30% based on the assumption that users were not sensitive to price changes. Within three weeks, churn increased by 18%, indicating high sensitivity in this segment.”
That difference is subtle, but it completely changes how retrieval works.
Now when a new proposal comes in, the system can match on intent.
⸻
The most important design decision we made came out of this shift:
The system must look backward before it looks forward.
Every time a new experiment is proposed, we run a recall step first.
We take the hypothesis, turn it into a semantic query, and retrieve a small set of similar past experiments — usually the top 3 to 5.
Those results get passed into the model along with the new proposal.
That’s it.
No fancy orchestration. No complex pipelines.
Just: recall first, then reason.
⸻
The difference in behavior was immediate.
Before, the model would respond with something like:
“This pricing change seems reasonable given typical user behavior.”
After adding memory, the response changed to something more like:
“Similar past experiments involving large price increases resulted in significant churn within a short period. A 20% increase may carry similar risks. Consider testing a smaller increment.”
Same model.
Completely different answer.
The only difference was context.
⸻
One thing that surprised me was how little memory we actually needed.
Initially, I assumed more data would lead to better decisions.
It didn’t.
When we passed too many past experiments into the model, the output got worse. It became vague and less decisive.
The sweet spot turned out to be a small set of highly relevant examples.
Usually five or fewer.
Anything beyond that just added noise.
⸻
Another mistake we made early on was only storing experiments after they were completed.
That seemed logical at first — why store incomplete data?
But it turns out even proposed experiments are valuable.
If two teams independently propose the same idea, that’s already useful context.
So we started storing experiments as soon as they’re defined, and then updating them later with outcomes.
That created a continuous loop:
Store the idea
Recall it during future decisions
Update it with results
Use it again
Over time, the system starts to build something that actually resembles learning.
⸻
There’s also an interesting side effect.
As the memory grows, the system becomes more opinionated.
Not because the model changes, but because the context does.
It starts to push back on risky ideas. It highlights patterns. It reinforces what works and flags what doesn’t.
It feels less like a generic assistant and more like someone who has been in every experiment review meeting for the past year.
Which, in a way, it has.
⸻
One of the more subtle lessons from building this was that most teams don’t have a thinking problem.
They have a recall problem.
The insights exist. The data exists. The conclusions are often correct.
But they’re not available at the moment decisions are made.
And that’s what really matters.
Timing, not storage.
⸻
Another takeaway is that memory quality directly affects decision quality.
If your stored data is vague, your retrieval will be weak.
If your retrieval is weak, your model will fall back to generic reasoning.
And that’s exactly how you end up repeating mistakes.
Adding more data doesn’t fix that.
Adding better-structured data does.
⸻
The last thing that became clear is that AI systems don’t improve just because models get better.
You can swap in a stronger model, increase context size, tweak prompts — none of that solves the core issue if the system still doesn’t remember.
What actually improves performance over time is accumulated, accessible experience.
In other words, memory.
⸻
Looking back, the biggest shift wasn’t technical.
It was conceptual.
We stopped treating past experiments as something to archive and started treating them as something to actively use.
That changed how we designed the system.
It also changed how it behaved.
⸻
AI doesn’t fail because it can’t think.
It fails because it can’t remember.
And once you fix that, everything else starts to make a lot more sense.
GitHub repo link - https://github.com/sameeralala/Exptracker.AI
Top comments (0)