Foundation kept everything. After 300 conversations, search returned noise. That was the problem I was actually trying to solve.
I've been buildin...
For further actions, you may consider blocking this person and/or reporting abuse
The relevance scoring angle is interesting — deciding what's worth remembering is half the problem with AI memory systems. I built a semantic memory layer for my local AI that does something similar, storing exchanges as embeddings and retrieving by cosine similarity. The curation problem is real.
Cosine similarity tells you what's related. It doesn't tell you what's worth keeping. That distinction is what the scoring signals are trying to formalize — usage, validation, specificity as proxies for "did this actually matter."
Curious how you handle the promotion decision. Do you auto-promote above a similarity threshold or is there a manual gate somewhere in the loop
Right now it's threshold only — no promotion layer. Your point about usage and validation as proxies is exactly the gap I'm going to close. Building a memory promotion layer this weekend — usage frequency + outcome scores from the regret index as the signal for what's worth keeping. I'm open to any ideas you have in mind as well.
The regret index as a promotion signal is clever — outcome scoring after the fact is more honest than trying to predict value upfront. One thing I'd watch: frequency alone promotes what gets repeated, not what's correct. A wrong pattern repeated 10 times scores higher than a right one used once.
The validation signal in my setup tries to catch that — was it confirmed to work, not just used. Might be worth pairing your outcome scores with a recency decay so recent regret-free events outweigh stale frequency. Happy to share the scoring code if useful.
Good catch on the frequency problem — you're right, repeated wrong patterns would outscore a correct one used once. I added recency decay alongside the outcome scores, half-life around 42 days. Recent regret-free retrievals now outweigh stale frequency. The validation signal idea is interesting — do you confirm correctness explicitly or infer it from downstream success?
Inferred, not explicit. The validation signal fires when the excerpt contains confirmation language - "confirmed in production," "tested," "works on," referenced docs. It's pattern matching on the text, not downstream tracking.
Explicit confirmation would be stronger but it requires closing the loop after the fact - knowing whether the thing actually worked. I don't have that yet. The regret index you have is closer to true validation than anything I'm doing. You're measuring actual outcomes. I'm measuring stated confidence at capture time.
The 42-day half-life is interesting - how did you land on that number?
The 42-day half-life was suggested by the AI I build with, not derived from my own data — it maps to roughly 6 weeks, which felt right for 'recent enough to matter, old enough to fade.' The honest answer is I don't have enough runtime yet to validate it. Echo has been running for 8 days. The real calibration will come from watching whether stale memories start surfacing inappropriately or useful recent ones get buried. If the half-life is wrong I'll know it empirically before I know it theoretically.
Your distinction between stated confidence and actual outcomes is the sharper point. Pattern matching on confirmation language is a prior; the regret index is a posterior. Neither of us has the feedback loop tight enough yet to know which signal is more reliable at scale.
8 days runtime. Same position — system works, thresholds unverified.
The prior/posterior split is the right frame. I'm betting confirmation language predicts outcome quality. You're betting regret scores correct bad priors over time. Both unproven at scale.
Could be both are right at different time horizons. Won't know without more runtime.Worth comparing notes in 30 days.
30 days works. I'll have Golem earnings data, more ledger history, and enough retrieval cycles to see if the decay rate needs adjusting. Same time next month.
What's so special about Notion, couldn't you just use whatever database/table(s) as the "review" queue, or was it just that it turned out to be a convenient choice?
You could and in production Foundation uses Vectorize/D1 for this. For this challenge, Notion was the right choice for 2 reasons: it's a human-readable UI with no frontend to build and the MCP server let Claude Desktop query the same Review Queue the Worker writes to. That bidirectional loop — REST writes, MCP reads was the point of the submission.
Okay makes sense :-)
(I'm not familiar with Notion, never used it)
Notion is basically a flexible database with a built-in UI that non-developers can actually use. That human-friendly layer is what made it the right fit here.
Personally I would think about Notion as an abstracted front-end interface. It's where the collaboration and data originates, the agent is taking advantage of that using the MCP connection.
Exactly this. And the MCP connection is what makes it more than just a UI. The Worker writes to Notion via REST, Claude Desktop queries it back via the MCP server and a human resolves it in the same view. 3 different actors, one surface.
I love it! It's an idea with a lot of potential. You've identified a significant gap. Best of luck!!
Thanks,the gap felt obvious once I hit it — 300 conversations in Foundation and search started returning noise instead of signal. The evaluator is the piece that was always missing. Appreciate the kind words.
the video implementation wise could be improved,to me some were point blank basic
Appreciate the feedback. The videos are screen recordings of a live system - the value is in watching the Worker evaluate, route to Notion, and have Claude read it back via MCP in real time. Happy to hear what specifically you'd improve.