Foundation kept everything. After 300 conversations, search returned noise. That was the problem I was actually trying to solve.
I've been building Foundation — a federated knowledge system on Cloudflare Workers and the hardest part isn't capturing insights from conversations. It's deciding which ones deserve to persist. Throw everything into Vectorize and you end up with noise.
The Notion MCP Challenge gave me a reason to isolate the evaluator and ship it as a standalone thing.
The Idea
Most challenge submissions pipe data into Notion. This one uses Notion as the judgment surface — the place where ambiguous knowledge items wait for a human to decide if they're worth keeping.
The architecture:
Conversation excerpt
↓
Workers AI (Llama 3.3 70B) scores 3 signals
↓
Score ≥ 0.67 → Knowledge Memory (auto-promoted)
Score 0.33–0.67 → Notion Review Queue (human judges)
Score < 0.33 → Discarded
Notion isn't just receiving output. The Worker queries it back for pending items and uses it as the approval layer. That's the move most submissions didn't make.
The Three Signals
The evaluator scores each knowledge item on three binary signals:
- Usage — Is there a concrete technique, command, or pattern being applied?
- Validation — Is it confirmed to work (tested, referenced, agreed upon)?
- Specificity — Is it actionable, not just vague advice?
A score of 1.0 means all three fired. That item goes straight to Knowledge Memory. A score of 0.67 means two fired — useful, but unverified. That item surfaces in the Notion Review Queue as Pending. A human makes the call.
Building It
The Worker is Hono on Cloudflare Workers, Workers AI for scoring, and the Notion REST API for reads and writes.
The evaluator uses the messages format — not prompt — with a strict system instruction:
const response = await ai.run('@cf/meta/llama-3.3-70b-instruct-fp8-fast', {
messages: [
{
role: 'system',
content: 'You are a knowledge quality evaluator. Respond with ONLY valid JSON.',
},
{
role: 'user',
content: `Score this excerpt and return ONLY:
{"usage": 0|1, "validation": 0|1, "specificity": 0|1, "summary": "one sentence"}
Excerpt: ${text}`,
},
],
max_tokens: 256,
});
One thing I hit: the model returns { response: { usage: 1, ... } } — a nested object, not a string. The prompt parameter made it worse, returning free-form essays instead of JSON. Switching to messages with a system role fixed it.
The routing logic is straightforward:
if (score >= 0.67) {
// Auto-promote to Knowledge Memory
await writeToNotion(env.KNOWLEDGE_MEMORY_ID, summary, score, signals, source, text);
} else if (score >= 0.33) {
// Surface for human judgment
await writeToNotion(env.REVIEW_QUEUE_ID, summary, score, signals, source, text, 'Pending');
} else {
// Discard
}
Notion as the Judgment Layer
Two databases: Knowledge Memory (permanent store) and Review Queue (human inbox — a Select field with Pending, Approved, Rejected).
The Worker has three endpoints:
-
POST /evaluate— scores a knowledge item and routes it -
GET /pending— queries Notion for items awaiting review -
GET /— health check
The /pending endpoint is what makes Notion MCP genuinely active:
const res = await fetch(
`https://api.notion.com/v1/databases/${env.REVIEW_QUEUE_ID}/query`,
{
method: 'POST',
headers: { Authorization: `Bearer ${env.NOTION_TOKEN}`, 'Notion-Version': '2022-06-28' },
body: JSON.stringify({
filter: { property: 'Status', select: { equals: 'Pending' } },
sorts: [{ property: 'Created', direction: 'descending' }],
}),
}
);
When something lands in Review Queue, a human opens Notion and changes Status from Pending to Approved or Rejected. That's the loop. Notion isn't a log — it's a decision surface.
What It Looks Like Running
► Querying Review Queue... Pending: 1
[1/3] HIGH-CONFIDENCE
Score: 100% | Signals: usage, validation, specificity
✓ Auto-promoted → Knowledge Memory
[2/3] AMBIGUOUS
Score: 67% | Signals: usage, specificity
⚡ Surfaced in Notion Review Queue
[3/3] LOW-CONFIDENCE
Score: 0% | Destination: discarded
✗ Not worth preserving.
► Re-querying... Pending: 2
Live endpoint: https://knowledge-evaluator.fpl-test.workers.dev
Closing the Loop with Notion MCP
The Worker writes to Notion via REST. But the other direction — reading pending items back via MCP — is where it gets interesting.
With @notionhq/notion-mcp-server configured in Claude Desktop, you can query the Review Queue directly in a conversation:
"Query my Notion Review Queue database and show me all items with Status Pending"
Claude returns:
1 item found
Name: The excerpt mentions using waitUntil() in Cloudflare Workers for
tasks like logging, but lacks production validation.
Score: 0.67
Status: 🟡 Pending
Signals: usage, specificity
Source: dev-chat
Created: March 10, 2026
And then offers: "Want to promote, flag, or dismiss it?"
That's the full loop. The Worker evaluates and writes. Notion holds the item. Claude reads it back via MCP and surfaces it for a human decision. Two MCP interactions — one outbound (REST), one inbound (MCP server) — both hitting the same Notion database.
The architecture now looks like this:
Conversation excerpt
↓
Workers AI scores 3 signals
↓
Score ≥ 0.67 → Knowledge Memory (auto-promoted via Notion REST)
Score 0.33–0.67 → Review Queue (Pending, written via Notion REST)
Score < 0.33 → Discarded
↓
Claude Desktop queries Review Queue via @notionhq/notion-mcp-server
↓
Human approves or rejects in Notion
Notion isn't a log — it's the decision surface that both the Worker and the AI agent can see.
What I Learned
The scoring threshold matters more than the model. The harder question — what makes something worth keeping — I haven't fully answered. That's why the Review Queue exists.
The model doesn't need fine-tuning for this. Writing the three signals took longer than the entire Worker.
Knowledge provenance is preserved throughout — the original conversation snippet lives in the Raw Context field, so when you're approving something in Notion, you can see exactly where it came from.
What's Next
The thresholds (0.67 and 0.33) are now tunable via env vars — HIGH_THRESHOLD
and LOW_THRESHOLD in wrangler.toml. Different knowledge domains need different bars.
A fourth signal is the natural evolution: novelty — does this item duplicate
something already in Knowledge Memory? Without it, the permanent store will accumulate
redundant entries over time. That's the next thing to build.
Stack
- Cloudflare Workers — runtime
- Workers AI (Llama 3.3 70B) — evaluator
- Hono — routing
- Notion API — Review Queue + Knowledge Memory (writes)
- @notionhq/notion-mcp-server — MCP query layer (reads)
Top comments (19)
The relevance scoring angle is interesting — deciding what's worth remembering is half the problem with AI memory systems. I built a semantic memory layer for my local AI that does something similar, storing exchanges as embeddings and retrieving by cosine similarity. The curation problem is real.
Cosine similarity tells you what's related. It doesn't tell you what's worth keeping. That distinction is what the scoring signals are trying to formalize — usage, validation, specificity as proxies for "did this actually matter."
Curious how you handle the promotion decision. Do you auto-promote above a similarity threshold or is there a manual gate somewhere in the loop
Right now it's threshold only — no promotion layer. Your point about usage and validation as proxies is exactly the gap I'm going to close. Building a memory promotion layer this weekend — usage frequency + outcome scores from the regret index as the signal for what's worth keeping. I'm open to any ideas you have in mind as well.
The regret index as a promotion signal is clever — outcome scoring after the fact is more honest than trying to predict value upfront. One thing I'd watch: frequency alone promotes what gets repeated, not what's correct. A wrong pattern repeated 10 times scores higher than a right one used once.
The validation signal in my setup tries to catch that — was it confirmed to work, not just used. Might be worth pairing your outcome scores with a recency decay so recent regret-free events outweigh stale frequency. Happy to share the scoring code if useful.
Good catch on the frequency problem — you're right, repeated wrong patterns would outscore a correct one used once. I added recency decay alongside the outcome scores, half-life around 42 days. Recent regret-free retrievals now outweigh stale frequency. The validation signal idea is interesting — do you confirm correctness explicitly or infer it from downstream success?
Inferred, not explicit. The validation signal fires when the excerpt contains confirmation language - "confirmed in production," "tested," "works on," referenced docs. It's pattern matching on the text, not downstream tracking.
Explicit confirmation would be stronger but it requires closing the loop after the fact - knowing whether the thing actually worked. I don't have that yet. The regret index you have is closer to true validation than anything I'm doing. You're measuring actual outcomes. I'm measuring stated confidence at capture time.
The 42-day half-life is interesting - how did you land on that number?
The 42-day half-life was suggested by the AI I build with, not derived from my own data — it maps to roughly 6 weeks, which felt right for 'recent enough to matter, old enough to fade.' The honest answer is I don't have enough runtime yet to validate it. Echo has been running for 8 days. The real calibration will come from watching whether stale memories start surfacing inappropriately or useful recent ones get buried. If the half-life is wrong I'll know it empirically before I know it theoretically.
Your distinction between stated confidence and actual outcomes is the sharper point. Pattern matching on confirmation language is a prior; the regret index is a posterior. Neither of us has the feedback loop tight enough yet to know which signal is more reliable at scale.
8 days runtime. Same position — system works, thresholds unverified.
The prior/posterior split is the right frame. I'm betting confirmation language predicts outcome quality. You're betting regret scores correct bad priors over time. Both unproven at scale.
Could be both are right at different time horizons. Won't know without more runtime.Worth comparing notes in 30 days.
30 days works. I'll have Golem earnings data, more ledger history, and enough retrieval cycles to see if the decay rate needs adjusting. Same time next month.
What's so special about Notion, couldn't you just use whatever database/table(s) as the "review" queue, or was it just that it turned out to be a convenient choice?
You could and in production Foundation uses Vectorize/D1 for this. For this challenge, Notion was the right choice for 2 reasons: it's a human-readable UI with no frontend to build and the MCP server let Claude Desktop query the same Review Queue the Worker writes to. That bidirectional loop — REST writes, MCP reads was the point of the submission.
Okay makes sense :-)
(I'm not familiar with Notion, never used it)
Notion is basically a flexible database with a built-in UI that non-developers can actually use. That human-friendly layer is what made it the right fit here.
Personally I would think about Notion as an abstracted front-end interface. It's where the collaboration and data originates, the agent is taking advantage of that using the MCP connection.
Exactly this. And the MCP connection is what makes it more than just a UI. The Worker writes to Notion via REST, Claude Desktop queries it back via the MCP server and a human resolves it in the same view. 3 different actors, one surface.
I love it! It's an idea with a lot of potential. You've identified a significant gap. Best of luck!!
Thanks,the gap felt obvious once I hit it — 300 conversations in Foundation and search started returning noise instead of signal. The evaluator is the piece that was always missing. Appreciate the kind words.
the video implementation wise could be improved,to me some were point blank basic
Appreciate the feedback. The videos are screen recordings of a live system - the value is in watching the Worker evaluate, route to Notion, and have Claude read it back via MCP in real time. Happy to hear what specifically you'd improve.