DEV Community

Cover image for I Built a Knowledge Evaluator That Uses Notion to Judge What's Worth Remembering
Daniel Nwaneri
Daniel Nwaneri

Posted on

I Built a Knowledge Evaluator That Uses Notion to Judge What's Worth Remembering

Notion MCP Challenge Submission 🧠

Foundation kept everything. After 300 conversations, search returned noise. That was the problem I was actually trying to solve.

I've been building Foundation — a federated knowledge system on Cloudflare Workers and the hardest part isn't capturing insights from conversations. It's deciding which ones deserve to persist. Throw everything into Vectorize and you end up with noise.

The Notion MCP Challenge gave me a reason to isolate the evaluator and ship it as a standalone thing.

The Idea

Most challenge submissions pipe data into Notion. This one uses Notion as the judgment surface — the place where ambiguous knowledge items wait for a human to decide if they're worth keeping.

The architecture:

Conversation excerpt
        ↓
Workers AI (Llama 3.3 70B) scores 3 signals
        ↓
Score ≥ 0.67 → Knowledge Memory (auto-promoted)
Score 0.33–0.67 → Notion Review Queue (human judges)
Score < 0.33 → Discarded
Enter fullscreen mode Exit fullscreen mode

Notion isn't just receiving output. The Worker queries it back for pending items and uses it as the approval layer. That's the move most submissions didn't make.

The Three Signals

The evaluator scores each knowledge item on three binary signals:

  • Usage — Is there a concrete technique, command, or pattern being applied?
  • Validation — Is it confirmed to work (tested, referenced, agreed upon)?
  • Specificity — Is it actionable, not just vague advice?

A score of 1.0 means all three fired. That item goes straight to Knowledge Memory. A score of 0.67 means two fired — useful, but unverified. That item surfaces in the Notion Review Queue as Pending. A human makes the call.

Building It

The Worker is Hono on Cloudflare Workers, Workers AI for scoring, and the Notion REST API for reads and writes.

The evaluator uses the messages format — not prompt — with a strict system instruction:

const response = await ai.run('@cf/meta/llama-3.3-70b-instruct-fp8-fast', {
  messages: [
    {
      role: 'system',
      content: 'You are a knowledge quality evaluator. Respond with ONLY valid JSON.',
    },
    {
      role: 'user',
      content: `Score this excerpt and return ONLY:
{"usage": 0|1, "validation": 0|1, "specificity": 0|1, "summary": "one sentence"}

Excerpt: ${text}`,
    },
  ],
  max_tokens: 256,
});
Enter fullscreen mode Exit fullscreen mode

One thing I hit: the model returns { response: { usage: 1, ... } } — a nested object, not a string. The prompt parameter made it worse, returning free-form essays instead of JSON. Switching to messages with a system role fixed it.

The routing logic is straightforward:

if (score >= 0.67) {
  // Auto-promote to Knowledge Memory
  await writeToNotion(env.KNOWLEDGE_MEMORY_ID, summary, score, signals, source, text);
} else if (score >= 0.33) {
  // Surface for human judgment
  await writeToNotion(env.REVIEW_QUEUE_ID, summary, score, signals, source, text, 'Pending');
} else {
  // Discard
}
Enter fullscreen mode Exit fullscreen mode

Notion as the Judgment Layer

Two databases: Knowledge Memory (permanent store) and Review Queue (human inbox — a Select field with Pending, Approved, Rejected).

The Worker has three endpoints:

  • POST /evaluate — scores a knowledge item and routes it
  • GET /pending — queries Notion for items awaiting review
  • GET / — health check

The /pending endpoint is what makes Notion MCP genuinely active:

const res = await fetch(
  `https://api.notion.com/v1/databases/${env.REVIEW_QUEUE_ID}/query`,
  {
    method: 'POST',
    headers: { Authorization: `Bearer ${env.NOTION_TOKEN}`, 'Notion-Version': '2022-06-28' },
    body: JSON.stringify({
      filter: { property: 'Status', select: { equals: 'Pending' } },
      sorts: [{ property: 'Created', direction: 'descending' }],
    }),
  }
);
Enter fullscreen mode Exit fullscreen mode

When something lands in Review Queue, a human opens Notion and changes Status from Pending to Approved or Rejected. That's the loop. Notion isn't a log — it's a decision surface.

What It Looks Like Running

► Querying Review Queue...  Pending: 1

[1/3] HIGH-CONFIDENCE
  Score: 100% | Signals: usage, validation, specificity
  ✓ Auto-promoted → Knowledge Memory

[2/3] AMBIGUOUS
  Score: 67% | Signals: usage, specificity
  ⚡ Surfaced in Notion Review Queue

[3/3] LOW-CONFIDENCE
  Score: 0% | Destination: discarded
  ✗ Not worth preserving.

► Re-querying...  Pending: 2
Enter fullscreen mode Exit fullscreen mode

Live endpoint: https://knowledge-evaluator.fpl-test.workers.dev

Closing the Loop with Notion MCP

The Worker writes to Notion via REST. But the other direction — reading pending items back via MCP — is where it gets interesting.

With @notionhq/notion-mcp-server configured in Claude Desktop, you can query the Review Queue directly in a conversation:

"Query my Notion Review Queue database and show me all items with Status Pending"

Claude returns:

1 item found

Name:     The excerpt mentions using waitUntil() in Cloudflare Workers for
          tasks like logging, but lacks production validation.
Score:    0.67
Status:   🟡 Pending
Signals:  usage, specificity
Source:   dev-chat
Created:  March 10, 2026
Enter fullscreen mode Exit fullscreen mode

And then offers: "Want to promote, flag, or dismiss it?"

That's the full loop. The Worker evaluates and writes. Notion holds the item. Claude reads it back via MCP and surfaces it for a human decision. Two MCP interactions — one outbound (REST), one inbound (MCP server) — both hitting the same Notion database.

The architecture now looks like this:

Conversation excerpt
        ↓
Workers AI scores 3 signals
        ↓
Score ≥ 0.67 → Knowledge Memory (auto-promoted via Notion REST)
Score 0.33–0.67 → Review Queue (Pending, written via Notion REST)
Score < 0.33 → Discarded
        ↓
Claude Desktop queries Review Queue via @notionhq/notion-mcp-server
        ↓
Human approves or rejects in Notion
Enter fullscreen mode Exit fullscreen mode

Notion isn't a log — it's the decision surface that both the Worker and the AI agent can see.

What I Learned

The scoring threshold matters more than the model. The harder question — what makes something worth keeping — I haven't fully answered. That's why the Review Queue exists.

The model doesn't need fine-tuning for this. Writing the three signals took longer than the entire Worker.

Knowledge provenance is preserved throughout — the original conversation snippet lives in the Raw Context field, so when you're approving something in Notion, you can see exactly where it came from.

Stack

  • Cloudflare Workers — runtime
  • Workers AI (Llama 3.3 70B) — evaluator
  • Hono — routing
  • Notion API — Review Queue + Knowledge Memory (writes)
  • @notionhq/notion-mcp-server — MCP query layer (reads)

Repo: github.com/dannwaneri/knowledge-evaluator

Top comments (4)

Collapse
 
leob profile image
leob

What's so special about Notion, couldn't you just use whatever database/table(s) as the "review" queue, or was it just that it turned out to be a convenient choice?

Collapse
 
dannwaneri profile image
Daniel Nwaneri

You could and in production Foundation uses Vectorize/D1 for this. For this challenge, Notion was the right choice for 2 reasons: it's a human-readable UI with no frontend to build and the MCP server let Claude Desktop query the same Review Queue the Worker writes to. That bidirectional loop — REST writes, MCP reads was the point of the submission.

Collapse
 
theycallmeswift profile image
Swift

Personally I would think about Notion as an abstracted front-end interface. It's where the collaboration and data originates, the agent is taking advantage of that using the MCP connection.

Collapse
 
dannwaneri profile image
Daniel Nwaneri

Exactly this. And the MCP connection is what makes it more than just a UI. The Worker writes to Notion via REST, Claude Desktop queries it back via the MCP server and a human resolves it in the same view. 3 different actors, one surface.