DEV Community: Florent Cleron

I Turned Notion Into a Shared Brain for AI Agents (and it actually made sense)

Florent Cleron — Tue, 07 Apr 2026 13:24:31 +0000

The thing that bugged me

Every "AI + Notion" demo I've seen does roughly the same thing: you prompt a model, it generates text, the text gets dumped into a Notion page. Done.

It works. It's fine. But it's basically a fancy copy-paste.

The agent writes to Notion, but Notion doesn't do anything. It's a filing cabinet, not a workspace.

And that felt like a waste. Because Notion already has databases, relations, statuses, filters — all the building blocks of a workflow engine. We just... never let the agents use them.

So I had this thought: what if Notion wasn't the destination, but the coordination layer?

👉 github.com/CommonLayer/notion-blackboard

What if multiple agents could read and write to shared Notion databases, and that's how they'd collaborate — not through function calls or message queues, but through structured state that a human can see and touch at any point?

Meet Notion Blackboard

The idea is dead simple.

You write an objective in a Notion database. Something like "Prepare a competitive analysis of the AI code editor market."

Then a pipeline kicks in:

A manager agent reads your objective and breaks it into 3-7 concrete tasks (stored in a Task Queue database)
A worker agent picks up each task and produces a result (stored in a Results database)
A reviewer agent checks every result, scores it, approves or rejects it (logged in an Audit Log database)
The system compiles everything into one clean final report, published back to Notion

The punchline: none of these agents talk to each other directly. They coordinate entirely through Notion. The manager writes tasks, the worker reads them. The worker writes results, the reviewer reads them. It's asynchronous, inspectable, and weirdly elegant.

I call it a blackboard architecture — an old AI pattern from the 80s where agents share a common workspace instead of passing messages. Notion just happens to be a very nice blackboard.

Why this actually matters

Here's the thing about most multi-agent setups: they're black boxes.

You fire off a pipeline, stuff happens in memory, logs scroll by in a terminal, and eventually you get an output. If something goes wrong, good luck figuring out where. If you want to tweak the plan mid-run, too bad — the agents already moved on.

With Notion as the backbone, you get something different:

The state is visible. Every task, every result, every review sits in a database you can browse.
You can intervene. Don't like a task the manager created? Edit it. Think a result is garbage? Change its status before the reviewer gets to it.
The audit trail is real. Not buried in a log file — it's right there in a database, filterable, sortable, shareable.
The output lives where you already work. No "export to PDF" step. The final report is a Notion page your team can read.

It's not about making agents smarter. It's about making their work legible.

The tech under the hood

The stack is intentionally minimal:

Python — because it's the lingua franca for this kind of thing
Notion API — direct REST calls, no SDK wrapper needed
OpenRouter — one API to call Claude, GPT-4o, Gemini, whatever you want
Each agent uses a different model by default (Claude for planning, GPT-4o for execution, Gemini for review), but you can swap them freely

The repo comes with some quality-of-life stuff:

# Bootstrap the full Notion workspace (creates 6 databases)
python3 main.py --bootstrap --parent-page-id <PAGE_ID>

# Check that everything is wired correctly
python3 main.py --doctor

# Run the pipeline on all pending objectives
python3 main.py --process-objectives

# Try it without any API calls
python3 main.py "Your objective here" --dry-run

The dry-run mode is particularly nice for poking around — it simulates the full pipeline locally with fake data, so you can see the flow without spending tokens or setting up Notion.

What the Notion workspace looks like

The workspace has two layers, and the split is intentional.

What you see as a user:

Objectives — where you write what you want done
Final Reports — where you read the result
Start Here — a guide page the system auto-generates

What's running backstage:

Task Queue — the manager's work breakdown
Results — the worker's intermediate outputs
Audit Log — every agent action, timestamped
Agent Registry — which agents are active and what model they're running

I had everything in one flat view at first. It was technically correct and completely overwhelming. The two-layer split made it click — you get the clean "objectives in, reports out" experience, with full transparency one click away if you want it.

What I'd build next

This is a working prototype, not a SaaS product. But the pattern has legs.

Things I'm thinking about:

Retry loops — when the reviewer rejects a result, send it back to the worker automatically
Parallel workers — right now tasks run sequentially, but Notion can handle concurrent writes
Web research — let the worker agent pull in external sources, with citations
A small web UI — so you don't need a terminal to trigger a run
Better templates — richer formatting for the final reports

But honestly, the core loop already works and it's satisfying to watch. Write an objective, go make coffee, come back to a structured report with full traceability.

The bigger picture

I think there's a real gap between "AI can generate text" and "AI can get work done in a way humans can follow."

Most agent frameworks optimize for autonomy — let the agents figure it out, minimize human intervention. That's exciting, but it also means you're trusting a pipeline you can't see.

Notion Blackboard takes the opposite bet: make the process the product. Every step is visible, editable, and stored in a tool people already use daily. The agents aren't hidden behind an API — they're collaborators in a shared workspace.

That's the idea, anyway. It's a small project, but I think the pattern is worth exploring.

Check it out

The code is open source:

👉 github.com/CommonLayer/notion-blackboard

If you're into multi-agent workflows, Notion hacks, or just want to see what happens when you treat a productivity app as an AI coordination protocol — have a look, break things, and let me know what you think.

Less Heard — A Terminal Sketch on Gender Equity in Tech

Florent Cleron — Wed, 18 Mar 2026 07:35:14 +0000

This is a submission for the 2026 WeCoded Challenge: Frontend Art

Less Heard is an interactive piece built for the WeCoded Challenge 2026, on the theme of gender equality in tech. No framework, no UI library — just a terminal, a face, and a path you have to walk yourself.

Show us your Art

🔗 Live demo

Less Heard is an interactive terminal experience. Type help to get started, then follow the guided path: speak, listen, connect, amplify, echo.

Inspiration

There is something ironically equal about an 80s terminal.

In the early days of personal computing, everyone saw the same thing: monochrome screen, fixed text, green or amber on black. It didn't matter who you were — the interface didn't look at you. It read you. That's where the aesthetic of Less Heard comes from: a space where form doesn't discriminate.

The 3D face at the center is neither male nor female. It's an entity, a presence. Behind a computer, we all start at the same point — one of the few equalities tech has always offered. The problem starts after that: when you speak up.

The core of the piece is the interaction itself. You have to type each word: speak, listen, connect, amplify, echo. It doesn't scroll automatically. That's intentional. Because that's exactly what's missing in real life: if you don't speak up, if no one validates you, if no one relays what you say — the signal stays below the line. The terminal mechanic forces you to make the journey. Not just watch it go by.

The final message: Heard. Credited. Carried forward.

Not "equality is good." Just what's concretely missing when it isn't.

My Code

🔗 github.com/fcwebdesign/Less-Heard

Vanilla HTML/CSS/JS + Three.js. Nothing to install, no build step.

What if LLMs needed a spine, not a bigger brain?

Florent Cleron — Mon, 16 Mar 2026 14:22:52 +0000

I’ve been building something for the past few months, and I’m still trying to figure out whether I’m hitting a real problem or just over-structuring something that better prompting would already solve.

My starting intuition is simple: LLMs are very good at generating, but much less reliable when you expect continuity from them. As soon as you want an agent that can hold a line, remember things cleanly, recover after tension, and stay coherent over time, you start seeing the limits of the model on its own. Not necessarily because it lacks intelligence, but because it lacks a kind of skeleton.

In many systems, the LLM does everything at once: it speaks, it decides, it improvises its own memory and its own frame. And that works, until it starts to drift. Prompting can take you pretty far, but it still feels fragile.

That’s the space I’m exploring. The idea is to move governance outside the model: the LLM generates, but it does not decide on its own. An explicit policy layer handles decisions, state and memory carry continuity, and a timeline keeps an inspectable trace of what happened.

What I’m seeing so far is mostly more stability: less invention around internal state, better constraint-following, firmer boundaries under prompt injection, and less drift over long sequences.

That said, I don’t want to oversell it: I haven’t formally proven state causality, the actual impact of governed memory, or deterministic replay yet. What I have are strong signals, not hard proof.

I’m posting this mainly to test the framing. Does this way of thinking resonate? Are there better or earlier projects working on the same problem? Or am I just adding structure to something that better prompting will eventually absorb?

At the core, the question I keep coming back to is simple: if the LLM is the muscle, what does the skeleton look like?

Why I make LLMs argue with each other before I make architecture decisions

Florent Cleron — Sun, 15 Mar 2026 15:48:12 +0000

The problem with asking one model

You ask Claude about your API design. It gives you a confident, well-structured answer. You move on. Two weeks later, during code review, someone spots the thing the model didn't mention — the thing you would have caught if you'd thought about it from a different angle.

This happens because LLMs are agreement machines. Ask one model a question, you get one perspective wrapped in confidence. The model won't naturally play devil's advocate against its own answer. It'll give you the best answer it can produce, not the best answer the problem deserves.

I started doing something simple: same prompt, same codebase context, two different models. And I noticed that the interesting part was never where they agreed — it was where they disagreed.

Structured disagreement as a design tool

The idea isn't new. Adversarial review exists in every serious engineering culture: red teams, architecture review boards, RFC processes. What's new is that you can now run a lightweight version of this with LLMs, grounded in your actual code, in minutes instead of days.

The setup I converged on uses two roles:

Critic — opens each round by pressure-testing the thesis. It looks for fragilities, unexamined assumptions, missing invariants. Its job is to break the argument, not improve it.

Builder — responds with implementation choices, sequencing, and safeguards. Its job is to defend or adapt the approach while staying concrete.

Both see the full transcript at every turn. This matters — it forces each model to actually address the other's points instead of talking past them.

After the rounds, a third model (or one of the two) produces a synthesis. Not a "both sides have good points" summary — a structured recommendation that incorporates the strongest objections.

The hard part: grounding the debate in code

A debate between two models about abstract architecture is just two blog posts arguing. The value comes from grounding.

This is the technical problem I spent the most time on. When you point the tool at a codebase, it needs to:

Build a file tree — from a local repo, uploaded files, or a GitHub repository (public or private). This means handling auth tokens, ignoring node_modules/.git/dist, and respecting size limits.
Resolve excerpts — you can't dump an entire repo into a prompt. The tool scores and selects the most relevant excerpts from each selected file, given the debate topic and objective. The default limits are 3 excerpts per source, 18 excerpts max per pack, 2 MB per text file, 10 MB per PDF.
Inject as an evidence pack — the selected excerpts are injected into both the debate prompts and the synthesis prompt, with [SRC-x] markers. This forces the models to cite specific files when making claims, instead of hand-waving about "your codebase."

The citation system is what makes the output actually useful. When the Critic says "your pagination contracts are inconsistent across 6 endpoints [SRC-2]", you can go straight to the file and verify.

What a debate actually looks like

Say you're debating whether to migrate a REST API to GraphQL. You select the relevant route files, the API client code, and the schema definitions.

The Critic opens by pointing out that migration won't fix existing inconsistencies — it just relocates them from endpoint logic to resolver logic. It flags the risk of nested query over-fetching before rate-limiting is in place.

The Builder responds: the inconsistencies are scoped to a handful of endpoints that can be normalized as a pre-migration step. Depth limiting and query cost analysis are standard tools. The migration unlocks typed schema sharing across client teams that are currently maintaining hand-written API wrappers.

The synthesis doesn't split the difference. It says: proceed with a scoped migration, normalize the problematic endpoints first, gate production access on query cost limits, and prioritize the two client teams that benefit most.

That's more useful than either model's answer alone. Not because the models are smarter together, but because the structure forced the second-order questions to surface.

When this works (and when it doesn't)

This works well for decisions that are:

Reversible but expensive — you're not sure, and being wrong costs weeks
Multi-dimensional — there are real tradeoffs, not a single correct answer
Groundable — there's actual code or documentation to anchor the discussion

It doesn't work well when:

The answer is obvious — no need for a debate if best practices clearly apply
The context is too large — an entire monorepo won't fit, and excerpt selection can miss things
You need domain-specific expertise — the models are limited by what they know about your specific business logic

Sometimes the debate is noise. Two models going back and forth without surfacing anything you didn't already know. That's fine — it takes a few minutes and costs a few cents. The signal-to-noise ratio improves a lot when you give it well-scoped files and a clear question.

Implementation choices

A few technical decisions worth mentioning:

Provider-agnostic routing. The tool supports OpenRouter (single key for any model), direct OpenAI/Anthropic keys, or a mix. Credential resolution happens server-side with a clear priority: UI-provided key > env variable > direct provider routing. This means you can use Claude as the Critic and GPT as the Builder, or any other combination.

Strict round structure. Exactly 2 participants, strict alternating rounds, one synthesis pass. I tried freeform multi-turn and it degenerates quickly — models start being polite instead of precise. The constraint makes the output better.

Format-aware synthesis. The synthesis step has format presets: tech/architecture, decision/strategy, factual/practical, proof/validation, or auto-detect. This shapes the synthesis prompt to produce the right kind of output rather than a generic summary.

No persistence. No database, no localStorage for keys, no saved sessions. Each run is self-contained. The output is a Markdown or JSON export you take with you. This was deliberate — the tool is for decision-making, not conversation management.

The stack

Next.js App Router, TypeScript, React, Tailwind. Provider adapters for OpenRouter, OpenAI, and Anthropic. unpdf for PDF text extraction. Nothing exotic — the complexity is in the prompt engineering and the evidence pack resolution, not the framework.

What I'd like to figure out next

The biggest open question is evaluation. How do you measure whether a debate-produced synthesis is actually better than a single-model answer to the same question? I have intuitions from using it, but no systematic way to prove it.

The second question is excerpt selection. Right now it's scoring-based, but there's probably room for a retrieval step that's more context-aware.

If you've tried similar multi-model approaches — or if you think this whole idea is flawed — I'm interested in hearing why.

Repo: github.com/CommonLayer/model-debate