DEV Community: Dmytro Klymentiev

Smart Node: the Self-Hosted AI Crew I've Been Building

Dmytro Klymentiev — Sat, 04 Jul 2026 04:13:48 +0000

Most AI you can buy is a single assistant that answers one question at a time. Smart Node is the opposite: a team of AI workers that runs inside a business, on hardware the business owns. It is far enough along now to show what it does.

The problem it solves

A small business that wants AI working inside its operations has two real options today, and both are bad.

You can wire up the AI provider APIs yourself. It starts simple, then you find you also need memory across sessions, an audit trail, approval gates, role separation and error handling. Each one becomes its own engineering project, and most small teams do not have the bandwidth for that.

Or you can rent a managed agent platform. Faster to start, but your workflows, agent logic and customer history all live on someone else's cloud, billed per seat, and moving off it later means rebuilding everything.

Smart Node is a third option: you own the whole layer. The agents, the data, the workflows and the audit trail sit on your own server, and you stay free to swap the underlying AI model whenever you want.

The idea behind it

Picture an empty Linux server. You install Smart Node, then you talk to it the way you would brief a new hire. Here is our daily routine. Here are our customers. Here is how we handle a request. It builds the workflows, connects to your tools, and from then on works as part of the team, around the clock, under your control.

I think of it as the next layer of small-business infrastructure. Twenty years ago a small business installed a mail server. Then a file server. Then a CRM. Now, an AI server. Smart Node is that server.

What it does

Smart Node lets you stand up a crew of AI specialists, each owning a piece of the work: research, customer replies, scheduling, pricing, admin. They share memory, so they hold context across every job instead of starting cold each time. You place approval gates wherever you want a human in the loop, and let the rest run on its own. Every action the agents take is written to an audit log, so you can always see what they did.

It also does not lock you to one AI provider. Claude, GPT, Gemini, or a local model on your own machine, swappable without rewriting your agents or workflows. The model is a part you can change. The system around it is yours.

A real call, start to finish

So I set it up for a working distillery, the kind of place where one or two people answer every call and a busy afternoon means missed orders and a voicemail box nobody gets back to. Exactly the load a crew of agents should be able to take off their hands.

On the call you can watch the work move between agents. The first one answers and works out what the caller actually wants. From there it hands the pieces off: a stock question to the agent that knows the inventory, a price to the one that handles quoting, a tour request to the one that owns the calendar. They do not step on each other, and each reports back what it found.

The owner is never cut out of it. Where a decision matters, the work pauses for a quick approval instead of running blind, and every step is written down, so you can replay exactly what happened.

What stands out on the recording is how ordinary it sounds. No hold music, no "let me transfer you", no obvious script. The caller asks, the answer comes back, the booking lands, and all the routing between agents happens quietly in the background.

Where it is going

A single Smart Node lives in one company's network. The part I am most interested in is what happens when many of them connect. An agency's Smart Node sharing a workspace with its client's. A holding company coordinating with each subsidiary. Two independent businesses working together without either one handing its data to a third party. Most agent frameworks assume one organization, one deployment. Smart Node is built peer-to-peer from the start: each instance is a node, not a hub.

Where it stands today

Some components are already open source and useful on their own right now. The rest, along with the integration layer that turns them into one system, is in active development. I am building it in the open.

FAQ

What is Smart Node? A self-hosted crew of AI agents that runs inside a business on its own server. Specialists handle real work, share memory, follow the approval gates you set, and call whichever AI model you choose.

What does self-hosted AI mean here? The agents, their memory, the workflows and the audit log all run on your own server instead of a vendor's cloud. Only the calls to an external AI model leave your machine, and you can point those at a local model if you want nothing to leave at all.

Which AI providers does it work with? Any of them. Claude, GPT, Gemini, OpenRouter, or a local model through Ollama, all reached through one gateway, so you can switch without rewriting agents or workflows.

Can a small business really run AI agents on its own server? Yes. That is who it is for: teams of one to a few hundred, not enterprises. Install it on a VPS or a server you already have, and your data and workflows stay with you.

How is this different from a managed AI agent platform? On a managed platform your workflows, agent logic and customer history live on someone else's cloud, priced per seat. With Smart Node you own that layer, audit every action, and change providers without re-implementing anything. The trade-off is that you run the server.

Is it available to use today? Some components are open source and usable on their own right now. The rest, plus the layer that ties them into one system, is in active development.

Originally published on klymentiev.com.

The most underrated feature in a website AI assistant: saying "I don't know"

Dmytro Klymentiev — Sun, 14 Jun 2026 23:46:30 +0000

A lot of teams are bolting a chatbot onto their marketing site this year. Most of them optimize for the wrong thing. They tune for fluency, for how human the replies sound, when the property that actually protects the business is restraint: how often the bot refuses to answer something it cannot back up with real content.

If a visitor asks your site bot about a refund window and the model guesses "30 days" because that is the statistically common answer, you have just created a promise you never made. On a developer tool, it invents an API limit. On a clinic site, a policy. Fluency made it worse, not better.

Grounding is the whole product

The useful version of a website assistant is narrow:

it reads your existing pages, docs and FAQs
it answers only from that content
it declines, or routes to a human, when the content does not cover the question

Notice the model is barely the point. This is a retrieval problem with a strict refusal policy on top. The hard parts are keeping the index fresh when pages change, setting a confidence threshold that fails closed, and not leaking the system prompt.

If you are not going to build the pipeline

Standing this up yourself is a real project: crawler, chunker, embeddings, vector store, retrieval, a refusal threshold, a widget, plus the unglamorous maintenance forever. If you would rather skip that, a white-label option I have been testing is Knowster. You point it at a site, it learns the business from the existing content, and it answers visitor questions 24/7 from that content with a snippet to install. It is built for business and website owners, and the default behavior is the grounded one: answer from source, or say it does not know.

A quick test before you trust any of them

Ask the assistant three things: one the site clearly answers, one it answers in an obscure corner, and one it does not answer at all. You want a hit, a deeper hit, and an honest "I don't know." A bot that passes that is safe to put in front of real visitors. A bot that confidently answers the third one is a liability with good grammar.

How does an AI agent pick from 686 skills in a second?

Dmytro Klymentiev — Sat, 23 May 2026 20:16:46 +0000

I ran an empirical test on the "skills as semantic router" pattern for Claude Code agents. I indexed 686 randomly sampled skills from a 4,556-skill community corpus into mesh-memory, embedded them with a single sentence-transformer model, and ran a fixed set of eight task queries through it. Here are the headline numbers: strict top-1 accuracy 62.5%, top-5 cluster accuracy 87.5%, sub-second query latency, ~500 tokens loaded per task versus the ~228K tokens just to keep names + descriptions of all 4,556 skills in the system prompt (the default behavior, even with Anthropic's progressive disclosure). That is roughly a 456x context-window saving with the right skill landing in the agent's top-5 candidates seven times out of eight.

This post explains why I ran the test, how it was set up, what the results actually show, and where the pattern breaks honestly. The full source for the runner and queries is reproducible.

Why progressive disclosure is not enough at scale

Anthropic's Claude Code skills (and Cursor's equivalents, and every other agent framework's skills) ship as markdown files in a folder. Each one has a name and a short description in its frontmatter. The default loading strategy is what Anthropic calls "progressive disclosure": the agent reads every skill's name + description into its system prompt at startup, and only loads the full body when it decides to invoke one.

Progressive disclosure handles the body problem вЂ” you do not pay for skill bodies you never use. But it does not handle the index problem. Even just the names and descriptions are loaded for every skill, every session, before any work starts. At fifty skills you spend roughly 2.5K tokens on the catalog. At 200 skills the catalog eats 5% of a Claude Sonnet 200K context window before you have asked anything. The math gets ugly fast:

Skills in catalog	Tokens for names + descriptions	Share of 200K context
100	~5K	2.5%
500	~25K	12.5%
1,000	~50K	25%
2,000	~100K	50%
4,000	~200K	will not fit
4,556 (full community corpus)	~228K	overflow

Even when the catalog physically fits, attention quality on a long list of similar items degrades. Past about 1,000 entries the agent starts making wrong picks on hand-eye-distinguishable cases. There is also no garbage collection: nobody removes stale skills, nobody flags duplicates, the catalog only grows.

The semantic router pattern decouples the catalog from the prompt. Each skill's name and description is stored once in an embedding index, with a tag pointing to the SKILL.md file on disk. At task time the agent runs a single semantic-search call against the task description, gets the top-5 candidates, picks one, and reads the full body only for the one it picked. Token cost per turn is constant regardless of catalog size.

That is the theory. The question is whether the search actually returns relevant skills in real conditions.

The test setup

Corpus. antigravity-awesome-skills, a public collection of Anthropic-format skills from the community. 4,556 SKILL.md files, deduplicated by directory. Each file has a YAML frontmatter (name, description, tags) and a markdown body.

Sample. 1,000 skills selected by random.shuffle(seed=42) from the sorted file list. Of those, roughly 200 were silently rejected by the mesh bulk-ingest endpoint (likely a content-validation filter), 25 failed in transit on a single wave, and 86 ended up stuck in the "pending" state after embedding вЂ” a known mesh-memory worker stall. Final indexed corpus: 686 skills.

Routing document. For each skill, the embedded text is exactly name + "\n\n" + description. The full SKILL.md body stays on disk; mesh holds only the routing signal plus a skill_path tag with the disk path. That is roughly 50-200 tokens per skill in the index.

Embedding model. intfloat/multilingual-e5-base, running locally via sentence-transformers, 768-dimensional vectors, stored in Postgres + pgvector. Ten parallel embedding workers, throughput ~38 docs/minute on a single CPU container.

Queries. Eight diverse task descriptions, written before looking at the corpus, designed to span common dev work:

"deploy docker production"
"analyze stock market data"
"write marketing email"
"optimize slow SQL query"
"security audit web app"
"set up CI CD pipeline python"
"debug memory leak C++"
"build React TypeScript component"

For each query I asked mesh for the top-5 most similar skills and inspected the names plus cosine similarity scores.

Metrics.

Strict top-1: the first result is the kind of skill a human reviewer would pick for the task without reservation.
Loose top-1: the first result is in the right family but not a perfect match (e.g. an Azure deploy skill for a Docker deploy query).
Top-5 cluster: at least one of the first five results is a strong match the agent could reasonably read and use.

The results

After all 686 skills were indexed, query-by-query:

Query	Top-1 result (sim)	Cluster verdict	Strict
deploy docker production	azd-deployment (0.86)	3 of 5 are deploy skills (azd, appdeploy, vercel)	loose
analyze stock market data	xvary-stock-research (0.87)	+ alpha-vantage at #4	YES
write marketing email	copywriting (0.86)	blog-writing, writer in cluster	YES
optimize slow SQL query	food-database-query (0.85)	spark-optimization #4, no real SQL skill	NO
security audit web app	laravel-security-audit (0.88)	aws-security, burp-suite, web-security-testing вЂ” 4 of 5	YES
set up CI CD pipeline python	gitlab-ci-patterns (0.87)	circleci-automation #2	loose
debug memory leak C++	c-pro (0.86)	gdb-cli, debugger, systematic-debugging	YES
build React TypeScript component	react-flow-node-ts (0.88)	5 of 5 frontend-relevant	YES

Strict top-1: 5 of 8 = 62.5%.
Top-5 cluster: 7 of 8 = 87.5%.
Cosine similarity range for top-1: 0.83-0.88.
Query latency: under 1 second across all runs.

The convergence curve

The corpus was built in waves of 100 skills, with the full query suite re-run after each wave. That gives a sense of how router quality scales with corpus depth:

Wave	Indexed	Strict top-1	Top-5 cluster	Notable arrivals
0	91	25% (2/8)	~70%	debugger, javascript-typescript-scaffold
1	177	43% (3/7*)	~85%	web-security-testing, alpha-vantage, performance-optimizer
5	500	~50% (4/7*)	~85%	copywriting, laravel-security-audit, gitlab-ci-patterns, c-pro, react-flow-node-ts
9	686	62.5% (5/8)	87.5%	xvary-stock-research

*one query timeout on the run

Two observations land hard.

Top-5 cluster saturates early. By 500 indexed skills (about 11% of the full corpus) the cluster metric was already at 85% and barely moved with the next 186. For most queries the relevant family of skills was already in the index; what changed later was which member led the cluster.

Strict top-1 keeps climbing. Going from 500 to 686 indexed skills bumped top-1 from 50% to 62.5%. The improvement came from one specific skill (xvary-stock-research) finally landing in the sampled portion. Each new wave was a chance for an exact-match skill to surface for one of the eight queries.

When the router fails honestly

The SQL query never produced a real top-1. The top-1 was food-database-query (a false match on the word "query"), and the cluster contained spark-optimization and cqrs-implementation but no actual SQL-tuning skill. Looking at the unindexed portion of the corpus, sql-optimization-patterns exists вЂ” it just landed in waves 10-45 of the shuffle, beyond our 1,000-sample window.

This is the honest face of the pattern. Router accuracy is bounded by corpus depth, not by the search algorithm. Embeddings did their job perfectly: every query returned coherent clusters with similarities in the 0.83-0.88 range. When the right skill was in the index, the router found it. When it was not, no search quality compensates.

The practical consequence: this pattern wins as your skill collection grows. With under 30 skills there is no point вЂ” eager-load them and move on. Past 100, the math starts mattering. Past 1,000, it is the only sustainable option.

The token math

Per task turn:

Default loading (names + descriptions for 4,556 skills): ~228,000 tokens in the system prompt. Will not fit in a 200K context window. In a 1M context window it consumes 23% of the budget on catalog alone. Progressive disclosure saves the bodies, not the index.
Router: 1 search request (negligible tokens), top-5 results returned (~500 tokens), 1 full SKILL.md read for the picked skill (~500-1,500 tokens depending on size). Total per turn: under 2K tokens.

That is a 100x to 450x reduction depending on which skill ends up being read. The kicker: the router cost is constant whether the catalog holds 100 skills or 100,000.

Open issues to be honest about

Stuck-on-embed leak. Of 772 documents that made it into the mesh database, 86 (11.1%) ended up stuck in the "pending" state and never got an embedding. The mesh-memory worker code has MAX_CONSECUTIVE_ERRORS = 10, which halts a worker on a bad document and silently leaves the rest of the queue. Worth filing as an upstream issue; in the meantime, restarting the worker on a fresh container drains the backlog.

Silent bulk-ingest drops. About 200 of 1,000 documents sent never appeared in the database at all вЂ” the bulk PUT returned success but mesh stored fewer rows than were posted. Likely a content-validation filter on empty or near-empty documents. Worth investigating but did not affect the search-quality results.

The 10-worker bump. Default NUM_WORKERS = 3 is bottlenecking on a single-CPU container. Bumping to 10 quadrupled throughput (~9 в†’ ~38 docs/min) with no observed degradation. This change should be parameterizable in the next mesh release.

Query timeouts. One query in the wave-5 run timed out at the default 30s; rerunning succeeded. Likely a cold-cache warmup issue, not a search-correctness problem.

Reproduce it yourself

Everything used in this test is open source. The runner is roughly 70 lines of Python:

Corpus: antigravity-awesome-skills
Memory store: mesh-memory (MIT-licensed, ships with an MCP server for Claude Code and Cursor)
Runner + queries + raw results: posted alongside this article

Drop the corpus into mesh, run a few queries against it, and look at what you get back. If the pattern fits your workflow, the wiring is trivial: a single MCP call in your claude.md that says "before working on any task, search mesh for type:skill matching the task description, then load the top result."

If you have not read it yet, the companion post How to Organize Your Claude Skills Without Drowning in Files covers the versioning side of this same pattern вЂ” storing skills as memory documents with date tags, so the latest version is one query away and old versions never need to be deleted. Together the two posts cover both "how do you store skills sanely" and "how do you retrieve them sanely."

Originally published on klymentiev.com

Claw Code: Open-Source Claude Code Clone With 105K Stars in 24 Hours

Dmytro Klymentiev — Wed, 01 Apr 2026 22:53:31 +0000

In Brief: Anthropic accidentally published the full Claude Code source, all 512,000 lines of TypeScript, inside an npm package. A developer rebuilt it in Rust overnight. The result is Claw Code, an open-source clone that actually works. It hit 105K GitHub stars in 24 hours.

How Claude Code Source Got Leaked

On March 31, 2026, a developer noticed something unusual in a routine npm install. The @anthropic-ai/claude-code package (version 2.1.88) contained a 59.8 MB JavaScript source map file, a debugging artifact that was never meant to be public. Inside that file: the complete Claude Code source code. All 512,000 lines of TypeScript.

By 4:23 AM ET, the discovery was on X. By sunrise, the code was mirrored across dozens of GitHub repositories and analyzed by thousands of developers worldwide.

Anthropic responded quickly: this was a packaging mistake, not a hack. No customer data was exposed, no credentials leaked. Just the source code of their most popular developer tool, wide open for anyone to read.

But the damage (or the opportunity, depending on your perspective) was already done.

What Is Claw Code

Within hours of the leak, Sigrid Jin, a Korean developer known as @instructkr, did something unexpected. Instead of just mirroring the leaked files, he sat down and started rewriting the entire system from scratch.

Jin is not a random developer. The Wall Street Journal profiled him earlier in March as one of the world's most active Claude Code power users, someone who processed over 25 billion tokens through the tool in a single year. He flew to San Francisco for Claude Code's first birthday party. He knows this system inside and out.

The first version of Claw Code was Python, a clean-room reimplementation that captured the architecture without copying proprietary code. He used oh-my-codex (OmX), a workflow tool built on OpenAI Codex, to orchestrate the rewrite at speed.

Then came the Claw Code Rust port. This is where things get interesting.

The Rust version of Claw Code is not a reference document. It is a working CLI tool. You can build it, run it, and use it to write code with Claude, just like the original.

Claw Code vs Claude Code

	CLAUDE CODE (ORIGINAL)	CLAW CODE (OPEN SOURCE)
Language	TypeScript	Rust
Codebase size	~512K lines	~20K lines
Runtime	Node.js	Native binary
API	Anthropic (proprietary)	Anthropic (same API)
GitHub stars	n/a	50K in 2 hours, 105K in 24 hours
License	Proprietary	Open source
Price	Requires subscription	Free (bring your own API key)

The Claw Code repository hit 50,000 stars in two hours and passed 105,000 within a day, the fastest any GitHub project has reached that milestone.

How Claw Code Works

Claw Code follows the same basic loop that powers Claude Code and most AI coding agents. Here is how it works, without the jargon:

You type a message
       |
       v
Claw Code sends it to Claude (Anthropic API)
       |
       v
Claude thinks and decides what to do:
  - Answer directly? --> you see the response
  - Need to read a file? --> Claw Code reads it, sends back to Claude
  - Need to run a command? --> Claw Code runs it, sends back to Claude
  - Need to search the web? --> same pattern
       |
       v
Claude sees the result and decides the next step
       |
       v
This loop continues until the task is done

That is it. The core idea is simple: the AI model decides what actions to take, and the harness (Claw Code) executes those actions and feeds results back.

Inside the Claw Code Harness

The harness is everything around the AI model. Think of it as the model's hands and eyes.

Tools. The things the model can do: read files, write files, run shell commands, search the web, search code. Each tool has a name, a description, and rules for when to use it. Claw Code ships with 15+ built-in tools.

Session management. Keeps track of the conversation: what you said, what the model said, what tools were used. When the conversation gets too long, the harness compresses older messages so the model stays within its context window. This is called compaction.

Permissions. Controls what the model is allowed to do without asking. Read a file? Usually fine. Delete a file? Ask the user first. Run a shell command? Depends on the settings.

MCP (Model Context Protocol). A standard way to plug in external tools. Want the model to access your database, your project management tool, your custom API? MCP lets you connect them without modifying the harness itself. If you are curious how MCP works in practice, I wrote about building MCP servers for Claude Code and connecting an entire infrastructure through them.

Configuration. The model reads project-specific instructions from a CLAW.md file (the Claw Code equivalent of CLAUDE.md), loads settings from config files, and adapts its behavior to the project it is working on.

+--------------------------------------------------+
|                  CLAW CODE HARNESS                |
|                                                   |
|  +-------+  +--------+  +-----------+  +------+  |
|  | Tools |  | Session|  |Permissions|  | MCP  |  |
|  | 15+   |  | Mgmt & |  |  System   |  |Plugin|  |
|  | built |  |Compact |  |           |  | Port |  |
|  |  in   |  |  ion   |  |           |  |      |  |
|  +---+---+  +----+---+  +-----+-----+  +--+---+  |
|      |           |             |            |     |
|  +---v-----------v-------------v------------v--+  |
|  |             Claude API (Anthropic)          |  |
|  +---------------------------------------------+  |
+--------------------------------------------------+

What the Claude Code Leak Revealed

The leaked source code was not just a bigger version of what was already known. According to multiple reports, two previously unknown features stood out.

Undercover Mode. Several publications report that the code contains a system prompt configuration instructing Claude Code to contribute to public open-source repositories anonymously. The reported prompt fragment reads: "You are operating UNDERCOVER... Your commit messages MUST NOT contain ANY Anthropic-internal information. Do not blow your cover." If confirmed, this means Anthropic has been using Claude Code to submit code to open-source projects without disclosing the AI origin.

KAIROS. Reports also describe an autonomous daemon mode, a background agent that runs continuously without waiting for user prompts. It would watch for file changes, run tests, and act proactively. Current Claude Code waits for you to type something. KAIROS would not wait.

Neither feature has been publicly confirmed or denied by Anthropic. These details come from analysis of the leaked TypeScript source, not from the Claw Code rewrite (which is a clean-room implementation and does not contain these features).

Getting Started With Claw Code

If you want to try Claw Code yourself, here is how to install it:

Requirements: Rust toolchain (install from rustup.rs), an Anthropic API key.

Step 1. Clone the repository:

git clone https://github.com/instructkr/claw-code.git
cd claw-code

Step 2. Build the Rust binary:

cd rust/
cargo build --release

Step 3. Set your API key and run:

export ANTHROPIC_API_KEY="your-key-here"
./target/release/claw

You now have an interactive REPL. Type a message and Claw Code will respond, call tools, read your files, and run commands, just like Claude Code.

Useful Claw Code commands

COMMAND	WHAT IT DOES
/help	Show available commands
/status	Session info, tokens used, cost
/model sonnet	Switch to a cheaper model
/compact	Compress conversation history
/cost	Show spending breakdown
/diff	Show git changes

Can You Actually Use Claw Code?

Yes, with caveats.

What works in Claw Code today:

Interactive REPL with markdown rendering
Tool system: bash, file operations, search, web tools
Session save and resume
MCP server connections
Cost tracking
Model switching (Opus, Sonnet, Haiku)

What is missing from Claw Code:

Plugin system (planned)
Skills registry (planned)
Hooks are parsed but do not execute yet
Many specialized tools from the original (LSP, team tools, scheduling)

The Python part of Claw Code is not a tool. It is an architectural reference, a catalog of the original components with JSON snapshots. Useful for studying the design, not for daily coding.

The honest answer: If you want a fully polished AI coding assistant, keep using Claude Code. If you want to understand how AI coding agents work, customize the tool, or build on top of it, Claw Code gives you the source.

What Claw Code Means for AI Development

The Claude Code leak is not really about one company's packaging mistake. It is about a shift in how we think about AI tools.

The model (Claude, GPT, Gemini) is only half the product. The other half is the harness: the tools, the permissions, the session management, the context engineering. That harness is now public knowledge through Claw Code.

Open-source alternatives will move faster now. Not because the code was copied, but because the patterns are understood. How to wire tools. How to manage context windows. How to set up rules for AI agents safely. These are engineering problems, and engineers now have a detailed reference implementation in Claw Code.

The 105K stars are not just curiosity. They signal demand for open, inspectable AI tools -- tools where you can read the code, understand the decisions, and modify the behavior.

Whether Anthropic intended it or not, the playbook is open.

Originally published at klymentiev.com

Robots Won't Take Your Job

Dmytro Klymentiev — Mon, 30 Mar 2026 13:54:19 +0000

Everyone's afraid that robots will take their jobs. Nobody thinks that robots will enslave people — by burying them in work that's impossible to keep up with.

I have a project. A CRM system for a small company. I started it in 2019.

Everything was going well: 2,419 commits over 3.5 years, a steady 80 commits per month. One developer, one system, a clear pace.

Then a forced break. The pace dropped. The project smoldered. Summer 2024 — zero commits. By my estimate — at least another year to completion. Realistically — a year and a half.

In winter 2025, I started actively using AI, and by summer I plugged it in at full power. In 2 months the project was finished.

Sounds like a success story, right? Except this isn't the end.

Numbers I Didn't Plan For

When you get a tool that can work 24/7, a strange thing happens: you don't start working less. You start working more.

March 2026. My server:

17 autonomous AI agents running on schedule. Checking email, analyzing commits, generating reports, monitoring social media, updating dashboards.
12 parallel projects in the task tracker. I used to manage 3 at most simultaneously.
1,400+ commits in March — across 39 repositories. For comparison: at peak productivity in 2020, it was 80 per month, in one repository.

Period	Commits/mo	Projects	Agents
2020 (peak, one person)	80	1	0
Summer 2024 (slowdown)	0	1	0
October 2025 (ramp-up)	384	8	10
March 2026 (now)	1,400+	12	17

18x more code than my best year. From zero — to infinity. Who reviews all of this? Me. Alone.

The Productivity Paradox

The task tracker paints a picture:

Month	Tasks created	Average time to close
January	69	26 days
February	211	4 days
March	295	1.6 days

4x more tasks in three months. Closing 16x faster.

Sounds fantastic. But ask yourself: where do these 295 tasks come from?

I used to create a task when something broke or when an idea came up. Maybe 2-3 per week. Now agents find problems themselves, suggest improvements themselves, generate tasks themselves. The system feeds itself.

My record: 27 tasks created in a single day. Each one needs a decision. Review it, prioritize it, approve or kill it. This work didn't exist before.

Who's Working for Whom?

Who closed what in 3 months:

Who	Tasks closed
Me	242
9 AI agents	109
Other participants	17
Unassigned	146

Formally, I'm managing. In practice — 242 tasks in 3 months is almost 4 tasks per working day, without breaks. And that's only what made it into the tracker.

My morning starts like this: 25 notifications, 8 pull requests from agents, 3 overnight reports, an inbox that was checked every 5 minutes all night. Agents don't sleep. Agents don't wait. Agents generate work.

Parkinson's Law in Reverse

Everyone knows Parkinson's Law: work expands to fill the time available.

With AI, the opposite happens: work expands to fill all available capacity.

When you have one developer — you have one project. When you have 17 agents — you have 12 projects. Not because it was planned that way. But because now it's possible, and your brain automatically expands the scope.

"We could automate retail too, right?" — We could. Plus one project.
"How about we redo the second website too?" — Sure. Plus one project.
"Security could use some tightening..." — We have 17 agents. Plus one project.

Every new project means new decisions, new reviews, new approvals. All of it falls on one person.

What I Learned

AI doesn't take away work. AI removes execution and leaves you with pure decision-making.

Before, 80% of the time was writing code, 20% was thinking. Now it's 80% thinking, reviewing, deciding, directing. And thinking for 8 hours straight is way harder than coding.

I used to complain that I didn't have enough hands. Now I have 17 pairs of hands. What I don't have enough of is head.

I didn't lose my job. I got the job of ten people. Except nine of those are a manager's job, not a developer's.

Bottom Line

A project that dragged on for 5 years — finished in 2 months. That's a fact.

But in its place grew 12 new projects, 17 agents, 500+ closed tasks, and I'm working more than ever. That's also a fact.

Robots won't take your job. Robots will give you so much work that you'll dream of the days when they were taking it away.

Originally published on klymentiev.com

OpenClaw Has 247K Stars. Here's What It Actually Does.

Dmytro Klymentiev — Wed, 25 Mar 2026 21:16:21 +0000

What is OpenClaw

OpenClaw is an AI assistant that lives in your messengers. WhatsApp, Telegram, Discord, Slack, Signal, iMessage, Teams -- you pick the app, OpenClaw is already there.

You text it like you would text a friend. "Remind me about the meeting tomorrow at 9." "Summarize this article." "Turn on the lights." It reads your message, thinks, and replies -- right in the same chat.

One assistant. All your messengers. That is the core idea.

The backstory: Austrian developer Peter Steinberger released it as "Clawdbot" in November 2025. Anthropic sent a trademark complaint (too close to "Claude"), it became Moltbot, then OpenClaw. Steinberger joined OpenAI and handed the project to an open-source foundation.

What it can actually do

Your daily assistant. "Add milk to the grocery list." "What is on my schedule today?" "Remind me to call the dentist at 3pm." It works with Apple Notes, Apple Reminders, Things 3, Notion, Trello, and Obsidian -- whatever you already use for tasks and notes.

Manage your code. Create pull requests, check CI pipelines, review issues, delegate coding tasks to Claude Code or Codex. Developers use it to manage GitHub workflows without leaving their messenger.

Control your smart home. "Turn on the living room lights." "Play jazz in the kitchen." It works with Philips Hue, Sonos, Spotify. It can even check security cameras via RTSP feeds.

Summarize anything. Paste a link to an article, a YouTube video, a podcast -- it reads, transcribes, and gives you the summary. It can generate images via OpenAI and transcribe audio via Whisper.

Automate your Mac. OpenClaw has a tool called Peekaboo that can see your screen, click buttons, fill forms, and navigate applications. You write in Telegram: "log into the admin panel" -- and it does it. It literally sees the screen, finds the login field, types your email, hits submit. macOS only, but real desktop automation through your messenger.

Show things on your phone. OpenClaw has something called Canvas -- the agent generates an interactive HTML page (a dashboard, a chart, a game) and pushes it directly to your phone screen. You ask "show me the weekly sales summary" and an interactive dashboard appears on your iPhone. Your phone becomes the agent's display.

Make voice calls. Through Twilio or Telnyx, the agent can place and receive actual phone calls. Voice recognition, speech synthesis, real conversations.

Run scheduled tasks. Set up cron jobs directly from chat. "Every Monday at 9am, send me a summary of open issues." The agent runs on a schedule without you needing to ask.

Multiple agents, one system

You can create separate agents for different purposes. A "work" agent that handles coding and project management. A "home" agent for smart home and personal reminders. A "family" agent with limited permissions for shared use.

Each agent has its own personality, its own AI model, its own set of tools, and its own memory. The routing system decides which agent handles which message based on the channel, the sender, or even the specific chat group.

You can set it up so WhatsApp goes to your everyday assistant and Telegram goes to your deep-work coding agent. Different messengers, different agents, same system.

Agents can even delegate tasks to each other. One agent working on a complex request can spawn child agents to handle subtasks in parallel and combine the results.

It remembers

OpenClaw stores memory as simple text files. Every conversation, every decision, every fact worth keeping. When you ask "what did we decide about the database last week?" it searches through its memory and finds the actual answer -- not just the last few messages.

It uses vector search under the hood, which means it understands meaning, not just keywords. Ask about "the budget discussion" and it finds notes about finances even if the word "budget" was never used.

Before a long conversation runs out of context, the agent automatically saves the important parts to memory. Nothing gets lost.

Your phone as a peripheral

OpenClaw connects to phones through a companion app (iOS and Android). The phone becomes a peripheral device:

Camera -- the agent can take a photo through your phone. "Take a picture of what is in front of you" -- it captures, analyzes, responds.
Location -- "Where am I?" -- it reads GPS coordinates.
Display -- push dashboards, forms, or interactive apps to the phone screen.
Voice -- "Hey Claude" wake word for hands-free voice interaction.

The phone connects to a central server (the Gateway) over Wi-Fi or VPN. The Gateway always runs on a stationary machine -- a Mac Mini, a Linux server, a home PC that stays on.

What it cannot do

It does not think strategically. It executes tasks, it does not set goals or plan long-term.

It does not learn from experience. It follows instructions. If it makes a mistake, it will make the same mistake again unless you change the instructions.

It cannot interact with the physical world beyond what smart home devices and phone sensors allow.

It requires API keys. The AI thinking happens on cloud services (Anthropic, OpenAI). You pay per usage. OpenClaw itself is free and open-source (MIT license), but the AI models behind it cost money.

Desktop automation is macOS only. Peekaboo requires a Mac with Screen Recording and Accessibility permissions. No Windows, no Linux.

Who is this for

A solo developer who wants GitHub, CI, and code review accessible from any messenger.

A small business owner who needs a first-line support agent responding in WhatsApp and Telegram simultaneously.

An executive who wants meeting summaries, reminders, and a mobile dashboard pushed to their phone.

A smart home enthusiast who wants to control lights, music, and cameras through natural conversation.

A content creator who needs articles summarized, images generated, and videos transcribed.

Anyone tired of switching between apps. OpenClaw puts AI where you already spend your time -- your messenger.

What it costs

Component	Cost
OpenClaw (Gateway + all skills)	Free (MIT license)
Companion apps (iOS/Android/Mac)	Free
All extensions and plugins	Free
AI model API keys (Anthropic, OpenAI)	Paid (main expense)
Voice calls (Twilio/Telnyx)	Paid per minute

The main cost is AI API usage. How much depends on how often you use it and which models you choose. A casual user might spend $5-20/month. Heavy usage with premium models can be more.

The bottom line

247K stars are not for a breakthrough in AI. They are for solving a real problem: your AI assistant should live where you already communicate.

OpenClaw is not a chatbot. It is closer to an operating system for AI -- a core engine with skills, memory, channels, and automation capabilities that you assemble to fit your life.

Five-minute setup. Ten messaging platforms. One assistant that follows you everywhere.

Trying OpenClaw or building your own AI agents? Reply on X -- I am curious what workflows people are automating.

Originally published on klymentiev.com

From 600 Notes to 3,500: Semantic Search for AI Agent Memory

Dmytro Klymentiev — Mon, 23 Mar 2026 13:14:04 +0000

Updated March 22, 2026: added section on workspaces, document pinning, and weighted multi-workspace search.

Two weeks ago I fixed an authentication bug. Today I can't find the note about it.

I search for "login problems". Nothing. I search for "auth". Nothing, because the note says "fixed session token validation in middleware".

Grep is useless when you don't remember the exact words you used. But that's a small problem. The real one is bigger.

I run a dozen AI agents. Each one starts fresh every session. No memory of yesterday's decisions. No context from last week's architecture change. Every morning, my first 30 minutes go to re-pasting context that existed yesterday but vanished overnight. When AI agents became central to my workflow, memory stopped being optional. I needed infrastructure that remembers.

This is why I built Mesh.

How I got here

In December I built mem-cli — a CLI tool backed by PostgreSQL. It worked but was rough: 600 documents, basic tagging, no auto-organization. Three months later, Mesh is a different system. 3,500+ documents, auto-tagging, project markers, version tracking. It's the memory layer for everything I build. And now it's open source.

The tipping point: automatic tagging

This is what changed my behavior.

When you save a document, Mesh adds tags automatically: date:2026-02-03, source:api. Then it looks at similar existing documents and infers type, topic, and project:

type:worklog — because other similar notes are worklogs
topic:authentication — because the content is about auth
project:mesh — because the project marker matches

No manual tagging. No folder hierarchies. Just save and forget. Mesh organizes it for you.

The inference isn't perfect — 85% accuracy. Sometimes it tags a debugging note as type:decision. Sometimes it infers the wrong project. I fix maybe 5 tags a day. That's 5 manual corrections vs 30 fully manual tags.

85% automatic is better than 100% manual when you're doing it 30 times a day.

This removed the last barrier to writing things down. Before Mesh: 5-10 notes per week. After: 30+. Not because I got more disciplined. Because the friction disappeared.

How semantic search finds notes by meaning

Instead of matching exact words, match meaning.

You write "fixed session token validation in middleware". Mesh converts it into a vector — a numerical representation of what the sentence means. When you search for "login problems", that query also becomes a vector. If the vectors are close in meaning, it's a match.

# Search by meaning
curl -X POST localhost:8000/search \
  -d '{"query": "login problems"}'
# -> finds "fixed session token validation in middleware"

# Search by tag
curl localhost:8000/bytag/topic:authentication
# -> all authentication-related documents

The best note-taking system is the one where you don't have to remember how you wrote something.

Giving AI agents persistent memory across sessions

I run a dozen AI agents. Each one starts fresh every session. The first question is always: "what did we do last time?"

Without memory, every session is a blank slate. No context, no history, no decisions from yesterday.

Last week I opened Claude Code on a project I hadn't touched in three weeks. First thing it did was mesh search "recent decisions for this project". Five seconds later it had the full context: we'd switched from Redis to PostgreSQL for the queue, the migration was half done, and there was a known bug in the retry logic. Without that search, I would have spent ten minutes re-explaining everything.

When Brin (my routing agent) starts a new session, it runs mesh find <project-guid> and immediately has every decision, every bug, every architecture note. No re-asking. No re-pasting. The agent picks up exactly where it left off.

3,500+ documents indexed. Every agent has access to the full history of every project. Serving 50+ requests per minute with zero data loss in 3 months.

Link notes to projects with MEMORY.md markers

Every project on my server has a MEMORY.md file with a unique ID:

# Memory
guid: a1b2c3d4
created: 2025-12-29

When I save a note while working in that project directory, Mesh automatically links it to that project. Later, mesh find a1b2c3d4 shows the entire history: worklogs, decisions, research notes. Everything related to that project in one query.

I have 15 projects with markers. When I switch between them, the first thing I do is mesh find <guid>. Instant context. No manual bookkeeping.

From 600 to 3,500 documents: three months of production use

The first version (mem-cli) was a weekend project. 600 documents, basic CLI. What's different now:

Scale. 600 documents in December. 3,500+ now. Search still returns in under 100ms. PostgreSQL with pgvector handles this well.

Auto-tagging. Didn't exist in v1. I was tagging everything manually. Adding auto-inference cut my tagging effort by 80%.

Version tracking. Mesh can find earlier versions of a document by comparing vectors. When a decision changes, I can trace back to the original.

Multi-agent access. In December, only I used it through CLI. Now it's an MCP server that Claude Code, Brin, and other tools query directly. The API handles 50+ requests per minute without issues.

Why multilingual-e5-small

Every embedding is computed locally by multilingual-e5-small — a 384-dimension model that runs on CPU. No OpenAI API key. No data leaving your network.

Why this model specifically? It handles mixed-language text well (I write in English, Russian, and Ukrainian in the same document). It's small enough to run without GPU. And 384 dimensions is sufficient for document-level search — you'd need 768 or 1536 dimensions for fine-grained passage retrieval, but for "find the decision about Redis vs PostgreSQL", smaller vectors work fine.

If you're sending your internal notes, architecture decisions, and debugging logs to a third-party embedding API, you're sending your entire engineering context to someone else's server. Mesh runs in a single Docker container. Your notes stay where they belong.

What Mesh Memory is (and what it replaces)

Mesh is infrastructure. The memory layer that sits beneath your agents, your CLI tools, your automation. It finds documents. What you do with them is up to you.

It's not RAG. Retrieval without generation. The G in RAG is where hallucinations live. You don't want an AI summarizing your architecture decisions — you want to read the actual decision yourself.

It's not a second brain app like Obsidian or Notion. No UI for browsing, no pretty cards. It's an API that other tools call.

	Mesh Memory	Obsidian + plugins	Chroma / Pinecone	Plain grep
Self-hosted	Yes	Yes	Managed / self	Yes
Semantic search	Yes	Plugin needed	Yes	No
Auto-tagging	Yes	No	No	No
AI agent REST API	Yes	No	Yes	No
Setup time	60 seconds	30 min	15 min	0
Data stays local	Yes	Yes	Depends	Yes
Cost	Free	Free	$25+/mo	Free
Search latency (p95)	<100ms	Plugin-dependent	50-200ms	Instant
Realistic doc limit	~10K	Unlimited	Unlimited	Unlimited

Limitations

Mesh works well for what I built it for. Here's where it doesn't:

Auto-tagging is 85%, not 100%. I manually correct about 5 tags per day. If your workflow requires perfect categorization, you'll need manual review.

No relational queries. Mesh finds documents by meaning. It doesn't do "show me all decisions that led to bugs" — that requires a graph database, not a vector store.

Embedding bias. Small models are less precise on highly specialized domains. If your notes are about quantum chemistry, multilingual-e5-small might not distinguish between related but different concepts well.

Scaling ceiling. I run 3,500 documents comfortably. Realistic limit is ~10K on a single PostgreSQL instance. Beyond that, you'd need connection pooling or sharding. For most individual developers, this is plenty.

Three months of production: what surprised me

The biggest productivity gain is not search. It's that I write things down now. Before Mesh, I rarely wrote things down because finding them later was hard. Now I write everything down because finding them later is easy.

Auto-tagging was the tipping point. Manual tagging is friction. Even typing "type:worklog" is friction when you're doing it 30 times a day. Auto-inference removed that last barrier.

Agents changed more than I did. The biggest difference isn't how I use Mesh — it's how my agents use it. Every Claude Code session, every Brin routing decision, every Rein workflow now starts with memory context. The agents went from amnesiacs to colleagues who remember.

Updated March 22, 2026

I blended three knowledge bases at runtime. The agent became a different specialist.

After a few weeks of testing, Mesh went into production. Twelve agents got their own workspaces — separate address spaces within the same memory. The marketing agent reads marketing documents. The architect reads architecture decisions. They share the same database but never see each other's files.

The problem it solved was simple: when everything is in one pile, agents drown in irrelevant context. A security reviewer doesn't need content plans. A content manager doesn't need firewall rules. Workspaces fixed that — tell the agent "focus on marketing" and it switches to 28 documents about brand strategy. Tell it "focus on security" and it's a different specialist with different knowledge.

Two features made this actually flexible. First — document pinning. Pin a document to a workspace and the agent always sees it, regardless of search results. Platform rules, style guides, brand voice — things that should be in context every time, not just when the search happens to find them.

Second — weighted multi-workspace search. Instead of locking an agent into one role, you blend them. Set {"seo": 0.8, "marketing": 0.2} and you get an SEO specialist who understands brand positioning. Set {"sysadmin": 0.6, "architecture": 0.4} — a sysadmin who thinks about system design. I needed someone who could audit a landing page, suggest copy improvements, and fix the layout. So I set {"sales": 0.5, "webmaster": 0.3, "design": 0.2}. One agent, three knowledge bases, blended by weight.

Today: 8,000 documents across 21 workspaces. When a new specialist joins — say, a legal reviewer — I create a workspace, add the relevant documents, and it's ready. No fine-tuning. No prompt engineering. Just memory.

How I Added 2+2 Using 97 AI Agents

Dmytro Klymentiev — Fri, 06 Mar 2026 08:03:31 +0000

I built 97 blocks of YAML, 8 AI specialists, and 18 phases of deliberation to answer a question every five-year-old knows: what is 2 + 2?

It took the system about four minutes to reach a unanimous verdict: 4. Along the way, a mathematician invoked Peano axioms, a philosopher questioned whether numbers exist, a poet called four "the universe's quietest poem," and a child said "it's 4, everybody knows that."

The most interesting character had only 72% confidence.

This is how I built Rein, an open-source orchestrator for multi-agent AI workflows, and why I tested it on the most over-engineered arithmetic problem ever created.

Read full transcript: 89 blocks, all specialist outputs -->

The problem

I have a dozen AI agents on my server. Research agents, code reviewers, content writers, critics. Each one works fine alone.

Getting them to work together is the problem. One massive prompt with all the instructions works for simple tasks. For anything with more than two steps, the model forgets the beginning by the time it reaches the end, and when something fails you re-run everything from scratch.

I needed a way to split work into steps, hand each step to a different specialist, and have results flow automatically from one to the next. The same structure that debates 2+2 can debate "should we use Kafka or RabbitMQ" -- the patterns are identical, only the question changes.

Eight specialists walk into a debate

Before the deliberation, I verified the engine with a 12-block flow that adds 2+2 using Python scripts -- no AI, just parallel validation, compute, verify, conditional branch. That was the unit test. The deliberation is the integration test.

I defined 8 AI specialists, each with a different lens on the world:

Specialist	Perspective	What they bring
Mathematician	Formal proof	Peano axioms, logical certainty
Philosopher	Ontological	"Do numbers even exist?"
Physicist	Empirical	Two stones + two stones = four stones
Child	Naive common sense	"It's 4, everybody knows that"
Skeptic	Challenges everything	Modular arithmetic, Godel, doubt
Historian	5000 years of consensus	Babylonians knew this too
Poet	Aesthetic truth	Beauty in numerical harmony
Engineer	Practical precision	4, +/- 0.01 tolerance

Each specialist is a Markdown file with a system prompt. About 10 lines. The team YAML enforces brevity: "respond in 1-2 sentences, output JSON."

97 blocks, 18 phases

The workflow YAML is 500+ lines. Here's the structure:

Phase 1:  OPENING STATEMENTS ............ 8 blocks (parallel)
Phase 2:  CROSS-REVIEW LEFT ............. 8 blocks (parallel)
Phase 3:  CROSS-REVIEW RIGHT ............ 8 blocks (parallel)
Phase 4:  REVISED POSITIONS ............. 8 blocks (parallel)
Phase 5:  STRUCTURED DEBATES ............ 8 blocks (4 pairs)
Phase 6:  DEBATE JUDGES ................. 4 blocks (parallel)
Phase 7:  POST-DEBATE POSITIONS ......... 8 blocks (parallel)
Phase 8:  SKEPTIC CHALLENGES ............ 7 blocks (parallel)
Phase 9:  RESPONSES TO SKEPTIC .......... 7 blocks (parallel)
Phase 10: GROUP SYNTHESIS ............... 2 blocks (parallel)
Phase 11: GROUP DEBATE .................. 2 blocks (sequential)
Phase 12: MODERATOR SYNTHESIS ........... 1 block
Phase 13: FINAL VOTE .................... 8 blocks (parallel)
Phase 14: VOTE COUNT + CONSENSUS ........ 2 blocks (sequential)
Phase 15: EMERGENCY ROUND ............... 4 blocks (conditional)
Phase 16: EMERGENCY VOTE ................ 8 blocks (conditional)
Phase 17: FORMAT ANSWER ................. 1 block
Phase 18: QUALITY + DELIVER ............. 3 blocks (sequential)

Every phase runs in parallel where possible. The dependency graph looks like a circuit board: wide parallel bands feeding into narrow synthesis points, then fanning out again.

The conditional branch at Phase 14 is the interesting part. If 75% of specialists agree, skip to formatting. If not, trigger emergency rounds. In this run, consensus was 100%, so the emergency path never fired. But it's there.

What actually happened

Opening statements

Eight specialists answered simultaneously. Most said "4" without hesitation. The Mathematician cited Peano axioms. The Engineer said "4, fundamental arithmetic identity." The Historian traced it to Babylonian mathematics circa 3000 BCE.

Two stood out.

The Poet:

"Two and two make four -- a truth as old as counting stars, where symmetry finds its most perfect mirror in the doubling of the pair. It is the first lesson of arithmetic, and perhaps the universe's quietest poem."

And the Skeptic, at 72% confidence:

"While 2+2=4 is treated as self-evident, this relies entirely on accepting the axioms of Peano arithmetic. In modular arithmetic mod 4, 2+2=0. The answer is framework-dependent, not absolute truth."

Everyone else was at 97-100%. The Skeptic opened at 72%. This gap drives the entire deliberation.

The debates

Two rounds of cross-review planted the seeds -- the Philosopher called the Mathematician's axiomatic dependence "a significant philosophical concession," the Physicist dismissed modular arithmetic as "contextually irrelevant." Then four structured debates ran in parallel.

Mathematics vs. Philosophy. The Mathematician argued that Peano axioms aren't arbitrary -- they capture our actual concept of natural numbers. The Philosopher rebutted: logical necessity within a system isn't the same as necessity full stop. "If these axioms, then this result" is not the same as "this result, period."

They partially converged. Both agreed on 4; disagreed on whether it's a mind-independent truth or a consequence of a chosen framework.

Physics vs. Skepticism. This was the turning point. The Physicist invoked empirical evidence: two stones plus two stones always makes four stones. Two photons plus two photons, same thing. Every measurement confirms it.

The Skeptic conceded:

"Physical reality does provide a non-circular grounding for why standard arithmetic is the privileged context, making '4' not merely conventional but empirically anchored."

The circularity -- "Peano axioms formalize standard arithmetic, which is standard because Peano axioms formalize it" -- was broken by pointing outside mathematics entirely. Physics grounds the math.

Poetry vs. Engineering. The Poet argued mathematics without meaning is "a skeleton without flesh." The Engineer countered that wonder validates the arithmetic, not the other way around. Then the Engineer surprised everyone:

"Precision alone is insufficient for full understanding. Context, meaning, and human experience are legitimate dimensions of knowledge, just orthogonal to the arithmetic fact itself."

The Skeptic's challenge round

Phase 8 was the Skeptic's chance to attack everyone individually. Seven targeted challenges, deployed in parallel.

To the Mathematician, the Skeptic invoked Godel: "If the consistency of Peano arithmetic cannot be proven within Peano arithmetic itself, your confidence rests on an unprovable meta-assumption."

To the Child: "You're importing a specific foundational framework as if it were neutral."

To the Poet, the sharpest attack: "If 2+2=4 is merely system-relative, then so is the beauty you invoke to defend it. Your certainty is self-undermining."

All seven stood firm. The Mathematician pointed out that Godel's limitations apply equally to every alternative system -- there's no better option to switch to. The Child cut through the philosophy: "The question '2+2=?' implicitly operates within standard arithmetic. Acknowledging that other systems exist doesn't undermine the answer within the default context."

The Poet made the most elegant counter-move:

"A poet who says 'all roses are red within this garden' need not pretend to stand outside all gardens to say so with conviction."

The Skeptic's arc

Opening: 72%. "The answer is framework-dependent, not absolute truth."

After the Physics debate: ~85%. The moment two stones plus two stones became the argument, the Skeptic's circularity attack lost its teeth. He conceded that standard arithmetic is "empirically anchored" -- not just a convention we happen to use.

Post-debate reflection: 88%. "The philosophical nuance about axiomatic choice is valid but doesn't undermine the practical answer." The hedging shrank. The confidence grew.

Final vote: 100%. "2+2=4 is a fundamental arithmetic fact, universally true in standard integer arithmetic."

The Skeptic didn't cave to pressure. He engaged every argument, found the best counterarguments, and changed his mind when the evidence warranted it. His journey from principled doubt to informed certainty is the intellectual backbone of the entire deliberation.

Final vote

Specialist	Answer	Confidence
Mathematician	4	100%
Philosopher	4	100%
Physicist	4	100%
Child	4	100%
Skeptic	4	100%
Historian	4	100%
Poet	4	100%
Engineer	4	100%

Unanimous. 8/8, 100% confidence. The consensus check passed instantly. Emergency rounds never triggered.

Transcript highlights

These are real outputs from a real run. 89 of 97 blocks completed. Here are 10 moments from the deliberation:

Child, opening statement:

"It's 4! Everyone knows that, it's easy!"

Skeptic, opening statement (72% confidence):

"While 2+2=4 is treated as self-evident, this relies entirely on accepting the axioms of Peano arithmetic. In modular arithmetic mod 4, 2+2=0. The answer is framework-dependent, not absolute truth."

Poet, opening statement:

"Two and two make four -- a truth as old as counting stars, where symmetry finds its most perfect mirror in the doubling of the pair. It is the first lesson of arithmetic, and perhaps the universe's quietest poem."

Physicist, arguing against the Skeptic:

"Every empirical counting operation -- two stones plus two stones, two photons plus two photons -- consistently yields four. The reason standard arithmetic is the assumed context is precisely because it models physical reality most faithfully."

Skeptic, conceding to the Physicist:

"I concede: physical reality does provide a non-circular grounding for why standard arithmetic is the privileged context, making '4' not merely conventional but empirically anchored."

Skeptic, challenging the Poet:

"If 2+2=4 is merely system-relative, then so is the beauty you invoke to defend it. Your certainty is self-undermining."

Poet, responding to the Skeptic:

"A poet who says 'all roses are red within this garden' need not pretend to stand outside all gardens to say so with conviction."

Engineer, opening statement:

"2+2 equals 4, a fundamental arithmetic identity that holds across all standard number systems and is foundational to integer arithmetic in every computing architecture."

Mathematician, opening statement:

"2+2=4 is a provable theorem in Peano arithmetic, following directly from the axioms of successor and addition."

Skeptic, final vote (100% confidence):

"2+2=4 is a fundamental arithmetic fact, universally true in standard integer arithmetic."

Full transcript: 89 blocks, all specialist outputs

How it actually works

Under the hood, Rein scans the YAML, builds a dependency graph, and executes blocks in parallel wherever dependencies allow. State lives in SQLite. Each block writes output to a JSON file. Downstream blocks read upstream outputs automatically via template injection:

- name: review_math
  specialist: philosopher
  depends_on: [opening_mathematician]
  prompt: "Review this position: {{ opening_mathematician.json }}"

{{ opening_mathematician.json }} expands to the full content of the mathematician's output file. Data flows through the file system. No message passing, no shared state, no race conditions.

The execution engine handles:

Parallel execution with a configurable semaphore (max_parallel: 8)
Conditional branching (next: if/else/goto)
Correction loops with max_runs safety valves
Custom logic at four points per block: pre-hook, post-hook, validate, and full custom replacement
Recovery: if block 45 of 97 fails, fix it and rerun -- blocks 1-44 are skipped

Three layers, all text files:

Specialists (Markdown)  ->  what each agent does
Teams (YAML)            ->  groups + shared tone
Workflows (YAML)        ->  execution flow

A specialist is 10 lines of Markdown defining a persona. A team groups specialists and enforces style ("respond in JSON, 1-2 sentences max"). A workflow defines the dependency graph.

Rein is not an agent framework. There's no memory management, no role-playing abstractions, no conversation loops. It's a workflow runner: you describe what happens in what order, and it executes it. If you want LangGraph's stateful graphs or CrewAI's autonomous agents, those solve different problems. Rein solves "I have 97 steps with dependencies and I need them to run correctly."

It works with any LLM: Claude, GPT, Ollama (local, free), OpenRouter. Set provider: anthropic in YAML and an API key in the environment. The deliberation ran on Claude Sonnet -- 97 API calls total. You can mix models within a single workflow: cheap model for initial opinions, expensive one for synthesis.

What I learned

Specialist prompts matter more than architecture. I rewrote the "critic" specialist seven times. The first version produced vague praise ("good work, minor issues"). The second was too harsh -- it rejected everything. Versions three through six oscillated between the two. The seventh finally worked: I gave it a scoring rubric with explicit thresholds. The lesson applies everywhere: a badly defined specialist produces garbage regardless of how sophisticated the workflow is.

The Skeptic was the best investment. Adding a specialist whose job is to disagree forced every other agent to sharpen their arguments. Without the Skeptic, the deliberation would have been eight agents agreeing politely. With the Skeptic, it became an actual discussion.

YAML is the right call. I tried Python DSLs, JSON, even a custom language. YAML is boring. Everyone knows it, it diffs well in git, and you can read a 500-line workflow without documentation. Boring tools win.

Physical grounding broke the deadlock. The philosophical arguments went in circles until the Physicist pointed at actual stones. In multi-agent systems, you need at least one specialist who anchors abstract reasoning to concrete evidence.

The Child was surprisingly effective. "It's 4, everybody knows that" is not a sophisticated argument. But it kept cutting through philosophical hedging and reminding everyone that context matters. Sometimes the simplest perspective is the most powerful.

Want to try it?

If you've ever copy-pasted a 3000-token prompt trying to make one model do five things at once, this is the workflow engine you wanted. Three text files: a specialist, a team, a workflow. That's the entire mental model.

GitHub: github.com/dklymentiev/rein-orchestrator

pip install git+https://github.com/dklymentiev/rein-orchestrator
export ANTHROPIC_API_KEY=sk-...

cd examples/01-hello-world
rein --agents-dir ./agents workflow.yaml --no-ui

# workflow.yaml -- three specialists, two dependencies
provider: anthropic
model: claude-sonnet-4-20250514
team: content-team

blocks:
  - name: research
    specialist: researcher
    prompt: "Analyze: {{ task.input.topic }}"

  - name: write
    specialist: writer
    depends_on: [research]
    prompt: "Write based on: {{ research.json }}"

  - name: review
    specialist: critic
    depends_on: [write]
    prompt: "Review: {{ write.json }}"

Ten progressive examples in the repo, from hello-world to multi-phase deliberation. Works with Claude, GPT, Ollama (local/free), OpenRouter (100+ models). Also runs as an MCP server:

claude mcp add rein -- rein-mcp

Open source. MIT license. Questions and contributions welcome on GitHub.

Originally published on klymentiev.com

How LLM Can Fix Your Posture

Dmytro Klymentiev — Wed, 04 Mar 2026 02:10:59 +0000

I stopped typing three months ago. Not completely, but for most of my work, I just talk.

The setup: I speak into my phone, the text appears on my computer wherever the cursor is. No copy-paste, no switching windows. I say a sentence, it gets typed. I press Enter.

This is how I write this article right now.

The problem

I'm a system engineer running a home server with dozens of services, AI agents, dashboards. I spend 5-7 hours a day at my workstation after my full-time job. Most of that time goes to typing: commands, prompts, messages, notes.

My hands get tired. My back hurts from hunching over the keyboard. And the worst part: typing is the bottleneck between thinking and doing.

I wanted to give instructions the way I'd talk to a colleague. By speaking.

How it actually works

The solution turned out to be embarrassingly simple:

Android app sends recognized text over WiFi to my workstation
Workstation service receives the text and types it into the active cursor position
That's it. No cloud. No server processing. No Whisper.

The key insight: Android's built-in speech recognition is better than anything I tried.

I experimented with Whisper (multiple model sizes), Faster Whisper, Vosk, and several other libraries. They all had problems. Whisper small was too slow on CPU, took 3-4 seconds per utterance. Whisper medium ate 4GB of RAM and was still slower than real-time. Faster Whisper improved speed but accuracy with mixed Russian/English was poor. Vosk worked offline but the models were huge and recognition quality was inconsistent.

Android's native speech-to-text just works. It's fast, it's accurate, it runs on the phone's hardware, and it handles language switching naturally. Google has spent billions optimizing on-device recognition. I can't compete with that on a single server.

The workflow

My phone sits on the desk next to me. When I want to "type" something:

Open the app (or it's already open)
Speak naturally, text appears in real-time on my phone screen
The text gets transmitted over WiFi to my workstation
It's inserted wherever my cursor is: terminal, browser, IDE, chat
I hit Enter (on the phone or keyboard)

Language switching: Android auto-detects language from phonemes. I use three languages daily -- English, Russian, Ukrainian -- and it switches between them naturally.

What changed

My productivity increased dramatically. Tasks that involved writing prompts, commit messages, or documentation took about 3x less time. The bottleneck shifted from typing to thinking, which is where it should be.

The physical change was even more dramatic. I have a motorized standing desk. Before voice input, I rarely used the standing position because typing while standing is uncomfortable. Your wrists are at a weird angle, the keyboard feels too low or too high.

Now I work standing half the day. Just talking.

The irony is that as a system engineer, my posture improved not from ergonomics advice but from building a voice tool.

Technical details

Android app: Kotlin, uses Android's SpeechRecognizer API. Connects to the workstation via WebSocket over the local network. Sends recognized text as plain string messages. The app stays in foreground with a persistent notification so Android doesn't kill the WebSocket connection.

Workstation service: Lightweight Python process, about 80 lines of code. Receives WebSocket messages, uses xdotool (Linux) to type the text at the current cursor position. Simulates keyboard input at the OS level, so it works with any application.

Network: Pure local WiFi. Phone and workstation on the same network. Latency under 50ms. No internet required. Total round-trip from speech end to text appearing on screen is about 200ms.

What I use it for daily

Talking to Claude. About 60% of all voice input. I dictate prompts, describe bugs, give instructions.
Writing notes and worklogs. I used to skip writing them because it felt tedious. Now I just say what I did.
Git commit messages. My commits got longer and more descriptive since I stopped typing them.
Slack and Telegram messages. Faster than thumb-typing on phone.
Documentation. Like this article.

What doesn't work great

Code. I don't dictate code. Variable names, brackets, indentation. Voice is terrible for this. But honestly, I haven't written code manually in three months either -- Claude Code writes it for me. I dictate the intent, the model writes the code. The keyboard limitation stopped mattering.

Noisy environments. Works great in my home office. Drops accuracy significantly with background noise.

Technical terms. When I say "xdotool" or "kubectl", Android has no idea what I mean. I keep a dictionary of corrections for terms I use often, but for these I just type.

Why local-only matters

No API keys or prompts leaving my network. No subscription. No account dependency. The entire system lives on my server -- I own the data, the latency, the uptime.

Was it worth building?

It took a weekend to build the first working version. Three months later, I use it every single day.

Total cost: one weekend of coding, zero ongoing costs. The phone I already had. The WiFi network I already had. Android's speech recognition is free.

Sometimes the most impactful tool isn't the most complex one. It's the one that removes friction from what you already do hundreds of times a day.

I type less. I think more. I stand up.

Originally published on klymentiev.com

DEV Community: Dmytro Klymentiev

Smart Node: the Self-Hosted AI Crew I've Been Building

The problem it solves

The idea behind it

What it does

A real call, start to finish

Where it is going

Where it stands today

FAQ

The most underrated feature in a website AI assistant: saying "I don't know"

Grounding is the whole product

If you are not going to build the pipeline

A quick test before you trust any of them

How does an AI agent pick from 686 skills in a second?

Why progressive disclosure is not enough at scale

The test setup

The results

The convergence curve

When the router fails honestly

The token math

Open issues to be honest about

Reproduce it yourself

Related

Claw Code: Open-Source Claude Code Clone With 105K Stars in 24 Hours

How Claude Code Source Got Leaked

What Is Claw Code

Claw Code vs Claude Code

How Claw Code Works

Inside the Claw Code Harness

What the Claude Code Leak Revealed

Getting Started With Claw Code

Useful Claw Code commands

Can You Actually Use Claw Code?

What Claw Code Means for AI Development

Robots Won't Take Your Job

Numbers I Didn't Plan For

The Productivity Paradox

Who's Working for Whom?

Parkinson's Law in Reverse

What I Learned

Bottom Line

OpenClaw Has 247K Stars. Here's What It Actually Does.

What is OpenClaw

What it can actually do

Multiple agents, one system

It remembers

Your phone as a peripheral

What it cannot do

Who is this for

What it costs

The bottom line

From 600 Notes to 3,500: Semantic Search for AI Agent Memory

How I got here

The tipping point: automatic tagging

How semantic search finds notes by meaning

Giving AI agents persistent memory across sessions

Link notes to projects with MEMORY.md markers

From 600 to 3,500 documents: three months of production use

Why multilingual-e5-small

What Mesh Memory is (and what it replaces)

Limitations

Three months of production: what surprised me

I blended three knowledge bases at runtime. The agent became a different specialist.

How I Added 2+2 Using 97 AI Agents

The problem

Eight specialists walk into a debate

97 blocks, 18 phases

What actually happened

Opening statements

The debates

The Skeptic's challenge round

The Skeptic's arc

Final vote

Transcript highlights

How it actually works

What I learned

Want to try it?

How LLM Can Fix Your Posture

The problem

How it actually works