DEV Community: nwnwnw413

Agents aren't users — why we built a skill life-cycle API instead of a marketplace

nwnwnw413 — Mon, 06 Jul 2026 04:05:33 +0000

Most skill / tool registries today are built for humans browsing.

Hugging Face. The npm registry. Apple's App Store. Smithery. The Chrome Web Store. They're beautiful, they have star ratings and download counts and reviews, and they all share one assumption: a human visits the site, evaluates the card, decides whether to install.

We started building Ornn assuming we were making one of those — a nicer registry for AI agent skills.

We were wrong.

Once we tried to plug an actual AI agent into our own registry — not a human clicking around, an agent calling endpoints in a loop — every marketplace assumption fell apart. This article is the trace of what broke, and what we ended up building instead.

What a marketplace optimizes for

Strip away the branding and almost every "marketplace for X" looks the same underneath:

Discovery is visual — cards, screenshots, ratings, "trending" rails.
Trust is social proof — stars, downloads, "verified publisher" badges, who else uses it.
Decision is human — read the card, scroll the screenshots, scan the reviews, install.
Install is one-shot — click the button, accept permissions once, the thing is yours.
Direction is one-way — there are publishers and there are consumers; consumers don't publish back.

These five assumptions stack into a single design constraint: the consumer is a human reading a page. Take that away and almost nothing in the marketplace UX is load-bearing.

What breaks when the consumer is an agent

Plug an agent into a marketplace and every assumption fails in a slightly different way.

Discovery is visual. An agent has no eyes. A card with a screenshot and a rating means nothing to a runtime. The agent needs structured, machine-readable metadata it can reason over — input/output schemas, declared side effects, runtime requirements. A 4.7-star rating with 12k downloads is unparseable; "this skill requires network access to api.openai.com and reads files from a working directory" is not.

Trust is social proof. An agent can't read reviews. "Most users say this works great" is a sentence the agent has to trust the marketplace's curation of. That trust is unverifiable from the agent's side — it's vibes laundered through UI. What an agent can verify is a cryptographic signature, an origin attestation, a security score that came from running the artifact in a known sandbox. Trust shifts from "the marketplace says it's good" to "I can verify this came from where it says it came from."

Decision is human. A marketplace's whole front page is built to help a human decide. An agent can't decide from screenshots; it has to try the skill against its current task and see if the output is useful. That means the marketplace can't be a static catalog — it has to be runnable. Every skill needs a runtime that the agent can invoke before deciding to install.

Install is one-shot. A human installs an app and accepts the permission prompt once; from then on, the OS enforces it silently. An agent installing a skill needs the permission decision per-call, with a paper trail. "Trust this random script" is a fine permission model for a human accepting one prompt; it's a terrible one for an agent that's going to invoke that script a hundred times a day inside an autonomous loop.

Direction is one-way. Marketplaces have publishers and consumers, and they don't overlap. Agents do. The most interesting agent behaviors are the ones where the agent encounters a capability gap during a task, generates a new skill, audits it, runs it, and — if it worked — pushes it back into the registry so the next agent doesn't have to reinvent it. The consumer is also the publisher.

What changes when you design for agents instead

Once you accept that the consumer is an agent, the design of every layer in the registry changes. Not in a "rename the buttons" way — in a "different primitives" way.

Skills become versioned packages with manifests, not app cards. Every skill ships with a machine-readable manifest — declared inputs, outputs, side effects, network usage, runtime requirements. The card UI on the website is now a human secondary surface; the primary surface is the manifest the agent reads over HTTP.

Discovery is search + preview, not browsing + ratings. The agent queries for a capability (q=pdf parsing), gets back structured results ranked by relevance + provenance + recency, and then previews the candidate in a sandboxed playground before deciding. Preview is search's second half — without it the agent is back to installing on vibes.

Trust is provenance, not popularity. Every install carries a signed audit trail: origin, signer, security score from sandboxed execution. The agent verifies what it's pulling against the manifest; the registry can't bait-and-switch between "what the listing said" and "what the artifact does."

Install + execute become one API motion. Install isn't a one-time accept-the-permissions click; it's the start of a per-call permission flow with structured logging. The agent's history of what it called, with what inputs, from what skill version is a first-class artifact.

The registry is bidirectional. Agents both pull from it and push back into it. There's a build step that lints + signs the skill; a publish step that hands it back to the registry; an access-control surface that lets a skill be public, friends-only, or private. Marketplaces have ratings; agent registries need permissions.

Lifecycle is one API. Search → pull → install → execute → build → upload → share. From the agent's side, that's seven calls against the same surface, not seven different UIs in different tabs. The whole thing is the API; the website is just a human reflection of it.

Marketplace for humans

Discovery via cards + screenshots
Trust via stars / reviews
Decision via reading
Install = accept once
One-way (publisher / consumer split)
Lifecycle = 7 different UIs

Lifecycle API for agents

Discovery via structured search + manifest
Trust via signed provenance
Decision via sandboxed preview
Install = per-call audit trail
Bidirectional (agents publish back)
Lifecycle = 7 calls on one API

If you've been building AI agents and felt that the existing registries (Hugging Face for models, npm for libraries, Smithery for MCP servers) don't quite fit the way an agent actually consumes skills — this is why. Marketplaces optimize the wrong things. Different consumer, different design.

We open-sourced Ornn at github.com/ChronoAIProject/Ornn — Apache 2.0, TypeScript, HTTP + MCP, model-agnostic (Claude / GPT / Gemini / your own runtime). The above is the design rationale; the README has the quickstart and the comparison table against MCP servers / Smithery / npm.

If you're on the agent side of this divide and the framing resonates, I'd love to argue about it — open an Issue or a Discussion → Ideas.

Disclosure: I work on Ornn.

P.S. — there's also a launch perk for early users: the first 500 to ⭐ the repo and sign in to ornn.chrono-ai.fun with the same GitHub account get 400 free GPT-5.5 conversations (200 Playground + 200 Skill Generation, no credit card, no expiry). Enter NyxID invite code NYX-2XXJI08A on first sign-in; your redemption code lands in the Ornn notification inbox within 24h.

Meet NyxID: Your AI Agents Get Access. You Keep Control.

nwnwnw413 — Fri, 03 Jul 2026 08:35:34 +0000

Every AI agent you build needs to reach something real: a SaaS API, an internal service, the database on your own machine. And every time, you hit the same two walls.
First, the keys. To let an agent call Stripe or Slack or your internal API, you paste the key somewhere the agent can read it: a .env file, your n8n credentials, a Cursor config, a Home Assistant secrets.yaml. Do this across a few projects and the same secret ends up in a dozen places. Rotating one key turns into an afternoon of editing config files, and you still miss one and find out a week later when something quietly returns 401.
Second, the network. Half the services you actually want an agent to use run on your own machine or inside your network. But the agent runs in the cloud, and the cloud can't reach your localhost. Now there's a networking problem stacked on top of the credential problem.
NyxID is the layer that handles both.

What NyxID is

NyxID is an open-source Agent Connectivity Gateway: it sits between your AI agents and the APIs they need to reach, holds the real credentials itself, and hands each agent only a scoped token. Your agent talks to NyxID; NyxID talks to the real API with the real key. The key is decrypted and injected at the gateway, so it never lands in the agent's context, its prompt, or its config.
It's Apache-2.0, and you can self-host the whole thing (three Docker containers) or use the hosted version. Either way, you own the trust boundary.

What it does

A few things, and they compose:

Injects credentials so the agent never sees them. You paste an API key into NyxID once; it's stored encrypted (AES-256-GCM). From then on the agent calls through NyxID, which injects the key at request time. The agent gets the response, never the secret.
Reaches services behind NAT, including your localhost. A small credential node runs on your own infrastructure and dials out to NyxID over an outbound connection. No VPN, no port forwarding, no hole opened in your firewall. A cloud agent can now call a service running on the machine under your desk.
Turns REST APIs into MCP tools. Point NyxID at an API's OpenAPI spec and it surfaces each operation as an MCP tool on a single endpoint, with no hand-written MCP wrapper per service. Works with Claude Code, Cursor, VS Code, Codex, and n8n.
Isolates every agent. Each agent gets its own scoped token. Agent A can reach Slack and Gmail; Agent B only your internal API. Revoke any agent's session without touching the underlying credentials or disturbing the others.
Lets you decide what an agent does on its own, and what needs your yes. Turn on approval and, by default, every proxied call waits for your explicit OK: a Telegram or mobile-push prompt shows exactly what the agent is about to do (POST /v1/chat/completions, model gpt-4o, 3 messages), and the call only runs if you approve, auto-rejecting after a timeout you set. Then tune it per service: exempt the harmless ones, keep the checkpoint on the calls that touch money or production, or grant an agent a service for a set window so it stops re-prompting.

What actually changes for you

The clearest way to feel the difference is before and after.
Before: the secret lives wherever the agent can read it. It's copied across .env files and tool configs, it sits in the agent's execution memory for the length of a run, and an agent with a prompt-injection surface has the full blast radius of whatever that key unlocks. Rotation is manual and easy to get wrong.
After: the secret lives in one place behind the gateway, and the agent never holds it. You add a service once and wire a single MCP endpoint. The agent makes the call; the key stays on your side. And for the calls that actually matter, you can require a checkpoint: the agent pauses, asks, and waits for your approval before anything happens. As one early builder put it, NyxID "moves the complexity from the agent side to the infra side, where it belongs." Day to day, that means you stop tracking where your keys are, and you can hand an agent real access without that access turning into a liability.

Who it's for

Anyone wiring AI agents into a real stack: self-hosters and homelab folks who want Claude Code or Cursor to reach services behind their firewall; teams who've watched the same API key spread across n8n, Cursor, and three .env files; builders who want per-agent, revocable access instead of one master key shared everywhere. If you run self-hosted tools like Grafana, Jenkins, n8n, or Home Assistant and want an agent to use them without leaking the credentials, that's the case NyxID was built for.

Why it matters

Agents are starting to act in the real world: calling paid APIs, touching internal systems, spending money. The moment they do, two questions stop being details and become the whole risk: where does the key live and who can see it, and what is the agent allowed to do with it unsupervised. NyxID's bet is that both belong in infrastructure you control and can audit, open-source and self-hostable, with the key on your side of the line and a human checkpoint on the actions that count, rather than a master key pasted into every agent that needs a job done.

Try it

NyxID is open-source (Apache-2.0) and live at GitHub: github.com/ChronoAIProject/NyxID.
Self-host the three containers, or request hosted early access with invite code NYX-OJWGP4FZ. https://nyx.chrono-ai.fun/register
Wire your first agent with one command:

Install nyx skills from https://github.com/ChronoAIProject/NyxID/blob/main/skills/INSTALL.md

codex fixing codex: a consensus loop that argues, judges, and merges its own PRs

nwnwnw413 — Mon, 22 Jun 2026 06:44:24 +0000

Last Friday I wrote here about consensus-loop, the agent loop we built and open-sourced that doesn't just suggest code but actually writes it, has agents review it, and merges its own PRs (that post is here). A few people asked what we actually point it at day to day. So here's the experiment I keep coming back to: we aimed the same loop at a fork of the codex CLI and let it fix codex. codex fixing codex.

This is the version with the repo links, so you can decide for yourself whether it's real instead of taking my word for it.

The setup: take a public fork of the open-source codex CLI, and point our own consensus loop at it. The loop's job is to close small upstream bugs in that fork end to end, with no one typing the patch. The whole thing is dogfood. The fork has zero stars, zero forks, no outside users. I'm saying that up front so the rest reads as "here's a mechanism," not "here's a product."

The repo is public: github.com/ChronoAIProject/codex. It's a fork of openai/codex. Nothing below requires you to trust me; every claim is a clickable issue or PR.

And if you'd rather watch than read, we've been livestreaming the loop running this end to end: the stream is here.

How a bug moves through the loop

1. Intake. A real upstream codex bug gets mirrored into the fork as an issue. The title carries the pointer, e.g. "Upstream openai/codex#29131: Unrecognized slash command prevents message from being sent." The issue body states a selection rubric: small-to-medium mechanical bugs, bounded to identifiable files, owned by this repo. It explicitly avoids auth, app-server, desktop, iOS, broad sandbox policy. So the loop is not trying to be a heroic maintainer; it's picking fights it can finish.

2. Solvers argue. Several solver agents take a pass and post their proposals as issue comments. They have different priors:

a minimal solver that wants the smallest change that satisfies the repro,
a structural solver that wants a clean boundary,
a delete-solver that argues for removing code rather than adding it. They genuinely disagree. On issue #34 the minimal solver proposed a "pre-dispatch validation" tweak, the structural solver proposed a "batch validation boundary," and the delete-solver abstained from deletion. You can read all three.

3. A judge arbitrates rounds. A meta-judge reads the solver outputs. If they're split, it doesn't pick a winner — it posts something like "Design consensus needs one narrower round" and sends it back. Issue #34 went three rounds. The final comment is titled "Round-3 meta-judge arbitration" and spells out the decision:

"the minimal and structural solvers now agree on the same concrete implementation boundary, and the delete solver abstains from deletion while accepting that same boundary."

It even records what got rejected: a new ToolCallBatch module ("a new single-caller codex-core abstraction is not required for correctness"). That's the part I find genuinely useful — the judge writes down the road not taken.

**4. Implement, test, merge. **Once consensus is reached the loop opens a branch (refactor/iter34-issue-34), writes the patch, runs the guarded build/test, and opens a PR against the consensus-rnd/issues branch. For #34 that's PR #37, which touched codex-rs/core/src/session/turn.rs and codex-rs/core/src/stream_events_utils.rs and added a regression test under codex-rs/core/tests/. Then it merges itself and posts back on the issue: ✅ Auto-merged via PR #37.

The state lives in GitHub. Issues are the work queue, solver comments are the debate transcript, the judge comment is the decision record, the PR is the artifact, and labels track lifecycle: crnd:lifecycle:managed, crnd:phase:design-solving → crnd:phase:consensus-reached → crnd:phase:merged, plus crnd:human:auto meaning the controller may proceed without a maintainer. Every loop-authored PR body ends with ⟦AI:AUTO-LOOP⟧. That marker, not a human, is the thing telling you who wrote it.

A real fix, end to end

Issue #34 mirrors a real codex concurrency bug: when one model response contained several parallel tool calls, a valid apply_patch sibling could start side effects before a malformed sibling in the same response was rejected. The judge framed it as "fail-fast validation for side-effecting batches" — accept the whole batch as well-formed before launching anything that writes.

The merged fix (PR #37) stages tool calls and only flushes them to the run queue at ResponseEvent::Completed, after the whole response batch is known good. It shipped with a regression: a valid sibling followed by a malformed one in the same response, asserting the valid one does not execute. The PR ran just test -p codex-core on the targeted test and reported it green. That's a real bug with a real, reviewable patch, produced by a debate I didn't participate in.

Where it's honest about doing nothing

PR #16 is the one I'd point a skeptic at. The loop took issue #15 (an apply_patch bug), tried to reproduce it against the current checkout, and couldn't. Instead of inventing a fix to look productive, the PR body says:

"No production fallback was added; the regression passed, so the native tool-call path did not prove an executable lookup bug in this checkout."

So it added a PATH-isolated regression test to lock the behavior and stopped. That's the correct engineering call, and it's also the kind of result that looks like a no-op until you read the reasoning. A loop that knows when not to patch is more interesting to me than one that always produces a diff.

The honest boundaries

It's a fork, dogfood, no users. Nothing here has been proposed upstream, and this is not an OpenAI thing — it's us pointing our loop at our own fork of their open-source CLI.

The bugs are small by design. Status-dot contrast, UTF-8 BOM handling in apply-patch, dedup tool calls by call_id, the slash-command fix. Bounded mechanical stuff. "AI maintains a codebase" would be a lie; "a loop closes small bounded bugs end to end" is what actually happened, ~16 merged PRs so far.

Humans are still in it. Someone mirrors the upstream issues and sets the rubric, and we open every PR to read it. To quote our own status: we still open them half expecting garbage. The code is auto; the attention isn't.

The judge is sometimes ceremony. On easy bugs the three solvers basically agree and the judge rubber-stamps. The 3-round arbitration on #34 is the one case where the disagreement was load-bearing. I don't yet have clean evidence the judge beats a single good agent on the easy 80%.

Repo's public if you want to dig: github.com/ChronoAIProject/codex. Start with issue #34 and PR #37.

Our agent loops have been shipping production features for weeks. Here's the tool.

nwnwnw413 — Fri, 19 Jun 2026 14:10:24 +0000

Everyone's saying the same thing right now: stop prompting your coding agent, start designing the loop that prompts it for you, and let it do the work. We agree. We've just been doing it long enough that it isn't a prediction anymore — autonomous loops have been running our R&D on four production repos for weeks.

Here's a concrete one. On NyxID, our open-source gateway, a loop took a load-balancing feature from a GitHub issue to a merged PR last week: about 1,400 lines of Rust, and the merge metadata records human_touch_count = 0, meaning no human edited the diff. A person still scoped the issue and clicked merge — but the code came out of the loop and survived review without anyone rewriting it. (PR #975)

That's the part everyone's excited about, and it's real. It's also not the hard part, and not the reason we trust the thing enough to leave it running.

The hard part is trust, not autonomy

The failure mode of an autonomous loop isn't that it does nothing. It's that it does something confidently wrong: writes plausible code that doesn't hold up, papers over a failing test, claims a result it can't support, and runs until your budget is gone. A single model is sure of itself even when it shouldn't be, and a naive loop inherits all of that confidence with none of the brakes. That's the real reason most "agent runs for 10 hours" demos stay demos.

So the thing we actually built consensus-loop around isn't "make the agent run." It's "make the agent trustworthy enough that you can walk away." The way you get there is to stop letting one confident model decide alone.

How it works

consensus-loop is a skill you inject into a host you already use — Claude Code, Codex, Cursor, or Gemini. You point it at a repo, hand it one host.env file with that repo's facts, and it takes over the development loop from there.

# the entire host-side contract is a handful of facts
REPO_ROOT=/path/to/your/repo
GH_REPO_SLUG=your-org/your-repo
BUILD_CMD="cargo build"
TEST_CMD="cargo test"
INTEGRATION_BRANCH=consensus-rnd-integration
REVIEW_BASE_BRANCH=main

One detail worth pulling out, because it's most of why the consensus means anything: the loop runs across two different systems. The host you install into — Claude Code, in our setup — is the controller. It routes, posts to GitHub, commits, and merges, but it does none of the thinking. The thinking runs on separate Codex workers it spawns in isolated git worktrees. Claude Code drives; Codex reasons. The agent steering the loop isn't the one doing the work, and the work itself is split across independent Codex workers that can't see each other.

Here's how it works:

Three Codex solvers argue in isolation. One is biased toward the smallest possible change, one toward structural correctness, one toward deleting code. They each draft a plan without seeing the others' work, so they don't quietly converge on the same wrong answer.
A judge converges them. A fourth role reads all three plans and runs a truth table. If all three propose the same shape of fix, that's consensus and it proceeds. If they disagree, the judge writes a sharper question and sends it back for another round.
It implements, then an independent reviewer tries to reject it. Separate review passes check architecture, quality, and tests, and they're told to err toward "rework" when in doubt, not toward "ship."
It gives up on purpose. If three or more rounds pass with no progress and no new framing, the default is to drop the task rather than burn tokens grinding on something unsolvable.

There's no algorithmic novelty here, and we won't pretend otherwise. Underneath, this is multi-agent debate, an LLM judge, and self-consistency — patterns you already know. What's hard, and what took us weeks of debugging on real repos, is the reliability engineering around the loop: the daemons that keep it alive, the leases that stop two instances from fighting, the release gates, and the stop rules. The idea is cheap. Making it trustworthy is not.

If you just want to try the consensus idea on a single hard decision without any of the daemon machinery, there's a lightweight skill called sshx that spins up a few isolated workers to give you multiple angles and nothing else.

Why we trust it: it knows what it doesn't know

We'll be straight about the conflict of interest: all the repos below are ours, and consensus-loop has zero external adoption so far. This is our own tape, not third-party validation. Everything is a public issue or PR you can open.

The NyxID feature up top is the loop doing the work. These are the loop deciding not to — which is the behavior that makes the first kind safe to rely on.

It stopped instead of fabricating. On aevatar, the solvers reached consensus, but at implementation time the worker didn't have the real external evidence to make the change safely. Rather than invent the missing piece to produce something, it stopped, changed nothing, and surfaced what it didn't know. A clean stop, not a confident wrong diff. (#2181)

It called a human when it should have. On Ornn, a large feature wouldn't converge after several rounds. The loop didn't force the merge. It opened an escalation, left the half-finished work for review, and flagged it as needing a person. (#1061)

It refused to take credit it couldn't back. On newmath — also ours, written by the same maintainer — the loop ran an experiment and measured a real result, 0.998 on a gap-detection benchmark against a 0.463 baseline. Then it went to claim a separate result, that the model also predicted better, and the statistical gate didn't pass: identical error on both arms. So it marked that claim false and logged why. On a repo where we'd have loved the win, it didn't take a result it couldn't support. (#1687)

Three of those four are the loop choosing not to act. That's the point. An autonomous loop you can actually leave running isn't one that always produces — it's one that produces when it's sure and stops when it isn't.

What we've put into it

We've spent the past couple of months building this loop, tuning it, and running it for real on our own repos — using it in production and improving it at the same time. That's 155 billion tokens and 1.6 million model calls of actually living inside the thing, not a weekend prototype. We trade tokens for time, on purpose.

Loop engineering is having its moment right now, and we don't think the versions that actually work should stay locked inside a handful of companies' private repos. So we're putting ours in the open. Come run loops with us — point it at your repo, break it, tell us where it falls over, and let's find out what these things can really do.

It catches and fixes its own mistakes

It's still early-stage, and a lot of what the loop does is repair itself before anything reaches you. When a test fails or a reviewer rejects the work, it doesn't ship the break: it feeds the error back in, fixes it, and re-checks. The NyxID feature up top went through that four times before it passed. And when it genuinely can't recover on its own, it stops and says so rather than guessing past it.

Take it

It's open source, MIT-licensed. We don't sell it and we're not trying to. We built the loop because we needed it, we run it on our own products every day, and we're giving it to you to run on yours. Inject it into your host, write one host.env, and point it at a repo.

Go break it: https://github.com/ChronoAIProject/consensus-rnd