DEV Community: Tang Weigang

Before You Add More Agents, Design the Control Plane

Tang Weigang — Wed, 27 May 2026 03:09:03 +0000

OpenAI Agents Python makes it easy to describe agents, connect tools, define handoffs, and run agentic workflows. That is useful, but it also creates a trap: teams may start by adding more agents before they define the operational boundaries that make those agents safe to use in a real repository.

The hard part is usually not getting the first demo to run. The hard part is knowing when an agent should start, what it is allowed to touch, what evidence it must leave behind, when it can hand work to another agent, and how the team will recover when the workflow fails.

For production use, I would start with a control plane.

This does not need to be a heavy platform. A Markdown checklist, a JSON policy file, and a trace log can be enough for the first version. The key is that the rules exist outside the model's temporary reasoning. They become part of the workflow, not something you hope the model remembers.

1. Define the task entry contract

An agent should not start from a vague instruction like "fix this feature" or "improve this repo." That may be fine for a toy demo, but it is too wide for real work.

A task entry contract should answer five questions:

What is the goal?
What input is trusted?
What files, services, or systems are in scope?
What is the acceptance standard?
When should the agent stop instead of improvising?

For example, a safer engineering task might say:

Read only the package under src/connectors. Modify only connector_policy.py and related tests. Preserve the git diff. Run the connector test suite. If the requested behavior conflicts with an existing policy rule, stop and return the conflict instead of rewriting the policy.

That kind of instruction is not just prompt polish. It reduces the agent's degrees of freedom. It turns an open-ended request into an executable contract.

The business value is simple: fewer surprising edits, fewer review cycles, and less time spent asking why an agent touched something unrelated.

2. Separate tools by risk

Tool access should not be binary. "The agent can use tools" is too broad. A file search is not the same risk as deleting a directory, publishing an article, or calling a production API.

I prefer three buckets.

Low-risk tools can run directly. Examples: read a file, search for symbols, inspect documentation, list a directory, or open a local artifact.

Medium-risk tools can run if they leave evidence. Examples: modify a draft, generate a patch, run tests, create a report, or produce a migration plan. The output should be inspectable.

High-risk tools require an explicit gate. Examples: destructive git commands, deleting files, pushing to a remote, publishing content, spending money, modifying production infrastructure, or calling external APIs with side effects.

OpenAI Agents Python gives you a framework for building the workflow. It does not automatically know your risk model. That risk model belongs in your engineering system.

If your agent can publish content, the publication action should not be treated the same way as writing a local draft. If your agent can modify code, the modification should not be treated the same way as reading code. If your agent can call production services, the system needs a gate before side effects happen.

This is where many agent workflows become fragile. The model may be capable, but the surrounding system has no authority model.

3. Make handoffs evidence-based

Multi-agent workflows are attractive because they map nicely to human roles: researcher, planner, coder, reviewer, publisher. But every handoff creates a new failure point.

A handoff table should define:

When a handoff is allowed
Which agent receives the task
What evidence must be passed along
Which cases block the handoff

A research agent should not hand work to a writing agent by saying "I found the sources." It should pass source links, key claims, contradictions, uncertain points, and the reason those sources are relevant.

A coding agent should not hand work to a release agent by saying "fixed." It should pass the diff, tests run, tests skipped, remaining risk, and rollback path.

That evidence is the difference between agentic collaboration and a chain of guesses.

The more agents you add, the more important this becomes. Without evidence-based handoffs, every downstream agent has to infer what the upstream agent meant. That makes failures harder to debug and easier to repeat.

4. Treat trace as a product feature

When an agent workflow fails, the least useful conclusion is "the model was unreliable." That may be true, but it does not tell you what to improve.

A useful trace should capture:

The task goal
The input and source material
The rules that were active
The tools that were called
The files or external systems touched
The verification result
The failure reason
The rule or workflow change suggested for next time

You do not need a complex observability backend on day one. A structured Markdown worklog or JSONL trace can be enough. What matters is that failures become training material for the system.

If a failure came from a vague task, improve the task entry contract. If a failure came from excessive permission, tighten the tool policy. If a failure came from a weak handoff, change the handoff table. If a failure came from missing verification, add a test or preflight check.

This is how an agent workflow gets more reliable over time. Not by hoping the next model will magically be better, but by converting failures into rules.

5. Start with one real workflow

The wrong move is to design a giant multi-agent platform first. The better move is to choose one low-risk but real workflow.

Good first workflows include:

Updating documentation after a code change
Reviewing a pull request for missing tests
Classifying issues into actionable buckets
Producing a release note from a verified diff
Preparing a technical article draft with source links and disclosure

For each workflow, define the entry contract, tool policy, handoff table, and trace format. Then run it repeatedly. The goal is not to prove that agents are impressive. The goal is to prove that the workflow reduces repeated coordination while preserving reviewability and rollback.

If a small workflow becomes stable, expand it. If it keeps failing, the trace should tell you whether the problem is the task, the permissions, the handoff, the verification, or the model.

The practical takeaway

OpenAI Agents Python is a useful foundation for building agent workflows. But the production value comes from the control plane around it.

Before adding more agents, define:

How tasks enter the system
Which tools are allowed under which conditions
What evidence is required for handoff
How traces feed back into better rules

That is less exciting than a flashy demo, but it is the difference between an agent that merely runs and an agent workflow that a team can actually trust.

Disclosure: this is an unofficial Doramagic technical note. It is not an official OpenAI publication and does not represent the upstream project unless explicitly stated by that project.

Adopt Codex CLI only after you can explain the source, boundary, review, and rollback model

Tang Weigang — Tue, 26 May 2026 03:51:27 +0000

A lot of teams want to treat Codex CLI as a shortcut: install it, point it at a repository, and hope it saves time immediately. That framing is too shallow for a real codebase.

If you are adopting Codex CLI in a team that cares about quality, the real question is not whether it can write code. The real question is whether the workflow around it is explicit enough to be reviewed, bounded, and reversed. Without those four properties, the tool can create output faster, but it cannot create confidence faster.

1. Start from the source of truth

Before any assistant touches a repository, someone needs to answer a basic question: what is the current source of truth?

That sounds trivial, but it is the first place AI-assisted workflows drift. Teams often test a tool against a repository snapshot, an old issue thread, or a blog post that no longer matches the current implementation. Once that happens, every next step becomes fragile because the assistant is reasoning from stale input.

A useful adoption process starts by checking three things:

the repository or package is the current one
the installation or usage instructions still match reality
the command being run is documented for this version, not an older release

If those checks fail, do not treat the tool as “mostly correct.” Treat the source as unresolved. In practice, a fast assistant reading the wrong upstream source is not a productivity gain. It is a faster way to compound confusion.

2. Make the permission boundary visible

The second boundary is operational scope.

A team should be able to answer, in plain language, what the tool may read, what it may change, what it may execute, and what requires human approval. If those boundaries are hidden in the operator’s head, the workflow is already too loose.

This matters because the early demo of an AI coding tool is misleading. It feels safe when it is only producing text. The risk appears when the same tool is allowed to inspect files, write patches, run shell commands, or touch directories that nobody explicitly intended to expose.

A mature setup does not see permission boundaries as friction. It sees them as the thing that makes the workflow repeatable. The point is not to maximize what the tool can do. The point is to define exactly what it can do so the rest of the team can trust the result.

A practical rule is simple:

read access should be explicit
write access should be narrow
destructive actions should require confirmation
privileged steps should be isolated from exploratory steps

If you cannot describe the boundary clearly, you do not yet have a production workflow.

3. Put review back at the center

The third boundary is review.

This is where many teams get the biggest false win. A tool produces a patch quickly, and the team celebrates the speed. But if the patch is hard to inspect, hard to compare, or hard to reject, the tool has not reduced cost. It has merely moved cost into a later phase when context is already lower.

Review is not a ceremonial step after generation. Review is part of the product.

A good AI-assisted workflow makes the output:

easy to inspect
easy to compare
easy to reject
easy to refine

That means the assistant should be optimized for diffs, not theatrics. If a change cannot be understood in a short review cycle, the workflow is not ready. The best sign of maturity is not that the assistant can generate a large patch. It is that a normal engineer can explain why the patch is acceptable in minutes.

This is also where teams should insist on a clear evidence trail. If a change passes, where is the proof? If it fails, what specifically failed? If the answer is vague, the workflow is too soft to rely on.

4. Treat rollback as part of the design

The fourth boundary is rollback.

Rollback is often treated like cleanup after the fact. That is the wrong mental model. Rollback is part of the design of the workflow itself.

Every real repository will eventually see a bad assumption, an incomplete refactor, a broken command, or a change that looked reasonable until someone reviewed it closely. The question is not whether mistakes will happen. The question is whether recovery is fast enough that the team stays calm.

A rollback-capable workflow has three qualities:

you can identify the last safe state
you can return to it quickly
you can explain what changed without guessing

If those three qualities are not present, then every experiment becomes a one-way door. That is too expensive for a solo team and unacceptable for a shared codebase.

This is the difference between “the tool can help me write code” and “the tool can participate in an engineering system.” The first is a demo. The second is a capability.

5. Use a better adoption question

The wrong question is: can the tool generate good code?

The better question is: can the team trust the workflow around the tool?

That better question breaks down into four operational checks:

Can we identify the source of truth before the tool starts?
Can we define the tool’s authority without ambiguity?
Can we tell whether the change is acceptable in under five minutes?
Can we return to the last known good state without guesswork?

Those are more useful than any demo because they turn a vague technology discussion into a reviewable operating standard.

If any of those questions is “not yet,” the right answer is not to push harder on the model. The right answer is to fix the workflow boundary first.

6. What a real adoption path looks like

For a real team, the best rollout is boring on purpose.

It should begin with a narrow, reversible use case. Not a magical broad permission set. Not an open-ended “let’s see what happens.” A narrow path where the output is easy to inspect and easy to undo.

A good adoption path usually looks like this:

choose one repository
define one class of change
define one reviewer
define one rollback path
measure whether the same standard holds on the second and third run

The repetition matters. The first successful run is easy to overvalue because everybody is paying attention. The real test is the second, third, and tenth run, when the novelty is gone and the tool has to fit ordinary work.

If the workflow does not survive repetition, it is not ready.

7. Why this matters for the team, not just the tool

This approach is bigger than Codex CLI.

Any AI coding tool used in a real repository should be evaluated the same way. The issue is not which vendor is cleverer. The issue is whether the team can maintain control while gaining speed.

When something goes wrong, a mature team should not debate the intelligence of the tool. It should inspect the broken boundary:

was the source stale?
were the permissions too broad?
was the review path unclear?
was rollback not guaranteed?

That framing reduces emotional noise and makes the problem fixable. It also makes the workflow easier to teach to other engineers because the rules are operational, not mystical.

8. The shortest useful conclusion

Codex CLI is worth adopting only when the surrounding workflow is already disciplined enough to keep it honest.

If source is verified, permissions are bounded, review is visible, and rollback is guaranteed, the tool becomes useful. If not, it just helps you create uncertainty faster.

Doramagic project page: https://doramagic.ai/en/projects/codex/
Manual: https://doramagic.ai/en/projects/codex/manual/
Source repository: https://github.com/openai/codex

Non-official note: this is a Doramagic-made, non-official AI capability package. Unless the upstream project states otherwise, it does not represent an official upstream release.

Codex CLI is useful only when the workflow around it is reviewable and reversible

Tang Weigang — Tue, 26 May 2026 03:45:42 +0000

A lot of teams still evaluate AI coding tools by asking whether the tool can generate code quickly. That is a useful question, but it is not the one that decides whether the tool can enter a real workflow.

If you plan to put Codex CLI into a live repository, the real question is whether the surrounding process is reviewable, bounded, and reversible. Without those three properties, fast code generation is only a faster way to create uncertainty.

Start with the source boundary

The first thing a team needs is source clarity.

Is the repository current? Are the docs aligned with the implementation? Is the installation guide still valid? Is the command you are about to run documented for the current version, not for an older release buried in an issue thread? If the source chain is stale, every later decision is built on a false premise.

This sounds basic, but it is where a lot of AI tool adoption goes wrong. People test the tool on whatever seems to work, then discover later that the official docs and the actual behavior drifted apart months ago. A fast assistant reading the wrong repository or an outdated example is not an efficiency gain. It is a faster way to multiply confusion.

The practical move is simple: verify the current upstream source before you trust the assistant’s output.

Define the permission boundary before you try to save time

The second boundary is operational scope.

What can the tool read? What can it modify? What commands can it execute? Which directories are in scope? Which actions require human confirmation? Which steps must be blocked until a reviewer looks at them?

Teams often skip this because the tool feels helpful during the first demo. That is exactly the danger. An AI coding tool becomes risky when everyone assumes it is “just helping” while it is already touching files, shell commands, or environments that nobody explicitly intended to expose.

Good teams do not treat permission boundaries as friction. They treat them as the thing that makes the rest of the workflow usable.

A boundary is not a limitation on productivity. It is what turns productivity from a guess into a repeatable process.

Put review back at the center

The third boundary is review.

If a change cannot be inspected in a diff, if the intent cannot be understood from the patch, or if the test output cannot explain what changed, then the AI has not saved time. It has just moved the cost to a later moment when the team has less context.

The best workflow is not the one that makes the biggest patch quickly. It is the one that makes the patch easy to evaluate.

That means the output needs to be:

easy to inspect,
easy to compare,
easy to reject,
and easy to refine.

In other words, the tool should make review easier, not optional.

Rollback is not cleanup; rollback is part of the design

The fourth boundary is rollback.

Every real workflow will eventually see a bad edit, a wrong assumption, a partial refactor, a failed test run, or a change that looks fine until a reviewer pushes back. The question is not whether failure will happen. The question is whether recovery is simple enough that the team stays calm.

A good rollback path means you can identify the last safe state, return to it quickly, and explain what changed. Without that, every trial becomes a one-way door.

This is where many tools look stronger than they are. They can produce code, but they cannot produce confidence. And in a real team, confidence comes from being able to reverse the move if it turns out to be the wrong move.

A better evaluation model

Instead of asking, “Did it generate good code?”, ask these four questions:

Can we identify the exact source of truth before the tool starts?
Can we define the tool’s authority without ambiguity?
Can we tell whether the change is acceptable in under five minutes?
Can we return to the last known good state without guesswork?

Those questions are more useful than any demo. They shift the discussion from vague enthusiasm to operational control.

That is the standard I would apply to any AI coding tool, not just Codex CLI. It also makes team coordination easier. When something goes wrong, you do not argue about the tool’s intelligence. You inspect the broken boundary: source, permissions, review, or rollback.

What “good” actually looks like

A mature team does not need a heroic pilot project to justify the tool. It needs a repeatable path that a normal engineer can follow on an ordinary day.

The team should be able to say:

why a change was accepted,
why a change was rejected,
where the evidence lives,
and how to undo it.

If that conversation takes more than a few minutes, the workflow is still too vague.

That is why I treat Codex CLI as a capability asset, not as a magical terminal replacement. It is useful only when the surrounding system makes its outputs inspectable and reversible. The real win is not speed by itself. The real win is speed with control.

Why this matters for day-to-day adoption

A lot of teams overestimate the importance of the first successful run.

The first run is easy to celebrate because the context is fresh and everybody is paying attention. The real test is whether the same standard can hold on the second, third, and tenth run, when nobody is excited anymore and the tool has to fit into actual work.

That is where the source boundary, permission boundary, review boundary, and rollback boundary become practical rather than theoretical. They stop being abstract ideas and become the difference between a tool that integrates and a tool that merely impresses.

If the workflow cannot survive repetition, it is not ready.

The shortest useful conclusion

Codex CLI should not be adopted because it looks clever in a demo. It should be adopted because the team can trust the workflow around it.

That means:

source is verified,
permissions are bounded,
review is visible,
rollback is guaranteed.

If those four things are true, the tool becomes useful.
If they are not, the tool just creates faster uncertainty.

Doramagic project page: https://doramagic.ai/en/projects/codex/
Manual: https://doramagic.ai/en/projects/codex/manual/
Source repository: https://github.com/openai/codex

Non-official note: this is a Doramagic-made, non-official AI capability package. Unless the upstream project states otherwise, it does not represent an official upstream release.

Codex CLI Is Useful Only When the Workflow Around It Is Reviewable and Reversible

Tang Weigang — Mon, 25 May 2026 02:26:16 +0000

Why Codex CLI is only useful when the workflow around it is reviewable and reversible

A lot of teams evaluate an AI coding tool the wrong way.

They install it, ask it to change a file, watch it produce something plausible, and then conclude the tool is ready for real work. That conclusion is too optimistic. A terminal coding agent is not useful because it can generate text that looks like code. It becomes useful only when the workflow around it is strong enough to survive a bad run, a partial run, or a misunderstood instruction.

That is the practical lesson I keep seeing with tools like Codex CLI: the capability itself is not the product. The capability plus the source boundary, permission boundary, review path, test path, and rollback path is the product.

If your team skips that layer, you are not adopting an AI workflow. You are importing uncertainty into a place where you used to have discipline.

The first mistake: treating the demo as the operating model

A demo is optimized to look good. A real workflow is optimized to fail safely.

That difference matters more than the model brand, the prompt style, or the UI polish. A terminal agent can make a small repository feel magical because the first few tasks are usually easy: rename a function, add a README section, move a helper, adjust a test expectation. Those are not the hard cases. The hard cases are the ones where the tool touches the wrong directory, edits the wrong branch, oversteps its permissions, or silently assumes a convention that does not exist in your repo.

A strong evaluation therefore starts with a different question:

What exactly must be true before I trust this tool in a real repository?

For me, the answer has five parts.

I can verify the source.
I can define what the tool is allowed to read and change.
I can inspect every meaningful change in git.
I can run the relevant tests before I accept the output.
I can revert the whole trial without ambiguity.

Without those five pieces, the tool may still be impressive, but it is not yet operational.

Source verification is not a nice-to-have

The first thing to check is not whether the tool can write code. The first thing to check is whether you are following the right source.

That sounds obvious, but it is where many teams get sloppy. They pull install instructions from an old blog post, copy a workflow from a stale README, or rely on memory from an earlier version. With AI tools, that creates a subtle failure mode: the model appears capable, but the documentation, flags, defaults, or safeguards may have changed since the last time you looked.

A reviewable workflow begins with source identity:

Which repository is the canonical upstream?
Which version or branch are you actually using?
Which docs are current, and which are historical?
Is the install path consistent with the tool you are actually running?

If your answer to any of those is fuzzy, stop there. Do not move on to performance, prompt tuning, or automation. Those are downstream concerns. Source mismatch is upstream failure.

For a team like mine, the practical rule is simple: before any tool touches a production repository, I want a source page that I can point to and a local note that says exactly what was verified. If I cannot answer “what changed since the last run?”, I am not ready to use the tool again.

Permission boundaries define whether the tool is useful or risky

The second mistake is treating permissions as a UI problem rather than a system design problem.

Coding agents can be dangerous for reasons that have nothing to do with raw code quality. They can read files too widely, write to the wrong place, generate edits based on outdated assumptions, or run commands that have side effects you did not intend. The risk is not just “bad code.” The risk is “bad code plus uncontrolled scope.”

A real workflow needs permission boundaries that are explicit enough to explain to another engineer in one minute.

Questions I want answered before a trial:

What directories can the agent inspect?
What files can it edit?
Can it execute shell commands?
If so, which commands are safe and which commands require review?
When should the agent stop and ask for human confirmation?
What is the smallest safe target repository for the first trial?

This is where many teams confuse speed with safety. They want the agent to move fast, so they grant broad access. That is backwards. The smaller and clearer the permission boundary, the faster the team can actually move, because the review burden drops.

I would rather let a tool work in a narrow sandbox and prove itself than give it broad access and spend the rest of the week trying to explain what happened.

The review path is the real value

If an AI coding tool leaves you with code you cannot inspect, it is not solving engineering work. It is generating opaque artifacts.

The most useful thing about a coding agent is not the code it writes. It is the fact that it can be forced into the same review path as a human contributor:

the diff is visible,
the change is scoped,
the intent is explainable,
and the reviewer can reject it without drama.

That is the standard. If the output cannot go through a normal git review, it is not ready for a real repo.

A good workflow is one where the output can be answered with a series of concrete checks:

What file changed?
Why did it change?
What is the behavioral difference?
Is the new behavior covered by a test?
Does the change touch a risky path?
What happens if I reject this patch?

Those questions sound basic, but they are the difference between a coding assistant and a production habit.

The agent should not be evaluated on whether it can produce a long answer. It should be evaluated on whether the answer collapses cleanly into an inspectable diff and a credible explanation.

Tests are not a formality

A coding agent that does not live inside your test loop is just a fast way to create plausible-looking changes.

That is the part teams underestimate. They see a change that compiles or looks reasonable in the editor and assume the work is done. But engineering value is not “the patch looks okay.” Engineering value is “the patch survives the relevant validation path.”

For different repos, that validation path may mean different things:

unit tests,
integration tests,
type checks,
linting,
snapshot updates,
smoke tests,
or a specific manual acceptance check.

The important thing is not which test style you use. The important thing is that you know, ahead of time, what evidence counts as acceptance.

That means every agent workflow should answer:

What tests should run after the change?
Which tests are mandatory versus optional?
What failure is acceptable during the trial, and what failure is a stop signal?
What is the smallest set of commands that proves the change is safe enough?

Without that, the agent may feel productive, but your repo is accumulating unverified state.

Rollback is the boundary that separates a trial from a commitment

This is the part most teams postpone, and the part they regret postponing.

If a trial with an AI coding tool goes wrong, you need a rollback path that is boring and fast. Boring means it is already decided. Fast means you do not need to re-derive the reversal under pressure.

A real rollback plan should include:

the branch or workspace you used,
the exact command or mechanism to undo the trial,
the files that were allowed to change,
the commands that should be reversed or replayed,
and the notes that explain why the trial failed.

The absence of rollback is what turns experimentation into operational debt.

I do not think “reversible” is a nice philosophical ideal. I think it is the minimum condition for using a coding agent in a serious repository. If you cannot explain how to back out, you have not actually bounded the trial.

Reviewable and reversible is a better adoption criterion than “smart”

One of the reasons AI coding discussions become noisy is that people ask the wrong question. They ask whether the tool is smart enough. That question is too vague to be useful.

A more practical question is:

Can my team understand, verify, and undo what the tool did?

That question is better because it maps to engineering controls instead of vague capability claims.

If the answer is yes, the tool may genuinely improve throughput.

If the answer is no, the tool may still be impressive, but it does not belong in a critical workflow yet.

This applies whether the tool is used for a one-off change, a repetitive maintenance task, or a larger refactor. The core adoption pattern does not change:

verify the source,
define the permission boundary,
keep the diff reviewable,
keep the validation path explicit,
keep rollback easy.

Anything beyond that is optimization.

What this means in practice for Codex CLI

I like Codex CLI for the same reason I like any serious engineering tool: it can be made useful if the system around it is disciplined.

That discipline is not decorative. It is the product.

When I evaluate a terminal coding agent, I want to see a setup that makes the following easy:

confirm the upstream source,
read the install path once,
understand the permission boundary,
inspect the diff in git,
run the right tests,
and roll back if the trial does not hold up.

If those things are hard, the tool is not ready for the way real teams work.

That is why I would not frame adoption as “Can Codex CLI write code?”
I would frame it as “Can Codex CLI operate inside a workflow that is reviewable and reversible?”

That is the difference between a useful capability and a flashy demo.

A simple acceptance checklist

If I were deciding whether to trust a new coding agent in a repo, I would want this checklist to be true:

I know the canonical upstream source.
I know the current docs match the current tool.
I know what the agent can read and edit.
I know what commands are allowed.
I can review every change in git.
I know which tests prove the patch is valid.
I can revert the trial cleanly.
I can explain the trial to another engineer without hand-waving.

That is enough to start.

If you cannot satisfy that checklist, the tool may still be worth experimenting with, but it is not ready to become part of your normal workflow.

Final judgment

My view is simple: Codex CLI is useful only when the surrounding workflow is reviewable and reversible.

That sounds strict, but it is actually the friendly standard. It keeps the tool honest, keeps the team in control, and prevents a promising demo from becoming a maintenance problem.

The right adoption path is not to ask the agent to do more. It is to ask the workflow to prove more.

If the workflow can prove source identity, permission boundary, reviewability, testability, and rollback discipline, then the agent earns its place.

If it cannot, then the tool is not the problem. The workflow is.

Codex CLI needs a rollback-ready workflow, not just a quick install

Tang Weigang — Sun, 24 May 2026 03:05:35 +0000

When teams evaluate Codex CLI, the first questions are usually simple: can we install it, and can it help with code changes?

Those questions matter, but they are not enough.

A coding agent becomes useful in a real repository only when the workflow around it is reviewable and reversible.

That means checking a few things before treating the tool as production-adjacent:

First, verify the source path.
Make sure the repository, documentation, and install path you are following are actually the current upstream path and not a stale secondary summary.

Second, define the permission boundary.
What can the tool read, what can it edit, what commands can it run, and where must a human step in?

Third, require a review path.
A good result is not "the agent wrote code."
A good result is "the change can be inspected in git, checked against tests, and rejected without confusion."

Fourth, keep rollback discipline.
If a trial goes wrong, the team should know the smallest safe environment, the expected output, and how to back out without leaving uncertain state behind.

This is why Doramagic treats Codex CLI as more than a terminal AI demo.
The useful part is the capability asset around it: source review, setup boundaries, validation checks, and rollback guidance.

Project page:
https://doramagic.ai/en/projects/codex/

Source repository:
https://github.com/openai/codex

This is an independent Doramagic resource pack. It is not an official upstream project release unless the upstream project says so.

Aider clone signals point to rollback-ready coding contracts, not just better code generation

Tang Weigang — Fri, 22 May 2026 17:39:52 +0000

Today’s first high-quality Doramagic publishing topic is Aider.

I am not treating the 2026-05-23 GitHub collection as a completed fact source. The live collection process stalled during the publishing window and did not write a 2026-05-23.jsonl file. This post therefore uses the newest available local metrics snapshot: the 2026-05-22 GitHub traffic file.

In that snapshot, doramagic-aider-pack had 4 views, 21 clones, and 21 unique cloners. This is not a large number, and it does not prove adoption. A clone can come from a batch scan, a script, a mirror, a one-off check, or a real evaluator. Treating clones as satisfaction would be a weak claim.

The signal is still useful because Aider sits in a category that is easy to misunderstand. People often frame AI coding tools as “chatbots that write more code.” In actual engineering work, the valuable question is different: can the tool keep a code change inside a contract that is inspectable, reversible, and reviewable?

The first layer of an AI coding tool is obvious. It turns a natural-language request into code changes. It can inspect files, propose patches, run tests, and iterate. That is useful, but it is not enough for daily engineering work. Teams also need to know what boundary the agent changed, what evidence it used, what it did not know, and what it will do when the first attempt fails.

This is where Aider-style tools differ from ordinary code completion. They participate in the change loop. Once a tool participates in the change loop, it needs engineering discipline, not just generation ability.

A reusable Aider capability pack should answer at least six questions.

First, how does the agent establish the task boundary before editing?

It should know which files are in scope, which directories are only references, which files must not be touched, and which existing changes may belong to the user. Without that rule, “can edit code” becomes “can change the whole repository.” For a small team or one-person company, that is especially dangerous because there may be no second reviewer waiting downstream.

Second, what evidence must be captured before the patch?

A good coding workflow should not begin with blind edits. It should record the branch, dirty worktree, relevant tests, existing failures, target files, entry points, configuration, and dependencies. Without this evidence, a later claim that the task is fixed may only mean that one narrow path happened to pass.

Third, how small should the patch be?

AI tools tend to produce large changes when the task is underspecified. Large changes can look productive, but they increase review cost and rollback risk. A better rule is: make the smallest verifiable change first, run the closest relevant check, then decide whether to expand scope. A capability pack should make that rhythm explicit instead of hoping the model chooses it.

Fourth, how should failure narrow the search?

Failing tests are not the problem. Guessing after failure is the problem. The agent should record the command, the error summary, the suspected cause, the next hypothesis, and why that hypothesis is better than the alternatives. Then failure becomes reusable evidence instead of an invisible chain of automated attempts.

Fifth, when must the agent stop?

Authentication, billing, database migrations, destructive scripts, production configuration, release actions, user privacy, and external publication are not ordinary coding steps. The agent should stop, explain the risk, propose the next action, and wait for confirmation. The stronger the tool becomes, the more important the stop rule becomes.

Sixth, what proves the result?

“Done” is not proof. A useful final state should include changed files, why the change was made, which checks were run, which checks were not run, remaining risks, and the next recommended action. Code that cannot be reviewed is just maintenance debt produced faster.

For Doramagic, the lesson is that an Aider resource pack should not be a prompt collection. It should be a portable engineering contract. It should connect code generation, context reading, patch boundaries, verification commands, failure recovery, human confirmation, and final evidence into one reusable loop.

More precisely, this is not just a prompt library. A high-quality Aider capability asset needs a source map that says which evidence came from upstream Aider, the Doramagic project page, and the local publishing metrics; host instructions that tell the agent how to read the repository, respect file scope, and protect the dirty worktree; a prompt preview that shows how the operator should describe a task; a pitfall log for large patches, accidental deletion, missing tests, and context mistakes; a boundary card for database changes, production configuration, account permissions, release actions, and external publication; eval or smoke check criteria for deciding whether the patch passed; a human manual with a review checklist; a test log that records what was and was not run; and a feedback path so failure becomes reusable knowledge.

The decision rule should be explicit. GO when the task is scoped, reversible, and covered by a clear verification checklist. HOLD when requirements are vague, tests are missing, or file boundaries are unclear. NO_GO when the change touches production data, deletion, account authority, external publishing, or any action that cannot be safely rolled back. This pass/fail criteria is more useful than saying the model “seems to understand the code.”

The limits also need to be visible. This is independent guidance, not official Aider documentation, and it should not claim to represent the upstream project. The pack should help an operator decide how to use an AI coding agent safely; it should not pretend that every repository, team policy, or production environment has the same risk boundary.

That is also why a Doramagic Aider pack should not only explain “how to make Aider write code.” The stronger use case is explaining which tasks are safe for direct edits, which tasks should only produce candidate patches, when the agent must stop, and how each success or failure becomes reusable operating knowledge.

Today’s metrics do not prove that doramagic-aider-pack is successful. They only show that someone or some system is pulling on this type of capability. The better question to keep testing is whether users need a portable AI coding work contract that they can load, reuse, inspect, and verify, rather than another prompt that says “help me change this code.”

source_asset_url:
https://github.com/Aider-AI/aider

doramagic_project_url:
https://doramagic.ai/en/projects/aider/

This is an independent Doramagic resource pack. It is not an official Aider release unless the upstream project says so.

Chrome DevTools MCP traffic shows developers are checking operability, not hype

Tang Weigang — Fri, 22 May 2026 07:16:02 +0000

Today's first Doramagic publishing signal comes from doramagic-chrome-devtools-mcp-pack.

In the partial 2026-05-22 GitHub metrics snapshot, the repository had 44 views, 2 unique viewers, 120 clones, and 60 unique cloners. The more important signal is the path pattern: traffic was opened 16 times, commit-activity 10 times, the repository overview 8 times, and there were also visits to community health, pulse, README, README.zh-CN, branches, and contributors.

I would not interpret this as satisfaction. A GitHub clone can come from real evaluation, a script, a mirror, a batch checker, or a one-off scan. Views and clones are not product love by themselves. But the path pattern still matters because it does not look like casual article reading. It looks like an operability and trust check.

That distinction is important for Chrome DevTools MCP and browser-agent work.

Browser automation creates an easy illusion: if an agent can open a page, click a button, read the DOM, and take a screenshot, then it can complete a real workflow. In practice, the hard part is not the click. The hard part is knowing the page state before the click, verifying the result after the click, and recovering when the page does something unexpected.

A reusable UI acceptance chain has to answer several questions.

First, what is the starting state?

Is the user logged in? Is the active tab the target page? Has the page finished loading? Are cookies, permissions, language settings, cached state, region, popups, or experiments affecting the result? If this state is not recorded, a successful run may be impossible to reproduce.

Second, what counts as verification?

Many browser agents stop at "the page looks fine." That is not acceptance. Acceptance needs inspectable evidence: URL, key DOM state, screenshot, console errors, failed network requests, form state, enabled buttons, post-submit confirmation, public URL, or dashboard status. Without evidence, automation is just an unaudited action.

Third, where is the action boundary and what are the limits?

Chrome DevTools MCP brings an agent closer to a real browser session. That increases capability, but it also increases risk. Login, publish, delete, submit, pay, authorize, upload, or change account settings are not ordinary clicks. A context pack has to say which actions can be automated, which actions require human confirmation, and which actions should stop at draft or preview.

Fourth, what is the recovery path?

Real UI workflows fail in small ways: the editor is not focused, Markdown is swallowed by a rich-text field, the upload control rejects the file, a platform sends the post to moderation, a loading overlay never ends, labels change across languages, or an account-risk modal appears. A pack that only says "click publish" is not useful. A better pack tells the agent what to record, where to roll back, and how to decide whether the issue is content, account state, or platform behavior.

This is why the Chrome DevTools MCP path data is interesting. Developers were not only opening the main page. They were checking traffic, commit activity, community health, pulse, branches, and contributors. That is closer to asking "can I trust and operationalize this?" than "is this a cool demo?"

For Doramagic, the lesson is simple: AI capability packs should not be prompt collections, especially in high-side-effect domains like browser automation, UI acceptance, and publishing workflows. This is not just a prompt problem. A useful capability asset needs an operating contract:

what to observe before acting;
what the agent may do alone;
where it must stop for confirmation;
what evidence it must leave behind;
how it should recover from failure;
how one successful run becomes a reusable workflow.

That is why a Chrome DevTools MCP pack needs more than a README. README gives the human entry point. AGENTS.md and CLAUDE.md give host instructions. A source map shows what evidence the pack is based on. A prompt preview shows the intended interaction shape. Pitfall notes prevent repeated failure modes. A human manual gives the operator a checklist before loading the pack into a host. Evals define acceptance criteria. Boundary cards keep the agent from turning a powerful browser interface into uncontrolled action. Feedback notes help the next run improve instead of repeating the same failure.

The stronger browser agents become, the less useful it is to optimize only for "can it click?" The better product direction is: can every step be observed, proven, reversed, and reviewed?

Today's metrics do not prove that doramagic-chrome-devtools-mcp-pack is successful. They do show that people are checking the right surface: not whether Chrome DevTools MCP is hot, but whether the capability can fit into a trustworthy execution chain.

source_asset_url:
https://github.com/tangweigang-jpg/doramagic-chrome-devtools-mcp-pack

doramagic_project_url:
https://doramagic.ai/en/projects/chrome-devtools-mcp/

This is an independent Doramagic resource pack. It is not an official upstream project release unless the upstream project says so.

A clone-heavy MCP pack signal is about inspectability, not marketing

Tang Weigang — Thu, 21 May 2026 02:42:44 +0000

Today's second Doramagic signal comes from doramagic-prism-mcp-pack.

In the 2026-05-21 GitHub metrics snapshot, the repository had only 1 view and 1 unique viewer, but 40 clones and 31 unique cloners. That ratio is easy to misread. It does not prove satisfaction. GitHub clone activity can include real evaluation, scripts, CI, mirrors, or one-time scanning. But it is still a useful signal because it suggests that some users or systems are not merely reading the page. They are pulling the pack into an environment where it can be inspected.

For MCP-related resources, that distinction matters.

MCP is often discussed as if connecting a server automatically gives an agent a reliable capability. In practice, the interface is only one layer. The agent still needs task context, host instructions, permission boundaries, input assumptions, error recovery behavior, and acceptance criteria. Without those, the model can call tools while still misunderstanding the job.

A clone-heavy signal says the pack has to survive local inspection.

First, the entrypoint must be obvious. A user should not have to guess whether to start with README, AGENTS.md, CLAUDE.md, the human manual, or a prompt preview. The pack should make the first useful path visible.

Second, the boundary must be explicit. The resource should explain what belongs to the upstream project, what Doramagic added, and what the agent still cannot infer. This is especially important for MCP packs because official-status confusion is easy: a pack can be useful without being an official upstream release.

Third, the pack should contain failure knowledge. If the agent gets missing credentials, an unavailable server, an incompatible host, or incomplete input data, the right behavior is not confident continuation. It is to stop, report the missing evidence, and give the user a recovery path.

Fourth, the pack needs acceptance checks. A user cloning the repo is likely to inspect whether the files are actionable. That means commands, file boundaries, source maps, risk notes, and validation criteria matter more than polished positioning copy.

This is why Doramagic treats a project pack as an inspectable capability asset. The goal is not to make a generic MCP advertisement. The goal is to let a user or agent host answer practical questions:

What task does this pack support?
Which file should I load first?
Which host instructions are available?
Which parts come from upstream evidence?
What should the agent refuse to guess?
How do I know the task is actually complete?

The prism-mcp signal is not a victory claim. It is a reminder that clone-heavy behavior raises the bar for structure. If people pull the repository locally, the pack must be clear enough to be audited without a sales page.

source_asset_url:
https://github.com/tangweigang-jpg/doramagic-prism-mcp-pack

doramagic_project_url:
https://doramagic.ai/zh/projects/prism-mcp/

This is an independent Doramagic resource pack. It is not an official upstream project release unless the upstream project says so.

Complex AI frameworks need acceptance-ready context packs, not longer prompts

Tang Weigang — Thu, 21 May 2026 02:41:19 +0000

Today's first Doramagic publishing signal comes from doramagic-langchain-pack.

In the 2026-05-21 GitHub metrics snapshot, the repository had 12 views, 1 unique viewer, 28 clones, 23 unique cloners, and 2 stars. The more useful signal is not the raw count. It is the path pattern. The visitor did not only open the repository home. They also opened 01_PROMPT_PREVIEW.md, 03_PITFALL_LOG.md, 04_BOUNDARY_RISK_CARD.md, and 05_HUMAN_MANUAL.md.

That pattern matters because LangChain is not a simple library with one obvious happy path. It is a large, evolving framework with multiple subsystems: chains, retrieval, tools, agents, memory, callbacks, evaluation, integrations, and deployment concerns. A coding agent can sound confident while still mixing old APIs, tutorial assumptions, and project-specific requirements.

For a project like this, a longer prompt is not enough.

A prompt can start the interaction. It can tell the model what role to take, how to respond, and what output format to prefer. But it cannot, by itself, give a user a durable operating contract. It does not tell the agent when to stop guessing. It does not separate example code from production constraints. It does not preserve known failure modes across hosts. It does not define what evidence proves that the task is done.

An acceptance-ready context pack needs several layers.

First, it needs a human manual. The user should understand what task the pack is meant to support before loading it into an agent. Is it for reading architecture, migrating code, debugging chains, building retrieval workflows, or preparing safer prompts for a coding host? If the task boundary is vague, the agent will treat every LangChain request as the same generic problem.

Second, it needs a pitfall log. In large frameworks, failure knowledge is often more valuable than ideal-path instructions. A pitfall log tells the agent which common moves are risky: relying on stale examples, skipping version checks, confusing demo snippets with application code, or proposing a high-level chain without showing how it is validated.

Third, it needs a boundary and risk card. A useful agent should know what it cannot safely infer. If the user does not provide the dependency version, runtime error, file layout, or reproduction command, the agent should surface the missing evidence instead of inventing a confident answer.

Fourth, it needs an acceptance path. A response is not complete just because it reads well. The pack should help the agent leave behind inspectable evidence: files touched, commands to run, assumptions made, expected output, and recovery notes when the result fails.

That is why Doramagic treats context packs as portable capability assets rather than prompt collections. A prompt preview is still useful, but it is only the front door. The durable value is in the manual, source map, pitfall log, risk boundary, and eval checklist that let the same capability move across Codex, Claude Code, Cursor, Aider, or another host.

The interesting part of today's LangChain signal is that readers were already inspecting those deeper files. That suggests they were not only asking whether an AI agent can write LangChain code. They were asking whether the pack can make the agent guess less, verify more, and recover when the first answer is wrong.

source_asset_url:

https://github.com/tangweigang-jpg/doramagic-langchain-pack

doramagic_project_url:

https://doramagic.ai/en/projects/langchain/

This is an independent Doramagic resource pack. It is not an official upstream project release unless the upstream project says so.

AI skills should be portable capability assets, not prompt collections

Tang Weigang — Wed, 20 May 2026 14:34:07 +0000

The second signal in today's Doramagic metrics was not another browser automation pack. It was doramagic-skills.

In the 2026-05-20 sample, the repo had 35 views, 8 unique viewers, 10 clones, and 10 unique cloners. The interesting part is that visitors did not only stop at the repository home. They also opened specific skill files such as:

skills/economic-dashboard/human_summary.md
skills/qlib-ai-quant

That behavior matters. It means people are checking whether a skill is useful for a concrete task.

A prompt is not enough

Many AI skills are just prompt snippets:

act as this expert;
follow these steps;
return this format;
avoid these mistakes.

That can be useful, but it is not yet a portable capability asset.

If a skill only works in one chat, one model, or one tool, it is hard to maintain. If it has no trigger conditions, no validation path, and no failure boundary, it becomes another piece of text that users hesitate to trust.

What a portable skill needs

A reusable skill should answer at least four questions.

First, what task does it solve?

Concrete beats abstract. "Generate an economic dashboard summary" is more useful than "make the AI understand economics."

Second, when should it be triggered?

The skill should describe the inputs that activate it, the cases where it should not run, and the missing data that should stop the agent.

Third, how is the result validated?

An agent response is not the same as a completed task. A skill should leave evidence: generated files, data sources, commands, acceptance criteria, and failure notes.

Fourth, can it move across hosts?

The same capability should be easy to adapt to Codex, Claude Code, Cursor, Aider, or another AI coding host.

That is why Doramagic treats a skill as a bundle of:

human manual;
host instructions;
prompt preview;
eval checklist;
known pitfalls;
source map;
boundary and risk card.

The goal is not to collect more prompts. The goal is to help people own AI capabilities they can load, reuse, verify, and improve.

Project:

https://github.com/tangweigang-jpg/doramagic-skills

This is unofficial AI capability content prepared by Doramagic. Unless an upstream project explicitly says otherwise, it is not an official upstream release.

AI coding agents need browser evidence, not just code inspection

Tang Weigang — Wed, 20 May 2026 01:48:45 +0000

The strongest signal in today's Doramagic GitHub metrics was not an abstract agent framework. It was a practical browser-verification pack:

https://github.com/tangweigang-jpg/doramagic-chrome-devtools-mcp-pack

In the 2026-05-20 sample, it had 120 clones, 60 unique cloners, and 44 views. The paths included traffic and commit-activity pages, which suggests readers were checking whether the pack was real enough to reuse.

That signal matters because AI coding agents often fail at the same boundary: they can edit code, but they cannot always prove that the browser experience works.

Common failure modes:

the page is blank, but the agent says it is done;
tests pass, but the button cannot be clicked;
console errors are ignored;
network failures are not connected to the code change;
screenshots exist, but no acceptance criteria are checked.

Chrome DevTools MCP-style access gives an agent browser evidence: console output, network behavior, DOM state, and real interaction results.

But a tool is not yet a capability asset. A reusable pack should tell the host agent:

when browser evidence is required;
what evidence must be recorded;
when to stop instead of guessing;
how to connect browser failure back to a code change.

A useful loop looks like this:

open the target URL
-> inspect console errors
-> run the key interaction
-> record the visible result
-> compare against acceptance criteria
-> only then edit code or report success

This is an unofficial AI capability pack prepared by Doramagic. Unless the upstream project explicitly says otherwise, it is not an official upstream release.

Load OpenSkills context into Codex or Claude Code without guessing source boundaries

Tang Weigang — Tue, 19 May 2026 12:17:36 +0000

The useful question is not whether an AI coding agent can read an OpenSkills README.

The useful question is whether the agent can tell the difference between:

what the upstream OpenSkills project actually provides;
what an independent Doramagic resource pack adds;
what the current host still needs to verify before acting.

That boundary matters because coding agents often fail by turning inference into a claim. A resource pack should reduce that behavior, not hide it behind a better prompt.

A practical OpenSkills AI context pack should give the host:

a source map;
host instructions for Codex, Claude Code, Cursor, or Aider;
a prompt preview;
a pitfall log;
a smoke check or eval;
a boundary card;
a human manual;
a test log;
a feedback path.

One minimal acceptance check:

Task: explain how an OpenSkills-style capability asset should be moved into another AI host.
Pass: the answer separates upstream behavior, Doramagic's independent context pack, and host-specific assumptions.
Fail: the answer implies Doramagic is the official OpenSkills release or skips source boundaries.

The point is not to make the workflow heavier. The point is to make the agent stop guessing earlier.

Doramagic pack:
https://github.com/tangweigang-jpg/doramagic-openskills-pack

This is an independent Doramagic resource pack. It is not an official upstream project release unless the upstream project says so.