"Claude Code calls Codex" sounds like one feature. It's at least two different process models, and they have almost nothing in common past the name.
The first spawns a one-shot subprocess with codex exec. You hand it one explicit instruction, it produces a file or a structured result, and it exits. The second runs a persistent runtime with codex app-server and talks to it over JSON-RPC, managing threads, turns, reviews, and interrupts for work that needs to carry state across rounds.
Both let Claude Code borrow Codex. They differ on startup cost, protocol, permissions, error recovery, and the kind of task they fit. Get the distinction wrong and you either over-engineer a one-shot job or reach for a stateless call on work that needs to resume.
The conclusion first: two architectures, not two commands
| Dimension |
codex exec one-shot subprocess |
codex app-server persistent service |
|---|---|---|
| Reference implementation | baoyu codex-imagegen backend |
OpenAI Codex Plugin for Claude Code |
| Process shape | Spawned per task, exits when done | Long-running, reused within a session |
| Transport | Launch args, stdin, JSONL event stream | JSON-RPC requests and notifications |
| State model | Single run, no dependence on the last | Thread holds multiple turns, can resume |
| Permission posture | The example uses danger-full-access
|
Review is read-only; task can switch to workspace-write
|
| Typical task | Image gen, file generation, single deterministic op | Code review, long delegated tasks, multi-turn work |
| Main risk | Full-access child, cold start every time | More protocol and lifecycle complexity |
The one-line test:
- If you need to run once and get a single verifiable artifact, reach for
codex exec. - If you need ongoing collaboration, retained context, and the ability to cancel or resume, reach for
codex app-server.
Version scope: keep the numbers honest
The first thing this writeup exposed wasn't architecture. It was version accounting. I had carried over the original draft's phrasing about "the current local version," and only after checking the install records did I confirm that the marketplace source and the active plugin were not the same snapshot.
Local commands and plugin records show:
- Codex CLI is
0.140.0. - The OpenAI Codex Plugin for Claude Code is
1.0.4, commit807e03a. - The baoyu-skills marketplace source snapshot is
2.5.1, commit441ca30. - But Claude Code's installed-plugin record still points baoyu-skills at the earlier
1.111.1snapshot.
So the accurate way to state the baoyu-codex-imagegen analysis below is this:
It's based on the baoyu-skills v2.5.1 source snapshot in the local marketplace, not a claim that the active plugin has been upgraded to v2.5.1.
This is easy to miss. The marketplace source, the cached snapshot, and the active version can all be different commits. Read the directory name or the changelog alone and you'll write "the version I read" when you mean "the version actually running."
Path one: codex exec, Codex as a one-shot operator
What it solves
The baoyu-codex-imagegen skill has a narrow job: let a non-Codex host like Claude Code call the image_gen tool built into the Codex CLI, and save the result to a chosen path.
Tasks like that share a shape:
- Clear input boundary, usually a prompt, an aspect ratio, and an output path.
- Clear result boundary, usually one file and one line of structured status.
- No need for multiple rounds, and no need to restore prior context.
So it skips a persistent service and spawns directly:
codex exec \
--json \
--sandbox danger-full-access \
--skip-git-repo-check \
-
If a reference image exists, it appends one or more --image arguments.
Why each flag is there
exec runs non-interactively for scripting. OpenAI's CLI docs position it as the execution path for automation and CI: run, return a result, done.
--json turns process output into line-delimited JSON events, or JSONL. The caller doesn't parse terminal display text; it reads structured events for the thread, tool calls, usage, and the final message.
--sandbox danger-full-access is here because this implementation needs Codex to copy the image from its default generation directory to an arbitrary target path the caller specifies, so it grants full file permissions.
That is not a general best practice. OpenAI's docs recommend workspace-write for automation and say to avoid unnecessary full access unless the runtime is already isolated.
--skip-git-repo-check lets Codex run outside a Git repo, since image jobs may launch from a temp or plugin directory rather than a trusted repository.
The trailing - tells Codex to read the instruction from stdin. The wrapper writes the task contract with child.stdin.write(instruction) and then closes stdin.
The task contract is the real work
This path doesn't pass the user prompt straight through. It wraps a strict instruction, roughly:
TASK:
Generate an image and save it to the given path.
STEPS:
1. You must call the built-in image_gen.
2. Copy the result to the target path.
3. Check that the target file exists.
4. Return one line of JSON only.
HARD CONSTRAINTS:
- Do not call an external image API.
- Do not fake the image with a script.
- You must use image_gen to produce real pixels.
This is the "sub-agent as operator" design:
- Fixed input structure.
- Fixed set of allowed tools.
- Fixed file side effects.
- Fixed output format.
- Explicit prohibitions.
For an automated pipeline, the constraints matter more than the phrasing. The caller wants a verifiable result, not an open conversation.
Don't trust self-reported success: three checks
The engineering detail worth keeping is that this implementation does not call the job done just because Codex replied "success."
It checks, in order:
- Whether the JSONL events contain a thread ID.
- Whether an image actually appears under
$CODEX_HOME/generated_images/{threadId}/. - If the directory check fails, whether the tool calls include a
cpormvfrom the generation directory to the target path. - Whether the target file actually exists and has a byte count above zero.
Failure becomes a structured error:
agent_refusedno_image_gen_tool_usetimeoutcodex_not_installedspawn_failed
The point isn't the image. It's a general principle:
An agent's natural-language reply is a claim. Files, events, and repeatable checks are evidence.
Where it fits and where it doesn't
Good fit:
- Single image or file generation.
- A code transform with clear boundaries.
- One-off analysis that returns structured JSON.
- Automation that doesn't need inherited context.
Limits:
- Every run pays process and model cold-start cost.
- No cross-run state by default.
- With
danger-full-access, the trust boundary is very wide. - Timeout, cancellation, and recovery usually fall to the wrapper to build.
Path two: codex app-server, Codex as a stateful service
The OpenAI Codex Plugin for Claude Code does not re-run codex exec per command. It starts codex app-server and manages an ongoing session over JSON-RPC.
OpenAI's docs define the App Server's core abstraction in three layers:
- Thread: a conversation that persists.
- Turn: one round of user input and agent execution inside a thread.
- Item: events inside a turn, such as messages, reasoning, commands, and file edits.
Direct connection and broker
The plugin supports two connection modes.
Direct:
Claude Code
|
| stdin/stdout JSONL
v
codex app-server
The client starts codex app-server itself and sends line-delimited JSON-RPC over stdio.
Broker:
Claude Code command
|
| Unix socket
v
Broker
|
| reuse
v
codex app-server
The plugin stores the broker endpoint in CODEX_COMPANION_APP_SERVER_ENDPOINT so review, rescue, and status commands in the same Claude Code session share one Codex runtime.
If the broker returns the busy error -32001, or the connection hits ENOENT or ECONNREFUSED, the plugin drops the broker and starts an App Server directly to retry.
That's one more layer than a one-shot subprocess, and it buys:
- Runtime reuse within a session.
- Thread persistence.
- Background task management.
- Cancel and resume.
- Permission isolation between review and task.
Handshake: initialize first
Once the App Server connection is up, the client sends initialize, then an initialized notification.
The plugin passes this client identity:
{
"title": "Codex Plugin",
"name": "Claude Code",
"version": "1.0.4"
}
It also uses optOutNotificationMethods to unsubscribe from some token-level delta events, keeping the structured notifications that are worth more to the caller and cutting noise.
Session model: threads and turns
The key RPC methods the plugin uses:
| Method | Purpose |
|---|---|
thread/start |
Create a new thread |
thread/name/set |
Name a thread |
thread/resume |
Resume an existing thread |
thread/list |
Query past threads |
turn/start |
Start a turn in a thread |
review/start |
Start a code review |
turn/interrupt |
Interrupt a running turn |
So the App Server isn't a single-round wrapper that "sends a prompt and waits." It's a managed session runtime.
Review and task have different permissions
The plugin keeps the two actions separate.
Review runs read-only, on a temporary thread, through review/start. It returns findings and does not touch code.
Task defaults to read-only. Pass --write and it switches to workspace-write. It can save the thread, and it can continue prior work with --resume or --resume-last.
This is closer to what an engineering system's default should look like than "run everything with full access." Set the minimum permission by the nature of the task, then decide whether to widen write scope.
Hooks wire Codex into the Claude Code lifecycle
The plugin registers three Claude Code hooks:
-
SessionStart: prepare the shared runtime. -
SessionEnd: clean up the broker and session resources. -
Stop: an optional stop-gate review.
With the review gate on, every time Claude Code is about to stop, it can have Codex check whether the last round has a blocking problem.
The value isn't "one more model." It's putting a second model inside the delivery flow:
Claude makes a change
|
v
Codex reviews independently
|
+-- ALLOW: stop is permitted
|
+-- BLOCK: return findings, keep working
It has a cost. The official plugin README warns that the review gate can create long Claude/Codex loops and burn through usage fast, so don't turn it on unconditionally.
How to choose
When codex exec fits
Use a one-shot subprocess when most of these hold:
- The task is a single round.
- The result can be verified by a file or JSON.
- You don't need to restore prior context.
- Cold-start cost is acceptable.
- The caller can handle timeout and retry on its own.
Examples: generate an image, convert input to a fixed format, run one analysis on a file, run a check once in CI.
When codex app-server fits
Use the persistent service when you need:
- Multiple rounds of conversation.
- Thread resumption.
- Background runs and status queries.
- Interruption of a running task.
- Separate review and write permissions.
- Integration with Claude Code's session lifecycle.
Examples: review a branch continuously, delegate a long investigation, let Codex change code and then add tests, or run an automatic second-model gate before stopping.
How this was verified
This published version doesn't lean on the draft's description. I redid a minimal verification.
The steps I ran:
- Read the draft and listed every factual claim about versions, commands, RPC methods, and permissions.
- Ran
codex --version,codex exec --help, andcodex app-server --helpto confirm the current CLI's commands and flags. - Checked the OpenAI plugin manifest, install records,
app-server.mjs,codex.mjs, and the hook config. - Checked
spawn.ts,main.ts, the version file, and the Git commit in the baoyu marketplace source. - Cross-checked against the OpenAI Codex CLI, App Server, Codex Plugin, and Claude Code Hooks docs.
- Recorded "current active version" and "source snapshot I actually read" separately.
The mistake and the lesson
I first took the draft's baoyu-skills v2.5.1 as "the current local version." On further checking, the v2.5.1 marketplace source does exist locally, but Claude Code's installed-plugin record still points at an earlier snapshot.
Without checking the install record, that phrasing looks reasonable and is wrong.
The lesson:
When you analyze a local plugin, record at least the marketplace HEAD, the install cache path, the plugin manifest, and the commit. No single one of those stands in for "the version actually running."
Practical advice
One-shot tasks: hardcode the output contract
Don't write "generate an image for me" or "check my code." An automation prompt should include at least:
Goal
Allowed tools
Input and output paths
Prohibitions
Verification steps
Final return format
That cuts the uncertainty of an agent improvising, and it lets the caller judge success or failure.
Long tasks: resume with the delta only
When you resume a thread, send only what changed:
Continue the last task. Apply the first fix and add the matching test.
There's no reason to re-paste the whole background. Repeating context adds noise and can make the model misread the task boundary.
Review tasks: bind every finding to evidence
Whether you run a standard review or an adversarial one, require each finding to carry:
- The file or diff actually examined.
- A reproducible failure path.
- A clear risk level.
- A split between fact, inference, and open question.
A "might be a problem" with no evidence rarely makes it into an engineering decision.
Permissions: start at the smallest scope
The order of preference:
read-only
|
v
workspace-write
|
v
danger-full-access
Widen only when the task genuinely needs a larger file scope and the runtime is trusted.
Closing
"Claude Code calls Codex" is not one calling convention.
codex exec is a one-shot, stateless subprocess that's easy to wrap. It fits single tasks with clear boundaries and verifiable results.
codex app-server is a stateful, resumable, manageable agent service. It fits code review, task delegation, and complex work that needs ongoing collaboration.
The real selection criteria aren't "which is more advanced." They are:
- Does the task need state?
- Can the result be verified in one shot?
- Do you need interruption, resume, and background management?
- Can permissions be graded by action?
- Is the extra protocol complexity worth it?
Simple tasks get a simple process. Ongoing collaboration gets a stateful service. Draw that line clearly and the system gets easier to understand and to maintain.
Top comments (0)