DEV Community

Cover image for OpenAI Codex now finishes 85% of scoped tasks. Here is the /goal workflow that gets you there.
GDS K S
GDS K S

Posted on

OpenAI Codex now finishes 85% of scoped tasks. Here is the /goal workflow that gets you there.

OpenAI Codex now finishes 85% of scoped tasks. Here is the /goal workflow that gets you there.

OpenAI has been circulating an 85 to 90 percent success rate for Codex on well-scoped maintenance work. That number comes from internal testing, not an independent benchmark. But the mechanics behind it are real, and they explain both why it works and when it falls apart.

The feature is /goal. It shipped in Codex CLI 0.128.0 and became generally available across the CLI, IDE extension, and Codex app in version 0.133.0 on May 21, 2026. The short version: you set a goal, Codex loops until it believes the goal is complete, and the only hard stops are an evaluation that says "done" or a token budget that runs dry.

Understanding why that loop succeeds or fails on any given task is the whole game.

TL;DR

Scenario Outcome Why
Fix a failing test with a known error message High pass rate Scope is tight, completion is verifiable
Add a typed interface to an existing module High pass rate Output shape is checkable
Refactor a cross-cutting concern across 12 files Fails often Ambiguous scope, no clear done signal
Redesign the data model Fails always No binary done-check possible
Update a dependency and fix breakage Medium Depends on how far the breakage spreads

1. What /goal does and why "persisted" matters

A standard Codex turn is stateless. You ask something, it runs, the session ends. /goal breaks that pattern.

When you set a goal, Codex injects two prompts at the end of every turn automatically: goals/continuation.md and goals/budget_limit.md. The first tells the model to check whether the goal is complete and decide whether to continue. The second tracks token consumption and stops the loop before it exceeds your budget. The loop runs forward until one of those two conditions triggers.

Before version 0.133.0, goals were session-scoped. When the CLI process died, the goal died. The 0.133.0 release backed goals with dedicated storage so they track progress across active turns, including across CLI restarts. That is the "persisted" part. The goal state survives a reboot.

Version 0.132.0 (May 19, 2026) added one important fix: goal continuations now stop at usage limits instead of spinning indefinitely. Before that fix, a goal with no clear completion signal would run until the process died or the account hit a rate limit.

The loop pattern OpenAI uses here is not novel. Practitioners call this the "Ralph loop": an agent that checks its own output and decides whether to keep going. Codex adds budget accounting and a persistence layer on top. The prompt injection runs automatically; you never write the continuation prompts yourself.

2. The shape of a task that hits 85%

Three properties push a task into the high success range.

The goal must have a binary success check. "Fix the failing tests in src/auth" works. "Improve the auth module" does not. The agent needs to run a verification step and get a yes or no result. Passing CI is yes or no. "Better code" is not.

The scope must stay tight. A goal that touches one module or one interface definition gives the agent a small search space. If the fix requires changes in five unrelated parts of the codebase, the agent will solve three of them and stall on the fourth with no way to know it stalled.

The success condition must be observable from within the session. Write a shell command that returns 0 on success and non-zero on failure, and the agent can self-check. Tests are the obvious example. Type checks work too. Lint rules work. "The PR passes review" does not, because the agent cannot run that check.

Tasks I have seen work well:

  • Write a missing test for a specific function, run it green
  • Add a TypeScript interface that satisfies an existing as cast
  • Bump a dependency version and fix the type errors that surface
  • Extract a repeated code block into a shared utility and update all call sites in one directory

Every one of those has a finish line the agent can reach and measure.

3. The shape of a task that fails

The failure modes split into two categories: scope creep and unprovable completion.

Scope creep happens when the agent fixes one thing and reveals another. You ask it to fix a failing integration test. It fixes the test by updating the mock. The mock now diverges from the real API. The agent has no instruction to check that, so it declares done. The CI passes locally and fails in staging two days later. The agent did exactly what you said. The goal was too narrow.

Unprovable completion happens when the agent cannot self-check. "Refactor this service to be more readable" gives the agent nothing to verify. The agent will make changes, decide the changes look reasonable, mark the goal complete, and stop. Whether the code reads better is a human judgment. The agent will produce something and stop confidently regardless.

Architectural changes fail almost every time. If the task requires deciding where a module boundary should sit, or which service owns a responsibility, the agent hits the ambiguity and either picks one arbitrarily or loops until budget. That is not a capability gap. The task is genuinely underdetermined. No amount of looping closes that.

The 85% number, whatever its exact measurement method, almost certainly applies to a curated set of maintenance tasks with clear success criteria. If you point /goal at open-ended design work, you are not in the 85%. You are in a different distribution entirely.

4. Setup and a sample /goal call

Install or update the Codex CLI:

npm install -g @openai/codex
codex --version
# 0.133.0 or later for persistent goals
Enter fullscreen mode Exit fullscreen mode

Check that goals are active (on by default since 0.133.0, but worth confirming):

codex doctor
# look for: goals: enabled, storage: ok
Enter fullscreen mode Exit fullscreen mode

Set a goal from the CLI:

codex goal set "All tests in src/payments pass with no TypeScript errors"
Enter fullscreen mode Exit fullscreen mode

Start a session in the repo and let it run:

cd /your/repo
codex
# Codex picks up the active goal and begins the loop
Enter fullscreen mode Exit fullscreen mode

Watch it loop:

codex goal status
# shows: active goal, turns completed, tokens used, last evaluation result
Enter fullscreen mode Exit fullscreen mode

The agent runs npm test or your configured test command at the end of each turn, checks the output, and decides whether to continue. If it cannot find a test command, it looks for package.json scripts named test, typecheck, or lint in that order.

For a task with a tighter scope, you can inline the success command:

codex goal set "Fix TypeScript errors in src/api/routes.ts" \
  --verify "npx tsc --noEmit --project tsconfig.json"
Enter fullscreen mode Exit fullscreen mode

The --verify flag tells Codex which command to use as the done-check instead of inferring it. Pass anything that exits 0 on success.

Cancel a goal that has stalled:

codex goal cancel
Enter fullscreen mode Exit fullscreen mode

List past goals and their outcomes:

codex goal list --limit 10
Enter fullscreen mode Exit fullscreen mode

5. Wiring /goal into CI for safety

The loop does not replace CI. Treat it as a way to get closer to green before CI runs. The agent's output goes through type check, lint, and tests before merging, same as any other code.

A GitHub Actions job that verifies Codex-generated changes:

name: verify-codex-output

on:
  pull_request:
    branches: [main]

jobs:
  type-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm

      - name: Install
        run: npm ci

      - name: Type check
        run: npx tsc --noEmit

      - name: Lint
        run: npx eslint src --max-warnings 0

      - name: Test
        run: npm test -- --coverage --passWithNoTests

  detect-scope-creep:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Count changed files
        run: |
          CHANGED=$(git diff --name-only origin/main...HEAD | wc -l)
          echo "Changed files: $CHANGED"
          if [ "$CHANGED" -gt 20 ]; then
            echo "::warning::PR changes $CHANGED files. Review for unintended scope creep."
          fi
Enter fullscreen mode Exit fullscreen mode

The scope-creep check is the one I added specifically for agent-authored PRs. If Codex touches more than 20 files on what should be a five-file task, someone needs to read what happened. The warning does not block the PR; it flags it for a slower review.

The important CI rule: never relax your existing quality gates for agent-generated code. If anything, add the file-count check. An agent that cannot measure its own scope will not stop itself from editing 40 files to fix a one-line bug.

Pre-commit hooks are the other layer. Add a quick type check before the commit even reaches CI:

# .pre-commit-config.yaml (if using pre-commit)
repos:
  - repo: local
    hooks:
      - id: tsc
        name: TypeScript check
        entry: npx tsc --noEmit
        language: system
        pass_filenames: false
Enter fullscreen mode Exit fullscreen mode

Or wire it directly in package.json using husky:

{
  "scripts": {
    "prepare": "husky install"
  }
}
Enter fullscreen mode Exit fullscreen mode
# .husky/pre-commit
npm run typecheck
Enter fullscreen mode Exit fullscreen mode

Now every commit the agent makes, whether from a /goal loop or a single turn, goes through the type check locally before it can push.

The bottom line

The /goal loop works on tasks where "done" has a binary answer the agent can check itself. Write that verify command before you set the goal. If you cannot write that command, the task needs more scoping before you hand it to the agent.

The 85% figure covers curated maintenance tasks. You cannot carry that rate over to any task you hand the tool. Architectural decisions, ambiguous refactors, and cross-cutting changes will not approach that number regardless of turn count.

The persistence layer that shipped in 0.133.0 is the real unlock. A goal that survives a CLI restart means you can set a task running, close the terminal, and come back to a result rather than a dead session. That changes the workflow from "supervised agent" to something closer to a slow async job. Wire it into CI, cap the budget, and treat the output like any other unreviewed PR.

What is the first maintenance task in your backlog that has a clear test-based done condition? That is the one to try /goal on first.


GDS K S ยท thegdsks.com ยท follow on X @thegdsks

Set the verify command before the goal. If you cannot write it, the scope is not ready.

Top comments (0)