Codex /goal and OpenGUI: long-running tasks need state

#agents #ai #cli #llm

Long-running agents tend to fail in the second half.

The first step is often fine. Fix a CI failure, open an app, tap a button, search for a keyword. Models can produce a reasonable first action. The trouble starts around step ten: what has already happened, where the task is stuck, what the original boundary was, and when the task is allowed to stop. Those details slide out of context.

Codex CLI 0.128.0 added /goal. The release note describes a persisted goal workflow: app-server APIs, model tools, runtime continuation, and TUI controls for create, pause, resume, and clear. Simon Willison compared it to OpenAI's version of a Ralph loop: set a goal for Codex, then let it keep executing, checking, and correcting until the goal is done or the budget runs out.

In the context of long-running tasks, the change is about where the goal lives. It moves from text in a single prompt to state that can be resumed, paused, cleared, and referenced again later.

Why coding agents need goal

Take a CI failure. The immediate failure may be one broken test. The agent changes the test, then the implementation, then adjusts a type because the code now feels awkward. Each step can be justified, but the final diff is much larger than the original problem.

Code generation is rarely the hard part here. The run has no stable constraint attached to it. The original goal may have been as small as:

/goal 修复当前 failing tests，保持 diff 尽量小，最后跑完 npm test

Or:

/goal 处理这个 PR 的 review comments，不改无关文件，最后给出改动摘要

That kind of goal carries the target, the boundary, and the acceptance condition. It tells the agent where to go, what not to touch, and when to stop.

Without that state, the agent is easily pulled around by the current error. A type looks awkward, so it changes the type. A test is hard to write, so it changes the test. The structure feels messy, so it refactors. Each local move can make sense, while the whole task drifts.

On phones, the hard part is screen state

OpenGUI works on a different kind of long-running task: letting AI operate a real Android phone.

Repository: https://github.com/Core-Mate/open-gui

In a codebase, state can still land in files, tests, and diffs. On a phone, state is a live screen.

For example, ask the phone to open X, search for discussions about mobile AI agents, collect the main points, and summarize what people care about. As a sentence, this looks simple. On the phone, it becomes a series of state checks: is the app open, is this the home page, is the search box focused, did the results finish loading, did a login prompt, permission prompt, or follow recommendation appear in the middle.

The loop of screenshot, tap, screenshot can only carry short tasks. If the screen does not change, the system has to decide whether the tap missed, the network is slow, the page is loading, or the action has no visible feedback. If the page jumps somewhere else, it also has to decide whether to go back, retry, or continue from the new page.

So a goal on mobile has to answer a few concrete questions: which step is the task on, whether the current screen supports the next step, where to recover after a failure, and when the run can end.

OpenGUI turns the goal into a state flow

I ran OpenGUI and read through the source. It connects the backend graph, device connection, and Android-side action execution instead of leaving phone automation as a script.

On the backend, the main entry point is server/apps/backend/src/modules/graph-agent/graph/mobile-agent.graph.ts. Complex tasks go through Plan Supervisor, where the plan is split into executable subtasks. Concrete actions enter executor.graph.ts, the device execution subgraph. The execution result goes back to the supervisor, which decides whether to continue, retry, replan, or hand off to Summarizer.

On Android, actions are applied to the real device. client/core_accessibility/.../GestureService.kt executes GUI actions such as taps and typing. The device keeps a WebSocket connection to the backend, and client/core_network/.../StandbySocketManager.kt handles the standby connection. Feishu/Lark, Telegram, and REST API can sit outside this as remote task entry points, turning the phone from a local demo device into something that can receive work.

OpenGUI spreads the goal across several pieces of state: the plan document, current subtask, device screenshot, execution result, failure classification, and final summary. After each device action, the backend gets fresh device state and decides the next move.

A simple script assumes the page will follow the expected order. OpenGUI assumes the page will change, so the executor has to keep reporting real state back to the backend.

The cost

Putting the goal into a graph makes the system heavier.

You have to maintain task state, keep WebSocket connections alive, handle device standby, send execution results and screenshots back, and design state transitions for continue, retry, cancel, and summarize. Model calls and screenshot analysis also cost money. The longer the task runs, the more that cost becomes an engineering concern instead of a small detail.

But on mobile, it is hard to avoid this cost. Real apps show popups, hang on loading screens, misread taps, and send users to completely different pages. A prompt loop alone quickly turns into screenshot-based while true.

OpenGUI puts that complexity into the system. A bad tap becomes an execution result for the supervisor to consume. The device keeps reporting state. It behaves more like a worker than a screen being clicked. The design is heavier, but it gives long-running tasks a place to be debugged, recovered, and reviewed.

The first use cases I would try are community research, mobile flow testing, ops tasks, and App-only workflows that web automation cannot reach. These tasks may not need the strongest model, but they do need an execution system that can keep following the goal, see failure, and send state back.

In coding agents, Codex /goal keeps the goal as recoverable state. On phones, OpenGUI connects task progress, device feedback, and failure handling into a state flow. A long-running agent has to keep track of the run, not only execute the next step.