Fenix

Posted on May 10

The Evolution of Mobile Automation: From Scripts to State Flows

#agents #ai #automation #mobile

I spent some time with OpenGUI recently, running a long-haul task on a real phone: open X, search for recent discussions on mobile AI agents, collect the main viewpoints, and summarize what people care about.

The task is one sentence in plain English. The execution breaks into dozens of judgments and actions. Is the app open? Are we on the home screen? Did the tap hit the search box? Is the result page loaded? Did a login popup appear? A recommended-follow modal? Did the page navigate away, and should we go back or retry?

Traditional mobile automation struggles with this kind of task. Not because tapping is hard, but because real phones don't follow scripts.

To test this, I ran the same task three times with three different setups.

Pure script (Appium): Failed all three times. Once stuck on an update dialog, twice the search results page changed its xpath. Average survival: 4 steps.

VLM screenshot loop (v2 Agent): One success out of three, taking 18 minutes. It retry-tapped a recommended-follow modal 7 times. The other two runs got stuck in loops at step 12 and step 23. The screenshot showed no change, the model tapped the same spot again, the screenshot still showed no change.

OpenGUI: All three succeeded, averaging 11 minutes. The longest run hit a login popup, slow network loading, and a recommended-follow modal. The supervisor replanned twice. No human intervention.

This doesn't mean OpenGUI is "smarter". It means the system maintains task state explicitly, rather than relying on the model to remember context implicitly.

Traditional Mobile Automation: Assuming the World Cooperates

Three mainstream approaches:

Record and replay. You operate the phone once, the tool records tap coordinates, swipe trajectories, and input text, then replays them verbatim.

UI automation frameworks like Appium and UIAutomator. They locate elements via accessibility tree or xpath, then perform actions.

RPA platforms. Visual workflow builders that wrap the above into flowcharts, with conditional branches and exception handlers.

These work fine for simple jobs. Daily check-ins, timed coupon grabs, batch processing of fixed flows. As long as the page doesn't change, they run reliably. Change the page, or stretch the flow, and things break.

v1: Popups Are the Enemy of Scripts

Record-and-replay is the most intuitive. You do it once, it remembers, it replays.

For that X search task, the recorded flow looks like: tap X icon, wait 3 seconds, tap search box, type "mobile AI agents", tap search button, wait 3 seconds, scroll, screenshot.

This works in an ideal world. The real world doesn't cooperate. An update dialog shifts tap coordinates. Slow network means 3 seconds isn't enough; the skeleton is still showing. Login is required before search, and the script has no concept of where it is. A recommended-follow modal intercepts the scroll. A blogger's content needs a "show more" tap that the script never recorded.

Scripts have no state. Every step assumes the previous page's result and the current page's structure. Reality deviates, the script errors out. No recovery.

v2: Visual Understanding Makes Scripts Smarter, But Doesn't Solve State

Coordinates and xpath are brittle. Can the machine just read the screen?

This is the dominant approach for mobile agent demos in the past couple years: screenshot, feed to a VLM, model returns the next action, execute, screenshot again.

The loop is more flexible than scripts. The model sees the current screen, identifies where the search box is, handles some popups. No predefined coordinates needed. Just tell it "open X and search for mobile AI agents".

The problem: the VLM only sees the current screenshot. Short tasks are fine. Three steps to open an app, tap a button, type some text. The model usually handles it. Stretch the task, and the cracks show.

The model doesn't know what came before. Step 10 fails, it doesn't know whether to backtrack to step 3 or step 7. It doesn't know the overall goal. Looking only at the current screenshot, it drifts toward local optima. It sees a UI element that looks off and "optimizes" it, deviating from the main task. The task is done, it keeps tapping.

v2 solves screen reading. It doesn't solve long-horizon state maintenance.

The deeper problem is context window limits. A VLM processing a screenshot + prompt burns tokens fast. A 1080p screenshot encodes to 1000-3000 tokens. After 5-10 loops, what came before and what the original goal was gets physically evicted from context. This isn't "forgetting". It's eviction.

Many mobile agent demos stop at v2. Three-minute demos are impressive. Thirty-minute real tasks usually get stuck on a popup or loading state, then fall into the screenshot-recognize-tap-no-change-screenshot death loop.

v3: Turn the Goal Into a State Flow

OpenGUI puts the task into a stateful backend graph instead of letting the model run a stateless screenshot loop locally.

The architecture is clean. The core flow looks like this:

User/IM command → Plan Supervisor → Executor Graph → Android Client → Real device
                       ↑                ↓
                       └─── Execution result + device state ───┘

The main graph lives in mobile-agent.graph.ts, built with LangGraph's StateGraph:

const graph = new StateGraph(AgentStateSchema)
  .addNode("supervisor", supervisorNode)
  .addNode("extract_todo", extractTodoNode)
  .addNode("fallback_extract", fallbackExtractNode)
  .addNode("gui_executor", executorSubgraph)
  .addNode("summarizer", summarizerNode)
  .addEdge(START, "supervisor")
  .addConditionalEdges("supervisor", routeAfterSupervisor)
  .addConditionalEdges("extract_todo", routeAfterExtractTodo)
  .addConditionalEdges("gui_executor", routeAfterExecutor)
  .addEdge("summarizer", END);

Plan Supervisor maintains task state. A complex task comes in, the supervisor breaks it into executable subtasks, forms a plan document, then dispatches them one by one to the Executor. It's itself an LLM agent with write_todos and read_todos tools, so it can adjust the task list dynamically. First call generates the plan; subsequent calls evaluate the Executor's returned results, deciding whether to mark complete, retry, or replan.

The routing logic is simple but telling (routing.ts):

export function routeAfterExtractTodo(state: AgentState) {
  if (state.isCancelled) return "summarizer";
  if (state.planTodoComplete) return "summarizer";
  if (state.todoFound) return "gui_executor";
  return "fallback_extract";  // Haiku fallback extraction
}

export function routeAfterExecutor(state: AgentState) {
  if (state.isCancelled) return "summarizer";
  if (isExecutionConnectionLostMessage(state.executorOutput?.fail_reason)) {
    return "summarizer";  // Device disconnected, stop retrying
  }
  return "supervisor";  // Send execution result back for evaluation
}

Executor Graph handles the actual device interaction, itself a subgraph. The execution loop is defined in executor.graph.ts:

const subgraph = new StateGraph(AgentStateSchema)
  .addNode("entry", entryNode)
  .addNode("sense", senseNode, { retryPolicy: { maxAttempts: 3 } })
  .addNode("vision_model", visionModelNode)
  .addNode("parse_action", parseActionNode)
  .addNode("execute_action", executeActionNode, { retryPolicy: { maxAttempts: 3 } })
  .addNode("post_execute", postExecuteNode)
  .addEdge(START, "entry")
  .addEdge("entry", "sense")
  .addEdge("sense", "vision_model")
  .addConditionalEdges("vision_model", routeAfterVisionModel)
  .addConditionalEdges("parse_action", routeByAction)
  .addConditionalEdges("execute_action", routeAfterExecuteAction)
  .addConditionalEdges("post_execute", routeAfterPostExecute);

Entry initializes execution state. Sense pulls the screenshot and current app info from the device. Vision Model sends the screenshot and context to the VLM for the next action. Parse Action turns the VLM's output into structured actions. Execute Action sends actions to the Android device over WebSocket. Post Execute runs anomaly detection (more on this below), then decides whether to loop back to Sense or exit the subgraph.

Summarizer steps in at the end, packaging key execution info into structured results for the user.

This collaboration turns the goal from a string in a prompt into a full system that can be referenced, paused, resumed, and cleaned up.

Where does the state live? Look at state.types.ts, AgentStateSchema:

planDocument: the supervisor-generated plan (Markdown)
executorInput / executorOutput: current subtask input and output
executor: internal Executor subgraph state, including screenshot URI, current prediction, loop count, anomaly flags, message history, token usage
todoFound / planTodoComplete: boolean flags for supervisor decisions
isCancelled / isPaused: user interrupt state

This state doesn't live in the model's context. It lives in the backend graph's state object, updated by LangGraph's reducer after every step.

The Android side maintains a WebSocket connection to the backend. StandbySocketManager.kt handles device standby. GestureService.kt executes actions on the real device. The device isn't a puppet driven by scripts. It's a worker that feeds back state continuously.

Anomaly Detection: The Fuse Inside the Executor

The most common v2 failure mode is looping: the screenshot doesn't change, the model taps the same spot again, the screenshot still doesn't change. OpenGUI has explicit anomaly detection in the Post Execute node.

Look at post-execute.node.ts:

Action repetition detection: Check the last 10 actions for 5 consecutive similar actions (same type + coordinate distance under 50 pixels). If found, flag as a likely loop.

Action cycle detection: Check for A-B-A-B alternating patterns. For example, tap back, then tap in, then tap back, then tap in. The model oscillates between two pages.

Screenshot anomaly detection: Compare recent screenshots using perceptual hash (pHash). If 3 consecutive screenshots are identical and the action isn't wait/scroll, the page isn't responding. If screenshots show an A-B-A-B alternating pattern, the page is switching between two states.

Consecutive scroll detection: More than 8 consecutive scrolls means the current search strategy isn't making progress. Force the executor to exit and let the supervisor replan.

When an anomaly is detected, the Post Execute node sets needRemind = true and injects a reminder on the next Vision Model call:

const remindMessage = new HumanMessage(
  `The current task may be stuck in a loop or drifting from the goal.
Execution anomaly: ${exec.remindReason}
Original task: ${instruction}
Check whether the execution is drifting from the original goal or stuck in a loop.`
);

The key design choice: anomalies aren't termination reasons. They're consumed inputs. Detect loop → inject reminder → VLM outputs corrected action on next round → execution continues. The entire loop closes inside the system. No human needed.

The Key Difference: State Management

Traditional scripts have no state. They only know "what's the next step". v2 agents have no state. They only know "how do I handle this screen". OpenGUI distributes state across plan documents, current subtasks, execution results, and failure taxonomies. The supervisor makes every decision based on complete state.

The maze analogy holds. Traditional scripts hold a roadmap. One wrong turn, they're lost. v2 agents look around at every intersection but can't remember how they got there. OpenGUI carries a real-time updated map. It knows where it is, where it's going, which paths were tried and failed.

Another key difference is model role separation. v2 usually uses one model for all decisions. OpenGUI splits planning, supervision, and VLM execution across different models.

The README quotes numbers. For a medium-length task (~60 screenshot analyses), all-Opus config runs $8-15. Swap to Qwen 3.6 Plus for Planner and Supervisor, Doubao Pro for VLM, same task drops to $0.6-1.2. A 10-15x cost difference.

This gap comes from two factors. First, Qwen/Doubao are priced far below Opus. Second, OpenGUI's architecture lets different roles use different models. Planner and Supervisor handle text plans. They don't need multimodal capability, so they can use cheap text models. Only the Executor's VLM needs vision. That cost is isolated inside the subgraph.

A Concrete Example

Here's how the X search task runs in OpenGUI:

Plan Supervisor generates the plan first: open X, search keyword, browse results, collect viewpoints, summarize. Then it dispatches the subtask "open X" to the Executor. The Executor screenshots. The VLM judges whether we're on the home screen or another app, then taps. Result comes back: X is open, but a login popup appeared. Supervisor judges this as an anomaly that needs login handling. If it can't handle it, mark failed and try to skip. After login is handled, dispatch "search keyword". The Executor searches. Network is slow, the page hasn't loaded. Internal retry: wait, screenshot again, judge again. Search completes, enter "browse results" subtask. A recommended-follow modal appears mid-way. The Executor identifies it as interference, tries to close or skip. All subtasks complete. Supervisor calls Summarizer for structured summary.

No step assumes the page will proceed in order. Every step judges based on current real state. Failures are consumed as normal inputs, not termination reasons.

After completion, the Summarizer returns something like this:

## Task Summary

**Goal**: Search X for recent discussions on "mobile AI agents" and summarize key concerns.

**Execution**: 
- Opened X app successfully
- Searched "mobile AI agents" 
- Scrolled through top 20 results
- Collected 8 relevant posts/threads

**Key Findings**:
1. Privacy concerns dominate: users worried about screen recording and data access
2. Reliability: agents failing on non-standard UI patterns (custom keyboards, overlays)
3. Cost: VLM per-screenshot pricing makes long tasks expensive
4. Latency: 5-15s per action too slow for real-time interaction

**Blocked Items**:
- Login prompt appeared; task continued after handling
- One result required app switch to Safari; skipped per constraints

**Conclusion**: Mobile AI agents are technically feasible but face UX, cost, and trust hurdles before mainstream adoption.

The Cost

This design is heavier. Costs show up in three places.

Model cost. Every VLM screenshot analysis calls an API. A 1080p screenshot encodes to 1000-3000 tokens. A 10-minute task with 60 screenshot analyses might consume 150k-300k tokens total. All-Opus, this is non-trivial. Mixed-model configs bring it to acceptable ranges, but at a capability cost. Qwen plans worse than Opus. Doubao's vision understanding misses details in some scenarios.

Latency. Screenshot → backend → VLM inference → action decode → network transmission → device execution → wait for UI stable → screenshot again. One loop is typically 5-15 seconds. A 60-step task spends 5-15 minutes just waiting. v2 can be faster if the VLM runs locally or nearby, but OpenGUI's backend graph architecture naturally adds a network hop. For latency-sensitive tasks (real-time game assistance, for example), this architecture doesn't fit.

System complexity. You run a backend (NestJS + LangGraph), database (PostgreSQL), cache (Redis), WebSocket gateway, and maintain the Android client's standby connection. Deploying OpenGUI is much heavier than running a Python script. Devices sleep, background processes get killed by the OS, WebSockets drop. Standby isn't "connect and forget". It needs heartbeat (35-second interval), reconnection, and state sync.

But mobile is hard to simplify. Real apps throw popups, hang on loading, misregister taps, navigate to completely different pages. Pure prompt loops degrade into screenshot-flavored while true for any non-trivial task length.

OpenGUI puts complexity explicitly into the system. A missed tap becomes an execution result for the supervisor to consume. Device disconnection is detected by the standby gateway and retries stop. A paused task resumes from the current subtask. The design is heavier, but it gives long-haul tasks a place to debug, recover, and replay.

Checkpoint: Tasks Can Resume After Interruption

LangGraph provides checkpointing. OpenGUI wires it into PostgreSQL (PostgresCheckpointerService). This means:

The task is on step 20, the backend restarts. After restart, it resumes from the step 20 checkpoint, not from scratch.
The user pauses the task manually. On resume, the supervisor re-evaluates current state and decides whether to continue or replan.
Multiple subtasks share the same thread ID. State persists across graph nodes.

This matters for long tasks. A two-hour run losing all progress because of a backend rolling restart is unacceptable. Checkpointing turns "state lives in memory" into "state lives in the database". Performance sacrificed, reliability gained.

The Real Lesson

The capability boundary of an automation system isn't defined by "what actions can it execute". It's defined by "how much state can it maintain".

Scripts maintain one step of state (what's next). v2 agents maintain one round of state (how do I read this screen). OpenGUI maintains the full task lifecycle: plan, progress, anomalies, recovery.

Codex's /goal does something similar for coding agents. It turns the goal from text in a conversation turn into recoverable session state. OpenGUI goes further on mobile. It doesn't just save the goal. It wires device feedback, execution results, and failure handling into a complete state flow. Different domain, same problem: long-horizon agents can't just execute the next step. They need to continuously maintain "where am I, where am I going, what are the boundaries".

Daily check-in? Script is enough. Running an AI on a real phone for tens of minutes, across multiple apps, with complex judgments? Then you need to pull the goal out of the prompt and turn it into continuously maintained state. The choice is heavier. It's also the path from demo to production.