DEV Community: Fenix

手机自动化的演进：从脚本到状态流

Fenix — Sun, 10 May 2026 08:48:48 +0000

前阵子跑 OpenGUI，在真机上试了一个长程任务：打开 X，搜索 mobile AI agents 相关的近期讨论，收集主要观点，再总结大家关心的问题。

用自然语言描述只有一句话，执行起来却拆成了几十个判断和动作。App 打开了吗？是不是首页？搜索框点中了吗？结果加载了吗？中间有没有登录弹窗？有没有推荐关注？页面跳走了是回退还是重试？

在传统手机自动化的思路里，这种任务很难稳定跑完。点击本身不难，麻烦的是真实手机不按脚本走。

为了验证这个判断，我用三种方案各跑了三次同一个任务。

纯脚本（Appium）：三次全失败。一次卡在更新弹窗，两次搜索后页面结构变化导致 xpath 失效。平均存活 4 步。

VLM 截图循环（v2 Agent）：三次里一次成功，耗时 18 分钟，中间在推荐关注弹窗上重试了 7 次。另外两次分别在第 12 步和第 23 步陷入循环：截图显示没变化，模型继续点同一个位置，再截图还是没变化。

OpenGUI：三次全部成功，平均耗时 11 分钟。最长一次遇到登录弹窗 + 网络加载慢 + 推荐关注，supervisor 做了两次重新规划，没有人工干预。

这组对比说明的不是 OpenGUI"更聪明"，而是它把任务状态显式地维护在系统里，而不是依赖模型隐式地记住上下文。

传统手机自动化：假设世界会配合你

目前市面上主流的方案大致分三类：

录制回放，你在手机上操作一遍，工具把点击坐标、滑动轨迹、输入内容录下来，之后按原样回放。

UI 自动化框架，如 Appium、UIAutomator，通过 accessibility tree 或 xpath 定位元素，然后执行操作。

RPA 平台，可视化编排，把上述能力封装成流程图，加上条件判断和异常捕获。

这些方案在简单场景下都很好用。每天自动打卡、定时抢券、批量处理固定流程，只要页面不变，它们能跑得很稳。页面一变，或者流程稍长，问题就冒出来了。

v1：脚本的天敌是弹窗

录制回放是最直观的方案。你操作一遍，它记下来，下次照做。

拿那个 X 搜索的任务来说，录制下来的流程大概是：点击 X 图标，等待 3 秒，点击搜索框，输入 "mobile AI agents"，点击搜索按钮，再等 3 秒，滑动浏览，截图保存。

这个脚本在理想情况下可以跑通，但现实世界不合作。打开 X 时弹了一个更新提醒，点击坐标就错位了。网络慢，3 秒不够，页面还在加载 skeleton。搜索前让你登录，脚本不知道这是哪一步。结果页中间插了一个推荐关注，滑动被拦截了。某个博主的内容需要点击"显示更多"，但脚本没有这条分支。

脚本没有状态。每一步都预设了上一页的结果和当前页的状态，现实一偏离就报错退出，没有恢复能力。

v2：视觉理解让脚本变聪明一点，但没解决状态问题

坐标和 xpath 不可靠，能不能让机器看懂屏幕？

这是近一两年手机 Agent demo 的主流思路：截图，传给多模态模型（VLM），模型返回下一步动作，执行，再截图。

这个循环比脚本灵活。模型能看到当前屏幕，能识别搜索框在哪，甚至能处理一些弹窗。不需要预先定义坐标，告诉它"打开 X 搜索 mobile AI agents"就行。

问题是，VLM 每次只看当前这一张截图。短任务里没问题，三步之内打开 App、点一个按钮、输一段文字，模型通常能搞定。任务一长，短板就暴露了：

模型不知道前面做了什么，第十步失败时不知道要回退到第几步。它也不知道整体目标，只看当前截图，容易被局部最优带偏。看到一个不顺眼的 UI，可能会顺手"优化"一下，偏离主线。任务完成了，它可能还在继续点点点。

v2 解决了看懂屏幕的问题，但没有解决维护长程状态的问题。

更底层的问题是上下文窗口的硬限制。VLM 处理截图 + prompt 的 token 消耗很大，一张 1080p 截图编码后可能占 1000-3000 token。5-10 轮循环后，前面做了什么、最初的目标是什么，物理上就被挤出上下文了。这不是"忘了"，是装不下了。

很多手机 Agent demo 停在 v2。三分钟的 demo 很惊艳，三十分钟的实际任务，大概率会在某个弹窗或加载状态上卡住，然后陷入截图、识别、点击、没变化、再截图的死循环。

v3：把目标变成状态流

OpenGUI 的做法是把任务放进一个有状态的后端 graph 里，而不是让模型在本地做一个无状态的截图循环。

看过源码，架构很清晰。核心链路大概是这样：

User/IM 命令 → Plan Supervisor → Executor Graph → Android Client → 真实设备
                      ↑                ↓
                      └──── 执行结果 + 设备状态 ────┘

主 graph 在 mobile-agent.graph.ts 里构建，用的是 LangGraph 的 StateGraph：

const graph = new StateGraph(AgentStateSchema)
  .addNode("supervisor", supervisorNode)
  .addNode("extract_todo", extractTodoNode)
  .addNode("fallback_extract", fallbackExtractNode)
  .addNode("gui_executor", executorSubgraph)
  .addNode("summarizer", summarizerNode)
  .addEdge(START, "supervisor")
  .addConditionalEdges("supervisor", routeAfterSupervisor)
  .addConditionalEdges("extract_todo", routeAfterExtractTodo)
  .addConditionalEdges("gui_executor", routeAfterExecutor)
  .addEdge("summarizer", END);

Plan Supervisor 维护任务状态。复杂任务进来，supervisor 先拆成可执行的子任务，形成计划文档，然后逐个派发给 Executor。它本身也是一个 LLM agent，带有 write_todos 和 read_todos 两个 tool，可以动态调整任务列表。第一次调用时生成计划，后续调用时评估 Executor 回传的结果，决定标记完成、重试还是重新规划。

Supervisor 的路由逻辑很简单，但足够说明问题（routing.ts）：

export function routeAfterExtractTodo(state: AgentState) {
  if (state.isCancelled) return "summarizer";
  if (state.planTodoComplete) return "summarizer";
  if (state.todoFound) return "gui_executor";
  return "fallback_extract";  // 用 Haiku 做兜底提取
}

export function routeAfterExecutor(state: AgentState) {
  if (state.isCancelled) return "summarizer";
  if (isExecutionConnectionLostMessage(state.executorOutput?.fail_reason)) {
    return "summarizer";  // 设备断连，停止重试
  }
  return "supervisor";  // 把执行结果送回 supervisor 评估
}

Executor Graph 负责把子任务落到设备上，本身也是一个 subgraph。执行循环在 executor.graph.ts 里定义：

const subgraph = new StateGraph(AgentStateSchema)
  .addNode("entry", entryNode)
  .addNode("sense", senseNode, { retryPolicy: { maxAttempts: 3 } })
  .addNode("vision_model", visionModelNode)
  .addNode("parse_action", parseActionNode)
  .addNode("execute_action", executeActionNode, { retryPolicy: { maxAttempts: 3 } })
  .addNode("post_execute", postExecuteNode)
  .addEdge(START, "entry")
  .addEdge("entry", "sense")
  .addEdge("sense", "vision_model")
  .addConditionalEdges("vision_model", routeAfterVisionModel)
  .addConditionalEdges("parse_action", routeByAction)
  .addConditionalEdges("execute_action", routeAfterExecuteAction)
  .addConditionalEdges("post_execute", routeAfterPostExecute);

Entry 节点初始化执行状态；Sense 节点从设备获取截图和当前 App 信息；Vision Model 节点把截图和上下文发给 VLM，获取下一步动作；Parse Action 节点把 VLM 的输出解析成结构化动作；Execute Action 节点通过 WebSocket 把动作发到 Android 设备执行；Post Execute 节点做异常检测（后面细说），然后决定回退到 Sense 继续循环，还是退出 subgraph。

Summarizer 在收尾时介入，把执行过程的关键信息整理成结构化结果返回给用户。

这几个组件的协作，让目标从一个 prompt 里的文字，变成了一整套可以被引用、暂停、恢复和清理的状态。

状态具体存在哪里？看 state.types.ts 里的 AgentStateSchema：

planDocument：supervisor 生成的计划文档（Markdown）
executorInput / executorOutput：当前子任务的输入和输出
executor 字段：Executor subgraph 的内部状态，包括截图 URI、当前预测、循环计数、异常标记、消息历史、token 用量等
todoFound / planTodoComplete：supervisor 决策用的布尔标志
isCancelled / isPaused：用户中断状态

这个状态不是存在模型上下文里，而是存在后端 graph 的 state 对象中，每步执行完都由 LangGraph 的 reducer 合并更新。

Android 端通过 WebSocket 和后端保持连接。StandbySocketManager.kt 负责设备待命，GestureService.kt 负责把动作执行到真实设备上。设备不是被脚本驱动的傀儡，而是一个持续反馈状态的 worker。

异常检测：executor 里的保险丝

v2 方案最容易出现的是"循环"：截图没变化，模型继续点同一个位置，再截图还是没变化。OpenGUI 在 Post Execute 节点里做了显式的异常检测。

看 post-execute.node.ts 里的检测逻辑：

动作重复检测：检查最近 10 个动作里是否有连续 5 个相似动作（同类型 + 坐标距离小于 50 像素）。如果是，标记为可能循环。

动作周期检测：检查是否存在 A-B-A-B 的交替模式。比如点击返回再点进去，再点返回再点进去，模型在两个页面间来回跳。

截图异常检测：用 perceptual hash（pHash）比较最近几张截图。如果连续 3 张截图完全相同且动作不是 wait/scroll，说明页面没响应。如果截图呈现 A-B-A-B 的交替模式，说明页面在两个状态间切换。

连续滚动检测：连续 scroll 超过 8 次，判定当前搜索策略没有进展，强制退出 executor 让 supervisor 重新规划。

检测到异常后，Post Execute 节点会设置 needRemind = true，并在下一轮 Vision Model 调用时注入提醒：

const remindMessage = new HumanMessage(
  `The current task may be stuck in a loop or drifting from the goal.
Execution anomaly: ${exec.remindReason}
Original task: ${instruction}
Check whether the execution is drifting from the original goal or stuck in a loop.`
);

这个设计的关键是：异常不是报错终止的理由，而是被消费掉的输入。检测到循环 → 注入提醒 → VLM 下一轮输出修正动作 → 继续执行。整个链路在系统内部闭环，不需要人工干预。

关键差异：状态管理

传统脚本没有状态，只有"下一步该做什么"。v2 的 Agent 也没有状态，只有"当前屏幕该怎么处理"。OpenGUI 的状态分布在计划文档、当前子任务、执行结果、失败分类这些数据结构里，supervisor 每一步都能基于完整状态做决策。

走迷宫的比喻很贴切：传统脚本手里只有一张路线图，走错一步就迷路。v2 Agent 每到一个路口抬头看四周，但记不住来时的路。OpenGUI 手里有一张实时更新的地图，知道自己在哪、目标在哪、哪条路试过不通。

另一个关键差异是模型角色的分离。v2 通常用一个模型做所有决策，OpenGUI 把规划、监督、VLM 执行拆到不同模型。

从 README 里的数据，全 Claude Opus 配置跑一个中等长度任务（约 60 次截图分析），VLM + Planner + Supervisor 全部用 Opus，预估费用在 $8-15 区间。换成 Qwen 3.6 Plus 做 Planner 和 Supervisor，Doubao Pro 做 VLM，同样任务降到 $0.6-1.2，成本差 10-15 倍。

这个成本差异来自两个因素：一是 Qwen/Doubao 的单价远低于 Opus，二是 OpenGUI 的架构允许不同角色用不同模型。Planner 和 Supervisor 处理的是文本计划，不需要多模态能力，可以用便宜的文本模型。只有 Executor 里的 VLM 需要看图，这部分费用被隔离在 subgraph 里。

一个具体的例子

X 搜索的任务在 OpenGUI 里会经历这些状态：

Plan Supervisor 先生成计划：打开 X，搜索关键词，浏览结果，收集观点，总结。然后派发子任务"打开 X"给 Executor。Executor 截图，VLM 判断当前是桌面还是其他 App，执行点击。结果回传：X 已打开，但弹出了登录框。Supervisor 判断这是异常，需要处理登录，无法处理就标记失败并尝试跳过。登录处理完，继续派发"搜索关键词"。Executor 执行搜索，网络慢页面没加载完，内部重试，等待再截图再判断。搜索完成，进入"浏览结果"子任务。中间遇到推荐关注弹窗，Executor 识别为干扰，尝试关闭或跳过。所有子任务完成，Supervisor 调用 Summarizer 生成结构化总结。

没有一个环节假设页面会按顺序走。每一步都基于当前真实状态判断，失败被当作正常输入消费，不是异常终止的理由。

执行完成后，Summarizer 返回的结果大致长这样：

## Task Summary

**Goal**: Search X for recent discussions on "mobile AI agents" and summarize key concerns.

**Execution**: 
- Opened X app successfully
- Searched "mobile AI agents" 
- Scrolled through top 20 results
- Collected 8 relevant posts/threads

**Key Findings**:
1. Privacy concerns dominate: users worried about screen recording and data access
2. Reliability: agents failing on non-standard UI patterns (custom keyboards, overlays)
3. Cost: VLM per-screenshot pricing makes long tasks expensive
4. Latency: 5-15s per action too slow for real-time interaction

**Blocked Items**:
- Login prompt appeared; task continued after handling
- One result required app switch to Safari; skipped per constraints

**Conclusion**: Mobile AI agents are technically feasible but face UX, cost, and trust hurdles before mainstream adoption.

代价是什么

这套设计更重，代价在三方面。

模型成本。VLM 每次分析截图都要调 API，一张 1080p 截图编码后可能占 1000-3000 token。一个 10 分钟的任务如果有 60 次截图分析，总 token 消耗可能在 15-30 万之间。全 Opus 配置下这是不可忽视的开销，混合模型配置能把费用压到可接受范围，但代价是模型能力的差异：Qwen 的规划质量不如 Opus，Doubao 的视觉理解在某些场景下会漏掉细节。

延迟。截图 → 后端 → VLM 推理 → 动作解码 → 网络传输 → 设备执行 → 等待 UI 稳定 → 再截图，这个链路单轮通常在 5-15 秒。一个 60 步的任务，纯等待时间就 5-15 分钟。v2 方案如果把 VLM 放在本地或近端可以更快，但 OpenGUI 的后端 graph 架构天然引入了一层网络跳转。对延迟敏感的任务（比如实时游戏辅助），这套架构不适用。

系统复杂度。你需要跑后端（NestJS + LangGraph）、数据库（PostgreSQL）、缓存（Redis）、WebSocket gateway，还要维护 Android 客户端的待命连接。部署一套 OpenGUI 比跑一个 Python 脚本重得多。设备会休眠，后台会被系统杀掉，WebSocket 会断线。standby 机制不是"连上就行"，要处理心跳（35 秒间隔）、重连、状态同步。

但手机端很难绕开这些复杂度。真实 App 会弹窗，会卡加载，会误触，会把你带到一个完全不同的页面。只靠循环 prompt，稍微长一点的任务就会退化成截图版的 while true。

OpenGUI 把复杂度显式地放进系统里。一次没点对，变成 supervisor 要消费的执行结果。设备掉线，被 standby gateway 检测到并停止重试。任务跑了一半被暂停，可以在恢复时从当前子任务继续。这个设计更重，但它给了长程任务一个可以调试、可以恢复、可以复盘的位置。

Checkpoint：任务中断后能恢复

LangGraph 提供了 checkpointing 机制，OpenGUI 把它接进了 PostgreSQL（PostgresCheckpointerService）。这意味着：

任务执行到第 20 步，后端重启了，重启后可以从第 20 步的 checkpoint 继续，而不是从头来
用户手动暂停了任务，恢复时 supervisor 重新评估当前状态，决定继续执行还是重新规划
多个子任务共享同一个 thread ID，状态在 graph 节点之间持久化

这个功能对长程任务很关键。一个跑了两小时的任务，如果因为后端滚动更新而丢失全部进度，是不可接受的。checkpoint 把"状态在内存里"变成"状态在数据库里"，牺牲了性能，换来了可靠性。

贤者时刻

自动化系统的能力边界，不是由"能执行什么动作"决定的，而是由"能维护多少状态"决定的。

脚本只能维护一步的状态（下一步做什么）。v2 Agent 能维护一轮的状态（当前屏幕怎么理解）。OpenGUI 维护的是整个任务生命周期的状态：计划、进度、异常、恢复。

Codex 的 /goal 在 coding agent 里做了类似的事：把目标从一轮对话里的文字，变成会话里可恢复的状态。OpenGUI 在手机端走了更远，它不仅保存目标，还把设备反馈、执行结果和失败处理接成了一条完整的状态流。场景不同，问题很接近：长程 agent 不能只会执行下一步，还要持续维护"我在哪、要去哪、边界是什么"这些信息。

如果你只是每天自动签到一次，脚本就够了。要让 AI 在真实手机上跑一个持续几十分钟、涉及多 App 切换和复杂判断的任务，就需要把目标从 prompt 里抽出来，变成一个可以被持续维护的状态。这个选择更重，但它是从 demo 走向 production 的必经之路。

参考

OpenGUI 官网：https://opengui.ai
OpenGUI 源码：https://github.com/Core-Mate/open-gui
OpenAI Codex 0.128.0 release: https://github.com/openai/codex/releases/tag/rust-v0.128.0
Simon Willison on Codex goals: https://simonwillison.net/2026/Apr/30/codex-goals/

The Evolution of Mobile Automation: From Scripts to State Flows

Fenix — Sun, 10 May 2026 08:48:15 +0000

I spent some time with OpenGUI recently, running a long-haul task on a real phone: open X, search for recent discussions on mobile AI agents, collect the main viewpoints, and summarize what people care about.

The task is one sentence in plain English. The execution breaks into dozens of judgments and actions. Is the app open? Are we on the home screen? Did the tap hit the search box? Is the result page loaded? Did a login popup appear? A recommended-follow modal? Did the page navigate away, and should we go back or retry?

Traditional mobile automation struggles with this kind of task. Not because tapping is hard, but because real phones don't follow scripts.

To test this, I ran the same task three times with three different setups.

Pure script (Appium): Failed all three times. Once stuck on an update dialog, twice the search results page changed its xpath. Average survival: 4 steps.

VLM screenshot loop (v2 Agent): One success out of three, taking 18 minutes. It retry-tapped a recommended-follow modal 7 times. The other two runs got stuck in loops at step 12 and step 23. The screenshot showed no change, the model tapped the same spot again, the screenshot still showed no change.

OpenGUI: All three succeeded, averaging 11 minutes. The longest run hit a login popup, slow network loading, and a recommended-follow modal. The supervisor replanned twice. No human intervention.

This doesn't mean OpenGUI is "smarter". It means the system maintains task state explicitly, rather than relying on the model to remember context implicitly.

Traditional Mobile Automation: Assuming the World Cooperates

Three mainstream approaches:

Record and replay. You operate the phone once, the tool records tap coordinates, swipe trajectories, and input text, then replays them verbatim.

UI automation frameworks like Appium and UIAutomator. They locate elements via accessibility tree or xpath, then perform actions.

RPA platforms. Visual workflow builders that wrap the above into flowcharts, with conditional branches and exception handlers.

These work fine for simple jobs. Daily check-ins, timed coupon grabs, batch processing of fixed flows. As long as the page doesn't change, they run reliably. Change the page, or stretch the flow, and things break.

v1: Popups Are the Enemy of Scripts

Record-and-replay is the most intuitive. You do it once, it remembers, it replays.

For that X search task, the recorded flow looks like: tap X icon, wait 3 seconds, tap search box, type "mobile AI agents", tap search button, wait 3 seconds, scroll, screenshot.

This works in an ideal world. The real world doesn't cooperate. An update dialog shifts tap coordinates. Slow network means 3 seconds isn't enough; the skeleton is still showing. Login is required before search, and the script has no concept of where it is. A recommended-follow modal intercepts the scroll. A blogger's content needs a "show more" tap that the script never recorded.

Scripts have no state. Every step assumes the previous page's result and the current page's structure. Reality deviates, the script errors out. No recovery.

v2: Visual Understanding Makes Scripts Smarter, But Doesn't Solve State

Coordinates and xpath are brittle. Can the machine just read the screen?

This is the dominant approach for mobile agent demos in the past couple years: screenshot, feed to a VLM, model returns the next action, execute, screenshot again.

The loop is more flexible than scripts. The model sees the current screen, identifies where the search box is, handles some popups. No predefined coordinates needed. Just tell it "open X and search for mobile AI agents".

The problem: the VLM only sees the current screenshot. Short tasks are fine. Three steps to open an app, tap a button, type some text. The model usually handles it. Stretch the task, and the cracks show.

The model doesn't know what came before. Step 10 fails, it doesn't know whether to backtrack to step 3 or step 7. It doesn't know the overall goal. Looking only at the current screenshot, it drifts toward local optima. It sees a UI element that looks off and "optimizes" it, deviating from the main task. The task is done, it keeps tapping.

v2 solves screen reading. It doesn't solve long-horizon state maintenance.

The deeper problem is context window limits. A VLM processing a screenshot + prompt burns tokens fast. A 1080p screenshot encodes to 1000-3000 tokens. After 5-10 loops, what came before and what the original goal was gets physically evicted from context. This isn't "forgetting". It's eviction.

Many mobile agent demos stop at v2. Three-minute demos are impressive. Thirty-minute real tasks usually get stuck on a popup or loading state, then fall into the screenshot-recognize-tap-no-change-screenshot death loop.

v3: Turn the Goal Into a State Flow

OpenGUI puts the task into a stateful backend graph instead of letting the model run a stateless screenshot loop locally.

The architecture is clean. The core flow looks like this:

User/IM command → Plan Supervisor → Executor Graph → Android Client → Real device
                       ↑                ↓
                       └─── Execution result + device state ───┘

The main graph lives in mobile-agent.graph.ts, built with LangGraph's StateGraph:

const graph = new StateGraph(AgentStateSchema)
  .addNode("supervisor", supervisorNode)
  .addNode("extract_todo", extractTodoNode)
  .addNode("fallback_extract", fallbackExtractNode)
  .addNode("gui_executor", executorSubgraph)
  .addNode("summarizer", summarizerNode)
  .addEdge(START, "supervisor")
  .addConditionalEdges("supervisor", routeAfterSupervisor)
  .addConditionalEdges("extract_todo", routeAfterExtractTodo)
  .addConditionalEdges("gui_executor", routeAfterExecutor)
  .addEdge("summarizer", END);

Plan Supervisor maintains task state. A complex task comes in, the supervisor breaks it into executable subtasks, forms a plan document, then dispatches them one by one to the Executor. It's itself an LLM agent with write_todos and read_todos tools, so it can adjust the task list dynamically. First call generates the plan; subsequent calls evaluate the Executor's returned results, deciding whether to mark complete, retry, or replan.

The routing logic is simple but telling (routing.ts):

export function routeAfterExtractTodo(state: AgentState) {
  if (state.isCancelled) return "summarizer";
  if (state.planTodoComplete) return "summarizer";
  if (state.todoFound) return "gui_executor";
  return "fallback_extract";  // Haiku fallback extraction
}

export function routeAfterExecutor(state: AgentState) {
  if (state.isCancelled) return "summarizer";
  if (isExecutionConnectionLostMessage(state.executorOutput?.fail_reason)) {
    return "summarizer";  // Device disconnected, stop retrying
  }
  return "supervisor";  // Send execution result back for evaluation
}

Executor Graph handles the actual device interaction, itself a subgraph. The execution loop is defined in executor.graph.ts:

const subgraph = new StateGraph(AgentStateSchema)
  .addNode("entry", entryNode)
  .addNode("sense", senseNode, { retryPolicy: { maxAttempts: 3 } })
  .addNode("vision_model", visionModelNode)
  .addNode("parse_action", parseActionNode)
  .addNode("execute_action", executeActionNode, { retryPolicy: { maxAttempts: 3 } })
  .addNode("post_execute", postExecuteNode)
  .addEdge(START, "entry")
  .addEdge("entry", "sense")
  .addEdge("sense", "vision_model")
  .addConditionalEdges("vision_model", routeAfterVisionModel)
  .addConditionalEdges("parse_action", routeByAction)
  .addConditionalEdges("execute_action", routeAfterExecuteAction)
  .addConditionalEdges("post_execute", routeAfterPostExecute);

Entry initializes execution state. Sense pulls the screenshot and current app info from the device. Vision Model sends the screenshot and context to the VLM for the next action. Parse Action turns the VLM's output into structured actions. Execute Action sends actions to the Android device over WebSocket. Post Execute runs anomaly detection (more on this below), then decides whether to loop back to Sense or exit the subgraph.

Summarizer steps in at the end, packaging key execution info into structured results for the user.

This collaboration turns the goal from a string in a prompt into a full system that can be referenced, paused, resumed, and cleaned up.

Where does the state live? Look at state.types.ts, AgentStateSchema:

planDocument: the supervisor-generated plan (Markdown)
executorInput / executorOutput: current subtask input and output
executor: internal Executor subgraph state, including screenshot URI, current prediction, loop count, anomaly flags, message history, token usage
todoFound / planTodoComplete: boolean flags for supervisor decisions
isCancelled / isPaused: user interrupt state

This state doesn't live in the model's context. It lives in the backend graph's state object, updated by LangGraph's reducer after every step.

The Android side maintains a WebSocket connection to the backend. StandbySocketManager.kt handles device standby. GestureService.kt executes actions on the real device. The device isn't a puppet driven by scripts. It's a worker that feeds back state continuously.

Anomaly Detection: The Fuse Inside the Executor

The most common v2 failure mode is looping: the screenshot doesn't change, the model taps the same spot again, the screenshot still doesn't change. OpenGUI has explicit anomaly detection in the Post Execute node.

Look at post-execute.node.ts:

Action repetition detection: Check the last 10 actions for 5 consecutive similar actions (same type + coordinate distance under 50 pixels). If found, flag as a likely loop.

Action cycle detection: Check for A-B-A-B alternating patterns. For example, tap back, then tap in, then tap back, then tap in. The model oscillates between two pages.

Screenshot anomaly detection: Compare recent screenshots using perceptual hash (pHash). If 3 consecutive screenshots are identical and the action isn't wait/scroll, the page isn't responding. If screenshots show an A-B-A-B alternating pattern, the page is switching between two states.

Consecutive scroll detection: More than 8 consecutive scrolls means the current search strategy isn't making progress. Force the executor to exit and let the supervisor replan.

When an anomaly is detected, the Post Execute node sets needRemind = true and injects a reminder on the next Vision Model call:

const remindMessage = new HumanMessage(
  `The current task may be stuck in a loop or drifting from the goal.
Execution anomaly: ${exec.remindReason}
Original task: ${instruction}
Check whether the execution is drifting from the original goal or stuck in a loop.`
);

The key design choice: anomalies aren't termination reasons. They're consumed inputs. Detect loop → inject reminder → VLM outputs corrected action on next round → execution continues. The entire loop closes inside the system. No human needed.

The Key Difference: State Management

Traditional scripts have no state. They only know "what's the next step". v2 agents have no state. They only know "how do I handle this screen". OpenGUI distributes state across plan documents, current subtasks, execution results, and failure taxonomies. The supervisor makes every decision based on complete state.

The maze analogy holds. Traditional scripts hold a roadmap. One wrong turn, they're lost. v2 agents look around at every intersection but can't remember how they got there. OpenGUI carries a real-time updated map. It knows where it is, where it's going, which paths were tried and failed.

Another key difference is model role separation. v2 usually uses one model for all decisions. OpenGUI splits planning, supervision, and VLM execution across different models.

The README quotes numbers. For a medium-length task (~60 screenshot analyses), all-Opus config runs $8-15. Swap to Qwen 3.6 Plus for Planner and Supervisor, Doubao Pro for VLM, same task drops to $0.6-1.2. A 10-15x cost difference.

This gap comes from two factors. First, Qwen/Doubao are priced far below Opus. Second, OpenGUI's architecture lets different roles use different models. Planner and Supervisor handle text plans. They don't need multimodal capability, so they can use cheap text models. Only the Executor's VLM needs vision. That cost is isolated inside the subgraph.

A Concrete Example

Here's how the X search task runs in OpenGUI:

Plan Supervisor generates the plan first: open X, search keyword, browse results, collect viewpoints, summarize. Then it dispatches the subtask "open X" to the Executor. The Executor screenshots. The VLM judges whether we're on the home screen or another app, then taps. Result comes back: X is open, but a login popup appeared. Supervisor judges this as an anomaly that needs login handling. If it can't handle it, mark failed and try to skip. After login is handled, dispatch "search keyword". The Executor searches. Network is slow, the page hasn't loaded. Internal retry: wait, screenshot again, judge again. Search completes, enter "browse results" subtask. A recommended-follow modal appears mid-way. The Executor identifies it as interference, tries to close or skip. All subtasks complete. Supervisor calls Summarizer for structured summary.

No step assumes the page will proceed in order. Every step judges based on current real state. Failures are consumed as normal inputs, not termination reasons.

After completion, the Summarizer returns something like this:

## Task Summary

**Goal**: Search X for recent discussions on "mobile AI agents" and summarize key concerns.

**Execution**: 
- Opened X app successfully
- Searched "mobile AI agents" 
- Scrolled through top 20 results
- Collected 8 relevant posts/threads

**Key Findings**:
1. Privacy concerns dominate: users worried about screen recording and data access
2. Reliability: agents failing on non-standard UI patterns (custom keyboards, overlays)
3. Cost: VLM per-screenshot pricing makes long tasks expensive
4. Latency: 5-15s per action too slow for real-time interaction

**Blocked Items**:
- Login prompt appeared; task continued after handling
- One result required app switch to Safari; skipped per constraints

**Conclusion**: Mobile AI agents are technically feasible but face UX, cost, and trust hurdles before mainstream adoption.

The Cost

This design is heavier. Costs show up in three places.

Model cost. Every VLM screenshot analysis calls an API. A 1080p screenshot encodes to 1000-3000 tokens. A 10-minute task with 60 screenshot analyses might consume 150k-300k tokens total. All-Opus, this is non-trivial. Mixed-model configs bring it to acceptable ranges, but at a capability cost. Qwen plans worse than Opus. Doubao's vision understanding misses details in some scenarios.

Latency. Screenshot → backend → VLM inference → action decode → network transmission → device execution → wait for UI stable → screenshot again. One loop is typically 5-15 seconds. A 60-step task spends 5-15 minutes just waiting. v2 can be faster if the VLM runs locally or nearby, but OpenGUI's backend graph architecture naturally adds a network hop. For latency-sensitive tasks (real-time game assistance, for example), this architecture doesn't fit.

System complexity. You run a backend (NestJS + LangGraph), database (PostgreSQL), cache (Redis), WebSocket gateway, and maintain the Android client's standby connection. Deploying OpenGUI is much heavier than running a Python script. Devices sleep, background processes get killed by the OS, WebSockets drop. Standby isn't "connect and forget". It needs heartbeat (35-second interval), reconnection, and state sync.

But mobile is hard to simplify. Real apps throw popups, hang on loading, misregister taps, navigate to completely different pages. Pure prompt loops degrade into screenshot-flavored while true for any non-trivial task length.

OpenGUI puts complexity explicitly into the system. A missed tap becomes an execution result for the supervisor to consume. Device disconnection is detected by the standby gateway and retries stop. A paused task resumes from the current subtask. The design is heavier, but it gives long-haul tasks a place to debug, recover, and replay.

Checkpoint: Tasks Can Resume After Interruption

LangGraph provides checkpointing. OpenGUI wires it into PostgreSQL (PostgresCheckpointerService). This means:

The task is on step 20, the backend restarts. After restart, it resumes from the step 20 checkpoint, not from scratch.
The user pauses the task manually. On resume, the supervisor re-evaluates current state and decides whether to continue or replan.
Multiple subtasks share the same thread ID. State persists across graph nodes.

This matters for long tasks. A two-hour run losing all progress because of a backend rolling restart is unacceptable. Checkpointing turns "state lives in memory" into "state lives in the database". Performance sacrificed, reliability gained.

The Real Lesson

The capability boundary of an automation system isn't defined by "what actions can it execute". It's defined by "how much state can it maintain".

Scripts maintain one step of state (what's next). v2 agents maintain one round of state (how do I read this screen). OpenGUI maintains the full task lifecycle: plan, progress, anomalies, recovery.

Codex's /goal does something similar for coding agents. It turns the goal from text in a conversation turn into recoverable session state. OpenGUI goes further on mobile. It doesn't just save the goal. It wires device feedback, execution results, and failure handling into a complete state flow. Different domain, same problem: long-horizon agents can't just execute the next step. They need to continuously maintain "where am I, where am I going, what are the boundaries".

Daily check-in? Script is enough. Running an AI on a real phone for tens of minutes, across multiple apps, with complex judgments? Then you need to pull the goal out of the prompt and turn it into continuously maintained state. The choice is heavier. It's also the path from demo to production.

References

OpenGUI: https://opengui.ai
OpenGUI repository: https://github.com/Core-Mate/open-gui
OpenAI Codex 0.128.0 release: https://github.com/openai/codex/releases/tag/rust-v0.128.0
Simon Willison on Codex goals: https://simonwillison.net/2026/Apr/30/codex-goals/

Codex /goal and OpenGUI: long-running tasks need state

Fenix — Tue, 05 May 2026 01:55:19 +0000

Long-running agents tend to fail in the second half.

The first step is often fine. Fix a CI failure, open an app, tap a button, search for a keyword. Models can produce a reasonable first action. The trouble starts around step ten: what has already happened, where the task is stuck, what the original boundary was, and when the task is allowed to stop. Those details slide out of context.

Codex CLI 0.128.0 added /goal. The release note describes a persisted goal workflow: app-server APIs, model tools, runtime continuation, and TUI controls for create, pause, resume, and clear. Simon Willison compared it to OpenAI's version of a Ralph loop: set a goal for Codex, then let it keep executing, checking, and correcting until the goal is done or the budget runs out.

In the context of long-running tasks, the change is about where the goal lives. It moves from text in a single prompt to state that can be resumed, paused, cleared, and referenced again later.

Why coding agents need goal

Take a CI failure. The immediate failure may be one broken test. The agent changes the test, then the implementation, then adjusts a type because the code now feels awkward. Each step can be justified, but the final diff is much larger than the original problem.

Code generation is rarely the hard part here. The run has no stable constraint attached to it. The original goal may have been as small as:

/goal 修复当前 failing tests，保持 diff 尽量小，最后跑完 npm test

Or:

/goal 处理这个 PR 的 review comments，不改无关文件，最后给出改动摘要

That kind of goal carries the target, the boundary, and the acceptance condition. It tells the agent where to go, what not to touch, and when to stop.

Without that state, the agent is easily pulled around by the current error. A type looks awkward, so it changes the type. A test is hard to write, so it changes the test. The structure feels messy, so it refactors. Each local move can make sense, while the whole task drifts.

On phones, the hard part is screen state

OpenGUI works on a different kind of long-running task: letting AI operate a real Android phone.

Repository: https://github.com/Core-Mate/open-gui

In a codebase, state can still land in files, tests, and diffs. On a phone, state is a live screen.

For example, ask the phone to open X, search for discussions about mobile AI agents, collect the main points, and summarize what people care about. As a sentence, this looks simple. On the phone, it becomes a series of state checks: is the app open, is this the home page, is the search box focused, did the results finish loading, did a login prompt, permission prompt, or follow recommendation appear in the middle.

The loop of screenshot, tap, screenshot can only carry short tasks. If the screen does not change, the system has to decide whether the tap missed, the network is slow, the page is loading, or the action has no visible feedback. If the page jumps somewhere else, it also has to decide whether to go back, retry, or continue from the new page.

So a goal on mobile has to answer a few concrete questions: which step is the task on, whether the current screen supports the next step, where to recover after a failure, and when the run can end.

OpenGUI turns the goal into a state flow

I ran OpenGUI and read through the source. It connects the backend graph, device connection, and Android-side action execution instead of leaving phone automation as a script.

On the backend, the main entry point is server/apps/backend/src/modules/graph-agent/graph/mobile-agent.graph.ts. Complex tasks go through Plan Supervisor, where the plan is split into executable subtasks. Concrete actions enter executor.graph.ts, the device execution subgraph. The execution result goes back to the supervisor, which decides whether to continue, retry, replan, or hand off to Summarizer.

On Android, actions are applied to the real device. client/core_accessibility/.../GestureService.kt executes GUI actions such as taps and typing. The device keeps a WebSocket connection to the backend, and client/core_network/.../StandbySocketManager.kt handles the standby connection. Feishu/Lark, Telegram, and REST API can sit outside this as remote task entry points, turning the phone from a local demo device into something that can receive work.

OpenGUI spreads the goal across several pieces of state: the plan document, current subtask, device screenshot, execution result, failure classification, and final summary. After each device action, the backend gets fresh device state and decides the next move.

A simple script assumes the page will follow the expected order. OpenGUI assumes the page will change, so the executor has to keep reporting real state back to the backend.

The cost

Putting the goal into a graph makes the system heavier.

You have to maintain task state, keep WebSocket connections alive, handle device standby, send execution results and screenshots back, and design state transitions for continue, retry, cancel, and summarize. Model calls and screenshot analysis also cost money. The longer the task runs, the more that cost becomes an engineering concern instead of a small detail.

But on mobile, it is hard to avoid this cost. Real apps show popups, hang on loading screens, misread taps, and send users to completely different pages. A prompt loop alone quickly turns into screenshot-based while true.

OpenGUI puts that complexity into the system. A bad tap becomes an execution result for the supervisor to consume. The device keeps reporting state. It behaves more like a worker than a screen being clicked. The design is heavier, but it gives long-running tasks a place to be debugged, recovered, and reviewed.

The first use cases I would try are community research, mobile flow testing, ops tasks, and App-only workflows that web automation cannot reach. These tasks may not need the strongest model, but they do need an execution system that can keep following the goal, see failure, and send state back.

In coding agents, Codex /goal keeps the goal as recoverable state. On phones, OpenGUI connects task progress, device feedback, and failure handling into a state flow. A long-running agent has to keep track of the run, not only execute the next step.

References

OpenAI Codex 0.128.0 release: https://github.com/openai/codex/releases/tag/rust-v0.128.0
Simon Willison: https://simonwillison.net/2026/Apr/30/codex-goals/
OpenGUI: https://github.com/Core-Mate/open-gui

OpenGUI：手机上的 OpenClaw

Fenix — Mon, 04 May 2026 13:24:24 +0000

OpenGUI 是一个让 AI 操作真实 Android 手机的项目。

项目地址：https://github.com/Core-Mate/open-gui

OpenClaw 把 AI 接到桌面环境，OpenGUI 则把类似的执行层放到了 Android 手机上。它面向的是手机 App 里的任务：点击、输入、截图、阅读页面、执行流程、返回结果。

很多任务天然发生在移动端：X、Reddit、Hacker News、Telegram、微信、小红书，还有不少只在 App 内完整存在的业务流程。网页自动化覆盖不到这些场景。

基本架构

OpenGUI 由后端和 Android 客户端两部分组成。

后端负责理解任务、生成计划、监督执行和总结结果；Android 客户端连接后端，在真实设备上执行 GUI 操作。除了点屏幕，它还要处理任务状态、设备状态和失败后的恢复。

仓库里能看到几块代码：

后端的任务规划、Executor Graph、复核和总结
Android 端的 AccessibilityService 动作执行
设备通过 WebSocket 保持连接
飞书/Lark、Telegram、REST API 等远程触发入口

这样跑起来后，手机可以保持待命，像一个远程 worker 一样接任务、执行任务、回传结果。

本地启动

本地需要先准备 Android 开发环境，并连接一台 Android 设备。

启动后端：

cd server
./start.sh

启动 Android 客户端：

cd client
./start.sh

后端脚本会准备依赖服务、数据库和 API；客户端脚本会构建 APK、安装到连接的 Android 设备上，并启动应用。

手机侧还需要手动确认 USB 调试、Accessibility Service、悬浮窗权限，以及模型 API Key / 机器人凭证等。权限授权这部分保留人工确认更合理。

难点在哪里

手机 Agent 的麻烦通常出现在后续状态处理上。点一下屏幕只是开始。

比如一个很简单的任务：

打开 X，搜索 mobile AI agents 相关的近期讨论，收集主要观点，并总结大家主要关心的问题。

这个任务看起来不大，但手机上会发生很多不确定的事情：App 可能停在旧页面，搜索框可能没点中，结果页可能加载慢，中间还可能弹出登录、权限、推荐关注之类的窗口。

所以 Mobile Agent 不能只会“看图然后点一下”。它还要知道任务做到哪一步了，当前屏幕是不是符合预期，点错之后怎么恢复，页面没变化时要不要重试，最后怎么把执行结果收回来。

我实际跑了一下，也仔细看了 OpenGUI 的源码。它的思路还挺不错：后端 graph 管任务状态和计划，Executor Graph 负责把具体步骤派到手机上，Android 端通过 AccessibilityService 执行动作，再通过 WebSocket 把设备状态和执行结果传回来。

这会把手机放进执行循环里。后端判断任务要继续、重试还是结束；手机端把真实屏幕和动作结果反馈回来。

这套设计比单纯写脚本靠谱得多。手机可以待命、执行、反馈，更接近一个移动端 worker。

我能想到的第一批用途，是社区信息收集、移动端流程测试、运营任务执行，以及那些网页自动化碰不到的 App-only 流程。

OpenGUI: OpenClaw for phones

Fenix — Mon, 04 May 2026 13:23:21 +0000

OpenGUI is a project that lets AI operate a real Android phone.

Repository: https://github.com/Core-Mate/open-gui

OpenClaw connects AI to a desktop environment. OpenGUI brings a similar execution layer to Android. It is aimed at tasks inside mobile apps: tapping, typing, taking screenshots, reading screens, moving through flows, and returning results.

A lot of work already happens on phones: X, Reddit, Hacker News, Telegram, WeChat, Xiaohongshu, and plenty of business flows that only really exist inside apps. Web automation does not reach those surfaces.

Basic architecture

OpenGUI has two main parts: a backend and an Android client.

The backend understands the task, creates a plan, supervises execution, and summarizes the result. The Android client connects to the backend and performs GUI actions on a real device. Beyond tapping the screen, it also has to handle task state, device state, and recovery after failures.

You can see a few pieces in the repo:

task planning, Executor Graph, review, and summarization on the backend
AccessibilityService-based action execution on Android
WebSocket connections for keeping devices online
remote entry points through Feishu/Lark, Telegram, and REST API

Once it is running, the phone can stay on standby, receive a task like a remote worker, execute it, and send results back.

Local setup

You need an Android development environment and a connected Android device.

Start the backend:

cd server
./start.sh

Start the Android client:

cd client
./start.sh

The backend script prepares the services, database, and API. The client script builds the APK, installs it on the connected Android device, and launches the app.

Some phone-side steps still need manual approval: USB debugging, Accessibility Service, overlay permission, and model API keys or bot credentials. Keeping those steps explicit makes sense.

Where it gets hard

The hard part of a phone agent usually starts after the first tap.

Take a simple task:

Open X, search for recent discussions about mobile AI agents, collect the main points, and summarize what people care about.

That sounds small, but the phone can be in many different states. The app may open on an old page. The search box may not receive focus. Results may load slowly. A login prompt, permission prompt, or follow recommendation can appear in the middle.

So a mobile agent cannot just look at a screenshot and tap once. It has to know where the task is, whether the current screen matches expectations, how to recover after a bad tap, when to retry after no visible change, and how to collect the final result.

I ran OpenGUI and also spent some time reading the source. The approach is pretty good: the backend graph manages task state and plans, the Executor Graph sends concrete steps to the phone, the Android side performs actions through AccessibilityService, and WebSocket sends device state and execution results back.

This puts the phone inside the execution loop. The backend decides whether to continue, retry, or finish; the phone reports what actually happened on screen.

This is much more practical than a plain script. The phone can stand by, execute, and report back. It starts to look like a mobile worker.

The first use cases I can imagine are community research, mobile flow testing, ops tasks, and App-only workflows that web automation cannot touch.

NexAgent: a self-evolving AI agent built on Elixir/OTP

Fenix — Sat, 14 Mar 2026 01:37:33 +0000

OpenClaw showed the world the future of AI Agents. But it got me wondering: if an Agent is going to stick with me for a decade, what should its architecture look like?
Links: NexAgent GitHub: https://github.com/gofenix/nex-agent

TL;DR

OpenClaw's 310k stars are well-deserved.

It proves that the "personal AI Agent" direction is right—that everyday users are willing to pay for an "AI that can actually do work." As a long-time AI observer, seeing that red lobster logo take over the internet makes me genuinely happy. It means Agents are finally moving from a niche toy to the mainstream.

I spun one up myself. But using it surfaced some interesting issues that got me thinking: If an Agent is meant to be a 24/7 "companion" rather than just a "tool"—and if it needs to get smarter over time—how should its underlying architecture change?

There's no single right answer. OpenClaw proved the demand using TypeScript/Node.js. I wanted to explore a different path using Elixir/OTP.

So I built NexAgent.

This isn't meant to compete with OpenClaw. It's an experiment focused entirely on the "long-running Agent" niche.

My Experience Raising a Lobster

Like everyone else, I found OpenClaw on Twitter.

The red lobster logo, 310k stars, the "AI digital employee" pitch—it felt straight out of Iron Man (hello, JARVIS).

I installed it right away, hooked up a Telegram Bot, and gave it a task: "Check my GitHub Issues every morning at 8 AM and ping the high-priority ones to Lark."

Early Days: Pure Magic

Waking up to find my Issues neatly categorized.
Asking "Any bugs today?" on Telegram and seeing it actually remember yesterday's context.
Felt a solid 20% bump in my quality of life.

That Led Me to a Different Question

What if I don't just want to use an Agent, but raise it long-term?
It needs 24/7 rock-solid uptime.
It should compound its intelligence, not wipe the slate clean on every reboot.
It needs to self-evolve instead of waiting for author updates.

That’s when another tech stack came to mind.

Why Elixir/OTP?

Choosing TypeScript/Node.js for OpenClaw was the right move. It dramatically lowered the barrier to entry and brought more than 310k people into the project. That's how open source wins.

But I kept wondering: if the endgame is a "system that never goes down", what else is out there?

That led me to Elixir and OTP. Not for novelty's sake, but because OTP (Open Telecom Platform) was literally built for telecom switches: systems that must run 24/7, stay resilient, and support hot code upgrades.

Scenario	Node.js Approach	OTP Approach
Process Management	Single process + external restart	Supervision tree auto-restart
Memory Isolation	Same process space	Each task runs in isolated process
Hot Updates	Restart the service	Zero-downtime hot reload
Error Recovery	Manual intervention	Auto-recovery + graceful degradation

It’s not about which stack is better—it’s about optimizing for different use cases.

OpenClaw optimizes for accessibility, putting Agents in everyone's hands.
NexAgent optimizes for extreme stability, exploring what long-term AI companionship looks like.

Core Experiments with NexAgent

I rewrote the Agent's core in Elixir and ran a few tests:

Experiment 1: Uptime

I left NexAgent running on my local machine for an extended period:

Stability: Rock solid, zero memory leaks.
Latency: Consistently fast, no degradation over time.
Resilience: When a tool crashed, it restarted instantly without taking down the main loop.

Zero manual restarts. OTP's supervision tree makes the system far easier to operate over long periods.

Experiment 2: Hot Reloading in the Wild

My AMap weather tool suddenly broke, throwing API permission errors.

The Agent self-diagnosed the issue: the API key was bound to iOS, but it was making server-side calls. It autonomously patched its own source code, swapping the logic to read a Web Service key instead.

Four minutes later, it successfully pulled the weather for Shenzhen. No server restart. The chat session never dropped.

Here's the actual screenshot:

From diagnosing the bug to writing the fix and hot-reloading the module—zero human intervention.

Experiment 3: Self-Evolution Pipeline

NexAgent ships with a built-in pipeline for self-improvement:

Reflect: Read the source code of any internal module.
Analyze: Figure out what's broken and draft a fix.
Upgrade: Apply the patch and hot-reload on the fly.

This means that when tool logic needs to change, for example because an external API changes its response format, the Agent can inspect its own code, patch it, and update itself in memory without me having to restart the daemon.

The mechanism works flawlessly. I'm currently exploring more complex "fully autonomous repair" scenarios.

Under the Hood of NexAgent

1. Supervision Trees: Crash and Recover

# lib/nex/agent/application.ex
children = [
  NexAgent.InfrastructureSupervisor,
  NexAgent.WorkerSupervisor,
  NexAgent.Gateway
]

Supervisor.start_link(children, strategy: :rest_for_one)

If the infrastructure tier crashes, all dependent Workers restart. If a single tool crashes, only that specific tool's process restarts. The main Agent loop keeps running.

2. Process Isolation: Each Task Runs in Its Own Process

# lib/nex/agent/tool/registry.ex:181
Task.Supervisor.start_child(NexAgent.ToolTaskSupervisor, fn ->
  tool_module.execute(args)
end)

Every single tool execution gets its own lightweight process. Crashes don't affect the main loop.

3. Hot Reloading: Upgrades on the Fly

# lib/nex/agent/code_upgrade.ex:39
with :ok <- maybe_validate_code(code),
     :ok <- create_backup(module, source_path),
     :ok <- write_source(source_path, code),
     {:ok, hot_reload} <- compile_and_load(module, code),
     :ok <- maybe_health_check(module) do
  {:ok, %{version: version, hot_reload: hot_reload}}
else
  {:error, reason} ->
    _ = rollback(module)  # Auto rollback on failure
    {:error, to_error(reason)}
end

4. Dual-Layer Memory

MEMORY.md: Long-term state (project context, user quirks).
HISTORY.md: Grep-able conversation logs.

Powered by async consolidation: When a chat gets too long, the Agent spins up a background process to summarize the history and extract facts into long-term memory. Zero latency impact on your active chat.

The Six-Layer Evolution Model

NexAgent doesn't just evolve in one way. It has six layers of growth:

SOUL: Personality and core values.
USER: Your profile and how you like to collaborate.
MEMORY: Long-term context and project domain knowledge.
SKILL: Reusable workflows it has learned.
TOOL: Hardcoded integrations and tools.
CODE: The actual Elixir source code.

Every layer compounds over time, and each can evolve independently.

Quick Start

Want to take it for a spin?

# 1. Install Elixir (~> 1.18)
# 2. Clone repo
git clone https://github.com/gofenix/nex-agent.git
cd nex-agent
mix deps.get

# 3. Initialize
mix nex.agent onboard

# 4. Configure config file

# 5. Start gateway
mix nex.agent gateway

More docs: GitHub Repo

Closing Thoughts

OpenClaw opened our eyes to what AI Agents can do. That's a massive win for the whole industry.

NexAgent is simply probing a specific niche: If an Agent is meant to be a long-term companion, how should we build it?

310k people are raising lobsters, experiencing what it feels like to have AI at their fingertips.

I'm planting a tree, waiting for the day it grows into a canopy.

Two different paths, one shared goal: weaving AI seamlessly into our lives.