WonderLab

Posted on Apr 4

Open Source Project of the Day (Part 29): Open-AutoGLM - A Phone Agent Framework for Controlling Phones with Natural Language

#ai #llm #agents #opensource

Introduction

"'Open Meituan and search for nearby hot pot restaurants.' 'Send a message to File Transfer Assistant: deployment successful.' — Spoken, and the phone does it automatically."

This is Part 29 of the "Open Source Project of the Day" series. Today we explore Open-AutoGLM (GitHub), open-sourced by zai-org (Zhipu AI ecosystem).

You want to control your phone with natural language: open apps, search, tap, type text — without doing it step by step yourself. Open-AutoGLM delivers two things: a Phone Agent framework (Python code running on your computer) that controls devices via ADB (Android) or HDC (HarmonyOS) in a loop of "screenshot → visual model understands the interface → outputs action (launch app, tap coordinates, type, etc.) → execute"; and the AutoGLM-Phone series of vision-language models (9B parameters) optimized for mobile interfaces, callable via Zhipu BigModel, ModelScope APIs, or your own vLLM/SGLang service. Users simply say something like "open Xiaohongshu and search for food," and the Agent automatically completes the entire flow — with support for sensitive operation confirmation and human takeover during login/CAPTCHA situations. The project supports Android 7.0+ and HarmonyOS NEXT, covering 50+ Android apps and 60+ HarmonyOS apps, and can be integrated with Midscene.js and other UI automation tools.

What You'll Learn

Open-AutoGLM's positioning: Phone Agent framework + AutoGLM-Phone model for "natural language → phone operations"
The working pipeline: screenshot, visual model, action parsing, ADB/HDC execution, remote debugging, and human takeover
Environment setup: Python, ADB/HDC, developer options, ADB Keyboard (Android)
Model acquisition and deployment: Zhipu/ModelScope API and self-hosted vLLM/SGLang
Supported apps, available actions, and secondary development structure

Prerequisites

Familiar with Python 3.10+, pip, and virtual environments
Understanding of basic ADB or HDC concepts (connecting devices, executing commands)
If self-hosting the model service, basic experience with GPU and vLLM/SGLang; using cloud API only requires applying for a key

Project Background

Project Introduction

Open-AutoGLM includes an open-source Phone Agent framework and the AutoGLM-Phone vision-language model, targeting "control phones with natural language": users enter commands on a computer, the Agent controls the phone via ADB (Android) or HDC (HarmonyOS), combining multimodal screen understanding and planning capabilities to automatically complete operations like opening apps, tapping, typing, and swiping. The framework has built-in sensitive operation confirmation and human takeover (e.g., for login, CAPTCHA), supports WiFi/network remote debugging — no need to stay plugged in throughout. The model provides AutoGLM-Phone-9B (Chinese-optimized) and AutoGLM-Phone-9B-Multilingual, downloadable from Hugging Face or ModelScope, or callable via already-deployed Zhipu BigModel or ModelScope APIs. The project states it is for research and learning only — prohibited for illegal use; read the terms of service in the repo before using.

Core problems the project solves:

Want to operate a phone "in natural language," not by memorizing steps or writing scripts
Need a working Agent codebase + a visual model optimized for mobile interfaces for reproduction and secondary development
Need simultaneous support for Android and HarmonyOS, local and remote devices, and Chinese and English commands

Target user groups:

Developers and teams researching or practicing "Phone Agent" and "GUI Agent"
Users needing to automate Android/HarmonyOS device operations for testing, demos, or assistance tasks
Integrators wanting to incorporate "natural language phone control" into their own products (e.g., Midscene.js integration)

Author/Team Introduction

Organization: zai-org (GitHub), related to the Zhipu AI ecosystem; README mentions Zhipu AI Input Method, GLM Coding Plan, AutoGLM Practitioners activities, etc.
Papers: AutoGLM (arXiv:2411.00820), MobileRL (arXiv:2509.18119); model architecture same as GLM-4.1V-9B-Thinking — see GLM-V deployment instructions for reference
Blog and product: autoglm.z.ai/blog

Project Stats

⭐ GitHub Stars: 23.5k+
🍴 Forks: 3.7k+
📦 Version: main branch as trunk; models on Hugging Face / ModelScope
📄 License: Apache-2.0
🌐 Documentation and blog: autoglm.z.ai/blog, repo README and README_en.md, iOS setup guide
💬 Community: WeChat community, Zhipu AI Input Method X account, GitHub Issues

Main Features

Core Purpose

Open-AutoGLM's core purpose is to translate natural language instructions into actual phone operations:

Receive instruction: User inputs something like "open Meituan and search for nearby hot pot restaurants" or "open WeChat and send to File Transfer Assistant: deployment successful"
Screenshot and understand: Agent takes a screenshot of the current screen via ADB/HDC, calls the AutoGLM-Phone visual model to understand the interface content and user goal
Plan and output action: Model outputs structured actions (e.g., Launch, Tap, Type, Swipe, etc.), Agent parses them and sends to device for execution
Loop until complete: After execution, take another screenshot, understand, plan — until the task is done, the maximum steps are reached, or human takeover is triggered
Safety and takeover: Sensitive operations can be configured with a confirmation callback; login, CAPTCHA, and similar scenarios can trigger a human takeover callback, then continue after completion

Use Cases

Automated testing and demos
- Drive app flows with natural language, reducing the need for hand-written UI automation scripts
Personal assistant-style operations
- "Open Taobao and search for wireless earphones," "open Xiaohongshu and search for food guides" — Agent automatically completes multi-step operations
Remote device control and debugging
- Connect via WiFi ADB/HDC, control the phone without a USB connection, convenient for remote demos or development
Integration with Midscene.js and similar tools
- Midscene.js has adapted AutoGLM — automate iOS/Android using YAML or JavaScript workflows paired with AutoGLM
Research and secondary development
- Extend new apps, new actions, or new prompts based on the phone_agent package, or integrate with a self-hosted model service

Quick Start

Environment: Python 3.10+; Android devices need ADB + developer mode + USB debugging (some models need "USB debugging (security settings)") + ADB Keyboard installed and enabled; HarmonyOS devices need HDC + developer options; iOS see docs/ios_setup.

Install the Agent (this repo):

git clone https://github.com/zai-org/Open-AutoGLM.git
cd Open-AutoGLM
pip install -r requirements.txt
pip install -e .

Connect device: After USB connecting the phone, run adb devices (Android) or hdc list targets (HarmonyOS) to confirm the device is listed.

Using third-party model service (no local GPU needed):

# Zhipu BigModel (apply for API key at Zhipu platform)
python main.py --base-url https://open.bigmodel.cn/api/paas/v4 --model "autoglm-phone" --apikey "your-key" "Open Meituan and search for nearby hot pot restaurants"

# ModelScope (apply for API key at ModelScope)
python main.py --base-url https://api-inference.modelscope.cn/v1 --model "ZhipuAI/AutoGLM-Phone-9B" --apikey "your-key" "Open Meituan and search for nearby hot pot restaurants"

Using self-hosted model service: Deploy AutoGLM-Phone-9B with vLLM or SGLang (see README startup parameters) to get an OpenAI-compatible API (e.g., http://localhost:8000/v1), then point --base-url and --model to that service.

Python API example:

from phone_agent import PhoneAgent
from phone_agent.model import ModelConfig

model_config = ModelConfig(
    base_url="http://localhost:8000/v1",
    model_name="autoglm-phone-9b",
)
agent = PhoneAgent(model_config=model_config)
result = agent.run("Open Taobao and search for wireless earphones")
print(result)

Core Features

Multimodal screen understanding: AutoGLM-Phone is optimized for mobile interfaces — understands the current page from screenshots and outputs the next action
Android + HarmonyOS: Android uses ADB, HarmonyOS uses HDC — switch with --device-type adb/hdc in the same Agent
50+ Android apps / 60+ HarmonyOS apps: Social communication, e-commerce, food delivery, transportation, video, music, lifestyle, content communities, etc. — see supported app list with python main.py --list-apps and --device-type hdc --list-apps
Rich operations: Launch, Tap, Type, Swipe, Back, Home, Long Press, Double Tap, Wait, Take_over (human takeover)
Remote debugging: Supports adb connect IP:5555 / hdc tconn IP:5555 for WiFi-based device control
Sensitive operations and human takeover: Configure confirmation_callback and takeover_callback to intervene during payment, login, CAPTCHA, and similar scenarios
Chinese and English prompts: --lang cn (default) and --lang en, corresponding to phone_agent/config/prompts_zh.py and prompts_en.py — customizable
OpenAI-compatible API: Any model service exposing an OpenAI-format interface works — easy to connect to Zhipu, ModelScope, or self-hosted vLLM/SGLang

Project Advantages

Dimension	Open-AutoGLM	Manual UI automation scripts	Cloud-only "phone assistant" products
Input method	Natural language	Coordinates/selectors, code	Natural language but closed-source/uncontrollable
Devices and OS	Android + HarmonyOS, local/remote	Depends on script and tools	Depends on product
Model and deployment	Open-source model + self-hosted or third-party API	No model	Usually cloud-only
Extensibility	Modify prompts, add apps, integrate self-built services	High but requires coding	Low
Research and reproduction	Paper + code + downloadable models	Depends on whether scripts are open-source	Difficult to reproduce

Why choose Open-AutoGLM?

End-to-end open source: From Agent logic to models (including Chinese/multilingual versions) — all accessible for learning and secondary development
Ready-to-use and self-hostable: Use Zhipu/ModelScope for quick experience, or build with vLLM/SGLang for private or customized deployment
Dual platform: Supports both Android and HarmonyOS, with Midscene.js and ecosystem integration for many extension scenarios

Detailed Project Analysis

Architecture Overview

This repo (Agent side): Runs on the computer, responsible for calling model APIs, parsing model-output actions, and sending screenshot and operation commands via ADB/HDC.
Model service: A separate process or remote API, receives "screenshot + conversation/task" input, returns structured actions (e.g., do(action="Launch", app="Meituan")). Can connect to local vLLM/SGLang or Zhipu, ModelScope, etc.
Device side: Phone/tablet with developer mode and USB debugging enabled (plus ADB Keyboard, etc.), communicating with ADB/HDC on the computer via USB or WiFi.

Project Structure (phone_agent)

agent.py: PhoneAgent main class, orchestrates screenshots, model calls, action parsing, operation execution, and callbacks
adb/: ADB connection, screenshots, text input (ADB Keyboard), device control (tap, swipe, etc.)
actions/handler.py: Translates model-output actions into ADB/HDC commands and executes them
config/: Supported app mapping (apps.py), Chinese/English system prompts (prompts_zh.py, prompts_en.py)
model/client.py: OpenAI-compatible model client for requesting the visual model API

HarmonyOS is implemented via HDC-related modules (e.g., hdc/) in the repo; specify with --device-type hdc on the command line.

Models and Deployment

AutoGLM-Phone-9B: Optimized for Chinese mobile app scenarios, Hugging Face, ModelScope
AutoGLM-Phone-9B-Multilingual: Multilingual, suitable for Chinese-English mixed or English interfaces
Deployment: Recommended to use inference services with strong Structured Output capability (e.g., OpenAI, Gemini); when self-hosting, follow README's vLLM/SGLang startup parameters (e.g., --max-model-len 25480, --limit-mm-per-prompt, etc.) — otherwise output format errors or garbled text may occur. Model is ~20GB, requires GPU (recommend 24GB+ VRAM for local deployment).

Important Notes

Compliance and use: Project states it is for research and learning only; strictly prohibited for illegally obtaining information, interfering with systems, or any illegal use; read the terms of service and privacy policy in the repo before using
Permissions and security: Requires enabling developer mode and USB debugging on the device; sensitive operations should use callback confirmation; payment/banking pages may screenshot as black screens and trigger takeover
Chinese input: Android requires ADB Keyboard — not installed or enabled will cause input issues; HarmonyOS uses the system IME

Project Resources

Official Resources

🌟 GitHub: github.com/zai-org/Open-AutoGLM
📄 English README: README_en.md
🌐 Blog: autoglm.z.ai/blog
📱 iOS setup: docs/ios_setup
🤖 Models: Hugging Face, ModelScope
📚 Zhipu API: docs.bigmodel.cn
🐛 Issues: GitHub Issues

Related Resources

Midscene.js AutoGLM integration
GLM-V model deployment
ADB Keyboard (Android text input)
Papers: AutoGLM (arXiv:2411.00820), MobileRL (arXiv:2509.18119)

Who Should Use This

Phone Agent / GUI Agent researchers: Need a reproducible framework and model
Automated testing and demos: Drive Android/HarmonyOS device flows with natural language
Product integrators: Want to incorporate "natural language phone control" (e.g., combined with Midscene.js)
Zhipu/multimodal ecosystem users: Already using Zhipu API or GLM series, want to extend to phone control scenarios

Welcome to visit my personal homepage for more useful knowledge and interesting products

DEV Community