DEV Community

WonderLab
WonderLab

Posted on

Open Source Project of the Day (Part 29): Open-AutoGLM - A Phone Agent Framework for Controlling Phones with Natural Language

Introduction

"'Open Meituan and search for nearby hot pot restaurants.' 'Send a message to File Transfer Assistant: deployment successful.' — Spoken, and the phone does it automatically."

This is Part 29 of the "Open Source Project of the Day" series. Today we explore Open-AutoGLM (GitHub), open-sourced by zai-org (Zhipu AI ecosystem).

You want to control your phone with natural language: open apps, search, tap, type text — without doing it step by step yourself. Open-AutoGLM delivers two things: a Phone Agent framework (Python code running on your computer) that controls devices via ADB (Android) or HDC (HarmonyOS) in a loop of "screenshot → visual model understands the interface → outputs action (launch app, tap coordinates, type, etc.) → execute"; and the AutoGLM-Phone series of vision-language models (9B parameters) optimized for mobile interfaces, callable via Zhipu BigModel, ModelScope APIs, or your own vLLM/SGLang service. Users simply say something like "open Xiaohongshu and search for food," and the Agent automatically completes the entire flow — with support for sensitive operation confirmation and human takeover during login/CAPTCHA situations. The project supports Android 7.0+ and HarmonyOS NEXT, covering 50+ Android apps and 60+ HarmonyOS apps, and can be integrated with Midscene.js and other UI automation tools.

What You'll Learn

  • Open-AutoGLM's positioning: Phone Agent framework + AutoGLM-Phone model for "natural language → phone operations"
  • The working pipeline: screenshot, visual model, action parsing, ADB/HDC execution, remote debugging, and human takeover
  • Environment setup: Python, ADB/HDC, developer options, ADB Keyboard (Android)
  • Model acquisition and deployment: Zhipu/ModelScope API and self-hosted vLLM/SGLang
  • Supported apps, available actions, and secondary development structure

Prerequisites

  • Familiar with Python 3.10+, pip, and virtual environments
  • Understanding of basic ADB or HDC concepts (connecting devices, executing commands)
  • If self-hosting the model service, basic experience with GPU and vLLM/SGLang; using cloud API only requires applying for a key

Project Background

Project Introduction

Open-AutoGLM includes an open-source Phone Agent framework and the AutoGLM-Phone vision-language model, targeting "control phones with natural language": users enter commands on a computer, the Agent controls the phone via ADB (Android) or HDC (HarmonyOS), combining multimodal screen understanding and planning capabilities to automatically complete operations like opening apps, tapping, typing, and swiping. The framework has built-in sensitive operation confirmation and human takeover (e.g., for login, CAPTCHA), supports WiFi/network remote debugging — no need to stay plugged in throughout. The model provides AutoGLM-Phone-9B (Chinese-optimized) and AutoGLM-Phone-9B-Multilingual, downloadable from Hugging Face or ModelScope, or callable via already-deployed Zhipu BigModel or ModelScope APIs. The project states it is for research and learning only — prohibited for illegal use; read the terms of service in the repo before using.

Core problems the project solves:

  • Want to operate a phone "in natural language," not by memorizing steps or writing scripts
  • Need a working Agent codebase + a visual model optimized for mobile interfaces for reproduction and secondary development
  • Need simultaneous support for Android and HarmonyOS, local and remote devices, and Chinese and English commands

Target user groups:

  • Developers and teams researching or practicing "Phone Agent" and "GUI Agent"
  • Users needing to automate Android/HarmonyOS device operations for testing, demos, or assistance tasks
  • Integrators wanting to incorporate "natural language phone control" into their own products (e.g., Midscene.js integration)

Author/Team Introduction

  • Organization: zai-org (GitHub), related to the Zhipu AI ecosystem; README mentions Zhipu AI Input Method, GLM Coding Plan, AutoGLM Practitioners activities, etc.
  • Papers: AutoGLM (arXiv:2411.00820), MobileRL (arXiv:2509.18119); model architecture same as GLM-4.1V-9B-Thinking — see GLM-V deployment instructions for reference
  • Blog and product: autoglm.z.ai/blog

Project Stats

  • GitHub Stars: 23.5k+
  • 🍴 Forks: 3.7k+
  • 📦 Version: main branch as trunk; models on Hugging Face / ModelScope
  • 📄 License: Apache-2.0
  • 🌐 Documentation and blog: autoglm.z.ai/blog, repo README and README_en.md, iOS setup guide
  • 💬 Community: WeChat community, Zhipu AI Input Method X account, GitHub Issues

Main Features

Core Purpose

Open-AutoGLM's core purpose is to translate natural language instructions into actual phone operations:

  1. Receive instruction: User inputs something like "open Meituan and search for nearby hot pot restaurants" or "open WeChat and send to File Transfer Assistant: deployment successful"
  2. Screenshot and understand: Agent takes a screenshot of the current screen via ADB/HDC, calls the AutoGLM-Phone visual model to understand the interface content and user goal
  3. Plan and output action: Model outputs structured actions (e.g., Launch, Tap, Type, Swipe, etc.), Agent parses them and sends to device for execution
  4. Loop until complete: After execution, take another screenshot, understand, plan — until the task is done, the maximum steps are reached, or human takeover is triggered
  5. Safety and takeover: Sensitive operations can be configured with a confirmation callback; login, CAPTCHA, and similar scenarios can trigger a human takeover callback, then continue after completion

Use Cases

  1. Automated testing and demos

    • Drive app flows with natural language, reducing the need for hand-written UI automation scripts
  2. Personal assistant-style operations

    • "Open Taobao and search for wireless earphones," "open Xiaohongshu and search for food guides" — Agent automatically completes multi-step operations
  3. Remote device control and debugging

    • Connect via WiFi ADB/HDC, control the phone without a USB connection, convenient for remote demos or development
  4. Integration with Midscene.js and similar tools

    • Midscene.js has adapted AutoGLM — automate iOS/Android using YAML or JavaScript workflows paired with AutoGLM
  5. Research and secondary development

    • Extend new apps, new actions, or new prompts based on the phone_agent package, or integrate with a self-hosted model service

Quick Start

Environment: Python 3.10+; Android devices need ADB + developer mode + USB debugging (some models need "USB debugging (security settings)") + ADB Keyboard installed and enabled; HarmonyOS devices need HDC + developer options; iOS see docs/ios_setup.

Install the Agent (this repo):

git clone https://github.com/zai-org/Open-AutoGLM.git
cd Open-AutoGLM
pip install -r requirements.txt
pip install -e .
Enter fullscreen mode Exit fullscreen mode

Connect device: After USB connecting the phone, run adb devices (Android) or hdc list targets (HarmonyOS) to confirm the device is listed.

Using third-party model service (no local GPU needed):

# Zhipu BigModel (apply for API key at Zhipu platform)
python main.py --base-url https://open.bigmodel.cn/api/paas/v4 --model "autoglm-phone" --apikey "your-key" "Open Meituan and search for nearby hot pot restaurants"

# ModelScope (apply for API key at ModelScope)
python main.py --base-url https://api-inference.modelscope.cn/v1 --model "ZhipuAI/AutoGLM-Phone-9B" --apikey "your-key" "Open Meituan and search for nearby hot pot restaurants"
Enter fullscreen mode Exit fullscreen mode

Using self-hosted model service: Deploy AutoGLM-Phone-9B with vLLM or SGLang (see README startup parameters) to get an OpenAI-compatible API (e.g., http://localhost:8000/v1), then point --base-url and --model to that service.

Python API example:

from phone_agent import PhoneAgent
from phone_agent.model import ModelConfig

model_config = ModelConfig(
    base_url="http://localhost:8000/v1",
    model_name="autoglm-phone-9b",
)
agent = PhoneAgent(model_config=model_config)
result = agent.run("Open Taobao and search for wireless earphones")
print(result)
Enter fullscreen mode Exit fullscreen mode

Core Features

  1. Multimodal screen understanding: AutoGLM-Phone is optimized for mobile interfaces — understands the current page from screenshots and outputs the next action
  2. Android + HarmonyOS: Android uses ADB, HarmonyOS uses HDC — switch with --device-type adb/hdc in the same Agent
  3. 50+ Android apps / 60+ HarmonyOS apps: Social communication, e-commerce, food delivery, transportation, video, music, lifestyle, content communities, etc. — see supported app list with python main.py --list-apps and --device-type hdc --list-apps
  4. Rich operations: Launch, Tap, Type, Swipe, Back, Home, Long Press, Double Tap, Wait, Take_over (human takeover)
  5. Remote debugging: Supports adb connect IP:5555 / hdc tconn IP:5555 for WiFi-based device control
  6. Sensitive operations and human takeover: Configure confirmation_callback and takeover_callback to intervene during payment, login, CAPTCHA, and similar scenarios
  7. Chinese and English prompts: --lang cn (default) and --lang en, corresponding to phone_agent/config/prompts_zh.py and prompts_en.py — customizable
  8. OpenAI-compatible API: Any model service exposing an OpenAI-format interface works — easy to connect to Zhipu, ModelScope, or self-hosted vLLM/SGLang

Project Advantages

Dimension Open-AutoGLM Manual UI automation scripts Cloud-only "phone assistant" products
Input method Natural language Coordinates/selectors, code Natural language but closed-source/uncontrollable
Devices and OS Android + HarmonyOS, local/remote Depends on script and tools Depends on product
Model and deployment Open-source model + self-hosted or third-party API No model Usually cloud-only
Extensibility Modify prompts, add apps, integrate self-built services High but requires coding Low
Research and reproduction Paper + code + downloadable models Depends on whether scripts are open-source Difficult to reproduce

Why choose Open-AutoGLM?

  • End-to-end open source: From Agent logic to models (including Chinese/multilingual versions) — all accessible for learning and secondary development
  • Ready-to-use and self-hostable: Use Zhipu/ModelScope for quick experience, or build with vLLM/SGLang for private or customized deployment
  • Dual platform: Supports both Android and HarmonyOS, with Midscene.js and ecosystem integration for many extension scenarios

Detailed Project Analysis

Architecture Overview

  • This repo (Agent side): Runs on the computer, responsible for calling model APIs, parsing model-output actions, and sending screenshot and operation commands via ADB/HDC.
  • Model service: A separate process or remote API, receives "screenshot + conversation/task" input, returns structured actions (e.g., do(action="Launch", app="Meituan")). Can connect to local vLLM/SGLang or Zhipu, ModelScope, etc.
  • Device side: Phone/tablet with developer mode and USB debugging enabled (plus ADB Keyboard, etc.), communicating with ADB/HDC on the computer via USB or WiFi.

Project Structure (phone_agent)

  • agent.py: PhoneAgent main class, orchestrates screenshots, model calls, action parsing, operation execution, and callbacks
  • adb/: ADB connection, screenshots, text input (ADB Keyboard), device control (tap, swipe, etc.)
  • actions/handler.py: Translates model-output actions into ADB/HDC commands and executes them
  • config/: Supported app mapping (apps.py), Chinese/English system prompts (prompts_zh.py, prompts_en.py)
  • model/client.py: OpenAI-compatible model client for requesting the visual model API

HarmonyOS is implemented via HDC-related modules (e.g., hdc/) in the repo; specify with --device-type hdc on the command line.

Models and Deployment

  • AutoGLM-Phone-9B: Optimized for Chinese mobile app scenarios, Hugging Face, ModelScope
  • AutoGLM-Phone-9B-Multilingual: Multilingual, suitable for Chinese-English mixed or English interfaces
  • Deployment: Recommended to use inference services with strong Structured Output capability (e.g., OpenAI, Gemini); when self-hosting, follow README's vLLM/SGLang startup parameters (e.g., --max-model-len 25480, --limit-mm-per-prompt, etc.) — otherwise output format errors or garbled text may occur. Model is ~20GB, requires GPU (recommend 24GB+ VRAM for local deployment).

Important Notes

  • Compliance and use: Project states it is for research and learning only; strictly prohibited for illegally obtaining information, interfering with systems, or any illegal use; read the terms of service and privacy policy in the repo before using
  • Permissions and security: Requires enabling developer mode and USB debugging on the device; sensitive operations should use callback confirmation; payment/banking pages may screenshot as black screens and trigger takeover
  • Chinese input: Android requires ADB Keyboard — not installed or enabled will cause input issues; HarmonyOS uses the system IME

Project Resources

Official Resources

Related Resources

Who Should Use This

  • Phone Agent / GUI Agent researchers: Need a reproducible framework and model
  • Automated testing and demos: Drive Android/HarmonyOS device flows with natural language
  • Product integrators: Want to incorporate "natural language phone control" (e.g., combined with Midscene.js)
  • Zhipu/multimodal ecosystem users: Already using Zhipu API or GLM series, want to extend to phone control scenarios

Welcome to visit my personal homepage for more useful knowledge and interesting products

Top comments (0)