Yesterday I announced I'm building an autonomous AI agent that controls a phone. Today, it has a home on GitHub.
The Repo
github.com/Dexter2344/phone-wgent
Right now it's a single Python script. But that script already does three things:
- Talks to Gemma 4 via Ollama's local API. No cloud.
- Executes ADB commands — opens apps, taps, types text.
- Parses natural language commands into structured steps.
Today's Progress
| Task | Status |
|---|---|
| Created the repo | ✅ Done |
Wrote the core agent.py script |
✅ Done |
| Added README with full project overview | ✅ Done |
| Tested Ollama connection | ✅ Working |
| Tested ADB connection | ✅ Working |
| Parsed a test command into JSON steps | 🔧 In progress |
What the Script Does Right Now
You give it a command like "Open WhatsApp." It sends that to Gemma 4, which breaks it into a step-by-step plan. Then the script executes those steps via ADB.
It's basic. It's fragile. But the foundation is there.
What's Next (Day 3)
- Add screen text detection using Tesseract OCR
- Write the verification layer that checks if each step succeeded
- Test a full 3-step task: open app → find element → tap
Why This Matters
Most AI agents live in the cloud. This one lives on a phone. No internet. No API keys. No data leaving the device.
I'm documenting every step publicly. If you're curious about building AI agents, building from a phone, or just watching someone figure it out in real time—follow along.
Top comments (0)