We gave actual claws to Openclaw agent and it flies a drone now

Swastika Yadav — Mon, 11 May 2026 18:32:32 +0000

A few weeks back, we posted a short demo of a drone following a car in peak SF traffic, controlled entirely by an Openclaw agent through a single natural language prompt. It pulled 617K views and 305 developers showed up in the replies asking for early access. Half the quote tweets were calling it the most exciting thing they'd seen in robotics all year, the other half were genuinely concerned about the surveillance implications. Both reactions told us the same thing, this hit a nerve because people could immediately picture what they'd build with it.
// Detect dark theme var iframe = document.getElementById('tweet-2028645216505549168-887'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2028645216505549168&theme=dark" }

So we wrote a detailed breakdown of how Openclaw goes from a sentence to a drone tracking a car in real time.

TL;DR: Openclaw agents can now control drones via Mavlink on Dimensional. One natural language query and the agent handles perception, tracking, and drone flight control autonomously. The same agent that ran on a humanoid yesterday flew a drone today. Fully open source.

We gave our Openclaw agent access to a drone through Dimensional, our open-source agentic OS for physical space. 40 lines runfile. No flight controller code. No Mavlink scripting. Just one sentence:

"Follow the next white car that comes through the intersection."

The drone took off, the agent started reading the camera feed, waited for a white car to appear, and followed it autonomously. Every decision about when to launch, what to track, and how to pursue was made by the agent. We just typed a sentence and ran a short runfile we call a blueprint.

How Openclaw actually flies a drone

The key design decision is that the Openclaw agent doesn't know it's flying a drone. It sees typed streams: Out[Image] for camera data, In[Twist] for velocity commands, Out[PoseStamped] for position. It reasons about what it sees and issues commands against whatever interface Dimensional gives it, regardless of whether those streams are coming from rotors or legs or wheels.

So when you type "follow the next white car," the agent starts running YOLO detection on the live camera feed, classifying vehicles frame by frame. Nothing happens yet. It's watching the intersection, waiting. The moment a white car enters the frame, the agent recognizes it and hands off to the drone_visual_servoing_controller, which computes the pixel offset between the car's bounding box and the camera center, converts that into velocity commands, and feeds them to the flight controller. From there the MavlinkConnection takes over and translates everything into protocol commands over UDP port 14550.

Three clean layers doing three different jobs: the agent decides what to follow, the drone_tracking_module figures out how to follow it, and Mavlink handles the actual flying.

The Mavlink layer underneath handles everything about actual flight including rotor speeds, altitude holds, and GPS waypoints. That's 920 lines of guardrailed, low-level control logic sitting between the agent and the hardware, and the agent never touches any of it. What the agent does is call skills like person_follow or gps_nav_skill through the @skill decorator, which exposes regular Python methods as tools the LLM can discover and invoke while it's reasoning. The agent decides what needs to happen, the flight controller figures out how to make it happen, and those two systems run at completely different speeds without stepping on each other.

This is also how we solve the latency question that kept coming up in the replies. Agent reasoning at LLM speed and real-time flight control don't need to run at the same frequency, they just need a clean interface between them. Dimensional is that interface. The agent isn't waiting on the flight controller and the flight controller isn't waiting on the agent.

You also don't need a physical drone to start building with any of this. The repo ships with FakeMavlinkConnection and a full replay system that feeds recorded flight telemetry back to your agent with real timing preserved. Build and test the entire workflow on your laptop, then plug in hardware when you're ready to fly.

You can run the full agentic drone workflow against recorded flight data with one command:
dimos --replay run drone-agentic

Same agent, different robots

The day before we flew the drone, this same Openclaw agent was running on a Unitree G1 humanoid. The day after, a quadruped. Three platforms in three days. Zero rewrites.
This works because the same typed streams the agent was reading from the drone work identically on a humanoid or quadruped. Build the agent workflow once, swap the hardware connection module, and everything carries over. The agent doesn't know or care what body it's in.
We spent months building a custom, pip-installable transport layer to make this possible. That transport layer is how we already cover 80% of robots including Unitree, DeepRobotics, Agibot, Galaxea, AgileX and most drone platforms. It's the reason we could ship on a humanoid one day and a drone the next, and the reason any developer building on Dimensional gets that same iteration speed.

curl -fsSL https://raw.githubusercontent.com/dimensionalOS/dimos/main/scripts/install.sh | bash

One install. Pick your hardware. Write your prompt.

The agent remembers what it sees with Spatial Memory

Following a car is one instruction, one flight. For agents to actually be deployed in physical space they need to remember what they've seen over hours and days, not just react frame by frame.

We're building Spatial Memory to solve this, it gives agents a persistent world model they can query across space and time. The drone runs on a monocular camera but robots on Dimensional with depth sensors or lidar can plug into the same memory system and agents can query this across seven dimensions: object, room, semantic, geometric, time, image, and point cloud. This is what turns a single-instruction drone into a persistent system that understands your space.

Developers are already building for real world

We also set this up at our own office as a proof of concept. If outdoor cameras detect someone loitering, an Openclaw agent deploys a drone to investigate. All cameras, drones, and robots operate in one shared world frame, building spatial memory together as a fleet. We are working on making it production ready soon.

The depth perception comes from monocular depth estimation, so you don’t need expensive sensor arrays to get started. A standard camera feed is enough to build a working spatial model. OpenClaw can now understand physical space and temporality, and it integrates with any lidar, stereo, or RGB camera you throw at it.

When we open-sourced the full stack, Dimensional hit 3 trending on GitHub within 72 hours. A developer built a Telegram bot controlling a Unitree Go2 in 180 lines of Python. Hackathon teams built ROSClaw, bridging every ROS robot to Openclaw agents. Companies are already shipping inspection drones and warehouse automation on the same stack.

Join our Discord, share what you build and hang out with fellow builders!

DEV Community: Dimensional

We gave actual claws to Openclaw agent and it flies a drone now

How Openclaw actually flies a drone

Same agent, different robots

The agent remembers what it sees with Spatial Memory

Developers are already building for real world