Cian

Posted on Feb 27

I Built an AI That Can See Your Arduino and Write the Code For It

#ai #arduino #visionagents #showdev

There is a specific frustration anyone who has worked with Arduino knows well.

You have a breadboard in front of you. Components are wired up. You open a chat window, describe your setup in text — "I have an LED on pin 8 with a 220 ohm resistor" — copy the code the AI gives you, paste it into the Arduino IDE, hit upload, and watch the LED do nothing. You go back to the chat window. You describe what happened. You get a revised version. You copy it again.

You do this five times before realizing the AI gave you code for pin 9 because you told it pin 8 and it added a one-line comment that said "change this to match your wiring" which you missed.

Every AI coding assistant has this problem: they are blind to your physical setup.

ArduinoVision is my attempt to fix that.

The Idea

The concept is simple enough to state in one sentence: an AI agent that can see your breadboard through a camera, write the correct Arduino code based on what it actually observes, and upload it directly to your board.

No copy-paste. No IDE switching. No describing your wiring in text. You connect the components. The AI handles everything else.

I built this for the Vision Possible: Agent Protocol hackathon by WeMakeDevs, and the core of it runs on the VisionAgents SDK by Stream.

What VisionAgents Makes Possible

Before I get into the build, I want to explain why this project needed VisionAgents specifically — because that is not an obvious answer.

The challenge with building a hardware coding agent is that it needs three things happening simultaneously and tightly integrated: it needs to see video (your camera), hear audio (your voice), reason about both together (the LLM), and take external actions (compile, upload). Wiring all of that together manually — WebRTC for the camera feed, a separate STT service, a separate LLM call, a separate TTS for the response — is a significant amount of infrastructure before you write a single line of the actual agent logic.

VisionAgents collapses all of that into a few lines of Python.

The relevant part of the agent setup looks like this:

from vision_agents.core import Agent, AgentLauncher, User, Runner
from vision_agents.plugins import getstream, openai

llm = openai.Realtime(model="gpt-realtime", voice="cedar", fps=1)

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="ArduinoVision", id="arduino-vision-agent"),
    instructions=SYSTEM_PROMPT,
    llm=llm,
)

That is the entire transport and LLM setup. getstream.Edge() handles the WebRTC infrastructure — video/audio in and out, connection management, reconnection logic. openai.Realtime() handles speech-to-speech natively — no separate STT or TTS services, no intermediate text conversion, just audio in and audio out with video frames attached. Stream's edge network keeps the latency under 30ms, which matters when someone is physically holding a component in front of the camera.

The fps=1 setting deserves a note. I initially had it at fps=3 and the audio quality was noticeably degraded — cutting out, pitch shifts mid-sentence. Dropping to one frame per second freed up the audio pipeline entirely. For identifying breadboard wiring, one frame per second is more than sufficient.

Registering Arduino Tools

The agent's practical capability comes from tool registration. VisionAgents uses @llm.register_function() to make Python functions callable by the model during conversation:

@llm.register_function(
    description="List all connected Arduino boards. Returns port, board name, and FQBN. ALWAYS call this first to find the port needed for upload."
)
async def list_boards() -> dict:
    boards = list_arduino_boards()
    if boards:
        return {
            "found": True,
            "boards": boards,
            "message": f"Found {len(boards)} board(s). Use the 'port' for upload operations."
        }
    return {"found": False, "boards": [], "message": "No Arduino boards detected."}

I registered six tools in total: list_boards, write_code, compile_code, upload_code, serial_monitor, and deploy_code (which chains the previous three). Each one wraps a call to arduino-cli on the system.

What makes this work well in practice is that the model chains these calls naturally based on the conversation. The user says "make the LED blink." The model calls list_boards to find the port, calls write_code to save the sketch, then deploy_code to compile and upload. The user did not ask it to do those steps in that order — the model inferred the sequence from context and tool descriptions.

The Event System

One thing I found genuinely useful during development was the event subscription API. Every tool call emits a ToolStartEvent and ToolEndEvent:

@agent.events.subscribe
async def on_tool_start(event: ToolStartEvent):
    logger.info(f"TOOL START: {event.tool_name}")
    logger.info(f"Args: {json.dumps(event.arguments, indent=2)}")

@agent.events.subscribe
async def on_tool_end(event: ToolEndEvent):
    if event.success:
        logger.info(f"TOOL END: {event.tool_name} ({event.execution_time_ms:.0f}ms)")
    else:
        logger.error(f"TOOL FAILED: {event.tool_name} - {event.error}")

When you are building a hardware-in-the-loop agent where failures are physical (LED does not blink, board does not respond), having a structured log of every tool call with arguments and timing is essential. It is also how I caught that the model was calling deploy_code before the port permissions were correctly set — the error message was clear in the log instantly.

What I Learned Building This

Real-time video AI has a different failure mode than text AI. With text AI, wrong output is obvious — you read it and fix the prompt. With video AI, wrong output means the board does not respond and you are staring at a stationary LED trying to figure out if the model misidentified the pin, or the code is wrong, or the upload failed, or the LED is wired backwards. Good observability (the event system) is not optional.

Tool descriptions are more important than I expected. The model's behaviour changed significantly based on how I phrased the tool descriptions. "Detect connected Arduino boards" caused the model to call it inconsistently. "List all connected Arduino boards. ALWAYS call this first to find the port needed for upload." made it call the tool reliably every time, in the right order.

Hardware-in-the-loop iteration is slow. Software agents can iterate in milliseconds. Hardware agents have a four-second compile-upload cycle. This changes how you design the system — you want the model to be confident before it acts, not to try-and-retry. Good visual grounding (making sure the agent can clearly see the wiring before generating code) matters more than in pure software contexts.

What It Is Not (Yet)

ArduinoVision is a hackathon prototype. Its scope right now is: AVR boards (Uno, Nano), basic GPIO (digital pins, LEDs, buttons), one board connected at a time. It does not handle I2C sensors, servo control, ESP32/ESP8266, or multi-board setups. These are natural extensions but they are not in this version.

The interface also relies on the VisionAgents demo UI at demo.visionagents.ai rather than a custom frontend. For a prototype this is fine — building a custom WebRTC client is significant work that would have added nothing to the core idea.

The Bigger Picture

The thing that strikes me about this project is how little code it took to get something genuinely useful working. The Arduino tooling (list boards, write, compile, upload) is maybe 300 lines of Python. The agent setup is another 150. The entire relevant surface area is small.

What VisionAgents provides is the hard part: real-time video transport, speech-to-speech latency that feels natural, and a clean function calling interface that the model uses reliably. Without that infrastructure being pre-built, this project would have been two weeks of WebRTC work before a single Arduino command got called.

There is a real category of applications that becomes possible when AI agents can see physical environments and take actions based on what they observe. Hardware debugging is one. Lab automation is another. Physical quality control. Teaching environments where a student shows their circuit and gets immediate, accurate feedback.

ArduinoVision is a small example of what that category looks like when the infrastructure is available.

Try It

The code is on GitHub: github.com/mutaician/arduino-vision

You need a Stream account (free tier works), an OpenAI API key, Python 3.12, and arduino-cli. The README has full setup instructions. If you are on Windows, there are notes on forwarding the USB serial port to WSL.

Built for the Vision Possible: Agent Protocol hackathon by WeMakeDevs. Powered by VisionAgents SDK by Stream.

DEV Community