Gunnar Grosch for AWS

Posted on Dec 26, 2025

DEV Track Spotlight: Control Humanoid Robots and Drones with Voice and Agentic AI (DEV313)

#aws #ai #iot #robotics

Voice control in robotics has traditionally required intricate coordination between speech recognition, natural language processing, and response generation. But what if you could simplify this complexity while achieving real-time, natural interactions with humanoid robots, robot dogs, and drones?

In this session from AWS re:Invent 2025, Haowen Huang (Senior Developer Advocate at AWS) and Cyrus Wong (Senior Lecturer at Hong Kong Institute of Information Technology and AWS AI Hero) demonstrated a groundbreaking project that combines Agentic AI with speech-to-speech streaming to revolutionize voice control for robotics. Through live demonstrations and practical AWS-based architectures, they showed how to minimize latency while maintaining precision across diverse robotic platforms.

Watch the Full Session:

The Evolution of Agentic AI

Haowen opened the session by setting the context for where we are in the agentic AI journey. According to predictions from Sequoia Capital, we are still at the very beginning stage of agentic AI in 2025. The future envisions AI systems that go beyond current tool-like attributes to become autonomously operating intelligent agents with capabilities for reasoning, planning, collaboration, and high autonomous operation.

The most striking prediction? By the 2030s, we will see an agent economy operating like a global neural network composed of numerous interconnected agents. Sequoia Capital even predicts the emergence of the first one-person unicorn, a company created and operated by a single individual with AI agents, valued at $1 billion.

This requires a fundamental shift in mindset. As Haowen emphasized, "We should adjust our mindset, not only in the technology but our mindset as well." This stochastic mindset means adapting to probabilistic outcomes and managing AI systems that operate with some degree of uncertainty.

The Challenge: Making Robots Respond Like Humans

Traditional voice control systems follow a request-response pattern: speech-to-text, then LLM processing, then text-to-speech. This creates a "rocky talk style" with significant limitations. As Cyrus explained, "The main issue is you can't interrupt. You can't suddenly say 'stop!' It will not respond."

The breakthrough came with Amazon Nova Sonic, which enables true stream-in, stream-out processing. This means continuous speech input with real-time response generation, allowing for natural interruptions and immediate feedback, just like human conversation.

Real Production Examples from Hong Kong

Cyrus and his students at the Hong Kong Institute of Information Technology built a complete robotics control system that demonstrates the power of this approach. The project controls three types of hardware:

Humanoid robots - Full-body robots capable of complex movements and gestures

Robot dogs - Four-legged robots with dynamic movement capabilities

Drones - Flying robots requiring precise command processing

All of these robots are controlled through natural voice commands, with Nova integrated directly into the humanoid robots for conversational AI capabilities. The system even supports multiple robots operating simultaneously, with each robot having its own personality and background story stored in Amazon DynamoDB.

Architecture: Serverless and Scalable

The entire system is built on serverless AWS architecture, making it both cost-effective and scalable. The key components include:

AWS IoT Core - Provides long-term connection channels for remote robot control with minimal memory footprint

AWS Lambda - Hosts the MCP (Model Context Protocol) server that translates high-level commands into robot actions

Amazon Bedrock - Powers the AI models, including Nova Sonic for voice interaction and Nova Pro for text-based control

AWS App Runner - Handles WebSocket connections for streaming voice interactions

Amazon DynamoDB - Stores robot personalities, backgrounds, and context

Amazon Cognito - Manages authentication and access control

The architecture supports two interaction modes: a voice channel using Nova Sonic for real-time speech-to-speech interaction, and a text channel using Nova Pro for the digital human interface with Cantonese language support.

The Magic of MCP Servers in Lambda

One of the most practical contributions from this session was showing how to implement an MCP server directly in AWS Lambda. Cyrus demonstrated that the implementation is surprisingly straightforward using the AWS Lambda MCP library. You simply install the library via pip, create a handler with MCP tools registration, and wrap your functions as MCP tools.

The key insight is using Enum types for parameters like robot names. This dramatically improves accuracy when processing voice commands. As Cyrus noted, "It's very difficult to use your voice to tell the robot name accurately. You want it to become more accurate."

The team's approach involves registering high-level robot control functions (like go forward, turn right, sit down, jump, take off) as MCP tools. Each tool corresponds to a specific robot action, and the Lambda function sends these commands through AWS IoT Core to the physical robots.

Handling Multiple Tool Calls

A critical challenge in robotics control is handling sequences of actions. When you say "do five push-ups," the system needs to call the push-up action five times, not once with a parameter of five.

Initially, the team tried low-level API implementations with comma-separated action lists, but this proved unstable. The breakthrough came from using Strands Agents SDK, which provides a built-in agent loop for tool selection, reasoning, and tool calling.

As Cyrus explained, "Behind the scenes is the agent. The built-in agent has a tool selection and then reasoning and tool calling loop. This loop is always running. So what is the meaning? You can have infinite Lambda tools call."

This means you can give high-level instructions like "show me some kung fu moves" and the agent will autonomously generate and execute a sequence of appropriate actions.

Production Lessons and Best Practices

The team learned several critical lessons through building this production system:

Use Frameworks, Don't Be Lazy

Cyrus was emphatic about this point: "At the beginning I just think about, I don't want to make something I don't use. I don't use a framework. I just very simple, copy the standard source code, oh, it worked. But when you debug and then face the next problem, this is no solution. Just lazy, don't lazy. Use frameworks, okay?"

Simple frameworks already include solutions to common problems. Using Strands Agents SDK saved countless hours of debugging and provided robust handling of complex interaction patterns.

Enforce Tool Calling with "Any"

One surprising discovery was the toolChoice: "any" parameter in Nova Sonic. Without this setting, the model might say it performed an action without actually calling the tool. This magic parameter ensures the model always attempts to use available tools when appropriate.

List Your Tools in the Prompt

While low-code approaches suggest you don't need to list tools explicitly, the team found that listing tools in the system prompt significantly improved accuracy. Don't try to save tokens, just list the tools clearly.

Handle the "Lying" Problem

LLMs will sometimes claim they performed an action when they didn't. The solution is explicit instructions: "If you can't do it, just don't do it." This prevents the model from providing false confirmations.

Use "Immediate" for Action Commands

The team discovered that without the keyword "immediate" in prompts, the agent might wait for more information before executing commands. Adding this keyword ensures responsive action execution.

Parallel Execution Matters

Tool calling in LLMs is sequential by default. When controlling multiple robots, this creates noticeable delays. The solution is implementing an "all" API that uses parallel execution to send commands to all robots simultaneously.

The Power of Kiro CLI for Code Generation

Throughout the project, the team leveraged Kiro CLI extensively for code generation and migration. Cyrus shared several impressive use cases:

Converting request-response APIs to streaming - Kiro CLI accurately converted their entire API from synchronous to streaming patterns

Implementing third-party APIs - By providing Chinese documentation HTML files, Kiro CLI successfully implemented the complete API for their digital human service

Generating 3D visualization - Despite having no 3D programming experience, Cyrus used Kiro CLI to generate a complete 3D robot simulator connected to AWS IoT

Iterative refinement - Using command-line fixes, Kiro CLI continuously updated the codebase, resulting in clean, maintainable code

As Cyrus noted, "This is my experience on using Q CLI. And also, you have the technical depth. Maybe you see a wrong name, a lot of improvement. You just in the command line fix it. They will continue to update your source code, fix it, update your source code, fix it, and the code base becomes very clean and look very nice."

Security with IAM Authentication

The system implements AWS IAM authentication throughout, including for MCP server calls between Lambda functions. While the MCP server client doesn't natively support IAM authentication, the team implemented a custom solution that injects STS tokens into HTTP headers, ensuring secure service-to-service communication.

Key Takeaways

🤖 Stream-in, stream-out is essential - Nova Sonic's streaming capabilities enable natural, interruptible conversations with robots

🏗️ Serverless scales - The entire system runs on serverless AWS services, making it cost-effective and scalable

🔧 MCP servers simplify integration - Implementing MCP servers in Lambda provides a clean abstraction for robot control

🤝 Frameworks solve hard problems - Strands Agents SDK handles complex multi-step reasoning and tool calling automatically

📊 Prompt engineering matters - Explicit tool listings, "immediate" keywords, and clear instructions dramatically improve accuracy

🛡️ IAM everywhere - Implementing IAM authentication across all service calls ensures production-grade security

⚡ Parallel execution is critical - Sequential tool calling creates unacceptable delays when controlling multiple robots

🎯 Kiro CLI accelerates development - Kiro CLI can handle complex code generation, API implementation, and iterative refinement

Cyrus's advice resonates beyond robotics: "Use frameworks, okay? Whatever framework you use, use frameworks. Don't be lazy. Simple framework will already include the very common problem and solution for you."

About This Series

This post is part of DEV Track Spotlight, a series highlighting the incredible sessions from the AWS re:Invent 2025 Developer Community (DEV) track.

The DEV track featured 60 unique sessions delivered by 93 speakers from the AWS Community - including AWS Heroes, AWS Community Builders, and AWS User Group Leaders - alongside speakers from AWS and Amazon. These sessions covered cutting-edge topics including:

🤖 GenAI & Agentic AI - Multi-agent systems, Strands Agents SDK, Amazon Bedrock
🛠️ Developer Tools - Kiro, Kiro CLI, Amazon Q Developer, AI-driven development
🔒 Security - AI agent security, container security, automated remediation
🏗️ Infrastructure - Serverless, containers, edge computing, observability
⚡ Modernization - Legacy app transformation, CI/CD, feature flags
📊 Data - Amazon Aurora DSQL, real-time processing, vector databases

Each post in this series dives deep into one session, sharing key insights, practical takeaways, and links to the full recordings. Whether you attended re:Invent or are catching up remotely, these sessions represent the best of our developer community sharing real code, real demos, and real learnings.

Follow along as we spotlight these amazing sessions and celebrate the speakers who made the DEV track what it was!

DEV Community