Embodied AI in Action: From Seeing to Imagining to Doing

#deeplearning #machinelearning #robotics #ai

Last Tuesday, I had the chance to attend TAI AAI #13 - Embodied AI: From Seeing to Imagining to Doing, held at KERNEL HONGO, Tokyo. This event, hosted by Tokyo AI, brought together engineers, researchers, and AI enthusiasts to explore the cutting edge of embodied AI—the intersection of perception, reasoning, and action in intelligent systems.

The event revolved around a simple but powerful idea: how modern robots and AI systems connect what they perceive, what they imagine, and how they act in the real world. It featured three fascinating talks, each highlighting a different aspect of this pipeline: Seeing, Imagining, and Doing.

Seeing: Language as an Interface for Embodied AI

Speaker: Roland Meertens, ML Engineer at Wayve

Roland Meertens kicked off the event by exploring how Vision-Language Models (VLMs) can bridge the gap between human understanding and machine perception. He illustrated this with the example of Wayve’s self-driving car, deployed on the Nissan Ariya, showing how language can help humans interpret the car’s decisions.

The key takeaway: it’s not enough for AI systems to act autonomously—they also need to be explainable. By translating complex perceptions and actions into natural language, we can better understand, predict, and even influence autonomous systems.

Imagining: Predictive World Models

Speaker: Alisher Abdulkhaev, Co-Founder & CTO, Kanaria Tech

Next, Alisher Abdulkhaev presented the concept of world models—predictive representations of the physical world that allow AI systems to imagine future states and plan ahead. Unlike reactive systems that respond only to the immediate environment, world model-driven AI can anticipate outcomes and make informed decisions.

Alisher shared his work on the Kanaria Robotic Model (KRM), a foundation model for social navigation in autonomous robots. His approach integrates perception, reasoning, and goal-directed action, enabling robots to navigate complex, dynamic environments naturally.

The insight here is profound: imagining the world enables AI to move beyond reaction into proactive, intelligent behavior.

Doing: Vision-Language-Action Models

Speaker: Motonari Kambara, JSPS Research Fellow, Keio University

The final talk by Motonari Kambara covered Vision-Language-Action (VLA) models, which connect perception and reasoning directly to action. By combining visual and linguistic inputs, these systems can generate purposeful, explainable behavior in robots.

Motonari emphasized grounded understanding: robots don’t just act—they understand the environment and the task in human-comprehensible terms. This opens exciting avenues for transparent, interpretable AI that can collaborate safely and effectively with humans.

Key Takeaways from TAI AAI #13

Explainability is critical: Whether in autonomous cars or mobile robots, humans need to understand AI behavior.

World models enable planning: Predictive representations let AI systems anticipate and act intelligently.

Embodied AI integrates perception, reasoning, and action: The next generation of intelligent systems will not only see and imagine but also act purposefully in complex environments.

Attending this event was an inspiring reminder of how far embodied AI has come, and how rapidly the field is evolving. From self-driving cars that can explain their decisions to robots that plan social navigation in real-world spaces, the frontier of AI is becoming more human-understandable and intelligent by the day.

Top comments (1)

Alex Chen • Oct 21

Thanks for sharing this - the Tokyo AI community keeps producing great events. The explainability piece from Wayve really resonates. I've been working on AI systems where users need to trust the output (mental health tracking), and it's wild how much the "why" matters as much as the "what."

The world models concept is fascinating too. Most of my work has been with reactive systems that just respond to immediate input, but the idea of AI anticipating and planning... that's where things get interesting. And honestly a bit scary when you think about deployment at scale.

Question - did they discuss any practical applications of VLA models beyond robotics? Wondering if that perception-reasoning-action pipeline could apply to other domains where you need explainable AI behavior.