Shawn

Posted on Jun 16

FutureX · Physical AI Daily — Issue 30 (06/17)

#ai #robotics #machinelearning #research

Today's Highlights

· Alibaba launched Qwen-Robot (Alibaba's embodied AI series), its first embodied large model family under the Qwen line — releasing three components simultaneously: RobotManip (manipulation), RobotNav (navigation), and RobotWorld (world model), marking Alibaba's formal entry into physical AI.

· Galaxea AI (Chinese embodied AI startup) unveiled a full embodied AI suite the same day: open-sourcing next-generation VLA backbone G0.5, announcing world model Fast-WAM and a whole-body control backbone, and debuting its self-developed biped humanoid Kengo.

· Genesis AI launched its first general-purpose robot Eno, betting on a "non-humanoid" wheeled form factor, and announced an enterprise deployment partnership with LG CNS (backed by Eric Schmidt and others; delivery planned by year-end).

· Mobileye announced plans to launch a vertically integrated, self-operated Robotaxi service in the U.S. in 2027, pivoting from supplier to operator; MBLY shares rose in pre-market trading.

· On the capital side: Simple AI (Chinese embodied AI startup) raised hundreds of millions of RMB in a Pre-A round led by Didi; Unitree Robotics (Chinese quadruped and humanoid robot maker) passed its STAR Market IPO review, targeting approximately RMB 4.202 billion in proceeds.

I. Research Papers

GAM: Repurposing a Geometric Foundation Model as a Shared Backbone for Robot Policies · manipulation

Most current VLA and world-action models still operate in 2D image space or derived latent spaces, lacking the 3D geometry that contact-rich manipulation truly requires. GAM directly reuses a pretrained geometric foundation model to inject 3D structure into policies — the highest-trending paper (HF↑80) that day on the "giving VLAs 3D" track.

Jisang Han et al. · arXiv 2606.17046 source

The method uses a pretrained Geometric Foundation Model (GFM) as a shared backbone for perception, temporal prediction, and action decoding: the GFM is split at an intermediate layer, with shallow layers serving as an observation encoder and deeper layers handling temporal prediction and action output. This gives a language-conditioned manipulation policy native 3D geometry rather than relying on 2D frames or 2D-derived representations. The paper argues that "reusing geometric priors" this way is more efficient than stacking additional point-cloud or depth branches, and better supports generalization across viewpoints and contact tasks.

DreamX-World 1.0: A General Interactive World Model with Long-Horizon and "Revisit" Support · world-model

Among general interactive world models, it is rare to combine revisiting previously observed regions, promptable events, and long-horizon generation in a single model that also spans realistic, gaming, and stylized domains; community attention at HF↑70.

DreamX Team · arXiv 2606.16993 source

This is a general text/image-to-video interactive world model supporting camera navigation, revisiting previously observed regions, and promptable events across realistic, gaming, and stylized scenes. Its data engine combines camera-accurate Unreal Engine rendering, action-rich game recordings, and real-world video with recovered camera geometry. The model uses a lightweight projected positional encoding (E-PRoPE) to preserve projected camera geometry, and converts a bidirectional video generator into a few-step autoregressive world model via causal forcing, DMD distillation, and long-rollout training.

Qwen-RobotWorld: An Embodied Video World Model with Language as the Unified Action Interface · world-model

The "brain" component of Alibaba's Qwen-Robot embodied suite: uses natural language as a unified action interface and unifies manipulation, driving, indoor navigation, and human-to-robot transfer into a single model capable of predicting future visual states.

Jie Zhang et al. (Alibaba Qwen) · arXiv 2606.17030 source

The model uses natural language as a unified action interface, predicting physically plausible future visual trajectories from current observations across robot manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. It offers three application directions: synthetic data augmentation for policy training, a scalable virtual environment for policy evaluation, and language-guided planning signals. Architecturally, it uses a Double-Stream MMDiT that couples frozen Qwen2.5-VL semantics with video-VAE latents, using an MLLM for action encoding.

Kairos: A "Native" World Model Stack for Physical AI · world-model

Advances world models from "passive video generators" toward operational infrastructure capable of maintaining state over long horizons and pretraining across embodiments; the Kairos stack is reported to have topped embodied benchmarks including RoboTwin 2.0 and LIBERO-Plus.

Kairos Team et al. · arXiv 2606.16533 source · Commentary: Quantum Bit source (WeChat, CN)

Kairos proposes a native world model stack: a "native pretraining paradigm" combined with a "cross-embodiment data curriculum" that organizes open-world video, human behavioral data, and robot interaction into a progressive development path. Within a unified native architecture, it simultaneously handles world understanding, generation, and prediction, using hybrid linear temporal attention to maintain persistent state over long horizons, with an emphasis on efficient execution under real deployment constraints. This stack represents the world model technical approach recently showcased publicly by SenseTime-affiliated Daai (大晓).

How Should World Models Be Evaluated? A Decision-Centric Position Paper · world-model

In line with recent discussions about the confused definition of "world model," this paper directly targets the most critical flaw in current evaluation: claims about "what a model can be used for" routinely exceed what the evaluations actually support.

Yang Yu et al. (Nanjing University) · arXiv 2606.15032 source

The authors survey the various objects now referred to as "world models" (action-conditioned environment models, latent imagination models, future video predictors, interactive neural simulators, latent predictive representations, synthetic data engines, etc.) and argue that evaluation has generalized along with the terminology, causing recurring "claim/evidence mismatches." The paper advocates recentering world model evaluation around downstream decision-making — executable metrics such as planning success rate and policy improvement.

ViTaL: Integrating Touch into Inference-Time Policy Guidance · manipulation

Inference-time guidance (selecting actions at deployment without retraining the policy) has previously relied almost entirely on vision, but the success of contact-rich tasks often hinges on local interactions such as contact forces. ViTaL incorporates touch into a two-level guidance scheme: high-level mode selection via vision, low-level action refinement via touch.

Yilin Wu et al. (CMU) · arXiv 2606.14981 source

The method frames multimodal guidance as a bilevel optimization: the high level uses visual sample-and-verify for long-horizon mode selection, deciding what behavior the robot should execute; the low level uses tactile-guided diffusion editing to refine the selected action sequence over a shorter horizon to satisfy local constraints such as contact forces. Designed for contact-rich manipulation where vision alone is insufficient to judge success.

ROVE: Post-Training Humanoid VLAs with "Imperfect Human Intervention" · vla

Human intervention is an important correction signal for VLA post-training, but humanoid whole-body kinematics and dexterous hands make intervention trajectories often hesitant, inefficient, or erroneous; learning from these directly as expert demonstrations would bake in bad habits.

Wei Xiao et al. · arXiv 2606.17011 source · Signal HF↑6

ROVE presents a reinforcement learning framework: first, a human-in-the-loop pipeline collects deployment and intervention data for humanoid manipulation; then an optimism-based method performs humanoid VLA post-training on data containing imperfect interventions, avoiding the absorption of hesitant, inefficient, or erroneous intervention behavior into the policy as supervision.

Other papers today: MotionVLA (VLA for humanoid locomotion, frequency-domain analysis for custom multi-codebook, arXiv 2606.15142 source); Metis (general world-action model for autonomous driving/urban navigation, decoupling video generation and action prediction, 2606.15869); TruDi (trust-region diffusion policy supporting large-scale parallel on-policy RL, 2606.15260); AVA-VLA (VLA with latent-variable reasoning and early-exit mechanism, 2606.15099); CausalDrive (real-time causal driving world model, operating from a single frame + ego trajectory + text prompt, 2606.15341); Retrieve, Don't Retrain (using retrieval instead of per-task fine-tuning to scale VLAs to new tasks at test time, 2606.15631).

Open Source · Tools · Benchmarks

· ATOM-Bench: A real-world manipulation benchmark that decomposes tabletop manipulation into "action atoms × instruction atoms" (30 atomic tasks + 24 compositional generalization tasks, single-arm/dual-arm dual track), releasing 3,000 human demonstrations and evaluation rollout data, designed to diagnose the "can do individual skills but can't recombine them" failure mode in general manipulation policies (arXiv 2606.16826 source).

· Junpu Intelligent × Boden × Shanghai Jiao Tong University: Jointly released a large-scale dataset for real-robot reinforcement learning, targeting the persistent data scarcity in physical-robot RL source.

· AgiBot GO-2 (AgiBot, Chinese humanoid robotics company): Open-sourced as the official baseline model for the AGIBOT WORLD CHALLENGE, available for secondary development by developers worldwide source (WeChat, CN).

II. Funding & Deals

Simple AI (深朴智能) ｜ Pre-A ｜ Hundreds of Millions of RMB · embodied

Led by Didi, with follow-on participation from Meihua Ventures, Keli Sensing, and continued investment from existing backers Creation Partners, Linear Capital, and Puhua Capital. Simple AI focuses on general embodied intelligence robots; the company reportedly closed four rounds within a year, making this the most-watched funding round in China's embodied AI space that day. Source: Zhidongxi source

Unitree Robotics (宇树科技) ｜ STAR Market IPO Approved ｜ ~RMB 4.202 Billion Targeted · humanoid

Unitree Robotics successfully passed its STAR Market IPO review, targeting approximately RMB 4.202 billion in proceeds — a further step toward capital markets following mass production of quadruped robot dogs and humanoid robots. Multiple outlets described the review timeline as "73 days to approval." Source: Sina Finance source

Limitless Labs ｜ Series A ｜ $20 Million · industrial

Building physical AI foundation models and factory-focused AI agents for precision manufacturing; this round will be used to expand the company's embodied foundation model. Targeting manufacturing verticals with an embodied AI backbone is another example of the B2B route. Source: SiliconANGLE source

Pegasus Tech Ventures × CYBERDYNE ｜ Corporate Venture Fund ｜ ¥10 Billion · adjacent

The two parties established a ¥10 billion CVC fund dedicated to "HCPS Cybernics × Physical AI." CYBERDYNE is known for its HAL exoskeleton; the fund aims to accelerate startups in human-cyber-physical systems. Source: Business Wire source

NOITOM Robotics (诺亦腾, Chinese motion-capture and robotics company) ｜ Pre-A++ · adjacent

Investors include the Beijing Artificial Intelligence Industry Investment Fund, the Shanghai AI Industry Series Fund, Shenzhen Capital Group, Jianfa Emerging, CICC Capital, Kunlun Capital, and Yuanhe Puhua, with existing shareholders adding to their positions. NOITOM began in motion capture and is expanding into embodied data collection. Source: Shicheng Capital source (WeChat, CN)

Qiankong Embodied Intelligence (潜空间具身智能) ｜ Seed Round ｜ Tens of Millions of RMB · embodied

Led by Hanhui Capital alongside industry partners; the company focuses on "software-defined robot motion" — enabling robots to move via a no-code platform. Proceeds will fund platform R&D and team expansion. Source: PKU Youth CEO Club source (WeChat, CN)

Hot Money Returns to Collaborative Robots: FAIR-Innovation, Elite Robot, Realman, Flexiv, and Tianjee Intelligence Close Rounds in Quick Succession · industrial

Into 2026, FAIR-Innovation Robotics, Elite Robot, Realman Robotics, and Flexiv each announced new funding rounds, with Tianjee Intelligence (Chinese collaborative robot maker) standing out with a RMB 1 billion Series B. Collaborative robots' relatively contained costs and well-defined deployment scenarios have made them a renewed focus for capital. Source: Gaogong Robot source (WeChat, CN)

III. Commercial Deployment

AgiBot Cumulative Shipments Surpass 10,000 Units; ~RMB 5 Billion National Special-Purpose Robot Base Enters Construction Sprint · humanoid

AgiBot (Chinese humanoid robotics company) announced cumulative shipments exceeding 10,000 units (approximately 4,900 added in the first half of the year — a near-doubling in six months), making it the first player in the industry to reach "10,000-unit delivery," while targeting RMB 10 billion in revenue by 2027. Its approximately RMB 5 billion national special-purpose robot base is entering its construction sprint, with capacity being built out across Lingang, Chengdu Pidu, Zhengzhou, and other locations. (Earlier demonstrations of the Yuanzheng A3 autonomously playing table tennis have been covered previously and are not repeated here.) Source: Yuandong Robot source (WeChat, CN)

Chengdu Humanoid Robot Innovation Center Signs Order for 5,000 Units · humanoid

The Chengdu Humanoid Robot Innovation Center announced it has signed a supply order for 5,000 units, signaling a move from showcase to volume procurement for humanoid robots. ⚠️ Single-party claim ("largest single supplier order in China's embodied AI sector" is a self-reported figure). Source: Sohu source

BMW Tests Humanoid Robots at Leipzig Factory · industrial

BMW is testing humanoid robots at its Leipzig factory in Germany, framing them as "tireless colleagues," joining other automakers trialing humanoids on production lines. The deployment remains at the factory pilot stage, not yet scaled. Source: VISION mobility source

Humanoid-Operated Convenience Store to Launch in Hong Kong · humanoid

According to foreign media reports, a convenience store operated by humanoid robots is set to launch in Hong Kong, extending humanoids from factory floors and guided tours into retail front-of-house. Actual operational performance and human-robot task division remain to be observed. Source: People.com source

Amazon to Build £500 Million Cross-Dock Hub in the UK, Processing ~20 Million Parcels per Week · industrial

Amazon plans to build a cross-dock center in the UK with an investment of approximately £500 million, processing roughly 20 million parcels per week, continuing the expansion of warehouse and logistics automation toward ever-higher throughput. Source: Trans.INFO source

IV. Industry Developments

Alibaba Launches First Embodied Large Model Series Qwen-Robot: "Hands, Feet, and Brain" Released Simultaneously · world-model

The Qwen family's first complete embodied intelligence model series consists of three components — Qwen-RobotManip (manipulation, VLA), Qwen-RobotNav (navigation), and Qwen-RobotWorld (world model) — corresponding to the robot's "hands, feet, and brain," marking Alibaba's formal move from conversational AI in the digital world into the physical world. According to public analyses, the VLA component is built on a Qwen3.5-4B vision-language backbone paired with an approximately 1.15B DiT flow-matching action decoder, unifying manipulation, navigation, and trajectory prediction into a single action-trajectory prediction framework conditioned on "embodied perception prompts," and claimed to be reusable across tasks, environments, and embodiments. Alibaba positions it as a "standardized control backbone" for robotics. The announcement was covered by more than seven English-language outlets the same day; however, Alibaba's Hong Kong-listed shares briefly dipped in pre-market trading, reflecting market uncertainty about the commercial monetization of such a "backbone." Source: Robot Outlook source (WeChat, CN)

Galaxea AI Unveils Full Embodied AI Suite: Open-Sources G0.5, Announces Fast-WAM, Debuts Biped Humanoid Kengo · humanoid

At an embodied intelligence event on June 16, Galaxea AI simultaneously released and open-sourced its next-generation VLA foundation model G0.5, announced its world model Fast-WAM and a whole-body control foundation model, and gave the first public showcase of its self-developed biped humanoid Kengo, while declaring the completion of a "full-stack hardware + intelligence" strategic loop. The founder outlined a three-stage framework: "instinctive intelligence — operational intelligence — evolutionary intelligence," in which instinctive intelligence acts directly on the body (balance, walking, running), with operational and evolutionary intelligence layered on progressively, emphasizing there are no shortcuts. Simultaneously showcasing an open-source backbone model, a world model, and a self-developed humanoid is how Galaxea AI differentiates itself from players that focus solely on "the brain" or solely on "the body." Source: Robot Outlook source (WeChat, CN)

Genesis AI Launches First General-Purpose Robot Eno, Betting on a "Non-Humanoid" Wheeled Approach · embodied

Genesis AI, backed by Eric Schmidt and others, launched its first general-purpose robot Eno, using a wheeled rather than bipedal design, with an integrated hardware body and an AI brain called GENE. The company plans to deliver to customers by year-end. Genesis AI publicly questions the current humanoid hype, arguing that a more pragmatic wheeled mobile manipulation approach is better suited for enterprise scenarios. The same day, it announced what it called an "industry-first" strategic partnership with LG CNS for scaled enterprise robot deployment. AFP, Reuters, and Forbes followed closely; Forbes went as far as asking whether this is "the iPhone moment for humanoid robots" — a media framing more than a validated verdict; Eno's actual capabilities and delivery remain to be proven. Source: The Robot Report source

Current Robotics Releases Whole-Body Dexterous Manipulation Model Curr-0: Unified Weights for Mobile and Fine Manipulation · embodied

Curr-0 uses a single policy to end-to-end couple walking, whole-body posture coordination, and fine hand manipulation — so the robot does not need to "walk into position and then operate," but instead coordinates the whole body and hands in real time while moving. The model was trained on 21,000 hours of real human behavioral data (including 2,800 hours of whole-body teleoperation), collected via the company's self-developed HumanEx whole-body exoskeleton system — allowing humans to naturally complete tasks in real environments while wearing the exoskeleton, shifting data growth from "robot deployment hours" to "human task hours." The official demo shows the robot completing tasks such as tearing open a tea bag, lighting incense, stamping a seal, and squatting to place a stuffed toy in a basket — capability demonstrations that remain at a distance from scaled deployment. Source: Quantum Bit source (WeChat, CN)

Mobileye Announces Self-Operated, Vertically Integrated U.S. Robotaxi Service for 2027 · autonomy

Autonomous driving technology supplier Mobileye (MBLY) announced plans to launch its own vertically integrated Robotaxi service in a U.S. city in 2027, extending its role from supplier to operator — news that drove its U.S. shares higher in pre-market trading. This represents a significant departure from the company's long-standing position as an ADAS and autonomous driving technology supplier, and is being read as adding another integrated player alongside Waymo and Tesla; the 2027 timeline is a stated target. Source: CNBC source

Li Auto Launches Self-Developed Chip Mach M100; CEO Li Xiang Defines "Embodied Intelligent Vehicle" · autonomy

Li Auto (Chinese EV maker) released its self-developed chip Mach M100 and defined the "embodied intelligent vehicle" as an agent combining four capabilities: an electric vehicle, a professional driver, an AI computer, and a life assistant. CEO Li Xiang stated that smart cars are "not yet smart enough" and set a target for the Mach VLA to match Tesla FSD V14 by year-end. This is another example of an automaker extending the "embodied intelligence" narrative from robot bodies back to the car itself; the year-end FSD comparison is a stated target requiring verification against actual software versions. Source: Sina Finance source

JD.com Establishes Embodied Intelligence Research Institute in Suqian, Continuing to "Anchor to the Physical World" · embodied

JD.com announced the establishment of an Embodied Intelligence Research Institute in Suqian, extending its stated strategy of "AI value reconstruction anchored to the physical world," as the e-commerce and logistics giant continues directing resources toward embodied AI. Source: Gasgoo source

ABB Robotics Partners with PSYONIC to Close the Dexterity Gap Using Human-Generated Data · embodied

ABB Robotics has partnered with PSYONIC, a bionic prosthetic hand company, to leverage the human manipulation data accumulated through PSYONIC's bionic prosthetic hands to improve robot fine manipulation and tactile dexterity — echoing the day's broader industry discussion that "dexterous hands are a bottleneck for embodied AI." Source: The Robot Report source

Hardware · Supply Chain

· Dexterous Hand Cost and Lifespan: The per-unit cost of dexterous hands in the industry still exceeds $6,000, with a usable lifespan of approximately six weeks — identified as one of the key bottlenecks to humanoid mass production source.

· Falling Robot Prices: Consumer-grade humanoid prices have dropped to around RMB 9,998; the main driver of cost reduction is Chinese manufacturers leveraging automotive-grade motors, reducers, and cameras from existing supply chains source.

· Cost Baseline: McKinsey estimates the current per-unit cost of humanoid prototypes at approximately $150,000–$500,000 and identifies supply-chain cost reduction as the decisive factor in bridging the commercialization gap source (WeChat, CN).

· Automotive Technology Reuse: Industry estimates suggest approximately 70% of the technology in building cars and humanoid robots (high-performance drive motors, reducers, sensors) is shared — a key reason automakers are entering the embodied AI space in force source.