Shawn

Posted on Jun 20

FutureX · Physical AI Daily — Issue 34 (06/21)

#ai #robotics #machinelearning #research

Today's Highlights

· Hyundai Motor completes full acquisition of Boston Dynamics (SoftBank exits at ~$325 million); production-version Atlas is scheduled to enter Hyundai's U.S. factory in 2028, with planned annual capacity of 30,000 units.

· DaxAI Robotics (Chinese embodied-AI startup) closes 4 funding rounds in its first year, with Jiangsha Capital (Chinese early-stage VC) leading the Pre-A; the company reports half-year orders exceeding 300 million yuan and is pushing deployment of thousands of units, with Zhu Xiaohu (prominent Chinese VC) among backers in the "productivity" embodied-AI space.

· HKU's World Engine brings autonomous driving into the "post-training" era: reinforcement learning inside a reconstructed closed-loop simulation world reduces collision rate by ~45.5% and achieves 200 km without takeover (simulation metric).

· Video world models address gaps in "spatial" and "memory" capabilities: CameraSquad (SIGGRAPH 2026) generates multi-view consistent video in a single parallel pass; Tencent × Tsinghua release open-source memory evaluation benchmark MBench.

· Baidu Apollo Go obtains Swiss L4 commercial operating permit (~80 km² covering St. Gallen and two Appenzell cantons), marking the first large-scale commercial operating approval in Europe for a Chinese autonomous driving company; public service is scheduled for 2027.

I. Research Papers

CameraSquad: Camera-Controllable Video Generation with Multi-View Consistency · world-model

By decoupling "what the world looks like" from "where you're looking from" at the attention level, multiple camera trajectories can exchange information in a single parallel inference pass — directly addressing one of the most awkward gaps on the road from video world models to spatial intelligence: changing viewpoint causes subject "face-swapping," and serial inference across different trajectories produces misaligned results.

Team led by Gao Lin (University of Chinese Academy of Sciences / Cardiff University / HKUST / Kuaishou Kling) · Accepted at SIGGRAPH 2026 · Coverage: Jiqizhixin source (WeChat, CN)

The method is built on the Wan2.2 video diffusion model, decomposing the original 3D self-attention into a Content-Attention module (handling content reference) and a Camera-Attention module (encoding camera intrinsics and extrinsics via PRoPE), then using "dual-mode cross-view attention" (CVA-α for appearance consistency, CVA-β for geometric alignment) to make tokens from different viewpoints at the same timestep mutually visible. Multi-view consistent outputs are then passed through DA3 for depth estimation and back-projected into dynamic 3D point clouds, providing supervision for 4D reconstruction. On WebVid / HumanVid, the approach achieves leading camera control accuracy with rotation error as low as 1.42°, while image quality metrics (FID, CLIP-V) show no degradation from the added spatial control.

World Engine: A Closed-Loop "Post-Training" Framework for Autonomous Driving · autonomy

End-to-end driving models are now pretrained on massive datasets, but the long-tail events that truly define safety boundaries are naturally rare in real-world driving logs. This work chains together "find failures → reconstruct an interactive world → constrained reinforcement learning" into a closed loop — mirroring the paradigm shift from pretraining to post-training in large language models.

Team led by Li Hongyang (University of Hong Kong / Huawei / Shanghai Artificial Intelligence Laboratory / Tsinghua's Li Shengbo) · Coverage: Quantum Bit source (WeChat, CN)

The system has three components: SimEngine reconstructs re-renderable, interactive simulation worlds from multi-pass logs using 3D Gaussian Splatting (with depth and normal supervision, LiDAR exposure alignment, and per-camera color correction); Behaviour World Model uses a diffusion model to generalize from known failure cases into a set of similar hard scenarios (e.g., lead vehicle decelerates → lane-change overtake); the post-training stage applies behaviour-regularized reinforcement learning with rewards covering collision avoidance, drivable area, traffic efficiency, and ride comfort, plus KL regularization against the pretrained policy to prevent catastrophic forgetting. In closed-loop simulation evaluation, collision rate drops by ~45.5% with 200 km of zero takeovers.

ERVLA: Embodied Chain-of-Thought Must Be "Accurate" Not "Verbose" · vla

Vision-Language-Action (VLA) models inherit large-scale visual-semantic priors from VLMs, but stronger perception and broader semantic coverage do not automatically translate into better action generation — Tsinghua and Xiaomi identify the mismatch between "symbolic reasoning" and "continuous action" as the core problem.

Tsinghua University / Xiaomi · Coverage: Embodied Intelligence Observer source (WeChat, CN)

The work reconstructs the reasoning-to-action mapping in VLAs, emphasizing that chain-of-thought should serve action precision rather than verbose description — alleviating the disconnect between VLM-trained symbolic outputs and the continuous control demands of real hardware.

ThinkingVLA: "Imagining" the Next Frame While Acting · vla

Traditional VLAs behave like reflex-driven apprentices — given an instruction, they output an action directly, with no anticipation or post-hoc verification. This breaks down quickly on multi-step tasks requiring spatial reasoning.

Fudan University · Coverage: Humanoid Lab source (WeChat, CN)

The method has the model predict the next visual frame as an intermediate representation during execution, embedding "imagined future observations" into action generation — improving success rates on long-horizon tasks.

Motion-Focused Latent Actions: Cross-Embodiment VLA Trained on 50 Trajectories · vla

High-quality robot manipulation data is extremely expensive to collect, while first-person human video is virtually unlimited. Extracting action priors from human video and aligning them with minimal real-robot data is one of the leading approaches to reducing data costs.

Coverage: Embodied Algorithm source (WeChat, CN)

The framework extracts general "motion-focused latent action" priors from large-scale unannotated first-person human video, then uses only ~50 robot demonstration trajectories to bring the model to near state-of-the-art performance on a new robot platform.

RT-VLA: Dual-Branch Decoupling + Multi-Level Distillation, 44× Speedup for End-to-End Driving · autonomy

Deploying large driving models on-vehicle makes inference latency an unavoidable bottleneck. This work trades distillation for speed, aiming to retain the teacher model's capabilities while eliminating the coupling overhead of the reasoning-explanation module.

CMU · Coverage: Shenlv AI source (WeChat, CN)

RT-VLA distills a lightweight student model from the frozen large teacher model SimLingo, with a dual-branch runtime architecture and hierarchical distillation training scheme, claiming ~44× speedup in end-to-end inference.

Other papers today: TopoRetarget — topology-preserving dexterous hand motion retargeting on the Wuji Hand, for teleoperation and demonstration data collection (coverage source (WeChat, CN)).

Open Source · Tools · Benchmarks

· MBench (Tencent × Tsinghua): The first benchmark specifically evaluating "long-term memory" in video/world models; 1,040 cases split across entity, environment, and causal dimensions with 12 sub-dimensions. Introduces Trigger-Conditioned Scoring to penalize "cheating via static content generation." Evaluation of 14 SOTA models shows spatial geometry and causal evolution are universal bottlenecks, and visual fidelity ≠ memory stability. Dataset, evaluation code, live leaderboard, and technical report are fully open-sourced source (WeChat, CN).

· NVIDIA JetPack 7.2: Robotics edge development stack update — adds NemoClaw support, Yocto Project support, and AGX Orin 32GB Super Mode source.

· Sharpa Wave (Wuji Hand): An open dexterous hand benchmarking platform offering interactive in-browser URDF visualization, with real-world test data and side-by-side spec comparisons for 16+ mainstream dexterous hands source (WeChat, CN).

II. Funding & Deals

DaxAI Robotics (大咖机器人) ｜ Pre-A ｜ Hundreds of millions of yuan · embodied ⚠️ Order figures are company-disclosed

Led by Jinshajiang Ventures (Chinese early-stage VC), with participation from Yunshi Capital, Shengshi Investment, Lingxin Qiaoshou / Lingcheng Future, and CVC arms of listed companies. Founded in May 2025, the team is composed of former core members of JD.com's autonomous vehicle unit and PhD graduates from the University of Science and Technology of China's junior class. The company pursues a dual-engine approach combining the DaxBrain-WM embodied world model with a general-purpose hardware platform, targeting pan-retail, pan-logistics, and eldercare. Its bimanual dexterous humanoid starts at 69,800 yuan, and the company has also unveiled a ton-class heavy-load robot horse with a 1,000 kg payload capacity. The company says it completed four rounds in its first year, with half-year orders exceeding 300 million yuan and thousands of units in deployment. According to IT Juzi data, China's embodied AI / robotics sector attracted ~46 billion yuan across 288 deals in the first half of the year (disclosed amounts, not revenue); capital is shifting from broad-based bets toward the few teams with real orders.Source: Science and Tech Innovation Board Daily source (WeChat, CN)

Daimon Robotics (戴盟机器人) ｜ Series A ｜ 100+ million yuan · embodied

Jointly invested by Inovance Technology's (Chinese industrial automation leader) industrial fund Inovance Industrial Investment and China Telecom. Daimon focuses on visuotactile sensing and dexterous hands, targeting the "accurate touch" tactile capability for robots — widely seen as a hardware bottleneck for embodied manipulation deployment.Source: Greater Bay Area Common Home source (WeChat, CN)

ZuzuZoos ｜ Pre-A ｜ Tens of millions of yuan · adjacent

An AI-native consumer tech toy brand betting on emotionally interactive companion robots that can talk and keep users company; the team comes from DJI, Pop Mart, and Zhipu AI (Chinese AI lab). Companion and emotional robotics has seen a string of small funding rounds recently, as capital opens a consumer-facing emotional business track alongside large models and industrial robots.Source: blog.csdn.net source

Luming Robotics (鹿明机器人) ｜ A1 + A2 ｜ ~1 billion yuan cumulative · humanoid

Founder Yu Chao was previously head of embodied robotics at Dreame Technology (Chinese consumer appliance maker) and has a Tsinghua background. The company has closed two rounds (A1 and A2), each in the hundreds of millions of yuan, for cumulative funding of close to 1 billion yuan — another "Dreame alumni" embodied team entering the top funding tier.Source: University Learning Hub source (WeChat, CN)

III. Commercialization & Deployment

Baidu Apollo Go Receives Swiss L4 Commercial Operating Permit · autonomy

The permit covers ~80 km² across St. Gallen, Appenzell Ausserrhoden, and Appenzell Innerrhoden in eastern Switzerland, and will operate under the AmiGo brand in partnership with Swiss Post subsidiary PostBus — marking the first time a Chinese autonomous driving company has obtained a large-scale commercial operating permit in Europe. The project began safety-driver road testing on June 1; full public commercial service is expected to launch in 2027. Following WeRide's Zurich deployment in partnership with Uber, Europe now has a second entry channel for Chinese robotaxi operators.Source: MSN source

Zhiyuan Robotics (智元, Chinese humanoid startup) Announces "Pioneer Deployments" Across Seven Scenarios · embodied ⚠️ Revenue figure is a stated target

Zhiyuan says it has achieved pioneer deployments across seven scenarios — precision part loading/unloading, industrial handling, logistics sorting, in-store navigation, and food-service guidance — calling 2026 its "Year of Deployment." The company has simultaneously announced a target of 10 billion yuan in revenue by 2027; this figure is a strategic aspiration and should be read separately from the scale of actual deployments to date.Source: Dongjing source (WeChat, CN)

Unitree Robotics (宇树科技, Chinese humanoid maker) Launches Consumer Trial and Rental Program · humanoid

Unitree Robotics has launched a robot trial program via Alipay's Sesame Credit, offering day-rental to consumers (~3,299 yuan/day). This is a consumer touchpoint for trials rather than scaled delivery — a step by a leading hardware maker to test consumer-market awareness following its IPO approval.Source: Suiyu Diandian source (WeChat, CN)

IV. Industry News

Hyundai Motor Completes Full Acquisition of Boston Dynamics · humanoid

SoftBank exits for ~$325 million, giving Hyundai Motor Group 100% control of Boston Dynamics. The accompanying product roadmap moves the production-version Atlas from its CES 2026 debut into deployment: it is scheduled to enter Hyundai's Metaplant factory in Georgia, USA in 2028, starting with parts sequencing, sorting, and line-side material handling, then expanding to more complex tasks by 2030. Hyundai plans to build a robot production line with annual capacity of ~30,000 units, backed by the company's $26 billion U.S. investment commitment. This marks Atlas's formal transition from research showcase to mass-production tool on Hyundai's own factory floor.Source: Autohome source

Boston Dynamics to Test Google AI Models on Spot · embodied

Concurrent with the full-acquisition announcement, Boston Dynamics said it will test Google's AI models on the Spot quadruped robot, exploring how hardware platforms can be connected to stronger general-purpose reasoning and language capabilities.Source: MSN source

Zhiyuan Robotics Launches D1 Series Quadruped Robots · embodied

Zhiyuan has introduced three quadruped/wheeled-leg robots — D1 Ultra, D1 Max, and D1 MaxPro — targeting industrial emergency response, security, inspection, education, and entertainment. The D1 Max weighs ~41 kg, features 12+4 degrees of freedom, IP67 protection, and hot-swappable dual batteries; the MaxPro adds a front-mounted 96-line LiDAR. Amid the fanfare around humanoid platforms, quadrupeds remain the more immediately deployable platform for specialized field operations.Source: Robot Moment source (WeChat, CN)

Physical Intelligence Adopts NVIDIA Cosmos 3 for Policy Evaluation · world-model

Physical Intelligence (PI) is using NVIDIA's Cosmos world model to advance policy evaluation — replacing some real-robot testing with world-model generation and prediction, a new pathway to reducing VLA validation costs. The move also reinforces Cosmos's positioning as a "world model as evaluation engine," gaining endorsement from another top-tier user.Source: Embodied Era source (WeChat, CN)

Honor Debuts Its First Humanoid Robot at MWC 2026 · humanoid

Honor (Chinese smartphone maker) showcased its first embodied-intelligence humanoid robot at MWC 2026, becoming the first smartphone company to enter the consumer humanoid space; the unit features an Orbbec (Chinese 3D vision sensor maker) Gemini 330 series binocular 3D camera. Smartphone companies are entering the market with supply chain and distribution advantages, broadening the competitive landscape for consumer humanoids — though the current stage remains an expo debut.Source: Shenzhen Artificial Intelligence Industry Association source (WeChat, CN)

XPENG Refutes Rumor of Mandatory LiDAR Requirement for L3/L4 National Standard · autonomy

In response to a circulating claim that China's autonomous driving national standard would mandate LiDAR installation, XPENG officially said the claim is inaccurate and urged the public to refer to current regulations and industry realities. The sensor-stack debate (LiDAR vs. camera-only) continues to generate noise, but no mandatory regulatory requirement of the kind rumored has been issued.Source: PConline News source

Continuing storylines: world models remain this week's dominant theme for funding and buzz (Jijia Vision's 3.5 billion yuan in three months, Aether AI's multiple conversations with Huang Biwei — both previously reported; no new milestones today). NVIDIA's "let AI research robotics itself" (ENPIRE) was previously reported; today saw follow-on tool releases including SpatialClaw and NemoClaw.

Hardware · Supply Chain

· Mingzhi Electric (鸣志电器, Chinese motor manufacturer): Robot motor sales up more than 200% year-on-year; dexterous hand degrees of freedom increasing from 11 to 22, directly driving demand for — and consistency requirements on — small motors, drivers, and feedback controllers source (WeChat, CN).

· Xingdong Era (星动纪元, Chinese robotics startup): Launches a new dexterous hand, targeting the underappreciated hardware capability gap created by the industry's "heavy model, light hardware" bias source (WeChat, CN).

· Hesai Technology (禾赛科技, Chinese LiDAR maker): Confirms it will supply LiDAR for Mercedes-Benz L3-level vehicles and will manufacture in Thailand, extending automotive-grade LiDAR production capacity internationally source.

· Moore Threads (摩尔线程, Chinese GPU maker): Discloses full-featured GPU roadmap; next-generation "Huagang" architecture chip emerges, targeting compute for physical AI and robotics source.