DEV Community

Shawn
Shawn

Posted on

FutureX · Physical AI Daily — Issue 28 (06/15)

Today's Highlights

· XPENG issued an internal letter with He Xiaopeng personally taking over as CEO of the robotics division, announcing the company's transformation from an intelligent vehicle company into a "Physical AI company." The IRON humanoid robot is set for mass production by end of 2026, entering XPENG showrooms as a sales guide in Q1 2027, with full-stack in-house R&D maintained from chips to articulated dexterous hands.

· "Nurturing the brain" becomes capital's new focus: Hillhouse Ventures exclusively backs MoYa (Hongfire Intelligence / SoulX) — a soft companion robot entering through the sleep scenario — at the angel round; Jianzhou Robotics secures hundreds of millions of RMB led by Ant Group, DiDi, and Delian Capital, setting a new funding record for the "embodied data without robot hardware" segment.

· Hong Kong IPO pipeline expands simultaneously: EngineAI reportedly filed confidentially with HKEX at a valuation exceeding CNY 10 billion; highway-logistics autonomous driving firm Zhuganxian Technology re-filed; harmonic reducer manufacturer LAIFUAL Drive passed its listing hearing.

· University of Maryland's HumanEgo used only 30 minutes of human first-person video and zero robot data to achieve a 92.5% success rate across four real-world tasks on a bimanual robot, with zero-shot transfer to different embodiments, cameras, and scenes.

· Deployment milestone: Junpu Intelligent rolled off the first batch of G2 robots, equipped with the GO-1 embodied foundation model, completing 2,283 tasks over 8 consecutive hours in a 3C factory with zero errors; a thousand-unit order has been locked in.

I. Research Papers

HumanEgo: 30 Minutes of Human Egocentric Video, Zero-Shot Teaching of Bimanual Robots · manipulation

This work moves the robot's "data interface" from lab teleoperation to a pair of smart glasses — relying on no robot data, no robot post-training, and no internet-scale pretraining, learning deployable policies directly from minutes of human video. It is the most radical step yet along the route of learning manipulation from human video.

Zhi (Leo) Wang et al. (University of Maryland) · arXiv 2605.24934 source · Commentary: Jiqizhixin source

The core idea is "grounding representations in interaction rather than in the body": Meta Aria glasses capture egocentric video with 6-DoF head trajectories and 3D hand keypoints; for each hand and each object in the scene, a 29-dimensional Interaction-Centric Token (ICT) is computed, encoding the entity's 6D pose in a reference frame along with hand-relative-to-object pose and grasp state. The human hand is retargeted and abstracted as a "virtual two-finger gripper," and visual processing that masks out the human arm and renders a virtual gripper eliminates the appearance and kinematic gap between human hands and robot end-effectors; during occlusion, kinematic locking maintains object pose continuity. The policy uses flow matching with three dense auxiliary objectives — object motion prediction, 2D trajectory regression, and latent consistency — enabling efficient learning from as few as ~60 trajectories. Across four task categories — pick-and-place, long-horizon stacking, contact-rich bimanual coordination, and continuous rotation — the method achieves a 92.5% success rate, outperforming an equal-duration teleoperation baseline by 41 percentage points.

Multilingual Instructions Expose "Stepped" Fatal Weaknesses in VLA Models · vla

Mainstream VLA multimodal backbones are theoretically cross-lingual, but this has never been systematically verified. This paper reframes "language robustness" from a static model capability into a dynamic temporal control problem during execution — and proposes a repair method that intervenes only at inference time without retraining.

Harbin Institute of Technology · ACL 2026 Main Conference Accepted · Commentary: Roushen Algorithm source

The team translated LIBERO instructions into 10 languages including Chinese, Japanese, and Arabic to build the LIBERO-Multilingual benchmark. OpenVLA-OFT achieves an average success rate of 97.1% in English, plummeting to 50.8%–65.3% in non-English languages; on Goal tasks, Arabic drops to just 6.4%, more than 91 percentage points below English. Representation bias and text-image gradient ratio analysis revealed that language influence is not uniformly distributed but concentrated at a few "critical nodes" — 53% of multilingual failures cluster in the "navigation" phase, which requires language to localize targets. The authors propose a step-wise inference-time intervention: offline identification of the gradient-sensitive top-50% of steps, followed by online alignment of representations toward the English reference direction at those steps. OpenVLA-OFT non-English average improves by 9.5pp; pi0.5 recovers by a substantial 24.2pp (Chinese: 56.7%→80.3%); applying a uniform mean shift or selecting steps randomly is nearly ineffective or even counterproductive.

GaussianDWM: Unifying Scene Understanding and Multimodal Generation for Autonomous Driving with 3D Gaussians · world-model

Most driving world models focus on "predicting/generating future frames" but cannot answer what objects are in the scene or where they are. This paper feeds the same 3D Gaussian representation into an LLM for understanding and uses it as a condition to drive generation, aiming to supply the explicit 3D structure that world models have been missing.

Tianchen Deng et al. (SJTU / Tsinghua / Megvii / Mach Drive) · CVPR 2026 · Code: github.com/dtc111111/GaussianDWM · Commentary: Jiqizhixin source

The method comprises three parts — World Tokenizer, scene understanding, and multimodal generation — all organized around the same set of 3D Gaussians: language features derived from CLIP and inheriting SAM's hierarchical semantics are layered onto the Gaussian primitives (a scene-level autoencoder compresses 512 dimensions to 3), then projected into the LLM embedding space via a Gaussian Projector with task-aware sampling (4,096 Gaussian tokens in the main experiment). On the generation side, a dual-conditioning design uses low-level RGB/depth to constrain texture and geometry while high-level world knowledge from the LLM supplies semantic spatial priors. On NuInteract, the average metric reaches 59.23 (vs. DriveMonkey's 52.12), with 2D/3D visual grounding mAP improving from 19.47/34.53 to 34.95/52.78.

SANA-WM: An Efficient World Model Deployable in Minutes on a Single GPU · world-model

Long-video world models typically require large models, large datasets, and multi-GPU inference — cost is the key obstacle to deploying them in embodied simulation. This work compresses "60-second 720p, camera-controllable" generation onto a single GPU, offering a cost-reduction path for scalable world model research.

Zhu Haoyi et al. (NVIDIA / University of Science and Technology of China) · arXiv 2605.15178 source · Code: NVlabs/Sana · Commentary: Lumina Embodied Intelligence source

The model generates 60-second, 720p, camera-motion-controllable video worlds from a first frame image, text, and a 6-DoF camera trajectory. The architecture uses a Hybrid Linear DiT combining Gated DeltaNet with softmax attention, maintaining long-context modeling while reducing compute and memory; a dual-branch camera controller with UCPE and Plücker ray conditioning improves trajectory-following accuracy. The team constructed a video dataset of approximately 213,000 clips with metric-scale camera pose annotations and applied a two-stage generation process with a long-video refiner to improve visual quality and temporal consistency.

LabVLA: Enabling Robots to "Run Experiments" in the Lab · vla

Scientific AI has long suffered a "brain-hand disconnect" — capable of reading literature and designing protocols, but unable to perform pipetting, centrifugation, and other physical tasks that consume 60% of a researcher's time. This is a full-stack solution combining a data engine, training recipe, and evaluation benchmark tailored to lab scenarios, not just a model architecture change.

Zhejiang University / Shanghai AI Lab et al. · Commentary: Roushen Algorithm source

Addressing the unique challenges of high-precision lab instruments, transparent liquids, and zero-tolerance protocol workflows, the team uses the simulation data engine RoboGenesis to overcome real-data scarcity: atomic skills are defined and composed into workflows, filtered through physics validation (checking for liquid spills and protocol compliance), then structured for export across multiple robot embodiments. The model undergoes two-stage training on Qwen3-VL-4B — first pretraining with FAST action tokens to familiarize the backbone with actions, then post-training with flow matching and a DiT action expert for continuous control output, with "knowledge isolation" freezing backbone weights to preserve existing visual-language reasoning. The LabUtopia benchmark for lab tasks is released alongside.

Other papers today: the asset-conversion pathway from "test data → world model training data" pioneered by Qingyan Precision and others has drawn wide discussion (W65); this week saw a concentrated output of 8 notable works in the humanoid Loco-Manip direction, covering whole-body control and mobile manipulation (W15).

Open-source · Tools · Benchmarks: HumanEgo open-sourced its code and accumulated 230+ stars within days (humanego-ai.github.io); SANA-WM released alongside NVlabs/Sana has gained 2.5k+ net new stars since launch; GaussianDWM, LabVLA's LabUtopia benchmark, and the RoboGenesis data engine were all released simultaneously; Zhiyuan opened the AGIBOT WORLD dataset and Genie Sim 3.0 simulation platform at the BAAI Conference.

II. Funding & Transactions

Jianzhou Robotics | Multiple Consecutive Rounds | Cumulative Hundreds of Millions RMB · adjacent

This round was co-led by Ant Group, DiDi, and Delian Capital, with returning investors Shunwei Capital, BV Baidu Ventures, and Jiushi Intelligence adding follow-on. It marks the first joint embodied-AI investment by Ant and DiDi and is the largest single funding to date in the "embodied data without robot hardware" segment. The company was founded in May 2025 by former Momenta senior algorithm director Chen Jianxing and senior intelligent driving product expert Zhu Yanming, who argue "data will achieve scale-up earlier than models." The company developed the Gen DAS passive wearable data-collection device in-house and launched Gen EgoData, a full-modality dataset for embodied world models encompassing vision, force, action trajectories, physical interaction outcomes, and chain-of-thought. With more than 30 AI companies as partners, and alongside JD's self-built collection centers and four robot makers co-investing in Zhiyu Cornerstone, the segment is shifting from "building bodies" to a data arms race for "nurturing brains." Source: Data Walker X source

Hongfire Intelligence (SoulX / MoYa) | Angel Round | Hillhouse Ventures Sole Investor · adjacent

This is the company's first external funding round, earmarked for R&D, mass-production delivery, and supply chain buildout of the MoYa soft family-care robot. MoYa enters through the sleep scenario: it resembles a plush toy designed to be hugged while sleeping, integrating soft structure, pneumatic actuation (air bladders inflating and deflating to simulate an embrace), breathing rhythms, and emotional companionship, deliberately scoped to "hugging, breathing, and gentle patting" with no vision module for now, to protect privacy. Founder Zheng Qian holds a robotics undergraduate degree from HIT and a PhD from Zhejiang University, with early-stage experience at an exoskeleton company. In contrast to the mainstream "general humanoid + industrial" narrative, MoYa pursues a differentiated niche-scenario approach and plans to launch in September this year. Source: Hillhouse Ventures source

Poke Robotics | Angel Round | Tens of Millions USD · embodied

Founded by Xu Huazhe after departing Xinghaitu — itself valued at over CNY 20 billion — who completed this round. Capital continues to favor the robotics segment; the question raised by Caixin is whether this reflects validated business models or companies stockpiling "ammunition" ahead of intensifying competition. Source: Caixin source

EngineAI | Proposed HK IPO | Valuation Exceeds CNY 10 Billion · humanoid ⚠️ Reported

The company reportedly filed confidentially with HKEX, working with CICC and CITIC Securities. Combined with Unitree's push toward the STAR Market, integrated humanoid manufacturers are collectively moving toward public markets. Source: Sohu source

Zhuganxian Technology | Re-filed with HKEX · autonomy

Ranked fourth among China's commercial vehicle autonomous driving solution providers, focused on heavy-truck driverless operation in highway logistics and port scenarios. A prior filing did not proceed; this is the company's renewed attempt. Source: ifeng.com source

LAIFUAL Drive | Passed HKEX Listing Hearing · hardware

Harmonic reducers are a core component of humanoid robot joints. The company posts annual revenue of CNY 260 million while still losing CNY 170 million, with Lenovo and China Development Bank Fund as shareholders. It is pushing into the capital markets on expectations of domestic substitution in the humanoid supply chain, though profitability remains a question mark. Source: Sohu source

ULTIROBOTICS | RMB Tens of Millions · industrial

A warehouse embodied AI company, jointly invested by Changshu ETDZ Juyuan and Shenzhen Institute of Science and Technology Innovation, with Houlang Capital serving as strategic financial adviser. Its in-house Ulti-Brain model uses a hierarchical architecture integrating a world model, performing long-horizon continuous spatial perception from RGBD streams, with a focus on generalization that does not depend on customer-specific scene data. Source: Gaogong Humanoid Robots source

III. Commercialization & Deployment

Junpu Intelligent Rolls Off First G2 Robots, Putting Them to Work in 3C Factories · industrial

The G2 is equipped with the GO-1 embodied foundation model and completed 2,283 tasks over 8 consecutive hours in a 3C factory with zero errors; a thousand-unit order has been locked in. Four production lines at the Wuxi base roll off approximately one unit per hour, with a monthly capacity target of 300–400 units in August and plans to cut manufacturing costs by 20% within two years. Unlike the "general humanoid" approach, Junpu entered through deterministic scenarios — 3C electronics assembly, logistics sorting, and industrial handling — first. With the industry still largely at the demonstration stage, continuous zero-error data from a real production line is more compelling than lab footage. Source: Zhidian Chaijie source

Pony.ai Seventh-Generation Robotaxi Debuts at Chongqing Auto Show, Cost Falls Below CNY 230,000 · autonomy

The seventh-generation vehicle brings total vehicle cost below CNY 230,000, a critical step toward a viable unit economics model for scaled operations, as the company simultaneously signals accelerated overseas expansion. The cost curve is widely regarded as the decisive factor in the Robotaxi race, and this reduction aims to narrow the per-unit economic gap with the single-vehicle intelligence approach. Source: D1EV source

Waymo Acquires Apple's Former Arizona Autonomous Driving Test Site for $220 Million · autonomy

Waymo acquired Apple's former autonomous driving test facility in Arizona for approximately $220 million. Against the backdrop of its expansion into new cities such as Nashville and the development of its sixth-generation system, building proprietary testing and operational infrastructure has become a necessary companion to fleet scale-up. Source: MSN source

Domestic Robots Enter BMW's Shenyang Production Line for Validation · industrial ⚠️ Validation Stage

BMW's production line in the old industrial base of northeast China has introduced domestic humanoid robots for production floor trials. The reporting also candidly notes that repeated validation on a live line is still required — whether robots can reliably identify parts and pick up tools, and whether they can maintain safe distances in human-robot collaboration while sustaining prolonged high-intensity operation. This is an automotive plant pilot, not yet a mass-production milestone. Source: Sina Finance source

UBTECH U1 Ultra-Bionic Companion Robot Approaches 4,000 Pre-orders · humanoid

JD.com pre-orders over 10 days are approaching 4,000 units (up from a previously reported 2,700 in 6 days, continuing to rise), with a CNY 3,000 deposit targeting adult consumers. The robot features an "companion-raising" emotional AI model and is positioned as a family companion rather than a productivity tool. Note that this demand is primarily consumer novelty-seeking and is separate from industrial mass-production capability. Source: Pandaily source

Amazon Expands Warehouse Automation in India · industrial

Amazon announced further expansion of warehouse robotics and automation deployment in India, continuing the robotization of its global fulfillment network and signaling an acceleration of warehouse automation in emerging markets. Source: Tech in Asia source

IV. Industry Developments

XPENG: He Xiaopeng Takes Personal Command of Robotics, Announces Transformation into "Physical AI Company" · humanoid

The June 10 internal letter carries significant weight — "XPENG Robotics officially enters the eve of mass production and commercialization" and "this marks XPENG's transformation from an intelligent vehicle company to a Physical AI company." He Xiaopeng takes on the role of CEO of the robotics division, pulling group resources to fully replicate the automotive business's supply chain, manufacturing, and quality systems into robotics. The timeline calls for the IRON humanoid robot to enter mass production by end of 2026 and appear in XPENG showrooms as a sales guide in Q1 2027, with full-stack in-house R&D maintained from chips and OS to joints and dexterous hands — an analogy to BYD's vertical integration in battery manufacturing to compress costs. A risk factor: the core robotics team just went through departures in late May, making a 200-day sprint to mass production more challenging. This is another automaker — following BYD's move into humanoids — making a large-scale transfer of in-house vehicle capabilities to the embodied AI segment. Source: Zhidian Chaijie source

Zhiyuan Releases Embodied Foundation Model GO-2, Leading with "Action Chain-of-Thought" · world-model ⚠️ Vendor Claims

At this week's BAAI Conference, Zhiyuan unveiled GO-2, the next-generation embodied foundation model. Its core is ACoT-VLA (Action Chain-of-Thought): conventional VLAs must "observe scene → generate language description → map to action," with the intermediate language-translation step introducing information loss; GO-2 enables reasoning to occur directly in action space, outputting structured, kinematically feasible coarse-grained action intent sequences via two action reasoners (explicit and implicit), with asynchronous coarse-fine execution for real-time correction and a full-lifecycle closed loop for continuous self-optimization after deployment. Zhiyuan claims a 25% improvement in long-horizon task success rate over the baseline (vendor-stated figure, awaiting third-party replication). Also unveiled: the Elf G2 industrial robot, Genie Sim 3.0 simulation platform, AGIBOT WORLD open-source dataset, an ICRA million-dollar competition, and the "Yuansheng" ecosystem plan with a five-year CNY 2 billion commitment — presenting a combined foundation model + simulation + open-source data + developer ecosystem strategy. Source: Guijiyizhi source

Kunlun Tech Unveils World Model Matrix-Game 3.5 · world-model

At the BAAI Conference, Kunlun Tech's Tiangong team unveiled the latest progress on world model Matrix-Game 3.5. World models have become one of the most discussed topics at this year's conference, with researchers from embodied AI, robot control, game engines, and physical AI infrastructure each presenting their technical approaches; the debate over competing paradigms continues. Source: Kunlun Tech Group source

GM CEO: Prioritizing Consumer Vehicle Autonomy Now, Laying Groundwork for Ride-Hailing Later · autonomy ⚠️ Statement

Following the shutdown of Cruise, General Motors is shifting focus to embedding autonomous driving as a consumer vehicle feature, then using that as a foundation for future mobility services — effectively staying in the autonomous driving space via "vehicle-side autonomous capability" rather than an independent Robotaxi fleet. This contrasts with the approach of most Chinese players who are deploying Robotaxi fleets directly. Source: DoNews source

Humanoid Robots to Get "ID Cards": Full Lifecycle Management Standard Issued · humanoid

According to the Humanoid Robot and Embodied Intelligence Standardization Technical Committee under the Ministry of Industry and Information Technology, under the newly issued "Humanoid Robot Full Lifecycle Management Specification," every humanoid robot will be assigned an identity code (its "ID card"); approximately 28,000 units have already received codes. This is a foundational infrastructure step as the industry moves from "building them" to "managing them traceably." Source: Electronic and Electrical Metrology and Testing source

Hardware · Supply Chain: Linkerbot claims to be the world's only company achieving mass production of high-DoF dexterous hands at the ten-thousand-unit scale, holding approximately 80% of the global high-DoF dexterous hand market, with monthly capacity ramping from over 4,000 units toward the ten-thousand-unit level (W92/W95, ⚠️ vendor claims); Hamm Electronics' 8mm micro-motors are now shipping in volume with capacity scaling up, addressing incremental demand for dexterous hands as "consumables" in data collection and teleoperation (W93); Unitree is pursuing a QDD quasi-direct-drive approach (large motor + small gearbox) to replace harmonic reducers and cut costs, with joint actuators accounting for approximately 50%–70% of the robot BOM (W41); engineering plastics — which can reduce density by 50%–70% compared to metal — are accelerating adoption in humanoid structural components (W42).

Top comments (0)