Shawn

Posted on Jun 18

FutureX · Physical AI Daily — Issue 32 (06/19)

#ai #robotics #machinelearning #research

Today's Highlights

· Waymo has recalled 3,871 robotaxis to NHTSA after its fifth-generation autonomous driving software was found capable of entering closed highway construction zones; the company has suspended highway operations and rerouted vehicles to surface roads.

· Capital continues to pour into world models and embodied AI "brains": Manifold AI (Chinese world-model startup), founded one year ago, has reached unicorn status with a Pre-A round totaling nearly ¥1 billion; Yingsu and U.S.-based Odyssey ($310 million) also closed rounds the same day.

· More embodied AI brain fundraising: Noematrix (Chinese embodied AI startup) secured a new round of hundreds of millions of yuan led by Wuxi Data Group; Aether AI, the causal world-model company founded by UCSD's Prof. Bi-wei Huang, raised a $20 million seed round led by Matrix Partners China.

· Nvidia's ENPIRE framework lets 8 AI coding agents autonomously control a robot fleet — writing code and running training simultaneously — achieving up to 99% success on high-precision tasks such as GPU installation.

· On the deployment side: the Chengdu Humanoid Robot Innovation Center signed a 5,000-unit procurement order with central state-owned enterprises; UBTECH's consumer humanoid U1 approached 5,000 pre-orders within 17 days.

I. Research Papers

Guava: A General "Tool-Calling" Framework for Embodied Manipulation · vla

An alternative to end-to-end VLAs: treat a powerful multimodal large language model as a "commander" and attach external perception, planning, and control modules — systematically addressing what scaffolding is needed to make general reasoning models truly capable of manipulation. The day's most-discussed paper in the community.

Haowen Liu et al. · arXiv 2606.18363 source

The authors systematically search the design space across three dimensions — agent workflow, action space, and observation space — and distill three key ingredients for effective embodied agents (iterative reasoning, appropriate action abstraction, structured observations), demonstrating that this harness can unlock manipulation capabilities across a range of reasoning models without per-embodiment end-to-end policy training. Community traction: HF↑22.

Do as I Do: Turning Everyday Human Video into Dexterous Hand Manipulation Data · manipulation

Scarce dexterous-hand data is a bottleneck for scaled manipulation. This paper proposes leveraging large quantities of monocular RGB human video directly, overcoming two hurdles: hand-object interaction estimation and the morphological gap between human hands and robot end-effectors. Co-authors include Pieter Abbeel and Mahi Shafiullah — a route worth watching.

Bhawna Paliwal et al. (UC Berkeley / NYU et al.) · arXiv 2606.19333 source · Commentary: 超智前夜 source (WeChat, CN)

DO AS I DO reconstructs hand-object interactions from first- and third-person in-the-wild video, then retargets those interaction estimates onto multi-fingered robot dexterous hands to generate executable manipulation trajectories — converting internet-scale human manipulation video into trainable data and reducing the cost of real-robot data collection.

HALOMI: Learning Active Perception for Humanoid Loco-Manipulation from Human Demonstrations · locomotion

Human demonstrations are easy to collect at scale and naturally encode hand-eye coordination, but direct transfer to humanoids requires fragile world-frame tracking controllers. This paper unifies "where to look, where to walk, and where to grasp" through active perception, targeting a scalable data source for whole-body mobile manipulation.

Zehui Zhao et al. · arXiv 2606.18772 source

HALOMI augments the Universal Manipulation Interface (UMI) with first-person perception, collecting ego-view and wrist-view observations along with head-hand trajectories at scale, and proposes a manifold-constrained approach to mitigate the distribution shift between human and humanoid ego observations and action execution — improving robustness on out-of-distribution targets.

PAIWorld: A 3D-Consistent World Foundation Model for Manipulation · world-model

Most existing world foundation models are single-view and lack the multi-view 3D consistency that robot manipulation requires; naively concatenating view tokens produces cross-view drift, depth inconsistency, and texture misalignment. This paper attributes the problem to the absence of explicit cross-view communication and 3D geometric priors, and addresses both simultaneously.

Yuhang Huang et al. · arXiv 2606.18375 source

PAIWorld introduces explicit inter-view information exchange and 3D geometric priors into a diffusion Transformer world model, targeting the egocentric, eye-to-hand, and wrist-camera multi-view setups common in robotics to recover cross-view object consistency and depth alignment. Community traction: HF↑3.

DREAM-Chunk: Adding "Reactivity" to Action Chunks with a Latent World Model · vla

Action chunking is now a standard VLA interface, but once a chunk is committed, open-loop execution is brittle under stochastic dynamics, hardware error, and partial observability. This paper trades test-time compute for robustness without requiring policy fine-tuning.

Wenxi Chen et al. · arXiv 2606.18589 source

DREAM-Chunk attaches a lightweight latent world model to a chunk-based policy: at test time it samples multiple candidate chunks, rolls each out in latent space to predict the future, then selects the chunk whose predicted state best matches the actual rollout — using additional inference compute to cover a range of possible futures.

MolmoMotion: Language-Conditioned 3D Point Trajectory Prediction · world-model

Formalizing "how objects move" as 3D point trajectory prediction in world coordinates — category-agnostic, view-stable, compact, and directly useful for downstream planning — with a million-scale data corpus released alongside.

Jianing Zhang et al. · arXiv 2606.18558 source

Given a short visual history, a set of 3D query points on an object, and a natural-language goal description, the model predicts each point's future 3D trajectory. The authors release MolmoMotion-1M — a large-scale corpus of 3D point trajectories annotated from 1.16 million unconstrained videos, with action descriptions and object anchors — forming a complete technical stack for this task. Community traction: HF↑5.

Act2Answer: How Much Commonsense and World Knowledge Survives VLA Fine-Tuning? · benchmark

VLAs are typically fine-tuned from strong VLMs on robot data, but how much commonsense and factual knowledge survives adaptation has remained unclear — knowledge loss and poor low-level control generalization tend to confound each other. This paper provides a decoupled measurement protocol.

Nikita Kachaev et al. · arXiv 2606.19297 source

Act2Answer reformats VLM knowledge benchmarks into "answer by acting": each question becomes a tabletop episode in which the agent selects among candidate answers via a single object-placement action, yielding a control-confound-reduced, action-grounded success rate — used to evaluate VLAs across a range of commonsense and world-knowledge scenarios.

Other papers today: VEGA (training a navigation VLA from in-the-wild first-person navigation video with geometric trajectory supervision); Motion-Focused Latent Action (decoupling motion from background for cross-embodiment VLA pre-training from human ego video); Mem-World (memory-augmented action-conditioned world model addressing "forgetting/hallucination" caused by occlusion and drift in manipulation); DCGWM (identifying "goal-interference collapse" in JEPA world models under dual physical and social signals, using partitioned latent spaces to structurally prevent it); Object-Centric Residual RL (VLA residual augmentation using object poses for zero-shot sim-to-real transfer).

Open Source · Tools · Benchmarks

· HT-Bench: A large-scale multi-task benchmark for dexterous whole-hand tactile sensing, comprising 10 million RGB frames and 7.8 million tactile frames across 226 tasks, evaluating tactile representations on contact geometry encoding, visuo-tactile alignment, and generalization to unseen tasks.

· ROBOSHACKLES: A safety dataset for preventing bodily injury by embodied foundation models — since real robot-caused injury data cannot be legally collected, the authors start from DROID observations and use scene understanding, hazard-aware image editing, and temporal prompting to synthesize realistic dangerous rollouts via video models for safety alignment.

· SC3-Eval: Repurposes pre-trained video foundation models as robot policy evaluators using self-consistent video generation, suppressing error accumulation in autoregressive rollouts through constraints such as forward/inverse dynamics consistency and multi-view consistency (co-authors include Allen Z. Ren and Lucy X. Shi).

· Physics-IQ Verified: A systematic audit of the Physics-IQ benchmark for measuring physical understanding in video generation models, identifying its shortcomings and proposing three improvements for more accurate measurement (co-authors include Yuki M. Asano and Stefan Bauer).

II. Funding & Deals

Manifold AI (Chinese world-model startup) ｜ Pre-A ｜ ~¥1 Billion (cumulative) ｜ Unicorn in One Year · world-model

This round was backed by Guoxin Fund (under China Reform Holdings), Yifeng Capital (a Temasek affiliate), BAIC Industrial Investment, and Xinneng Ventures, with all four existing shareholders adding to their positions; the company was founded in late May 2025 and has completed 6 rounds within one year, reaching nearly ¥1 billion cumulative Pre-A and entering unicorn territory for world models. Its in-house WorldScape and WorldScape Policy models claim top rankings on WorldScore, WorldArena, and RoboTwin, and the company positions world models as an embodied pre-training foundation covering outdoor, indoor, and aerial domains. Founder and CEO Wu Wei is a former SenseTime executive and two-time consecutive champion of the Waymo SimAgents Challenge.Sources: 机器人前瞻 source (WeChat, CN), 复星锐正 source (WeChat, CN)

Noematrix (Chinese embodied AI startup) ｜ New Round ｜ Hundreds of Millions of Yuan · embodied

Led by Wuxi Data Group, with participation from SJTU AI Future Fund, a wholly-owned subsidiary of Shanghai Institute for Advanced Study, and Yicun Capital; the company was founded in late 2023 by the team of Prof. Ce Wu from Shanghai Jiao Tong University, with its "Noematrix Embodied Brain" product pursuing a self-developed general embodied large-model approach that closes the decision loop from instruction understanding to execution feedback via force-position hybrid post-training. The company had previously received backing from Sequoia China, Alibaba, Prosperity7, and Sea, completing three rounds within one year.Source: 硬氪 source (WeChat, CN)

Aether AI ｜ Seed Round ｜ $20 Million (~¥135 Million) · world-model

Led by Matrix Partners China, with participation from势能资本, SWC Global, and 九合创投. The company was founded by Prof. Bi-wei Huang, assistant professor at the University of California San Diego (UCSD), and focuses on "causal world models" — moving robots from understanding "what" to understanding "why" — through a three-part approach of causal feature representation, causal structure discovery, and causal dynamics modeling to build a "causal brain" for robots, with embodied intelligence as the first application.Sources: 甲子光年 source (WeChat, CN), 机器人前瞻 source (WeChat, CN)

Yingsu (Chinese 3D world-model startup) ｜ Tens of Millions of USD · world-model

The company advocates that "the physical world is 3D, so world models should be too," betting on a three-dimensional representation approach and joining the dense wave of world-model startups closing rounds today. ⚠️ Single-source claimSource: 暗涌Waves source (WeChat, CN)

Odyssey (U.S.) ｜ Series B ｜ $310 Million ｜ Valuation ~$1.45 Billion · world-model

Backed jointly by Amazon, Nvidia, and AMD, with CIA-affiliated IQT and Google Chief Scientist Jeff Dean also participating; the company builds 3D world models that simulate the physical world, designating AWS as its preferred cloud and running on Amazon Trainium chips — a slight realignment following Nvidia Ventures' prior Series A investment. Founders Oliver Cameron and Jeff Hawke come from autonomous driving.Source: Rubin智造社 source (WeChat, CN)

III. Commercial Deployment

Chengdu Humanoid Robot Innovation Center Secures 5,000-Unit Order from Central State-Owned Enterprises · industrial

On June 16, the Chengdu Humanoid Robot Innovation Center signed a strategic procurement agreement for 5,000 embodied intelligent robots with multiple central state-owned enterprises — described as the largest single-supplier order to date in China's embodied intelligence sector. If delivered on schedule, it would directly drive volume growth for core components such as harmonic reducers and joint modules. ⚠️ Single-source claimSource: 金亿谈价值 source (WeChat, CN)

UBTECH's Consumer U1 Approaches 5,000 Pre-Orders in 17 Days · humanoid

Following early deposit orders of over 2,700 units in six days, cumulative pre-orders approached 5,000 units over 17 days, with pricing starting at ¥128,000 and targeting adult consumers only. This is another volume signal for China's humanoid market testing consumer demand, though pre-order and deposit figures remain distinct from actual deliveries. ⚠️ Pre-order figuresSources: 财闻 source (WeChat, CN), Sina finance source

Digital Huaxia's "Xingxing Xia P2" Deployed in Elderly Care Facility · embodied

The Xingxing Xia P2 humanoid robot from Digital Huaxia (Shenzhen Nanshan enterprise) has been deployed at a Shenzhen elderly care and nursing facility, handling companion conversation, performance activities, and health monitoring — completing real-world operational validation. Compared with factory settings, elderly care demands higher standards for safety and long-duration stability, making this a meaningful real-world entry point for service-sector deployment.Source: 蛇口消息报 source (WeChat, CN)

Yushu Technology's Robot Dogs Deployed in Jingjiang Flood Response · industrial

During the "Emergency Mission 2026" exercise, Yushu Technology's (Chinese quadruped robot maker) latest quadruped robots participated in river flood response through human-robot collaboration, demonstrating tasks including levee inspection and hazard detection. Emergency rescue is one of the few real-world scenarios where quadruped robots already face genuine paying demand.Sources: Sohu source, source

First 10,000-Unit-Scale Embodied AI Super Factory in Beijing-Tianjin-Hebei Region Begins Operations · industrial

Lingyi iTech's (Chinese electronics manufacturer) Beijing embodied AI super factory has announced scaled operations, positioning itself as the first 10,000-unit-scale humanoid robot production base in the Beijing-Tianjin-Hebei region. Combined with Agibot's 10,000-unit production line in Shanghai, China's embodied robot industry is transitioning from prototype lines to 10,000-unit-scale capacity — and mass-production supply chains are the decisive factor. ⚠️ Single-source claimSource: 雷克智能 source (WeChat, CN)

Amazon's Next Warehouse Efficiency Push: Moving People, Not Just Goods · industrial

Multiple U.S. media outlets report that Amazon is testing a new warehouse efficiency approach that could reportedly save millions of labor hours, shifting the focus from adding automated transport to restructuring how workers and robots collaborate on the floor. The specifics come from media reporting; the quantified claims await official confirmation and subsequent disclosures. ⚠️ Media-reported figuresSources: Benzinga source, Business Insider source

IV. Industry Developments

Nvidia ENPIRE: Letting AI Coding Agents Train Robots by Themselves · world-model

Nvidia, in collaboration with CMU and UC Berkeley, has introduced the ENPIRE framework: the entire robot training pipeline is handed to 8 Codex-style coding agents, each given a fleet of robots, a GPU budget, and a token allocation, allowing them to run full training loops directly on real hardware — much as they would test their own code. Nvidia claims the eight-robot collaborative system achieves up to 99% success on high-precision tasks such as installing GPUs onto motherboards, as disclosed by Jim Fan on June 16. Pushing "automated research" down to physical hardware is a rare development along this line, though the 99% figure is a demonstration result under specific experimental settings and robust scalability remains to be independently reproduced.Sources: DeepTech深科技 source (WeChat, CN), Decrypt source

Daxiao Robotics Claims Four Benchmark Top Rankings for Kairos World Model · world-model

Daxiao Robotics (Chinese robotics company), chaired by SenseTime co-founder Xiaogang Wang, has released its "natively integrated" Kairos world model, featuring a unified backbone for multimodal understanding, generation, and prediction. The company claims top rankings on four world-model and embodied AI benchmarks — RoboTwin 2.0, LIBERO-Plus, WorldModelBench Robot, and DreamGen — and plans to open-source for industry use, with existing compatibility for Chinese AI chips including Muxi and Biren. Benchmark results are self-reported; methodological differences across benchmarks are significant, and third-party replication is advisable. ⚠️ Vendor/benchmark self-reported figuresSources: 机器之心 source (WeChat, CN), 深蓝AI source (WeChat, CN)

Caocao Mobility Announces Full AI Transformation and RoboX Strategy · autonomy

Caocao Mobility (Chinese ride-hailing platform) has announced its pivot to a "Physical AI mobility technology platform," launching the RoboX strategy along dual tracks of Robotaxi and Robovan, targeting deployment of 100,000 Robotaxis by 2030 and proprietary vehicle production beginning in 2027. This is another example of a ride-hailing platform extending into autonomous vehicle operations; the 100,000-unit target is a long-range planning goal. ⚠️ Strategic announcementSources: 四川在线 source, 每日科普资讯 source (WeChat, CN)

Star Robotics Releases XHAND 1 PRO Research-Grade Dexterous Hand · embodied

On June 17, Star Robotics (Chinese humanoid robotics startup) unveiled the next-generation research-grade XHAND 1 PRO dexterous hand, featuring fully direct-drive actuation, 21 degrees of freedom, and whole-hand tactile sensing. As data increasingly defines competitive advantage in embodied AI, high-DoF plus tactile dexterous hands are being viewed by more and more teams as the "gateway" hardware for real-world manipulation data collection.Source: 具身元 source (WeChat, CN)

NIO Pushes New World Model Version to Over 700,000 Users · autonomy

NIO has announced an over-the-air update to a new version of its intelligent driving world model, reaching approximately 700,000 users simultaneously; the architecture upgrades from "world model + closed-loop reinforcement learning" to "world model + supervised fine-tuning + reinforcement learning," with vehicles purchased as far back as four years ago eligible for the update. The world-model approach is moving from papers and press conferences toward vehicle-scale OTA deployment.Source: 智能网联汽车年鉴 source (WeChat, CN)

Li Auto Enters Embodied Intelligence with "Chip + Brain + New Paradigm" Approach · autonomy

Multiple reports outline Li Auto's embodied intelligence push: in-house chips as the foundation, a unified embodied "brain" model, with an emphasis on leveraging data and compute accumulated from vehicle operations to extend into robotics. Following Xpeng's announced pivot to a physical AI company, another leading Chinese automaker is positioning its automotive capabilities as a lever for entering embodied intelligence.Source: 新浪网 source

Hardware · Supply Chain

· Intel RealSense D585 Pro: Unveiled at Automate 2026 in North America, this AI-native depth camera and accompanying Perception Studio software target robotic perception applications — the day's most notable new upstream sensor component.

· Harmonic Drive FLA Micro Actuator: Harmonic Drive has introduced the FLA series of miniaturized integrated actuators, emphasizing low noise and high power density for compact robot joint spaces.

· Artery (ARTERY MCU) New BGA100 Package: Targeting dexterous hand and robot joint control, the new package improves integration density and control capability in a smaller footprint.

· GigaDevice (Chinese MCU maker) Robot MCU: Claims approximately 50% cost reduction and approximately 60% power consumption reduction for neural network control in robotics, with a focus on high-temperature tolerance and reliability within miniature joints.

V. Weekly Observations

Waymo Recalls 3,871 Robotaxis Over Construction Zone Recognition Flaw, Suspends Highway Operations · autonomy

According to NHTSA, Waymo has recalled 3,871 robotaxis after its fifth-generation autonomous driving software was found to potentially continue driving in closed highway construction zones without recognizing ramp closure signs. More than ten related incidents have occurred in California and Arizona since April; on May 18, seven vehicles in the San Francisco Bay Area entered an active construction lane, attributed to software "prioritizing avoidance of other highway hazards without recognizing the construction zone." Waymo has temporarily restricted highway driving — vehicles remain in service on surface roads — and will issue a free OTA update to address the issue. This is the most significant autonomous driving regulatory event of the week and raises questions about the maturity of "highway-scenario L4."Sources: NHTSA, TechCrunch source

China Submits First Mandatory National Standards for L3/L4 Autonomous Driving · autonomy

China's Ministry of Industry and Information Technology (MIIT) has published for public review the country's first mandatory national standards for L3/L4 autonomous driving, requiring autonomous driving systems to meet the competence level of "a qualified human driver," mandating injury mitigation when a collision is unavoidable, and establishing a basis for apportioning accident liability. Industry observers interpret this as the end of an era in which automakers could compete on vague marketing claims, with autonomous driving capabilities and responsibilities now subject to hard regulatory benchmarks.Sources: 每日经济新闻, 驱动之家 source

Concentrated Policy Activity: Shenzhen Action Plan Revised, Real-Scene Training Initiative Advances · humanoid

Shenzhen has issued a revised "Embodied Intelligent Robotics Technology Innovation and Industry Development Action Plan (2025–2027)"; a 2026 humanoid robot and embodied intelligence real-scene training initiative jointly launched by MIIT and the State-owned Assets Supervision and Administration Commission (SASAC) entered implementation this week, accompanied by an "Industry High-Quality Dataset Construction Action Implementation Plan" requiring real-training scenarios to be converted into standardized training data. Policy is shifting from "building an industry" to "building the data pipeline" — aligned with the industry view that data is the decisive differentiator.Sources: 国策新知 source (WeChat, CN), 上海市宝山区政务服务中心 source (WeChat, CN)

Barclays: Humanoids Could Support a $200 Billion Market Before 2035 · humanoid

A Barclays humanoid sector report proposes an "Automation 3.0" framework (1.0: industrial robots; 2.0: digital AI; 3.0: first-ever penetration of hard-to-automate service-sector work), arguing that simultaneous breakthroughs in "brain/body/battery" could support a trillion-dollar physical AI ecosystem. Market size estimates vary considerably: roughly $2–3 billion today, $10–25 billion by 2030, with a base-case scenario of approximately $40 billion by 2035 and an optimistic scenario of $200 billion — the latter being a bull-case assumption, not consensus.Sources: Barclays, 财联社 source

Sell-Side Consensus Shifts to Supply Chain: "Repricing" Is Underestimated; Joints and Actuators Favored · hardware

Morgan Stanley argues that what is genuinely undervalued in the humanoid market is not demand but the repricing of the supply chain; Daiwa judges that body joints and dexterous-hand actuators are the most actionable investment within the supply chain, naming Tuopu Group (Chinese auto parts maker), Minth Group, and Sanhua Intelligent Controls (Chinese HVAC and refrigeration component maker) as top picks. Multiple institutions have collectively shifted focus from "whole-robot narratives" to "who captures component volume growth" — corroborated by this week's developments in reducers and micro motors.Sources: KC桌面 source (WeChat, CN), 中金在线

Shipment and Export Data: China Leading, Shanghai Port Accounts for ~40% of National Robot Exports · humanoid

Multiple sources estimate global humanoid robot shipments in 2025 at approximately 13,000–18,000 units (Omdia, IDC), while China's Robot Industry Alliance (HRAA) puts China's domestic market alone near 20,000 units; on the export side, robot exports through Shanghai in the first five months reached approximately ¥8.36 billion, representing roughly 40% of China's national total. Absolute shipment volumes remain modest, but China's lead on both production capacity and export value is now visible in the data.Sources: human five source (WeChat, CN), 上观新闻

Weekly Supply Chain Roundup: Chinese Harmonic Reducer Substitution Accelerates, Micro Motors Emerge as New Bottleneck · hardware

Harmonic reducers are this week's clearest supply chain trend: a Chinese duopoly is taking shape (Likai Harmonic at approximately 27.5% and Laifual Drive at approximately 21.4% market share, per Xinzhi Research), with Laifual Drive clearing a Hong Kong Exchange listing hearing — poised to become the Hong Kong market's "first harmonic reducer listing" and China's second-largest supplier; Frost & Sullivan estimates China's robot harmonic reducer market at approximately ¥6.8 billion in 2025, with the top five suppliers controlling 75.8%, and this component accounting for over 40% of humanoid robot hardware costs — Goldman Sachs notes the sector is transitioning from "commercial validation" to "volume production revenue." At the other end of the drivetrain, miniature servo motors are emerging as a new weak point: Chinese manufacturing of miniature dexterous-hand motors is reported to have exceeded 25% localization, with per-finger drive costs falling to approximately ¥120; Linker Hand's LinkHand 2.0 reportedly reduced per-unit costs by approximately ¥1,800 (about 22%) by switching 60% of motors to Zhaowei Electromechanical (Chinese micro-motor maker), with plans to deliver 50,000–100,000 dexterous hands in 2026. Overall, Chinese substitution in harmonic reducers carries the highest certainty, while miniature servo motors and high-end flexspline materials remain the two bottlenecks constraining humanoid volume ramp.Sources: 新质界 source (WeChat, CN), 机甲智士 source (WeChat, CN), 千机局 source (WeChat, CN), 鼎盛资本 source (WeChat, CN)