Introduction
Humanoid robots like Tesla's Optimus and Figure AI are generating massive hype, but the critical question isn't just whether they need data, it's how much, and what kind.
The narrative suggests humanoids require endless datasets, creating a boom market for data startups. But 2024-2025 research suggests a different trajectory: humanoids will need substantial data initially, then demand will plateau and shift toward specialized services like curation and safety validation rather than raw collection. The business model around data changes from collection to intelligent processing.
This analysis examines four core doubts about the "endless data appetite" narrative, then weighs counterarguments that suggest certain demands persist.
Part 1: Arguments for Plateauing Demands
Doubt 1: Do Scaling Laws Show Diminishing Returns?
Microsoft's "Scaling Laws for Pre-training Agents and World Models" (2024) reveals that embodied AI systems follow power-law relationships, not linear growth. Optimal data scales with compute as D ∝ C^0.68, meaning data requirements grow much slower than computational capacity. Crucially, losses plateau at large datasets (1.63 billion pairs) without significant overfitting.
For humanoids, this means early data (first 100 trajectories) drives massive capability gains. The 10,000th trajectory? Marginal improvements. By 100,000 trajectories, you're fighting diminishing returns.
NVIDIA's "DreamGen" (2025) demonstrates this principle in practice. A generative world model trained on one teleop task generated 22 novel behaviors without collecting additional real-world data. Recent work on "Learning Hierarchical World Models with Adaptive Temporal Abstractions" (Gumbsch et al., ICLR 2024) shows hierarchical approaches like THICK achieve efficiency improvements through multi-timescale reasoning with far less data than flat world models.
Implication: Foundational training peaks 2026-2028. Afterward, demand likely drops 50-70% as efficiency gains mature.
Doubt 2: Can Few-Shot Learning Replace Massive Datasets?
Boston Dynamics' "Real-World Humanoid Locomotion with Reinforcement Learning" (2024) shows Digit adapting to diverse terrains in fewer than 100 real-world trials with 90% zero-shot success on new environments.
Honda Research Institute's "VisuoTactile Pretraining" (2025) demonstrates that contact-rich manipulation (USB insertion, card swiping, key insertion) achieves 90%+ success with only 32 demonstrations plus 45 minutes of reinforcement learning. Combining visual and tactile feedback replaces the need for massive labeled datasets.
The theoretical foundation appears in "Stop Regressing: Training Value Functions via Classification" (2024). Classification-based value functions (Q-transformers) outperform regression in manipulation, achieving state-of-the-art results with dramatically fewer trajectories.
Implication: Deep RL is more sample-efficient than supervised learning for robotics. By 2032, few-shot learning likely cuts requirements 80-90% compared to supervised approaches.
Doubt 3: Will Synthetic Data Make Real Data Obsolete?
"Video2Robot" (Aim Intelligence, 2025) converts human videos into physics-grounded humanoid trajectories, scaling behaviors like climbing without real robot captures.
"X-Humanoid" (2025) converts Ego-Exo4D videos (60 hours = 3.6 million frames) into Optimus-like action sequences for cooking and biking, training both policies and world models.
The "Humanoid Everyday" dataset (260 real-world robotic tasks) is currently the largest multimodal humanoid dataset, yet authors acknowledge that synthetic data enables generalization beyond real data's domain.
Citi's "The Rise of AI Robots" (2024) forecasts 1.3 billion robots by 2035, primarily trained via simulation. This scales via GPU rendering, not manual collection.
Implication: Synthetic data dominates by 2028-2030. Real data demand drops 80-90%. Real data becomes specialized (edge cases, safety validation, domain-specific fine-tuning).
Doubt 4: Does Internal Fleet Learning Hide External Demand?
Tesla, Figure, and Boston Dynamics don't buy data from startups. They collect internally. A former Tesla Autopilot engineer noted: "Data generation isn't the bottleneck. They collect terabytes per hour. The hard part is finding the right clips for training. That's curation."
This shifts the market entirely. Collection becomes free; curation becomes valuable. A startup identifying the 1% of fleet data most valuable for improvement is worth billions. A startup selling raw teleoperation data? Increasingly irrelevant.
Figure AI's "$675M Series B funding" (February 2024) went to in-house development, not external data purchases. "DreamGen" explicitly demonstrates autonomous data generation via learned world models.
NVIDIA researcher Jim Fan noted in an "April 9, 2025 Office Chai interview": "Unlike LLMs, robotics doesn't yet have clear scaling laws. Compute and data are both bottlenecks, but physical data collection remains expensive."
Implication: External data demand low from 2026 onward. Near-zero by 2036 as fleets mature.
Part 2: Counterarguments
The Sim-to-Real Gap Persists
Simulation handles gravity, friction, and inertia. It doesn't capture material properties, sensor noise, wear, or degradation over time. A robot trained in perfect simulation may fail after 100 real-world episodes due to unmodeled dynamics.
"Humanoid Locomotion as Next Token Prediction" (2024) shows sim-trained policies require substantial real-world adaptation even with domain randomization.
Fine-Tuning Requires Real Data
Google's "MT-Opt" (2024) demonstrates that sim-trained policies need significant real robot data for fine-tuning across diverse tasks. As humanoids move to messy real-world settings, environment-specific adaptation demands increase, not decrease.
Robot Vision Gaps
Embodied AI benchmarks reveal persistent gaps, particularly in temporal reasoning. Robots often treat frames independently while humans process continuous streams with temporal context. Understanding that someone "is about to" reach for an object requires temporal reasoning that current vision systems lack.
Safety Validation Is Extensive
ISO 13482 mandates comprehensive testing across failure modes. Real-world edge cases emerge unpredictably. Boston Dynamics' Atlas experienced numerous falls during development, each requiring data collection and analysis. Safety-critical applications demand orders of magnitude more validation data than general robotics.
Human Interaction Is Complex
Humanoids working alongside people must interpret subtle social cues: body language, eye contact, contextual intent, theory of mind. Recent work on "human-AI interaction" (2024) shows this capability remains elusive, requiring extensive multimodal training data.
Real-World Complexity Dominates
History shows robotics underestimates real complexity. Tesla's Autopilot discovered thousands of edge cases post-deployment that simulation missed. Long-tail distributions mean rare but critical scenarios dominate failure cases. As humanoids enter homes, factories, and public spaces, new failure modes will emerge requiring continuous data collection.
Long-Horizon Planning Remains Difficult
Human tasks span minutes to hours with complex interdependencies. Reinforcement learning struggles with long-horizon credit assignment. Recent "transformer-based planning work" (2024) shows hierarchical reasoning requires extensive trajectory data for reliable long-term decision-making.
Intuitive Physics Capabilities Gap
AI systems still lack robust understanding of object properties, stability, and physical interactions. Each novel environment or material type may require specific training data for reliable interaction.
Part 3: Synthesis
The doubts suggest demand peaks 2026-2028 then declines sharply. The counterarguments suggest certain demands persist. The reality is bifurcated.
Data Types That Peak and Plateau
Foundational locomotion datasets (walking, balance, navigation) peak 2026-2028, then plateau as core policies mature. Generic manipulation demos (grasping, lifting, placing) peak 2026-2029, then plateau. Teleoperation services for bootstrapping peak 2026-2028, then drop 80-90%.
Data Types That Persist or Grow
Safety validation data runs continuous. Each new environment, interaction, or edge case requires data. Domain-specific fine-tuning data persists. Healthcare robots need healthcare data. Surgical robots need surgical data. Temporal and social interaction data grows as robots interact more with humans. Edge case and failure data collects continuously. Hardware variation based finetuning data will still be needed.
Market Trajectory
Raw data collection captures 20-30% of robotics value chain 2026-2030. By 2031-2036, collection captures 2-5% while curation, processing, and domain adaptation capture 15-25%.
Market size forecasts diverge significantly: "Grand View Research projects $4.04B by 2030" (17.5% CAGR from $1.55B in 2024) while "BCC Research projects $11B by 2030" (42.8% CAGR from $1.9B in 2025). Grand View likely conservative; BCC likely includes speculative demand scenarios. Markets and Markets forecasts $13.25B by 2029 at 45.5% CAGR.
The Critical Distinction
Raw data becomes commoditized by 2028. The bottleneck shifts from collection to curation. Identifying valuable signal within terabytes of fleet data matters infinitely more than raw collection volume.
Critical Context
Domain Variations Matter
Consumer humanoids see highest efficiency gains; data demand drops 80-90% by 2035. Healthcare and surgical robots require conservative deployment with high safety validation; data demand remains substantial. Industrial robots in hazardous environments use extensive simulation with moderate efficiency gains.
Hardware-Software Coupling
Better sensors (force feedback, advanced cameras) reduce data requirements. Lower-cost sensors increase requirements. Conclusions assume current hardware. Significant hardware shifts change data strategies.
Regional Differences
Data privacy laws (GDPR in EU), labor costs, and safety standards vary by region, affecting data collection ROI and humanoid adoption willingness.
Conclusion
Humanoids will need substantial data, but the trajectory is "peak and persist," not endless escalation. Foundational training peaks 2026-2028, driven by scaling law efficiency and synthetic data gains. Raw data demand then drops 50-90%.
However, specialized data needs persist: sim-to-real fine-tuning, safety validation, social interaction learning, and edge case handling. The market story isn't about data volume declining, it's about value migrating from collection to curation.
Pure data collection becomes trivial by 2028. The competitive advantage lies with companies solving intelligent curation, safety validation, and domain-specific adaptation. Integrated hardware-AI companies (Tesla, Boston Dynamics, Figure) internalize these capabilities, creating structural moats.
Data infrastructure startups face headwinds unless they pivot from collection to specialization. The humanoid market grows to $4-13B by 2030, but raw data's share of that value shrinks from 20-30% to 2-5% as the field matures.
This represents a fundamental shift: data becomes abundant; intelligence (curation, adaptation, validation) becomes scarce.
Top comments (0)