<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ankit Khandelwal</title>
    <description>The latest articles on DEV Community by Ankit Khandelwal (@ankk98).</description>
    <link>https://dev.to/ankk98</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F37202%2Fcffd594b-2290-43e7-9e13-b61cdf6c6b8e.jpeg</url>
      <title>DEV Community: Ankit Khandelwal</title>
      <link>https://dev.to/ankk98</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ankk98"/>
    <language>en</language>
    <item>
      <title>Kriya-Egocentric-100K: Action100M-style Annotations for Real-World Labor Videos</title>
      <dc:creator>Ankit Khandelwal</dc:creator>
      <pubDate>Tue, 17 Mar 2026 05:22:36 +0000</pubDate>
      <link>https://dev.to/ankk98/kriya-egocentric-100k-action100m-style-annotations-for-real-world-labor-videos-42jd</link>
      <guid>https://dev.to/ankk98/kriya-egocentric-100k-action100m-style-annotations-for-real-world-labor-videos-42jd</guid>
      <description>&lt;p&gt;Just pushed a new preview dataset to Hugging Face: &lt;strong&gt;&lt;a href="https://huggingface.co/datasets/ankk98/kriya-egocentric-100k" rel="noopener noreferrer"&gt;Kriya-Egocentric-100K&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It contains &lt;strong&gt;Action100M-compatible hierarchical action annotations&lt;/strong&gt; for a small 5-video subset of &lt;a href="https://huggingface.co/datasets/builddotai/Egocentric-100K" rel="noopener noreferrer"&gt;Build AI’s Egocentric-100K&lt;/a&gt; — real first-person footage captured with a monocular head-mounted fisheye camera during manual labor tasks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvi3kngo6f532vm339o9v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvi3kngo6f532vm339o9v.png" alt="Kriya Viz Screenshot" width="800" height="439"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  What’s inside?
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;One JSON file per video (&lt;code&gt;f001-w001-0001.json&lt;/code&gt; etc.)&lt;/li&gt;
&lt;li&gt;Full Action100M-style tree: root → sub-segments with precise start/end timestamps&lt;/li&gt;
&lt;li&gt;LLM-generated natural language captions + structured GPT outputs (brief/detailed summaries, action labels, actors)&lt;/li&gt;
&lt;li&gt;Everything generated 100 % automatically via the &lt;strong&gt;&lt;a href="https://mindandmotionlabs.com/api-docs.html" rel="noopener noreferrer"&gt;Kriya Full Automated Action Annotation API&lt;/a&gt;&lt;/strong&gt; (early preview)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The videos themselves are &lt;strong&gt;not&lt;/strong&gt; hosted here (you’ll need to pull them directly from Build AI under their license), but the annotations are MIT and drop-in compatible with the &lt;strong&gt;&lt;a href="https://ankk98.github.io/kriya-viz/" rel="noopener noreferrer"&gt;Kriya Visualizer&lt;/a&gt;&lt;/strong&gt; — just load the video + matching JSON and explore the timeline instantly.&lt;/p&gt;

&lt;h4&gt;
  
  
  Why this matters
&lt;/h4&gt;

&lt;p&gt;After the EPIC-KITCHENS preview, this is the next step toward scaling automatic annotation to more diverse egocentric domains. Manual labor footage brings new challenges (occlusions, tool use, unstructured environments) — and the results already look strong for downstream tasks like video world models, VLMs, VLA policies, and embodied robotics.&lt;/p&gt;

&lt;p&gt;Visualizer demo, full pipeline details, and the previous Kriya-EPIC-KITCHENS release are all in the &lt;strong&gt;&lt;a href="https://dev.to/ankk98/kriya-tools-for-exploring-and-generating-action100m-style-video-annotations-46ee"&gt;original Kriya tools blog post&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is still an early preview — feedback and collaboration super welcome! Drop a comment or DM if you want to try the API on your own footage or discuss scaling plans.&lt;/p&gt;

&lt;p&gt;Excited to keep pushing the boundary of automatic video understanding .&lt;/p&gt;

</description>
      <category>ai</category>
      <category>dataset</category>
      <category>computervision</category>
      <category>egocentric</category>
    </item>
    <item>
      <title>Kriya: Tools for Exploring and Generating Action100M-style Video Annotations</title>
      <dc:creator>Ankit Khandelwal</dc:creator>
      <pubDate>Sat, 14 Mar 2026 06:29:49 +0000</pubDate>
      <link>https://dev.to/ankk98/kriya-tools-for-exploring-and-generating-action100m-style-video-annotations-46ee</link>
      <guid>https://dev.to/ankk98/kriya-tools-for-exploring-and-generating-action100m-style-video-annotations-46ee</guid>
      <description>&lt;p&gt;After reading the excellent &lt;a href="https://arxiv.org/abs/2601.10592" rel="noopener noreferrer"&gt;Action100M paper&lt;/a&gt;, I became very excited about the potential of &lt;strong&gt;fully automated, large-scale video action annotation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;High-quality temporal action hierarchies open doors for training stronger video world models, video-language models (VLMs), vision-language-action models (VLAs), humanoid control policies, and physical reasoning systems.&lt;/p&gt;

&lt;p&gt;But two practical problems quickly appeared:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;There was no convenient way to &lt;strong&gt;visualize&lt;/strong&gt; these rich, hierarchical annotations together with the video.&lt;/li&gt;
&lt;li&gt;Generating such annotations at scale for new/custom video datasets still felt out of reach for many researchers and engineers.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So I built two tools to help move things forward.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Kriya Visualizer – See Action100M-style Annotations Come Alive
&lt;/h2&gt;

&lt;p&gt;I created a lightweight, static web-based visualizer specifically designed for Action100M-style temporal action trees.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Features (current version):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Video player synced with the annotation timeline&lt;/li&gt;
&lt;li&gt;Hierarchical timeline (one row per level in the action tree)&lt;/li&gt;
&lt;li&gt;Nodes highlight at the current timestamp&lt;/li&gt;
&lt;li&gt;Side panel with metadata, full transcript, and raw JSON view&lt;/li&gt;
&lt;li&gt;Clean, single-screen layout (no installation needed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpwxn9kxf3iclk3x8ggt7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpwxn9kxf3iclk3x8ggt7.png" alt="Kriya Viz Screenshot" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's open source under MIT license → feel free to fork, improve, or use it in your projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Access Here:&lt;/strong&gt; &lt;a href="https://ankk98.github.io/kriya-viz/" rel="noopener noreferrer"&gt;https://ankk98.github.io/kriya-viz/&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;GitHub repo:&lt;/strong&gt; &lt;a href="https://github.com/Ankk98/kriya-viz" rel="noopener noreferrer"&gt;https://github.com/Ankk98/kriya-viz&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're working with Action100M data (or any similar dense temporal action hierarchy), give it a try and let me know what features would make it more useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Kriya-EPIC-KITCHENS – Automatic Annotations on Egocentric Videos
&lt;/h2&gt;

&lt;p&gt;Next, I wanted to test how well fully automatic annotation works on real, challenging egocentric data.&lt;/p&gt;

&lt;p&gt;I ran the &lt;strong&gt;Kriya Full Automated Action Annotation API&lt;/strong&gt; (early preview) on a small subset of videos from the popular &lt;a href="https://epic-kitchens.github.io/2026" rel="noopener noreferrer"&gt;EPIC-KITCHENS-100&lt;/a&gt; dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; A preview Hugging Face dataset with ~6 videos fully annotated in Action100M style, no human labeling involved.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Temporal segments with hierarchical actions&lt;/li&gt;
&lt;li&gt;Natural language captions/descriptions per segment&lt;/li&gt;
&lt;li&gt;Ready to download and use&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Dataset link:&lt;/strong&gt; &lt;a href="https://huggingface.co/datasets/ankk98/kriya-epic-kitchens" rel="noopener noreferrer"&gt;https://huggingface.co/datasets/ankk98/kriya-epic-kitchens&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Early results on kitchen egocentric videos look very promising. I'm excited to see if/how these annotations can feed downstream tasks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Video world models&lt;/li&gt;
&lt;li&gt;VLM / VLA fine-tuning&lt;/li&gt;
&lt;li&gt;Robotic manipulation from egocentric views&lt;/li&gt;
&lt;li&gt;Physical AI reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The current API version deliberately follows the Action100M pipeline closely. An improved version that addresses some limitations is already in the works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API docs (early preview):&lt;/strong&gt; &lt;a href="https://mindandmotionlabs.com/api-docs.html" rel="noopener noreferrer"&gt;https://mindandmotionlabs.com/api-docs.html&lt;/a&gt;&lt;br&gt;&lt;br&gt;
(You send videos → get back structured temporal action hierarchies)&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Manual video annotation at scale is expensive and slow. If high-quality automatic annotation becomes reliable, we can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Train on orders-of-magnitude more grounded video data&lt;/li&gt;
&lt;li&gt;Build more general-purpose video understanding and action generation models&lt;/li&gt;
&lt;li&gt;Accelerate progress toward capable robotic and embodied AI systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These two small releases are just early steps. Kriya Visualizer for inspection/debugging, and Kriya-EPIC-KITCHENS as a proof-of-concept dataset.&lt;/p&gt;

&lt;p&gt;Feedback, feature requests, collaboration ideas, or even just "I tried it and here's what broke" are very welcome!&lt;/p&gt;

&lt;p&gt;What are you building with video action data right now? Drop a comment below 👇&lt;/p&gt;

</description>
      <category>ai</category>
      <category>computervision</category>
      <category>robotics</category>
      <category>dataset</category>
    </item>
    <item>
      <title>From Perception to Embodied Intelligence: Evolution, Architectures, and the Humanoid Gap</title>
      <dc:creator>Ankit Khandelwal</dc:creator>
      <pubDate>Sat, 14 Feb 2026 13:56:12 +0000</pubDate>
      <link>https://dev.to/ankk98/from-perception-to-embodied-intelligence-evolution-architectures-and-the-humanoid-gap-3dhi</link>
      <guid>https://dev.to/ankk98/from-perception-to-embodied-intelligence-evolution-architectures-and-the-humanoid-gap-3dhi</guid>
      <description>&lt;p&gt;Vision-Language-Action (VLA) models represent a paradigm shift from passive multimodal understanding to active embodied control. This brief maps the lineage from foundational Vision-Language Models (VLMs) like CLIP and BLIP to current state-of-the-art VLA systems, revealing critical architectural transitions, data strategies, and failure modes that define the frontier of humanoid manipulation.&lt;/p&gt;

&lt;p&gt;The analysis identifies three core evolutionary phases:&lt;/p&gt;

&lt;p&gt;(1) VLM pre-training for semantic understanding&lt;br&gt;
(2) action tokenization enabling end-to-end control&lt;br&gt;
(3) hybrid architectures balancing reasoning with real-time execution&lt;/p&gt;

&lt;p&gt;For humanoid robotics, fundamental gaps remain in proprioceptive reasoning, long-horizon planning, and physics-aware action generation, challenges that current open-source models address only partially.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Evolutionary Timeline: From VLMs to VLAs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Phase 1: Foundation (2021–2022) – VLMs as Semantic Engines
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;CLIP (2021)&lt;/strong&gt; and &lt;strong&gt;BLIP (2022)&lt;/strong&gt; established contrastive learning as the dominant paradigm for aligning vision and language modalities. These models excelled at matching images to text descriptions but lacked any mechanism for action generation. Their legacy persists in modern VLAs: OpenVLA inherits SigLIP's vision encoder, while Pi0 leverages PaliGemma's VLM backbone. &lt;a href="https://hankyukim.com/openvla/" rel="noopener noreferrer"&gt;hankyukim&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Limitation&lt;/strong&gt;: VLMs were fundamentally passive, optimized for retrieval and classification, not sequential decision-making. Early attempts like &lt;strong&gt;CLIPort&lt;/strong&gt; (2022) demonstrated that grafting CLIP representations onto robotic policies via imitation learning could achieve task-specific success but failed to generalize across embodiments or semantic concepts beyond the training distribution. &lt;a href="https://arxiv.org/html/2505.04769v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2: Tokenization Breakthrough (2023) – RT-2 and the Birth of VLAs
&lt;/h3&gt;

&lt;p&gt;Google DeepMind's &lt;strong&gt;RT-2 (July 2023)&lt;/strong&gt; catalyzed the field by reconceptualizing robot actions as text tokens. The architecture quantized continuous actions into discrete bins (typically 256 per dimension) and appended them to the vocabulary of a PaLM-E or PaLI-X VLM. This enabled training with standard next-token prediction objectives, unifying web-scale vision-language pre-training with robotic demonstrations. &lt;a href="https://madison-proceedings.com/index.php/aetr/article/view/4359" rel="noopener noreferrer"&gt;madison-proceedings&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance Leap&lt;/strong&gt;: RT-2 achieved 3× improvement in generalization over RT-1, demonstrating emergent capabilities like reasoning about object categories and improvising tools. The model could interpret novel commands ("place the apple on the 3") despite never observing such combinations in robot data. &lt;a href="https://deepmind.google/blog/rt-2-new-model-translates-vision-and-language-into-action/" rel="noopener noreferrer"&gt;deepmind&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3: Scaling and Open-Source (2024–2025) – OpenVLA, SmolVLA, and Pi0
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;OpenVLA (2024)&lt;/strong&gt; democratized access with a 7B-parameter model trained on 970k demonstrations from the Open X-Embodiment dataset. Built on Llama 2 + DINOv2 + SigLIP, it outperformed closed models like RT-2-X (55B parameters) with 7× fewer parameters by leveraging more diverse training data and 27 training epochs (vs. typical 1-2 epochs for VLMs). &lt;a href="https://arxiv.org/html/2406.09246v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SmolVLA (2025)&lt;/strong&gt; pioneered efficiency, achieving OpenVLA-level performance with &amp;lt;0.5B parameters by employing a compact VLM backbone, flow matching action expert, and asynchronous inference stack. Its key insight: action generation quality depends more on architectural efficiency than raw parameter count. &lt;a href="https://www.youtube.com/watch?v=T1PhkCQDCcc" rel="noopener noreferrer"&gt;youtube&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pi0 Series (Physical Intelligence, 2024–2025)&lt;/strong&gt; introduced hybrid architectures combining autoregressive action tokens with continuous flow matching. Pi0.5 added temporal awareness through timestep conditioning, while Pi0.6 scaled to 5B parameters and incorporated knowledge insulation, training the VLM backbone on FAST tokens while isolating the action expert's gradients. &lt;a href="https://arxiv.org/html/2410.24164v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Thematic Deep Dives: What Worked vs. What Failed
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 Key Ideas That Worked
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Action Tokenization as Sequence Prediction&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Treating actions as discrete tokens enabled direct transfer of LLM training infrastructure to robotics. RT-2's 256-bin quantization scheme remains the default in OpenVLA, providing a simple bridge between continuous control and autoregressive generation. This approach inherits powerful properties from language modeling: in-context learning, few-shot adaptation, and chain-of-thought reasoning. &lt;a href="https://arxiv.org/abs/2307.15818" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evidence&lt;/strong&gt;: OpenVLA achieves 95% action token accuracy after 27 training epochs, with performance correlating strongly to robot success rates. The discrete representation also simplifies multi-task training across heterogeneous robot embodiments. &lt;a href="https://arxiv.org/html/2406.09246v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Flow Matching for Continuous Control&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Diffusion-based action heads address the continuity problem inherent in tokenization. Pi0 and SmolVLA use flow matching to predict action chunks as continuous trajectories, avoiding quantization errors. This enables smoother, more precise control, critical for contact-rich manipulation. &lt;a href="https://www.youtube.com/watch?v=T1PhkCQDCcc" rel="noopener noreferrer"&gt;youtube&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance Impact&lt;/strong&gt;: Pi0 outperforms tokenized baselines on action chunking tasks (e.g., folding laundry) where precise force modulation matters. Flow matching also supports variable horizon predictions, unlike fixed-length token sequences. &lt;a href="https://arxiv.org/html/2410.24164v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Knowledge Insulation and Modularity&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;VLA-Adapter and Pi0.6 demonstrate that decoupling VLM reasoning from action generation improves training efficiency. By freezing the VLM backbone and training only a lightweight action expert, these models avoid catastrophic forgetting of web-scale knowledge while specializing for robot control. &lt;a href="https://arxiv.org/abs/2509.09372" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Efficiency Gains&lt;/strong&gt;: VLA-Adapter trains a powerful VLA in 8 hours on a single consumer GPU, while Pi0.6's insulated gradients prevent performance degradation on vision-language benchmarks. &lt;a href="https://website.pi-asset.com/pi06star/PI06_model_card.pdf" rel="noopener noreferrer"&gt;website.pi-asset&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Key Ideas That Failed
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Naive Proprioception Integration&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Feeding raw robot state (joint angles, end-effector poses) directly as additional tokens creates shortcut learning. Policies overfit to state-action memorization rather than visual reasoning, degrading spatial generalization. In testing, models trained with proprioception fail when object positions deviate slightly from training trajectories. &lt;a href="https://arxiv.org/html/2509.18644v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure Mode&lt;/strong&gt;: A study on visuomotor policies found that proprioceptive states cause "shortcuts where the policy directly associates absolute configurations with actions," leading to 40-60% success rate drops under spatial perturbations. &lt;a href="https://arxiv.org/html/2509.18644v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Monolithic Scaling Without Architectural Innovation&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Simply increasing VLM backbone size (e.g., RT-2-X's 55B parameters) yields diminishing returns for robot control. The computational overhead, 15GB GPU memory for inference at 6Hz, makes real-time deployment impractical. Larger models also struggle with action token accuracy, as the vast parameter space prioritizes language modeling over control precision. &lt;a href="https://arxiv.org/html/2406.09246v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Empirical Evidence&lt;/strong&gt;: OpenVLA's 7B model matches RT-2-X's performance despite 7× fewer parameters, suggesting data diversity and training recipe matter more than scale. &lt;a href="http://arxiv.org/pdf/2406.09246.pdf" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Single-Modality Action Generation&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Pure autoregressive or pure diffusion approaches each have blind spots. Autoregressive models struggle with continuous precision (quantization error), while diffusion models lack the reasoning depth of VLMs for long-horizon planning. HybridVLA attempted to combine both but introduced training interference between the two generation paradigms, requiring complex collaborative ensemble mechanisms that increased inference latency. &lt;a href="https://arxiv.org/abs/2503.10631" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Open Source Model Comparison: OpenVLA vs. SmolVLA vs. Pi0
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;OpenVLA (7B)&lt;/th&gt;
&lt;th&gt;SmolVLA (&amp;lt;0.5B)&lt;/th&gt;
&lt;th&gt;Pi0.6 (5B)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backbone&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Llama 2 + DINOv2 + SigLIP&lt;/td&gt;
&lt;td&gt;Qwen 2.5 0.5B + custom ViT&lt;/td&gt;
&lt;td&gt;Gemma3 4B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Action Head&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Autoregressive tokens (256 bins)&lt;/td&gt;
&lt;td&gt;Flow matching (continuous)&lt;/td&gt;
&lt;td&gt;Hybrid: FAST tokens + flow matching &lt;a href="https://website.pi-asset.com/pi06star/PI06_model_card.pdf" rel="noopener noreferrer"&gt;website.pi-asset&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Training Data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;970k demos (OpenX dataset)&lt;/td&gt;
&lt;td&gt;Public community datasets&lt;/td&gt;
&lt;td&gt;Proprietary large-scale corpus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inference Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6 Hz on RTX 4090 &lt;a href="https://arxiv.org/html/2406.09246v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;12.5 Hz on L40s (2.5× faster than OpenVLA) &lt;a href="https://ai.stanford.edu/blog/minivla/" rel="noopener noreferrer"&gt;ai.stanford&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;5-10 Hz (denoising steps dependent)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Key Innovation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cross-embodiment generalization&lt;/td&gt;
&lt;td&gt;Asynchronous inference stack&lt;/td&gt;
&lt;td&gt;Knowledge insulation + RL fine-tuning &lt;a href="https://www.pi.website/blog/pistar06" rel="noopener noreferrer"&gt;pi&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Simulation Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;62% on LIBERO-90 &lt;a href="https://ai.stanford.edu/blog/minivla/" rel="noopener noreferrer"&gt;ai.stanford&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;77% on LIBERO-90 (w/ action chunks) &lt;a href="https://ai.stanford.edu/blog/minivla/" rel="noopener noreferrer"&gt;ai.stanford&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;State-of-the-art on LIBERO-5 (96.5%) &lt;a href="https://arxiv.org/abs/2508.19236" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Real-World Strength&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generalization across robots&lt;/td&gt;
&lt;td&gt;Deployment on consumer GPUs&lt;/td&gt;
&lt;td&gt;Long-horizon tasks (coffee making, laundry) &lt;a href="https://www.pi.website/blog/pistar06" rel="noopener noreferrer"&gt;pi&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Critical Weakness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slow inference, quantization error&lt;/td&gt;
&lt;td&gt;Limited long-horizon reasoning&lt;/td&gt;
&lt;td&gt;Proprietary, computationally intensive&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Architectural Deep Dive&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;OpenVLA&lt;/strong&gt; follows the RT-2 blueprint faithfully: discretize actions, append to vocabulary, train with cross-entropy loss. Its strength lies in the curated OpenX dataset diversity, enabling zero-shot control of unseen robots. However, the autoregressive generation bottleneck limits real-time performance, 15GB GPU memory and 6Hz inference constrain deployment to high-end hardware. &lt;a href="http://arxiv.org/pdf/2406.09246.pdf" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SmolVLA&lt;/strong&gt; challenges the "bigger is better" orthodoxy. By using a compact VLM and flow matching action expert, it achieves comparable performance with 14× fewer parameters. The asynchronous inference stack decouples perception from action generation, allowing new chunks to be predicted while the robot executes previous commands. This is particularly impactful for dynamic environments where reaction time matters. &lt;a href="https://huggingface.co/blog/smolvla" rel="noopener noreferrer"&gt;huggingface&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pi0.6&lt;/strong&gt; represents the hybrid extreme: it trains the VLM backbone on FAST discrete tokens while the action expert predicts continuous flows. Knowledge insulation prevents gradient interference, and offline RL pre-training (Recap) doubles throughput on complex tasks. The model's hierarchical design supports heterogeneous prompts, enabling high-level task conditioning. The trade-off is accessibility, Pi0.6's training requires proprietary data and substantial compute, limiting reproducibility. &lt;a href="https://www.pi.website/blog/pistar06" rel="noopener noreferrer"&gt;pi&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The Humanoid Gap Report: Missing Capabilities for Hand Manipulation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 Proprioception and Tactile Integration
&lt;/h3&gt;

&lt;p&gt;Current VLAs treat proprioception as auxiliary inputs, leading to shortcut learning and poor spatial generalization. Humanoid hands require fine-grained force feedback and slip detection, capabilities absent in standard VLA pipelines. &lt;a href="https://arxiv.org/html/2509.18644v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gap&lt;/strong&gt;: No open-source VLA integrates tactile sensing end-to-end. ForceVLA and AnyTouch explore Mixture-of-Experts for contact-rich tasks, but these remain research prototypes. The lack of large-scale tactile datasets mirrors the early scarcity of robot demonstrations. &lt;a href="https://www.themoonlight.io/en/review/survey-of-vision-language-action-models-for-embodied-manipulation" rel="noopener noreferrer"&gt;themoonlight&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opportunity&lt;/strong&gt;: Develop a "Tactile VLA" that fuses vision, language, and distributed pressure sensor arrays. The architecture should use tactile tokens analogous to image patches, enabling the VLM backbone to reason about contact forces and friction constraints.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 Long-Horizon Planning and Memory
&lt;/h3&gt;

&lt;p&gt;Humanoid manipulation tasks (e.g., assembling furniture) span 5–20 minutes and require remembering partial progress. Standard VLAs operate with Markovian assumptions and fixed context windows, causing failure when intermediate steps are ambiguous. &lt;a href="https://arxiv.org/html/2410.24164v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gap&lt;/strong&gt;: MemoryVLA demonstrates perceptual-cognitive memory banks for manipulation, but its evaluation is limited to tabletop tasks. Humanoid whole-body control introduces additional complexity: locomotion plans must be retained while hands execute fine manipulations. &lt;a href="https://arxiv.org/abs/2508.19236" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opportunity&lt;/strong&gt;: Implement a hierarchical memory system with (1) working memory for immediate action chunks and (2) episodic memory for task-level progress. The hippocampal-inspired consolidation mechanism from MemoryVLA could scale to humanoid tasks by encoding proprioceptive trajectories alongside visual observations. &lt;a href="https://arxiv.org/abs/2508.19236" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4.3 Physics-Aware Action Generation
&lt;/h3&gt;

&lt;p&gt;VLAs hallucinate physically implausible actions, predicting grasps that violate kinematic constraints or object trajectories that ignore gravity. This stems from the VLM backbone's pixel-space reasoning lacking 3D physical grounding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gap&lt;/strong&gt;: GeoVLA and 3D-VLA integrate point clouds and depth maps, but these are add-ons rather than core architectural features. The models still prioritize semantic alignment over physical feasibility. &lt;a href="https://arxiv.org/abs/2508.09071" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opportunity&lt;/strong&gt;: Embed a differentiable physics simulator within the VLA training loop. Actions could be penalized for violating Newtonian mechanics, similar to how RL uses physics-based rewards. The "visual foresight" approach in F1-VLA shows promise: predicting next visual states correlates with action reliability, suggesting that generative world models could enforce physical consistency. &lt;a href="https://arxiv.org/html/2509.06951v2" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4.4 Sim-to-Real for Humanoid Morphology
&lt;/h3&gt;

&lt;p&gt;Humanoid robots exhibit high-dimensional action spaces (30+ DOF) and complex contact dynamics. Current sim-to-real methods rely on domain randomization, which fails to capture the nuance of bipedal balance and bimanual coordination. &lt;a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC12292580/" rel="noopener noreferrer"&gt;pmc.ncbi.nlm.nih&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gap&lt;/strong&gt;: HumanVLA demonstrates vision-language directed object rearrangement but requires privileged state information and hand-crafted finite state machines. The sim-to-real gap remains 17% failure rate in real-world experiments, primarily due to depth sensing errors and contact estimation delays. &lt;a href="https://arxiv.org/html/2406.19972v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opportunity&lt;/strong&gt;: Leverage human video data as an intermediate domain. EgoVLA extracts wrist and hand actions from egocentric videos, using inverse kinematics to retarget to robot hands. This "human-to-robot" transfer could bootstrap humanoid VLA training without expensive real robot data collection. &lt;a href="https://rchalyang.github.io/EgoVLA/" rel="noopener noreferrer"&gt;rchalyang.github&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Critical Disagreements and Uncertainties
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Disagreement 1: Proprioception's Role&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Proponents&lt;/strong&gt;: Proprioception provides compact, accurate state information essential for precise servo control. &lt;a href="https://arxiv.org/html/2509.18644v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Critics&lt;/strong&gt;: End-to-end visuomotor policies without explicit state inputs achieve better spatial generalization, as they cannot memorize trajectories. &lt;a href="https://arxiv.org/html/2509.18644v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resolution&lt;/strong&gt;: The consensus is shifting toward &lt;em&gt;conditioned&lt;/em&gt; proprioception, using state inputs only for low-level control while keeping high-level reasoning vision-driven, as seen in Helix's dual-system architecture. &lt;a href="https://www.iotworldtoday.com/robotics/humanoid-robots-learn-to-work-together-natural-language-control" rel="noopener noreferrer"&gt;iotworldtoday&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Disagreement 2: Action Representation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tokenization Camp&lt;/strong&gt;: Discrete tokens enable direct VLM transfer and chain-of-thought reasoning (OpenVLA, RT-2). &lt;a href="https://arxiv.org/html/2406.09246v1" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diffusion Camp&lt;/strong&gt;: Continuous flow matching captures action continuity and supports variable horizons (Pi0, SmolVLA). &lt;a href="https://www.youtube.com/watch?v=T1PhkCQDCcc" rel="noopener noreferrer"&gt;youtube&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resolution&lt;/strong&gt;: Hybrid approaches (Pi0.6, HybridVLA) are emerging as the synthesis, but training interference remains an open problem. &lt;a href="https://arxiv.org/abs/2503.10631" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Uncertainty&lt;/strong&gt;: The optimal data mixture ratio for humanoid VLAs is unknown. RT-2 used 10% robotics data, while OpenVLA uses 100%. For humanoids, the scarcer data may require more aggressive web-scale pre-training, but this risks physics misalignment.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Conclusion
&lt;/h2&gt;

&lt;p&gt;VLA models have evolved from passive VLMs to active embodied agents, but the leap to reliable humanoid manipulation remains incomplete. The open-source ecosystem (OpenVLA, SmolVLA) has democratized access, yet critical gaps persist in proprioceptive reasoning, long-horizon memory, and physics-aware generation.&lt;/p&gt;

</description>
      <category>robotics</category>
      <category>vla</category>
      <category>ai</category>
      <category>computervision</category>
    </item>
    <item>
      <title>Teleoperation Data Quality for Imitation Learning: What Actually Breaks the Model</title>
      <dc:creator>Ankit Khandelwal</dc:creator>
      <pubDate>Sun, 08 Feb 2026 13:53:43 +0000</pubDate>
      <link>https://dev.to/ankk98/teleoperation-data-quality-for-imitation-learning-what-actually-breaks-the-model-1abc</link>
      <guid>https://dev.to/ankk98/teleoperation-data-quality-for-imitation-learning-what-actually-breaks-the-model-1abc</guid>
      <description>&lt;p&gt;&lt;em&gt;Practical rubric design and failure modes from auditing robot teleop datasets (e.g. &lt;a href="https://github.com/huggingface/lerobot" rel="noopener noreferrer"&gt;LeRobot&lt;/a&gt;).&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this post
&lt;/h2&gt;

&lt;p&gt;We audited teleoperation episodes for an imitation-learning pipeline. Removing poor-quality episodes (about 20–40% in our case) led to clearly better learning; the literature often reports ~10–15% policy improvement from similar filtering. This post covers &lt;strong&gt;rubric mistakes that cause inconsistent scores&lt;/strong&gt; and &lt;strong&gt;failure modes&lt;/strong&gt; we kept seeing.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Rubric mistakes and how to fix them
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mistake 1: Metrics that sound clear but aren’t.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Example: “Mistake-to-Recovery-Ratio.” People disagree: Is it (total mistakes)/(total recoveries) or (total mistakes)/(total recovery &lt;em&gt;attempts&lt;/em&gt;)? If a pick fails, then fails again, then succeeds, is that one recovery or two attempts?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it should be:&lt;/strong&gt; Define one ratio per episode. Count each &lt;em&gt;distinct&lt;/em&gt; mistake once (each new failure event). Count a &lt;em&gt;recovery&lt;/em&gt; only when the operator successfully got back on track; failed attempts in between don’t add extra recoveries. Write this in the rubric: “Count a recovery only when intended behavior has resumed; don’t count failed attempts as new mistakes unless it’s a new failure (e.g. new drop).” If you also want to penalize messy recoveries, add a separate “recovery attempts per mistake” number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 2: No rule for overall quality.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Scorers give High when most dimensions are High but one is Low. Then “high quality” is not strict.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it should be:&lt;/strong&gt; Overall = &lt;strong&gt;Low&lt;/strong&gt; if any dimension is Low; &lt;strong&gt;High&lt;/strong&gt; only if all dimensions are High. One bad dimension pulls the episode down.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Failure modes we kept seeing
&lt;/h2&gt;

&lt;p&gt;Short name (formal term) with plain-language meaning. One line each; add a screenshot or GIF per item when you publish.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Post-task idle / run-on footage&lt;/strong&gt; (extra 10–15 s of video after the task is done). Dilutes the signal; policy can learn to linger.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Temporal misalignment&lt;/strong&gt; (sync issues between cameras or sensors). Bad for multi-view or fusion; causes inconsistent state.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Self-collision / kinematic clash&lt;/strong&gt; (arm hits itself or the body). Unsafe; don’t let the policy imitate it.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Low contrast / poor observability&lt;/strong&gt; (white background, same-color object, or bad lighting). Object hard to see; weak visual signal.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rubric incompleteness&lt;/strong&gt; (scorers disagree or don’t know how to score). Add explicit rules and examples; flag “undefined” cases and fix the rubric before locking scores.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Repeated failures before success&lt;/strong&gt; (e.g. 3–5 pick attempts before one works). Noisy trajectory; can teach hesitation.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Over-ideal / low-complexity conditions&lt;/strong&gt; (too easy, no obstacles). Can bias the dataset; score complexity separately or down-weight.  &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Impact
&lt;/h2&gt;

&lt;p&gt;After fixing the rubric and removing Low-quality episodes (20–40%), retraining gave noticeably better results. Studies on filtering teleop data often report ~10–15% (or more) policy gain. &lt;strong&gt;Define metrics and overall quality clearly, then audit before scaling data.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rubric:&lt;/strong&gt; Define “mistake” and “recovery” in writing; one ratio per episode. Overall quality = Low if any dimension is Low.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure modes:&lt;/strong&gt; Post-task idle, sensor sync, arm clashes, poor visibility, rubric gaps, repeated failed attempts, over-ideal setup. Name them, add examples (screenshots/GIFs), score consistently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filtering&lt;/strong&gt; a chunk of bad episodes is high leverage; do it before collecting more&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>lerobot</category>
      <category>vla</category>
      <category>ai</category>
    </item>
    <item>
      <title>Ghibli moment for 3D Printing</title>
      <dc:creator>Ankit Khandelwal</dc:creator>
      <pubDate>Thu, 05 Feb 2026 12:11:19 +0000</pubDate>
      <link>https://dev.to/ankk98/ghibli-moment-for-3d-printing-1lh1</link>
      <guid>https://dev.to/ankk98/ghibli-moment-for-3d-printing-1lh1</guid>
      <description>&lt;p&gt;I bought my first 3D printer this week to make parts for the robot I'm building.&lt;br&gt;
Even though I've seen 3D prints online for years, watching it work on my desk feels completely different.&lt;/p&gt;

&lt;p&gt;The print head moves slowly, laying down each thin line of plastic.&lt;br&gt;
At the start it looks like nothing, just squiggles.&lt;br&gt;
But layer by layer, an actual object appears, as if the room is quietly drawing in 3D.&lt;/p&gt;

&lt;p&gt;It is strangely calming to watch.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F35paqsgbig9d9dq90s1h.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F35paqsgbig9d9dq90s1h.jpg" alt="3D printer" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I keep thinking about all the little things I’ve wanted over the years like headphone stands, cable holders, desk gadgets.&lt;br&gt;
Earlier, they were just “nice to have” ideas that I would forget about.&lt;br&gt;
Now I feel like I have this small superpower to do &lt;em&gt;&lt;strong&gt;shaka laka boom boom&lt;/strong&gt;&lt;/em&gt; and make them real.&lt;/p&gt;

&lt;p&gt;Friends who visit are equally fascinated.&lt;br&gt;
Everyone has one object they’ve always wanted: a custom mount, a tiny figurine, some organizer for their setup.&lt;br&gt;
The printer is already “booked” for the next many days with all these requests.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc3qcmzybsdjo320lfa8d.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc3qcmzybsdjo320lfa8d.jpg" alt="Benchy Boat" width="800" height="599"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What still surprises me is how affordable this has become.&lt;br&gt;
The printer itself cost around 15k INR, which is not that far from what people pay for a regular home printer.&lt;br&gt;
It feels like we quietly crossed a line where this stopped being a futuristic toy and became just another tool.&lt;/p&gt;

&lt;p&gt;Before buying it, I had reached out to more than 20 printing vendors to get my robot parts made.&lt;br&gt;
Most of them took 3-4 days just to reply.&lt;br&gt;
Then they needed another 10 days or so for the actual printing.&lt;br&gt;
The quotes I got were between 70k and 120k INR, and this was before GST and delivery.&lt;/p&gt;

&lt;p&gt;In the end, I bought the printer for about 15k, spent around 5k on filament, another 10k on a few big parts I still outsourced, and finished everything for under 30k.&lt;br&gt;
The cost difference alone almost forced the decision.&lt;/p&gt;

&lt;p&gt;Now I keep noticing new machines that can even turn 2D photos into 3D models.&lt;br&gt;
The ecosystem already feels quite mature and surprisingly accessible.&lt;br&gt;
It seems like we’re just one Studio Ghibli style moment away from this becoming completely mainstream.&lt;/p&gt;

&lt;p&gt;For now, though, it still feels like a niche hobby.&lt;br&gt;
Most people I know have heard of 3D printing, but have never actually used it.&lt;br&gt;
Someone just needs to make the whole experience a bit simpler, tell the right story, and this will explode.&lt;/p&gt;

</description>
      <category>3dprinting</category>
    </item>
    <item>
      <title>The Hardest Part of Physical AI isn't the Brain</title>
      <dc:creator>Ankit Khandelwal</dc:creator>
      <pubDate>Thu, 22 Jan 2026 14:00:17 +0000</pubDate>
      <link>https://dev.to/ankk98/the-hardest-part-of-physical-ai-isnt-the-brain-1d1j</link>
      <guid>https://dev.to/ankk98/the-hardest-part-of-physical-ai-isnt-the-brain-1d1j</guid>
      <description>&lt;p&gt;Software engineers entering robotics often make a fundamental category error: they treat humanoids like servers with legs. In the cloud, "move fast and break things" is a mantra. In the physical world, breaking things costs $50,000 and sets your timeline back by quarters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://x.com/ankk98/status/2014331393103552608" rel="noopener noreferrer"&gt;The physical constraints dictate the solution space more than the algorithm ever will.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Consider the battle between &lt;strong&gt;Tesla and Waymo&lt;/strong&gt;. Tesla won the early race for scale because they optimized aggressively around hardware constraints. They built their AI stack to run on compute designed specifically for their cars, leveraging the existing fleet. Waymo, while technically brilliant, relied on expensive, complex sensor suites that were harder to mass-produce. Tesla understood that to win, you don't just add software to a car; you design the car &lt;em&gt;for&lt;/em&gt; the software.&lt;/p&gt;

&lt;p&gt;The same principle applies to &lt;strong&gt;mobile phones&lt;/strong&gt;. Every OS feature is strictly bounded by battery life and thermal throttling. The hardware shapes the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Humanoids, however, will be 10x harder.&lt;/strong&gt; Unlike a car (wheels) or a phone (static), a humanoid has dozens of moving parts—joints, actuators, and fingers—all requiring high torque and low latency. The complexity of maintaining physical reliability scales exponentially with every degree of freedom.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ola Electric&lt;/strong&gt; offers a cautionary tale. They applied a "software iteration" speed to hardware manufacturing. The result? Thermal issues, panel gaps, and recalls. They learned the hard way that you cannot "refactor" a battery or "hot-patch" a motor. A software bug is a quick fix; a hardware bug is a logistical nightmare.&lt;/p&gt;

&lt;p&gt;This is why the recent partnership between &lt;strong&gt;Google and Boston Dynamics&lt;/strong&gt; is so significant. Google historically struggles with the physical friction of hardware (see Nest/Stadia), while Boston Dynamics has mastered the "Body"—the durability, balance, and actuation. By combining Google’s "Brain" (AI/Cloud) with BD’s physical capability, they create a force multiplier. They acknowledge that physical engineering is a distinct discipline from data science.&lt;/p&gt;

&lt;p&gt;To succeed in Physical AI, we must prioritize reliability over intelligence. Before optimizing the LLM, we must optimize the cooling, the battery density, and the sensor durability. If you can’t keep the body alive, the code doesn't matter.&lt;/p&gt;

</description>
      <category>robotics</category>
      <category>humanoid</category>
      <category>ai</category>
    </item>
    <item>
      <title>Can a Humanoid Robot Recognize and Remember My Face?</title>
      <dc:creator>Ankit Khandelwal</dc:creator>
      <pubDate>Mon, 19 Jan 2026 21:45:17 +0000</pubDate>
      <link>https://dev.to/ankk98/can-a-humanoid-robot-recognize-and-remember-my-face-23ek</link>
      <guid>https://dev.to/ankk98/can-a-humanoid-robot-recognize-and-remember-my-face-23ek</guid>
      <description>&lt;p&gt;&lt;em&gt;A student walks into a robotics lab with a simple question. The expert smiles and begins unraveling the mystery.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: The Question
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Can a humanoid robot recognize my face?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes, &lt;em&gt;right now&lt;/em&gt;. Face recognition (FaceNet, InsightFace) is ~99% accurate in controlled settings.[19][21] But come back in 5 minutes? The robot has completely forgotten you exist.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Why does it forget me?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because its brain (Vision-Language-Action models, or VLAs) only sees 1-2 seconds of reality at a time - just 2-4 video frames.[3][15] Imagine having amnesia every second.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Why can't it just look at more frames?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because transformer attention - the math that makes VLAs work - is O(T²) where T = frames. Doubling frames costs 4× more computation. 30 frames needs 100× the power of 3 frames (30²/3² = 900/9).[3][4] The robot would need a nuclear reactor to think.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"So the real problem is compute?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Exactly. But here's the plot twist: you don't &lt;em&gt;need&lt;/em&gt; all frames. You only need the &lt;em&gt;important&lt;/em&gt; ones. And you don't store pixels - just compact features. That's 100-1000× compression without losing recognition ability.[2][6][26]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Wait... is there actually a way to solve this?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. Researchers have already solved the individual pieces (smart frame selection, compression, efficient attention). But nobody has stitched them together into a working robot. That's the frontier.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2: Face Recognition 101
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Okay, so how does face recognition actually work?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The robot converts your face into an "embedding" - a number vector where similar faces have similar coordinates. FaceNet uses 128 dimensions; InsightFace uses 512. Your face in sunlight and your face at night live in nearby neighborhoods of this abstract space.[19][21]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"That's... beautiful? But how did it learn this?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Trained on millions of face pairs with a technique called "triplet loss": push embeddings of the same person together, push embeddings of different people far apart. After seeing enough examples, patterns emerge.[19][21]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"How accurate is it, really?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In a lab with good lighting: 99%. In the real world with varying lighting, makeup, sunglasses: 85-92%. After 1 month, accuracy remains high (&amp;gt;90%) for adults with stable appearance; degradation is minimal over short intervals.[5][14] Studies show 98%+ accuracy even after 6 months for adults, with larger drops occurring over years.[30]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What trips it up?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Lighting changes, occlusion (masks, sunglasses), makeup, aging, and crowded scenes where extracting faces is messy. Basically, anything that changes how the pixels look.[5][14] But some changes hit harder: growing a beard can drop accuracy 10-25× for mismatched facial hair styles.[33] Sunglasses (upper-face occlusion) can drop accuracy from ~93% to ~37%.[34] Growing children face even bigger challenges - infants under 1 year show only ~30% accuracy over 6-month gaps.[35]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Can we make it more robust?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sort of. Ensemble methods (run multiple models, vote on the answer) help. Confidence thresholds work. Training with diverse appearances (beards, glasses, different ages) improves robustness.[33][34] For children, systems need age-invariant features or regular re-enrollment every 6-12 months.[35] But the honest answer: ask the human if you're uncertain: "Are you Alice? You look similar to someone I know."[19][21]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What about growing beards, glasses, or children?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Beard changes: Adding or removing facial hair can cause 10-25× increase in false non-match rates, especially mustaches.[33] Glasses: Upper-face occlusion (sunglasses) drops accuracy from ~93% to ~37% - worse than masks.[34] Growing children: Infants (0-1 year) show only ~30% accuracy over 6 months; toddlers (2-3 years) improve to ~65%.[35] For children, systems need frequent re-enrollment or age-invariant modeling.[35]&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3: The VLA Bottleneck
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"What exactly is a VLA?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Vision-Language-Action model. A neural network that takes three inputs: camera frames, language instructions ("pick up the red cup"), and outputs robot commands (move arm, open gripper).[15][18]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Examples?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;RT-2 (DeepMind, closed). OpenVLA (Carnegie Mellon, open-source 7B). Qwen-VL (Alibaba). VideoVLA (2025, understands motion). OpenVLA is the best starting point for building your own system.[11][15][28]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Wait - can VLAs recognize faces?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No. VLAs (OpenVLA, SmolVLA, Pi 0.6) are trained for manipulation tasks, not person identification. They understand objects and scenes, not individual faces. You need a separate face recognition module (InsightFace, FaceNet) that extracts face embeddings, then integrate those into the robot's memory system. The VLA handles actions; face recognition handles identity.[11][15]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Why do they only process 2-4 frames?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Control loops run at 50 Hz (20ms per cycle). Optimized VLAs on high-end GPUs achieve 20-40ms inference; typical systems take 50-150ms.[31] That leaves little time for deep video analysis when processing many frames.[24][26][28]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What if we optimize VLA inference?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Even with optimization: KV cache tricks (reuse computation), sparse attention (skip unimportant tokens), quantization (use 4-bit math instead of 32-bit): 30 frames still takes 100+ ms. Too slow.[1][4][29]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"So we can never extend context?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Wrong assumption. CronusVLA (2025) uses a clever trick: extract &lt;em&gt;motion features&lt;/em&gt; instead of processing raw pixels, caching past features to avoid recomputing the vision backbone.[26] This enables multi-frame context with minimal overhead compared to naive frame stacking.[26]&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4: Extending Context
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"How do we extend context efficiently?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Three independent tricks that stack: (1) Select only important frames (not all frames). (2) Compress frames to features (not pixels). (3) Use efficient attention patterns (not full attention).&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Trick 1: Which frames matter?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Motion-based selection: keep frames with high optical flow (stuff is changing), skip static frames. 15-20× compression with minimal accuracy loss. Or use learned importance (VLM scores which frames matter for your task).[2][5][12]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Any other selection methods?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Multi-armed bandit for constrained budgets (2025 research). Or hierarchical: keep recent frames densely, older frames sparsely. Or genetic algorithms (academic, not practical). Motion-based works well in practice.[2][12][14]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Trick 2: Compress frames?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Don't store 6 MB per frame (RGB pixels). Store pooled features (50 KB, 120× smaller) using max-pooling. Motion features from optical flow can compress temporal information, but face recognition typically requires appearance features combined with motion for best results.[10][13][15]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"How does max-pooling work?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Take every 2×2 grid of pixels, keep the strongest signal, discard the rest. Repeat 2-3 times: 1080p → ~64×64 → 32×32. Lose spatial detail, preserve what matters for recognition. At 64×64, expect 5-15% accuracy drop; at 32×32, expect 20-40% drop depending on conditions.[10][13][32]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What about temporal compression?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;TempMe (2025 paper): cluster similar consecutive frames, keep 1 representative per cluster. Result: 95% token reduction in video. Faster inference. Sometimes even better accuracy (less noise).[6]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Trick 3: Efficient attention?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Standard: query attends to every past token (O(T²) cost). Efficient: (a) KV cache - reuse computation from previous steps. (b) Grouped Query Attention - multiple query heads share one KV head (4× smaller cache). (c) Sparse attention - only attend to important positions.[1][4][29]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Combining all three?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Motion frame selection (15×) + temporal token merging (95%) + GQA + sparse = 100-1000× compression. Optimized systems can achieve 20-40ms latency on high-end GPUs.[31] Accuracy loss varies by compression level and task.[2][6][1]&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 5: The Memory Problem
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Okay, frames are compressed. Where do we &lt;em&gt;store&lt;/em&gt; them?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's the hard part: limited RAM on the robot (8-16 GB shared with OS). Can't query disk fast enough for real-time. Need &lt;em&gt;multiple&lt;/em&gt; storage tiers, each optimized for different timescales.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Layers?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 0&lt;/strong&gt; (2 sec): Current frames in RAM. Real-time VLA inference. &amp;lt;1ms access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 1&lt;/strong&gt; (60 sec): Compressed motion features on fast SSD. &amp;lt;20ms access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 2&lt;/strong&gt; (1 hour): Face embeddings in vector database (Milvus). Similarity search in &amp;lt;100ms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 3&lt;/strong&gt; (months): Person identities in PostgreSQL. SQL queries in &amp;lt;10ms.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Why separate tiers?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each tier optimizes for its job. Tier 0 is tiny and fast. Tier 3 is huge but doesn't need real-time speed. Together they cover seconds to months without exceeding your latency budget.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"How much storage?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tier 0: 0 MB (flushed). Tier 1: 100 MB. Tier 2: 500 MB per hour. Tier 3: 1 MB per 1000 people. Total for 1 month of operation: ~600 MB. Fits on a USB stick.[18][20]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What about privacy? Is storing face data ethical?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes, with consent and transparency. Users should opt-in, know what's stored, and be able to delete their data. Best practice: store embeddings (not raw images), encrypt at rest, allow deletion. Some jurisdictions (EU GDPR, some US states) require explicit consent for biometric data. Build privacy-by-design: minimal data, local-first storage, user control.[18]&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 6: Real-Time Recognition Challenge
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"So here's the hard part: when the robot sees someone, it needs to know &lt;em&gt;instantly&lt;/em&gt; who they are."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Right. At 30 FPS, you're getting 30 faces per second. You can't query the vector database 30 times per second. That's 50 round-trips to disk. Game over.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What do we do?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Smart caching. The robot's most-used people (family, frequent visitors) stay hot in memory. Tier 0 gets an LRU cache of embeddings it's seen recently. Tier 1 tracks faces from the past hour.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Can you walk through this?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Robot sees someone:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract face embedding (lightweight, ~5ms, can happen on spare GPU cycles)&lt;/li&gt;
&lt;li&gt;Check local cache (Tier 0): "Have I seen this embedding in the last 60 seconds?" If yes: instant match&lt;/li&gt;
&lt;li&gt;Cache miss? Check Tier 1 (motion features, faces from past hour): "Any motion features correlate with this face?" If yes: probably the same person&lt;/li&gt;
&lt;li&gt;Still no match? Query vector DB (Tier 2) &lt;em&gt;asynchronously&lt;/em&gt;. Don't block action loop.&lt;/li&gt;
&lt;li&gt;Query result arrives 50-100ms later. Robot incorporates into next decision.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;strong&gt;"But what if the person hasn't been seen in 3 months?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Exactly the query you're worried about. Robot can't afford synchronous queries. Solution: (a) Query Tier 3 in background thread. (b) Meanwhile, robot acts conservatively ("Hello! What's your name?"). (c) When query completes, update memory: "Oh! That was Alice!"&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"So the robot makes a guess while waiting for the database?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Correct. It's a reasonable tradeoff. Perfect accuracy takes 100ms. Approximate accuracy takes 20ms. For most tasks, approximate is fine, and you can refine later.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What about false positives?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Confidence thresholds + fallback. If embedding similarity is &amp;gt;0.9: "Welcome back, Alice!" If similarity is 0.75-0.9: "Are you Alice?" If &amp;lt;0.75: "Hello, new person!"&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"How do we avoid querying vector DB 50 times per second?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Several strategies:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Batch queries&lt;/strong&gt;: Accumulate 10 faces, query once (amortizes latency)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bloom filters&lt;/strong&gt;: Quick "definitely not in database" check before expensive query&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Locality&lt;/strong&gt;: Faces in same location likely same person (temporal coherence)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clustering&lt;/strong&gt;: Group embeddings into ~100 clusters, query cluster representative, not individual&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache hottest 1000 people&lt;/strong&gt;: 99% of queries hit cache (pareto principle)&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;strong&gt;"Which works best?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Combination. Always check local cache first (0.1ms). Batch queries when cache misses (10ms per 10 faces). Cluster embeddings in vector DB (10× fewer distance calculations). Query Tier 3 asynchronously.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What's the latency real-time impact?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tier 0 cache hit: &amp;lt;1ms (recognition instant). Tier 1 batch query: ~15ms (30 FPS, can handle). Tier 2/3 async: 50-100ms (doesn't block control).&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 7: Memory Updates and Consolidation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"After 3 months, the database is full of duplicate faces. Alice has been seen 500 times. How do we consolidate?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Periodic background job (runs every 30 minutes): cluster faces by similarity (embedding distance), compute centroid of each cluster, update Tier 3 with centroid + metadata.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What metadata gets updated?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;person_id, name, face_embedding_centroid (average of recent embeddings), last_seen, interaction_count, behavior_summary (LLM-generated), context_tags (where/when usually seen).&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Why centroid instead of keeping all 500 embeddings?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Storage: 500 embeddings × 512 dims × 4 bytes = 1 MB per person. Scaling to 10k people: 10 GB. But centroid: 512 dims × 4 bytes = 2 KB. 10k people: 20 MB. Also faster queries.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What about people you haven't seen in a year?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Archive them. Move centroid to cold storage (cloud). Keep recent 1000 people in hot database. When someone reappears after 1 year: warm up their embeddings, integrate into Tier 3.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 8: The Technical Stack
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"What libraries should I actually use?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Face detection/embedding: &lt;strong&gt;InsightFace&lt;/strong&gt; (accurate, fast, open-source, 512-dim vectors).&lt;br&gt;
Vector DB: &lt;strong&gt;Milvus&lt;/strong&gt; or &lt;strong&gt;Qdrant&lt;/strong&gt; (HNSW indexing, fast search, Python API).&lt;br&gt;
Person DB: &lt;strong&gt;PostgreSQL + pgvector&lt;/strong&gt; (SQL + vector similarity, scales to millions).&lt;br&gt;
VLA inference: &lt;strong&gt;HuggingFace Transformers&lt;/strong&gt; (OpenVLA-7B).&lt;br&gt;
Video I/O: &lt;strong&gt;OpenCV&lt;/strong&gt; (standard, efficient).&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Why InsightFace?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;20-50ms per face (fast). 95%+ detection accuracy. Open-source. Produces 512-dimensional embeddings proven for recognition. Easy to fine-tune.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Why Milvus over other vector DBs?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Supports HNSW (hierarchical approximate search), in-memory + SSD persistence, Python API, easy deployment on Jetson. Qdrant is also good (Rust-based, slightly faster). Pick either.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Why PostgreSQL + pgvector?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;SQL for complex queries (names, timestamps, context). Vector similarity search in same database. Scales to millions of records. pgvector is mature (stable since 2023).&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Wait - why both Milvus and PostgreSQL? Can't I use just one?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can! &lt;strong&gt;PostgreSQL + pgvector&lt;/strong&gt; can handle both: vector similarity search (like Milvus) AND SQL queries with metadata. Many systems use just PostgreSQL. The two-DB setup separates concerns: Milvus (Tier 2) optimized for fast vector search on recent faces, PostgreSQL (Tier 3) for long-term storage with rich metadata. But if you want simplicity, use PostgreSQL + pgvector for everything - it's mature and handles both workloads well.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What about the VLA model?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenVLA-7B&lt;/strong&gt; is your best bet. Open-source, fine-tuneable with LoRA, good community. RT-2 (DeepMind) is better but closed-source. VideoVLA (2025) supports multi-frame but less mature.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 9: Practical Constraints
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"What hardware do I actually need?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Minimum: &lt;strong&gt;Jetson Orin Nano Super&lt;/strong&gt; ($249, 8 GB RAM, 67 TFLOPS GPU). Processes ~5 FPS with constraints. Can run lightweight models (smolVLA 450M at 8-12 Hz) but struggles with larger 7B models (~0.3 Hz).[39]&lt;/p&gt;

&lt;p&gt;Recommended: 16 GB RAM, 256 GB NVMe SSD, 100+ TFLOPS GPU. For production-quality multi-model stacks, consider Jetson AGX Orin (32-64 GB) or newer architectures that can handle VLA + perception models simultaneously at real-time rates.[39]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Is 5-15 FPS enough?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For humanoid robots? Yes. You don't need 30 FPS every second. Key is asynchronous architecture: memory queries happen in background, don't block the control loop.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What's the latency budget?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Frame capture: 1-2ms. Optimized VLA inference (3 frames): 20-40ms on high-end GPUs; typical systems 50-150ms.[31] Action generation: 2-3ms. Memory cache lookups: &amp;lt;1ms. Async queries (don't block): 50-100ms. Total real-time path: 25-50ms for optimized systems. Meets 20-30 Hz control requirement.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What about on low-power devices like Jetson Orin Nano?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Unoptimized CPU-only: 150-300ms per frame. With GPU + TensorRT INT8 quantization + tracking: 25-40ms per frame for 1-5 faces. Memory is the bottleneck - 8 GB shared RAM limits model size and batch processing.[36][37]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What if I need to run multiple models simultaneously?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A full humanoid stack (VLA, object detection, SLAM, depth, speech) competing for shared 8 GB RAM makes real-time performance challenging. Jetson Orin Nano Super is not yet sufficient for production-quality multi-model deployments.[38]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What recognition accuracy should I expect?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Face detection: 95-98%. Recognition same day: 92-95%. After 1 week: 90-93%. After 1 month: 90-95% for adults with stable appearance (minimal degradation over short intervals).[30] Accuracy remains high (&amp;gt;90%) for months; larger drops occur over years. But appearance changes matter: beard growth can drop accuracy 10-25×; sunglasses drop to ~37%; children under 1 year show ~30% over 6 months.[33][34][35] Improves with recency-weighted averaging, ensemble models, and diverse training data.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What if I need higher accuracy?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use confidence thresholds (only match if &amp;gt;0.85 instead of 0.75). Ask for confirmation on borderline cases. Use ensemble (run 2-3 face recognition models, vote). Improves but costs latency.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 10: Current Research (2025)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"What actually broke through this year?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CronusVLA&lt;/strong&gt;: Multi-frame VLA using motion features with cached past frames, avoiding recomputation of the vision backbone.[26] Achieves 12.7% improvement on LIBERO benchmark with efficient multi-frame processing.[26]&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VideoVLA&lt;/strong&gt;: Diffusion-based approach. Predicts future frames AND continuous actions. Better generalization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-context LLMs&lt;/strong&gt;: Claude 200k tokens. Enables semantic memory integration directly.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"What's still unsolved?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Uncertainty calibration (robot knowing when it's uncertain). Privacy-preserving embeddings (encrypted vector search). Continual learning without forgetting old skills. Cross-modal grounding (explaining what it knows). And making all this work on a low powered device in real-time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 11: The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Why does robot memory actually matter?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For care robots: remember patient health status, preferences, medication. For home robots: understand family dynamics, relationships. For workplace: coordinate with individuals, learn workflows. Memory = personalization = trust.&lt;/p&gt;

&lt;p&gt;Imagine a Jarvis that can't recognise Tony Stark.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Who actually needs this? What's the market?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Three segments: (1) &lt;strong&gt;Healthcare&lt;/strong&gt;: care robots in hospitals/nursing homes ($2B+ market, growing 25% annually). (2) &lt;strong&gt;Consumer&lt;/strong&gt;: home assistant robots ($5B+ by 2030). (3) &lt;strong&gt;Enterprise&lt;/strong&gt;: warehouse/logistics robots ($15B+). Early adopters are healthcare (regulatory compliance, patient safety) and high-end consumer (personalization premium). The "remember me" feature becomes a differentiator when robots are commodity.[18][20]&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;"Is this going to be solved?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Partially, yes. In 1-3 years, robots will recognize and remember faces across months. In 3-7 years, they'll have super-human memory.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"So what's the summary?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Face recognition works. VLAs are bottlenecked. Compression techniques exist, but nobody has integrated them into a working robot yet. The four-tier memory system solves the storage problem - each tier optimized for its job. Caching prevents query explosion (LRU cache + batch queries + async). Most robots don't have this capability yet, humanoids are incomplete without it. In 3 years, this will likely be standard.&lt;/p&gt;




&lt;p&gt;Are you building in robotics-ai space, how are you tackling these challenges? Do you wish if someone could have built the memory layer for robots? Should I take up the project &lt;code&gt;yaadeinDB&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;Feel free to share your thoughts or feedback in the comments section.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;[1] &lt;a href="https://developer.nvidia.com/blog/optimizing-inference-for-long-context-and-large-batch-sizes-with-nvfp4-kv-cache/" rel="noopener noreferrer"&gt;Optimizing Inference for Long Context with NVFP4 KV Cache&lt;/a&gt; - NVIDIA Developer Blog, Dec 2025&lt;br&gt;
[2] &lt;a href="https://openaccess.thecvf.com/content/CVPR2025/html/Hu_M-LLM_Based_Video_Frame_Selection_for_Efficient_Video_Understanding_CVPR_2025_paper.html" rel="noopener noreferrer"&gt;M-LLM Based Video Frame Selection for Efficient Video Understanding&lt;/a&gt; - CVPR 2025&lt;br&gt;
[3] &lt;a href="https://arxiv.org/abs/2412.19442" rel="noopener noreferrer"&gt;A Survey on Large Language Model Acceleration based on KV Cache&lt;/a&gt; - ArXiv 2024&lt;br&gt;
[4] &lt;a href="https://sebastianraschka.com/blog/2025/coding-the-kv-cache-in-llms.html" rel="noopener noreferrer"&gt;Understanding and Coding the KV Cache in LLMs&lt;/a&gt; - Sebastian Raschka's Magazine, Jun 2025&lt;br&gt;
[5] &lt;a href="https://openaccess.thecvf.com/content_cvpr_2018/html/Huang_What_Makes_a_CVPR_2018_paper.html" rel="noopener noreferrer"&gt;Analyzing Temporal Information in Video Understanding&lt;/a&gt; - CVPR 2018&lt;br&gt;
[6] &lt;a href="https://arxiv.org/abs/2409.01156" rel="noopener noreferrer"&gt;TempMe: Video Temporal Token Merging for Efficient Video Understanding&lt;/a&gt; - ICLR 2025&lt;br&gt;
[10] &lt;a href="https://www.giskard.ai/glossary/pooling-layers-in-cnn" rel="noopener noreferrer"&gt;Pooling Layers in CNN&lt;/a&gt; - Giskard AI Glossary, 2025&lt;br&gt;
[11] &lt;a href="https://antonwohlgemuth.com/p/foundation-models-in-robotics-unlocking-new-frontiers-7cc1" rel="noopener noreferrer"&gt;Foundation Models for Robotics: Vision-Language-Action&lt;/a&gt; - Blog Post, Dec 2024&lt;br&gt;
[12] &lt;a href="https://arxiv.org/abs/2510.27280" rel="noopener noreferrer"&gt;FOCUS: Efficient Keyframe Selection for Long Videos&lt;/a&gt; - ArXiv 2025&lt;br&gt;
[13] &lt;a href="https://blog.milvus.io/ai-quick-reference/what-is-the-role-of-pooling-layers-in-cnns" rel="noopener noreferrer"&gt;Role of Pooling Layers in CNNs&lt;/a&gt; - Milvus.io Blog, 2025 (Note: URL redirects but page is accessible)&lt;br&gt;
[14] &lt;a href="https://arxiv.org/abs/2509.16635" rel="noopener noreferrer"&gt;A Review of Recent Techniques for Person Re-Identification&lt;/a&gt; - ArXiv, Sep 2025&lt;br&gt;
[15] &lt;a href="https://deepmind.google/blog/rt-2-new-model-translates-vision-and-language-into-action" rel="noopener noreferrer"&gt;RT-2: New model translates vision and language into action&lt;/a&gt; - DeepMind Blog, Jul 2023&lt;br&gt;
[18] &lt;a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC6452248/" rel="noopener noreferrer"&gt;Memory and mental time travel in humans and social robots&lt;/a&gt; - PMC, Mar 2019&lt;br&gt;
[19] &lt;a href="https://pyimagesearch.com/2023/01/09/face-recognition-with-siamese-networks-keras-and-tensorflow/" rel="noopener noreferrer"&gt;Understanding Face Recognition: FaceNet vs Siamese Networks&lt;/a&gt; - Blog Post, 2024&lt;br&gt;
[20] &lt;a href="https://openreview.net/forum?id=BBgDA4y0B9" rel="noopener noreferrer"&gt;Episodic Memory Banks for Lifelong Robot Learning&lt;/a&gt; - OpenReview&lt;br&gt;
[21] &lt;a href="https://pyimagesearch.com/2023/01/09/face-recognition-with-siamese-networks-keras-and-tensorflow/" rel="noopener noreferrer"&gt;Face Recognition with Siamese Networks, Keras, and TensorFlow&lt;/a&gt; - PyImageSearch, Jan 2023&lt;br&gt;
[24] &lt;a href="https://arxiv.org/abs/2506.07339" rel="noopener noreferrer"&gt;Real-Time Execution of Action Chunking Flow Policies&lt;/a&gt; - ArXiv 2025&lt;br&gt;
[26] &lt;a href="https://arxiv.org/abs/2506.19816" rel="noopener noreferrer"&gt;CronusVLA: Towards Efficient and Robust Manipulation via Transferring Latent Motion Across Time&lt;/a&gt; - ArXiv 2025&lt;br&gt;
[28] &lt;a href="https://huggingface.co/papers/2511.05936" rel="noopener noreferrer"&gt;Vision-Language-Action Models: Concepts, Progress&lt;/a&gt; - Blog/Docs, 2025&lt;br&gt;
[29] &lt;a href="https://www.emergentmind.com/topics/kv-cache-optimization" rel="noopener noreferrer"&gt;KV Cache Optimization in Transformers&lt;/a&gt; - Emergent Mind, Nov 2025&lt;br&gt;
[30] &lt;a href="https://arxiv.org/abs/2204.01760" rel="noopener noreferrer"&gt;Face Recognition in Children: A Longitudinal Study&lt;/a&gt; - ArXiv 2022; &lt;a href="https://pubmed.ncbi.nlm.nih.gov/28114700/" rel="noopener noreferrer"&gt;Longitudinal Analysis of Mugshots&lt;/a&gt; - PubMed 2017&lt;br&gt;
[31] &lt;a href="https://www.emergentmind.com/topics/kv-cache-optimization" rel="noopener noreferrer"&gt;Running VLAs at Real-Time Speed&lt;/a&gt; - Emergent Mind 2025; &lt;a href="https://arxiv.org/abs/2512.20276" rel="noopener noreferrer"&gt;ActionFlow: Real-Time Vision-Language-Action&lt;/a&gt; - ArXiv 2025&lt;br&gt;
[32] &lt;a href="https://arxiv.org/abs/2107.03769" rel="noopener noreferrer"&gt;Susceptibility to Image Resolution in Face Recognition&lt;/a&gt; - ArXiv 2021; Low-resolution face recognition studies - Multiple sources&lt;br&gt;
[33] &lt;a href="https://openaccess.thecvf.com/content/WACV2024W/DVPBA/html/Wu_Facial_Hair_Area_in_Face_Recognition_Across_Demographics_Small_Size_WACVW_2024_paper.html" rel="noopener noreferrer"&gt;Facial Hair Area in Face Recognition Across Demographics&lt;/a&gt; - ArXiv 2024; Effects of Facial Hair on Face Recognition - IEEE 2025&lt;br&gt;
[34] &lt;a href="https://arxiv.org/abs/2311.11512" rel="noopener noreferrer"&gt;Impact of Partial Occlusion on Face Recognition&lt;/a&gt; - ArXiv 2023; &lt;a href="https://pubmed.ncbi.nlm.nih.gov/36922579/" rel="noopener noreferrer"&gt;Glasses and Sunglasses Effects&lt;/a&gt; - PubMed 2023&lt;br&gt;
[35] &lt;a href="https://arxiv.org/abs/2204.01760" rel="noopener noreferrer"&gt;Face Recognition in Children: A Longitudinal Study&lt;/a&gt; - ArXiv 2022; Young Face Aging Dataset Studies - ArXiv 2022&lt;br&gt;
[36] &lt;a href="https://forums.developer.nvidia.com/t/face-detection-post-processing-not-working-in-deepstream-6-2-on-jetson-orin-nano/337401" rel="noopener noreferrer"&gt;Face Recognition on Jetson Orin Nano&lt;/a&gt; - NVIDIA Developer Forums 2024; &lt;a href="https://www.ijert.org/robust-multi-sensor-facial-recognition-in-real-time-using-nvidia-deepstream" rel="noopener noreferrer"&gt;Robust Multi-Sensor Facial Recognition in Real-Time using NVIDIA DeepStream&lt;/a&gt; - IJERT&lt;br&gt;
[37] &lt;a href="https://forums.developer.nvidia.com/t/jetson-orin-nanos-ram-keeps-getting-full-the-board-crashes/321270" rel="noopener noreferrer"&gt;Jetson Orin Nano RAM Issues and Memory Optimization&lt;/a&gt; - NVIDIA Developer Forums 2024; &lt;a href="https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/" rel="noopener noreferrer"&gt;NVIDIA Jetson Orin Nano Developer Kit Specifications&lt;/a&gt; - NVIDIA.com&lt;br&gt;
[38] &lt;a href="https://dev.to/ankk98/multi-model-ai-resource-allocation-for-humanoid-robots-a-survey-on-jetson-orin-nano-super-310i"&gt;Multi-Model AI Resource Allocation for Humanoid Robots: A Survey on Jetson Orin Nano Super&lt;/a&gt; - DEV Community, ankk98, 2025&lt;br&gt;
[39] &lt;a href="https://dev.to/ankk98/humanoid-compute-price-vs-performance-842"&gt;Humanoid Compute: Price vs. Performance&lt;/a&gt; - DEV Community, ankk98, 2025&lt;/p&gt;

</description>
      <category>humanoid</category>
      <category>ai</category>
      <category>robotics</category>
      <category>vla</category>
    </item>
    <item>
      <title>Multi-Model AI Resource Allocation for Humanoid Robots: A Survey on Jetson Orin Nano Super</title>
      <dc:creator>Ankit Khandelwal</dc:creator>
      <pubDate>Mon, 19 Jan 2026 13:06:12 +0000</pubDate>
      <link>https://dev.to/ankk98/multi-model-ai-resource-allocation-for-humanoid-robots-a-survey-on-jetson-orin-nano-super-310i</link>
      <guid>https://dev.to/ankk98/multi-model-ai-resource-allocation-for-humanoid-robots-a-survey-on-jetson-orin-nano-super-310i</guid>
      <description>&lt;p&gt;&lt;em&gt;Building efficient multi-model AI pipelines for humanoid robotics on resource-constrained edge hardware, with a focus on Jetson Orin Nano Super.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Status disclaimer&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Everything in this article is &lt;strong&gt;mostly theoretical today&lt;/strong&gt;. A Jetson Orin Nano Super–class board (8 GB LPDDR5, ~102 GB/s memory bandwidth, ~67 INT8 TOPS &lt;a href="https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/nano-super-developer-kit/" rel="noopener noreferrer"&gt;NVIDIA Jetson Orin Nano Super Developer Kit&lt;/a&gt;) is &lt;strong&gt;underpowered for running a full Vision-Language-Action (VLA) model plus several heavy vision models concurrently&lt;/strong&gt; in production. Making this truly viable will require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hardware&lt;/strong&gt;: more memory bandwidth, more VRAM, and higher sustained TOPS within a tight power envelope
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt;: lighter, edge-optimized VLA / YOLO26 variants (pruned, quantized, distilled)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software stack&lt;/strong&gt;: better kernel-level scheduling, more mature CUDA Green Contexts, and more predictable multi-tenant GPU runtimes
The architectures and strategies below are what you should &lt;strong&gt;aim for&lt;/strong&gt;, but today they remain a mix of research prototypes and partial production deployments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I have ordered the device so I will do some testing once I get it. Stay tuned for empirical results.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Suppose you want to run multiple AI models simultaneously on edge hardware:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a &lt;strong&gt;Vision-Language-Action (VLA)&lt;/strong&gt; model like &lt;strong&gt;&lt;a href="https://huggingface.co/blog/smolvla" rel="noopener noreferrer"&gt;SmolVLA&lt;/a&gt;&lt;/strong&gt; for robot control,&lt;/li&gt;
&lt;li&gt;a recent &lt;strong&gt;YOLO26&lt;/strong&gt; model for comprehensive perception (object detection, instance segmentation, pose estimation, oriented detection, and image classification) (&lt;a href="https://www.ultralytics.com/news/ultralytics-redefines-state-of-the-art-vision-ai-with-yolo26" rel="noopener noreferrer"&gt;Ultralytics YOLO26 announcement&lt;/a&gt;, &lt;a href="https://blog.roboflow.com/yolo26-in-roboflow/" rel="noopener noreferrer"&gt;Roboflow YOLO26 support&lt;/a&gt;),&lt;/li&gt;
&lt;li&gt;plus other specialized models (e.g., SLAM, depth, speech).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of these must share limited GPU memory and compute resources on an embedded platform like &lt;strong&gt;Jetson Orin Nano Super&lt;/strong&gt; (8 GB LPDDR5 @ ~102 GB/s, 6-core Arm CPU, Ampere GPU with 1,024 CUDA cores and 32 Tensor Cores &lt;a href="https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/nano-super-developer-kit/" rel="noopener noreferrer"&gt;NVIDIA Jetson Orin Nano Super Developer Kit&lt;/a&gt;, &lt;a href="https://docs.nvidia.com/jetson/archives/r36.4.4/DeveloperGuide/SD/PlatformPowerAndPerformance/JetsonOrinNanoSeriesJetsonOrinNxSeriesAndJetsonAgxOrinSeries.html" rel="noopener noreferrer"&gt;Jetson Orin Nano/NX/AGX power modes&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;We’ll survey &lt;strong&gt;three major resource allocation strategies&lt;/strong&gt; for running multiple AI models on edge devices: hardware partitioning, priority-based scheduling, and offloading. Then we'll focus on the &lt;strong&gt;event-driven architecture&lt;/strong&gt; that production robotics systems actually use for reliable, real-time multi-model execution.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fil5vzddt3s8prl7wupvm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fil5vzddt3s8prl7wupvm.png" alt="NVIDIA Jetson Orin Nano Super Developer Kit Specs" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Design Criteria for Multi-Model Edge AI Systems
&lt;/h2&gt;

&lt;p&gt;Before diving into specific strategies, it's crucial to understand the fundamental design criteria that shape resource allocation decisions for multi-model AI on edge devices. These criteria directly influence which approach will work for your specific use case.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-Time Performance Requirements
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Latency budgets&lt;/strong&gt;: Critical models (VLA for robot control) typically target a &lt;strong&gt;desired frequency of 24 Hz&lt;/strong&gt; for end-to-end control loops (sensor → action), while perception models (e.g., YOLO26 detection/segmentation) can tolerate at lower frequencies (5 Hz). Missing deadlines can cause instability or safety issues in mobile robots.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Jitter tolerance&lt;/strong&gt;: Real-time systems need &lt;strong&gt;predictable&lt;/strong&gt; latency. User reports show &lt;strong&gt;10–40% latency increases&lt;/strong&gt; even with per-client limits, and sometimes much worse when misconfigured (&lt;a href="https://docs.nvidia.com/deploy/mps/" rel="noopener noreferrer"&gt;NVIDIA MPS docs&lt;/a&gt;, &lt;a href="https://forums.developer.nvidia.com/t/mps-interference-problem/312930" rel="noopener noreferrer"&gt;MPS interference report&lt;/a&gt;, &lt;a href="https://forums.developer.nvidia.com/t/mps-vs-no-mps-drastic-increase-in-kernel-latency/336175" rel="noopener noreferrer"&gt;MPS latency outlier report&lt;/a&gt;). That makes naive multi-process sharing a bad fit for tight 24 Hz+ control loops unless carefully profiled and constrained.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Throughput vs. latency trade-offs&lt;/strong&gt;: Background models can use batching for efficiency, but critical models prioritize low-latency single-inference execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resource Constraints
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Power envelope&lt;/strong&gt;: On Jetson Orin Nano Super, low-power modes operate around &lt;strong&gt;7–8 W&lt;/strong&gt;, with higher modes up to ~25 W in &lt;code&gt;MAXN_SUPER&lt;/code&gt; (&lt;a href="https://docs.nvidia.com/jetson/archives/r36.4.4/DeveloperGuide/SD/PlatformPowerAndPerformance/JetsonOrinNanoSeriesJetsonOrinNxSeriesAndJetsonAgxOrinSeries.html" rel="noopener noreferrer"&gt;Jetson power/performance modes&lt;/a&gt;). Multi-model execution must stay within these thermal budgets or the device will downclock aggressively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory hierarchy&lt;/strong&gt;: The Orin Nano Super’s &lt;strong&gt;8 GB LPDDR5&lt;/strong&gt; is a &lt;strong&gt;unified memory pool&lt;/strong&gt; for CPU and GPU. Models compete for both GPU and system memory, and memory pressure can cause allocator fragmentation, cache thrashing, and even swapping if you’re not careful with container limits and tensor lifetimes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compute asymmetry&lt;/strong&gt;: GPU cores excel at parallel inference, CPU cores handle preprocessing/serialization. Resource allocation must balance both.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reliability and Fault Tolerance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Graceful degradation&lt;/strong&gt;: Non-critical models should drop frames or reduce frequency under resource pressure, not crash the entire system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model priority levels&lt;/strong&gt;: Critical perception (VLA control) &amp;gt; Essential perception (YOLO detection) &amp;gt; Background tasks (pose estimation, classification).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure isolation&lt;/strong&gt;: A single model's crash shouldn't bring down the entire pipeline. Containerization and process isolation are essential.&lt;/p&gt;

&lt;h3&gt;
  
  
  System-Level Considerations
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Communication overhead&lt;/strong&gt;: Inter-model data sharing (JSON serialization, queue management) adds latency that must be budgeted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring requirements&lt;/strong&gt;: Real-time metrics collection for latency, utilization, and thermal state enables adaptive resource allocation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scalability needs&lt;/strong&gt;: Will you add more models later? Choose architectures that support horizontal scaling without complete rearchitecting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployment constraints&lt;/strong&gt;: Edge devices often run in remote locations with limited network access, requiring self-contained solutions.&lt;/p&gt;

&lt;p&gt;These design criteria explain why simple partitioning approaches fail on edge devices: the fundamental constraints (thermal limits, unified memory, power budgets) make static allocation inefficient. Production systems instead use adaptive, priority-aware resource sharing with explicit failure modes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Approach 1: Partitioning – Static Slices of Compute and Memory
&lt;/h2&gt;

&lt;p&gt;Partitioning tries to make multi-model systems predictable by &lt;strong&gt;reserving fixed resources per model&lt;/strong&gt;. On edge hardware, this usually means partitioning GPU SMs, constraining CPU cores, or pinning memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 GPU Resource Partitioning (NVIDIA Green Contexts)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: Hardware-level SM (Streaming Multiprocessor) allocation. You split the GPU’s SMs into subsets and bind different workloads to different subsets using &lt;strong&gt;CUDA Green Contexts&lt;/strong&gt; (&lt;a href="https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__GREEN__CONTEXTS.html" rel="noopener noreferrer"&gt;CUDA Green Contexts driver API&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;On Jetson Orin Nano Super (Ampere, compute capability 8.7), the GPU exposes &lt;strong&gt;8 SMs&lt;/strong&gt; with a total of &lt;strong&gt;1,024 CUDA cores&lt;/strong&gt; (see &lt;a href="https://www.techpowerup.com/gpu-specs/jetson-orin-nano-8-gb.c4082" rel="noopener noreferrer"&gt;Jetson Orin Nano GPU spec&lt;/a&gt;). Green Contexts enforce &lt;strong&gt;minimum SM counts and alignment constraints&lt;/strong&gt; per context (e.g., minimum 4 SMs, counts in multiples of 2 for 8.x architectures).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hardware-enforced &lt;strong&gt;SM isolation&lt;/strong&gt; (clean separation at the compute level)&lt;/li&gt;
&lt;li&gt;Official NVIDIA support on Orin (compute capability 8.7)&lt;/li&gt;
&lt;li&gt;Streams and kernels under different Green Contexts are scheduled from separate queues, which can improve isolation in some workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt; (critical on Orin Nano–class devices):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frequency is still global&lt;/strong&gt;: GPU clock is governed by the Jetson power mode and thermal headroom, &lt;strong&gt;not&lt;/strong&gt; by Green Contexts. All contexts share the same global GPU frequency (&lt;a href="https://docs.nvidia.com/jetson/archives/r36.4.4/DeveloperGuide/SD/PlatformPowerAndPerformance/JetsonOrinNanoSeriesJetsonOrinNxSeriesAndJetsonAgxOrinSeries.html" rel="noopener noreferrer"&gt;Jetson power modes&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No memory isolation&lt;/strong&gt;: Contexts share L2 cache, memory controllers, and the same 8 GB LPDDR5 DRAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thermal throttling&lt;/strong&gt;: In 7–8 W modes, sustained heavy use across contexts still causes downclocking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited partition granularity&lt;/strong&gt;: With 8 SMs and a 4-SM minimum per context on cc 8.7, you can have at most &lt;strong&gt;two partitions of 4 SMs&lt;/strong&gt; each.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observed behavior can be surprising&lt;/strong&gt;: Users have reported &lt;strong&gt;little to no runtime change&lt;/strong&gt; when varying SM allocations via Green Contexts on Jetson Orin, suggesting that other bottlenecks (memory, front-end, scheduling) may dominate (&lt;a href="https://forums.developer.nvidia.com/t/green-context-sm-allocation-not-affecting-kernel-runtime-in-jetson-orina/332343" rel="noopener noreferrer"&gt;NVIDIA forum: Green Contexts on Orin&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-world latency impact (today)&lt;/strong&gt;: You may get some improved isolation in synthetic benchmarks, but on Orin Nano–class devices the main constraints are &lt;strong&gt;power mode, memory bandwidth, and thermal limits&lt;/strong&gt;, which Green Contexts do &lt;strong&gt;not&lt;/strong&gt; solve. For most embedded robotics use cases, the complexity is hard to justify unless you have a very specific multi-tenant requirement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt;: On Orin Nano–class devices, use Green Contexts only when you absolutely need &lt;strong&gt;hard SM isolation&lt;/strong&gt; between tenants and can afford the engineering complexity. For single-robot stacks, it’s usually better to rely on &lt;strong&gt;priority-based scheduling and event-driven architectures&lt;/strong&gt; instead.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.2 Software Partitioning: CUDA MPS (Multi-Process Service)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: A software layer that allows &lt;strong&gt;multiple processes to share a single GPU context&lt;/strong&gt;, time-multiplexing kernels from different processes through the &lt;strong&gt;CUDA MPS server&lt;/strong&gt; (&lt;a href="https://docs.nvidia.com/deploy/mps/" rel="noopener noreferrer"&gt;CUDA MPS guide&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Works on all Jetson platforms today (no driver updates needed)&lt;/li&gt;
&lt;li&gt;Per-process thread budget and pinned memory limits&lt;/li&gt;
&lt;li&gt;Simple to enable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Shared L2 cache and bandwidth&lt;/strong&gt;: Models can still thrash each other’s L2 lines and DRAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kernel serialization and interference&lt;/strong&gt;: Under contention, one client’s kernel launches can delay another’s.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unpredictable latency without careful tuning&lt;/strong&gt;: Reports show latency increases of &lt;strong&gt;10–40%&lt;/strong&gt; under moderate contention even with 50/50 SM splits, and in misconfigured scenarios, giant outliers (e.g., a kernel going from ~65 µs to ~100 ms) (&lt;a href="https://forums.developer.nvidia.com/t/mps-interference-problem/312930" rel="noopener noreferrer"&gt;MPS interference report&lt;/a&gt;, &lt;a href="https://forums.developer.nvidia.com/t/mps-vs-no-mps-drastic-increase-in-kernel-latency/336175" rel="noopener noreferrer"&gt;MPS latency outlier report&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory accounting is per-process, not global&lt;/strong&gt;: Per-process limits don’t give you a global “cap”; two 1 GB limits still allow 2 GB total in use.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-world issue&lt;/strong&gt;: For multi-model pipelines (VLA + YOLO26 detection/segmentation/pose) targeting &lt;strong&gt;24 Hz control loops&lt;/strong&gt;, this kind of latency variability is unacceptable unless you design around it very conservatively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt;: Reasonable for batch or non-real-time workloads; a poor fit for tight control loops.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.3 OS-Level Partitioning: Linux cgroups + CPU Affinity
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: Kernel-level control over CPU time and system RAM. You pin CPU cores, set CPU shares, and enforce memory limits per cgroup or container.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to implement&lt;/strong&gt;: Create CPU and memory control groups, pinning specific cores to each workload. Use Docker's &lt;code&gt;cpuset_cpus&lt;/code&gt; and &lt;code&gt;mem_limit&lt;/code&gt; for containerized isolation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clean OS-level isolation (CPU and system RAM)&lt;/li&gt;
&lt;li&gt;Prevents CPU contention between processes&lt;/li&gt;
&lt;li&gt;Works on all platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Doesn’t isolate GPU&lt;/strong&gt;: Both processes still compete for GPU memory bandwidth (on Orin Nano Super that’s ~102 GB/s shared across all clients &lt;a href="https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/nano-super-developer-kit/" rel="noopener noreferrer"&gt;NVIDIA Jetson Orin Nano Super Developer Kit&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incomplete solution&lt;/strong&gt;: If VLA runs on GPU but YOLO's CPU thread is blocked, latency still spikes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory overhead&lt;/strong&gt;: Tight system RAM means early swapping, crashing your "fixed" allocation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-world issue&lt;/strong&gt;: Critical model deadlines (24 Hz for VLA, real-time pose estimation) might still be missed if system RAM swaps to disk or GPU bandwidth is saturated by multiple concurrent models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt;: Useful as a supporting tool (especially with containers), but not sufficient alone for real-time multi-model GPU workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.4 Where Partitioning Fits
&lt;/h3&gt;

&lt;p&gt;Partitioning is attractive when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You need strong isolation&lt;/strong&gt; (multi-tenant scenarios, safety domains)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;You care more about fairness than minimum latency&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You can afford reduced peak performance&lt;/strong&gt; due to thermal limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But on small edge devices with unified memory and tight power envelopes, &lt;strong&gt;hard partitions tend to underutilize the hardware&lt;/strong&gt; and amplify thermal problems. That’s why most modern robotics stacks use partitioning only as a &lt;strong&gt;supporting tool&lt;/strong&gt;, not the primary strategy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Approach 2: Prioritization and Event-Driven Scheduling – Shared Resources, Explicit Priorities
&lt;/h2&gt;

&lt;p&gt;Prioritization assumes all models share the same GPU/CPU pool, but &lt;strong&gt;who runs when&lt;/strong&gt; is controlled carefully using priorities, async queues, and backpressure. This is the pattern used by OM1, LeRobot, Reachy 2, and most modern robotics systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 Why Prioritization Wins on Edge Devices
&lt;/h3&gt;

&lt;p&gt;The fundamental limitation of edge devices: &lt;strong&gt;Unified memory architectures and thermal constraints make static resource partitioning inefficient.&lt;/strong&gt; Production robotics systems avoid strict partitions and instead use event-driven patterns that dynamically allocate resources based on priority and system state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key insight&lt;/strong&gt;: Reliable multi-model execution comes from &lt;strong&gt;adaptive resource sharing&lt;/strong&gt; and &lt;strong&gt;graceful degradation&lt;/strong&gt;, not rigid slicing.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Core Principles
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Shared compute with explicit priorities&lt;/strong&gt;: Multiple models share GPU/CPU resources, but execution priority is clearly defined.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CUDA streams for kernel scheduling&lt;/strong&gt;: High-priority streams for critical models, normal priority for background tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async event communication&lt;/strong&gt;: Message queues decouple model timing and enable graceful degradation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System state awareness&lt;/strong&gt;: Monitor thermal/power limits and adapt resource allocation dynamically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deadline-aware scheduling&lt;/strong&gt;: Soft deadlines for non-critical models, hard deadlines for essential perception.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  2.3 Architecture: Prioritized CUDA Streams + Async Event Bus
&lt;/h3&gt;

&lt;p&gt;One concrete template looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────────────────────┐
│   Critical Model Thread (e.g., VLA @ 24Hz)         │
│   Priority: HIGH                                   │
│   Target frequency: 24 Hz                           │
└────────────────────────────────────────────────────┘
         ↓ (sensor inputs)
┌────────────────────────────────────────────────────┐
│   CUDA High-Priority Stream (GPU)                  │
│   Critical inference, never preempted              │
└────────────────────────────────────────────────────┘
         ↓ (outputs → Action/Event queues)
┌────────────────────────────────────────────────────┐
│   Event Bus (Redis/Zenoh/ROS2)                     │
│   Async communication between models               │
└────────────────────────────────────────────────────┘
         ↓ (decoupled messaging)
┌────────────────────────────────────────────────────┐
│   Background Models (YOLO, segmentation, etc.)     │
│   Priority: NORMAL/BACKGROUND                      │
│   Graceful degradation under load                  │
│   Runs in normal-priority CUDA streams             │
└────────────────────────────────────────────────────┘
         ↓ (context updates → Decision fusion)
┌────────────────────────────────────────────────────┐
│   Decision Fusion &amp;amp; Action Execution               │
│   Combines all model outputs                       │
└────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.4 Implementation Patterns
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Docker / Docker Compose + ROS 2 / Zenoh (containerized event-driven architecture)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Each AI model (or subsystem) runs in its own container, communicating over async message buses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Containerize each model service with NVIDIA runtime.&lt;/li&gt;
&lt;li&gt;Use async message queues (ZMQ/ROS2/Zenoh) for inter-service communication.&lt;/li&gt;
&lt;li&gt;Prioritize VLA at 24Hz with strict deadlines while YOLO runs at 5Hz with graceful degradation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tools &amp;amp; libraries&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ROS 2&lt;/strong&gt;: Native deadline/lifespan QoS policies atop DDS (&lt;a href="https://design.ros2.org/articles/qos_deadline_liveliness_lifespan.html" rel="noopener noreferrer"&gt;ROS 2 QoS design&lt;/a&gt;). Used heavily in &lt;strong&gt;Reachy 2&lt;/strong&gt;’s core ROS 2 workspace (&lt;a href="https://github.com/pollen-robotics/reachy2_core" rel="noopener noreferrer"&gt;&lt;code&gt;reachy2_core&lt;/code&gt;&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zenoh (OM1’s choice)&lt;/strong&gt;: Low-latency pub/sub and key/value messaging, lighter than full ROS 2 middleware. OM1 integrates Zenoh for cross-component data exchange (&lt;a href="https://github.com/OpenMind/OM1" rel="noopener noreferrer"&gt;OM1 repo&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redis + Lua&lt;/strong&gt;: Simple pub/sub and atomic operations for single-host deployments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quick start template&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create prioritized CUDA streams for each model based on real-time requirements (&lt;a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-priorities" rel="noopener noreferrer"&gt;CUDA stream priorities&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Use Python &lt;code&gt;asyncio&lt;/code&gt; (&lt;a href="https://docs.python.org/3/library/asyncio.html" rel="noopener noreferrer"&gt;docs&lt;/a&gt;) or ROS 2 callbacks for concurrent execution and queue-based communication.&lt;/li&gt;
&lt;li&gt;Start with critical models at high priority (e.g., 24 Hz), background models at normal priority (e.g., 5 Hz).&lt;/li&gt;
&lt;li&gt;Add Prometheus/Grafana or equivalent monitoring for latency, queue depths, and thermal throttling.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  2.5 Real-World Example: OM1 (OpenMind)
&lt;/h3&gt;

&lt;p&gt;OM1 (“OpenMind Modular AI Runtime for Robots”) demonstrates mode-based multi-model execution in a &lt;strong&gt;single Dockerized runtime&lt;/strong&gt;, orchestrating LLMs, VLMs, and robotics stacks together (&lt;a href="https://github.com/OpenMind/OM1" rel="noopener noreferrer"&gt;OM1 repo&lt;/a&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Single Docker Container (OM1 Runtime)
  ├─ Multiple operational modes (welcome, slam, navigation, etc.)
  ├─ Concurrent LLM execution (Fast Action + Core + Mentor LLMs)
  ├─ Zenoh pub/sub for inter-component communication
  ├─ Background processes (SLAM, navigation, face recognition)
  └─ Input orchestrators (VLM, ASR, sensors)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;No GPU partitioning.&lt;/strong&gt; Instead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple LLMs run concurrently with different roles and priorities (e.g., fast-reactive vs. deliberative).&lt;/li&gt;
&lt;li&gt;Vision models (VLM variants) provide continuous perception.&lt;/li&gt;
&lt;li&gt;SLAM and navigation models run in background with graceful degradation.&lt;/li&gt;
&lt;li&gt;All components communicate via &lt;strong&gt;Zenoh pub/sub&lt;/strong&gt; messaging and ROS 2 where appropriate.&lt;/li&gt;
&lt;li&gt;Dynamic mode transitions reallocate resources based on context and intent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Takeaway&lt;/strong&gt;: OM1 shows &lt;strong&gt;production-grade multi-model AI orchestration&lt;/strong&gt; (LLMs + VLMs + SLAM + navigation) using event-driven, priority-based scheduling rather than hard GPU partitioning.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.6 Prioritization: Pros, Cons, When to Use
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production-proven patterns (LeRobot async inference, OM1 runtime, Reachy 2 ROS 2 workspace).&lt;/li&gt;
&lt;li&gt;Graceful degradation (non-critical models adapt to resource constraints).&lt;/li&gt;
&lt;li&gt;Easy to debug (message introspection, queue monitoring, logging).&lt;/li&gt;
&lt;li&gt;Scales horizontally (add models without rearchitecting core systems).&lt;/li&gt;
&lt;li&gt;Platform-agnostic (works with NVIDIA, ROCm, CPU-only).&lt;/li&gt;
&lt;li&gt;Adaptive resource allocation (responds to thermal/power limits).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shared GPU bandwidth contention (models can still interfere).&lt;/li&gt;
&lt;li&gt;Message serialization overhead (~1–2ms per inter-model communication).&lt;/li&gt;
&lt;li&gt;Requires understanding async patterns and queue management.&lt;/li&gt;
&lt;li&gt;Not suitable for strict multi-tenant isolation guarantees.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Need guaranteed low-latency for critical model": Docker + ROS 2 + prioritized CUDA streams.&lt;/li&gt;
&lt;li&gt;"Running multiple YOLO26 variants (detect/segment/pose)": Event-driven architecture with async queues.&lt;/li&gt;
&lt;li&gt;"Building production robotics system": Docker Compose + Zenoh + mode-based execution.&lt;/li&gt;
&lt;li&gt;"Rapid prototyping on single device": Python &lt;code&gt;asyncio&lt;/code&gt; + CUDA streams.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Approach 3: Offloading – Pushing Work Off the Edge Device
&lt;/h2&gt;

&lt;p&gt;Offloading moves some or all model computation off the edge device to &lt;strong&gt;separate GPU servers or cloud infrastructure&lt;/strong&gt;. This eliminates local contention at the cost of network latency and extra infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 Remote Inference Offloading (LeRobot-Style Pattern)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: Run &lt;strong&gt;policy inference or heavy model inference&lt;/strong&gt; on a separate GPU server, while the robot (edge device) handles sensors and low-level control. Communication happens over &lt;strong&gt;gRPC streaming&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is the pattern used in &lt;strong&gt;LeRobot’s async inference stack&lt;/strong&gt;, where a &lt;code&gt;PolicyServer&lt;/code&gt; runs on a workstation GPU and a &lt;code&gt;RobotClient&lt;/code&gt; runs on the robot, exchanging observations and actions via gRPC (&lt;a href="https://github.com/huggingface/lerobot" rel="noopener noreferrer"&gt;LeRobot repo&lt;/a&gt;, see &lt;code&gt;lerobot/async_inference/policy_server.py&lt;/code&gt; and &lt;code&gt;robot_client.py&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to implement&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deploy policies and heavy models on a dedicated inference server with a larger GPU.&lt;/li&gt;
&lt;li&gt;Use gRPC streaming for low-latency communication between the robot and the inference server (&lt;a href="https://grpc.io/docs/languages/python/" rel="noopener noreferrer"&gt;gRPC Python docs&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero GPU contention on the edge&lt;/strong&gt;: Edge resources are freed for additional models or real-time control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalable inference&lt;/strong&gt;: Upgrade server GPUs independently of edge hardware constraints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliable latency&lt;/strong&gt;: Often more predictable network latency vs. highly variable local multi-model GPU sharing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete isolation&lt;/strong&gt;: Models run on separate hardware, eliminating interference.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Network dependency&lt;/strong&gt;: Requires reliable low-latency network connection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bandwidth overhead&lt;/strong&gt;: Camera frames must be compressed and transmitted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Additional infrastructure&lt;/strong&gt;: Need dedicated inference servers and monitoring.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Higher complexity&lt;/strong&gt;: Distributed system management, failure handling, and observability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-world use&lt;/strong&gt;: LeRobot uses this client–server architecture for &lt;strong&gt;RL policy inference&lt;/strong&gt; and async action streaming. The same pattern generalizes to VLA + YOLO26 pipelines, but for those, you must account for much higher bandwidth (video frames) and tighter latency budgets.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 Orchestrated Offloading with Triton and Microservices
&lt;/h3&gt;

&lt;p&gt;NVIDIA Triton Inference Server provides &lt;strong&gt;process-level isolation and scheduling&lt;/strong&gt; for multi-model deployments, often on a central server:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it is&lt;/strong&gt;: Multi-model serving platform with built-in queuing, batching, and per-model scheduling policies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to implement&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Configure separate model repositories with dedicated GPU instances and per-model batching policies with different latency deadlines.&lt;/li&gt;
&lt;li&gt;Expose models over gRPC/HTTP to edge clients.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production-grade scheduling and queuing.&lt;/li&gt;
&lt;li&gt;Per-model deadlines and batching policies.&lt;/li&gt;
&lt;li&gt;High resource efficiency on server GPUs.&lt;/li&gt;
&lt;li&gt;Can mix NVIDIA-stack and non-NVIDIA models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learning curve (gRPC, model configs).&lt;/li&gt;
&lt;li&gt;Overhead from HTTP/gRPC serialization (5–10ms per request).&lt;/li&gt;
&lt;li&gt;Still subject to GPU bandwidth contention on the server.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Distributed edge deployment with network": Remote offloading + gRPC streaming.&lt;/li&gt;
&lt;li&gt;"Enterprise ML pipeline with model versioning": Triton Inference Server + model ensembles.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.3 When Offloading Makes Sense
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You cannot meet latency or throughput targets within the edge device's power/thermal envelope.&lt;/li&gt;
&lt;li&gt;You need to run many heavy models simultaneously, but only a subset of them require strict real-time guarantees on the robot.&lt;/li&gt;
&lt;li&gt;Your deployment environment has reliable wired or high-quality wireless connectivity.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Putting It Together: Comparing the Three Approaches
&lt;/h2&gt;

&lt;p&gt;In practice, &lt;strong&gt;production systems mix these&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;cgroups and containers&lt;/strong&gt; for basic isolation.&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;prioritized CUDA streams and event buses&lt;/strong&gt; for real-time behavior.&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;offloading&lt;/strong&gt; for heavyweight or non-real-time models that don't fit on the edge box.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Given today’s hardware, &lt;strong&gt;a single Jetson Orin Nano Super is not yet a comfortable platform for running a large VLA plus multiple heavy YOLO26 variants and other models concurrently&lt;/strong&gt; at strict real-time rates. You can prototype pieces of this stack, but for production you will almost certainly need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More capable edge hardware&lt;/strong&gt; (Orin NX/AGX, Thor, or similar), or&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Significant offloading&lt;/strong&gt; to nearby GPU servers, and/or&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggressively optimized models&lt;/strong&gt; (distillation, pruning, quantization, ONNX/TensorRT deployment).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That said, the architectural lessons are already clear:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: For multi-model AI on edge devices, avoid static hardware partitioning as your primary tool. Favor &lt;strong&gt;event-driven architectures&lt;/strong&gt; with prioritized CUDA streams and async messaging, and treat partitioning and offloading as &lt;strong&gt;supporting levers&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Have questions or suggestions? Drop them in the comments below.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OM1 Architecture&lt;/strong&gt; (event-driven multimodal runtime): &lt;code&gt;https://github.com/OpenMind/OM1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LeRobot&lt;/strong&gt; (RL + async inference + gRPC): &lt;code&gt;https://github.com/huggingface/lerobot&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reachy 2 Core (ROS 2 workspace)&lt;/strong&gt;: &lt;code&gt;https://github.com/pollen-robotics/reachy2_core&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reachy 2 Python SDK&lt;/strong&gt;: &lt;code&gt;https://github.com/pollen-robotics/reachy2-sdk&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVIDIA Jetson Deployment with Triton&lt;/strong&gt;: &lt;code&gt;https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/jetson.html&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CUDA Green Contexts&lt;/strong&gt;: &lt;code&gt;https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__GREEN__CONTEXTS.html&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CUDA MPS Guide&lt;/strong&gt;: &lt;code&gt;https://docs.nvidia.com/deploy/mps/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MPS Interference Discussion&lt;/strong&gt;: &lt;code&gt;https://forums.developer.nvidia.com/t/mps-interference-problem/312930&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MPS Latency Outlier Discussion&lt;/strong&gt;: &lt;code&gt;https://forums.developer.nvidia.com/t/mps-vs-no-mps-drastic-increase-in-kernel-latency/336175&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ROS 2 Real-Time QoS&lt;/strong&gt;: &lt;code&gt;https://design.ros2.org/articles/qos_deadline_liveliness_lifespan.html&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python asyncio&lt;/strong&gt;: &lt;code&gt;https://docs.python.org/3/library/asyncio.html&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker Compose for robotics&lt;/strong&gt;: &lt;code&gt;https://fenilsonani.com/articles/docker-compose-multi-container-orchestration&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;YOLO 26 Family&lt;/strong&gt;: &lt;code&gt;https://www.ultralytics.com/news/ultralytics-redefines-state-of-the-art-vision-ai-with-yolo26&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>robotics</category>
      <category>vla</category>
      <category>ai</category>
      <category>gpu</category>
    </item>
    <item>
      <title>Insights from Sergey Levine’s appearance on the Dwarkesh Patel podcast</title>
      <dc:creator>Ankit Khandelwal</dc:creator>
      <pubDate>Mon, 19 Jan 2026 09:10:46 +0000</pubDate>
      <link>https://dev.to/ankk98/insights-from-sergey-levines-appearance-on-the-dwarkesh-patel-podcast-36bi</link>
      <guid>https://dev.to/ankk98/insights-from-sergey-levines-appearance-on-the-dwarkesh-patel-podcast-36bi</guid>
      <description>&lt;p&gt;Just finished an incredible deep dive into the future of robotics with Sergey Levine of Physical Intelligence. The "Robotics Flywheel" is much closer than people realize.&lt;/p&gt;

&lt;p&gt;Link: &lt;a href="https://youtu.be/48pxVdmkMIE?si=UamP4IMBoI0jOyMB" rel="noopener noreferrer"&gt;https://youtu.be/48pxVdmkMIE?si=UamP4IMBoI0jOyMB&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here are my top 10 takeaways on the path to general-purpose robots:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The 5-Year Horizon:&lt;/strong&gt; The median estimate for robots performing complex, autonomous home tasks and blue-collar work is just &lt;strong&gt;five years&lt;/strong&gt;. It’s a "single-digit" year problem, not a multi-decade one.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Representation Problem:&lt;/strong&gt; Video is harder than text because text is already abstracted into meaning, while video is just "compressed pixels". To scale, robots need to ignore "noise" (like moving clouds) and focus only on goal-relevant changes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hardware vs. Software:&lt;/strong&gt; Smarter AI actually makes hardware &lt;strong&gt;cheaper&lt;/strong&gt;. High-quality visual feedback allows robots to use "cheap," less precise parts because the AI can sense and correct mechanical errors in real-time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Inference Trilemma:&lt;/strong&gt; There is a constant trade-off between &lt;strong&gt;Inference Speed (Hz)&lt;/strong&gt;, &lt;strong&gt;Model Size (Parameters)&lt;/strong&gt;, and &lt;strong&gt;Context Length (Memory)&lt;/strong&gt;. The goal is to move toward the human brain's "extreme parallelism," where perception and planning run at different rates.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Imitation Before RL:&lt;/strong&gt; You can’t start with Reinforcement Learning (RL) from scratch, it takes too long. You must use supervised learning (imitation) first to provide the "prior knowledge" and common sense the robot needs to eventually learn on the job.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Emergent Compositionality:&lt;/strong&gt; Robots are starting to show "emergent" skills. Levine noted a robot that learned to clear an obstacle before folding laundry without being specifically trained for that sequence, it’s "compositional generalization".&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Moravec’s Paradox:&lt;/strong&gt; This is the core of robotics, the things humans find easy (folding a T-shirt) are the hardest for AI, while things we find hard (calculus) are easy. Physical proficiency is a massive computational challenge.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Externalized Brain:&lt;/strong&gt; For robots to be affordable, we might see "off-board inference". A robot might be in a "dumber" reactive mode if offline but become significantly smarter when connected to a high-speed data center.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The goal isn't just to build "mechanical people", it's to build heterogeneous systems that can be 100 feet tall or tiny, all powered by the same foundational intelligence.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The 24Hz Benchmark:&lt;/strong&gt; The human mind processes visual information and reacts at roughly &lt;strong&gt;24 frames per second (24Hz)&lt;/strong&gt;. To achieve human-level proficiency, robots must match this high-frequency inference while simultaneously managing the "trilemma" of increasing model size and memory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The 1-Second Context Paradox:&lt;/strong&gt; Current state-of-the-art VLA models often operate with only a &lt;strong&gt;one-second context window&lt;/strong&gt;. It is "shocking" that they can execute minute-long tasks by only observing the immediate past, but true autonomy will require scaling this to the minutes, hours, or even &lt;strong&gt;"decades of context"&lt;/strong&gt; that humans use to inform their plans.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Emergent Meta-Learning:&lt;/strong&gt; Meta-learning, the ability for a model to "learn how to learn" is an &lt;strong&gt;emergent property&lt;/strong&gt; seen in large foundation models. A sufficiently smart model can evaluate its own performance and figure out how to leverage auxiliary data, like &lt;strong&gt;simulations or synthetic experience&lt;/strong&gt;, to improve its success on real-world objectives.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Mastering Counterfactuals:&lt;/strong&gt; The "key" to optimal decision-making is the ability to answer &lt;strong&gt;counterfactuals&lt;/strong&gt;: "If I did this instead of that, would it be better?". Whether a robot uses a learned simulator, a reward model, or a value function, the core of intelligence is having a mechanism to &lt;strong&gt;evaluate these alternative futures&lt;/strong&gt; and pick the best one.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>robotics</category>
      <category>ai</category>
      <category>humanoids</category>
    </item>
    <item>
      <title>Is Humanoids' Data Appetite Really Endless?</title>
      <dc:creator>Ankit Khandelwal</dc:creator>
      <pubDate>Sat, 17 Jan 2026 08:17:53 +0000</pubDate>
      <link>https://dev.to/ankk98/is-humanoids-data-appetite-really-endless-39lj</link>
      <guid>https://dev.to/ankk98/is-humanoids-data-appetite-really-endless-39lj</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Humanoid robots like Tesla's Optimus and Figure AI are generating massive hype, but the critical question isn't just whether they need data, it's how much, and what kind.&lt;/p&gt;

&lt;p&gt;The narrative suggests humanoids require endless datasets, creating a boom market for data startups. But 2024-2025 research suggests a different trajectory: humanoids will need substantial data initially, then demand will plateau and shift toward specialized services like curation and safety validation rather than raw collection. The business model around data changes from collection to intelligent processing.&lt;/p&gt;

&lt;p&gt;This analysis examines four core doubts about the "endless data appetite" narrative, then weighs counterarguments that suggest certain demands persist.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: Arguments for Plateauing Demands
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Doubt 1: Do Scaling Laws Show Diminishing Returns?
&lt;/h3&gt;

&lt;p&gt;Microsoft's &lt;a href="https://arxiv.org/abs/2411.04434" rel="noopener noreferrer"&gt;"Scaling Laws for Pre-training Agents and World Models"&lt;/a&gt; (2024) reveals that embodied AI systems follow power-law relationships, not linear growth. Optimal data scales with compute as D ∝ C^0.68, meaning data requirements grow much slower than computational capacity. Crucially, losses plateau at large datasets (1.63 billion pairs) without significant overfitting.&lt;/p&gt;

&lt;p&gt;For humanoids, this means early data (first 100 trajectories) drives massive capability gains. The 10,000th trajectory? Marginal improvements. By 100,000 trajectories, you're fighting diminishing returns.&lt;/p&gt;

&lt;p&gt;NVIDIA's &lt;a href="https://arxiv.org/abs/2505.12705" rel="noopener noreferrer"&gt;"DreamGen"&lt;/a&gt; (2025) demonstrates this principle in practice. A generative world model trained on one teleop task generated 22 novel behaviors without collecting additional real-world data. Recent work on &lt;a href="https://openreview.net/forum?id=TjCDNssXKU" rel="noopener noreferrer"&gt;"Learning Hierarchical World Models with Adaptive Temporal Abstractions"&lt;/a&gt; (Gumbsch et al., ICLR 2024) shows hierarchical approaches like THICK achieve efficiency improvements through multi-timescale reasoning with far less data than flat world models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implication&lt;/strong&gt;: Foundational training peaks 2026-2028. Afterward, demand likely drops 50-70% as efficiency gains mature.&lt;/p&gt;




&lt;h3&gt;
  
  
  Doubt 2: Can Few-Shot Learning Replace Massive Datasets?
&lt;/h3&gt;

&lt;p&gt;Boston Dynamics' &lt;a href="https://www.science.org/doi/10.1126/scirobotics.adi9579" rel="noopener noreferrer"&gt;"Real-World Humanoid Locomotion with Reinforcement Learning"&lt;/a&gt; (2024) shows Digit adapting to diverse terrains in fewer than 100 real-world trials with 90% zero-shot success on new environments.&lt;/p&gt;

&lt;p&gt;Honda Research Institute's &lt;a href="https://arxiv.org/abs/2506.13762" rel="noopener noreferrer"&gt;"VisuoTactile Pretraining"&lt;/a&gt; (2025) demonstrates that contact-rich manipulation (USB insertion, card swiping, key insertion) achieves 90%+ success with only 32 demonstrations plus 45 minutes of reinforcement learning. Combining visual and tactile feedback replaces the need for massive labeled datasets.&lt;/p&gt;

&lt;p&gt;The theoretical foundation appears in &lt;a href="https://arxiv.org/abs/2403.03950" rel="noopener noreferrer"&gt;"Stop Regressing: Training Value Functions via Classification"&lt;/a&gt; (2024). Classification-based value functions (Q-transformers) outperform regression in manipulation, achieving state-of-the-art results with dramatically fewer trajectories.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implication&lt;/strong&gt;: Deep RL is more sample-efficient than supervised learning for robotics. By 2032, few-shot learning likely cuts requirements 80-90% compared to supervised approaches.&lt;/p&gt;




&lt;h3&gt;
  
  
  Doubt 3: Will Synthetic Data Make Real Data Obsolete?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://cm.asiae.co.kr/en/article/2025122608365327507" rel="noopener noreferrer"&gt;"Video2Robot"&lt;/a&gt; (Aim Intelligence, 2025) converts human videos into physics-grounded humanoid trajectories, scaling behaviors like climbing without real robot captures.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2512.04537" rel="noopener noreferrer"&gt;"X-Humanoid"&lt;/a&gt; (2025) converts Ego-Exo4D videos (60 hours = 3.6 million frames) into Optimus-like action sequences for cooking and biking, training both policies and world models.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://arxiv.org/abs/2510.08807" rel="noopener noreferrer"&gt;"Humanoid Everyday"&lt;/a&gt; dataset (260 real-world robotic tasks) is currently the largest multimodal humanoid dataset, yet authors acknowledge that synthetic data enables generalization beyond real data's domain.&lt;/p&gt;

&lt;p&gt;Citi's &lt;a href="https://www.citigroup.com/global/insights/the-rise-of-ai-robots" rel="noopener noreferrer"&gt;"The Rise of AI Robots"&lt;/a&gt; (2024) forecasts 1.3 billion robots by 2035, primarily trained via simulation. This scales via GPU rendering, not manual collection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implication&lt;/strong&gt;: Synthetic data dominates by 2028-2030. Real data demand drops 80-90%. Real data becomes specialized (edge cases, safety validation, domain-specific fine-tuning).&lt;/p&gt;




&lt;h3&gt;
  
  
  Doubt 4: Does Internal Fleet Learning Hide External Demand?
&lt;/h3&gt;

&lt;p&gt;Tesla, Figure, and Boston Dynamics don't buy data from startups. They collect internally. A former Tesla Autopilot engineer noted: "Data generation isn't the bottleneck. They collect terabytes per hour. The hard part is finding the right clips for training. That's curation."&lt;/p&gt;

&lt;p&gt;This shifts the market entirely. Collection becomes free; curation becomes valuable. A startup identifying the 1% of fleet data most valuable for improvement is worth billions. A startup selling raw teleoperation data? Increasingly irrelevant.&lt;/p&gt;

&lt;p&gt;Figure AI's &lt;a href="https://www.prnewswire.com/news-releases/figure-raises-675m-at-2-6b-valuation-and-signs-collaboration-agreement-with-openai-302074897.html" rel="noopener noreferrer"&gt;"$675M Series B funding"&lt;/a&gt; (February 2024) went to in-house development, not external data purchases. &lt;a href="https://arxiv.org/abs/2505.12705" rel="noopener noreferrer"&gt;"DreamGen"&lt;/a&gt; explicitly demonstrates autonomous data generation via learned world models.&lt;/p&gt;

&lt;p&gt;NVIDIA researcher Jim Fan noted in an &lt;a href="https://officechai.com/ai/unlike-with-llms-itll-take-2-5-years-to-figure-out-robotics-scaling-law-nvidias-jim-fan/" rel="noopener noreferrer"&gt;"April 9, 2025 Office Chai interview"&lt;/a&gt;: "Unlike LLMs, robotics doesn't yet have clear scaling laws. Compute and data are both bottlenecks, but physical data collection remains expensive."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implication&lt;/strong&gt;: External data demand low from 2026 onward. Near-zero by 2036 as fleets mature.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2: Counterarguments
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Sim-to-Real Gap Persists
&lt;/h3&gt;

&lt;p&gt;Simulation handles gravity, friction, and inertia. It doesn't capture material properties, sensor noise, wear, or degradation over time. A robot trained in perfect simulation may fail after 100 real-world episodes due to unmodeled dynamics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2402.19469" rel="noopener noreferrer"&gt;"Humanoid Locomotion as Next Token Prediction"&lt;/a&gt; (2024) shows sim-trained policies require substantial real-world adaptation even with domain randomization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fine-Tuning Requires Real Data
&lt;/h3&gt;

&lt;p&gt;Google's &lt;a href="https://arxiv.org/abs/2403.02914" rel="noopener noreferrer"&gt;"MT-Opt"&lt;/a&gt; (2024) demonstrates that sim-trained policies need significant real robot data for fine-tuning across diverse tasks. As humanoids move to messy real-world settings, environment-specific adaptation demands increase, not decrease.&lt;/p&gt;

&lt;h3&gt;
  
  
  Robot Vision Gaps
&lt;/h3&gt;

&lt;p&gt;Embodied AI benchmarks reveal persistent gaps, particularly in temporal reasoning. Robots often treat frames independently while humans process continuous streams with temporal context. Understanding that someone "is about to" reach for an object requires temporal reasoning that current vision systems lack.&lt;/p&gt;

&lt;h3&gt;
  
  
  Safety Validation Is Extensive
&lt;/h3&gt;

&lt;p&gt;ISO 13482 mandates comprehensive testing across failure modes. Real-world edge cases emerge unpredictably. Boston Dynamics' Atlas experienced numerous falls during development, each requiring data collection and analysis. Safety-critical applications demand orders of magnitude more validation data than general robotics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Human Interaction Is Complex
&lt;/h3&gt;

&lt;p&gt;Humanoids working alongside people must interpret subtle social cues: body language, eye contact, contextual intent, theory of mind. Recent work on &lt;a href="https://arxiv.org/abs/2407.21626" rel="noopener noreferrer"&gt;"human-AI interaction"&lt;/a&gt; (2024) shows this capability remains elusive, requiring extensive multimodal training data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-World Complexity Dominates
&lt;/h3&gt;

&lt;p&gt;History shows robotics underestimates real complexity. Tesla's Autopilot discovered thousands of edge cases post-deployment that simulation missed. Long-tail distributions mean rare but critical scenarios dominate failure cases. As humanoids enter homes, factories, and public spaces, new failure modes will emerge requiring continuous data collection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Long-Horizon Planning Remains Difficult
&lt;/h3&gt;

&lt;p&gt;Human tasks span minutes to hours with complex interdependencies. Reinforcement learning struggles with long-horizon credit assignment. Recent &lt;a href="https://arxiv.org/abs/2409.13373" rel="noopener noreferrer"&gt;"transformer-based planning work"&lt;/a&gt; (2024) shows hierarchical reasoning requires extensive trajectory data for reliable long-term decision-making.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intuitive Physics Capabilities Gap
&lt;/h3&gt;

&lt;p&gt;AI systems still lack robust understanding of object properties, stability, and physical interactions. Each novel environment or material type may require specific training data for reliable interaction.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3: Synthesis
&lt;/h2&gt;

&lt;p&gt;The doubts suggest demand peaks 2026-2028 then declines sharply. The counterarguments suggest certain demands persist. The reality is bifurcated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Types That Peak and Plateau
&lt;/h3&gt;

&lt;p&gt;Foundational locomotion datasets (walking, balance, navigation) peak 2026-2028, then plateau as core policies mature. Generic manipulation demos (grasping, lifting, placing) peak 2026-2029, then plateau. Teleoperation services for bootstrapping peak 2026-2028, then drop 80-90%.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Types That Persist or Grow
&lt;/h3&gt;

&lt;p&gt;Safety validation data runs continuous. Each new environment, interaction, or edge case requires data. &lt;strong&gt;Domain-specific fine-tuning data persists&lt;/strong&gt;. Healthcare robots need healthcare data. Surgical robots need surgical data. Temporal and social interaction data grows as robots interact more with humans. Edge case and failure data collects continuously. &lt;strong&gt;Hardware variation based finetuning data will still be needed.&lt;/strong&gt; &lt;/p&gt;

&lt;h3&gt;
  
  
  Market Trajectory
&lt;/h3&gt;

&lt;p&gt;Raw data collection captures 20-30% of robotics value chain 2026-2030. By 2031-2036, collection captures 2-5% while curation, processing, and domain adaptation capture 15-25%.&lt;/p&gt;

&lt;p&gt;Market size forecasts diverge significantly: &lt;a href="https://www.grandviewresearch.com/industry-analysis/humanoid-robot-market-report" rel="noopener noreferrer"&gt;"Grand View Research projects $4.04B by 2030"&lt;/a&gt; (17.5% CAGR from $1.55B in 2024) while &lt;a href="https://www.bccresearch.com/market-research/instrumentation-and-sensors/humanoid-robot-market.html" rel="noopener noreferrer"&gt;"BCC Research projects $11B by 2030"&lt;/a&gt; (42.8% CAGR from $1.9B in 2025). Grand View likely conservative; BCC likely includes speculative demand scenarios. Markets and Markets forecasts $13.25B by 2029 at 45.5% CAGR.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Critical Distinction
&lt;/h3&gt;

&lt;p&gt;Raw data becomes commoditized by 2028. The bottleneck shifts from collection to curation. Identifying valuable signal within terabytes of fleet data matters infinitely more than raw collection volume.&lt;/p&gt;




&lt;h2&gt;
  
  
  Critical Context
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Domain Variations Matter
&lt;/h3&gt;

&lt;p&gt;Consumer humanoids see highest efficiency gains; data demand drops 80-90% by 2035. Healthcare and surgical robots require conservative deployment with high safety validation; data demand remains substantial. Industrial robots in hazardous environments use extensive simulation with moderate efficiency gains.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hardware-Software Coupling
&lt;/h3&gt;

&lt;p&gt;Better sensors (force feedback, advanced cameras) reduce data requirements. Lower-cost sensors increase requirements. Conclusions assume current hardware. Significant hardware shifts change data strategies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Regional Differences
&lt;/h3&gt;

&lt;p&gt;Data privacy laws (GDPR in EU), labor costs, and safety standards vary by region, affecting data collection ROI and humanoid adoption willingness.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Humanoids will need substantial data, but the trajectory is "peak and persist," not endless escalation. Foundational training peaks 2026-2028, driven by scaling law efficiency and synthetic data gains. Raw data demand then drops 50-90%.&lt;/p&gt;

&lt;p&gt;However, specialized data needs persist: sim-to-real fine-tuning, safety validation, social interaction learning, and edge case handling. The market story isn't about data volume declining, it's about value migrating from collection to curation.&lt;/p&gt;

&lt;p&gt;Pure data collection becomes trivial by 2028. The competitive advantage lies with companies solving intelligent curation, safety validation, and domain-specific adaptation. Integrated hardware-AI companies (Tesla, Boston Dynamics, Figure) internalize these capabilities, creating structural moats.&lt;/p&gt;

&lt;p&gt;Data infrastructure startups face headwinds unless they pivot from collection to specialization. The humanoid market grows to $4-13B by 2030, but raw data's share of that value shrinks from 20-30% to 2-5% as the field matures.&lt;/p&gt;

&lt;p&gt;This represents a fundamental shift: data becomes abundant; intelligence (curation, adaptation, validation) becomes scarce.&lt;/p&gt;

</description>
      <category>robotics</category>
      <category>data</category>
      <category>humanoid</category>
      <category>ai</category>
    </item>
    <item>
      <title>The 5 Levels of Humanoid Autonomy</title>
      <dc:creator>Ankit Khandelwal</dc:creator>
      <pubDate>Fri, 16 Jan 2026 18:49:59 +0000</pubDate>
      <link>https://dev.to/ankk98/the-5-levels-of-humanoid-autonomy-1n54</link>
      <guid>https://dev.to/ankk98/the-5-levels-of-humanoid-autonomy-1n54</guid>
      <description>&lt;p&gt;If you scroll through X (Twitter) today, you’d think General Purpose Humanoids (GPH) are months away from folding our laundry and cooking 5-course meals. The reality is more nuanced and, for developers and founders, much more interesting.&lt;/p&gt;

&lt;p&gt;I’ve been digging into the "Self-Driving Levels" equivalent for robotics. We need a mental model to separate the hype (Level 5 sci-fi) from the commercial opportunities available &lt;em&gt;right now&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Based on frameworks from &lt;strong&gt;SemiAnalysis&lt;/strong&gt;, insights from roboticist &lt;strong&gt;Rodney Brooks&lt;/strong&gt; here is the definitive ladder of Humanoid Autonomy.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Framework: Agency vs. Dexterity
&lt;/h2&gt;

&lt;p&gt;Unlike self-driving cars, which just need to &lt;em&gt;move&lt;/em&gt; safely, humanoids must &lt;em&gt;move&lt;/em&gt; (Agency) and &lt;em&gt;manipulate&lt;/em&gt; (Dexterity).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agency:&lt;/strong&gt; Perception, planning, and navigation in unstructured environments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dexterity:&lt;/strong&gt; Grasping, force control, and fine manipulation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Current commercial viability lies in balancing these two.&lt;/p&gt;




&lt;h2&gt;
  
  
  Level 0: Scripted Motion (The Industrial Past)
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Status: Mature (1980s–Present)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These are the blind giants. They execute pre-programmed trajectories with sub-millimeter precision but have zero understanding of their environment. If you move the part by 1cm, the robot fails.&lt;/p&gt;

&lt;h3&gt;
  
  
  5 Use Cases:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Automotive Welding:&lt;/strong&gt; The backbone of Tesla/Toyota factories.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Painting:&lt;/strong&gt; Uniform spraying of car bodies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heavy Palletizing:&lt;/strong&gt; Moving heavy boxes in completely caged, fixed zones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PCB Assembly:&lt;/strong&gt; Pick-and-place machines (high speed, zero intelligence).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CNC Tending:&lt;/strong&gt; Loading raw metal into machines (requires precise fixturing).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Timeline:&lt;/strong&gt; Mature.&lt;br&gt;
&lt;strong&gt;Famous Bots:&lt;/strong&gt; FANUC M-2000, KUKA quantec.&lt;/p&gt;




&lt;h2&gt;
  
  
  Level 1: Intelligent Pick &amp;amp; Place (The Visual Awakening)
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Status: Commercial Scale (2023–Present)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Robots gained eyes. Using computer vision and deep learning, these systems can identify objects in a cluttered bin and pick them up. They don't "understand" the object's function, but they know where it is.&lt;/p&gt;

&lt;h3&gt;
  
  
  5 Use Cases:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Parcel Sorting:&lt;/strong&gt; Identifying and grabbing random Amazon packages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agricultural Sorting:&lt;/strong&gt; Picking good apples vs. bad apples on a conveyor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debris Recycling:&lt;/strong&gt; Sorting plastic from glass in waste plants.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kit Assembly:&lt;/strong&gt; Grabbing 3 different items to put in a subscription box.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality Control:&lt;/strong&gt; Visually inspecting parts and removing defects.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Timeline:&lt;/strong&gt; Standard in logistics by 2026.&lt;br&gt;
&lt;strong&gt;Famous Bots:&lt;/strong&gt; RightHand Robotics, Covariant (software), Fanuc with iRVision.&lt;/p&gt;




&lt;h2&gt;
  
  
  Level 2: Autonomous Mobility (The Explorer)
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Status: Early Production (2024–2026)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Robots gained &lt;strong&gt;Agency&lt;/strong&gt;. They can map a new environment, navigate around obstacles, and decide &lt;em&gt;how&lt;/em&gt; to get from A to B. This is where Boston Dynamics’ Spot shines. Note: They can move, but they can't do much with their hands yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  5 Use Cases:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Industrial Inspection:&lt;/strong&gt; Reading analog gauges in oil refineries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Construction Patrol:&lt;/strong&gt; Scanning progress on building sites (BIM verification).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security:&lt;/strong&gt; Autonomous patrolling of data centers or malls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hazard Mapping:&lt;/strong&gt; Entering gas-leak zones to measure toxicity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Last-Mile Delivery:&lt;/strong&gt; Sidewalk robots (Starship) navigating crowds.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Timeline:&lt;/strong&gt; Commercially viable now for inspection; scaling fast.&lt;br&gt;
&lt;strong&gt;Famous Bots:&lt;/strong&gt; Boston Dynamics Spot, ANYbotics ANYmal.&lt;/p&gt;




&lt;h2&gt;
  
  
  Level 3: Low-Skill Mobile Manipulation (The Founder's Sweet Spot)
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Status: Pilots -&amp;gt; Scale (2026–2029)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the biggest opportunity for startups right now.&lt;/strong&gt;&lt;br&gt;
These robots combine Level 2 mobility with Level 1 vision to perform &lt;em&gt;loose&lt;/em&gt; manipulation tasks. They can pick up a box, move it across a room, and put it down.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Crucial Insight:&lt;/em&gt; They struggle with &lt;strong&gt;force control&lt;/strong&gt;. They can't thread a needle or peel a potato perfectly because they lack tactile feeling. But they &lt;em&gt;can&lt;/em&gt; fry a basket of fries.&lt;/p&gt;

&lt;h3&gt;
  
  
  5 Use Cases:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Specialized Cooking (The "Fry Cook"):&lt;/strong&gt; Dumping baskets of fries, flipping burgers (requires timing, not fine touch).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Warehouse Restocking:&lt;/strong&gt; Taking a tote from a pallet and sliding it onto a shelf.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Laundry Loading:&lt;/strong&gt; Picking up dirty clothes and shoving them into a washer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hospital Logistics:&lt;/strong&gt; Delivering lab samples or food trays to nurse stations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trash Collection:&lt;/strong&gt; Navigating an office to empty bins into a main cart.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Timeline:&lt;/strong&gt; Pilots 2025; Scale 2027-2028.&lt;br&gt;
&lt;strong&gt;Famous Bots:&lt;/strong&gt; Figure 01 (BMW pilot), Tesla Optimus (Factory transport), &lt;strong&gt;Chef Robotics&lt;/strong&gt; (Modular arms).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; You don't need legs for this! A wheeled robot with an arm is 80% cheaper and 100% more stable for a kitchen.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Level 4: Force-Dependent Dexterity (The "Rodney Brooks" Wall)
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Status: Research Lab (2028+)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the barrier. To be a "General Purpose" humanoid, a robot needs &lt;strong&gt;tactile sensing&lt;/strong&gt; (touch). It needs to feel if a screw is cross-threaded, or if a tomato is too soft to slice.&lt;/p&gt;

&lt;p&gt;Rodney Brooks (founder of iRobot) argues this is the "hard part" the industry is underestimating. We have great vision (VLAs), but terrible touch.&lt;/p&gt;

&lt;h3&gt;
  
  
  5 Use Cases:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Full-Service Chef:&lt;/strong&gt; Slicing veggies, seasoning to taste, plating delicate herbs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Elder Care:&lt;/strong&gt; Helping someone stand up (requires sensing their balance/frailty).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skilled Trades:&lt;/strong&gt; Installing electrical outlets or plumbing fixtures.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Textile Work:&lt;/strong&gt; Buttoning a shirt or tying shoelaces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex Assembly:&lt;/strong&gt; Inserting flexible rubber gaskets into car doors.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Timeline:&lt;/strong&gt; Research prototypes 2029; Commercial 2032+.&lt;br&gt;
&lt;strong&gt;Famous Bots:&lt;/strong&gt; None commercially yet. Lab prototypes from MIT/Stanford.&lt;/p&gt;




&lt;h2&gt;
  
  
  Level 5: Fully General Autonomy
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Status: Sci-Fi (2032?)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A robot that can walk into a strange house, look around, and cook a specific family recipe using tools it has never seen before, without internet access.&lt;/p&gt;




&lt;h2&gt;
  
  
  The "ADAS vs. FSD" Split: Why One Size Won't Fit All
&lt;/h2&gt;

&lt;p&gt;We often talk about humanoids as a monolith—one robot to rule them all. But look at the automotive industry. We didn't jump straight to Level 5 Robotaxis. Instead, we have a split market: 99% of cars have &lt;strong&gt;ADAS&lt;/strong&gt; (Lane Keep, Cruise Control) and &amp;lt;1% attempt &lt;strong&gt;FSD&lt;/strong&gt; (Full Self-Driving).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Robotics will follow this exact same bifurcation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We aren't going to see a single "iPhone of Robots." Instead, &lt;strong&gt;Economics, Battery Life, Safety, and Compute&lt;/strong&gt; will force the market into two distinct categories:&lt;/p&gt;

&lt;h3&gt;
  
  
  Category 1: The "ADAS" Class (High Utility, Low Risk)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Build:&lt;/strong&gt; Wheeled bases, specialized grippers, constrained compute (e.g., Jetson Orin Nano).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Battery &amp;amp; Economics:&lt;/strong&gt; Wheels are 10x more energy-efficient than legs. Without the need to run a massive VLA model for every movement, these bots can run for 8-10 hours on a charge and cost &amp;lt;$10k.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adoption Vector:&lt;/strong&gt; These will dominate &lt;strong&gt;critical safety areas&lt;/strong&gt; first. Think radioactive waste handling, chemical spill cleanup, or repetitive high-heat industrial cooking. The ROI is immediate because the task is defined.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Category 2: The "FSD" Class (High Agency, High Cost)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Build:&lt;/strong&gt; Bipedal, humanoid hands, massive onboard inference compute.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Battery &amp;amp; Economics:&lt;/strong&gt; Balancing on two legs consumes massive power. Running a "Common Sense" brain drains the rest. These will cost $50k+ and last 2-4 hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adoption Vector:&lt;/strong&gt; Research labs, luxury home help (eventually), and unstructured environments where wheels physically cannot go.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What’s Your Bet?
&lt;/h2&gt;

&lt;p&gt;The robotics industry is currently split between two philosophies: the "iPhone moment" where one hardware platform does everything (Level 4/5 Humanoids), and the "App Store" reality where specialized tools solve specific problems today (Level 3 Mobile Manipulators).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I’d love to hear your take:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do you think I’m underestimating how fast VLA (Vision-Language-Action) models will solve the "dexterity gap"?&lt;/li&gt;
&lt;li&gt;Are you currently working on a Level 2 or Level 3 project?&lt;/li&gt;
&lt;li&gt;What’s the one "boring" chore you’d pay a Level 3 robot to do right now?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Drop your predictions in the comments below!&lt;/p&gt;




&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[1] &lt;a href="https://www.mckinsey.com/industries/industrials/our-insights/humanoid-robots-crossing-the-chasm-from-concept-to-commercial-reality" rel="noopener noreferrer"&gt;McKinsey: Humanoid Robots Crossing the Chasm&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;[2] &lt;a href="https://eu.36kr.com/en/p/3487244922412161" rel="noopener noreferrer"&gt;36kr: Rodney Brooks Technical Critiques&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;[3] &lt;a href="https://rodneybrooks.com/why-todays-humanoids-wont-learn-dexterity/" rel="noopener noreferrer"&gt;Rodney Brooks: Why Today's Humanoids Won't Learn Dexterity&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;[4] &lt;a href="https://newsletter.semianalysis.com/p/robotics-levels-of-autonomy" rel="noopener noreferrer"&gt;SemiAnalysis: Robotics Levels of Autonomy&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>humanoids</category>
      <category>ai</category>
      <category>vla</category>
      <category>robotics</category>
    </item>
    <item>
      <title>Humanoid Compute: Price vs. Performance</title>
      <dc:creator>Ankit Khandelwal</dc:creator>
      <pubDate>Thu, 15 Jan 2026 10:45:51 +0000</pubDate>
      <link>https://dev.to/ankk98/humanoid-compute-price-vs-performance-842</link>
      <guid>https://dev.to/ankk98/humanoid-compute-price-vs-performance-842</guid>
      <description>&lt;p&gt;&lt;em&gt;Exploring emerging humanoid hardware options, their compute capabilities, and what models you can actually run in Jan 2026.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Hardware Confusion in Robotics
&lt;/h2&gt;

&lt;p&gt;Over the last few months, I've been deep in the robotics rabbit hole—exploring datasets, VLA models, open-source projects, and trying to make sense of which hardware actually works for humanoids. The landscape is confusing.&lt;/p&gt;

&lt;p&gt;NVIDIA Jetson? AMD Strix Halo? Raspberry Pi? Hailo accelerators? Tesla Optimus uses NVIDIA silicon, but what about Chinese robots? And critically: &lt;strong&gt;what VLA model can my hardware actually run in real time?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This article is my attempt to create clarity. I'm organizing emerging humanoid robots by price tier (in USD), showing the best compute choices, their actual performance with VLA models, realistic use cases, and—honestly—where you'll hit a wall and need to wait for the next generation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;The humanoid robotics market projects to reach &lt;strong&gt;$30-50 billion by 2035&lt;/strong&gt; with 2 million units deployed in workplaces. But today, most humanoids cost $20k-$150k. As costs drop toward $5-10k by 2030, &lt;strong&gt;the compute choice becomes critical&lt;/strong&gt;—it defines whether your robot thinks in real time or needs to defer to the cloud.&lt;/p&gt;

&lt;p&gt;According to recent analysis, &lt;strong&gt;compute represents 15-35% of a humanoid's total BOM&lt;/strong&gt;. Choose wrong, and you either overpay or end up with a silent, slow robot.&lt;/p&gt;




&lt;h2&gt;
  
  
  Category 1: Under $1,200 — The DIY &amp;amp; Educational Tier
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best Choice: Raspberry Pi 5 + &lt;a href="https://hailo.ai/products/ai-accelerators/hailo-8l-ai-accelerator-for-ai-light-applications/" rel="noopener noreferrer"&gt;Hailo-8L AI Accelerator&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Hardware Specs
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Broadcom BCM2712 (&lt;strong&gt;Quad-core&lt;/strong&gt; Arm Cortex-A76, 2.4 GHz)&lt;/td&gt;
&lt;td&gt;~$70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI Accelerator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hailo-8L (13 TOPS)&lt;/td&gt;
&lt;td&gt;~$70-90&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8GB LPDDR4X&lt;/td&gt;
&lt;td&gt;Included&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Power&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.5-2.5W peak AI inference&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total Compute System&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~$180-200&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://www.raspberrypi.com/products/raspberry-pi-5/" rel="noopener noreferrer"&gt;See Raspberry Pi 5 specs&lt;/a&gt; | &lt;a href="https://docs.hailo.ai/" rel="noopener noreferrer"&gt;Hailo-8L documentation&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Model Capabilities
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it can run:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;YOLO v4/v5 Tiny&lt;/strong&gt;: 35+ FPS real-time object detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MobileNet V3&lt;/strong&gt;: Fast edge classification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SmolVLM 500M&lt;/strong&gt;: Lightweight vision-language understanding (~1-2 Hz)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local LLM inference&lt;/strong&gt;: Qwen 3B with 4-bit quantization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight visual servoing&lt;/strong&gt;: Sub-100ms latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What it CANNOT run:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenVLA 7B (too large, too slow)&lt;/li&gt;
&lt;li&gt;Multi-model pipelines in parallel&lt;/li&gt;
&lt;li&gt;Real-time complex manipulation policies&lt;/li&gt;
&lt;li&gt;Continuous cloud-free learning&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Use Cases
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Viable in 2026:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Educational robot arms (3D-printed chassis, &amp;lt;$500 mechanical)&lt;/li&gt;
&lt;li&gt;Warehouse shelf scanning &amp;amp; item detection&lt;/li&gt;
&lt;li&gt;Mobile base navigation with obstacle avoidance&lt;/li&gt;
&lt;li&gt;Simple teleoperation with human guidance&lt;/li&gt;
&lt;li&gt;Data collection and annotation platforms (collect data, train on cloud)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Real-World Example
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/liyiteng/AlohaMini" rel="noopener noreferrer"&gt;AlohaMini&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Next Generation" Problem
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Today:&lt;/strong&gt; This tier cannot run VLA models at robot-viable speeds (need &amp;gt;5 Hz for smooth control).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short-term workarounds (6-12 months):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid inference&lt;/strong&gt;: Run lightweight model locally, stream only complex decisions to remote&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Long-term (2027-2028):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hailo-8L successor (50+ TOPS at 3W) launches → enables real-time SmolVLA inference&lt;/li&gt;
&lt;li&gt;RPi 6 with better memory bandwidth → support for lightweight 1B VLAs&lt;/li&gt;
&lt;li&gt;Open-source distilled VLAs (&amp;lt;200M params) mature → native performance improvements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; This tier is for &lt;strong&gt;learning, prototyping, and collecting data&lt;/strong&gt; and not for autonomous manipulation. Use it to build datasets, then train bigger models on Jetson hardware.&lt;/p&gt;




&lt;h2&gt;
  
  
  Category 2: $1,200-$2,400 — The Researcher's Playground
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best Choice: &lt;a href="https://www.nvidia.com/en-in/autonomous-machines/embedded-systems/jetson-orin/nano-super-developer-kit/" rel="noopener noreferrer"&gt;Jetson Orin Nano Super Developer Kit&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Hardware Specs (Jetson Orin Nano Super)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1024-core NVIDIA Ampere (32 Tensor Cores)&lt;/td&gt;
&lt;td&gt;~$249 (Dev Kit)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;67 TOPS (Sparse INT8)&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;6-core&lt;/strong&gt; Arm Cortex-A78AE v8.2&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Power&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7-25W (configurable)&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cooling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Active required&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Model Capabilities — The Reality Check
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;OpenVLA 7B Performance:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Raw inference speed:&lt;/strong&gt; 0.3 Hz (3-4 seconds per action)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not viable for real-time control&lt;/strong&gt; (need &amp;gt;5 Hz)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Viable if:&lt;/strong&gt; Slow manipulation (&amp;lt;1 action/sec), scripted sequences, or cloud-assisted planning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;SmolVLA 450M Performance:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inference speed:&lt;/strong&gt; 8-12 Hz with fp16&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Viable for:&lt;/strong&gt; Real-time manipulation, visual servoing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory:&lt;/strong&gt; 2-3GB, leaves room for concurrent models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;MiniVLA 1B Performance:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inference speed:&lt;/strong&gt; 3-5 Hz &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-model pipelines:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can run &lt;strong&gt;language model (3B) + vision model (450M) + low-level controller&lt;/strong&gt; simultaneously&lt;/li&gt;
&lt;li&gt;Use this for hierarchical control: "pick up the red block" → (LLM) → "grasp at position X" → (vision) → motor commands&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Recommended Stack
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pseudo-code architecture for Jetson Orin Nano
&lt;/span&gt;&lt;span class="n"&gt;Language&lt;/span&gt; &lt;span class="nc"&gt;Model &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="n"&gt;quantized&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt; &lt;span class="n"&gt;decomposition&lt;/span&gt;
    &lt;span class="err"&gt;↓&lt;/span&gt;
&lt;span class="n"&gt;Vision&lt;/span&gt; &lt;span class="nc"&gt;Model &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;450&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;Spatial&lt;/span&gt; &lt;span class="n"&gt;understanding&lt;/span&gt;
    &lt;span class="err"&gt;↓&lt;/span&gt;
&lt;span class="n"&gt;Action&lt;/span&gt; &lt;span class="nc"&gt;Policy &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SmolVLA&lt;/span&gt; &lt;span class="mi"&gt;450&lt;/span&gt;&lt;span class="n"&gt;M&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;Real&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;control&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Use Cases
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Perfect for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;University robotics labs&lt;/li&gt;
&lt;li&gt;Early-stage startup prototyping&lt;/li&gt;
&lt;li&gt;Open-source humanoid development&lt;/li&gt;
&lt;li&gt;VLA model training &amp;amp; fine-tuning&lt;/li&gt;
&lt;li&gt;Research on embodied AI&lt;/li&gt;
&lt;li&gt;Manipulation tasks (pick &amp;amp; place, assembly with &amp;gt;1 sec cycle time)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Not suitable for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High-speed assembly lines&lt;/li&gt;
&lt;li&gt;Time-critical dexterity (surgery, precision electronics)&lt;/li&gt;
&lt;li&gt;Multi-robot swarm coordination (requires cloud offloading)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The "Next Generation" Outlook
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;2026-2027 Improvements:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Jetson Orin Nano successor&lt;/strong&gt; (2x memory to 16GB, 100+ TOPS) will enable real-time OpenVLA 7B inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quantization standardization&lt;/strong&gt;: INT4 quantization tools will mature → expect 2-3x speedups&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LoRA fine-tuning&lt;/strong&gt;: Parameter-efficient adaptation becomes standard → train custom models in &amp;lt;1 day on this hardware&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Timeline to viability:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Today:&lt;/strong&gt; Good for research &amp;amp; slow tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2027:&lt;/strong&gt; Will handle most manipulation tasks in real-time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2028:&lt;/strong&gt; Budget-class humanoids will use this as primary compute&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Smart strategy:&lt;/strong&gt; Start with Orin Nano for algorithm development. Once models mature, migrate to Jetson AGX Orin for deployment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Category 3: $2,400-$6,000 — The "Real Robot" Tier
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best Choice: &lt;a href="https://developer.nvidia.com/embedded/learn/get-started-jetson-agx-orin-devkit" rel="noopener noreferrer"&gt;NVIDIA Jetson AGX Orin 32GB or 64GB&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strategic Alternative:&lt;/strong&gt; &lt;a href="https://www.amd.com/en/products/processors/laptop/ryzen/ai-300-series/amd-ryzen-ai-max-plus-395.html" rel="noopener noreferrer"&gt;AMD Ryzen Strix Halo&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Hardware Specs (Jetson AGX Orin 64GB)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12-core GPU, 64GB LPDDR5X unified memory&lt;/td&gt;
&lt;td&gt;$2,200-2,500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tensor Cores&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5,120 CUDA cores, 275 TOPS INT8&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12-core Arm Cortex-A78AE&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Power&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;15-60W (configurable via jetson_clocks)&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory Bandwidth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;204.8 GB/s (critical for LLM inference)&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total System Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Module + cooling + power: $2,500-3,000&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-agx-orin/" rel="noopener noreferrer"&gt;Jetson AGX Orin Specs&lt;/a&gt; | &lt;a href="https://github.com/NVIDIA/TensorRT-LLM" rel="noopener noreferrer"&gt;TensorRT-LLM Benchmarks&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Model Performance — The Goldilocks Hardware
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;OpenVLA 7B in fp16:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;2 Hz inference&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Full model in memory, no quantization tricks needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;OpenVLA 7B quantized (INT4):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;4-5 Hz inference&lt;/strong&gt; (real-time for slower tasks)&lt;/li&gt;
&lt;li&gt;Achieves 92-95% accuracy retention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;SmolVLA 450M:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;15-20 Hz&lt;/strong&gt; (truly real-time)&lt;/li&gt;
&lt;li&gt;Comfortable headroom for safety checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multi-model stacking:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run 7B reasoning LLM + 7B VLA + trajectory optimizer &lt;strong&gt;simultaneously&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Example: "Navigate kitchen while avoiding obstacles" = LLM (planning) + VLA (perception) + controller (low-level)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-time SLAM + AI:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run ORB-SLAM on CPU cores while VLA runs on GPU&lt;/li&gt;
&lt;li&gt;Full 3D environment understanding + action selection in parallel&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Compute Cost in Humanoid BOM
&lt;/h3&gt;

&lt;p&gt;For a &lt;strong&gt;$3,500-4,500 complete humanoid:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jetson AGX Orin: $2,500 (~56-67% of total cost)&lt;/li&gt;
&lt;li&gt;Actuators: $900 (25%)&lt;/li&gt;
&lt;li&gt;Sensors/cameras: $300 (8%)&lt;/li&gt;
&lt;li&gt;Misc: $100 (3%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The hard truth:&lt;/strong&gt; At this price tier, compute dominates cost. The robot is mostly brain, not body.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Cases
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Excellent for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Research institutions building dexterous systems&lt;/strong&gt; (manipulation labs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Startups with Series A funding&lt;/strong&gt; (can justify $3K per unit compute cost)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Industrial pilots&lt;/strong&gt; (flexible assembly lines)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multimodal reasoning tasks&lt;/strong&gt; (navigation + manipulation + language understanding)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On-robot learning&lt;/strong&gt; (collect data, fine-tune models locally)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-robot coordination&lt;/strong&gt; (compute models for fleet behavior)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3-5 Year Forecast
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;2026:&lt;/strong&gt; Jetson AGX Orin becomes the development standard for all serious humanoid research.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2027:&lt;/strong&gt; Successor (likely 500+ TOPS) emerges with 2x efficiency → enables smaller robots.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2028-2030:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost drops 30-40% through competition (AMD, Intel catch up)&lt;/li&gt;
&lt;li&gt;Memory standardizes at 128GB unified&lt;/li&gt;
&lt;li&gt;Real-time OpenVLA becomes baseline expectation&lt;/li&gt;
&lt;li&gt;On-robot learning (collect data → train → deploy in hours) becomes standard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; This is where the magic happens. This tier enables &lt;strong&gt;embodied AI systems&lt;/strong&gt; that truly think locally.&lt;/p&gt;




&lt;h2&gt;
  
  
  Category 4: $6,000-$12,000 — The Industrial Deployment Class
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best Choice: &lt;a href="https://nvidianews.nvidia.com/news/nvidia-blackwell-powered-jetson-thor-now-available-accelerating-the-age-of-general-robotics" rel="noopener noreferrer"&gt;NVIDIA Jetson AGX Thor&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Hardware Specs
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Blackwell architecture (NVIDIA's latest)&lt;/td&gt;
&lt;td&gt;Developer Kit: $3,499&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Peak Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2,070 TFLOPS (FP4) / 1,035 TFLOPS (FP8)&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;128GB unified LPDDR5X&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory Bandwidth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;273 GB/s&lt;/strong&gt; (~1.3x Orin)&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;14-core Arm Neoverse V3AE&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Power&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;40-130W configurable&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Production Module Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~$2,500-2,800 (estimate)&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The Game Changer
&lt;/h3&gt;

&lt;p&gt;This is the inflection point. Thor entered production in August 2025 and is already adopted by Amazon Robotics, Boston Dynamics, and Figure AI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt; While the memory bandwidth (273 GB/s) is only a moderate step up from Orin, the real paradigm shift is the &lt;strong&gt;Blackwell GPU with native FP4 support&lt;/strong&gt;. This allows you to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Double the effective model size:&lt;/strong&gt; Run larger models in 4-bit precision (FP4) with hardware acceleration, effectively doubling the usable memory capacity compared to FP8/INT8.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transformer Engine:&lt;/strong&gt; Dynamically adjusts precision per layer to maintain accuracy while maximizing throughput.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run Multi-Modal Agents:&lt;/strong&gt; Run a 7B VLA + a 13B reasoning LLM simultaneously on a single module due to the massive 2070 TFLOPS of compute density.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Model Performance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;OpenVLA 7B in full precision:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;5+ Hz consistently&lt;/strong&gt; (fast enough for dexterous tasks)&lt;/li&gt;
&lt;li&gt;No quantization hacks required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Running multiple models simultaneously:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;30B reasoning model + 7B VLA + trajectory optimizer&lt;/li&gt;
&lt;li&gt;Example: "Assemble electronics" = LLM (step planning) + VLA (visual perception) + controller (motor commands)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-time multi-modal reasoning:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vision + language + proprioception all processing in parallel&lt;/li&gt;
&lt;li&gt;First time this is truly practical at the edge&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Use Cases — Industrial Reality
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Perfect for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Factory assembly lines&lt;/strong&gt; (complex dexterity, multi-object scenes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collaborative manufacturing&lt;/strong&gt; (safety-critical, real-time adaptation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Surgical robotics&lt;/strong&gt; (high latency requirements, real-time feedback)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Advanced manipulation&lt;/strong&gt; (24+ DOF robots with tactile sensing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research that won't be outdated in 2 years&lt;/strong&gt; (future-proof choice)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cost Breakdown for $7,500 Industrial Humanoid
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Jetson Thor module + integration&lt;/td&gt;
&lt;td&gt;$2,800&lt;/td&gt;
&lt;td&gt;37%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dexterous actuators (24 DOF)&lt;/td&gt;
&lt;td&gt;$2,800&lt;/td&gt;
&lt;td&gt;37%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sensors + cameras + tactile&lt;/td&gt;
&lt;td&gt;$800&lt;/td&gt;
&lt;td&gt;11%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Power system (dual batteries)&lt;/td&gt;
&lt;td&gt;$500&lt;/td&gt;
&lt;td&gt;7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Integration + testing&lt;/td&gt;
&lt;td&gt;$600&lt;/td&gt;
&lt;td&gt;8%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The insight:&lt;/strong&gt; At this tier, compute finally stops dominating BOM. Actuator cost rivals compute cost—a healthy balance.&lt;/p&gt;

&lt;h3&gt;
  
  
  5-Year Outlook
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;2026:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Thor becomes standard for enterprise robotics R&amp;amp;D&lt;/li&gt;
&lt;li&gt;Competitors (AMD, Qualcomm) announce equivalents but won't ship for 12+ months&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2027-2028:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jetson Thor successor (4,000+ TOPS) launches&lt;/li&gt;
&lt;li&gt;Manufacturing costs drop 30-40%&lt;/li&gt;
&lt;li&gt;First commercial humanoid deployments using Thor-class compute go mainstream&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2029-2030:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost drop to ~$1,500-2,000 per unit&lt;/li&gt;
&lt;li&gt;Becomes viable for mass-market humanoids ($15-20k retail)&lt;/li&gt;
&lt;li&gt;Full multimodal reasoning (vision + language + touch) becomes standard&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Category 5: $12,000+ — The Frontier
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use Case-Specific Choices:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;General-purpose humanoid:&lt;/strong&gt; Custom NVIDIA silicon (Tesla Optimus path) or dual Jetson Thor&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Surgical robotics:&lt;/strong&gt; Medical-certified compute stack (higher latency tolerance but reliability critical)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Swarm robotics:&lt;/strong&gt; Jetson Thor + cloud-connected training infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Reality
&lt;/h3&gt;

&lt;p&gt;This is where &lt;strong&gt;the robot becomes secondary to the compute infrastructure&lt;/strong&gt;. You're not just buying a processor; you're buying into a &lt;strong&gt;training pipeline, simulation environment, and model zoo&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Companies in this tier (Tesla, Boston Dynamics, Figure AI) build:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Simulation infrastructure&lt;/strong&gt; (digital twins)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed training pipelines&lt;/strong&gt; (thousands of episodes → models)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom silicon&lt;/strong&gt; optimizations (learned through production experience)&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Hardware Decision Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;&amp;lt;$1.2K&lt;/th&gt;
&lt;th&gt;$1.2-2.4K&lt;/th&gt;
&lt;th&gt;$2.4-6K&lt;/th&gt;
&lt;th&gt;$6-12K&lt;/th&gt;
&lt;th&gt;&amp;gt;$12K&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Real-time VLA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅✅&lt;/td&gt;
&lt;td&gt;✅✅✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-model pipelines&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅✅&lt;/td&gt;
&lt;td&gt;✅✅✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;On-device training&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Industrial deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hobby projects&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Research labs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅✅&lt;/td&gt;
&lt;td&gt;✅✅&lt;/td&gt;
&lt;td&gt;✅✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Market Reality: Why NVIDIA Will Dominate Through 2030
&lt;/h2&gt;

&lt;p&gt;I've searched extensively through GitHub, Reddit, research papers, and industry discussions. Here's what I found:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Platform adoption:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LeRobot (Hugging Face):&lt;/strong&gt; Officially optimizes for Jetson&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AlohaMini community:&lt;/strong&gt; Standardizes on Jetson Orin Nano&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chinese manufacturers (Unitree, Agility):&lt;/strong&gt; Moving toward Jetson for AI perception layers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Academic robotics labs:&lt;/strong&gt; 80%+ use NVIDIA (CUDA ecosystem, TensorRT maturity)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why AMD/Intel don't win:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ecosystem lag:&lt;/strong&gt; No robotics-optimized compilers or middleware&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer inertia:&lt;/strong&gt; 2+ million engineers trained on CUDA&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model optimization:&lt;/strong&gt; VLA models optimized first for NVIDIA, then backported&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supply chain:&lt;/strong&gt; NVIDIA has proven availability; competitors still ramping&lt;/li&gt;
&lt;li&gt;I have personally tried using various AI tools on my strix halo device on linux and it is a nightmare. Rocm still does not have stable support for strix halo. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The alternative:&lt;/strong&gt; Hailo accelerators win in the &lt;em&gt;power-constrained, single-task&lt;/em&gt; market (warehouse scanning, edge object detection). But for general-purpose humanoids with VLA reasoning? Jetson is uncontested.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Next 12-24 Months: Watch These Developments
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;2027:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First meaningful cost reduction in humanoid robotics hits ($20-30K robots become viable for specific tasks)&lt;/li&gt;
&lt;li&gt;Open-source VLA model zoo matures → SmolVLA derivatives enable sub-$5K robots&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2028-2030:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compute cost drops 60-70% from 2025 levels&lt;/li&gt;
&lt;li&gt;Robotics software becomes the moat, not hardware&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts: Building the Right Mental Model
&lt;/h2&gt;

&lt;p&gt;I started this research trying to answer: "Which hardware will dominate humanoid robotics?"&lt;/p&gt;

&lt;p&gt;After diving deep, the answer isn't satisfying but it's clear: &lt;strong&gt;NVIDIA Jetson variants will dominate 60-70% of the market through 2030&lt;/strong&gt;, with niches for AMD (cost optimization), Hailo (power efficiency), and custom silicon (post-Series B).&lt;/p&gt;

&lt;p&gt;But more importantly: &lt;strong&gt;the era of "compute is the bottleneck" is ending&lt;/strong&gt;. By 2028-2030, compute becomes a commodity. The real moats are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data:&lt;/strong&gt; Collected robot experience (proprietary datasets)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models:&lt;/strong&gt; Fine-tuned VLAs for specific tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manufacturing:&lt;/strong&gt; Can you make 1,000 units reliably?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution:&lt;/strong&gt; Getting the product to market first&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Product&lt;/strong&gt; Taste and understanding humans&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;This article captures my learning at a specific moment (January 2026). The field is moving fast.&lt;/p&gt;

&lt;p&gt;I am actively looking for ways in which I can contribute to open source software in this domain.&lt;/p&gt;

&lt;p&gt;If you're working on humanoid robotics, I'd love to hear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which compute platform are you using? Why?&lt;/li&gt;
&lt;li&gt;What VLA model is actually viable on your hardware?&lt;/li&gt;
&lt;li&gt;Where are you hitting walls?&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Last updated: January 15, 2026&lt;/em&gt;&lt;/p&gt;

</description>
      <category>humanoid</category>
      <category>ai</category>
      <category>robotics</category>
    </item>
  </channel>
</rss>
