DEV Community: Ankit Khandelwal

Running LLMs on AMD NPU with FastFlowLM - Fedora Guide

Ankit Khandelwal — Sun, 28 Jun 2026 10:51:52 +0000

Running LLMs on AMD NPU with FastFlowLM - Fedora Guide

Tested on Fedora 44, kernel 7.0.12, ROG Flow Z13 (Ryzen AI Max 390 / Strix Halo NPU).

Goal: copy-paste setup that gets flm validate and flm run working on Fedora.

TL;DR

You need four layers working together:

Layer	What it does
Kernel + DKMS driver (`amdxdna`)	Creates `/dev/accel/accel0`, loads NPU firmware
XRT base	AMD runtime installed to `/opt/xilinx/xrt`
XRT NPU plugin (`xrt_plugin` RPM)	Provides `libxrt_driver_xdna.so` so XRT sees the NPU
FastFlowLM (`flm`)	Runs LLMs on the NPU

On Fedora there is no prebuilt PPA like Ubuntu. You build XRT, the NPU plugin, and FastFlowLM from source.

Two non-obvious blockers we hit:

amd_iommu=off in kernel cmdline — common for GPU LLM tuning, but breaks the NPU
Symlinking xrt-smi — the script is path-sensitive; use a wrapper instead

Hardware tested

Item	Value
Machine	ASUS ROG Flow Z13 GZ302EA
CPU / NPU	AMD Ryzen AI Max 390 (Strix Halo)
NPU PCI ID	`1022:17f0` rev 11
OS	Fedora Linux 44 Workstation
Kernel	7.0.12-201.fc44.x86_64
NPU firmware	1.1.2.65
XRT	2.25.0 (built from `amd/xdna-driver`)

Also works on other XDNA2 NPUs (Strix, Strix Halo, Kraken, Gorgon Point) with the same stack.

Before you start

Kernel requirements

Linux 7.0+ (Fedora 44 ships this) with in-tree amdxdna support
For Strix Halo (rev 11), prefer the out-of-tree DKMS driver from xdna-driver over the stock in-tree module
IOMMU must be enabled — see IOMMU section

Check your kernel cmdline now

cat /proc/cmdline

If you see amd_iommu=off, remove it before spending time on driver builds. Details below.

1. Install build dependencies

sudo dnf install -y \
  git jq dkms \
  kernel-devel-$(uname -r) kernel-headers \
  gcc gcc-c++ make cmake ninja-build \
  boost-devel boost-filesystem boost-program-options boost-static \
  elfutils-devel libdrm-devel libuuid-devel libcurl-devel \
  openssl-devel zlib-static glibc-static libstdc++-static \
  protobuf-devel protobuf-compiler \
  json-glib-devel libyaml-devel libudev-devel \
  rpm-build curl pciutils \
  fftw-devel \
  opencl-headers opencl-filesystem OpenCL-ICD-Loader-devel

Run AMD's dependency scripts (optional but helpful):

git clone --recursive https://github.com/amd/xdna-driver.git ~/repos/xdna-driver
cd ~/repos/xdna-driver
sudo ./tools/amdxdna_deps.sh
sudo ./xrt/src/runtime_src/tools/scripts/xrtdeps.sh

Fedora OpenCL note: Fedora 44 uses OpenCL-ICD-Loader, not the older ocl-icd package. If the XRT build fails on OpenCL ICD layout or RPM dependencies, see Fedora XRT build fixes.

cmake3 wrapper (Fedora ships CMake 4.x as `cmake`)

XRT build scripts look for cmake3 on Fedora. Create a local wrapper — no system symlink needed:

mkdir -p ~/.local/xrt-build/bin
ln -sf /usr/bin/cmake ~/.local/xrt-build/bin/cmake3
echo 'export PATH="$HOME/.local/xrt-build/bin:$PATH"' >> ~/.bashrc
export PATH="$HOME/.local/xrt-build/bin:$PATH"

2. Build and install XRT

cd ~/repos/xdna-driver/xrt/build

./build.sh -npu -opt -disable-werror -noinit -j $(nproc)

Install the RPMs (version string may differ slightly):

cd ~/repos/xdna-driver/xrt/build/Release
sudo dnf install -y xrt-base-*.rpm xrt-base-devel-*.rpm xrt-npu-*.rpm

Register XRT libraries system-wide

Without this, flm fails with libxrt_coreutil.so.2: cannot open shared object file:

echo '/opt/xilinx/xrt/lib64' | sudo tee /etc/ld.so.conf.d/xrt.conf
sudo ldconfig

Verify:

ldconfig -p | grep xrt

3. Build and install the NPU plugin (DKMS driver + XRT shim)

This step provides libxrt_driver_xdna.so and replaces the in-tree kernel module with the DKMS build.

export PATH="$HOME/.local/xrt-build/bin:$PATH"
cd ~/repos/xdna-driver/build
./build.sh -release -j $(nproc)

Install the plugin RPM:

sudo dnf install ~/repos/xdna-driver/build/Release/xrt_plugin.*.rpm

Verify the DKMS module is active (path should contain extra/ or updates/dkms/, not kernel/drivers/accel/):

modinfo -F filename amdxdna
# e.g. /lib/modules/7.0.12-201.fc44.x86_64/extra/amdxdna.ko.xz

Reboot if the module was just installed:

sudo reboot

4. Fix memlock limit

The NPU needs locked memory. Check current limit:

ulimit -l

If not unlimited:

sudo tee /etc/security/limits.d/99-memlock.conf <<'EOF'
*    soft    memlock    unlimited
*    hard    memlock    unlimited
EOF

Log out and back in (or reboot), then confirm:

ulimit -l
# unlimited

5. Critical: do NOT use `amd_iommu=off`

What is `amd_iommu`?

IOMMU (AMD-Vi on AMD platforms) mediates how PCIe devices access memory. The NPU driver uses PASID / SVA (Shared Virtual Addressing) so the NPU can share your process's virtual address space — this requires IOMMU.

amd_iommu=off disables IOMMU entirely. Strix Halo users often add it for 5–12% faster GPU inference in llama.cpp. That trade-off kills NPU support.

Symptoms with IOMMU off

[ERROR]  No NPU device found.
amdxdna_sva_init: SVA bind device failed, ret -19
PASID unavailable and carveout not configured
Open /dev/accel/accel0 failed (err=-22): Invalid argument

Fix

Edit /etc/default/grub and remove amd_iommu=off:

sudo nano /etc/default/grub

Change:

GRUB_CMDLINE_LINUX="... amd_iommu=off amdgpu.gttsize=24576 ..."

To (keep your GPU tuning flags, drop only the IOMMU disable):

GRUB_CMDLINE_LINUX="... amdgpu.gttsize=24576 ttm.pages_limit=6291456"

Regenerate grub and reboot:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot

After reboot:

cat /proc/cmdline | grep amd_iommu || echo "OK: amd_iommu not disabled"

Optional middle ground: Some users use iommu=pt instead of amd_iommu=off for slightly less IOMMU overhead while keeping NPU working. Note: amd_iommu=pt is invalid on AMD — use iommu=pt (no amd_ prefix).

6. Build and install FastFlowLM

sudo dnf install -y \
  libavformat-devel libavutil-devel libavcodec-devel \
  libswresample-devel libswscale-devel

git clone --recursive https://github.com/FastFlowLM/FastFlowLM.git ~/repos/FastFlowLM
cd ~/repos/FastFlowLM/src
cmake --preset linux-default
cd build
cmake --build . -j $(nproc)
sudo cmake --install .

flm installs to /opt/fastflowlm/bin/flm (symlinked to /usr/local/bin/flm).

7. Make `xrt-smi` available (optional but useful)

xrt-smi lives in /opt/xilinx/xrt/bin/. You can source the environment script:

source /opt/xilinx/xrt/setup.sh

Do not symlink xrt-smi to /usr/local/bin — the wrapper script uses dirname "$0" and breaks when symlinked:

/usr/local/bin/xrt-smi: line 46: /usr/local/bin/unwrapped/xrt-smi: No such file or directory

Instead, create a small wrapper:

sudo tee /usr/local/bin/xrt-smi <<'EOF'
#!/bin/sh
exec /opt/xilinx/xrt/bin/xrt-smi "$@"
EOF
sudo chmod +x /usr/local/bin/xrt-smi

Or add XRT to PATH permanently in ~/.bashrc:

export PATH="/opt/xilinx/xrt/bin:$PATH"

8. Validate everything

Run these in order:

# Kernel driver + firmware
flm validate

Expected:

[Linux]  Kernel: 7.0.12-201.fc44.x86_64
[Linux]  NPU: /dev/accel/accel0 with 8 columns
[Linux]  NPU FW Version: 1.1.2.65
[Linux]  amdxdna version: 0.15
[Linux]  Memlock Limit: infinity

# XRT layer
xrt-smi examine

Expected: one NPU Strix Halo device at [0000:c5:00.1].

# Hardware self-test
xrt-smi validate

Expected: gemm, latency, and throughput tests PASSED.

Important: flm validate checks the kernel DRM device. flm run uses XRT. Both must pass before running models.

9. Run your first model

flm run gemma4-it:e4b
flm list
flm serve gemma4-it:e4b     # OpenAI-compatible server on port 52625

Models download from HuggingFace on first run. Default storage: ~/.config/flm/.

Inside an interactive flm run session, toggle performance reporting:

/verbose    # per-turn TTFT, prefill tok/s, decoding tok/s
/status     # token counts and throughput summary

Formal benchmarks across context lengths:

flm bench gemma4-it:e4b

10. Monitor NPU stats

There is no Linux equivalent to amdgpu_top or Windows Task Manager's NPU tab yet. Use a combination of XRT (device-level) and FLM (inference-level) tools.

Quick reference

What you want	Command
Device info, firmware, topology	`xrt-smi examine`
Power, partitions, platform	`xrt-smi examine -r all -d 0000:c5:00.1`
Hardware benchmark (TOPS, latency)	`xrt-smi validate`
Live-ish polling	`watch -n1 'xrt-smi examine -r all -d 0000:c5:00.1'`
Inference speed while chatting	`/verbose` and `/status` in `flm run`
Formal model benchmarks	`flm bench <model>`

Replace 0000:c5:00.1 with your NPU BDF from xrt-smi examine if it differs.

XRT — device-level snapshots

xrt-smi examine
xrt-smi examine -r all -d 0000:c5:00.1
xrt-smi validate

Poll while a model runs in another terminal:

watch -n1 'xrt-smi examine -r all -d 0000:c5:00.1'

Power modes (some require root):

xrt-smi configure --pmode performance -d 0000:c5:00.1
# modes: default, powersaver, balanced, performance, turbo

FLM — inference metrics (most useful in practice)

Terminal 1 — run a model with verbose output:

flm run gemma4-it:e4b
# then type /verbose

Terminal 2 — watch the NPU device:

watch -n1 'xrt-smi examine -r all -d 0000:c5:00.1'

Terminal 3 (optional) — GPU is separate from NPU:

amdgpu_top    # Radeon iGPU only, not the NPU

Kernel debugfs (low-level, requires root)

sudo ls /sys/kernel/debug/accel/
sudo ls /sys/kernel/debug/dri/

# When present (exact path varies by kernel/driver):
sudo cat /sys/kernel/debug/dri/0/telemetry_profiling
sudo cat /sys/kernel/debug/dri/0/powerstate
sudo cat /sys/kernel/debug/dri/0/get_app_health

These are read-on-demand debug interfaces, not a live dashboard.

What does NOT show NPU utilization

htop / top — CPU and RAM only
amdgpu_top / radeontop — GPU only
/sys/class/accel/accel0/ — device node metadata, no utilization graph

11. Real-world benchmark (ROG Flow Z13)

Measured on the same machine as this guide after a successful setup (Fedora 44, Ryzen AI Max 390, NPU firmware 1.1.2.65, IOMMU enabled).

`gemma4-it:e4b`

flm run gemma4-it:e4b
# /verbose enabled during session

Metric	Result
TTFT (time to first token)	1.21 s
Prefill speed	18 tok/s
Decoding speed	11 tok/s

These numbers come from FLM's /verbose output (prefill and decoding tokens/s). Your results will vary with prompt length, context size, power mode, and background load.

xrt-smi validate on the same hardware reported 4.4 TOPS (gemm), 52 µs average latency, and ~76k op/s throughput — useful as a hardware sanity check, not directly comparable to LLM tok/s.

Troubleshooting

Symptom	Cause	Fix
`libxrt_coreutil.so.2: cannot open shared object file`	XRT libs not in loader cache	Add `/opt/xilinx/xrt/lib64` to `ld.so.conf.d`, run `sudo ldconfig`
`No NPU device found` + clean dmesg	IOMMU disabled	Remove `amd_iommu=off`, reboot
`PASID unavailable and carveout not configured`	Same as above	Enable IOMMU
`Memlock limit is too low (8MB)`	Default ulimit too low	`/etc/security/limits.d/99-memlock.conf`, re-login
`xrt-smi: 0 devices found`	Missing NPU plugin	Install `xrt_plugin` RPM from `xdna-driver/build`
`/dev/accel/accel0` exists but open fails (ENODEV)	In-tree driver failed probe	Install DKMS driver via `xrt_plugin` RPM
`flm validate` OK but `flm run` fails with `No such device with index '0'`	XRT can't see NPU	Fix XRT plugin + `xrt-smi examine`
`xrt-smi: unwrapped/xrt-smi: No such file`	Bad symlink	Use wrapper script (see section 7)
`cmake3 is not installed`	Fedora CMake naming	Create `cmake3` wrapper pointing to `/usr/bin/cmake`
XRT build fails on OpenCL ICD	Fedora OpenCL 3.0 layout	See Fedora XRT build fixes below
Link errors for `libfftw3`	Missing dev package	`sudo dnf install fftw-devel`

Useful debug commands

# NPU PCI device
lspci -nn | grep 17f0

# Device node
ls -la /dev/accel/

# Kernel messages
sudo dmesg | grep -iE 'amdxdna|xdna|pasid|firmware|17f0'

# Which driver module is loaded
modinfo -F filename amdxdna
lsmod | grep amdxdna

# Firmware files (rev 11 = 17f0_11)
ls -la /usr/lib/firmware/amdnpu/17f0_11/

# IOMMU status
cat /proc/cmdline

Fedora XRT build fixes

Fedora 44 changed OpenCL packaging. Upstream XRT may fail to build or produce RPMs with wrong dependencies. Symptoms:

Compile error in ocl_icd_bindings.cpp (OpenCL 3.0 ICD struct layout)
RPM dependency conflict between ocl-icd and OpenCL-ICD-Loader

Workarounds applied in our build (track upstream fix):

Patch xrt/src/runtime_src/xocl/api/icd/ocl_icd_bindings.cpp for OpenCL 3.0 ICD compatibility
Patch xrt/src/CMake/cpackLin.cmake to require OpenCL-ICD-Loader >= 3.0 on Fedora

Upstream issue: Xilinx/XRT #9163

Install these before building if xrtdeps.sh fails on OpenCL packages:

sudo dnf install -y opencl-headers opencl-filesystem OpenCL-ICD-Loader-devel

Build flags that helped on Fedora:

./build.sh -npu -opt -disable-werror -noinit -j $(nproc)

Use -j $(nproc) with a space — some build scripts break on -j$(nproc).

Architecture: why so many pieces?

┌─────────────────────────────────────────┐
│  flm run / flm serve                    │  ← FastFlowLM (user-facing)
├─────────────────────────────────────────┤
│  libxrt_driver_xdna.so (XRT plugin)     │  ← xrt_plugin RPM
├─────────────────────────────────────────┤
│  libxrt_core.so (XRT base)              │  ← xrt-base RPM
├─────────────────────────────────────────┤
│  amdxdna.ko (DKMS kernel driver)        │  ← xrt_plugin RPM (postinst)
├─────────────────────────────────────────┤
│  NPU firmware (amdnpu/17f0_11/)       │  ← linux-firmware + plugin
├─────────────────────────────────────────┤
│  /dev/accel/accel0                      │  ← kernel DRM device node
└─────────────────────────────────────────┘
         ▲
         │ requires IOMMU (PASID/SVA)
         │ requires memlock = unlimited

Quick re-setup checklist (future you)

After a fresh Fedora install or kernel upgrade:

# 1. Confirm IOMMU is NOT disabled
grep -q 'amd_iommu=off' /proc/cmdline && echo "FIX GRUB FIRST" || echo "IOMMU OK"

# 2. Rebuild DKMS if kernel changed
sudo dkms autoinstall -k "$(uname -r)"
sudo depmod -a

# 3. Check memlock
ulimit -l

# 4. Validate
flm validate && xrt-smi examine && xrt-smi validate

References

FastFlowLM — NPU-first LLM runtime
FastFlowLM Linux docs
amd/xdna-driver — XRT + NPU plugin source
Lemonade NPU Linux guide
XRT OpenCL Fedora issue #9163
Strix Halo IOMMU discussion — GPU vs NPU trade-off

Written from a working Fedora 44 + ROG Flow Z13 setup. If AMD ships Fedora packages later, prefer those over building from source — but the troubleshooting sections above still apply.

Kriya-Egocentric-100K: Action100M-style Annotations for Real-World Labor Videos

Ankit Khandelwal — Tue, 17 Mar 2026 05:22:36 +0000

Just pushed a new preview dataset to Hugging Face: Kriya-Egocentric-100K.

It contains Action100M-compatible hierarchical action annotations for a small 5-video subset of Build AI’s Egocentric-100K — real first-person footage captured with a monocular head-mounted fisheye camera during manual labor tasks.

What’s inside?

One JSON file per video (f001-w001-0001.json etc.)
Full Action100M-style tree: root → sub-segments with precise start/end timestamps
LLM-generated natural language captions + structured GPT outputs (brief/detailed summaries, action labels, actors)
Everything generated 100 % automatically via the Kriya Full Automated Action Annotation API (early preview)

The videos themselves are not hosted here (you’ll need to pull them directly from Build AI under their license), but the annotations are MIT and drop-in compatible with the Kriya Visualizer — just load the video + matching JSON and explore the timeline instantly.

Why this matters

After the EPIC-KITCHENS preview, this is the next step toward scaling automatic annotation to more diverse egocentric domains. Manual labor footage brings new challenges (occlusions, tool use, unstructured environments) — and the results already look strong for downstream tasks like video world models, VLMs, VLA policies, and embodied robotics.

Visualizer demo, full pipeline details, and the previous Kriya-EPIC-KITCHENS release are all in the original Kriya tools blog post.

This is still an early preview — feedback and collaboration super welcome! Drop a comment or DM if you want to try the API on your own footage or discuss scaling plans.

Excited to keep pushing the boundary of automatic video understanding .

Kriya: Tools for Exploring and Generating Action100M-style Video Annotations

Ankit Khandelwal — Sat, 14 Mar 2026 06:29:49 +0000

After reading the excellent Action100M paper, I became very excited about the potential of fully automated, large-scale video action annotation.

High-quality temporal action hierarchies open doors for training stronger video world models, video-language models (VLMs), vision-language-action models (VLAs), humanoid control policies, and physical reasoning systems.

But two practical problems quickly appeared:

There was no convenient way to visualize these rich, hierarchical annotations together with the video.
Generating such annotations at scale for new/custom video datasets still felt out of reach for many researchers and engineers.

So I built two tools to help move things forward.

1. Kriya Visualizer – See Action100M-style Annotations Come Alive

I created a lightweight, static web-based visualizer specifically designed for Action100M-style temporal action trees.

Features (current version):

Video player synced with the annotation timeline
Hierarchical timeline (one row per level in the action tree)
Nodes highlight at the current timestamp
Side panel with metadata, full transcript, and raw JSON view
Clean, single-screen layout (no installation needed)

It's open source under MIT license → feel free to fork, improve, or use it in your projects.

Access Here: https://ankk98.github.io/kriya-viz/

GitHub repo: https://github.com/Ankk98/kriya-viz

If you're working with Action100M data (or any similar dense temporal action hierarchy), give it a try and let me know what features would make it more useful.

2. Kriya-EPIC-KITCHENS – Automatic Annotations on Egocentric Videos

Next, I wanted to test how well fully automatic annotation works on real, challenging egocentric data.

I ran the Kriya Full Automated Action Annotation API (early preview) on a small subset of videos from the popular EPIC-KITCHENS-100 dataset.

Result: A preview Hugging Face dataset with ~6 videos fully annotated in Action100M style, no human labeling involved.

Temporal segments with hierarchical actions
Natural language captions/descriptions per segment
Ready to download and use

Dataset link: https://huggingface.co/datasets/ankk98/kriya-epic-kitchens

Early results on kitchen egocentric videos look very promising. I'm excited to see if/how these annotations can feed downstream tasks:

Video world models
VLM / VLA fine-tuning
Robotic manipulation from egocentric views
Physical AI reasoning

The current API version deliberately follows the Action100M pipeline closely. An improved version that addresses some limitations is already in the works.

API docs (early preview): https://mindandmotionlabs.com/api-docs.html

(You send videos → get back structured temporal action hierarchies)

Why This Matters

Manual video annotation at scale is expensive and slow. If high-quality automatic annotation becomes reliable, we can:

Train on orders-of-magnitude more grounded video data
Build more general-purpose video understanding and action generation models
Accelerate progress toward capable robotic and embodied AI systems

These two small releases are just early steps. Kriya Visualizer for inspection/debugging, and Kriya-EPIC-KITCHENS as a proof-of-concept dataset.

Feedback, feature requests, collaboration ideas, or even just "I tried it and here's what broke" are very welcome!

What are you building with video action data right now? Drop a comment below 👇

From Perception to Embodied Intelligence: Evolution, Architectures, and the Humanoid Gap

Ankit Khandelwal — Sat, 14 Feb 2026 13:56:12 +0000

Vision-Language-Action (VLA) models represent a paradigm shift from passive multimodal understanding to active embodied control. This brief maps the lineage from foundational Vision-Language Models (VLMs) like CLIP and BLIP to current state-of-the-art VLA systems, revealing critical architectural transitions, data strategies, and failure modes that define the frontier of humanoid manipulation.

The analysis identifies three core evolutionary phases:

(1) VLM pre-training for semantic understanding
(2) action tokenization enabling end-to-end control
(3) hybrid architectures balancing reasoning with real-time execution

For humanoid robotics, fundamental gaps remain in proprioceptive reasoning, long-horizon planning, and physics-aware action generation, challenges that current open-source models address only partially.

1. The Evolutionary Timeline: From VLMs to VLAs

Phase 1: Foundation (2021–2022) – VLMs as Semantic Engines

CLIP (2021) and BLIP (2022) established contrastive learning as the dominant paradigm for aligning vision and language modalities. These models excelled at matching images to text descriptions but lacked any mechanism for action generation. Their legacy persists in modern VLAs: OpenVLA inherits SigLIP's vision encoder, while Pi0 leverages PaliGemma's VLM backbone. hankyukim

Key Limitation: VLMs were fundamentally passive, optimized for retrieval and classification, not sequential decision-making. Early attempts like CLIPort (2022) demonstrated that grafting CLIP representations onto robotic policies via imitation learning could achieve task-specific success but failed to generalize across embodiments or semantic concepts beyond the training distribution. arxiv

Phase 2: Tokenization Breakthrough (2023) – RT-2 and the Birth of VLAs

Google DeepMind's RT-2 (July 2023) catalyzed the field by reconceptualizing robot actions as text tokens. The architecture quantized continuous actions into discrete bins (typically 256 per dimension) and appended them to the vocabulary of a PaLM-E or PaLI-X VLM. This enabled training with standard next-token prediction objectives, unifying web-scale vision-language pre-training with robotic demonstrations. madison-proceedings

Performance Leap: RT-2 achieved 3× improvement in generalization over RT-1, demonstrating emergent capabilities like reasoning about object categories and improvising tools. The model could interpret novel commands ("place the apple on the 3") despite never observing such combinations in robot data. deepmind

Phase 3: Scaling and Open-Source (2024–2025) – OpenVLA, SmolVLA, and Pi0

OpenVLA (2024) democratized access with a 7B-parameter model trained on 970k demonstrations from the Open X-Embodiment dataset. Built on Llama 2 + DINOv2 + SigLIP, it outperformed closed models like RT-2-X (55B parameters) with 7× fewer parameters by leveraging more diverse training data and 27 training epochs (vs. typical 1-2 epochs for VLMs). arxiv

SmolVLA (2025) pioneered efficiency, achieving OpenVLA-level performance with <0.5B parameters by employing a compact VLM backbone, flow matching action expert, and asynchronous inference stack. Its key insight: action generation quality depends more on architectural efficiency than raw parameter count. youtube

Pi0 Series (Physical Intelligence, 2024–2025) introduced hybrid architectures combining autoregressive action tokens with continuous flow matching. Pi0.5 added temporal awareness through timestep conditioning, while Pi0.6 scaled to 5B parameters and incorporated knowledge insulation, training the VLM backbone on FAST tokens while isolating the action expert's gradients. arxiv

2. Thematic Deep Dives: What Worked vs. What Failed

2.1 Key Ideas That Worked

Action Tokenization as Sequence Prediction

Treating actions as discrete tokens enabled direct transfer of LLM training infrastructure to robotics. RT-2's 256-bin quantization scheme remains the default in OpenVLA, providing a simple bridge between continuous control and autoregressive generation. This approach inherits powerful properties from language modeling: in-context learning, few-shot adaptation, and chain-of-thought reasoning. arxiv

Evidence: OpenVLA achieves 95% action token accuracy after 27 training epochs, with performance correlating strongly to robot success rates. The discrete representation also simplifies multi-task training across heterogeneous robot embodiments. arxiv

Flow Matching for Continuous Control

Diffusion-based action heads address the continuity problem inherent in tokenization. Pi0 and SmolVLA use flow matching to predict action chunks as continuous trajectories, avoiding quantization errors. This enables smoother, more precise control, critical for contact-rich manipulation. youtube

Performance Impact: Pi0 outperforms tokenized baselines on action chunking tasks (e.g., folding laundry) where precise force modulation matters. Flow matching also supports variable horizon predictions, unlike fixed-length token sequences. arxiv

Knowledge Insulation and Modularity

VLA-Adapter and Pi0.6 demonstrate that decoupling VLM reasoning from action generation improves training efficiency. By freezing the VLM backbone and training only a lightweight action expert, these models avoid catastrophic forgetting of web-scale knowledge while specializing for robot control. arxiv

Efficiency Gains: VLA-Adapter trains a powerful VLA in 8 hours on a single consumer GPU, while Pi0.6's insulated gradients prevent performance degradation on vision-language benchmarks. website.pi-asset

2.2 Key Ideas That Failed

Naive Proprioception Integration

Feeding raw robot state (joint angles, end-effector poses) directly as additional tokens creates shortcut learning. Policies overfit to state-action memorization rather than visual reasoning, degrading spatial generalization. In testing, models trained with proprioception fail when object positions deviate slightly from training trajectories. arxiv

Failure Mode: A study on visuomotor policies found that proprioceptive states cause "shortcuts where the policy directly associates absolute configurations with actions," leading to 40-60% success rate drops under spatial perturbations. arxiv

Monolithic Scaling Without Architectural Innovation

Simply increasing VLM backbone size (e.g., RT-2-X's 55B parameters) yields diminishing returns for robot control. The computational overhead, 15GB GPU memory for inference at 6Hz, makes real-time deployment impractical. Larger models also struggle with action token accuracy, as the vast parameter space prioritizes language modeling over control precision. arxiv

Empirical Evidence: OpenVLA's 7B model matches RT-2-X's performance despite 7× fewer parameters, suggesting data diversity and training recipe matter more than scale. arxiv

Single-Modality Action Generation

Pure autoregressive or pure diffusion approaches each have blind spots. Autoregressive models struggle with continuous precision (quantization error), while diffusion models lack the reasoning depth of VLMs for long-horizon planning. HybridVLA attempted to combine both but introduced training interference between the two generation paradigms, requiring complex collaborative ensemble mechanisms that increased inference latency. arxiv

3. Open Source Model Comparison: OpenVLA vs. SmolVLA vs. Pi0

Feature	OpenVLA (7B)	SmolVLA (<0.5B)	Pi0.6 (5B)
Backbone	Llama 2 + DINOv2 + SigLIP	Qwen 2.5 0.5B + custom ViT	Gemma3 4B
Action Head	Autoregressive tokens (256 bins)	Flow matching (continuous)	Hybrid: FAST tokens + flow matching website.pi-asset
Training Data	970k demos (OpenX dataset)	Public community datasets	Proprietary large-scale corpus
Inference Speed	6 Hz on RTX 4090 arxiv	12.5 Hz on L40s (2.5× faster than OpenVLA) ai.stanford	5-10 Hz (denoising steps dependent)
Key Innovation	Cross-embodiment generalization	Asynchronous inference stack	Knowledge insulation + RL fine-tuning pi
Simulation Performance	62% on LIBERO-90 ai.stanford	77% on LIBERO-90 (w/ action chunks) ai.stanford	State-of-the-art on LIBERO-5 (96.5%) arxiv
Real-World Strength	Generalization across robots	Deployment on consumer GPUs	Long-horizon tasks (coffee making, laundry) pi
Critical Weakness	Slow inference, quantization error	Limited long-horizon reasoning	Proprietary, computationally intensive

Architectural Deep Dive

OpenVLA follows the RT-2 blueprint faithfully: discretize actions, append to vocabulary, train with cross-entropy loss. Its strength lies in the curated OpenX dataset diversity, enabling zero-shot control of unseen robots. However, the autoregressive generation bottleneck limits real-time performance, 15GB GPU memory and 6Hz inference constrain deployment to high-end hardware. arxiv

SmolVLA challenges the "bigger is better" orthodoxy. By using a compact VLM and flow matching action expert, it achieves comparable performance with 14× fewer parameters. The asynchronous inference stack decouples perception from action generation, allowing new chunks to be predicted while the robot executes previous commands. This is particularly impactful for dynamic environments where reaction time matters. huggingface

Pi0.6 represents the hybrid extreme: it trains the VLM backbone on FAST discrete tokens while the action expert predicts continuous flows. Knowledge insulation prevents gradient interference, and offline RL pre-training (Recap) doubles throughput on complex tasks. The model's hierarchical design supports heterogeneous prompts, enabling high-level task conditioning. The trade-off is accessibility, Pi0.6's training requires proprietary data and substantial compute, limiting reproducibility. pi

4. The Humanoid Gap Report: Missing Capabilities for Hand Manipulation

4.1 Proprioception and Tactile Integration

Current VLAs treat proprioception as auxiliary inputs, leading to shortcut learning and poor spatial generalization. Humanoid hands require fine-grained force feedback and slip detection, capabilities absent in standard VLA pipelines. arxiv

Gap: No open-source VLA integrates tactile sensing end-to-end. ForceVLA and AnyTouch explore Mixture-of-Experts for contact-rich tasks, but these remain research prototypes. The lack of large-scale tactile datasets mirrors the early scarcity of robot demonstrations. themoonlight

Opportunity: Develop a "Tactile VLA" that fuses vision, language, and distributed pressure sensor arrays. The architecture should use tactile tokens analogous to image patches, enabling the VLM backbone to reason about contact forces and friction constraints.

4.2 Long-Horizon Planning and Memory

Humanoid manipulation tasks (e.g., assembling furniture) span 5–20 minutes and require remembering partial progress. Standard VLAs operate with Markovian assumptions and fixed context windows, causing failure when intermediate steps are ambiguous. arxiv

Gap: MemoryVLA demonstrates perceptual-cognitive memory banks for manipulation, but its evaluation is limited to tabletop tasks. Humanoid whole-body control introduces additional complexity: locomotion plans must be retained while hands execute fine manipulations. arxiv

Opportunity: Implement a hierarchical memory system with (1) working memory for immediate action chunks and (2) episodic memory for task-level progress. The hippocampal-inspired consolidation mechanism from MemoryVLA could scale to humanoid tasks by encoding proprioceptive trajectories alongside visual observations. arxiv

4.3 Physics-Aware Action Generation

VLAs hallucinate physically implausible actions, predicting grasps that violate kinematic constraints or object trajectories that ignore gravity. This stems from the VLM backbone's pixel-space reasoning lacking 3D physical grounding.

Gap: GeoVLA and 3D-VLA integrate point clouds and depth maps, but these are add-ons rather than core architectural features. The models still prioritize semantic alignment over physical feasibility. arxiv

Opportunity: Embed a differentiable physics simulator within the VLA training loop. Actions could be penalized for violating Newtonian mechanics, similar to how RL uses physics-based rewards. The "visual foresight" approach in F1-VLA shows promise: predicting next visual states correlates with action reliability, suggesting that generative world models could enforce physical consistency. arxiv

4.4 Sim-to-Real for Humanoid Morphology

Humanoid robots exhibit high-dimensional action spaces (30+ DOF) and complex contact dynamics. Current sim-to-real methods rely on domain randomization, which fails to capture the nuance of bipedal balance and bimanual coordination. pmc.ncbi.nlm.nih

Gap: HumanVLA demonstrates vision-language directed object rearrangement but requires privileged state information and hand-crafted finite state machines. The sim-to-real gap remains 17% failure rate in real-world experiments, primarily due to depth sensing errors and contact estimation delays. arxiv

Opportunity: Leverage human video data as an intermediate domain. EgoVLA extracts wrist and hand actions from egocentric videos, using inverse kinematics to retarget to robot hands. This "human-to-robot" transfer could bootstrap humanoid VLA training without expensive real robot data collection. rchalyang.github

5. Critical Disagreements and Uncertainties

Disagreement 1: Proprioception's Role

Proponents: Proprioception provides compact, accurate state information essential for precise servo control. arxiv
Critics: End-to-end visuomotor policies without explicit state inputs achieve better spatial generalization, as they cannot memorize trajectories. arxiv
Resolution: The consensus is shifting toward conditioned proprioception, using state inputs only for low-level control while keeping high-level reasoning vision-driven, as seen in Helix's dual-system architecture. iotworldtoday

Disagreement 2: Action Representation

Tokenization Camp: Discrete tokens enable direct VLM transfer and chain-of-thought reasoning (OpenVLA, RT-2). arxiv
Diffusion Camp: Continuous flow matching captures action continuity and supports variable horizons (Pi0, SmolVLA). youtube
Resolution: Hybrid approaches (Pi0.6, HybridVLA) are emerging as the synthesis, but training interference remains an open problem. arxiv

Uncertainty: The optimal data mixture ratio for humanoid VLAs is unknown. RT-2 used 10% robotics data, while OpenVLA uses 100%. For humanoids, the scarcer data may require more aggressive web-scale pre-training, but this risks physics misalignment.

6. Conclusion

VLA models have evolved from passive VLMs to active embodied agents, but the leap to reliable humanoid manipulation remains incomplete. The open-source ecosystem (OpenVLA, SmolVLA) has democratized access, yet critical gaps persist in proprioceptive reasoning, long-horizon memory, and physics-aware generation.

Teleoperation Data Quality for Imitation Learning: What Actually Breaks the Model

Ankit Khandelwal — Sun, 08 Feb 2026 13:53:43 +0000

Practical rubric design and failure modes from auditing robot teleop datasets (e.g. LeRobot).

Why this post

We audited teleoperation episodes for an imitation-learning pipeline. Removing poor-quality episodes (about 20–40% in our case) led to clearly better learning; the literature often reports ~10–15% policy improvement from similar filtering. This post covers rubric mistakes that cause inconsistent scores and failure modes we kept seeing.

1. Rubric mistakes and how to fix them

Mistake 1: Metrics that sound clear but aren’t.

Example: “Mistake-to-Recovery-Ratio.” People disagree: Is it (total mistakes)/(total recoveries) or (total mistakes)/(total recovery attempts)? If a pick fails, then fails again, then succeeds, is that one recovery or two attempts?

How it should be: Define one ratio per episode. Count each distinct mistake once (each new failure event). Count a recovery only when the operator successfully got back on track; failed attempts in between don’t add extra recoveries. Write this in the rubric: “Count a recovery only when intended behavior has resumed; don’t count failed attempts as new mistakes unless it’s a new failure (e.g. new drop).” If you also want to penalize messy recoveries, add a separate “recovery attempts per mistake” number.

Mistake 2: No rule for overall quality.

Scorers give High when most dimensions are High but one is Low. Then “high quality” is not strict.

How it should be: Overall = Low if any dimension is Low; High only if all dimensions are High. One bad dimension pulls the episode down.

2. Failure modes we kept seeing

Short name (formal term) with plain-language meaning. One line each; add a screenshot or GIF per item when you publish.

Post-task idle / run-on footage (extra 10–15 s of video after the task is done). Dilutes the signal; policy can learn to linger.
Temporal misalignment (sync issues between cameras or sensors). Bad for multi-view or fusion; causes inconsistent state.
Self-collision / kinematic clash (arm hits itself or the body). Unsafe; don’t let the policy imitate it.
Low contrast / poor observability (white background, same-color object, or bad lighting). Object hard to see; weak visual signal.
Rubric incompleteness (scorers disagree or don’t know how to score). Add explicit rules and examples; flag “undefined” cases and fix the rubric before locking scores.
Repeated failures before success (e.g. 3–5 pick attempts before one works). Noisy trajectory; can teach hesitation.
Over-ideal / low-complexity conditions (too easy, no obstacles). Can bias the dataset; score complexity separately or down-weight.

3. Impact

After fixing the rubric and removing Low-quality episodes (20–40%), retraining gave noticeably better results. Studies on filtering teleop data often report ~10–15% (or more) policy gain. Define metrics and overall quality clearly, then audit before scaling data.

Summary

Rubric: Define “mistake” and “recovery” in writing; one ratio per episode. Overall quality = Low if any dimension is Low.
Failure modes: Post-task idle, sensor sync, arm clashes, poor visibility, rubric gaps, repeated failed attempts, over-ideal setup. Name them, add examples (screenshots/GIFs), score consistently.
Filtering a chunk of bad episodes is high leverage; do it before collecting more

Ghibli moment for 3D Printing

Ankit Khandelwal — Thu, 05 Feb 2026 12:11:19 +0000

I bought my first 3D printer this week to make parts for the robot I'm building.
Even though I've seen 3D prints online for years, watching it work on my desk feels completely different.

The print head moves slowly, laying down each thin line of plastic.
At the start it looks like nothing, just squiggles.
But layer by layer, an actual object appears, as if the room is quietly drawing in 3D.

It is strangely calming to watch.

I keep thinking about all the little things I’ve wanted over the years like headphone stands, cable holders, desk gadgets.
Earlier, they were just “nice to have” ideas that I would forget about.
Now I feel like I have this small superpower to do shaka laka boom boom and make them real.

Friends who visit are equally fascinated.
Everyone has one object they’ve always wanted: a custom mount, a tiny figurine, some organizer for their setup.
The printer is already “booked” for the next many days with all these requests.

What still surprises me is how affordable this has become.
The printer itself cost around 15k INR, which is not that far from what people pay for a regular home printer.
It feels like we quietly crossed a line where this stopped being a futuristic toy and became just another tool.

Before buying it, I had reached out to more than 20 printing vendors to get my robot parts made.
Most of them took 3-4 days just to reply.
Then they needed another 10 days or so for the actual printing.
The quotes I got were between 70k and 120k INR, and this was before GST and delivery.

In the end, I bought the printer for about 15k, spent around 5k on filament, another 10k on a few big parts I still outsourced, and finished everything for under 30k.
The cost difference alone almost forced the decision.

Now I keep noticing new machines that can even turn 2D photos into 3D models.
The ecosystem already feels quite mature and surprisingly accessible.
It seems like we’re just one Studio Ghibli style moment away from this becoming completely mainstream.

For now, though, it still feels like a niche hobby.
Most people I know have heard of 3D printing, but have never actually used it.
Someone just needs to make the whole experience a bit simpler, tell the right story, and this will explode.

The Hardest Part of Physical AI isn't the Brain

Ankit Khandelwal — Thu, 22 Jan 2026 14:00:17 +0000

Software engineers entering robotics often make a fundamental category error: they treat humanoids like servers with legs. In the cloud, "move fast and break things" is a mantra. In the physical world, breaking things costs $50,000 and sets your timeline back by quarters.

The physical constraints dictate the solution space more than the algorithm ever will.

Consider the battle between Tesla and Waymo. Tesla won the early race for scale because they optimized aggressively around hardware constraints. They built their AI stack to run on compute designed specifically for their cars, leveraging the existing fleet. Waymo, while technically brilliant, relied on expensive, complex sensor suites that were harder to mass-produce. Tesla understood that to win, you don't just add software to a car; you design the car for the software.

The same principle applies to mobile phones. Every OS feature is strictly bounded by battery life and thermal throttling. The hardware shapes the code.

Humanoids, however, will be 10x harder. Unlike a car (wheels) or a phone (static), a humanoid has dozens of moving parts—joints, actuators, and fingers—all requiring high torque and low latency. The complexity of maintaining physical reliability scales exponentially with every degree of freedom.

Ola Electric offers a cautionary tale. They applied a "software iteration" speed to hardware manufacturing. The result? Thermal issues, panel gaps, and recalls. They learned the hard way that you cannot "refactor" a battery or "hot-patch" a motor. A software bug is a quick fix; a hardware bug is a logistical nightmare.

This is why the recent partnership between Google and Boston Dynamics is so significant. Google historically struggles with the physical friction of hardware (see Nest/Stadia), while Boston Dynamics has mastered the "Body"—the durability, balance, and actuation. By combining Google’s "Brain" (AI/Cloud) with BD’s physical capability, they create a force multiplier. They acknowledge that physical engineering is a distinct discipline from data science.

To succeed in Physical AI, we must prioritize reliability over intelligence. Before optimizing the LLM, we must optimize the cooling, the battery density, and the sensor durability. If you can’t keep the body alive, the code doesn't matter.

Can a Humanoid Robot Recognize and Remember My Face?

Ankit Khandelwal — Mon, 19 Jan 2026 21:45:17 +0000

A student walks into a robotics lab with a simple question. The expert smiles and begins unraveling the mystery.

Part 1: The Question

"Can a humanoid robot recognize my face?"

Yes, right now. Face recognition (FaceNet, InsightFace) is ~99% accurate in controlled settings.[19][21] But come back in 5 minutes? The robot has completely forgotten you exist.

"Why does it forget me?"

Because its brain (Vision-Language-Action models, or VLAs) only sees 1-2 seconds of reality at a time - just 2-4 video frames.[3][15] Imagine having amnesia every second.

"Why can't it just look at more frames?"

Because transformer attention - the math that makes VLAs work - is O(T²) where T = frames. Doubling frames costs 4× more computation. 30 frames needs 100× the power of 3 frames (30²/3² = 900/9).[3][4] The robot would need a nuclear reactor to think.

"So the real problem is compute?"

Exactly. But here's the plot twist: you don't need all frames. You only need the important ones. And you don't store pixels - just compact features. That's 100-1000× compression without losing recognition ability.[2][6][26]

"Wait... is there actually a way to solve this?"

Yes. Researchers have already solved the individual pieces (smart frame selection, compression, efficient attention). But nobody has stitched them together into a working robot. That's the frontier.

Part 2: Face Recognition 101

"Okay, so how does face recognition actually work?"

The robot converts your face into an "embedding" - a number vector where similar faces have similar coordinates. FaceNet uses 128 dimensions; InsightFace uses 512. Your face in sunlight and your face at night live in nearby neighborhoods of this abstract space.[19][21]

"That's... beautiful? But how did it learn this?"

Trained on millions of face pairs with a technique called "triplet loss": push embeddings of the same person together, push embeddings of different people far apart. After seeing enough examples, patterns emerge.[19][21]

"How accurate is it, really?"

In a lab with good lighting: 99%. In the real world with varying lighting, makeup, sunglasses: 85-92%. After 1 month, accuracy remains high (>90%) for adults with stable appearance; degradation is minimal over short intervals.[5][14] Studies show 98%+ accuracy even after 6 months for adults, with larger drops occurring over years.[30]

"What trips it up?"

Lighting changes, occlusion (masks, sunglasses), makeup, aging, and crowded scenes where extracting faces is messy. Basically, anything that changes how the pixels look.[5][14] But some changes hit harder: growing a beard can drop accuracy 10-25× for mismatched facial hair styles.[33] Sunglasses (upper-face occlusion) can drop accuracy from ~93% to ~37%.[34] Growing children face even bigger challenges - infants under 1 year show only ~30% accuracy over 6-month gaps.[35]

"Can we make it more robust?"

Sort of. Ensemble methods (run multiple models, vote on the answer) help. Confidence thresholds work. Training with diverse appearances (beards, glasses, different ages) improves robustness.[33][34] For children, systems need age-invariant features or regular re-enrollment every 6-12 months.[35] But the honest answer: ask the human if you're uncertain: "Are you Alice? You look similar to someone I know."[19][21]

"What about growing beards, glasses, or children?"

Beard changes: Adding or removing facial hair can cause 10-25× increase in false non-match rates, especially mustaches.[33] Glasses: Upper-face occlusion (sunglasses) drops accuracy from ~93% to ~37% - worse than masks.[34] Growing children: Infants (0-1 year) show only ~30% accuracy over 6 months; toddlers (2-3 years) improve to ~65%.[35] For children, systems need frequent re-enrollment or age-invariant modeling.[35]

Part 3: The VLA Bottleneck

"What exactly is a VLA?"

Vision-Language-Action model. A neural network that takes three inputs: camera frames, language instructions ("pick up the red cup"), and outputs robot commands (move arm, open gripper).[15][18]

"Examples?"

RT-2 (DeepMind, closed). OpenVLA (Carnegie Mellon, open-source 7B). Qwen-VL (Alibaba). VideoVLA (2025, understands motion). OpenVLA is the best starting point for building your own system.[11][15][28]

"Wait - can VLAs recognize faces?"

No. VLAs (OpenVLA, SmolVLA, Pi 0.6) are trained for manipulation tasks, not person identification. They understand objects and scenes, not individual faces. You need a separate face recognition module (InsightFace, FaceNet) that extracts face embeddings, then integrate those into the robot's memory system. The VLA handles actions; face recognition handles identity.[11][15]

"Why do they only process 2-4 frames?"

Control loops run at 50 Hz (20ms per cycle). Optimized VLAs on high-end GPUs achieve 20-40ms inference; typical systems take 50-150ms.[31] That leaves little time for deep video analysis when processing many frames.[24][26][28]

"What if we optimize VLA inference?"

Even with optimization: KV cache tricks (reuse computation), sparse attention (skip unimportant tokens), quantization (use 4-bit math instead of 32-bit): 30 frames still takes 100+ ms. Too slow.[1][4][29]

"So we can never extend context?"

Wrong assumption. CronusVLA (2025) uses a clever trick: extract motion features instead of processing raw pixels, caching past features to avoid recomputing the vision backbone.[26] This enables multi-frame context with minimal overhead compared to naive frame stacking.[26]

Part 4: Extending Context

"How do we extend context efficiently?"

Three independent tricks that stack: (1) Select only important frames (not all frames). (2) Compress frames to features (not pixels). (3) Use efficient attention patterns (not full attention).

"Trick 1: Which frames matter?"

Motion-based selection: keep frames with high optical flow (stuff is changing), skip static frames. 15-20× compression with minimal accuracy loss. Or use learned importance (VLM scores which frames matter for your task).[2][5][12]

"Any other selection methods?"

Multi-armed bandit for constrained budgets (2025 research). Or hierarchical: keep recent frames densely, older frames sparsely. Or genetic algorithms (academic, not practical). Motion-based works well in practice.[2][12][14]

"Trick 2: Compress frames?"

Don't store 6 MB per frame (RGB pixels). Store pooled features (50 KB, 120× smaller) using max-pooling. Motion features from optical flow can compress temporal information, but face recognition typically requires appearance features combined with motion for best results.[10][13][15]

"How does max-pooling work?"

Take every 2×2 grid of pixels, keep the strongest signal, discard the rest. Repeat 2-3 times: 1080p → ~64×64 → 32×32. Lose spatial detail, preserve what matters for recognition. At 64×64, expect 5-15% accuracy drop; at 32×32, expect 20-40% drop depending on conditions.[10][13][32]

"What about temporal compression?"

TempMe (2025 paper): cluster similar consecutive frames, keep 1 representative per cluster. Result: 95% token reduction in video. Faster inference. Sometimes even better accuracy (less noise).[6]

"Trick 3: Efficient attention?"

Standard: query attends to every past token (O(T²) cost). Efficient: (a) KV cache - reuse computation from previous steps. (b) Grouped Query Attention - multiple query heads share one KV head (4× smaller cache). (c) Sparse attention - only attend to important positions.[1][4][29]

"Combining all three?"

Motion frame selection (15×) + temporal token merging (95%) + GQA + sparse = 100-1000× compression. Optimized systems can achieve 20-40ms latency on high-end GPUs.[31] Accuracy loss varies by compression level and task.[2][6][1]

Part 5: The Memory Problem

"Okay, frames are compressed. Where do we store them?"

Here's the hard part: limited RAM on the robot (8-16 GB shared with OS). Can't query disk fast enough for real-time. Need multiple storage tiers, each optimized for different timescales.

"Layers?"

Tier 0 (2 sec): Current frames in RAM. Real-time VLA inference. <1ms access.

Tier 1 (60 sec): Compressed motion features on fast SSD. <20ms access.

Tier 2 (1 hour): Face embeddings in vector database (Milvus). Similarity search in <100ms.

Tier 3 (months): Person identities in PostgreSQL. SQL queries in <10ms.

"Why separate tiers?"

Each tier optimizes for its job. Tier 0 is tiny and fast. Tier 3 is huge but doesn't need real-time speed. Together they cover seconds to months without exceeding your latency budget.

"How much storage?"

Tier 0: 0 MB (flushed). Tier 1: 100 MB. Tier 2: 500 MB per hour. Tier 3: 1 MB per 1000 people. Total for 1 month of operation: ~600 MB. Fits on a USB stick.[18][20]

"What about privacy? Is storing face data ethical?"

Yes, with consent and transparency. Users should opt-in, know what's stored, and be able to delete their data. Best practice: store embeddings (not raw images), encrypt at rest, allow deletion. Some jurisdictions (EU GDPR, some US states) require explicit consent for biometric data. Build privacy-by-design: minimal data, local-first storage, user control.[18]

Part 6: Real-Time Recognition Challenge

"So here's the hard part: when the robot sees someone, it needs to know instantly who they are."

Right. At 30 FPS, you're getting 30 faces per second. You can't query the vector database 30 times per second. That's 50 round-trips to disk. Game over.

"What do we do?"

Smart caching. The robot's most-used people (family, frequent visitors) stay hot in memory. Tier 0 gets an LRU cache of embeddings it's seen recently. Tier 1 tracks faces from the past hour.

"Can you walk through this?"

Robot sees someone:

Extract face embedding (lightweight, ~5ms, can happen on spare GPU cycles)
Check local cache (Tier 0): "Have I seen this embedding in the last 60 seconds?" If yes: instant match
Cache miss? Check Tier 1 (motion features, faces from past hour): "Any motion features correlate with this face?" If yes: probably the same person
Still no match? Query vector DB (Tier 2) asynchronously. Don't block action loop.
Query result arrives 50-100ms later. Robot incorporates into next decision.

"But what if the person hasn't been seen in 3 months?"

Exactly the query you're worried about. Robot can't afford synchronous queries. Solution: (a) Query Tier 3 in background thread. (b) Meanwhile, robot acts conservatively ("Hello! What's your name?"). (c) When query completes, update memory: "Oh! That was Alice!"

"So the robot makes a guess while waiting for the database?"

Correct. It's a reasonable tradeoff. Perfect accuracy takes 100ms. Approximate accuracy takes 20ms. For most tasks, approximate is fine, and you can refine later.

"What about false positives?"

Confidence thresholds + fallback. If embedding similarity is >0.9: "Welcome back, Alice!" If similarity is 0.75-0.9: "Are you Alice?" If <0.75: "Hello, new person!"

"How do we avoid querying vector DB 50 times per second?"

Several strategies:

Batch queries: Accumulate 10 faces, query once (amortizes latency)
Bloom filters: Quick "definitely not in database" check before expensive query
Locality: Faces in same location likely same person (temporal coherence)
Clustering: Group embeddings into ~100 clusters, query cluster representative, not individual
Cache hottest 1000 people: 99% of queries hit cache (pareto principle)

"Which works best?"

Combination. Always check local cache first (0.1ms). Batch queries when cache misses (10ms per 10 faces). Cluster embeddings in vector DB (10× fewer distance calculations). Query Tier 3 asynchronously.

"What's the latency real-time impact?"

Tier 0 cache hit: <1ms (recognition instant). Tier 1 batch query: ~15ms (30 FPS, can handle). Tier 2/3 async: 50-100ms (doesn't block control).

Part 7: Memory Updates and Consolidation

"After 3 months, the database is full of duplicate faces. Alice has been seen 500 times. How do we consolidate?"

Periodic background job (runs every 30 minutes): cluster faces by similarity (embedding distance), compute centroid of each cluster, update Tier 3 with centroid + metadata.

"What metadata gets updated?"

person_id, name, face_embedding_centroid (average of recent embeddings), last_seen, interaction_count, behavior_summary (LLM-generated), context_tags (where/when usually seen).

"Why centroid instead of keeping all 500 embeddings?"

Storage: 500 embeddings × 512 dims × 4 bytes = 1 MB per person. Scaling to 10k people: 10 GB. But centroid: 512 dims × 4 bytes = 2 KB. 10k people: 20 MB. Also faster queries.

"What about people you haven't seen in a year?"

Archive them. Move centroid to cold storage (cloud). Keep recent 1000 people in hot database. When someone reappears after 1 year: warm up their embeddings, integrate into Tier 3.

Part 8: The Technical Stack

"What libraries should I actually use?"

Face detection/embedding: InsightFace (accurate, fast, open-source, 512-dim vectors).
Vector DB: Milvus or Qdrant (HNSW indexing, fast search, Python API).
Person DB: PostgreSQL + pgvector (SQL + vector similarity, scales to millions).
VLA inference: HuggingFace Transformers (OpenVLA-7B).
Video I/O: OpenCV (standard, efficient).

"Why InsightFace?"

20-50ms per face (fast). 95%+ detection accuracy. Open-source. Produces 512-dimensional embeddings proven for recognition. Easy to fine-tune.

"Why Milvus over other vector DBs?"

Supports HNSW (hierarchical approximate search), in-memory + SSD persistence, Python API, easy deployment on Jetson. Qdrant is also good (Rust-based, slightly faster). Pick either.

"Why PostgreSQL + pgvector?"

SQL for complex queries (names, timestamps, context). Vector similarity search in same database. Scales to millions of records. pgvector is mature (stable since 2023).

"Wait - why both Milvus and PostgreSQL? Can't I use just one?"

You can! PostgreSQL + pgvector can handle both: vector similarity search (like Milvus) AND SQL queries with metadata. Many systems use just PostgreSQL. The two-DB setup separates concerns: Milvus (Tier 2) optimized for fast vector search on recent faces, PostgreSQL (Tier 3) for long-term storage with rich metadata. But if you want simplicity, use PostgreSQL + pgvector for everything - it's mature and handles both workloads well.

"What about the VLA model?"

OpenVLA-7B is your best bet. Open-source, fine-tuneable with LoRA, good community. RT-2 (DeepMind) is better but closed-source. VideoVLA (2025) supports multi-frame but less mature.

Part 9: Practical Constraints

"What hardware do I actually need?"

Minimum: Jetson Orin Nano Super ($249, 8 GB RAM, 67 TFLOPS GPU). Processes ~5 FPS with constraints. Can run lightweight models (smolVLA 450M at 8-12 Hz) but struggles with larger 7B models (~0.3 Hz).[39]

Recommended: 16 GB RAM, 256 GB NVMe SSD, 100+ TFLOPS GPU. For production-quality multi-model stacks, consider Jetson AGX Orin (32-64 GB) or newer architectures that can handle VLA + perception models simultaneously at real-time rates.[39]

"Is 5-15 FPS enough?"

For humanoid robots? Yes. You don't need 30 FPS every second. Key is asynchronous architecture: memory queries happen in background, don't block the control loop.

"What's the latency budget?"

Frame capture: 1-2ms. Optimized VLA inference (3 frames): 20-40ms on high-end GPUs; typical systems 50-150ms.[31] Action generation: 2-3ms. Memory cache lookups: <1ms. Async queries (don't block): 50-100ms. Total real-time path: 25-50ms for optimized systems. Meets 20-30 Hz control requirement.

"What about on low-power devices like Jetson Orin Nano?"

Unoptimized CPU-only: 150-300ms per frame. With GPU + TensorRT INT8 quantization + tracking: 25-40ms per frame for 1-5 faces. Memory is the bottleneck - 8 GB shared RAM limits model size and batch processing.[36][37]

"What if I need to run multiple models simultaneously?"

A full humanoid stack (VLA, object detection, SLAM, depth, speech) competing for shared 8 GB RAM makes real-time performance challenging. Jetson Orin Nano Super is not yet sufficient for production-quality multi-model deployments.[38]

"What recognition accuracy should I expect?"

Face detection: 95-98%. Recognition same day: 92-95%. After 1 week: 90-93%. After 1 month: 90-95% for adults with stable appearance (minimal degradation over short intervals).[30] Accuracy remains high (>90%) for months; larger drops occur over years. But appearance changes matter: beard growth can drop accuracy 10-25×; sunglasses drop to ~37%; children under 1 year show ~30% over 6 months.[33][34][35] Improves with recency-weighted averaging, ensemble models, and diverse training data.

"What if I need higher accuracy?"

Use confidence thresholds (only match if >0.85 instead of 0.75). Ask for confirmation on borderline cases. Use ensemble (run 2-3 face recognition models, vote). Improves but costs latency.

Part 10: Current Research (2025)

"What actually broke through this year?"

CronusVLA: Multi-frame VLA using motion features with cached past frames, avoiding recomputation of the vision backbone.[26] Achieves 12.7% improvement on LIBERO benchmark with efficient multi-frame processing.[26]

VideoVLA: Diffusion-based approach. Predicts future frames AND continuous actions. Better generalization.

Long-context LLMs: Claude 200k tokens. Enables semantic memory integration directly.

"What's still unsolved?"

Uncertainty calibration (robot knowing when it's uncertain). Privacy-preserving embeddings (encrypted vector search). Continual learning without forgetting old skills. Cross-modal grounding (explaining what it knows). And making all this work on a low powered device in real-time.

Part 11: The Bigger Picture

"Why does robot memory actually matter?"

For care robots: remember patient health status, preferences, medication. For home robots: understand family dynamics, relationships. For workplace: coordinate with individuals, learn workflows. Memory = personalization = trust.

Imagine a Jarvis that can't recognise Tony Stark.

"Who actually needs this? What's the market?"

Three segments: (1) Healthcare: care robots in hospitals/nursing homes ($2B+ market, growing 25% annually). (2) Consumer: home assistant robots ($5B+ by 2030). (3) Enterprise: warehouse/logistics robots ($15B+). Early adopters are healthcare (regulatory compliance, patient safety) and high-end consumer (personalization premium). The "remember me" feature becomes a differentiator when robots are commodity.[18][20]

"Is this going to be solved?"

Partially, yes. In 1-3 years, robots will recognize and remember faces across months. In 3-7 years, they'll have super-human memory.

Conclusion

"So what's the summary?"

Face recognition works. VLAs are bottlenecked. Compression techniques exist, but nobody has integrated them into a working robot yet. The four-tier memory system solves the storage problem - each tier optimized for its job. Caching prevents query explosion (LRU cache + batch queries + async). Most robots don't have this capability yet, humanoids are incomplete without it. In 3 years, this will likely be standard.

Are you building in robotics-ai space, how are you tackling these challenges? Do you wish if someone could have built the memory layer for robots? Should I take up the project yaadeinDB?

Feel free to share your thoughts or feedback in the comments section.

References

[1] Optimizing Inference for Long Context with NVFP4 KV Cache - NVIDIA Developer Blog, Dec 2025
[2] M-LLM Based Video Frame Selection for Efficient Video Understanding - CVPR 2025
[3] A Survey on Large Language Model Acceleration based on KV Cache - ArXiv 2024
[4] Understanding and Coding the KV Cache in LLMs - Sebastian Raschka's Magazine, Jun 2025
[5] Analyzing Temporal Information in Video Understanding - CVPR 2018
[6] TempMe: Video Temporal Token Merging for Efficient Video Understanding - ICLR 2025
[10] Pooling Layers in CNN - Giskard AI Glossary, 2025
[11] Foundation Models for Robotics: Vision-Language-Action - Blog Post, Dec 2024
[12] FOCUS: Efficient Keyframe Selection for Long Videos - ArXiv 2025
[13] Role of Pooling Layers in CNNs - Milvus.io Blog, 2025 (Note: URL redirects but page is accessible)
[14] A Review of Recent Techniques for Person Re-Identification - ArXiv, Sep 2025
[15] RT-2: New model translates vision and language into action - DeepMind Blog, Jul 2023
[18] Memory and mental time travel in humans and social robots - PMC, Mar 2019
[19] Understanding Face Recognition: FaceNet vs Siamese Networks - Blog Post, 2024
[20] Episodic Memory Banks for Lifelong Robot Learning - OpenReview
[21] Face Recognition with Siamese Networks, Keras, and TensorFlow - PyImageSearch, Jan 2023
[24] Real-Time Execution of Action Chunking Flow Policies - ArXiv 2025
[26] CronusVLA: Towards Efficient and Robust Manipulation via Transferring Latent Motion Across Time - ArXiv 2025
[28] Vision-Language-Action Models: Concepts, Progress - Blog/Docs, 2025
[29] KV Cache Optimization in Transformers - Emergent Mind, Nov 2025
[30] Face Recognition in Children: A Longitudinal Study - ArXiv 2022; Longitudinal Analysis of Mugshots - PubMed 2017
[31] Running VLAs at Real-Time Speed - Emergent Mind 2025; ActionFlow: Real-Time Vision-Language-Action - ArXiv 2025
[32] Susceptibility to Image Resolution in Face Recognition - ArXiv 2021; Low-resolution face recognition studies - Multiple sources
[33] Facial Hair Area in Face Recognition Across Demographics - ArXiv 2024; Effects of Facial Hair on Face Recognition - IEEE 2025
[34] Impact of Partial Occlusion on Face Recognition - ArXiv 2023; Glasses and Sunglasses Effects - PubMed 2023
[35] Face Recognition in Children: A Longitudinal Study - ArXiv 2022; Young Face Aging Dataset Studies - ArXiv 2022
[36] Face Recognition on Jetson Orin Nano - NVIDIA Developer Forums 2024; Robust Multi-Sensor Facial Recognition in Real-Time using NVIDIA DeepStream - IJERT
[37] Jetson Orin Nano RAM Issues and Memory Optimization - NVIDIA Developer Forums 2024; NVIDIA Jetson Orin Nano Developer Kit Specifications - NVIDIA.com
[38] Multi-Model AI Resource Allocation for Humanoid Robots: A Survey on Jetson Orin Nano Super - DEV Community, ankk98, 2025
[39] Humanoid Compute: Price vs. Performance - DEV Community, ankk98, 2025

Multi-Model AI Resource Allocation for Humanoid Robots: A Survey on Jetson Orin Nano Super

Ankit Khandelwal — Mon, 19 Jan 2026 13:06:12 +0000

Building efficient multi-model AI pipelines for humanoid robotics on resource-constrained edge hardware, with a focus on Jetson Orin Nano Super.

Status disclaimer

Everything in this article is mostly theoretical today. A Jetson Orin Nano Super–class board (8 GB LPDDR5, ~102 GB/s memory bandwidth, ~67 INT8 TOPS NVIDIA Jetson Orin Nano Super Developer Kit) is underpowered for running a full Vision-Language-Action (VLA) model plus several heavy vision models concurrently in production. Making this truly viable will require:

Hardware: more memory bandwidth, more VRAM, and higher sustained TOPS within a tight power envelope

Models: lighter, edge-optimized VLA / YOLO26 variants (pruned, quantized, distilled)

Software stack: better kernel-level scheduling, more mature CUDA Green Contexts, and more predictable multi-tenant GPU runtimes The architectures and strategies below are what you should aim for, but today they remain a mix of research prototypes and partial production deployments.

I have ordered the device so I will do some testing once I get it. Stay tuned for empirical results.

Introduction

Suppose you want to run multiple AI models simultaneously on edge hardware:

a Vision-Language-Action (VLA) model like SmolVLA for robot control,
a recent YOLO26 model for comprehensive perception (object detection, instance segmentation, pose estimation, oriented detection, and image classification) (Ultralytics YOLO26 announcement, Roboflow YOLO26 support),
plus other specialized models (e.g., SLAM, depth, speech).

All of these must share limited GPU memory and compute resources on an embedded platform like Jetson Orin Nano Super (8 GB LPDDR5 @ ~102 GB/s, 6-core Arm CPU, Ampere GPU with 1,024 CUDA cores and 32 Tensor Cores NVIDIA Jetson Orin Nano Super Developer Kit, Jetson Orin Nano/NX/AGX power modes).

We’ll survey three major resource allocation strategies for running multiple AI models on edge devices: hardware partitioning, priority-based scheduling, and offloading. Then we'll focus on the event-driven architecture that production robotics systems actually use for reliable, real-time multi-model execution.

Design Criteria for Multi-Model Edge AI Systems

Before diving into specific strategies, it's crucial to understand the fundamental design criteria that shape resource allocation decisions for multi-model AI on edge devices. These criteria directly influence which approach will work for your specific use case.

Real-Time Performance Requirements

Latency budgets: Critical models (VLA for robot control) typically target a desired frequency of 24 Hz for end-to-end control loops (sensor → action), while perception models (e.g., YOLO26 detection/segmentation) can tolerate at lower frequencies (5 Hz). Missing deadlines can cause instability or safety issues in mobile robots.

Jitter tolerance: Real-time systems need predictable latency. User reports show 10–40% latency increases even with per-client limits, and sometimes much worse when misconfigured (NVIDIA MPS docs, MPS interference report, MPS latency outlier report). That makes naive multi-process sharing a bad fit for tight 24 Hz+ control loops unless carefully profiled and constrained.

Throughput vs. latency trade-offs: Background models can use batching for efficiency, but critical models prioritize low-latency single-inference execution.

Resource Constraints

Power envelope: On Jetson Orin Nano Super, low-power modes operate around 7–8 W, with higher modes up to ~25 W in MAXN_SUPER (Jetson power/performance modes). Multi-model execution must stay within these thermal budgets or the device will downclock aggressively.

Memory hierarchy: The Orin Nano Super’s 8 GB LPDDR5 is a unified memory pool for CPU and GPU. Models compete for both GPU and system memory, and memory pressure can cause allocator fragmentation, cache thrashing, and even swapping if you’re not careful with container limits and tensor lifetimes.

Compute asymmetry: GPU cores excel at parallel inference, CPU cores handle preprocessing/serialization. Resource allocation must balance both.

Reliability and Fault Tolerance

Graceful degradation: Non-critical models should drop frames or reduce frequency under resource pressure, not crash the entire system.

Model priority levels: Critical perception (VLA control) > Essential perception (YOLO detection) > Background tasks (pose estimation, classification).

Failure isolation: A single model's crash shouldn't bring down the entire pipeline. Containerization and process isolation are essential.

System-Level Considerations

Communication overhead: Inter-model data sharing (JSON serialization, queue management) adds latency that must be budgeted.

Monitoring requirements: Real-time metrics collection for latency, utilization, and thermal state enables adaptive resource allocation.

Scalability needs: Will you add more models later? Choose architectures that support horizontal scaling without complete rearchitecting.

Deployment constraints: Edge devices often run in remote locations with limited network access, requiring self-contained solutions.

These design criteria explain why simple partitioning approaches fail on edge devices: the fundamental constraints (thermal limits, unified memory, power budgets) make static allocation inefficient. Production systems instead use adaptive, priority-aware resource sharing with explicit failure modes.

Approach 1: Partitioning – Static Slices of Compute and Memory

Partitioning tries to make multi-model systems predictable by reserving fixed resources per model. On edge hardware, this usually means partitioning GPU SMs, constraining CPU cores, or pinning memory.

1.1 GPU Resource Partitioning (NVIDIA Green Contexts)

What it is: Hardware-level SM (Streaming Multiprocessor) allocation. You split the GPU’s SMs into subsets and bind different workloads to different subsets using CUDA Green Contexts (CUDA Green Contexts driver API).

On Jetson Orin Nano Super (Ampere, compute capability 8.7), the GPU exposes 8 SMs with a total of 1,024 CUDA cores (see Jetson Orin Nano GPU spec). Green Contexts enforce minimum SM counts and alignment constraints per context (e.g., minimum 4 SMs, counts in multiples of 2 for 8.x architectures).

Pros:

Hardware-enforced SM isolation (clean separation at the compute level)
Official NVIDIA support on Orin (compute capability 8.7)
Streams and kernels under different Green Contexts are scheduled from separate queues, which can improve isolation in some workloads

Cons (critical on Orin Nano–class devices):

Frequency is still global: GPU clock is governed by the Jetson power mode and thermal headroom, not by Green Contexts. All contexts share the same global GPU frequency (Jetson power modes).
No memory isolation: Contexts share L2 cache, memory controllers, and the same 8 GB LPDDR5 DRAM.
Thermal throttling: In 7–8 W modes, sustained heavy use across contexts still causes downclocking.
Limited partition granularity: With 8 SMs and a 4-SM minimum per context on cc 8.7, you can have at most two partitions of 4 SMs each.
Observed behavior can be surprising: Users have reported little to no runtime change when varying SM allocations via Green Contexts on Jetson Orin, suggesting that other bottlenecks (memory, front-end, scheduling) may dominate (NVIDIA forum: Green Contexts on Orin).

Real-world latency impact (today): You may get some improved isolation in synthetic benchmarks, but on Orin Nano–class devices the main constraints are power mode, memory bandwidth, and thermal limits, which Green Contexts do not solve. For most embedded robotics use cases, the complexity is hard to justify unless you have a very specific multi-tenant requirement.

Verdict: On Orin Nano–class devices, use Green Contexts only when you absolutely need hard SM isolation between tenants and can afford the engineering complexity. For single-robot stacks, it’s usually better to rely on priority-based scheduling and event-driven architectures instead.

1.2 Software Partitioning: CUDA MPS (Multi-Process Service)

What it is: A software layer that allows multiple processes to share a single GPU context, time-multiplexing kernels from different processes through the CUDA MPS server (CUDA MPS guide).

Pros:

Works on all Jetson platforms today (no driver updates needed)
Per-process thread budget and pinned memory limits
Simple to enable

Cons:

Shared L2 cache and bandwidth: Models can still thrash each other’s L2 lines and DRAM.
Kernel serialization and interference: Under contention, one client’s kernel launches can delay another’s.
Unpredictable latency without careful tuning: Reports show latency increases of 10–40% under moderate contention even with 50/50 SM splits, and in misconfigured scenarios, giant outliers (e.g., a kernel going from ~65 µs to ~100 ms) (MPS interference report, MPS latency outlier report).
Memory accounting is per-process, not global: Per-process limits don’t give you a global “cap”; two 1 GB limits still allow 2 GB total in use.

Real-world issue: For multi-model pipelines (VLA + YOLO26 detection/segmentation/pose) targeting 24 Hz control loops, this kind of latency variability is unacceptable unless you design around it very conservatively.

Verdict: Reasonable for batch or non-real-time workloads; a poor fit for tight control loops.

1.3 OS-Level Partitioning: Linux cgroups + CPU Affinity

What it is: Kernel-level control over CPU time and system RAM. You pin CPU cores, set CPU shares, and enforce memory limits per cgroup or container.

How to implement: Create CPU and memory control groups, pinning specific cores to each workload. Use Docker's cpuset_cpus and mem_limit for containerized isolation.

Pros:

Clean OS-level isolation (CPU and system RAM)
Prevents CPU contention between processes
Works on all platforms

Cons:

Doesn’t isolate GPU: Both processes still compete for GPU memory bandwidth (on Orin Nano Super that’s ~102 GB/s shared across all clients NVIDIA Jetson Orin Nano Super Developer Kit).
Incomplete solution: If VLA runs on GPU but YOLO's CPU thread is blocked, latency still spikes
Memory overhead: Tight system RAM means early swapping, crashing your "fixed" allocation

Real-world issue: Critical model deadlines (24 Hz for VLA, real-time pose estimation) might still be missed if system RAM swaps to disk or GPU bandwidth is saturated by multiple concurrent models.

Verdict: Useful as a supporting tool (especially with containers), but not sufficient alone for real-time multi-model GPU workloads.

1.4 Where Partitioning Fits

Partitioning is attractive when:

You need strong isolation (multi-tenant scenarios, safety domains)
You care more about fairness than minimum latency
You can afford reduced peak performance due to thermal limits

But on small edge devices with unified memory and tight power envelopes, hard partitions tend to underutilize the hardware and amplify thermal problems. That’s why most modern robotics stacks use partitioning only as a supporting tool, not the primary strategy.

Approach 2: Prioritization and Event-Driven Scheduling – Shared Resources, Explicit Priorities

Prioritization assumes all models share the same GPU/CPU pool, but who runs when is controlled carefully using priorities, async queues, and backpressure. This is the pattern used by OM1, LeRobot, Reachy 2, and most modern robotics systems.

2.1 Why Prioritization Wins on Edge Devices

The fundamental limitation of edge devices: Unified memory architectures and thermal constraints make static resource partitioning inefficient. Production robotics systems avoid strict partitions and instead use event-driven patterns that dynamically allocate resources based on priority and system state.

Key insight: Reliable multi-model execution comes from adaptive resource sharing and graceful degradation, not rigid slicing.

2.2 Core Principles

Shared compute with explicit priorities: Multiple models share GPU/CPU resources, but execution priority is clearly defined.
CUDA streams for kernel scheduling: High-priority streams for critical models, normal priority for background tasks.
Async event communication: Message queues decouple model timing and enable graceful degradation.
System state awareness: Monitor thermal/power limits and adapt resource allocation dynamically.
Deadline-aware scheduling: Soft deadlines for non-critical models, hard deadlines for essential perception.

2.3 Architecture: Prioritized CUDA Streams + Async Event Bus

One concrete template looks like this:

┌────────────────────────────────────────────────────┐
│   Critical Model Thread (e.g., VLA @ 24Hz)         │
│   Priority: HIGH                                   │
│   Target frequency: 24 Hz                           │
└────────────────────────────────────────────────────┘
         ↓ (sensor inputs)
┌────────────────────────────────────────────────────┐
│   CUDA High-Priority Stream (GPU)                  │
│   Critical inference, never preempted              │
└────────────────────────────────────────────────────┘
         ↓ (outputs → Action/Event queues)
┌────────────────────────────────────────────────────┐
│   Event Bus (Redis/Zenoh/ROS2)                     │
│   Async communication between models               │
└────────────────────────────────────────────────────┘
         ↓ (decoupled messaging)
┌────────────────────────────────────────────────────┐
│   Background Models (YOLO, segmentation, etc.)     │
│   Priority: NORMAL/BACKGROUND                      │
│   Graceful degradation under load                  │
│   Runs in normal-priority CUDA streams             │
└────────────────────────────────────────────────────┘
         ↓ (context updates → Decision fusion)
┌────────────────────────────────────────────────────┐
│   Decision Fusion & Action Execution               │
│   Combines all model outputs                       │
└────────────────────────────────────────────────────┘

2.4 Implementation Patterns

Docker / Docker Compose + ROS 2 / Zenoh (containerized event-driven architecture)

Each AI model (or subsystem) runs in its own container, communicating over async message buses:

Containerize each model service with NVIDIA runtime.
Use async message queues (ZMQ/ROS2/Zenoh) for inter-service communication.
Prioritize VLA at 24Hz with strict deadlines while YOLO runs at 5Hz with graceful degradation.

Tools & libraries:

ROS 2: Native deadline/lifespan QoS policies atop DDS (ROS 2 QoS design). Used heavily in Reachy 2’s core ROS 2 workspace (reachy2_core).
Zenoh (OM1’s choice): Low-latency pub/sub and key/value messaging, lighter than full ROS 2 middleware. OM1 integrates Zenoh for cross-component data exchange (OM1 repo).
Redis + Lua: Simple pub/sub and atomic operations for single-host deployments.

Quick start template:

Create prioritized CUDA streams for each model based on real-time requirements (CUDA stream priorities).
Use Python asyncio (docs) or ROS 2 callbacks for concurrent execution and queue-based communication.
Start with critical models at high priority (e.g., 24 Hz), background models at normal priority (e.g., 5 Hz).
Add Prometheus/Grafana or equivalent monitoring for latency, queue depths, and thermal throttling.

2.5 Real-World Example: OM1 (OpenMind)

OM1 (“OpenMind Modular AI Runtime for Robots”) demonstrates mode-based multi-model execution in a single Dockerized runtime, orchestrating LLMs, VLMs, and robotics stacks together (OM1 repo).

Single Docker Container (OM1 Runtime)
  ├─ Multiple operational modes (welcome, slam, navigation, etc.)
  ├─ Concurrent LLM execution (Fast Action + Core + Mentor LLMs)
  ├─ Zenoh pub/sub for inter-component communication
  ├─ Background processes (SLAM, navigation, face recognition)
  └─ Input orchestrators (VLM, ASR, sensors)

No GPU partitioning. Instead:

Multiple LLMs run concurrently with different roles and priorities (e.g., fast-reactive vs. deliberative).
Vision models (VLM variants) provide continuous perception.
SLAM and navigation models run in background with graceful degradation.
All components communicate via Zenoh pub/sub messaging and ROS 2 where appropriate.
Dynamic mode transitions reallocate resources based on context and intent.

Takeaway: OM1 shows production-grade multi-model AI orchestration (LLMs + VLMs + SLAM + navigation) using event-driven, priority-based scheduling rather than hard GPU partitioning.

2.6 Prioritization: Pros, Cons, When to Use

Pros:

Production-proven patterns (LeRobot async inference, OM1 runtime, Reachy 2 ROS 2 workspace).
Graceful degradation (non-critical models adapt to resource constraints).
Easy to debug (message introspection, queue monitoring, logging).
Scales horizontally (add models without rearchitecting core systems).
Platform-agnostic (works with NVIDIA, ROCm, CPU-only).
Adaptive resource allocation (responds to thermal/power limits).

Cons:

Shared GPU bandwidth contention (models can still interfere).
Message serialization overhead (~1–2ms per inter-model communication).
Requires understanding async patterns and queue management.
Not suitable for strict multi-tenant isolation guarantees.

Best for:

"Need guaranteed low-latency for critical model": Docker + ROS 2 + prioritized CUDA streams.
"Running multiple YOLO26 variants (detect/segment/pose)": Event-driven architecture with async queues.
"Building production robotics system": Docker Compose + Zenoh + mode-based execution.
"Rapid prototyping on single device": Python asyncio + CUDA streams.

Approach 3: Offloading – Pushing Work Off the Edge Device

Offloading moves some or all model computation off the edge device to separate GPU servers or cloud infrastructure. This eliminates local contention at the cost of network latency and extra infrastructure.

3.1 Remote Inference Offloading (LeRobot-Style Pattern)

What it is: Run policy inference or heavy model inference on a separate GPU server, while the robot (edge device) handles sensors and low-level control. Communication happens over gRPC streaming.

This is the pattern used in LeRobot’s async inference stack, where a PolicyServer runs on a workstation GPU and a RobotClient runs on the robot, exchanging observations and actions via gRPC (LeRobot repo, see lerobot/async_inference/policy_server.py and robot_client.py).

How to implement:

Deploy policies and heavy models on a dedicated inference server with a larger GPU.
Use gRPC streaming for low-latency communication between the robot and the inference server (gRPC Python docs).

Pros:

Zero GPU contention on the edge: Edge resources are freed for additional models or real-time control.
Scalable inference: Upgrade server GPUs independently of edge hardware constraints.
Reliable latency: Often more predictable network latency vs. highly variable local multi-model GPU sharing.
Complete isolation: Models run on separate hardware, eliminating interference.

Cons:

Network dependency: Requires reliable low-latency network connection.
Bandwidth overhead: Camera frames must be compressed and transmitted.
Additional infrastructure: Need dedicated inference servers and monitoring.
Higher complexity: Distributed system management, failure handling, and observability.

Real-world use: LeRobot uses this client–server architecture for RL policy inference and async action streaming. The same pattern generalizes to VLA + YOLO26 pipelines, but for those, you must account for much higher bandwidth (video frames) and tighter latency budgets.

3.2 Orchestrated Offloading with Triton and Microservices

NVIDIA Triton Inference Server provides process-level isolation and scheduling for multi-model deployments, often on a central server:

What it is: Multi-model serving platform with built-in queuing, batching, and per-model scheduling policies.

How to implement:

Configure separate model repositories with dedicated GPU instances and per-model batching policies with different latency deadlines.
Expose models over gRPC/HTTP to edge clients.

Pros:

Production-grade scheduling and queuing.
Per-model deadlines and batching policies.
High resource efficiency on server GPUs.
Can mix NVIDIA-stack and non-NVIDIA models.

Cons:

Learning curve (gRPC, model configs).
Overhead from HTTP/gRPC serialization (5–10ms per request).
Still subject to GPU bandwidth contention on the server.

Best for:

"Distributed edge deployment with network": Remote offloading + gRPC streaming.
"Enterprise ML pipeline with model versioning": Triton Inference Server + model ensembles.

3.3 When Offloading Makes Sense

You cannot meet latency or throughput targets within the edge device's power/thermal envelope.
You need to run many heavy models simultaneously, but only a subset of them require strict real-time guarantees on the robot.
Your deployment environment has reliable wired or high-quality wireless connectivity.

Putting It Together: Comparing the Three Approaches

In practice, production systems mix these:

Use cgroups and containers for basic isolation.
Use prioritized CUDA streams and event buses for real-time behavior.
Use offloading for heavyweight or non-real-time models that don't fit on the edge box.

Conclusion

Given today’s hardware, a single Jetson Orin Nano Super is not yet a comfortable platform for running a large VLA plus multiple heavy YOLO26 variants and other models concurrently at strict real-time rates. You can prototype pieces of this stack, but for production you will almost certainly need:

More capable edge hardware (Orin NX/AGX, Thor, or similar), or
Significant offloading to nearby GPU servers, and/or
Aggressively optimized models (distillation, pruning, quantization, ONNX/TensorRT deployment).

That said, the architectural lessons are already clear:

TL;DR: For multi-model AI on edge devices, avoid static hardware partitioning as your primary tool. Favor event-driven architectures with prioritized CUDA streams and async messaging, and treat partitioning and offloading as supporting levers.

Have questions or suggestions? Drop them in the comments below.

Insights from Sergey Levine’s appearance on the Dwarkesh Patel podcast

Ankit Khandelwal — Mon, 19 Jan 2026 09:10:46 +0000

Just finished an incredible deep dive into the future of robotics with Sergey Levine of Physical Intelligence. The "Robotics Flywheel" is much closer than people realize.

Link: https://youtu.be/48pxVdmkMIE?si=UamP4IMBoI0jOyMB

Here are my top 10 takeaways on the path to general-purpose robots:

The 5-Year Horizon: The median estimate for robots performing complex, autonomous home tasks and blue-collar work is just five years. It’s a "single-digit" year problem, not a multi-decade one.
The Representation Problem: Video is harder than text because text is already abstracted into meaning, while video is just "compressed pixels". To scale, robots need to ignore "noise" (like moving clouds) and focus only on goal-relevant changes.
Hardware vs. Software: Smarter AI actually makes hardware cheaper. High-quality visual feedback allows robots to use "cheap," less precise parts because the AI can sense and correct mechanical errors in real-time.
The Inference Trilemma: There is a constant trade-off between Inference Speed (Hz), Model Size (Parameters), and Context Length (Memory). The goal is to move toward the human brain's "extreme parallelism," where perception and planning run at different rates.
Imitation Before RL: You can’t start with Reinforcement Learning (RL) from scratch, it takes too long. You must use supervised learning (imitation) first to provide the "prior knowledge" and common sense the robot needs to eventually learn on the job.
Emergent Compositionality: Robots are starting to show "emergent" skills. Levine noted a robot that learned to clear an obstacle before folding laundry without being specifically trained for that sequence, it’s "compositional generalization".
Moravec’s Paradox: This is the core of robotics, the things humans find easy (folding a T-shirt) are the hardest for AI, while things we find hard (calculus) are easy. Physical proficiency is a massive computational challenge.
The Externalized Brain: For robots to be affordable, we might see "off-board inference". A robot might be in a "dumber" reactive mode if offline but become significantly smarter when connected to a high-speed data center.
The goal isn't just to build "mechanical people", it's to build heterogeneous systems that can be 100 feet tall or tiny, all powered by the same foundational intelligence.
The 24Hz Benchmark: The human mind processes visual information and reacts at roughly 24 frames per second (24Hz). To achieve human-level proficiency, robots must match this high-frequency inference while simultaneously managing the "trilemma" of increasing model size and memory.
The 1-Second Context Paradox: Current state-of-the-art VLA models often operate with only a one-second context window. It is "shocking" that they can execute minute-long tasks by only observing the immediate past, but true autonomy will require scaling this to the minutes, hours, or even "decades of context" that humans use to inform their plans.
Emergent Meta-Learning: Meta-learning, the ability for a model to "learn how to learn" is an emergent property seen in large foundation models. A sufficiently smart model can evaluate its own performance and figure out how to leverage auxiliary data, like simulations or synthetic experience, to improve its success on real-world objectives.
Mastering Counterfactuals: The "key" to optimal decision-making is the ability to answer counterfactuals: "If I did this instead of that, would it be better?". Whether a robot uses a learned simulator, a reward model, or a value function, the core of intelligence is having a mechanism to evaluate these alternative futures and pick the best one.

Is Humanoids' Data Appetite Really Endless?

Ankit Khandelwal — Sat, 17 Jan 2026 08:17:53 +0000

Introduction

Humanoid robots like Tesla's Optimus and Figure AI are generating massive hype, but the critical question isn't just whether they need data, it's how much, and what kind.

The narrative suggests humanoids require endless datasets, creating a boom market for data startups. But 2024-2025 research suggests a different trajectory: humanoids will need substantial data initially, then demand will plateau and shift toward specialized services like curation and safety validation rather than raw collection. The business model around data changes from collection to intelligent processing.

This analysis examines four core doubts about the "endless data appetite" narrative, then weighs counterarguments that suggest certain demands persist.

Part 1: Arguments for Plateauing Demands

Doubt 1: Do Scaling Laws Show Diminishing Returns?

Microsoft's "Scaling Laws for Pre-training Agents and World Models" (2024) reveals that embodied AI systems follow power-law relationships, not linear growth. Optimal data scales with compute as D ∝ C^0.68, meaning data requirements grow much slower than computational capacity. Crucially, losses plateau at large datasets (1.63 billion pairs) without significant overfitting.

For humanoids, this means early data (first 100 trajectories) drives massive capability gains. The 10,000th trajectory? Marginal improvements. By 100,000 trajectories, you're fighting diminishing returns.

NVIDIA's "DreamGen" (2025) demonstrates this principle in practice. A generative world model trained on one teleop task generated 22 novel behaviors without collecting additional real-world data. Recent work on "Learning Hierarchical World Models with Adaptive Temporal Abstractions" (Gumbsch et al., ICLR 2024) shows hierarchical approaches like THICK achieve efficiency improvements through multi-timescale reasoning with far less data than flat world models.

Implication: Foundational training peaks 2026-2028. Afterward, demand likely drops 50-70% as efficiency gains mature.

Doubt 2: Can Few-Shot Learning Replace Massive Datasets?

Boston Dynamics' "Real-World Humanoid Locomotion with Reinforcement Learning" (2024) shows Digit adapting to diverse terrains in fewer than 100 real-world trials with 90% zero-shot success on new environments.

Honda Research Institute's "VisuoTactile Pretraining" (2025) demonstrates that contact-rich manipulation (USB insertion, card swiping, key insertion) achieves 90%+ success with only 32 demonstrations plus 45 minutes of reinforcement learning. Combining visual and tactile feedback replaces the need for massive labeled datasets.

The theoretical foundation appears in "Stop Regressing: Training Value Functions via Classification" (2024). Classification-based value functions (Q-transformers) outperform regression in manipulation, achieving state-of-the-art results with dramatically fewer trajectories.

Implication: Deep RL is more sample-efficient than supervised learning for robotics. By 2032, few-shot learning likely cuts requirements 80-90% compared to supervised approaches.

Doubt 3: Will Synthetic Data Make Real Data Obsolete?

"Video2Robot" (Aim Intelligence, 2025) converts human videos into physics-grounded humanoid trajectories, scaling behaviors like climbing without real robot captures.

"X-Humanoid" (2025) converts Ego-Exo4D videos (60 hours = 3.6 million frames) into Optimus-like action sequences for cooking and biking, training both policies and world models.

The "Humanoid Everyday" dataset (260 real-world robotic tasks) is currently the largest multimodal humanoid dataset, yet authors acknowledge that synthetic data enables generalization beyond real data's domain.

Citi's "The Rise of AI Robots" (2024) forecasts 1.3 billion robots by 2035, primarily trained via simulation. This scales via GPU rendering, not manual collection.

Implication: Synthetic data dominates by 2028-2030. Real data demand drops 80-90%. Real data becomes specialized (edge cases, safety validation, domain-specific fine-tuning).

Doubt 4: Does Internal Fleet Learning Hide External Demand?

Tesla, Figure, and Boston Dynamics don't buy data from startups. They collect internally. A former Tesla Autopilot engineer noted: "Data generation isn't the bottleneck. They collect terabytes per hour. The hard part is finding the right clips for training. That's curation."

This shifts the market entirely. Collection becomes free; curation becomes valuable. A startup identifying the 1% of fleet data most valuable for improvement is worth billions. A startup selling raw teleoperation data? Increasingly irrelevant.

Figure AI's "$675M Series B funding" (February 2024) went to in-house development, not external data purchases. "DreamGen" explicitly demonstrates autonomous data generation via learned world models.

NVIDIA researcher Jim Fan noted in an "April 9, 2025 Office Chai interview": "Unlike LLMs, robotics doesn't yet have clear scaling laws. Compute and data are both bottlenecks, but physical data collection remains expensive."

Implication: External data demand low from 2026 onward. Near-zero by 2036 as fleets mature.

Part 2: Counterarguments

The Sim-to-Real Gap Persists

Simulation handles gravity, friction, and inertia. It doesn't capture material properties, sensor noise, wear, or degradation over time. A robot trained in perfect simulation may fail after 100 real-world episodes due to unmodeled dynamics.

"Humanoid Locomotion as Next Token Prediction" (2024) shows sim-trained policies require substantial real-world adaptation even with domain randomization.

Fine-Tuning Requires Real Data

Google's "MT-Opt" (2024) demonstrates that sim-trained policies need significant real robot data for fine-tuning across diverse tasks. As humanoids move to messy real-world settings, environment-specific adaptation demands increase, not decrease.

Robot Vision Gaps

Embodied AI benchmarks reveal persistent gaps, particularly in temporal reasoning. Robots often treat frames independently while humans process continuous streams with temporal context. Understanding that someone "is about to" reach for an object requires temporal reasoning that current vision systems lack.

Safety Validation Is Extensive

ISO 13482 mandates comprehensive testing across failure modes. Real-world edge cases emerge unpredictably. Boston Dynamics' Atlas experienced numerous falls during development, each requiring data collection and analysis. Safety-critical applications demand orders of magnitude more validation data than general robotics.

Human Interaction Is Complex

Humanoids working alongside people must interpret subtle social cues: body language, eye contact, contextual intent, theory of mind. Recent work on "human-AI interaction" (2024) shows this capability remains elusive, requiring extensive multimodal training data.

Real-World Complexity Dominates

History shows robotics underestimates real complexity. Tesla's Autopilot discovered thousands of edge cases post-deployment that simulation missed. Long-tail distributions mean rare but critical scenarios dominate failure cases. As humanoids enter homes, factories, and public spaces, new failure modes will emerge requiring continuous data collection.

Long-Horizon Planning Remains Difficult

Human tasks span minutes to hours with complex interdependencies. Reinforcement learning struggles with long-horizon credit assignment. Recent "transformer-based planning work" (2024) shows hierarchical reasoning requires extensive trajectory data for reliable long-term decision-making.

Intuitive Physics Capabilities Gap

AI systems still lack robust understanding of object properties, stability, and physical interactions. Each novel environment or material type may require specific training data for reliable interaction.

Part 3: Synthesis

The doubts suggest demand peaks 2026-2028 then declines sharply. The counterarguments suggest certain demands persist. The reality is bifurcated.

Data Types That Peak and Plateau

Foundational locomotion datasets (walking, balance, navigation) peak 2026-2028, then plateau as core policies mature. Generic manipulation demos (grasping, lifting, placing) peak 2026-2029, then plateau. Teleoperation services for bootstrapping peak 2026-2028, then drop 80-90%.

Data Types That Persist or Grow

Safety validation data runs continuous. Each new environment, interaction, or edge case requires data. Domain-specific fine-tuning data persists. Healthcare robots need healthcare data. Surgical robots need surgical data. Temporal and social interaction data grows as robots interact more with humans. Edge case and failure data collects continuously. Hardware variation based finetuning data will still be needed.

Market Trajectory

Raw data collection captures 20-30% of robotics value chain 2026-2030. By 2031-2036, collection captures 2-5% while curation, processing, and domain adaptation capture 15-25%.

Market size forecasts diverge significantly: "Grand View Research projects $4.04B by 2030" (17.5% CAGR from $1.55B in 2024) while "BCC Research projects $11B by 2030" (42.8% CAGR from $1.9B in 2025). Grand View likely conservative; BCC likely includes speculative demand scenarios. Markets and Markets forecasts $13.25B by 2029 at 45.5% CAGR.

The Critical Distinction

Raw data becomes commoditized by 2028. The bottleneck shifts from collection to curation. Identifying valuable signal within terabytes of fleet data matters infinitely more than raw collection volume.

Critical Context

Domain Variations Matter

Consumer humanoids see highest efficiency gains; data demand drops 80-90% by 2035. Healthcare and surgical robots require conservative deployment with high safety validation; data demand remains substantial. Industrial robots in hazardous environments use extensive simulation with moderate efficiency gains.

Hardware-Software Coupling

Better sensors (force feedback, advanced cameras) reduce data requirements. Lower-cost sensors increase requirements. Conclusions assume current hardware. Significant hardware shifts change data strategies.

Regional Differences

Data privacy laws (GDPR in EU), labor costs, and safety standards vary by region, affecting data collection ROI and humanoid adoption willingness.

Conclusion

Humanoids will need substantial data, but the trajectory is "peak and persist," not endless escalation. Foundational training peaks 2026-2028, driven by scaling law efficiency and synthetic data gains. Raw data demand then drops 50-90%.

However, specialized data needs persist: sim-to-real fine-tuning, safety validation, social interaction learning, and edge case handling. The market story isn't about data volume declining, it's about value migrating from collection to curation.

Pure data collection becomes trivial by 2028. The competitive advantage lies with companies solving intelligent curation, safety validation, and domain-specific adaptation. Integrated hardware-AI companies (Tesla, Boston Dynamics, Figure) internalize these capabilities, creating structural moats.

Data infrastructure startups face headwinds unless they pivot from collection to specialization. The humanoid market grows to $4-13B by 2030, but raw data's share of that value shrinks from 20-30% to 2-5% as the field matures.

This represents a fundamental shift: data becomes abundant; intelligence (curation, adaptation, validation) becomes scarce.

The 5 Levels of Humanoid Autonomy

Ankit Khandelwal — Fri, 16 Jan 2026 18:49:59 +0000

If you scroll through X (Twitter) today, you’d think General Purpose Humanoids (GPH) are months away from folding our laundry and cooking 5-course meals. The reality is more nuanced and, for developers and founders, much more interesting.

I’ve been digging into the "Self-Driving Levels" equivalent for robotics. We need a mental model to separate the hype (Level 5 sci-fi) from the commercial opportunities available right now.

Based on frameworks from SemiAnalysis, insights from roboticist Rodney Brooks here is the definitive ladder of Humanoid Autonomy.

The Framework: Agency vs. Dexterity

Unlike self-driving cars, which just need to move safely, humanoids must move (Agency) and manipulate (Dexterity).

Agency: Perception, planning, and navigation in unstructured environments.
Dexterity: Grasping, force control, and fine manipulation.

Current commercial viability lies in balancing these two.

Level 0: Scripted Motion (The Industrial Past)

Status: Mature (1980s–Present)

These are the blind giants. They execute pre-programmed trajectories with sub-millimeter precision but have zero understanding of their environment. If you move the part by 1cm, the robot fails.

5 Use Cases:

Automotive Welding: The backbone of Tesla/Toyota factories.
Painting: Uniform spraying of car bodies.
Heavy Palletizing: Moving heavy boxes in completely caged, fixed zones.
PCB Assembly: Pick-and-place machines (high speed, zero intelligence).
CNC Tending: Loading raw metal into machines (requires precise fixturing).

Timeline: Mature.
Famous Bots: FANUC M-2000, KUKA quantec.

Level 1: Intelligent Pick & Place (The Visual Awakening)

Status: Commercial Scale (2023–Present)

Robots gained eyes. Using computer vision and deep learning, these systems can identify objects in a cluttered bin and pick them up. They don't "understand" the object's function, but they know where it is.

5 Use Cases:

Parcel Sorting: Identifying and grabbing random Amazon packages.
Agricultural Sorting: Picking good apples vs. bad apples on a conveyor.
Debris Recycling: Sorting plastic from glass in waste plants.
Kit Assembly: Grabbing 3 different items to put in a subscription box.
Quality Control: Visually inspecting parts and removing defects.

Timeline: Standard in logistics by 2026.
Famous Bots: RightHand Robotics, Covariant (software), Fanuc with iRVision.

Level 2: Autonomous Mobility (The Explorer)

Status: Early Production (2024–2026)

Robots gained Agency. They can map a new environment, navigate around obstacles, and decide how to get from A to B. This is where Boston Dynamics’ Spot shines. Note: They can move, but they can't do much with their hands yet.

5 Use Cases:

Industrial Inspection: Reading analog gauges in oil refineries.
Construction Patrol: Scanning progress on building sites (BIM verification).
Security: Autonomous patrolling of data centers or malls.
Hazard Mapping: Entering gas-leak zones to measure toxicity.
Last-Mile Delivery: Sidewalk robots (Starship) navigating crowds.

Timeline: Commercially viable now for inspection; scaling fast.
Famous Bots: Boston Dynamics Spot, ANYbotics ANYmal.

Level 3: Low-Skill Mobile Manipulation (The Founder's Sweet Spot)

Status: Pilots -> Scale (2026–2029)

This is the biggest opportunity for startups right now.
These robots combine Level 2 mobility with Level 1 vision to perform loose manipulation tasks. They can pick up a box, move it across a room, and put it down.

Crucial Insight: They struggle with force control. They can't thread a needle or peel a potato perfectly because they lack tactile feeling. But they can fry a basket of fries.

5 Use Cases:

Specialized Cooking (The "Fry Cook"): Dumping baskets of fries, flipping burgers (requires timing, not fine touch).
Warehouse Restocking: Taking a tote from a pallet and sliding it onto a shelf.
Laundry Loading: Picking up dirty clothes and shoving them into a washer.
Hospital Logistics: Delivering lab samples or food trays to nurse stations.
Trash Collection: Navigating an office to empty bins into a main cart.

Timeline: Pilots 2025; Scale 2027-2028.
Famous Bots: Figure 01 (BMW pilot), Tesla Optimus (Factory transport), Chef Robotics (Modular arms).

Note: You don't need legs for this! A wheeled robot with an arm is 80% cheaper and 100% more stable for a kitchen.

Level 4: Force-Dependent Dexterity (The "Rodney Brooks" Wall)

Status: Research Lab (2028+)

This is the barrier. To be a "General Purpose" humanoid, a robot needs tactile sensing (touch). It needs to feel if a screw is cross-threaded, or if a tomato is too soft to slice.

Rodney Brooks (founder of iRobot) argues this is the "hard part" the industry is underestimating. We have great vision (VLAs), but terrible touch.

5 Use Cases:

Full-Service Chef: Slicing veggies, seasoning to taste, plating delicate herbs.
Elder Care: Helping someone stand up (requires sensing their balance/frailty).
Skilled Trades: Installing electrical outlets or plumbing fixtures.
Textile Work: Buttoning a shirt or tying shoelaces.
Complex Assembly: Inserting flexible rubber gaskets into car doors.

Timeline: Research prototypes 2029; Commercial 2032+.
Famous Bots: None commercially yet. Lab prototypes from MIT/Stanford.

Level 5: Fully General Autonomy

Status: Sci-Fi (2032?)

A robot that can walk into a strange house, look around, and cook a specific family recipe using tools it has never seen before, without internet access.

The "ADAS vs. FSD" Split: Why One Size Won't Fit All

We often talk about humanoids as a monolith—one robot to rule them all. But look at the automotive industry. We didn't jump straight to Level 5 Robotaxis. Instead, we have a split market: 99% of cars have ADAS (Lane Keep, Cruise Control) and <1% attempt FSD (Full Self-Driving).

Robotics will follow this exact same bifurcation.

We aren't going to see a single "iPhone of Robots." Instead, Economics, Battery Life, Safety, and Compute will force the market into two distinct categories:

Category 1: The "ADAS" Class (High Utility, Low Risk)

The Build: Wheeled bases, specialized grippers, constrained compute (e.g., Jetson Orin Nano).
Battery & Economics: Wheels are 10x more energy-efficient than legs. Without the need to run a massive VLA model for every movement, these bots can run for 8-10 hours on a charge and cost <$10k.
Adoption Vector: These will dominate critical safety areas first. Think radioactive waste handling, chemical spill cleanup, or repetitive high-heat industrial cooking. The ROI is immediate because the task is defined.

Category 2: The "FSD" Class (High Agency, High Cost)

The Build: Bipedal, humanoid hands, massive onboard inference compute.
Battery & Economics: Balancing on two legs consumes massive power. Running a "Common Sense" brain drains the rest. These will cost $50k+ and last 2-4 hours.
Adoption Vector: Research labs, luxury home help (eventually), and unstructured environments where wheels physically cannot go.

What’s Your Bet?

The robotics industry is currently split between two philosophies: the "iPhone moment" where one hardware platform does everything (Level 4/5 Humanoids), and the "App Store" reality where specialized tools solve specific problems today (Level 3 Mobile Manipulators).

I’d love to hear your take:

Do you think I’m underestimating how fast VLA (Vision-Language-Action) models will solve the "dexterity gap"?
Are you currently working on a Level 2 or Level 3 project?
What’s the one "boring" chore you’d pay a Level 3 robot to do right now?

Drop your predictions in the comments below!

DEV Community: Ankit Khandelwal

Running LLMs on AMD NPU with FastFlowLM - Fedora Guide

Running LLMs on AMD NPU with FastFlowLM - Fedora Guide

TL;DR

Hardware tested

Before you start

Kernel requirements

Check your kernel cmdline now

1. Install build dependencies

cmake3 wrapper (Fedora ships CMake 4.x as cmake)

2. Build and install XRT

Register XRT libraries system-wide

3. Build and install the NPU plugin (DKMS driver + XRT shim)

4. Fix memlock limit

5. Critical: do NOT use amd_iommu=off

What is amd_iommu?

Symptoms with IOMMU off

Fix

6. Build and install FastFlowLM

7. Make xrt-smi available (optional but useful)

8. Validate everything

9. Run your first model

10. Monitor NPU stats

Quick reference

XRT — device-level snapshots

FLM — inference metrics (most useful in practice)

Kernel debugfs (low-level, requires root)

What does NOT show NPU utilization

11. Real-world benchmark (ROG Flow Z13)

gemma4-it:e4b

Troubleshooting

Useful debug commands

Fedora XRT build fixes

Architecture: why so many pieces?

Quick re-setup checklist (future you)

References

Kriya-Egocentric-100K: Action100M-style Annotations for Real-World Labor Videos

What’s inside?

Why this matters

Kriya: Tools for Exploring and Generating Action100M-style Video Annotations

1. Kriya Visualizer – See Action100M-style Annotations Come Alive

2. Kriya-EPIC-KITCHENS – Automatic Annotations on Egocentric Videos

Why This Matters

From Perception to Embodied Intelligence: Evolution, Architectures, and the Humanoid Gap

1. The Evolutionary Timeline: From VLMs to VLAs

Phase 1: Foundation (2021–2022) – VLMs as Semantic Engines

Phase 2: Tokenization Breakthrough (2023) – RT-2 and the Birth of VLAs

Phase 3: Scaling and Open-Source (2024–2025) – OpenVLA, SmolVLA, and Pi0

2. Thematic Deep Dives: What Worked vs. What Failed

2.1 Key Ideas That Worked

Action Tokenization as Sequence Prediction

Flow Matching for Continuous Control

Knowledge Insulation and Modularity

2.2 Key Ideas That Failed

Naive Proprioception Integration

Monolithic Scaling Without Architectural Innovation

Single-Modality Action Generation

3. Open Source Model Comparison: OpenVLA vs. SmolVLA vs. Pi0

Architectural Deep Dive

4. The Humanoid Gap Report: Missing Capabilities for Hand Manipulation

4.1 Proprioception and Tactile Integration

4.2 Long-Horizon Planning and Memory

4.3 Physics-Aware Action Generation

4.4 Sim-to-Real for Humanoid Morphology

5. Critical Disagreements and Uncertainties

6. Conclusion

Teleoperation Data Quality for Imitation Learning: What Actually Breaks the Model

Why this post

1. Rubric mistakes and how to fix them

2. Failure modes we kept seeing

3. Impact

Summary

Ghibli moment for 3D Printing

The Hardest Part of Physical AI isn't the Brain

Can a Humanoid Robot Recognize and Remember My Face?

Part 1: The Question

Part 2: Face Recognition 101

Part 3: The VLA Bottleneck

Part 4: Extending Context

Part 5: The Memory Problem

cmake3 wrapper (Fedora ships CMake 4.x as `cmake`)

5. Critical: do NOT use `amd_iommu=off`

What is `amd_iommu`?

7. Make `xrt-smi` available (optional but useful)

`gemma4-it:e4b`