A student walks into a robotics lab with a simple question. The expert smiles and begins unraveling the mystery.
Part 1: The Question
"Can a humanoid robot recognize my face?"
Yes, right now. Face recognition (FaceNet, InsightFace) is ~99% accurate in controlled settings.[19][21] But come back in 5 minutes? The robot has completely forgotten you exist.
"Why does it forget me?"
Because its brain (Vision-Language-Action models, or VLAs) only sees 1-2 seconds of reality at a time - just 2-4 video frames.[3][15] Imagine having amnesia every second.
"Why can't it just look at more frames?"
Because transformer attention - the math that makes VLAs work - is O(T²) where T = frames. Doubling frames costs 4× more computation. 30 frames needs 100× the power of 3 frames (30²/3² = 900/9).[3][4] The robot would need a nuclear reactor to think.
"So the real problem is compute?"
Exactly. But here's the plot twist: you don't need all frames. You only need the important ones. And you don't store pixels - just compact features. That's 100-1000× compression without losing recognition ability.[2][6][26]
"Wait... is there actually a way to solve this?"
Yes. Researchers have already solved the individual pieces (smart frame selection, compression, efficient attention). But nobody has stitched them together into a working robot. That's the frontier.
Part 2: Face Recognition 101
"Okay, so how does face recognition actually work?"
The robot converts your face into an "embedding" - a number vector where similar faces have similar coordinates. FaceNet uses 128 dimensions; InsightFace uses 512. Your face in sunlight and your face at night live in nearby neighborhoods of this abstract space.[19][21]
"That's... beautiful? But how did it learn this?"
Trained on millions of face pairs with a technique called "triplet loss": push embeddings of the same person together, push embeddings of different people far apart. After seeing enough examples, patterns emerge.[19][21]
"How accurate is it, really?"
In a lab with good lighting: 99%. In the real world with varying lighting, makeup, sunglasses: 85-92%. After 1 month, accuracy remains high (>90%) for adults with stable appearance; degradation is minimal over short intervals.[5][14] Studies show 98%+ accuracy even after 6 months for adults, with larger drops occurring over years.[30]
"What trips it up?"
Lighting changes, occlusion (masks, sunglasses), makeup, aging, and crowded scenes where extracting faces is messy. Basically, anything that changes how the pixels look.[5][14] But some changes hit harder: growing a beard can drop accuracy 10-25× for mismatched facial hair styles.[33] Sunglasses (upper-face occlusion) can drop accuracy from ~93% to ~37%.[34] Growing children face even bigger challenges - infants under 1 year show only ~30% accuracy over 6-month gaps.[35]
"Can we make it more robust?"
Sort of. Ensemble methods (run multiple models, vote on the answer) help. Confidence thresholds work. Training with diverse appearances (beards, glasses, different ages) improves robustness.[33][34] For children, systems need age-invariant features or regular re-enrollment every 6-12 months.[35] But the honest answer: ask the human if you're uncertain: "Are you Alice? You look similar to someone I know."[19][21]
"What about growing beards, glasses, or children?"
Beard changes: Adding or removing facial hair can cause 10-25× increase in false non-match rates, especially mustaches.[33] Glasses: Upper-face occlusion (sunglasses) drops accuracy from ~93% to ~37% - worse than masks.[34] Growing children: Infants (0-1 year) show only ~30% accuracy over 6 months; toddlers (2-3 years) improve to ~65%.[35] For children, systems need frequent re-enrollment or age-invariant modeling.[35]
Part 3: The VLA Bottleneck
"What exactly is a VLA?"
Vision-Language-Action model. A neural network that takes three inputs: camera frames, language instructions ("pick up the red cup"), and outputs robot commands (move arm, open gripper).[15][18]
"Examples?"
RT-2 (DeepMind, closed). OpenVLA (Carnegie Mellon, open-source 7B). Qwen-VL (Alibaba). VideoVLA (2025, understands motion). OpenVLA is the best starting point for building your own system.[11][15][28]
"Wait - can VLAs recognize faces?"
No. VLAs (OpenVLA, SmolVLA, Pi 0.6) are trained for manipulation tasks, not person identification. They understand objects and scenes, not individual faces. You need a separate face recognition module (InsightFace, FaceNet) that extracts face embeddings, then integrate those into the robot's memory system. The VLA handles actions; face recognition handles identity.[11][15]
"Why do they only process 2-4 frames?"
Control loops run at 50 Hz (20ms per cycle). Optimized VLAs on high-end GPUs achieve 20-40ms inference; typical systems take 50-150ms.[31] That leaves little time for deep video analysis when processing many frames.[24][26][28]
"What if we optimize VLA inference?"
Even with optimization: KV cache tricks (reuse computation), sparse attention (skip unimportant tokens), quantization (use 4-bit math instead of 32-bit): 30 frames still takes 100+ ms. Too slow.[1][4][29]
"So we can never extend context?"
Wrong assumption. CronusVLA (2025) uses a clever trick: extract motion features instead of processing raw pixels, caching past features to avoid recomputing the vision backbone.[26] This enables multi-frame context with minimal overhead compared to naive frame stacking.[26]
Part 4: Extending Context
"How do we extend context efficiently?"
Three independent tricks that stack: (1) Select only important frames (not all frames). (2) Compress frames to features (not pixels). (3) Use efficient attention patterns (not full attention).
"Trick 1: Which frames matter?"
Motion-based selection: keep frames with high optical flow (stuff is changing), skip static frames. 15-20× compression with minimal accuracy loss. Or use learned importance (VLM scores which frames matter for your task).[2][5][12]
"Any other selection methods?"
Multi-armed bandit for constrained budgets (2025 research). Or hierarchical: keep recent frames densely, older frames sparsely. Or genetic algorithms (academic, not practical). Motion-based works well in practice.[2][12][14]
"Trick 2: Compress frames?"
Don't store 6 MB per frame (RGB pixels). Store pooled features (50 KB, 120× smaller) using max-pooling. Motion features from optical flow can compress temporal information, but face recognition typically requires appearance features combined with motion for best results.[10][13][15]
"How does max-pooling work?"
Take every 2×2 grid of pixels, keep the strongest signal, discard the rest. Repeat 2-3 times: 1080p → ~64×64 → 32×32. Lose spatial detail, preserve what matters for recognition. At 64×64, expect 5-15% accuracy drop; at 32×32, expect 20-40% drop depending on conditions.[10][13][32]
"What about temporal compression?"
TempMe (2025 paper): cluster similar consecutive frames, keep 1 representative per cluster. Result: 95% token reduction in video. Faster inference. Sometimes even better accuracy (less noise).[6]
"Trick 3: Efficient attention?"
Standard: query attends to every past token (O(T²) cost). Efficient: (a) KV cache - reuse computation from previous steps. (b) Grouped Query Attention - multiple query heads share one KV head (4× smaller cache). (c) Sparse attention - only attend to important positions.[1][4][29]
"Combining all three?"
Motion frame selection (15×) + temporal token merging (95%) + GQA + sparse = 100-1000× compression. Optimized systems can achieve 20-40ms latency on high-end GPUs.[31] Accuracy loss varies by compression level and task.[2][6][1]
Part 5: The Memory Problem
"Okay, frames are compressed. Where do we store them?"
Here's the hard part: limited RAM on the robot (8-16 GB shared with OS). Can't query disk fast enough for real-time. Need multiple storage tiers, each optimized for different timescales.
"Layers?"
Tier 0 (2 sec): Current frames in RAM. Real-time VLA inference. <1ms access.
Tier 1 (60 sec): Compressed motion features on fast SSD. <20ms access.
Tier 2 (1 hour): Face embeddings in vector database (Milvus). Similarity search in <100ms.
Tier 3 (months): Person identities in PostgreSQL. SQL queries in <10ms.
"Why separate tiers?"
Each tier optimizes for its job. Tier 0 is tiny and fast. Tier 3 is huge but doesn't need real-time speed. Together they cover seconds to months without exceeding your latency budget.
"How much storage?"
Tier 0: 0 MB (flushed). Tier 1: 100 MB. Tier 2: 500 MB per hour. Tier 3: 1 MB per 1000 people. Total for 1 month of operation: ~600 MB. Fits on a USB stick.[18][20]
"What about privacy? Is storing face data ethical?"
Yes, with consent and transparency. Users should opt-in, know what's stored, and be able to delete their data. Best practice: store embeddings (not raw images), encrypt at rest, allow deletion. Some jurisdictions (EU GDPR, some US states) require explicit consent for biometric data. Build privacy-by-design: minimal data, local-first storage, user control.[18]
Part 6: Real-Time Recognition Challenge
"So here's the hard part: when the robot sees someone, it needs to know instantly who they are."
Right. At 30 FPS, you're getting 30 faces per second. You can't query the vector database 30 times per second. That's 50 round-trips to disk. Game over.
"What do we do?"
Smart caching. The robot's most-used people (family, frequent visitors) stay hot in memory. Tier 0 gets an LRU cache of embeddings it's seen recently. Tier 1 tracks faces from the past hour.
"Can you walk through this?"
Robot sees someone:
- Extract face embedding (lightweight, ~5ms, can happen on spare GPU cycles)
- Check local cache (Tier 0): "Have I seen this embedding in the last 60 seconds?" If yes: instant match
- Cache miss? Check Tier 1 (motion features, faces from past hour): "Any motion features correlate with this face?" If yes: probably the same person
- Still no match? Query vector DB (Tier 2) asynchronously. Don't block action loop.
- Query result arrives 50-100ms later. Robot incorporates into next decision.
"But what if the person hasn't been seen in 3 months?"
Exactly the query you're worried about. Robot can't afford synchronous queries. Solution: (a) Query Tier 3 in background thread. (b) Meanwhile, robot acts conservatively ("Hello! What's your name?"). (c) When query completes, update memory: "Oh! That was Alice!"
"So the robot makes a guess while waiting for the database?"
Correct. It's a reasonable tradeoff. Perfect accuracy takes 100ms. Approximate accuracy takes 20ms. For most tasks, approximate is fine, and you can refine later.
"What about false positives?"
Confidence thresholds + fallback. If embedding similarity is >0.9: "Welcome back, Alice!" If similarity is 0.75-0.9: "Are you Alice?" If <0.75: "Hello, new person!"
"How do we avoid querying vector DB 50 times per second?"
Several strategies:
- Batch queries: Accumulate 10 faces, query once (amortizes latency)
- Bloom filters: Quick "definitely not in database" check before expensive query
- Locality: Faces in same location likely same person (temporal coherence)
- Clustering: Group embeddings into ~100 clusters, query cluster representative, not individual
- Cache hottest 1000 people: 99% of queries hit cache (pareto principle)
"Which works best?"
Combination. Always check local cache first (0.1ms). Batch queries when cache misses (10ms per 10 faces). Cluster embeddings in vector DB (10× fewer distance calculations). Query Tier 3 asynchronously.
"What's the latency real-time impact?"
Tier 0 cache hit: <1ms (recognition instant). Tier 1 batch query: ~15ms (30 FPS, can handle). Tier 2/3 async: 50-100ms (doesn't block control).
Part 7: Memory Updates and Consolidation
"After 3 months, the database is full of duplicate faces. Alice has been seen 500 times. How do we consolidate?"
Periodic background job (runs every 30 minutes): cluster faces by similarity (embedding distance), compute centroid of each cluster, update Tier 3 with centroid + metadata.
"What metadata gets updated?"
person_id, name, face_embedding_centroid (average of recent embeddings), last_seen, interaction_count, behavior_summary (LLM-generated), context_tags (where/when usually seen).
"Why centroid instead of keeping all 500 embeddings?"
Storage: 500 embeddings × 512 dims × 4 bytes = 1 MB per person. Scaling to 10k people: 10 GB. But centroid: 512 dims × 4 bytes = 2 KB. 10k people: 20 MB. Also faster queries.
"What about people you haven't seen in a year?"
Archive them. Move centroid to cold storage (cloud). Keep recent 1000 people in hot database. When someone reappears after 1 year: warm up their embeddings, integrate into Tier 3.
Part 8: The Technical Stack
"What libraries should I actually use?"
Face detection/embedding: InsightFace (accurate, fast, open-source, 512-dim vectors).
Vector DB: Milvus or Qdrant (HNSW indexing, fast search, Python API).
Person DB: PostgreSQL + pgvector (SQL + vector similarity, scales to millions).
VLA inference: HuggingFace Transformers (OpenVLA-7B).
Video I/O: OpenCV (standard, efficient).
"Why InsightFace?"
20-50ms per face (fast). 95%+ detection accuracy. Open-source. Produces 512-dimensional embeddings proven for recognition. Easy to fine-tune.
"Why Milvus over other vector DBs?"
Supports HNSW (hierarchical approximate search), in-memory + SSD persistence, Python API, easy deployment on Jetson. Qdrant is also good (Rust-based, slightly faster). Pick either.
"Why PostgreSQL + pgvector?"
SQL for complex queries (names, timestamps, context). Vector similarity search in same database. Scales to millions of records. pgvector is mature (stable since 2023).
"Wait - why both Milvus and PostgreSQL? Can't I use just one?"
You can! PostgreSQL + pgvector can handle both: vector similarity search (like Milvus) AND SQL queries with metadata. Many systems use just PostgreSQL. The two-DB setup separates concerns: Milvus (Tier 2) optimized for fast vector search on recent faces, PostgreSQL (Tier 3) for long-term storage with rich metadata. But if you want simplicity, use PostgreSQL + pgvector for everything - it's mature and handles both workloads well.
"What about the VLA model?"
OpenVLA-7B is your best bet. Open-source, fine-tuneable with LoRA, good community. RT-2 (DeepMind) is better but closed-source. VideoVLA (2025) supports multi-frame but less mature.
Part 9: Practical Constraints
"What hardware do I actually need?"
Minimum: Jetson Orin Nano Super ($249, 8 GB RAM, 67 TFLOPS GPU). Processes ~5 FPS with constraints. Can run lightweight models (smolVLA 450M at 8-12 Hz) but struggles with larger 7B models (~0.3 Hz).[39]
Recommended: 16 GB RAM, 256 GB NVMe SSD, 100+ TFLOPS GPU. For production-quality multi-model stacks, consider Jetson AGX Orin (32-64 GB) or newer architectures that can handle VLA + perception models simultaneously at real-time rates.[39]
"Is 5-15 FPS enough?"
For humanoid robots? Yes. You don't need 30 FPS every second. Key is asynchronous architecture: memory queries happen in background, don't block the control loop.
"What's the latency budget?"
Frame capture: 1-2ms. Optimized VLA inference (3 frames): 20-40ms on high-end GPUs; typical systems 50-150ms.[31] Action generation: 2-3ms. Memory cache lookups: <1ms. Async queries (don't block): 50-100ms. Total real-time path: 25-50ms for optimized systems. Meets 20-30 Hz control requirement.
"What about on low-power devices like Jetson Orin Nano?"
Unoptimized CPU-only: 150-300ms per frame. With GPU + TensorRT INT8 quantization + tracking: 25-40ms per frame for 1-5 faces. Memory is the bottleneck - 8 GB shared RAM limits model size and batch processing.[36][37]
"What if I need to run multiple models simultaneously?"
A full humanoid stack (VLA, object detection, SLAM, depth, speech) competing for shared 8 GB RAM makes real-time performance challenging. Jetson Orin Nano Super is not yet sufficient for production-quality multi-model deployments.[38]
"What recognition accuracy should I expect?"
Face detection: 95-98%. Recognition same day: 92-95%. After 1 week: 90-93%. After 1 month: 90-95% for adults with stable appearance (minimal degradation over short intervals).[30] Accuracy remains high (>90%) for months; larger drops occur over years. But appearance changes matter: beard growth can drop accuracy 10-25×; sunglasses drop to ~37%; children under 1 year show ~30% over 6 months.[33][34][35] Improves with recency-weighted averaging, ensemble models, and diverse training data.
"What if I need higher accuracy?"
Use confidence thresholds (only match if >0.85 instead of 0.75). Ask for confirmation on borderline cases. Use ensemble (run 2-3 face recognition models, vote). Improves but costs latency.
Part 10: Current Research (2025)
"What actually broke through this year?"
CronusVLA: Multi-frame VLA using motion features with cached past frames, avoiding recomputation of the vision backbone.[26] Achieves 12.7% improvement on LIBERO benchmark with efficient multi-frame processing.[26]
VideoVLA: Diffusion-based approach. Predicts future frames AND continuous actions. Better generalization.
Long-context LLMs: Claude 200k tokens. Enables semantic memory integration directly.
"What's still unsolved?"
Uncertainty calibration (robot knowing when it's uncertain). Privacy-preserving embeddings (encrypted vector search). Continual learning without forgetting old skills. Cross-modal grounding (explaining what it knows). And making all this work on a low powered device in real-time.
Part 11: The Bigger Picture
"Why does robot memory actually matter?"
For care robots: remember patient health status, preferences, medication. For home robots: understand family dynamics, relationships. For workplace: coordinate with individuals, learn workflows. Memory = personalization = trust.
Imagine a Jarvis that can't recognise Tony Stark.
"Who actually needs this? What's the market?"
Three segments: (1) Healthcare: care robots in hospitals/nursing homes ($2B+ market, growing 25% annually). (2) Consumer: home assistant robots ($5B+ by 2030). (3) Enterprise: warehouse/logistics robots ($15B+). Early adopters are healthcare (regulatory compliance, patient safety) and high-end consumer (personalization premium). The "remember me" feature becomes a differentiator when robots are commodity.[18][20]
"Is this going to be solved?"
Partially, yes. In 1-3 years, robots will recognize and remember faces across months. In 3-7 years, they'll have super-human memory.
Conclusion
"So what's the summary?"
Face recognition works. VLAs are bottlenecked. Compression techniques exist, but nobody has integrated them into a working robot yet. The four-tier memory system solves the storage problem - each tier optimized for its job. Caching prevents query explosion (LRU cache + batch queries + async). Most robots don't have this capability yet, humanoids are incomplete without it. In 3 years, this will likely be standard.
Are you building in robotics-ai space, how are you tackling these challenges? Do you wish if someone could have built the memory layer for robots? Should I take up the project yaadeinDB?
Feel free to share your thoughts or feedback in the comments section.
References
[1] Optimizing Inference for Long Context with NVFP4 KV Cache - NVIDIA Developer Blog, Dec 2025
[2] M-LLM Based Video Frame Selection for Efficient Video Understanding - CVPR 2025
[3] A Survey on Large Language Model Acceleration based on KV Cache - ArXiv 2024
[4] Understanding and Coding the KV Cache in LLMs - Sebastian Raschka's Magazine, Jun 2025
[5] Analyzing Temporal Information in Video Understanding - CVPR 2018
[6] TempMe: Video Temporal Token Merging for Efficient Video Understanding - ICLR 2025
[10] Pooling Layers in CNN - Giskard AI Glossary, 2025
[11] Foundation Models for Robotics: Vision-Language-Action - Blog Post, Dec 2024
[12] FOCUS: Efficient Keyframe Selection for Long Videos - ArXiv 2025
[13] Role of Pooling Layers in CNNs - Milvus.io Blog, 2025 (Note: URL redirects but page is accessible)
[14] A Review of Recent Techniques for Person Re-Identification - ArXiv, Sep 2025
[15] RT-2: New model translates vision and language into action - DeepMind Blog, Jul 2023
[18] Memory and mental time travel in humans and social robots - PMC, Mar 2019
[19] Understanding Face Recognition: FaceNet vs Siamese Networks - Blog Post, 2024
[20] Episodic Memory Banks for Lifelong Robot Learning - OpenReview
[21] Face Recognition with Siamese Networks, Keras, and TensorFlow - PyImageSearch, Jan 2023
[24] Real-Time Execution of Action Chunking Flow Policies - ArXiv 2025
[26] CronusVLA: Towards Efficient and Robust Manipulation via Transferring Latent Motion Across Time - ArXiv 2025
[28] Vision-Language-Action Models: Concepts, Progress - Blog/Docs, 2025
[29] KV Cache Optimization in Transformers - Emergent Mind, Nov 2025
[30] Face Recognition in Children: A Longitudinal Study - ArXiv 2022; Longitudinal Analysis of Mugshots - PubMed 2017
[31] Running VLAs at Real-Time Speed - Emergent Mind 2025; ActionFlow: Real-Time Vision-Language-Action - ArXiv 2025
[32] Susceptibility to Image Resolution in Face Recognition - ArXiv 2021; Low-resolution face recognition studies - Multiple sources
[33] Facial Hair Area in Face Recognition Across Demographics - ArXiv 2024; Effects of Facial Hair on Face Recognition - IEEE 2025
[34] Impact of Partial Occlusion on Face Recognition - ArXiv 2023; Glasses and Sunglasses Effects - PubMed 2023
[35] Face Recognition in Children: A Longitudinal Study - ArXiv 2022; Young Face Aging Dataset Studies - ArXiv 2022
[36] Face Recognition on Jetson Orin Nano - NVIDIA Developer Forums 2024; Robust Multi-Sensor Facial Recognition in Real-Time using NVIDIA DeepStream - IJERT
[37] Jetson Orin Nano RAM Issues and Memory Optimization - NVIDIA Developer Forums 2024; NVIDIA Jetson Orin Nano Developer Kit Specifications - NVIDIA.com
[38] Multi-Model AI Resource Allocation for Humanoid Robots: A Survey on Jetson Orin Nano Super - DEV Community, ankk98, 2025
[39] Humanoid Compute: Price vs. Performance - DEV Community, ankk98, 2025
Top comments (0)