DEV Community: susanayi

memcpy

susanayi — Mon, 06 Apr 2026 04:19:43 +0000

Q: Why is memcpy safer than pointer casting for type punning?

💡 Concept in a Nutshell

memcpy is the "Official Copy Machine" of C: It doesn't care if your data is a math book or a cookbook; it only sees "paper" (Bytes) and duplicates them from point A to point B without violating language laws.

1. Life Analogy (The Librarian vs. The Xerox)

Imagine you have a Math Book but you want to read it as a Cookbook.

Pointer Casting (*(int*)&f): This is like forcing a Librarian to read a Math book as a Cookbook. The Librarian will get confused because it violates the "Library Classification Rules" (Strict Aliasing).
memcpy: This is like putting the Math book into a Xerox machine. The machine doesn't read the words; it just copies the ink onto new paper. Now you have "Cookbook-shaped" paper with "Math-ink" on it. It’s perfectly legal because the Xerox machine is allowed to touch any paper!

2. Code Example

#include <stdio.h>
#include <string.h>

int main() {
    float f = 3.14f;
    int i;

    // "Photocopy" the 4 bytes of f into i
    memcpy(&i, &f, sizeof(int));

    printf("Memory content of float 3.14f = 0x%X\n", i);
    // Output: 0x4048F5C3 (on IEEE 754 systems)

    // It works backwards too!
    int j = 1078523331;
    float g;
    memcpy(&g, &j, sizeof(float));
    printf("Int 1078523331 as float = %f\n", g);
    // Output: 3.140000

    return 0;
}

3. Standard (C99 Clause): 6.5.7

The char* Exception: The C standard allows char* (and unsigned char*) to alias any object type. Since memcpy is defined to operate byte-by-byte, it bypasses the Strict Aliasing Rule.

4. Key Techniques (Why it Works)

memcpy: The most robust way to copy bits between types.
Compiler Optimization: Modern compilers (GCC/Clang) recognize memcpy for type punning. On Arm64 or x86_64, they often optimize it into a single register move (ldr/str or mov), meaning zero function call overhead.

5. Warning & Pro-Tips

The Overlap Trap: memcpy assumes the source and destination do NOT overlap. If they might, always use memmove.
Size Matters: Ensure sizeof(dest) >= sizeof(src) to avoid buffer overflows.
Compiler Flag: In large projects (like the Linux Kernel), you might see -fno-strict-aliasing used to relax these rules globally.

How I Designed a Camera Scoring System for VLM-Based Activity Recognition — and Why It Looks Different in the Real World

susanayi — Tue, 31 Mar 2026 03:08:41 +0000

Part 2 of the "Training-Free Home Robot" series. Part 1 covered why fixed ceiling-mounted nodes ended up as the perception foundation. This post goes deep on one specific algorithm: how the system decides which camera angle to use for each behavioral episode, and what that decision looks like when you leave the simulation.

Once I'd worked through why fixed global cameras made sense — a conclusion I reached the hard way, starting from genuine skepticism about the requirement — the next problem was entirely mine: given twelve candidate viewpoints, which one do you actually use?

My advisor specified the input modality. The selection algorithm, the scoring weights, the hard FOV gate, the fallback logic — none of that was given to me. This post is that design work: where each decision came from, what tradeoffs it makes, and what changes when you move from a Unity simulation to a real room.

The Core Problem

My system recognizes what a user is doing — drinking, reading, typing — by sending a camera image to a Vision-Language Model (VLM). The VLM is zero-shot: no training data, no fine-tuning. It just sees an image and describes what's happening.

This creates a hard dependency: VLM accuracy is directly tied to image quality, and image quality is directly tied to viewpoint selection.

A trained activity recognition model can partially compensate for bad viewpoints — it has seen thousands of occluded or off-angle examples during training. A zero-shot VLM cannot. If the user is at the edge of the frame, or partially behind furniture, the VLM produces unreliable output: "a person standing near a wall" instead of "a person drinking from a bottle."

So before any AI inference happens, the system needs to answer: which of the twelve available camera nodes will produce the most useful image right now?

Why Not Just Pick the Closest Node?

The naive approach is distance-only: pick the node closest to the user. But distance alone misses two critical failure modes.

Occlusion. A node 1.5m away from the user, directly behind a sofa, produces a completely blocked image. A node 4m away with a clear line of sight is far more useful.

Off-axis angle. A node positioned to the side of a user who is facing a desk will capture a profile view at best, and the back of the user's head at worst. VLMs strongly prefer frontal or near-frontal views for activity recognition — they're trained on internet images where people face the camera.

Distance matters, but it's one factor among three.

The Scoring Formula

I ended up with a weighted combination of three geometric factors, plus a hard gate that runs before any of them.

Step 0 — Hard FOV Gate

Before computing any score, I check whether the user even falls within the node's field of view cone. If not, the node is excluded immediately — score = 0, no further calculation.

if θ_i > FOV_i / 2  →  s_i = 0   (hard gate, skip remaining calculation)

where θ_i is the angle between the node's forward direction and the vector pointing toward the user's chest. The aim point is set at chest height: aim = user.position + (0, 1.2, 0).

This gate matters more than it might seem. Without it, the weighted formula can assign a non-zero score to a node that physically cannot see the user — it just happens to be close or have good visibility in a different direction. Hard gating eliminates this entire class of bad selections before the arithmetic starts.

Step 1 — Visibility Factor v_i

v_i = 1   if linecast(node → user chest) is unobstructed
      0   otherwise

A physics linecast from the node position to the user's chest. If it hits furniture or a wall, v_i = 0. This is binary — either there's a clear path or there isn't.

Weight: 0.5 — the highest weight, because an occluded node is nearly useless regardless of its other properties.

Step 2 — Angle Factor α_i

α_i = max(0,  1 - θ_i / (FOV_i / 2))

This maps the user's angular position within the FOV cone to a continuous score: 1.0 at dead center, 0.0 at the FOV boundary. A node where the user appears near the edge of frame gets a low angle score even if the linecast is clear.

Weight: 0.3

Step 3 — Distance Factor d_i

d_i = max(0,  1 - dist(node, user_chest) / 10)

Linear decay from 1.0 at 0m to 0.0 at 10m. I chose 10m as the decay constant after observing that the largest room in my simulation is about 6m across — so 10m means a node in the opposite corner of the largest room still gets a non-zero distance score, but it's clearly penalized.

Weight: 0.2 — lowest weight, because distance matters less than occlusion or angle.

Final Score

s_i = (v_i × 0.5 + α_i × 0.3 + d_i × 0.2) × m_i

m_i ∈ [0.5, 1.0] is a per-node multiplier set in the Unity Inspector, allowing me to manually downweight nodes with known limitations (e.g., a node that points toward a window and produces glare in afternoon light).

Nodes with s_i ≥ 0.50 are admitted to the candidate list, sorted descending. The top-2 are captured.

Pseudocode

function ScoreCamerasRanked(user, cameras, s_min=0.50):
    aimPos ← user.position + [0, 1.2, 0]
    qualified ← []

    for node in cameras:
        θ ← angle(node.forward, aimPos - node.position)

        if θ > node.FOV / 2:
            node.score ← 0
            continue                    // hard FOV gate

        v ← 1 if Linecast(node.position, aimPos) clear else 0
        α ← clamp(1 - θ / (node.FOV/2), 0, 1)
        d ← clamp(1 - dist(node.position, aimPos) / 10, 0, 1)

        node.score ← (v*0.5 + α*0.3 + d*0.2) * node.multiplier

        if node.score ≥ s_min:
            qualified.append(node)

    return sort(qualified, key=score, descending=True)

Why This Design — The Honest Answer

I want to be direct about something: this scoring formula exists largely because of hardware constraints, not because it's the theoretically optimal solution.

My simulation runs on a single workstation. I have one physical camera in the Unity scene that teleports to each selected node position, renders a frame, and moves on. I could not run twelve simultaneous cameras without multiplying rendering cost by twelve. Even in simulation, I needed a fast, lightweight way to rank nodes without actually rendering from all of them first.

The weighted formula with three geometric factors fits that constraint perfectly:

It's O(N) where N = number of nodes — trivially fast even for N = 100
It uses only spatial coordinates and angles — no image rendering required
It's interpretable — when a node scores poorly, I can immediately see why (was it the occlusion? the angle? the distance?)

A more sophisticated approach would render a low-resolution thumbnail from each candidate node and run a quick quality assessment model on it before selecting. This would catch cases the geometric formula misses — a node with a clear linecast but the user facing directly away from it, for instance. But that requires N renders per selection decision, which was not feasible on my hardware.

The practical tradeoff: the geometric formula is fast and correct in the common case. It fails primarily when the user's facing direction is not aligned with the node's line of sight — a limitation I document explicitly in the thesis.

The Simulation vs. Reality Gap

Everything above runs in Unity. Translating this to a physical room with real IP cameras introduces three gaps that simulation completely sidesteps.

Gap 1: You Don't Know Where the User Is

In Unity, user.position is available as a ground-truth Vector3 — the exact world coordinates of the character, updated every frame.

In a real room, you don't have this. You need to estimate the user's position from the cameras themselves (using person detection + depth estimation or triangulation), from wearables, or from floor sensors. Each of these introduces estimation error that flows directly into the scoring formula.

Bridging approach: Use the fixed-node cameras to run a lightweight person detector (e.g., YOLOv8-nano) and estimate 2D floor position via homography. This gives approximate (x, z) coordinates sufficient for the scoring formula, even without depth sensors.

Gap 2: `node.forward` Requires Extrinsic Calibration

In Unity, every node's position and forward direction is set in the editor — exact, zero-error, always current. In a real room, you need to physically calibrate each camera's extrinsic parameters (position and orientation relative to a shared world coordinate frame).

Calibration drift is real: a camera that shifts 2cm from vibration or accidental contact changes its linecast origin enough to affect visibility calculations, particularly for borderline cases.

Bridging approach: ArUco marker-based calibration at installation time, with periodic re-verification. Store calibration parameters in a config file that feeds into the scoring formula at runtime. Flag nodes whose calibration is older than a threshold for re-calibration.

Gap 3: Linecast ≠ Real-World Occlusion

Unity's linecast is a perfect, instantaneous ray through a static collision mesh. In a real room, occlusion is dynamic (people, pets, moved furniture), partially transparent (glass tables, thin curtains), and probabilistic.

Bridging approach: Replace the binary linecast with a visibility probability estimated from the camera's own feed. If the selected node's image shows the user partially occluded in the previous frame, reduce its score for the current selection. This creates a feedback loop: actual image quality informs future node selection.

What the Scoring Looks Like in Practice

In the Unity simulation, I visualize node scores using Gizmos in the Scene View:

Green sphere — score ≥ 0.50, admitted to candidate list
Yellow sphere — score between 0.35 and 0.50, near threshold
Red sphere — score > 0 but below threshold
Gray sphere — FOV-gated, score = 0

During experiment setup, I use this visualization to verify that at least two nodes per room reliably score green for each behavioral spot (the sofa, the desk, the kitchen counter). If a room has only one reliably green node for a given spot, I reposition nodes before running experiments.

This debugging workflow — spatial visualization of scores before running inference — turned out to be as important as the formula itself. The formula is only as good as the node placement it operates on.

Summary

Design Decision	Reason	Real-World Equivalent
Hard FOV gate before weighted sum	Prevents scoring nodes that can't see the user	Same gate applies; requires accurate extrinsic calibration
Linecast for visibility	Fast, exact in simulation	Replace with visibility probability from live feed
Chest height aim point (1.2m)	Captures torso, most informative for activity recognition	Same; depth camera or pose estimator needed for accuracy
Top-2 node capture	Handles single-node occlusion failures	Same strategy; second node is insurance
Per-node multiplier m_i	Manual override for known problem nodes	Useful for flagging nodes with fixed environmental issues (glare, permanent obstruction)

The scoring formula is a pragmatic solution built around a specific hardware constraint: one rendering camera, twelve virtual viewpoints, a need for selection to be fast and interpretable. It works well in simulation, and the geometric logic transfers cleanly to a real deployment — but the inputs to the formula (user position, node orientation, occlusion) all need real-world measurement pipelines that simulation provides for free.

Next in the series: how the captured images feed into a zero-shot VLM pipeline, and how SBERT semantic normalization maps free-form VLM descriptions to canonical behavior labels without any training data.

Full thesis: "Personalized Proactive Service in Smart Home Robots: A Training-Free Visual Perception Framework Integrating VLM-Based Scene Grounding, RAG Memory, and Manifold Learning" — NCKU, 2025.

Embodied AI: Why I Gave My Home Robot an "Eye in the Sky"

susanayi — Tue, 31 Mar 2026 03:08:41 +0000

This is part of a series on building a training-free home service robot using VLMs, RAG memory, and manifold learning. This post covers the camera architecture — specifically, why fixed ceiling-mounted nodes ended up as the foundation of the whole perception system.

Honestly, my first instinct was: why not just use the robot's onboard camera?

It's the obvious answer. The robot is already in the room. It already has a camera. Adding twelve fixed ceiling nodes sounds like unnecessary complexity — more hardware, more calibration, more failure points, for a system that was already complicated enough.

My advisor's requirement was firm: the AI pipeline must take its visual input from fixed global cameras, not from the robot itself. No negotiation on that point.

So I spent a while sitting with that constraint, trying to understand it rather than just comply with it. This post is what I figured out. It starts with the genuine question I had — why global cameras at all? — and ends with the engineering decisions I made once I accepted the answer.

The Problem with a Robot's Point of View

The case for onboard vision is intuitive: the robot is mobile, so its camera goes wherever the action is. But "wherever the action is" turns out to be exactly the problem.

A robot-mounted camera, positioned anywhere from 30cm to 100cm off the ground, has two fundamental problems.

Narrow field of view and constant occlusion. The robot sees the world from a low, mobile, first-person perspective. A sofa blocks the view of the person sitting behind it. The kitchen wall hides what's happening at the dining table. From the robot's perspective, the home is a maze of partial information.

Motion blur during navigation. When the robot is moving — which is most of the time — its onboard camera is not producing reliable still frames. Activity recognition from blurry, unstabilized video is significantly harder than recognition from a fixed viewpoint.

These aren't engineering failures. They're intrinsic to the geometry of a mobile, ground-level camera. No amount of better hardware changes the fundamental constraint.

The Architecture: Fixed Nodes + a Single Moving Camera

My solution was to separate global perception from local action. The robot handles physical interaction. A set of fixed-position camera nodes handles scene understanding.

In my system, twelve CameraNode objects are distributed across three rooms (four per room) at ceiling height — approximately 2.3m — simulating the kind of fixed IP camera array you might mount in a real home. These nodes don't move, don't occlude each other, and always have a stable, overhead view of the space.

But here's the key engineering constraint: I only have one physical camera in the scene. Rather than instantiating twelve separate cameras (expensive and redundant), I use a single camera that teleports to each selected node position, renders a frame, and moves on. The VirtualCameraBrain component manages this process.

for each selected node (top-N by score):
    camera.transform ← node.position + node.rotation
    wait 2 frames                    // GPU render flush
    capture 512×512 PNG → Base64

POST all images to /predict in one request

This gives me multi-viewpoint coverage with minimal rendering overhead.

The Harder Problem: Which Node Do You Pick?

Having twelve nodes is useless if you pick the wrong one. A node behind the user, or one with the user at the edge of its field of view, produces an image the VLM can't interpret reliably.

I needed a scoring function. Here's what I settled on:

s_i = (v_i × 0.5 + α_i × 0.3 + d_i × 0.2) × m_i

Where:

v_i — Visibility: does a linecast from the node to the user's chest reach without hitting furniture? (0 or 1)
α_i — Angle factor: how centered is the user in the node's field of view? (1 at dead center, 0 at the FOV edge)
d_i — Distance factor: linear decay from 0m to 10m
m_i — A per-node priority multiplier, set in the Inspector

The most important rule comes before the weighted sum: hard FOV gating. If the user falls outside the node's field of view cone, that node gets score = 0 immediately, no matter how good its distance or linecast result. There's no point in a weighted calculation for a camera that can't even see the target.

After scoring, I sort candidates descending and capture from the top-2 nodes (configurable). Two viewpoints handle the cases where one node is slightly occluded.

Why This Matters for the VLM

The whole reason I care about viewpoint quality is downstream accuracy. My system uses llava-phi3 (via Ollama) to recognize what the user is doing — drinking, sitting, reading, typing — without any task-specific training.

VLMs are sensitive to image quality in ways that trained classifiers aren't. A trained activity recognition model can learn to compensate for partial occlusion if it sees enough occluded examples during training. A zero-shot VLM cannot — it has to interpret what it sees without that learned correction.

This means camera selection directly controls recognition accuracy. In early testing, episodes where the scoring system chose a poor viewpoint (user at the edge of frame, or partially behind furniture) produced VLM outputs like "a person standing near a wall" instead of "a person drinking from a bottle." The SBERT normalization layer handled some of this, but the better fix was improving the viewpoint selection upstream.

The Gap Between Simulation and Reality

I want to be honest about where my current implementation sits. Everything described above runs in a Unity 3D simulation. The "nodes" are virtual GameObjects. The "camera" is Unity's rendering engine. The coordinate streams come from DynamicSyncManager.cs, not from depth sensors or object detection.

This is intentional — I'm using simulation to validate the framework before committing to physical hardware. But it means two real-world problems remain unsolved:

Extrinsic calibration. In a real deployment, every pixel (u, v) from each fixed node must be mapped to a shared 3D coordinate system. This requires physical calibration of each camera's position and orientation relative to the room — a process that takes significant setup time and re-calibration whenever a camera is moved.

Latency compensation. Network transmission from a wall-mounted IP camera to the processing backend introduces roughly 50–150ms of latency. For a moving user, this means the position data you receive corresponds to where they were, not where they are. You need prediction — either simple linear extrapolation or a Kalman filter — to compensate.

My simulation sidesteps both of these by giving me ground-truth coordinates directly from the Unity scene. That's a real gap, and I'm documenting it explicitly in the thesis as a limitation.

Cooperative Offloading: What This Enables for the Robot

The architectural payoff of this design is that the robot itself doesn't need to run heavy vision inference. The fixed-node perception pipeline handles scene understanding and transmits lightweight metadata to the robot's decision layer:

{
  "user_id": "User_Mom",
  "action":  "Reading",
  "pos":     [-0.17, 8.62],
  "room":    "LivingRoom",
  "intent":  "Drink",
  "confidence": 0.74
}

The robot receives a pre-processed summary — who is where, what they're doing, and what they're likely to want next — rather than raw pixels. This is the core of the "ambient intelligence offloads to embodied intelligence" architecture.

For a battery-powered physical robot, this matters a lot. Running a VLM inference pipeline continuously on an embedded GPU drains a battery in under an hour. Running it on a wall-powered backend and sending metadata over Wi-Fi costs almost nothing on the robot side.

Perception Mode Comparison

Feature	Onboard Camera	Fixed Node Array	What My System Does
Field of view	Local, low, easily occluded	Global, overhead, stable	Nodes handle scene understanding
Inference load	Runs on robot battery	Runs on wall-powered backend	VLM runs on backend only
Coordinate source	Estimated from robot odometry	Direct from scene/sensors	Unity scene (simulation)
Calibration	Built into robot	Requires room-level setup	Skipped in simulation; needed in real deployment
Failure mode	Occlusion, motion blur	Network latency, fixed FOV gaps	Fallback: retry with next-best node

What's Next

The next post in this series covers how I use these captured images as input to a zero-shot VLM pipeline, and how SBERT semantic normalization maps the free-form VLM descriptions to canonical behavior labels — without any training data.

If you're building something similar, or have dealt with the extrinsic calibration problem in a real deployment, I'd love to hear how you approached it.

Part of the "Training-Free Home Robot" series. The full system integrates VLM perception, a Behavioral Scene Graph memory layer (FAISS + MongoDB), and UMAP manifold learning for proactive intent prediction.

Robotic Brain for Elder Care 3

susanayi — Mon, 30 Mar 2026 09:15:56 +0000

Part 3: The Scoring Engine — How a Robot Selects the Perfect Viewpoint

In the previous post, we discussed the "Single Camera + 12 Virtual Nodes" strategy to overcome simulation lag. But with 4 potential nodes in a single room, how does the system "decide" which one provides the best data for our AI backend?

This is where the StaticCameraManager comes in. Instead of random selection, we use a Heuristic Scoring Algorithm to rank viewpoints based on three physical constraints: Visibility, Angle, and Distance.

The Scoring Formula

To quantify the quality of each viewpoint, the system evaluates all registered nodes in the room using a weighted heuristic:

FinalScore = (Visibility × 0.5) + (AngleFactor × 0.3) + (DistanceFactor × 0.2)

By assigning the highest weight (50%) to Visibility, we ensure the robot never prioritizes a "perfect" angle if the person is obscured by furniture or walls.

1. Visibility: The Raycast Test (50%)

The most fundamental requirement is a clear line of sight. We use Unity’s Physics.Linecast to check for obstacles between the camera node and the user.

// Step 2：Visibility (Linecast Occlusion)
float vis = 1f;
if (Physics.Linecast(nodePos, aimPos, out RaycastHit hit))
{
    // Check if the hit object is the user or a part of the user
    bool hitUser = hit.transform == user.transform || hit.transform.IsChildOf(user.transform);
    if (!hitUser) vis = 0f; // Blocked by furniture or walls
}

If the raycast is blocked, the visibility score drops to 0, effectively disqualifying the node regardless of other factors.

2. Angle Factor: Semantic Clarity (30%)

For action recognition, front or side views are more informative than back views. We normalize the angle relative to the FOV center:

// Step 3：Angle Factor (Normalized FOV center)
float angleFactor = Mathf.Clamp01(1f - angle / halfFov);

Case Study: Drinking Behavior

While multiple nodes might have visibility, our algorithm selects the one that best captures the drinking gesture.

Note: Candidate A (Side-Back) - The hand-to-mouth action is partially obscured by the user's shoulder.

Note: Candidate B (Side-Front) - Higher Angle Score. The interaction with the bottle is clearly visible for the VLM.

3. Distance Factor: The Golden Range (20%)

A camera too far away loses pixel density. We prioritize nodes that keep the user within the "Golden Range" of 2 to 5 meters.

// Step 4：Distance Factor (10m Linear Decay)
float dist = Vector3.Distance(nodePos, aimPos);
float distFactor = Mathf.Clamp01(1f - dist / 10f);

Case Study: Typing Interaction

At the desk, the distance and angle combined determine the best viewpoint to capture hand-to-keyboard interaction.

Note: Candidate C - Although the angle is okay, the distance reduces the semantic detail of the typing action.

Note: Candidate D - Optimal Distance & Angle. The high-angle perspective provides a clear view of the hands on the keyboard.

Visualizing the Logic: Debugging with Gizmos

As an engineer, I need to verify the math in real-time. I implemented a custom OnDrawGizmos system that color-codes nodes:

Green: High Score (> 0.5) — Ready for capture.
Red/Grey: Low Score or Out of FOV — Disqualified.

This visual feedback allowed us to fine-tune our thresholds, ensuring the VirtualCameraBrain only teleports to locations that provide high-quality data.

What’s Next?

Now that we have selected the "Best Viewpoint," the final step is execution. In the next post, we will look at the VirtualCameraBrain implementation: Base64 encoding and REST API transmission.

Stay tuned!

Robotic Brain for Elder Care 2

susanayi — Mon, 30 Mar 2026 08:49:08 +0000

Part 1: Virtual Nodes and the Single-Camera Strategy — Overcoming Simulation Lag

In building an indoor perception system for elder care, the standard intuition is to deploy multiple live cameras to monitor daily routines. During our early development stage using NVIDIA Isaac Sim, we followed this path, experimenting with high-bandwidth sensor data like depth images and point clouds.

The Problem: The Performance Trap of Multi-Camera Rendering

However, we quickly encountered a critical performance wall. Simultaneously rendering and publishing data from multiple active cameras in any simulation engine (Unity or Isaac Sim) is a recipe for performance disaster. It consumes massive GPU memory (VRAM) and creates significant lag.

In our tests, images would queue for an unacceptably long time before ever entering the AI pipeline. For an elder-care system aiming for real-time interaction, this lag made subsequent VLM reasoning and intent prediction protocols impossible.

To prioritize practicality and focus on Robotics VLM and semantic research, we made a strategic decision: we bypassed the overhead of ROS 2 and transitioned to a custom, lightweight Unity-to-Python pipeline based on Event-Triggered RGB Transmission.

The Solution: A 12-Node Virtual Network

Our architecture rests on a deliberate separation between spatial metadata (Where can we see the user?) and rendering overhead (When do we actually draw the pixels?).

We deployed a network of 12 Virtual Camera Nodes across the simulated home. Instead of active cameras, these are lightweight "Empty Objects" that serve as potential observation posts.

Figure 1: Our experimental test bed featuring 12 virtual nodes. Each room (Kitchen, Living Room, and Dad's Room) is equipped with 4 specific viewpoints.

As shown in Figure 1, these 12 nodes incur zero rendering cost while idle. In Unity, they are merely Transform components (coordinates and forward vectors). This allows us to maintain a high simulation frame rate (FPS) while having 12 different perspectives available at any moment.

Technical Comparison: Why This is "Reasonable"

The table below illustrates the evolution from "Brute-force Rendering" to "Smart Orchestration":

Feature	Legacy Strategy (Isaac Sim Experience)	Optimized Strategy (Current Unity Architecture)
Camera Setup	Multiple Active Real Cameras	Single Real Camera + 12 Virtual Nodes
Rendering	Continuous Parallel Rendering	Event-Triggered "Teleport & Capture"
GPU Load	High VRAM & Draw Calls	Zero Idle Cost for Nodes
Latency	Long Image Queue (Lag)	Real-time Sync (Stable 60 FPS)
Data Type	Depth & Point Cloud (Heavy)	Streamlined RGB (VLM Optimized)
Protocol	ROS 2 Middleware	Custom High-speed REST API

The Mechanism: The "Smart Eye" Teleportation

We then employ a Single Rendering Camera—our "Smart Eye." The logic is orchestration rather than brute-force rendering.

When our system detects a significant state change (e.g., transitioning from 'Standing' to 'Drinking'), the high-level "brain" is invoked. But instead of processing 12 simultaneous streams, we use a Teleport-and-Capture strategy:

Event Trigger: The system detects a meaningful action from UserEntity.
Optimal Node Selection: Our heuristic scoring algorithm analyzes the 4 nodes within the current room to find the best angle, considering distance and occlusions.
Instant Capture: The single physical camera "teleports" to the selected optimal node, captures the frame, and sends it to the Python backend.

This design is not just a workaround for simulation lag; it is a pragmatic reflection of real-world constraints. In a real smart home, streaming 24/7 high-resolution video from 12 cameras would overwhelm most residential networks. By mirroring the behavior of smart surveillance systems—where resources are allocated only when an event occurs—we ensure maximum data integrity and system reliability.

What’s Next?

We have 12 nodes, but how does the robot "know" which one offers the best view?

In the next post, we will deep dive into the C# implementation of the StaticCameraManager and deconstruct the heuristic scoring algorithm that handles Occlusion, Angle, and Distance.

Stay tuned!

Robotic Brain for Elder Care 1

susanayi — Mon, 30 Mar 2026 07:52:10 +0000

The vision

"The long-term goal of this system is to support those who truly need assistance—individuals who are paralyzed, bedridden, or require 24/7 care.

However, to build a rapid MVP and a scalable architecture, I am starting with healthy users in ideal scenarios as my baseline. This choice simplifies the problem space and enables faster iteration during the early development stages.

System Architecture: The Bridge Between Virtual & Logic

To maintain a clean separation of concerns, I designed a decoupled architecture where Unity handles the physical simulation and Python acts as the high-level brain.

Here is the data flow from user behavior in Unity to the AI decision-making process in the backend:

The Simulation Environment

Currently, the entire system is being verified within a simulated home environment built in Unity.

I have defined three core experimental zones—Living Room, Workspace, and Kitchen—equipped with a dense network of virtual cameras. This setup allows me to test the robot's perception and spatial grounding in a controlled yet complex environment.

What’s Next?

In the next post, I will walk you through the Unity Experiment Setup and Development Environment in more detail.

We will explore how the 3D environment is designed to simulate daily routines and how the Unity-to-Python bridge handles real-time data streaming. Before diving into the complex AI "brain," it's essential to understand the "world" our robot lives in.

Stay tuned!

DEV Community: susanayi

memcpy

Q: Why is memcpy safer than pointer casting for type punning?

💡 Concept in a Nutshell

1. Life Analogy (The Librarian vs. The Xerox)

2. Code Example

3. Standard (C99 Clause): 6.5.7

4. Key Techniques (Why it Works)

5. Warning & Pro-Tips

How I Designed a Camera Scoring System for VLM-Based Activity Recognition — and Why It Looks Different in the Real World

The Core Problem

Why Not Just Pick the Closest Node?

The Scoring Formula

Step 0 — Hard FOV Gate

Step 1 — Visibility Factor v_i

Step 2 — Angle Factor α_i

Step 3 — Distance Factor d_i

Final Score

Pseudocode

Why This Design — The Honest Answer

The Simulation vs. Reality Gap

Gap 1: You Don't Know Where the User Is

Gap 2: node.forward Requires Extrinsic Calibration

Gap 3: Linecast ≠ Real-World Occlusion

What the Scoring Looks Like in Practice

Summary

Embodied AI: Why I Gave My Home Robot an "Eye in the Sky"

The Problem with a Robot's Point of View

The Architecture: Fixed Nodes + a Single Moving Camera

The Harder Problem: Which Node Do You Pick?

Why This Matters for the VLM

The Gap Between Simulation and Reality

Cooperative Offloading: What This Enables for the Robot

Perception Mode Comparison

What's Next

Robotic Brain for Elder Care 3

Part 3: The Scoring Engine — How a Robot Selects the Perfect Viewpoint

The Scoring Formula

1. Visibility: The Raycast Test (50%)

2. Angle Factor: Semantic Clarity (30%)

Case Study: Drinking Behavior

3. Distance Factor: The Golden Range (20%)

Case Study: Typing Interaction

Visualizing the Logic: Debugging with Gizmos

What’s Next?

Robotic Brain for Elder Care 2

Part 1: Virtual Nodes and the Single-Camera Strategy — Overcoming Simulation Lag

The Problem: The Performance Trap of Multi-Camera Rendering

The Solution: A 12-Node Virtual Network

Technical Comparison: Why This is "Reasonable"

The Mechanism: The "Smart Eye" Teleportation

What’s Next?

Robotic Brain for Elder Care 1

The vision

System Architecture: The Bridge Between Virtual & Logic

The Simulation Environment

What’s Next?

Gap 2: `node.forward` Requires Extrinsic Calibration