<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: susanayi</title>
    <description>The latest articles on DEV Community by susanayi (@susanayi).</description>
    <link>https://dev.to/susanayi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3686760%2Fbe7ab871-fe95-4237-8ac8-ce91fc7d9c21.jpg</url>
      <title>DEV Community: susanayi</title>
      <link>https://dev.to/susanayi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/susanayi"/>
    <language>en</language>
    <item>
      <title>memcpy</title>
      <dc:creator>susanayi</dc:creator>
      <pubDate>Mon, 06 Apr 2026 04:19:43 +0000</pubDate>
      <link>https://dev.to/susanayi/memcpy-3cc8</link>
      <guid>https://dev.to/susanayi/memcpy-3cc8</guid>
      <description>&lt;h1&gt;
  
  
  Q: Why is memcpy safer than pointer casting for type punning?
&lt;/h1&gt;

&lt;h3&gt;
  
  
  💡 Concept in a Nutshell
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;memcpy&lt;/code&gt; is the &lt;strong&gt;"Official Copy Machine"&lt;/strong&gt; of C: It doesn't care if your data is a math book or a cookbook; it only sees "paper" (Bytes) and duplicates them from point A to point B without violating language laws.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Life Analogy (The Librarian vs. The Xerox)
&lt;/h3&gt;

&lt;p&gt;Imagine you have a &lt;strong&gt;Math Book&lt;/strong&gt; but you want to read it as a &lt;strong&gt;Cookbook&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pointer Casting (&lt;code&gt;*(int*)&amp;amp;f&lt;/code&gt;)&lt;/strong&gt;: This is like forcing a Librarian to read a Math book as a Cookbook. The Librarian will get confused because it violates the "Library Classification Rules" (&lt;strong&gt;Strict Aliasing&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;memcpy&lt;/code&gt;&lt;/strong&gt;: This is like putting the Math book into a &lt;strong&gt;Xerox machine&lt;/strong&gt;. The machine doesn't read the words; it just copies the ink onto new paper. Now you have "Cookbook-shaped" paper with "Math-ink" on it. It’s perfectly legal because the Xerox machine is allowed to touch any paper!&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Code Example
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;stdio.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;string.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// "Photocopy" the 4 bytes of f into i&lt;/span&gt;
    &lt;span class="n"&gt;memcpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

    &lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Memory content of float 3.14f = 0x%X&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// Output: 0x4048F5C3 (on IEEE 754 systems)&lt;/span&gt;

    &lt;span class="c1"&gt;// It works backwards too!&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1078523331&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;memcpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Int 1078523331 as float = %f&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// Output: 3.140000&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Standard (C99 Clause): 6.5.7
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The &lt;code&gt;char*&lt;/code&gt; Exception&lt;/strong&gt;: The C standard allows &lt;code&gt;char*&lt;/code&gt; (and &lt;code&gt;unsigned char*&lt;/code&gt;) to alias any object type. Since &lt;code&gt;memcpy&lt;/code&gt; is defined to operate byte-by-byte, it bypasses the &lt;strong&gt;Strict Aliasing Rule&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Key Techniques (Why it Works)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;memcpy&lt;/code&gt;&lt;/strong&gt;: The most robust way to copy bits between types.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compiler Optimization&lt;/strong&gt;: Modern compilers (GCC/Clang) recognize &lt;code&gt;memcpy&lt;/code&gt; for type punning. On Arm64 or x86_64, they often optimize it into a single register move (&lt;code&gt;ldr/str&lt;/code&gt; or &lt;code&gt;mov&lt;/code&gt;), meaning &lt;strong&gt;zero function call overhead&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Warning &amp;amp; Pro-Tips
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Overlap Trap&lt;/strong&gt;: &lt;code&gt;memcpy&lt;/code&gt; assumes the source and destination do NOT overlap. If they might, always use &lt;code&gt;memmove&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Size Matters&lt;/strong&gt;: Ensure &lt;code&gt;sizeof(dest) &amp;gt;= sizeof(src)&lt;/code&gt; to avoid buffer overflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compiler Flag&lt;/strong&gt;: In large projects (like the Linux Kernel), you might see &lt;code&gt;-fno-strict-aliasing&lt;/code&gt; used to relax these rules globally.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>c</category>
      <category>programming</category>
      <category>linux</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How I Designed a Camera Scoring System for VLM-Based Activity Recognition — and Why It Looks Different in the Real World</title>
      <dc:creator>susanayi</dc:creator>
      <pubDate>Tue, 31 Mar 2026 03:08:41 +0000</pubDate>
      <link>https://dev.to/susanayi/embodied-ai-why-i-gave-my-home-robot-an-eye-in-the-sky-5fj6</link>
      <guid>https://dev.to/susanayi/embodied-ai-why-i-gave-my-home-robot-an-eye-in-the-sky-5fj6</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 2 of the "Training-Free Home Robot" series. Part 1 covered why fixed ceiling-mounted nodes ended up as the perception foundation. This post goes deep on one specific algorithm: how the system decides which camera angle to use for each behavioral episode, and what that decision looks like when you leave the simulation.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Once I'd worked through &lt;em&gt;why&lt;/em&gt; fixed global cameras made sense — a conclusion I reached the hard way, starting from genuine skepticism about the requirement — the next problem was entirely mine: given twelve candidate viewpoints, which one do you actually use?&lt;/p&gt;

&lt;p&gt;My advisor specified the input modality. The selection algorithm, the scoring weights, the hard FOV gate, the fallback logic — none of that was given to me. This post is that design work: where each decision came from, what tradeoffs it makes, and what changes when you move from a Unity simulation to a real room.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Problem
&lt;/h2&gt;

&lt;p&gt;My system recognizes what a user is doing — drinking, reading, typing — by sending a camera image to a Vision-Language Model (VLM). The VLM is zero-shot: no training data, no fine-tuning. It just sees an image and describes what's happening.&lt;/p&gt;

&lt;p&gt;This creates a hard dependency: &lt;strong&gt;VLM accuracy is directly tied to image quality, and image quality is directly tied to viewpoint selection.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A trained activity recognition model can partially compensate for bad viewpoints — it has seen thousands of occluded or off-angle examples during training. A zero-shot VLM cannot. If the user is at the edge of the frame, or partially behind furniture, the VLM produces unreliable output: "a person standing near a wall" instead of "a person drinking from a bottle."&lt;/p&gt;

&lt;p&gt;So before any AI inference happens, the system needs to answer: &lt;strong&gt;which of the twelve available camera nodes will produce the most useful image right now?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Not Just Pick the Closest Node?
&lt;/h2&gt;

&lt;p&gt;The naive approach is distance-only: pick the node closest to the user. But distance alone misses two critical failure modes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Occlusion.&lt;/strong&gt; A node 1.5m away from the user, directly behind a sofa, produces a completely blocked image. A node 4m away with a clear line of sight is far more useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Off-axis angle.&lt;/strong&gt; A node positioned to the side of a user who is facing a desk will capture a profile view at best, and the back of the user's head at worst. VLMs strongly prefer frontal or near-frontal views for activity recognition — they're trained on internet images where people face the camera.&lt;/p&gt;

&lt;p&gt;Distance matters, but it's one factor among three.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Scoring Formula
&lt;/h2&gt;

&lt;p&gt;I ended up with a weighted combination of three geometric factors, plus a hard gate that runs before any of them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 0 — Hard FOV Gate
&lt;/h3&gt;

&lt;p&gt;Before computing any score, I check whether the user even falls within the node's field of view cone. If not, the node is excluded immediately — score = 0, no further calculation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if θ_i &amp;gt; FOV_i / 2  →  s_i = 0   (hard gate, skip remaining calculation)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;where θ_i is the angle between the node's forward direction and the vector pointing toward the user's chest. The aim point is set at chest height: &lt;code&gt;aim = user.position + (0, 1.2, 0)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This gate matters more than it might seem. Without it, the weighted formula can assign a non-zero score to a node that physically cannot see the user — it just happens to be close or have good visibility in a different direction. Hard gating eliminates this entire class of bad selections before the arithmetic starts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 — Visibility Factor v_i
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;v_i = 1   if linecast(node → user chest) is unobstructed
      0   otherwise
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A physics linecast from the node position to the user's chest. If it hits furniture or a wall, v_i = 0. This is binary — either there's a clear path or there isn't.&lt;/p&gt;

&lt;p&gt;Weight: &lt;strong&gt;0.5&lt;/strong&gt; — the highest weight, because an occluded node is nearly useless regardless of its other properties.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 — Angle Factor α_i
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;α_i = max(0,  1 - θ_i / (FOV_i / 2))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This maps the user's angular position within the FOV cone to a continuous score: 1.0 at dead center, 0.0 at the FOV boundary. A node where the user appears near the edge of frame gets a low angle score even if the linecast is clear.&lt;/p&gt;

&lt;p&gt;Weight: &lt;strong&gt;0.3&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3 — Distance Factor d_i
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;d_i = max(0,  1 - dist(node, user_chest) / 10)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Linear decay from 1.0 at 0m to 0.0 at 10m. I chose 10m as the decay constant after observing that the largest room in my simulation is about 6m across — so 10m means a node in the opposite corner of the largest room still gets a non-zero distance score, but it's clearly penalized.&lt;/p&gt;

&lt;p&gt;Weight: &lt;strong&gt;0.2&lt;/strong&gt; — lowest weight, because distance matters less than occlusion or angle.&lt;/p&gt;

&lt;h3&gt;
  
  
  Final Score
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s_i = (v_i × 0.5 + α_i × 0.3 + d_i × 0.2) × m_i
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;m_i ∈ [0.5, 1.0] is a per-node multiplier set in the Unity Inspector, allowing me to manually downweight nodes with known limitations (e.g., a node that points toward a window and produces glare in afternoon light).&lt;/p&gt;

&lt;p&gt;Nodes with s_i ≥ 0.50 are admitted to the candidate list, sorted descending. The top-2 are captured.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pseudocode
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;ScoreCamerasRanked&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;cameras&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;s_min&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nx"&gt;aimPos&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;position&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;qualified&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;node&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;cameras&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nx"&gt;θ&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="nf"&gt;angle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;aimPos&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nx"&gt;θ&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;FOV&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;                    &lt;span class="c1"&gt;// hard FOV gate&lt;/span&gt;

        &lt;span class="nx"&gt;v&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nc"&gt;Linecast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;aimPos&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;clear&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="nx"&gt;α&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="nf"&gt;clamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;θ&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;FOV&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nx"&gt;d&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="nf"&gt;clamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;aimPos&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;v&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;α&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;d&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;multiplier&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="err"&gt;≥&lt;/span&gt; &lt;span class="nx"&gt;s_min&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nx"&gt;qualified&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;qualified&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;descending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nx"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why This Design — The Honest Answer
&lt;/h2&gt;

&lt;p&gt;I want to be direct about something: &lt;strong&gt;this scoring formula exists largely because of hardware constraints, not because it's the theoretically optimal solution.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My simulation runs on a single workstation. I have one physical camera in the Unity scene that teleports to each selected node position, renders a frame, and moves on. I could not run twelve simultaneous cameras without multiplying rendering cost by twelve. Even in simulation, I needed a fast, lightweight way to rank nodes without actually rendering from all of them first.&lt;/p&gt;

&lt;p&gt;The weighted formula with three geometric factors fits that constraint perfectly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It's &lt;strong&gt;O(N)&lt;/strong&gt; where N = number of nodes — trivially fast even for N = 100&lt;/li&gt;
&lt;li&gt;It uses only &lt;strong&gt;spatial coordinates and angles&lt;/strong&gt; — no image rendering required&lt;/li&gt;
&lt;li&gt;It's &lt;strong&gt;interpretable&lt;/strong&gt; — when a node scores poorly, I can immediately see why (was it the occlusion? the angle? the distance?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A more sophisticated approach would render a low-resolution thumbnail from each candidate node and run a quick quality assessment model on it before selecting. This would catch cases the geometric formula misses — a node with a clear linecast but the user facing directly away from it, for instance. But that requires N renders per selection decision, which was not feasible on my hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The practical tradeoff:&lt;/strong&gt; the geometric formula is fast and correct in the common case. It fails primarily when the user's facing direction is not aligned with the node's line of sight — a limitation I document explicitly in the thesis.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Simulation vs. Reality Gap
&lt;/h2&gt;

&lt;p&gt;Everything above runs in Unity. Translating this to a physical room with real IP cameras introduces three gaps that simulation completely sidesteps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 1: You Don't Know Where the User Is
&lt;/h3&gt;

&lt;p&gt;In Unity, &lt;code&gt;user.position&lt;/code&gt; is available as a ground-truth Vector3 — the exact world coordinates of the character, updated every frame.&lt;/p&gt;

&lt;p&gt;In a real room, you don't have this. You need to estimate the user's position from the cameras themselves (using person detection + depth estimation or triangulation), from wearables, or from floor sensors. Each of these introduces estimation error that flows directly into the scoring formula.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bridging approach:&lt;/strong&gt; Use the fixed-node cameras to run a lightweight person detector (e.g., YOLOv8-nano) and estimate 2D floor position via homography. This gives approximate (x, z) coordinates sufficient for the scoring formula, even without depth sensors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 2: &lt;code&gt;node.forward&lt;/code&gt; Requires Extrinsic Calibration
&lt;/h3&gt;

&lt;p&gt;In Unity, every node's position and forward direction is set in the editor — exact, zero-error, always current. In a real room, you need to physically calibrate each camera's extrinsic parameters (position and orientation relative to a shared world coordinate frame).&lt;/p&gt;

&lt;p&gt;Calibration drift is real: a camera that shifts 2cm from vibration or accidental contact changes its linecast origin enough to affect visibility calculations, particularly for borderline cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bridging approach:&lt;/strong&gt; ArUco marker-based calibration at installation time, with periodic re-verification. Store calibration parameters in a config file that feeds into the scoring formula at runtime. Flag nodes whose calibration is older than a threshold for re-calibration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gap 3: Linecast ≠ Real-World Occlusion
&lt;/h3&gt;

&lt;p&gt;Unity's linecast is a perfect, instantaneous ray through a static collision mesh. In a real room, occlusion is dynamic (people, pets, moved furniture), partially transparent (glass tables, thin curtains), and probabilistic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bridging approach:&lt;/strong&gt; Replace the binary linecast with a &lt;strong&gt;visibility probability&lt;/strong&gt; estimated from the camera's own feed. If the selected node's image shows the user partially occluded in the previous frame, reduce its score for the current selection. This creates a feedback loop: actual image quality informs future node selection.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Scoring Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;In the Unity simulation, I visualize node scores using Gizmos in the Scene View:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Green sphere&lt;/strong&gt; — score ≥ 0.50, admitted to candidate list&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Yellow sphere&lt;/strong&gt; — score between 0.35 and 0.50, near threshold&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Red sphere&lt;/strong&gt; — score &amp;gt; 0 but below threshold&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gray sphere&lt;/strong&gt; — FOV-gated, score = 0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;During experiment setup, I use this visualization to verify that at least two nodes per room reliably score green for each behavioral spot (the sofa, the desk, the kitchen counter). If a room has only one reliably green node for a given spot, I reposition nodes before running experiments.&lt;/p&gt;

&lt;p&gt;This debugging workflow — spatial visualization of scores before running inference — turned out to be as important as the formula itself. The formula is only as good as the node placement it operates on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Design Decision&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;th&gt;Real-World Equivalent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hard FOV gate before weighted sum&lt;/td&gt;
&lt;td&gt;Prevents scoring nodes that can't see the user&lt;/td&gt;
&lt;td&gt;Same gate applies; requires accurate extrinsic calibration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linecast for visibility&lt;/td&gt;
&lt;td&gt;Fast, exact in simulation&lt;/td&gt;
&lt;td&gt;Replace with visibility probability from live feed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chest height aim point (1.2m)&lt;/td&gt;
&lt;td&gt;Captures torso, most informative for activity recognition&lt;/td&gt;
&lt;td&gt;Same; depth camera or pose estimator needed for accuracy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Top-2 node capture&lt;/td&gt;
&lt;td&gt;Handles single-node occlusion failures&lt;/td&gt;
&lt;td&gt;Same strategy; second node is insurance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-node multiplier m_i&lt;/td&gt;
&lt;td&gt;Manual override for known problem nodes&lt;/td&gt;
&lt;td&gt;Useful for flagging nodes with fixed environmental issues (glare, permanent obstruction)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The scoring formula is a pragmatic solution built around a specific hardware constraint: one rendering camera, twelve virtual viewpoints, a need for selection to be fast and interpretable. It works well in simulation, and the geometric logic transfers cleanly to a real deployment — but the inputs to the formula (user position, node orientation, occlusion) all need real-world measurement pipelines that simulation provides for free.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next in the series: how the captured images feed into a zero-shot VLM pipeline, and how SBERT semantic normalization maps free-form VLM descriptions to canonical behavior labels without any training data.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Full thesis: "Personalized Proactive Service in Smart Home Robots: A Training-Free Visual Perception Framework Integrating VLM-Based Scene Grounding, RAG Memory, and Manifold Learning" — NCKU, 2025.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>robotics</category>
      <category>algorithms</category>
      <category>unity3d</category>
      <category>ai</category>
    </item>
    <item>
      <title>Embodied AI: Why I Gave My Home Robot an "Eye in the Sky"</title>
      <dc:creator>susanayi</dc:creator>
      <pubDate>Tue, 31 Mar 2026 03:08:41 +0000</pubDate>
      <link>https://dev.to/susanayi/embodied-ai-why-i-gave-my-home-robot-an-eye-in-the-sky-3gam</link>
      <guid>https://dev.to/susanayi/embodied-ai-why-i-gave-my-home-robot-an-eye-in-the-sky-3gam</guid>
      <description>&lt;p&gt;&lt;em&gt;This is part of a series on building a training-free home service robot using VLMs, RAG memory, and manifold learning. This post covers the camera architecture — specifically, why fixed ceiling-mounted nodes ended up as the foundation of the whole perception system.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Honestly, my first instinct was: why not just use the robot's onboard camera?&lt;/p&gt;

&lt;p&gt;It's the obvious answer. The robot is already in the room. It already has a camera. Adding twelve fixed ceiling nodes sounds like unnecessary complexity — more hardware, more calibration, more failure points, for a system that was already complicated enough.&lt;/p&gt;

&lt;p&gt;My advisor's requirement was firm: the AI pipeline must take its visual input from fixed global cameras, not from the robot itself. No negotiation on that point.&lt;/p&gt;

&lt;p&gt;So I spent a while sitting with that constraint, trying to understand it rather than just comply with it. This post is what I figured out. It starts with the genuine question I had — &lt;em&gt;why global cameras at all?&lt;/em&gt; — and ends with the engineering decisions I made once I accepted the answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem with a Robot's Point of View
&lt;/h2&gt;

&lt;p&gt;The case for onboard vision is intuitive: the robot is mobile, so its camera goes wherever the action is. But "wherever the action is" turns out to be exactly the problem.&lt;/p&gt;

&lt;p&gt;A robot-mounted camera, positioned anywhere from 30cm to 100cm off the ground, has two fundamental problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Narrow field of view and constant occlusion.&lt;/strong&gt; The robot sees the world from a low, mobile, first-person perspective. A sofa blocks the view of the person sitting behind it. The kitchen wall hides what's happening at the dining table. From the robot's perspective, the home is a maze of partial information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Motion blur during navigation.&lt;/strong&gt; When the robot is moving — which is most of the time — its onboard camera is not producing reliable still frames. Activity recognition from blurry, unstabilized video is significantly harder than recognition from a fixed viewpoint.&lt;/p&gt;

&lt;p&gt;These aren't engineering failures. They're intrinsic to the geometry of a mobile, ground-level camera. No amount of better hardware changes the fundamental constraint.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: Fixed Nodes + a Single Moving Camera
&lt;/h2&gt;

&lt;p&gt;My solution was to separate &lt;em&gt;global perception&lt;/em&gt; from &lt;em&gt;local action&lt;/em&gt;. The robot handles physical interaction. A set of fixed-position camera nodes handles scene understanding.&lt;/p&gt;

&lt;p&gt;In my system, twelve &lt;code&gt;CameraNode&lt;/code&gt; objects are distributed across three rooms (four per room) at ceiling height — approximately 2.3m — simulating the kind of fixed IP camera array you might mount in a real home. These nodes don't move, don't occlude each other, and always have a stable, overhead view of the space.&lt;/p&gt;

&lt;p&gt;But here's the key engineering constraint: &lt;strong&gt;I only have one physical camera in the scene.&lt;/strong&gt; Rather than instantiating twelve separate cameras (expensive and redundant), I use a single camera that teleports to each selected node position, renders a frame, and moves on. The &lt;code&gt;VirtualCameraBrain&lt;/code&gt; component manages this process.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for each selected node (top-N by score):
    camera.transform ← node.position + node.rotation
    wait 2 frames                    // GPU render flush
    capture 512×512 PNG → Base64

POST all images to /predict in one request
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives me multi-viewpoint coverage with minimal rendering overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Harder Problem: Which Node Do You Pick?
&lt;/h2&gt;

&lt;p&gt;Having twelve nodes is useless if you pick the wrong one. A node behind the user, or one with the user at the edge of its field of view, produces an image the VLM can't interpret reliably.&lt;/p&gt;

&lt;p&gt;I needed a scoring function. Here's what I settled on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s_i = (v_i × 0.5 + α_i × 0.3 + d_i × 0.2) × m_i
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v_i&lt;/strong&gt; — Visibility: does a linecast from the node to the user's chest reach without hitting furniture? (0 or 1)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;α_i&lt;/strong&gt; — Angle factor: how centered is the user in the node's field of view? (1 at dead center, 0 at the FOV edge)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;d_i&lt;/strong&gt; — Distance factor: linear decay from 0m to 10m&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;m_i&lt;/strong&gt; — A per-node priority multiplier, set in the Inspector&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most important rule comes before the weighted sum: &lt;strong&gt;hard FOV gating&lt;/strong&gt;. If the user falls outside the node's field of view cone, that node gets score = 0 immediately, no matter how good its distance or linecast result. There's no point in a weighted calculation for a camera that can't even see the target.&lt;/p&gt;

&lt;p&gt;After scoring, I sort candidates descending and capture from the top-2 nodes (configurable). Two viewpoints handle the cases where one node is slightly occluded.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters for the VLM
&lt;/h2&gt;

&lt;p&gt;The whole reason I care about viewpoint quality is downstream accuracy. My system uses &lt;code&gt;llava-phi3&lt;/code&gt; (via Ollama) to recognize what the user is doing — drinking, sitting, reading, typing — without any task-specific training.&lt;/p&gt;

&lt;p&gt;VLMs are sensitive to image quality in ways that trained classifiers aren't. A trained activity recognition model can learn to compensate for partial occlusion if it sees enough occluded examples during training. A zero-shot VLM cannot — it has to interpret what it sees without that learned correction.&lt;/p&gt;

&lt;p&gt;This means &lt;strong&gt;camera selection directly controls recognition accuracy&lt;/strong&gt;. In early testing, episodes where the scoring system chose a poor viewpoint (user at the edge of frame, or partially behind furniture) produced VLM outputs like "a person standing near a wall" instead of "a person drinking from a bottle." The SBERT normalization layer handled some of this, but the better fix was improving the viewpoint selection upstream.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Gap Between Simulation and Reality
&lt;/h2&gt;

&lt;p&gt;I want to be honest about where my current implementation sits. Everything described above runs in a Unity 3D simulation. The "nodes" are virtual GameObjects. The "camera" is Unity's rendering engine. The coordinate streams come from &lt;code&gt;DynamicSyncManager.cs&lt;/code&gt;, not from depth sensors or object detection.&lt;/p&gt;

&lt;p&gt;This is intentional — I'm using simulation to validate the framework before committing to physical hardware. But it means two real-world problems remain unsolved:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extrinsic calibration.&lt;/strong&gt; In a real deployment, every pixel (u, v) from each fixed node must be mapped to a shared 3D coordinate system. This requires physical calibration of each camera's position and orientation relative to the room — a process that takes significant setup time and re-calibration whenever a camera is moved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency compensation.&lt;/strong&gt; Network transmission from a wall-mounted IP camera to the processing backend introduces roughly 50–150ms of latency. For a moving user, this means the position data you receive corresponds to where they &lt;em&gt;were&lt;/em&gt;, not where they &lt;em&gt;are&lt;/em&gt;. You need prediction — either simple linear extrapolation or a Kalman filter — to compensate.&lt;/p&gt;

&lt;p&gt;My simulation sidesteps both of these by giving me ground-truth coordinates directly from the Unity scene. That's a real gap, and I'm documenting it explicitly in the thesis as a limitation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cooperative Offloading: What This Enables for the Robot
&lt;/h2&gt;

&lt;p&gt;The architectural payoff of this design is that the robot itself doesn't need to run heavy vision inference. The fixed-node perception pipeline handles scene understanding and transmits lightweight metadata to the robot's decision layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"User_Mom"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s2"&gt;"Reading"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pos"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;-0.17&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;8.62&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"room"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;"LivingRoom"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s2"&gt;"Drink"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.74&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The robot receives a pre-processed summary — who is where, what they're doing, and what they're likely to want next — rather than raw pixels. This is the core of the "ambient intelligence offloads to embodied intelligence" architecture.&lt;/p&gt;

&lt;p&gt;For a battery-powered physical robot, this matters a lot. Running a VLM inference pipeline continuously on an embedded GPU drains a battery in under an hour. Running it on a wall-powered backend and sending metadata over Wi-Fi costs almost nothing on the robot side.&lt;/p&gt;




&lt;h2&gt;
  
  
  Perception Mode Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Onboard Camera&lt;/th&gt;
&lt;th&gt;Fixed Node Array&lt;/th&gt;
&lt;th&gt;What My System Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Field of view&lt;/td&gt;
&lt;td&gt;Local, low, easily occluded&lt;/td&gt;
&lt;td&gt;Global, overhead, stable&lt;/td&gt;
&lt;td&gt;Nodes handle scene understanding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference load&lt;/td&gt;
&lt;td&gt;Runs on robot battery&lt;/td&gt;
&lt;td&gt;Runs on wall-powered backend&lt;/td&gt;
&lt;td&gt;VLM runs on backend only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coordinate source&lt;/td&gt;
&lt;td&gt;Estimated from robot odometry&lt;/td&gt;
&lt;td&gt;Direct from scene/sensors&lt;/td&gt;
&lt;td&gt;Unity scene (simulation)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Calibration&lt;/td&gt;
&lt;td&gt;Built into robot&lt;/td&gt;
&lt;td&gt;Requires room-level setup&lt;/td&gt;
&lt;td&gt;Skipped in simulation; needed in real deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure mode&lt;/td&gt;
&lt;td&gt;Occlusion, motion blur&lt;/td&gt;
&lt;td&gt;Network latency, fixed FOV gaps&lt;/td&gt;
&lt;td&gt;Fallback: retry with next-best node&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The next post in this series covers how I use these captured images as input to a zero-shot VLM pipeline, and how SBERT semantic normalization maps the free-form VLM descriptions to canonical behavior labels — without any training data.&lt;/p&gt;

&lt;p&gt;If you're building something similar, or have dealt with the extrinsic calibration problem in a real deployment, I'd love to hear how you approached it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part of the "Training-Free Home Robot" series. The full system integrates VLM perception, a Behavioral Scene Graph memory layer (FAISS + MongoDB), and UMAP manifold learning for proactive intent prediction.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>robotics</category>
      <category>python</category>
      <category>computervsion</category>
    </item>
    <item>
      <title>Robotic Brain for Elder Care 3</title>
      <dc:creator>susanayi</dc:creator>
      <pubDate>Mon, 30 Mar 2026 09:15:56 +0000</pubDate>
      <link>https://dev.to/susanayi/robotic-brain-for-elder-care-3-5g2p</link>
      <guid>https://dev.to/susanayi/robotic-brain-for-elder-care-3-5g2p</guid>
      <description>&lt;h1&gt;
  
  
  Part 3: The Scoring Engine — How a Robot Selects the Perfect Viewpoint
&lt;/h1&gt;

&lt;p&gt;In the previous post, we discussed the "Single Camera + 12 Virtual Nodes" strategy to overcome simulation lag. But with 4 potential nodes in a single room, how does the system "decide" which one provides the best data for our AI backend?&lt;/p&gt;

&lt;p&gt;This is where the &lt;strong&gt;StaticCameraManager&lt;/strong&gt; comes in. Instead of random selection, we use a &lt;strong&gt;Heuristic Scoring Algorithm&lt;/strong&gt; to rank viewpoints based on three physical constraints: &lt;strong&gt;Visibility, Angle, and Distance.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Scoring Formula
&lt;/h2&gt;

&lt;p&gt;To quantify the quality of each viewpoint, the system evaluates all registered nodes in the room using a weighted heuristic:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;FinalScore = (Visibility × 0.5) + (AngleFactor × 0.3) + (DistanceFactor × 0.2)&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By assigning the highest weight (50%) to &lt;strong&gt;Visibility&lt;/strong&gt;, we ensure the robot never prioritizes a "perfect" angle if the person is obscured by furniture or walls.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Visibility: The Raycast Test (50%)
&lt;/h2&gt;

&lt;p&gt;The most fundamental requirement is a clear line of sight. We use Unity’s &lt;code&gt;Physics.Linecast&lt;/code&gt; to check for obstacles between the camera node and the user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Step 2：Visibility (Linecast Occlusion)&lt;/span&gt;
&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;vis&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;1f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Physics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Linecast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodePos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aimPos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;out&lt;/span&gt; &lt;span class="n"&gt;RaycastHit&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Check if the hit object is the user or a part of the user&lt;/span&gt;
    &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;hitUser&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;IsChildOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;hitUser&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;vis&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="m"&gt;0f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Blocked by furniture or walls&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the raycast is blocked, the visibility score drops to &lt;strong&gt;0&lt;/strong&gt;, effectively disqualifying the node regardless of other factors.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Angle Factor: Semantic Clarity (30%)
&lt;/h2&gt;

&lt;p&gt;For action recognition, front or side views are more informative than back views. We normalize the angle relative to the FOV center:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Step 3：Angle Factor (Normalized FOV center)&lt;/span&gt;
&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;angleFactor&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Mathf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Clamp01&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1f&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="n"&gt;angle&lt;/span&gt; &lt;span class="p"&gt;/&lt;/span&gt; &lt;span class="n"&gt;halfFov&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Case Study: Drinking Behavior
&lt;/h3&gt;

&lt;p&gt;While multiple nodes might have visibility, our algorithm selects the one that best captures the drinking gesture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rb8rqtmojuajyf5nf40.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rb8rqtmojuajyf5nf40.png" alt=" " width="512" height="512"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Note: Candidate A (Side-Back) - The hand-to-mouth action is partially obscured by the user's shoulder.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc696zvr2o2p1ez15a4x9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc696zvr2o2p1ez15a4x9.png" alt=" " width="512" height="512"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Note: Candidate B (Side-Front) - Higher Angle Score. The interaction with the bottle is clearly visible for the VLM.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  3. Distance Factor: The Golden Range (20%)
&lt;/h2&gt;

&lt;p&gt;A camera too far away loses pixel density. We prioritize nodes that keep the user within the "Golden Range" of 2 to 5 meters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Step 4：Distance Factor (10m Linear Decay)&lt;/span&gt;
&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;dist&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Vector3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodePos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aimPos&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;distFactor&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Mathf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Clamp01&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1f&lt;/span&gt; &lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="n"&gt;dist&lt;/span&gt; &lt;span class="p"&gt;/&lt;/span&gt; &lt;span class="m"&gt;10f&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Case Study: Typing Interaction
&lt;/h3&gt;

&lt;p&gt;At the desk, the distance and angle combined determine the best viewpoint to capture hand-to-keyboard interaction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F34bir9y6rtf2mk0igal7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F34bir9y6rtf2mk0igal7.png" alt=" " width="512" height="512"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Note: Candidate C - Although the angle is okay, the distance reduces the semantic detail of the typing action.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72xw452brogm43hxq3kk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72xw452brogm43hxq3kk.png" alt=" " width="512" height="512"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Note: Candidate D - Optimal Distance &amp;amp; Angle. The high-angle perspective provides a clear view of the hands on the keyboard.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Visualizing the Logic: Debugging with Gizmos
&lt;/h2&gt;

&lt;p&gt;As an engineer, I need to verify the math in real-time. I implemented a custom &lt;code&gt;OnDrawGizmos&lt;/code&gt; system that color-codes nodes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Green&lt;/strong&gt;: High Score (&amp;gt; 0.5) — Ready for capture.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Red/Grey&lt;/strong&gt;: Low Score or Out of FOV — Disqualified.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This visual feedback allowed us to fine-tune our thresholds, ensuring the &lt;strong&gt;VirtualCameraBrain&lt;/strong&gt; only teleports to locations that provide high-quality data.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next?
&lt;/h2&gt;

&lt;p&gt;Now that we have selected the "Best Viewpoint," the final step is execution. In the next post, we will look at the &lt;strong&gt;VirtualCameraBrain&lt;/strong&gt; implementation: &lt;strong&gt;Base64 encoding&lt;/strong&gt; and &lt;strong&gt;REST API transmission&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Stay tuned!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>algorithms</category>
      <category>architecture</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Robotic Brain for Elder Care 2</title>
      <dc:creator>susanayi</dc:creator>
      <pubDate>Mon, 30 Mar 2026 08:49:08 +0000</pubDate>
      <link>https://dev.to/susanayi/virtual-nodes-and-the-single-camera-strategy-3h02</link>
      <guid>https://dev.to/susanayi/virtual-nodes-and-the-single-camera-strategy-3h02</guid>
      <description>&lt;h2&gt;
  
  
  Part 1: Virtual Nodes and the Single-Camera Strategy — Overcoming Simulation Lag
&lt;/h2&gt;

&lt;p&gt;In building an indoor perception system for elder care, the standard intuition is to deploy multiple live cameras to monitor daily routines. During our early development stage using NVIDIA Isaac Sim, we followed this path, experimenting with high-bandwidth sensor data like &lt;strong&gt;depth images and point clouds&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: The Performance Trap of Multi-Camera Rendering
&lt;/h2&gt;

&lt;p&gt;However, we quickly encountered a critical performance wall. Simultaneously rendering and publishing data from multiple active cameras in any simulation engine (Unity or Isaac Sim) is a recipe for performance disaster. It consumes massive GPU memory (VRAM) and creates significant lag.&lt;/p&gt;

&lt;p&gt;In our tests, images would queue for an unacceptably long time before ever entering the AI pipeline. For an elder-care system aiming for real-time interaction, this lag made subsequent VLM reasoning and intent prediction protocols impossible. &lt;/p&gt;

&lt;p&gt;To prioritize practicality and focus on Robotics VLM and semantic research, we made a strategic decision: we bypassed the overhead of ROS 2 and transitioned to a custom, lightweight Unity-to-Python pipeline based on Event-Triggered RGB Transmission.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solution: A 12-Node Virtual Network
&lt;/h2&gt;

&lt;p&gt;Our architecture rests on a deliberate separation between spatial metadata (Where can we see the user?) and &lt;strong&gt;rendering overhead&lt;/strong&gt; (When do we actually draw the pixels?). &lt;/p&gt;

&lt;p&gt;We deployed a network of 12 Virtual Camera Nodes across the simulated home. Instead of active cameras, these are lightweight "Empty Objects" that serve as potential observation posts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkn18n65j6vam6ut9jf7y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkn18n65j6vam6ut9jf7y.png" alt="A top-down 3D view of the Unity-based home simulation. The layout is divided into three main experimental zones: Dad's Room (Blue), Living Room (Green), and Kitchen (Red). Each zone contains 4 virtual camera nodes, totaling 12 nodes across the environment." width="800" height="463"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 1: Our experimental test bed featuring 12 virtual nodes. Each room (Kitchen, Living Room, and Dad's Room) is equipped with 4 specific viewpoints.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As shown in Figure 1, these 12 nodes incur &lt;strong&gt;zero rendering cost&lt;/strong&gt; while idle. In Unity, they are merely &lt;code&gt;Transform&lt;/code&gt; components (coordinates and forward vectors). This allows us to maintain a high simulation frame rate (FPS) while having 12 different perspectives available at any moment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Technical Comparison: Why This is "Reasonable"
&lt;/h3&gt;

&lt;p&gt;The table below illustrates the evolution from "Brute-force Rendering" to "Smart Orchestration":&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Legacy Strategy (Isaac Sim Experience)&lt;/th&gt;
&lt;th&gt;Optimized Strategy (Current Unity Architecture)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Camera Setup&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multiple Active Real Cameras&lt;/td&gt;
&lt;td&gt;Single Real Camera + 12 Virtual Nodes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rendering&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Continuous Parallel Rendering&lt;/td&gt;
&lt;td&gt;Event-Triggered "Teleport &amp;amp; Capture"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU Load&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High VRAM &amp;amp; Draw Calls&lt;/td&gt;
&lt;td&gt;Zero Idle Cost for Nodes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Long Image Queue (Lag)&lt;/td&gt;
&lt;td&gt;Real-time Sync (Stable 60 FPS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Type&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Depth &amp;amp; Point Cloud (Heavy)&lt;/td&gt;
&lt;td&gt;Streamlined RGB (VLM Optimized)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Protocol&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ROS 2 Middleware&lt;/td&gt;
&lt;td&gt;Custom High-speed REST API&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Mechanism: The "Smart Eye" Teleportation
&lt;/h2&gt;

&lt;p&gt;We then employ a &lt;strong&gt;Single Rendering Camera&lt;/strong&gt;—our "Smart Eye." The logic is orchestration rather than brute-force rendering. &lt;/p&gt;

&lt;p&gt;When our system detects a significant state change (e.g., transitioning from 'Standing' to 'Drinking'), the high-level "brain" is invoked. But instead of processing 12 simultaneous streams, we use a &lt;strong&gt;Teleport-and-Capture&lt;/strong&gt; strategy:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Event Trigger: The system detects a meaningful action from &lt;code&gt;UserEntity&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt; Optimal Node Selection: Our heuristic scoring algorithm analyzes the 4 nodes within the current room to find the best angle, considering distance and occlusions.&lt;/li&gt;
&lt;li&gt; Instant Capture: The single physical camera &lt;strong&gt;"teleports"&lt;/strong&gt; to the selected optimal node, captures the frame, and sends it to the Python backend.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This design is not just a workaround for simulation lag; it is a pragmatic reflection of real-world constraints. In a real smart home, streaming 24/7 high-resolution video from 12 cameras would overwhelm most residential networks. By mirroring the behavior of smart surveillance systems—where resources are allocated only when an event occurs—we ensure maximum data integrity and system reliability.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next?
&lt;/h2&gt;

&lt;p&gt;We have 12 nodes, but how does the robot "know" which one offers the best view? &lt;/p&gt;

&lt;p&gt;In the next post, we will deep dive into the C# implementation of the &lt;strong&gt;StaticCameraManager&lt;/strong&gt; and deconstruct the heuristic scoring algorithm that handles Occlusion, Angle, and Distance.&lt;/p&gt;

&lt;p&gt;Stay tuned!&lt;/p&gt;

</description>
      <category>robotics</category>
      <category>unity3d</category>
      <category>ai</category>
      <category>performance</category>
    </item>
    <item>
      <title>Robotic Brain for Elder Care 1</title>
      <dc:creator>susanayi</dc:creator>
      <pubDate>Mon, 30 Mar 2026 07:52:10 +0000</pubDate>
      <link>https://dev.to/susanayi/robotic-brain-for-elder-care-1-1d1d</link>
      <guid>https://dev.to/susanayi/robotic-brain-for-elder-care-1-1d1d</guid>
      <description>&lt;h2&gt;
  
  
  The vision
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"The long-term goal of this system is to support those who truly need assistance—individuals who are paralyzed, bedridden, or require 24/7 care. &lt;/p&gt;

&lt;p&gt;However, to build a rapid MVP and a scalable architecture, I am starting with healthy users in ideal scenarios as my baseline. This choice simplifies the problem space and enables faster iteration during the early development stages. &lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  System Architecture: The Bridge Between Virtual &amp;amp; Logic
&lt;/h2&gt;

&lt;p&gt;To maintain a clean separation of concerns, I designed a decoupled architecture where &lt;strong&gt;Unity&lt;/strong&gt; handles the physical simulation and &lt;strong&gt;Python&lt;/strong&gt; acts as the high-level brain. &lt;/p&gt;

&lt;p&gt;Here is the data flow from user behavior in Unity to the AI decision-making process in the backend:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhirvvjdqn6e410ufz8cq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhirvvjdqn6e410ufz8cq.png" alt="unity arch" width="800" height="634"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Simulation Environment
&lt;/h2&gt;

&lt;p&gt;Currently, the entire system is being verified within a simulated home environment built in &lt;strong&gt;Unity&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6xap8hwnl4pzcbfzkis9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6xap8hwnl4pzcbfzkis9.png" alt="A top-down 3D view of the Unity home simulation" width="800" height="551"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I have defined three core experimental zones—Living Room, Workspace, and Kitchen—equipped with a dense network of virtual cameras. This setup allows me to test the robot's perception and spatial grounding in a controlled yet complex environment.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next?
&lt;/h2&gt;

&lt;p&gt;In the next post, I will walk you through the Unity Experiment Setup and Development Environment in more detail. &lt;/p&gt;

&lt;p&gt;We will explore how the 3D environment is designed to simulate daily routines and how the Unity-to-Python bridge handles real-time data streaming. Before diving into the complex AI "brain," it's essential to understand the "world" our robot lives in.&lt;/p&gt;

&lt;p&gt;Stay tuned!&lt;/p&gt;

</description>
      <category>robotics</category>
      <category>ai</category>
      <category>unity3d</category>
      <category>python</category>
    </item>
  </channel>
</rss>
