<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Samaresh Kumar Singh</title>
    <description>The latest articles on DEV Community by Samaresh Kumar Singh (@samaresh_singh_1acf4838c1).</description>
    <link>https://dev.to/samaresh_singh_1acf4838c1</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3392874%2F70832442-f0c6-4aba-ba77-1ed254f7b1a7.png</url>
      <title>DEV Community: Samaresh Kumar Singh</title>
      <link>https://dev.to/samaresh_singh_1acf4838c1</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/samaresh_singh_1acf4838c1"/>
    <language>en</language>
    <item>
      <title>ARM System-on-Chip (SoC) Deep Dive: Edge AI and Coherency Fabric</title>
      <dc:creator>Samaresh Kumar Singh</dc:creator>
      <pubDate>Wed, 01 Oct 2025 22:17:24 +0000</pubDate>
      <link>https://dev.to/samaresh_singh_1acf4838c1/arm-system-on-chip-soc-deep-dive-edge-ai-and-coherency-fabric-52en</link>
      <guid>https://dev.to/samaresh_singh_1acf4838c1/arm-system-on-chip-soc-deep-dive-edge-ai-and-coherency-fabric-52en</guid>
      <description>&lt;p&gt;LLC partitioning and QoS work together. QoS controls priority in the&lt;br&gt;
interconnect. LLC partitioning controls which agent's working set stays in&lt;br&gt;
cache. Together they form a contract: the display and NPU will get their&lt;br&gt;
bandwidth, and their data will be warm in cache. Everything else adapts.&lt;/p&gt;

&lt;p&gt;You validate this contract with PMU counters watching 95th and 99th percentile&lt;br&gt;
latencies, not averages. Average latency is a vanity metric on contended&lt;br&gt;
memory buses.&lt;/p&gt;


&lt;h2&gt;
  
  
  The SMMU: Not Just Security, Actually Correctness
&lt;/h2&gt;

&lt;p&gt;The System Memory Management Unit tends to get framed as a security feature,&lt;br&gt;
which it is, but treating it purely as a security control misses why it is&lt;br&gt;
essential for correct system operation.&lt;/p&gt;

&lt;p&gt;The SMMU sits between I/O masters (the ISP, the NIC, PCIe devices, USB&lt;br&gt;
controllers) and the physical address space. Each device gets a StreamID. The&lt;br&gt;
SMMU uses that StreamID to look up the appropriate page table, translating the&lt;br&gt;
device's virtual address to a physical address before allowing the DMA to&lt;br&gt;
proceed. A device can only access memory that its stage-1 (OS-controlled) and&lt;br&gt;
stage-2 (hypervisor-controlled) page tables permit.&lt;/p&gt;

&lt;p&gt;The security value is obvious: a compromised camera driver cannot DMA into&lt;br&gt;
kernel memory or another process's address space. But the correctness value is&lt;br&gt;
less discussed. Without the SMMU, DMA addresses are physical, meaning a driver&lt;br&gt;
bug that generates the wrong address can write anywhere in DRAM. These bugs&lt;br&gt;
tend to manifest as subtle corruption, often far from the code that caused them,&lt;br&gt;
often in a completely different process's memory. Debugging this without an&lt;br&gt;
SMMU is miserable. With an SMMU, the bad access generates a fault with a precise&lt;br&gt;
fault address and StreamID. You know immediately which device caused it and what&lt;br&gt;
address it tried to access.&lt;/p&gt;

&lt;p&gt;For multi-tenant systems, stage-2 translation provides per-VM isolation at the&lt;br&gt;
hardware level. The guest OS sets up stage-1 translations for its devices. The&lt;br&gt;
hypervisor controls stage-2, ensuring a guest cannot map device DMA into another&lt;br&gt;
guest's physical memory ranges.&lt;/p&gt;

&lt;p&gt;The practical rule is to configure the SMMU before anything else at boot, with&lt;br&gt;
a default-deny policy, then open up specific address windows per device.&lt;br&gt;
Configuring it as an afterthought means debugging memory corruption in&lt;br&gt;
production.&lt;/p&gt;


&lt;h2&gt;
  
  
  A Real-World Reference: Edge AI Camera at 5W
&lt;/h2&gt;

&lt;p&gt;The architecture described here is not hypothetical. Something close to it runs&lt;br&gt;
in every flagship smartphone and most modern smart camera SoCs. Here is how&lt;br&gt;
all the pieces interact in a concrete scenario.&lt;/p&gt;

&lt;p&gt;The target: a security camera platform running 4K60 video capture with&lt;br&gt;
continuous object detection at under 10ms latency, 60 FPS on the display&lt;br&gt;
output, all within a 5W thermal budget.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph TD
  SENSOR["Camera Sensor + ISP\n4K 60fps"]
  DRAM["DRAM\nFrame buffers + model weights"]
  NPU["NPU\nObject detection &amp;amp;lt;10ms\nQoS priority 12"]
  CPU["Cortex-A Cluster\nOrchestration + OS"]
  DISP["Display Controller\n4K 60fps out\nQoS priority 15"]

  subgraph LLC["LLC Partitions (7 MB total)"]
    P1["NPU region\n3 MB — model hot"]
    P2["Display region\n2 MB — frame bufs"]
    P3["CPU region\n2 MB — OS/stack"]
  end

  SENSOR --&amp;gt;|"AXI non-coherent"| DRAM
  DRAM --&amp;gt;|"ACE-Lite coherent"| NPU
  DRAM --&amp;gt;|"AXI non-coherent"| DISP
  NPU --&amp;gt;|"IRQ via GIC"| CPU
  CPU --&amp;gt;|"Configure / control"| NPU
  LLC -.-&amp;gt;|"Partitioned ways"| DRAM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What each path decision buys you:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ISP writes frames to DRAM via the non-coherent AXI path. At 4K60, frame&lt;br&gt;
writing is the highest-bandwidth operation on the platform. Putting this on the&lt;br&gt;
coherent fabric would generate snoop traffic proportional to the frame rate,&lt;br&gt;
burning 300 to 400 mW just in coherency protocol overhead.&lt;/p&gt;

&lt;p&gt;The NPU pulls model weights and frame data from the LLC via the coherent&lt;br&gt;
ACE-Lite path. The LLC partition for the NPU keeps the detection model resident.&lt;br&gt;
On a first inference after system boot, the model loads from DRAM into the NPU's&lt;br&gt;
LLC partition. For every subsequent inference, the weights are already warm. The&lt;br&gt;
DRAM penalty is paid once.&lt;/p&gt;

&lt;p&gt;The CPU receives an interrupt from the GIC when the NPU finishes. Because the&lt;br&gt;
NPU used coherent DMA for its output, the CPU can immediately read the detection&lt;br&gt;
results without invalidating anything. The frame timestamp and bounding box&lt;br&gt;
coordinates are in cache, coherent, ready.&lt;/p&gt;

&lt;p&gt;At steady state, the power breakdown is roughly 2W compute (CPU, NPU, ISP&lt;br&gt;
running continuously), 1.5W DRAM at medium utilization, and 1.5W display and&lt;br&gt;
peripheral. The LLC partitioning and QoS configuration account for most of the&lt;br&gt;
DRAM efficiency. Without them, the same workload at comparable latencies&lt;br&gt;
requires about 6W because of unnecessary DRAM spill and coherency overhead.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Knobs That Actually Matter
&lt;/h2&gt;

&lt;p&gt;Real production tuning comes down to a short list. Knowing which levers exist&lt;br&gt;
is the starting point. Knowing which ones move the needle on your specific&lt;br&gt;
workload is the actual skill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLC partition sizing&lt;/strong&gt; needs to be empirically derived. Start with PMU&lt;br&gt;
counters measuring LLC hit rate per agent, then adjust partition sizes until&lt;br&gt;
the hit rate for time-critical agents stabilizes above 90%. Below that, you&lt;br&gt;
will see tail latency climb. Above it, you are probably over-allocating to that&lt;br&gt;
agent at the expense of others.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// PMU quick-check: L2 miss rate during inference&lt;/span&gt;
&lt;span class="n"&gt;pmu_configure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PMU_L2_CACHE_MISS&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;run_ai_inference&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;l2_misses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pmu_read&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"L2 misses during inference: %llu&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;l2_misses&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Huge page adoption&lt;/strong&gt; for video and tensor buffers is close to a free lunch.&lt;br&gt;
The TLB miss rate drops dramatically. The main friction is that huge pages need&lt;br&gt;
physically contiguous memory, which requires CMA (Contiguous Memory Allocator)&lt;br&gt;
reservation at boot. This is a one-line kernel parameter. Most teams skip it and&lt;br&gt;
then wonder why their video pipeline has periodic latency spikes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DMA mode selection&lt;/strong&gt; should be documented per data path, not chosen once&lt;br&gt;
globally. Write it down in the driver architecture document, with the reasoning.&lt;br&gt;
Six months after initial bringup, someone will add a new accelerator and make&lt;br&gt;
the wrong choice because the rationale was never written down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;QoS settings&lt;/strong&gt; should be measured under maximum contention, not idle&lt;br&gt;
conditions. Set up a stress test that runs all compute engines simultaneously,&lt;br&gt;
then verify that display and NPU latency stay within SLO bounds. If they do not,&lt;br&gt;
your priorities and bandwidth reservations need adjustment.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Note on Tail Latency
&lt;/h2&gt;

&lt;p&gt;One trap that experienced engineers still fall into is optimizing for&lt;br&gt;
average-case latency while ignoring tail latency. Averages on contended memory&lt;br&gt;
buses look fine until they do not. A system that averages 6ms inference latency&lt;br&gt;
but hits 18ms at the 99th percentile will fail its real-time requirements in&lt;br&gt;
production, because production workloads are not averages.&lt;/p&gt;

&lt;p&gt;PMU-based profiling needs to capture percentile distributions, not means. The&lt;br&gt;
95th and 99th percentile latencies tell you whether your LLC partitions and QoS&lt;br&gt;
settings are holding up under contention. An average that looks good while the&lt;br&gt;
99th percentile drifts upward is a sign that something is occasionally evicting&lt;br&gt;
a critical working set, or that bandwidth guarantees are holding on average but&lt;br&gt;
not under peak scenarios.&lt;/p&gt;

&lt;p&gt;The correlation between LLC hit-rate stability and inference tail latency is&lt;br&gt;
often direct and observable. When a partition eviction event happens, the next&lt;br&gt;
inference cold-loads weights from DRAM and the latency spike shows up&lt;br&gt;
immediately in the distribution. Tracking these together makes root cause&lt;br&gt;
analysis tractable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementation Checklist
&lt;/h2&gt;

&lt;p&gt;Before declaring a platform production-ready, each of these should have a&lt;br&gt;
verified answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] SLOs defined (inference latency, display FPS, jitter) per workload&lt;/li&gt;
&lt;li&gt;[ ] LLC partitions allocated for NPU, Display, CPU — sizes validated with PMU&lt;/li&gt;
&lt;li&gt;[ ] DMA mode chosen and documented per data path, with rationale&lt;/li&gt;
&lt;li&gt;[ ] Huge pages mapped for all large hot buffers (frame buffers, tensors)&lt;/li&gt;
&lt;li&gt;[ ] QoS priorities and bandwidth guarantees set and tested under full load&lt;/li&gt;
&lt;li&gt;[ ] PMU instrumentation capturing 95p/99p distributions, not just averages&lt;/li&gt;
&lt;li&gt;[ ] SMMU default-deny policy locked in before any driver bring-up&lt;/li&gt;
&lt;li&gt;[ ] Interrupt routing for GIC verified across EL2/EL3 paths&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;The ARM SoC architecture described here is what makes modern edge computing&lt;br&gt;
possible at the power envelope and cost point that edge devices demand. A CPU&lt;br&gt;
cluster alone could not do it. A bare NPU with no cache hierarchy management&lt;br&gt;
would be unreliable. What makes it work is the combination: dedicated compute&lt;br&gt;
engines with defined roles, a shared but well-managed memory system, a&lt;br&gt;
coherency fabric that handles the hard synchronization problems, and a QoS layer&lt;br&gt;
that enforces the real-time contracts that users care about.&lt;/p&gt;

&lt;p&gt;The engineers who get this right share a common trait. They do not think about&lt;br&gt;
these components in isolation. They think about data flow, contention scenarios,&lt;br&gt;
and worst-case latency under load. They have LLC hit rates and 99th percentile&lt;br&gt;
latency numbers at their fingertips. And they configure SMMU policies before any&lt;br&gt;
other driver goes in, not after.&lt;/p&gt;

&lt;p&gt;The architecture is not magic. The properties it provides are a direct&lt;br&gt;
consequence of deliberate design decisions, most of which can be reversed by&lt;br&gt;
careless driver work or misconfigured firmware. Understanding the mechanism is&lt;br&gt;
what allows you to keep those properties intact from first bringup through&lt;br&gt;
production.&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>ai</category>
      <category>architecture</category>
      <category>iot</category>
    </item>
  </channel>
</rss>
