<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Samaresh Kumar Singh</title>
    <description>The latest articles on DEV Community by Samaresh Kumar Singh (@samaresh_singh_1acf4838c1).</description>
    <link>https://dev.to/samaresh_singh_1acf4838c1</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3392874%2F70832442-f0c6-4aba-ba77-1ed254f7b1a7.png</url>
      <title>DEV Community: Samaresh Kumar Singh</title>
      <link>https://dev.to/samaresh_singh_1acf4838c1</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/samaresh_singh_1acf4838c1"/>
    <language>en</language>
    <item>
      <title>ARM System-on-Chip (SoC) Deep Dive: Edge AI and Coherency Fabric</title>
      <dc:creator>Samaresh Kumar Singh</dc:creator>
      <pubDate>Wed, 01 Oct 2025 22:17:24 +0000</pubDate>
      <link>https://dev.to/samaresh_singh_1acf4838c1/arm-system-on-chip-soc-deep-dive-edge-ai-and-coherency-fabric-52en</link>
      <guid>https://dev.to/samaresh_singh_1acf4838c1/arm-system-on-chip-soc-deep-dive-edge-ai-and-coherency-fabric-52en</guid>
      <description>&lt;p&gt;🏗️ &lt;strong&gt;ARM System-on-Chip (SoC) Architecture Explained&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The following diagram referencing the original architecture at &lt;a href="https://tinyurl.com/zjyue6h4" rel="noopener noreferrer"&gt;ARM SoC Architecture Diagram&lt;/a&gt; outlines a complete modern ARM-based System-on-Chip design optimized for edge AI inference and heterogeneous computing. This architecture is the foundation for modern smartphones, high-end security cameras, and autonomous systems.&lt;/p&gt;

&lt;p&gt;Let’s break down the design layer by layer:&lt;/p&gt;
&lt;h2&gt;
  
  
  1) Architecture at a Glance
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Five layers:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Control Plane&lt;/strong&gt; (Cortex-A clusters, GIC, PMU)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Plane&lt;/strong&gt; (NPU, GPU, ISP, VPU)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory Subsystem&lt;/strong&gt; (L1/L2/LLC, TLB, MC + DRAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interconnect &amp;amp; Coherency Fabric&lt;/strong&gt; (CHI/ACE, snoop filters, QoS)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I/O Translation Fabric&lt;/strong&gt; (SMMU, StreamIDs, PCIe/USB/Display/Network)&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;Mental model: &lt;strong&gt;CPU = orchestrator&lt;/strong&gt; (control), &lt;strong&gt;accelerators = muscle&lt;/strong&gt; (data). CPU sets up work; accelerators crunch it; the fabric keeps everything coherent and on time.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  2) Control Plane (Top, Light Blue)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What lives here&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ARM Cortex-A CPU clusters (e.g., 4 cores)&lt;/strong&gt; with private &lt;strong&gt;L1I/L1D&lt;/strong&gt; and shared &lt;strong&gt;L2&lt;/strong&gt; per cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GIC&lt;/strong&gt; (Generic Interrupt Controller) for routing/priority of device interrupts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PMU&lt;/strong&gt; (Performance Monitoring Unit) for on-silicon counters (cache misses, TLB misses, branches, cycles).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why it matters&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs OS/supervisor and orchestrates accelerators.&lt;/li&gt;
&lt;li&gt;Handles exceptions/IRQs; configures QoS, cache partitions, SMMU mappings.&lt;/li&gt;
&lt;li&gt;PMU is your truth serum for tuning.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Role:&lt;/strong&gt; Configure the world. Don’t do heavy lifting.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  3) Data Plane (Left Middle, Light Green)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Compute engines&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NPU (AI accelerator)&lt;/strong&gt;: GEMM/convolution engines for local inference (e.g., object detection).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: Graphics + GPGPU for UI, 3D, shaders, post-processing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ISP&lt;/strong&gt;: Image pipeline (demosaic, denoise, tone map, auto-exposure/focus).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VPU&lt;/strong&gt;: HW encode/decode (H.264/H.265/AV1) for capture/streaming.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key design principle&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Control/Data split&lt;/strong&gt;: CPU provides the model and buffers; NPU/GPU/ISP/VPU execute at high throughput and signal completion.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  4) Memory Subsystem (Right Middle, Light Yellow)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cache hierarchy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;L1&lt;/strong&gt; (per-core, ~32–64 KB, ~4 cycles): fastest, tiny.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L2&lt;/strong&gt; (per-cluster, ~256 KB–2 MB, ~12 cycles).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLC/L3&lt;/strong&gt; (shared, ~4–16 MB, ~40 cycles).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DRAM&lt;/strong&gt; (GBs, ~100–200+ cycles).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Partitioned LLC&lt;/strong&gt; (recommended)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Reserve LLC ways/regions per role to avoid “noisy neighbors.”&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ex: &lt;strong&gt;CPU&lt;/strong&gt; 2 MB, &lt;strong&gt;NPU&lt;/strong&gt; 3 MB (model), &lt;strong&gt;Display&lt;/strong&gt; 2 MB (frame cadence).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Prevents the GPU from evicting the NPU’s model or the display’s frame buffers.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;TLB &amp;amp; huge pages&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TLB caches VA→PA. Misses trigger page-walks (~100 cycles).&lt;/li&gt;
&lt;li&gt;For large datasets, use &lt;strong&gt;2 MB huge pages&lt;/strong&gt; to cut TLB pressure drastically.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Access path&lt;/strong&gt;: CPU → TLB → L1 → L2 → LLC → DRAM (short-circuit at first hit).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  5) Interconnect &amp;amp; Coherency Fabric (Center, Light Coral)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;CHI/ACE fabric&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High-speed, coherent fabric that routes transactions and enforces cache coherency via &lt;strong&gt;MESI/MOESI&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Coherency 101&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Keep all cores/agents in agreement about shared cache lines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;M/O/E/S/I&lt;/strong&gt; states; snoops on peer caches; ownership transitions on read/write.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Snoop filter&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tracks line locations so you only snoop relevant caches (reduces traffic and power).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;QoS manager&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Prioritize time-critical clients:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ex: &lt;strong&gt;Display&lt;/strong&gt; (15), &lt;strong&gt;NPU&lt;/strong&gt; (12), &lt;strong&gt;CPU&lt;/strong&gt; (8), &lt;strong&gt;GPU&lt;/strong&gt; (4).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Guarantee bandwidth slices under contention.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;AXI (non-coherent)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For DMA/I/O that don’t need HW coherency (e.g., streaming camera frames).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  6) I/O Translation Fabric (Bottom, Lavender)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SMMU (IOMMU)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-device virtual memory + isolation with &lt;strong&gt;Stage-1&lt;/strong&gt; (OS) and &lt;strong&gt;Stage-2&lt;/strong&gt; (Hypervisor) translations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;StreamID&lt;/strong&gt; selects which translation/permissions to apply; prevents rogue DMA into kernel memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Peripherals&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PCIe&lt;/strong&gt; (NVMe, NICs), &lt;strong&gt;USB&lt;/strong&gt;, &lt;strong&gt;Display (HDMI/DP)&lt;/strong&gt;, &lt;strong&gt;Network&lt;/strong&gt; (Ethernet/Wi-Fi).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  7) Data-Flow Walkthroughs
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Flow A: CPU memory access (coherent)
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;CPU issues load (&lt;code&gt;x = array[i]&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TLB&lt;/strong&gt; translates VA→PA (page walk on miss).&lt;/li&gt;
&lt;li&gt;Probe &lt;strong&gt;L1 → L2 → LLC&lt;/strong&gt;; return on first hit; else &lt;strong&gt;DRAM&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Data bubbles back to registers via cache hierarchy.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; PMU counters validate latency and hit rates across levels.&lt;/p&gt;
&lt;h3&gt;
  
  
  Flow B: AI inference with &lt;strong&gt;coherent DMA&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;CPU loads model (prefetch to &lt;strong&gt;LLC partition for NPU&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;CPU programs NPU: &lt;em&gt;input at 0x1000; run model X&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;NPU issues &lt;strong&gt;ACE-Lite (coherent) DMA&lt;/strong&gt; reads.&lt;/li&gt;
&lt;li&gt;Fabric snoops CPU caches, probes LLC; fetches from DRAM if needed.&lt;/li&gt;
&lt;li&gt;NPU computes (e.g., &lt;strong&gt;8 ms&lt;/strong&gt; target).&lt;/li&gt;
&lt;li&gt;NPU writes results coherently; coherency invalidates stale CPU lines.&lt;/li&gt;
&lt;li&gt;NPU IRQ via &lt;strong&gt;GIC&lt;/strong&gt; → CPU reads fresh results immediately.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Benefit:&lt;/strong&gt; No software cache flush; lower latency to consume results.&lt;/p&gt;
&lt;h3&gt;
  
  
  Flow C: Camera capture with &lt;strong&gt;non-coherent DMA&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;ISP processes a frame → starts DMA to &lt;strong&gt;VA 0x5000&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SMMU&lt;/strong&gt; checks &lt;strong&gt;StreamID&lt;/strong&gt;, applies Stage-1/2 translations → PA 0xABCD.&lt;/li&gt;
&lt;li&gt;QoS grants top priority to camera.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AXI&lt;/strong&gt; write bypasses CPU caches → &lt;strong&gt;DRAM&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;IRQ → CPU invalidates corresponding cache region (&lt;strong&gt;manual sync&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;CPU (or NPU) consumes the frame.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Trade-off:&lt;/strong&gt; Non-coherent = &lt;em&gt;faster/less power&lt;/em&gt; but &lt;strong&gt;requires&lt;/strong&gt; SW sync; coherent = &lt;em&gt;simpler&lt;/em&gt; but adds snoop latency and power.&lt;/p&gt;
&lt;h3&gt;
  
  
  Flow D: QoS under contention
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Workloads: &lt;strong&gt;Display (4K60)&lt;/strong&gt;, &lt;strong&gt;NPU inference&lt;/strong&gt;, &lt;strong&gt;CPU compile&lt;/strong&gt;, &lt;strong&gt;GPU render&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Assign priorities and &lt;strong&gt;bandwidth guarantees&lt;/strong&gt; (e.g., Display 25 GB/s, NPU 20 GB/s).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLC partitioning&lt;/strong&gt; preserves working sets for Display/NPU.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Meet display and inference SLOs even if GPU throughput dips.&lt;/p&gt;
&lt;h2&gt;
  
  
  8) Design Patterns &amp;amp; Tuning Knobs
&lt;/h2&gt;
&lt;h3&gt;
  
  
  8.1 LLC partitioning (must-have for SLOs)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Pin critical footprints (AI model, frame buffers) to keep tail latencies flat.&lt;/li&gt;
&lt;li&gt;Track with PMU: LLC hits/misses by partition; watch 95p/99p latency.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  8.2 Coherent vs. non-coherent DMA (selection matrix)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Coherent (ACE-Lite)&lt;/th&gt;
&lt;th&gt;Non-coherent (AXI)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cache sync&lt;/td&gt;
&lt;td&gt;Automatic (HW)&lt;/td&gt;
&lt;td&gt;Manual (SW flush/invalidate)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;Higher (snoop)&lt;/td&gt;
&lt;td&gt;Lower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Power&lt;/td&gt;
&lt;td&gt;Higher (snoop traffic)&lt;/td&gt;
&lt;td&gt;~20% lower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Tight CPU/accelerator loops&lt;/td&gt;
&lt;td&gt;Streaming I/O (camera, video, NIC)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  8.3 Reduce TLB pressure
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;2 MB huge pages&lt;/strong&gt; for large buffers (video, tensors).&lt;/li&gt;
&lt;li&gt;Result: Fewer TLB misses, fewer page walks, measurable perf + power win.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  8.4 Throughput knobs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Burst length&lt;/strong&gt; (favor 128–256 B when legal).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alignment&lt;/strong&gt; (avoid crossing cache line boundaries).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefetchers&lt;/strong&gt; (streaming/stride detection for model and frame access).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  9) PPA (Performance, Power, Area) Trade-offs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;You don’t get all three.&lt;/strong&gt; Use SLOs to pick the right point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example decision:&lt;/strong&gt; Target &lt;strong&gt;&amp;lt;10 ms&lt;/strong&gt; inference&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;8 MB LLC&lt;/strong&gt; → ~95% hit-rate → ~7 ms ✔; &lt;strong&gt;higher power/area&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4 MB LLC&lt;/strong&gt; → ~85% hit-rate → ~9 ms ✔; &lt;strong&gt;−33% power, −50% area&lt;/strong&gt;.
&lt;strong&gt;Pick 4 MB&lt;/strong&gt; if it meets SLO and saves battery/BoM.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  10) Measurement &amp;amp; Validation (what to log first)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;PMU quick-start (pseudo-C)&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Count L2 misses during inference&lt;/span&gt;
&lt;span class="n"&gt;pmu_configure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PMU_L2_CACHE_MISS&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;run_ai_inference&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;l2m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pmu_read&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"L2 misses: %llu&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;l2m&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Microbenchmarks&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency ladders&lt;/strong&gt;: L1/L2/LLC/DRAM load/store.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TLB miss latency&lt;/strong&gt;: throttle page size to force misses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sustained bandwidth&lt;/strong&gt;: long memcpy/stream triads, mixed R/W.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coherency stress&lt;/strong&gt;: ping-pong cache lines across cores/agents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QoS drills&lt;/strong&gt;: over-subscribe fabric; verify guarantees + tail latency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What “good” looks like&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SLO compliance at 95p/99p latencies.&lt;/li&gt;
&lt;li&gt;Stable LLC hit-rates for reserved partitions.&lt;/li&gt;
&lt;li&gt;TLB miss rate stays low under real traffic.&lt;/li&gt;
&lt;li&gt;Display never drops frames even under stress.&lt;/li&gt;
&lt;li&gt;Power aligns with budget at steady state and peaks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  11) Security &amp;amp; Isolation (don’t bolt it on later)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SMMU&lt;/strong&gt; with &lt;strong&gt;per-StreamID&lt;/strong&gt; domains; least-privilege address windows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage-2&lt;/strong&gt; translations controlled by hypervisor for multi-tenant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secure interrupt routing&lt;/strong&gt;: verify GIC settings for EL2/EL3 paths.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Firmware update chain-of-trust&lt;/strong&gt; (ROM → BL → secure OS).&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  12) Putting It Together: Edge AI Camera Example
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NPU&lt;/strong&gt; achieves &lt;strong&gt;&amp;lt;10 ms&lt;/strong&gt; inference on the detection model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ISP&lt;/strong&gt; captures 4K frames; &lt;strong&gt;Display&lt;/strong&gt; holds 60 FPS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLC partitioning&lt;/strong&gt; protects model + frame buffers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QoS&lt;/strong&gt; guarantees bandwidth to Display/NPU; GPU is best-effort.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SMMU&lt;/strong&gt; isolates camera/NIC DMA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Power&lt;/strong&gt; stays under &lt;strong&gt;~5 W&lt;/strong&gt; with optimal cache sizes and non-coherent streaming.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  13) Checklist for Your Implementation
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Define SLOs (latency, FPS, jitter) per workload.&lt;/li&gt;
&lt;li&gt;[ ] Allocate &lt;strong&gt;LLC partitions&lt;/strong&gt; for critical agents.&lt;/li&gt;
&lt;li&gt;[ ] Choose &lt;strong&gt;coherent&lt;/strong&gt; vs &lt;strong&gt;non-coherent&lt;/strong&gt; DMA per path.&lt;/li&gt;
&lt;li&gt;[ ] Map &lt;strong&gt;huge pages&lt;/strong&gt; for large, hot buffers.&lt;/li&gt;
&lt;li&gt;[ ] Set &lt;strong&gt;QoS priorities&lt;/strong&gt; and bandwidth guarantees; measure at load.&lt;/li&gt;
&lt;li&gt;[ ] Instrument &lt;strong&gt;PMU&lt;/strong&gt;; gate changes on 95p/99p wins, not averages.&lt;/li&gt;
&lt;li&gt;[ ] Validate with microbenchmarks + application mixes, not just one workload.&lt;/li&gt;
&lt;li&gt;[ ] Lock down &lt;strong&gt;SMMU&lt;/strong&gt; policies and interrupt routing early.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;This ARM SoC pattern—&lt;strong&gt;CPU orchestrates, accelerators execute, fabric keeps it coherent and fair&lt;/strong&gt;—is what powers modern edge devices. If you lock SLOs, partition caches, choose DMA modes wisely, and validate with PMU-driven loops, you’ll ship systems that are fast, power-efficient, and robust under real-world contention.&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>ai</category>
      <category>architecture</category>
      <category>iot</category>
    </item>
  </channel>
</rss>
