ARM System-on-Chip (SoC) Deep Dive: Edge AI and Coherency Fabric

Samaresh Kumar Singh — Wed, 01 Oct 2025 22:17:24 +0000

LLC partitioning and QoS work together. QoS controls priority in the
interconnect. LLC partitioning controls which agent's working set stays in
cache. Together they form a contract: the display and NPU will get their
bandwidth, and their data will be warm in cache. Everything else adapts.

You validate this contract with PMU counters watching 95th and 99th percentile
latencies, not averages. Average latency is a vanity metric on contended
memory buses.

The SMMU: Not Just Security, Actually Correctness

The System Memory Management Unit tends to get framed as a security feature,
which it is, but treating it purely as a security control misses why it is
essential for correct system operation.

The SMMU sits between I/O masters (the ISP, the NIC, PCIe devices, USB
controllers) and the physical address space. Each device gets a StreamID. The
SMMU uses that StreamID to look up the appropriate page table, translating the
device's virtual address to a physical address before allowing the DMA to
proceed. A device can only access memory that its stage-1 (OS-controlled) and
stage-2 (hypervisor-controlled) page tables permit.

The security value is obvious: a compromised camera driver cannot DMA into
kernel memory or another process's address space. But the correctness value is
less discussed. Without the SMMU, DMA addresses are physical, meaning a driver
bug that generates the wrong address can write anywhere in DRAM. These bugs
tend to manifest as subtle corruption, often far from the code that caused them,
often in a completely different process's memory. Debugging this without an
SMMU is miserable. With an SMMU, the bad access generates a fault with a precise
fault address and StreamID. You know immediately which device caused it and what
address it tried to access.

For multi-tenant systems, stage-2 translation provides per-VM isolation at the
hardware level. The guest OS sets up stage-1 translations for its devices. The
hypervisor controls stage-2, ensuring a guest cannot map device DMA into another
guest's physical memory ranges.

The practical rule is to configure the SMMU before anything else at boot, with
a default-deny policy, then open up specific address windows per device.
Configuring it as an afterthought means debugging memory corruption in
production.

A Real-World Reference: Edge AI Camera at 5W

The architecture described here is not hypothetical. Something close to it runs
in every flagship smartphone and most modern smart camera SoCs. Here is how
all the pieces interact in a concrete scenario.

The target: a security camera platform running 4K60 video capture with
continuous object detection at under 10ms latency, 60 FPS on the display
output, all within a 5W thermal budget.

graph TD
  SENSOR["Camera Sensor + ISP\n4K 60fps"]
  DRAM["DRAM\nFrame buffers + model weights"]
  NPU["NPU\nObject detection &lt;10ms\nQoS priority 12"]
  CPU["Cortex-A Cluster\nOrchestration + OS"]
  DISP["Display Controller\n4K 60fps out\nQoS priority 15"]

  subgraph LLC["LLC Partitions (7 MB total)"]
    P1["NPU region\n3 MB — model hot"]
    P2["Display region\n2 MB — frame bufs"]
    P3["CPU region\n2 MB — OS/stack"]
  end

  SENSOR -->|"AXI non-coherent"| DRAM
  DRAM -->|"ACE-Lite coherent"| NPU
  DRAM -->|"AXI non-coherent"| DISP
  NPU -->|"IRQ via GIC"| CPU
  CPU -->|"Configure / control"| NPU
  LLC -.->|"Partitioned ways"| DRAM

What each path decision buys you:

The ISP writes frames to DRAM via the non-coherent AXI path. At 4K60, frame
writing is the highest-bandwidth operation on the platform. Putting this on the
coherent fabric would generate snoop traffic proportional to the frame rate,
burning 300 to 400 mW just in coherency protocol overhead.

The NPU pulls model weights and frame data from the LLC via the coherent
ACE-Lite path. The LLC partition for the NPU keeps the detection model resident.
On a first inference after system boot, the model loads from DRAM into the NPU's
LLC partition. For every subsequent inference, the weights are already warm. The
DRAM penalty is paid once.

The CPU receives an interrupt from the GIC when the NPU finishes. Because the
NPU used coherent DMA for its output, the CPU can immediately read the detection
results without invalidating anything. The frame timestamp and bounding box
coordinates are in cache, coherent, ready.

At steady state, the power breakdown is roughly 2W compute (CPU, NPU, ISP
running continuously), 1.5W DRAM at medium utilization, and 1.5W display and
peripheral. The LLC partitioning and QoS configuration account for most of the
DRAM efficiency. Without them, the same workload at comparable latencies
requires about 6W because of unnecessary DRAM spill and coherency overhead.

The Knobs That Actually Matter

Real production tuning comes down to a short list. Knowing which levers exist
is the starting point. Knowing which ones move the needle on your specific
workload is the actual skill.

LLC partition sizing needs to be empirically derived. Start with PMU
counters measuring LLC hit rate per agent, then adjust partition sizes until
the hit rate for time-critical agents stabilizes above 90%. Below that, you
will see tail latency climb. Above it, you are probably over-allocating to that
agent at the expense of others.

// PMU quick-check: L2 miss rate during inference
pmu_configure(PMU_L2_CACHE_MISS);
run_ai_inference();
uint64_t l2_misses = pmu_read();
printf("L2 misses during inference: %llu\n",
       (unsigned long long)l2_misses);

Huge page adoption for video and tensor buffers is close to a free lunch.
The TLB miss rate drops dramatically. The main friction is that huge pages need
physically contiguous memory, which requires CMA (Contiguous Memory Allocator)
reservation at boot. This is a one-line kernel parameter. Most teams skip it and
then wonder why their video pipeline has periodic latency spikes.

DMA mode selection should be documented per data path, not chosen once
globally. Write it down in the driver architecture document, with the reasoning.
Six months after initial bringup, someone will add a new accelerator and make
the wrong choice because the rationale was never written down.

QoS settings should be measured under maximum contention, not idle
conditions. Set up a stress test that runs all compute engines simultaneously,
then verify that display and NPU latency stay within SLO bounds. If they do not,
your priorities and bandwidth reservations need adjustment.

A Note on Tail Latency

One trap that experienced engineers still fall into is optimizing for
average-case latency while ignoring tail latency. Averages on contended memory
buses look fine until they do not. A system that averages 6ms inference latency
but hits 18ms at the 99th percentile will fail its real-time requirements in
production, because production workloads are not averages.

PMU-based profiling needs to capture percentile distributions, not means. The
95th and 99th percentile latencies tell you whether your LLC partitions and QoS
settings are holding up under contention. An average that looks good while the
99th percentile drifts upward is a sign that something is occasionally evicting
a critical working set, or that bandwidth guarantees are holding on average but
not under peak scenarios.

The correlation between LLC hit-rate stability and inference tail latency is
often direct and observable. When a partition eviction event happens, the next
inference cold-loads weights from DRAM and the latency spike shows up
immediately in the distribution. Tracking these together makes root cause
analysis tractable.

Implementation Checklist

Before declaring a platform production-ready, each of these should have a
verified answer:

[ ] SLOs defined (inference latency, display FPS, jitter) per workload
[ ] LLC partitions allocated for NPU, Display, CPU — sizes validated with PMU
[ ] DMA mode chosen and documented per data path, with rationale
[ ] Huge pages mapped for all large hot buffers (frame buffers, tensors)
[ ] QoS priorities and bandwidth guarantees set and tested under full load
[ ] PMU instrumentation capturing 95p/99p distributions, not just averages
[ ] SMMU default-deny policy locked in before any driver bring-up
[ ] Interrupt routing for GIC verified across EL2/EL3 paths

Closing Thoughts

The ARM SoC architecture described here is what makes modern edge computing
possible at the power envelope and cost point that edge devices demand. A CPU
cluster alone could not do it. A bare NPU with no cache hierarchy management
would be unreliable. What makes it work is the combination: dedicated compute
engines with defined roles, a shared but well-managed memory system, a
coherency fabric that handles the hard synchronization problems, and a QoS layer
that enforces the real-time contracts that users care about.

The engineers who get this right share a common trait. They do not think about
these components in isolation. They think about data flow, contention scenarios,
and worst-case latency under load. They have LLC hit rates and 99th percentile
latency numbers at their fingertips. And they configure SMMU policies before any
other driver goes in, not after.

The architecture is not magic. The properties it provides are a direct
consequence of deliberate design decisions, most of which can be reversed by
careless driver work or misconfigured firmware. Understanding the mechanism is
what allows you to keep those properties intact from first bringup through
production.