Samaresh Kumar Singh

Posted on Oct 1

ARM System-on-Chip (SoC) Deep Dive: Edge AI and Coherency Fabric

#systemdesign #ai #architecture #iot

🏗️ ARM System-on-Chip (SoC) Architecture Explained

The following diagram referencing the original architecture at ARM SoC Architecture Diagram outlines a complete modern ARM-based System-on-Chip design optimized for edge AI inference and heterogeneous computing. This architecture is the foundation for modern smartphones, high-end security cameras, and autonomous systems.

Let’s break down the design layer by layer:

1) Architecture at a Glance

Five layers:

Control Plane (Cortex-A clusters, GIC, PMU)
Data Plane (NPU, GPU, ISP, VPU)
Memory Subsystem (L1/L2/LLC, TLB, MC + DRAM)
Interconnect & Coherency Fabric (CHI/ACE, snoop filters, QoS)
I/O Translation Fabric (SMMU, StreamIDs, PCIe/USB/Display/Network)

Mental model: CPU = orchestrator (control), accelerators = muscle (data). CPU sets up work; accelerators crunch it; the fabric keeps everything coherent and on time.

2) Control Plane (Top, Light Blue)

What lives here

ARM Cortex-A CPU clusters (e.g., 4 cores) with private L1I/L1D and shared L2 per cluster.
GIC (Generic Interrupt Controller) for routing/priority of device interrupts.
PMU (Performance Monitoring Unit) for on-silicon counters (cache misses, TLB misses, branches, cycles).

Why it matters

Runs OS/supervisor and orchestrates accelerators.
Handles exceptions/IRQs; configures QoS, cache partitions, SMMU mappings.
PMU is your truth serum for tuning.

Role: Configure the world. Don’t do heavy lifting.

3) Data Plane (Left Middle, Light Green)

Compute engines

NPU (AI accelerator): GEMM/convolution engines for local inference (e.g., object detection).
GPU: Graphics + GPGPU for UI, 3D, shaders, post-processing.
ISP: Image pipeline (demosaic, denoise, tone map, auto-exposure/focus).
VPU: HW encode/decode (H.264/H.265/AV1) for capture/streaming.

Key design principle

Control/Data split: CPU provides the model and buffers; NPU/GPU/ISP/VPU execute at high throughput and signal completion.

4) Memory Subsystem (Right Middle, Light Yellow)

Cache hierarchy

L1 (per-core, ~32–64 KB, ~4 cycles): fastest, tiny.
L2 (per-cluster, ~256 KB–2 MB, ~12 cycles).
LLC/L3 (shared, ~4–16 MB, ~40 cycles).
DRAM (GBs, ~100–200+ cycles).

Partitioned LLC (recommended)

Reserve LLC ways/regions per role to avoid “noisy neighbors.”
- Ex: CPU 2 MB, NPU 3 MB (model), Display 2 MB (frame cadence).
Prevents the GPU from evicting the NPU’s model or the display’s frame buffers.

TLB & huge pages

TLB caches VA→PA. Misses trigger page-walks (~100 cycles).
For large datasets, use 2 MB huge pages to cut TLB pressure drastically.

Access path: CPU → TLB → L1 → L2 → LLC → DRAM (short-circuit at first hit).

5) Interconnect & Coherency Fabric (Center, Light Coral)

CHI/ACE fabric

High-speed, coherent fabric that routes transactions and enforces cache coherency via MESI/MOESI.

Coherency 101

Keep all cores/agents in agreement about shared cache lines:
- M/O/E/S/I states; snoops on peer caches; ownership transitions on read/write.

Snoop filter

Tracks line locations so you only snoop relevant caches (reduces traffic and power).

QoS manager

Prioritize time-critical clients:
- Ex: Display (15), NPU (12), CPU (8), GPU (4).
Guarantee bandwidth slices under contention.

AXI (non-coherent)

For DMA/I/O that don’t need HW coherency (e.g., streaming camera frames).

6) I/O Translation Fabric (Bottom, Lavender)

SMMU (IOMMU)

Per-device virtual memory + isolation with Stage-1 (OS) and Stage-2 (Hypervisor) translations.
StreamID selects which translation/permissions to apply; prevents rogue DMA into kernel memory.

Peripherals

PCIe (NVMe, NICs), USB, Display (HDMI/DP), Network (Ethernet/Wi-Fi).

7) Data-Flow Walkthroughs

Flow A: CPU memory access (coherent)

CPU issues load (x = array[i]).
TLB translates VA→PA (page walk on miss).
Probe L1 → L2 → LLC; return on first hit; else DRAM.
Data bubbles back to registers via cache hierarchy.

Tip: PMU counters validate latency and hit rates across levels.

Flow B: AI inference with coherent DMA

CPU loads model (prefetch to LLC partition for NPU).
CPU programs NPU: input at 0x1000; run model X.
NPU issues ACE-Lite (coherent) DMA reads.
Fabric snoops CPU caches, probes LLC; fetches from DRAM if needed.
NPU computes (e.g., 8 ms target).
NPU writes results coherently; coherency invalidates stale CPU lines.
NPU IRQ via GIC → CPU reads fresh results immediately.

Benefit: No software cache flush; lower latency to consume results.

Flow C: Camera capture with non-coherent DMA

ISP processes a frame → starts DMA to VA 0x5000.
SMMU checks StreamID, applies Stage-1/2 translations → PA 0xABCD.
QoS grants top priority to camera.
AXI write bypasses CPU caches → DRAM.
IRQ → CPU invalidates corresponding cache region (manual sync).
CPU (or NPU) consumes the frame.

Trade-off: Non-coherent = faster/less power but requires SW sync; coherent = simpler but adds snoop latency and power.

Flow D: QoS under contention

Workloads: Display (4K60), NPU inference, CPU compile, GPU render.
Assign priorities and bandwidth guarantees (e.g., Display 25 GB/s, NPU 20 GB/s).
LLC partitioning preserves working sets for Display/NPU.

Goal: Meet display and inference SLOs even if GPU throughput dips.

8) Design Patterns & Tuning Knobs

8.1 LLC partitioning (must-have for SLOs)

Pin critical footprints (AI model, frame buffers) to keep tail latencies flat.
Track with PMU: LLC hits/misses by partition; watch 95p/99p latency.

8.2 Coherent vs. non-coherent DMA (selection matrix)

Aspect	Coherent (ACE-Lite)	Non-coherent (AXI)
Cache sync	Automatic (HW)	Manual (SW flush/invalidate)
Latency	Higher (snoop)	Lower
Power	Higher (snoop traffic)	~20% lower
Best for	Tight CPU/accelerator loops	Streaming I/O (camera, video, NIC)

8.3 Reduce TLB pressure

Use 2 MB huge pages for large buffers (video, tensors).
Result: Fewer TLB misses, fewer page walks, measurable perf + power win.

8.4 Throughput knobs

Burst length (favor 128–256 B when legal).
Alignment (avoid crossing cache line boundaries).
Prefetchers (streaming/stride detection for model and frame access).

9) PPA (Performance, Power, Area) Trade-offs

You don’t get all three. Use SLOs to pick the right point.

Example decision: Target <10 ms inference

8 MB LLC → ~95% hit-rate → ~7 ms ✔; higher power/area.
4 MB LLC → ~85% hit-rate → ~9 ms ✔; −33% power, −50% area. Pick 4 MB if it meets SLO and saves battery/BoM.

10) Measurement & Validation (what to log first)

PMU quick-start (pseudo-C):

// Count L2 misses during inference
pmu_configure(PMU_L2_CACHE_MISS);
run_ai_inference();
uint64_t l2m = pmu_read();
printf("L2 misses: %llu\n", (unsigned long long)l2m);

Microbenchmarks

Latency ladders: L1/L2/LLC/DRAM load/store.
TLB miss latency: throttle page size to force misses.
Sustained bandwidth: long memcpy/stream triads, mixed R/W.
Coherency stress: ping-pong cache lines across cores/agents.
QoS drills: over-subscribe fabric; verify guarantees + tail latency.

What “good” looks like

SLO compliance at 95p/99p latencies.
Stable LLC hit-rates for reserved partitions.
TLB miss rate stays low under real traffic.
Display never drops frames even under stress.
Power aligns with budget at steady state and peaks.

11) Security & Isolation (don’t bolt it on later)

SMMU with per-StreamID domains; least-privilege address windows.
Stage-2 translations controlled by hypervisor for multi-tenant.
Secure interrupt routing: verify GIC settings for EL2/EL3 paths.
Firmware update chain-of-trust (ROM → BL → secure OS).

12) Putting It Together: Edge AI Camera Example

NPU achieves <10 ms inference on the detection model.
ISP captures 4K frames; Display holds 60 FPS.
LLC partitioning protects model + frame buffers.
QoS guarantees bandwidth to Display/NPU; GPU is best-effort.
SMMU isolates camera/NIC DMA.
Power stays under ~5 W with optimal cache sizes and non-coherent streaming.

13) Checklist for Your Implementation

[ ] Define SLOs (latency, FPS, jitter) per workload.
[ ] Allocate LLC partitions for critical agents.
[ ] Choose coherent vs non-coherent DMA per path.
[ ] Map huge pages for large, hot buffers.
[ ] Set QoS priorities and bandwidth guarantees; measure at load.
[ ] Instrument PMU; gate changes on 95p/99p wins, not averages.
[ ] Validate with microbenchmarks + application mixes, not just one workload.
[ ] Lock down SMMU policies and interrupt routing early.

Closing Thoughts

This ARM SoC pattern—CPU orchestrates, accelerators execute, fabric keeps it coherent and fair—is what powers modern edge devices. If you lock SLOs, partition caches, choose DMA modes wisely, and validate with PMU-driven loops, you’ll ship systems that are fast, power-efficient, and robust under real-world contention.

DEV Community