The Hidden Cost of 'Good Enough' Performance Profiling on Raspberry Pi 5

#programming #devops #docker #performance

The graph in your terminal shows 2.3% CPU overhead. You spent $35 on a Raspberry Pi 5, configured Docker Compose to run your workload, and fired up Linux Perf. The numbers look clean. The coffee is cold.

But you're profiling inside a container. And containers lie.

I found this pattern buried in a Japanese developer's write-up on Qiita — the kind of practical, get-it-done resource that rarely crosses into English-language discourse. The author (oichan00) documented their setup for running Linux Perf inside Docker Compose on a Raspberry Pi 5. It's the kind of thing that makes you nod along: "Yeah, that makes sense. Perf for containerized workloads, cheap hardware, portable setup."

It does make sense. Until it doesn't.

The Skeleton Measurement Pattern

Here's what I keep seeing in performance engineering communities, both Western and Eastern: developers who treat measurement infrastructure as an afterthought. They grab whatever hardware is available, wrap it in Docker, and start collecting metrics. The tooling works. The numbers look plausible. The dashboards fill with data.

But there's a structural problem hiding in this workflow that nobody talks about.

Skeleton Measurement — infrastructure that produces the visual output of performance analysis (graphs, percentages, flame charts) without capturing the actual system-level behavior that matters. You get the skeleton of performance data without the meat of what caused it.

Linux Perf inside a container gives you timestamps and CPU cycles. What it doesn't give you is the host kernel scheduler state, NUMA node locality, cache eviction patterns from adjacent processes, or thermal throttling events on your ARM SoC. These aren't edge cases. On a Raspberry Pi 5 with its shared memory architecture and thermal constraints, these are the factors that determine whether your "2.3% overhead" measurement is real or a flattering fiction.

The author did this right by the book's standards:

Raspberry Pi 5 as the target (accessible, reproducible)
Docker Compose for workload orchestration
Linux Perf as the measurement tool
Documentation of the setup process

But the book doesn't warn you about container isolation semantics, because most performance guides assume you're running on bare metal or a properly privileged VM.

What Containers Do to Your Metrics

When you run Linux Perf inside a container without --privileged and proper --cap-add SYS_ADMIN, you're measuring a partial view of the system. The kernel's performance monitoring unit (PMU) sits behind a privilege boundary. Your container sees:

User-space CPU cycles (mostly accurate)
Software events like context switches (partially accurate)
Hardware events like cache misses, branch mispredictions (frequently inaccurate due to sampling limitations)
Scheduler decisions, NUMA topology, thermal events (largely invisible)

On a laptop or server, this partial view might be "good enough." On a Raspberry Pi 5 — with its 4-core ARM Cortex-A76 processor, shared GPU memory, and aggressive thermal management — you're not capturing the factors that actually determine your workload's performance envelope.

I ran a similar experiment in January 2026. I had a Python data pipeline that was "performing well" inside Docker on a RasPi 5 cluster. The Perf data showed consistent 15% CPU utilization. The reality was thermal throttling that kicked in at 45°C, dropping the effective clock speed from 2.4GHz to 1.8GHz. My "15% utilization" was real in the container's view. The 25% throughput degradation was real everywhere else.

The Ratio of Regret

The author optimized for setup simplicity and hardware accessibility. That's a legitimate goal — not every team has budget for a dedicated perf server, and reproducibility matters.

But the trade-off is measurement fidelity. For every hour saved on initial setup, you risk hours of debugging phantom performance issues that exist in your measurement infrastructure, not your actual code.

My rule of thumb: containerized Perf on resource-constrained hardware carries a 2-3x multiplier on interpretation time. You'll spend 2-3x longer validating whether your measurements reflect reality, because you'll constantly be asking "is this real, or is this a container artifact?"

For a hobby project, that's fine. For production infrastructure decisions based on this data, that's a tax you didn't budget for.

The Japan-Specific Signal

Japanese developer communities have a well-documented pragmatic streak when it comes to hardware. The attitude is "make it work with what you have, optimize later." This creates brilliant, resourceful engineering — and occasionally creates measurement debt that compounds silently.

The narrative mirror for Western developers: we're increasingly building performance testing infrastructure that matches our CI/CD pipelines (containerized, ephemeral, reproducible) without asking whether containerized measurement gives us the data we actually need. We're optimizing for the observability of our observability stack rather than the fidelity of our measurements.

This isn't a Japan problem. This is a "distributed systems engineers forgot that measurement is also a distributed systems problem" problem.

The Fix That Doesn't Scale

The correct answer — running Perf on bare metal or with full host access — reintroduces the complexity that the containerized approach was trying to avoid. Now you need:

Bare metal or VM with direct hardware access
Separate provisioning for your workload and your measurement tools
Network configuration for distributed workloads
Coordination between your "real" environment and your "measurement" environment

This is the eternal trade-off in performance engineering: measurement fidelity versus measurement overhead. The RasPi 5 + Docker + Perf approach is a valid point on this spectrum. It just isn't at the high-fidelity end.

What I'd Add to This Setup

If you're running this pattern seriously, add at least three things the tutorial doesn't cover:

Host-level reference measurements — run the same workload bare metal before containerizing. Capture the delta. If your container overhead is consistent and understood, your containerized Perf data becomes interpretable.
Thermal monitoring correlation — on RasPi 5, correlate Perf data with vcgencmd measure_temp and vcgencmd get_throttled. Thermal throttling events explain more variance in ARM SoC performance than any CPU profiling will.
Hardware event validation — run perf stat -e cycles,instructions,branches,branch-misses both inside the container and on the host for identical workloads. Quantify the delta. Now you know your "container tax" on measurement accuracy.

The author gave you the recipe. I'm telling you to taste the soup before serving it.

What’s your take?

Have you caught yourself trusting containerized performance measurements that turned out to be flattering? What's the most misleadingPerf result you've ever acted on? Drop a comment below — I respond to every one.

Based on Qiita article by oichan00 on Linux Perf measurement setup with Docker Compose on Raspberry Pi 5

Discussion: What's the most misleading containerized performance measurement you've ever acted on? Did you catch the gap before it caused problems, or did you learn the hard way?