Decreasing latency noise and maximizing performance during end-to-end benchmarking

#hasura #performance #powermanagement #intel

I was taken by surprise to find that benchmarking graphql-engine at 100 RPS was such a light load that power management was skewing my measurements dramatically (and worse, misleading me when e.g. having the browser open (i.e. more load) actually caused a dramatic performance improvement, causing me to think I'd made a performance breakthrough!).

While benchmarking some tweaks to graphql-engine we noticed some confusing and misleading/inconsistent results which led us down a bit of a rabbit hole. This is a summary of what we've learned so far, what is left to try, and what we've tried which didn't seem to have much effect.

We're starting to explore two interrelated things:

understanding how to maximize performance in a running graphql-engine generally; this is a type of noise reduction as well for the purposes of benchmarking, but is also obviously beneficial for best-practices to share with users or to help them measure performance
to understand how to run graphql-engine such that latency is consistent (but possibly much slower), so that we can be confident that a change we've made was actually beneficial or not

Some of the results here probably apply to production deployments of graphql-engine (and other services) as well; but we don't have concrete recommendations there yet. If you have success with any of these techniques in production, let us know!

Things that were very effective

Background : modern intel processors have extremely sophisticated power management that modifies the clock frequency and powers up and down subsystems dynamically and constantly (many times per second). There are knobs on modern linux for some amount of control over how all this behaves: the intel_idle driver allows some control over c-states (processor idle states), while intel_pstate deals with p-states. See the references at the bottom for more.

I was taken by surprise to find that benchmarking graphql-engine at 100 RPS was such a light load that power management was skewing my measurements dramatically (and worse, misleading me when e.g. having the browser open (i.e. more load) actually caused a dramatic performance improvement, causing me to think I'd made a performance breakthrough!).

We can use turbostat (from the cpupower package) to look at power management states of the machine during an execution of some load, with:

$ turbostat --interval 0.1 sleep 120

For my laptop (with Core i7-3667U), without tweaks, running the 100 RPS benchmark:

    Core CPU Avg\_MHz Busy% Bzy\_MHz TSC\_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp GFX%rc6 GFXMHz Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt - - 110 9.44 1167 2494 0 0 14.44 4.30 0.01 71.81 57 57 0.00 0 4.58 1.66 3.58 35.50 4.14 1.16 0.31 0 0 113 9.80 1161 2494 0 0 13.99 3.98 0.00 72.23 56 57 0.00 0 4.58 1.66 3.58 35.50 4.14 1.16 0.31 0 1 103 8.92 1160 2494 0 0 14.87 1 2 113 9.85 1151 2494 0 0 14.12 4.62 0.02 71.40 57 1 3 110 9.18 1196 2494 0 0 14.79 Thread Stats Avg Stdev Max +/- Stdev Latency 2.44ms 833.69us 13.82ms 78.94% Req/Sec 26.36 67.81 400.00 85.22% 6004 requests in 1.00m, 2.35MB read

...we can see the processor was busy (in C-state 0) less than 10% of the time, while it spent over 70% of the time in the deep C-7 state, which is slow to wake and do useful work.

"performance" p-state governor

intel_pstate offers two "governors" or modes. Switching it to "performance", with...

$ sudo cpupower frequency-set -g performance

...significantly reduces latency as the CPU is (waves hands) more ready to ramp up and do work when it comes:

    Core CPU Avg\_MHz Busy% Bzy\_MHz TSC\_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp GFX%rc6 GFXMHz Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt - - 99 4.49 2206 2494 0 0 13.02 5.00 0.01 77.47 55 55 0.00 0 5.35 2.06 3.97 48.43 4.40 1.67 0.13 0 0 104 5.01 2073 2494 0 0 12.72 5.05 0.02 77.20 54 55 0.00 0 5.35 2.06 3.97 48.43 4.40 1.67 0.13 0 1 93 3.96 2361 2494 0 0 13.77 1 2 103 4.99 2071 2494 0 0 12.31 4.95 0.00 77.74 55 1 3 95 4.00 2386 2494 0 0 13.30 Thread Stats Avg Stdev Max +/- Stdev Latency 1.52ms 1.14ms 33.86ms 97.87% Req/Sec 26.48 67.00 333.00 85.14% 6004 requests in 1.00m, 2.35MB read

Notice Bzy_MHz is close to the advertised clock speed here.

Prohibit deep sleep states

Finally we can keep the processor from entering deep sleep states with:

$ sudo cpupower idle-set -D10

(Note: you can re-enable all idle states with : sudo cpupower idle-set -E)

    Core CPU Avg\_MHz Busy% Bzy\_MHz TSC\_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp GFX%rc6 GFXMHz Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt - - 68 2.28 3000 2494 0 0 97.72 0.00 0.00 0.00 64 64 0.00 0 0.00 0.00 0.00 0.00 8.74 5.86 0.11 0 0 71 2.36 3000 2494 0 0 97.64 0.00 0.00 0.00 60 64 0.00 0 0.00 0.00 0.00 0.00 8.74 5.86 0.11 0 1 67 2.25 3000 2494 0 0 97.75 1 2 73 2.43 3000 2494 0 0 97.57 0.00 0.00 0.00 64 1 3 63 2.09 3000 2494 0 0 97.91 Thread Stats Avg Stdev Max +/- Stdev Latency 1.25ms 404.96us 6.43ms 68.94% Req/Sec 25.66 63.97 333.00 84.95% 6004 requests in 1.00m, 2.35MB read

Here we see a smaller but significant reduction in latency. We can see that we're spending nearly all our time in the shallower C1 state, ready to wake up quickly.

We can graph latency improvements from these two changes. Here we're measuring from instrumented graphql-engine:

FYI Here are all AWS instances under $3/hr that allow setting both cstate/pstate:

                       CPUS ECUs RAM
  $1.591 c4.8xlarge 36 132 60 GiB
  $1.872 h1.8xlarge 32 99 128 GiB
  $2.00 m4.10xlarge 40 124.5 160 GiB
  $2.128 r4.8xlarge 32 99 244 GiB
  $2.496 i3.8xlarge 32 99 244 GiB

Things that seemed to have little or dubious effect

...but might when e.g. graphql-engine is carefully isolated on CPUs:

echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
+RTS -qa -qm and/or pinning the parent haskell process with taskset -cp $(ps -F <pid> | awk '{print $7}' | tail -n1) <pid> I'm still trying to understand what's happening here, but launching a new OS thread (at ~5 us latency) is unlikely to be a bottleneck, however maybe it's associated with some other type of blocking
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled : this resulted in 15% regression for small return payloads, and 15% improvement for large ones. Worth revisiting.

Things that are likely to be effective but not explored

carefully isolating graphql-engine on particular CPUs. See also link dump below.

Copied from LLVM benchmarking docs:

Use https://github.com/lpechacek/cpuset to reserve cpus for just the

program you are benchmarking. If using perf, leave at least 2 cores

so that perf runs in one and your program in another::

cset shield -c N1,N2 -k on

This will move all threads out of N1 and N2. The -k on means

that even kernel threads are moved out.

Disable the SMT pair of the cpus you will use for the benchmark. The

pair of cpu N can be found in

/sys/devices/system/cpu/cpuN/topology/thread_siblings_list and

disabled with::

cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | \
awk -F, '{print $2}' | \
sort -n | \
uniq | \
( while read X ; do echo $X ; echo 0 | sudo tee /sys/devices/system/cpu/cpu$X/online ; done )

Run the program with::

cset shield --exec -- perf stat -r 10 <cmd>

This will run the command after -- in the isolated cpus. The

particular perf command runs the <cmd> 10 times and reports

statistics.

...and not practical for actual deployment but potential for more stable (and slower) latencies

disable turboboost: echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
consider using tmpfs for our benchmarking postgres instance so we never touch disk

Things that seemed to be a waste of time

fiddling with /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq and .../intel_pstate/min_perf_pct. It's not clear if this did anything. It certainly didn't produce a fixed CPU frequency. From https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt :

For contemporary Intel processors, the frequency is controlled by the processor itself and the P-State exposed to software is related to performance levels. The idea that frequency can be set to a single frequency is fictional for Intel Core processors. Even if the scaling driver selects a single P-State, the actual frequency the processor will run at is selected by the processor itself.