DEV Community

loading...
Cover image for Datadog Killed Our Vault and Consul Performance

Datadog Killed Our Vault and Consul Performance

codyja profile image Cody Jarrett ・8 min read

It can often be a learning experience or at least entertaining to hear about other people's experiences in managing distributed systems at scale. Recently we experienced a HashiCorp Vault outage and I thought it would be fun to share this experience plus a few lessons learned and tips. Looking back, it's quite interesting how different things can align so perfectly to result in such an obscure issue as you'll see.

Environment Overview

My company uses the full suite of HashiCorp tools, and quite honestly we're really big fans. Their tools have allowed us to transition to modern infrastructure practices in on-prem data center and cloud environments. This has led to improved application deployments and an overall better hosting experience. At the heart of our environment we use Nomad for container and non-containerized application orchestration, Vault for secrets storage and dynamic PKI, and Consul for service discovery and for Vault's storage backend. The total environment is fairly large with about 8400 active jobs consisting of over 60,000 containers and IIS app pools running on the Nomad clusters. Consul clusters have about 10,000 services in the catalog and about 40,000 unique services being health checked.

Issue

One evening a few weeks back, we began upgrading the Nomad agent on several pools of Nomad clients (node classes) in one of our data centers. Soon after, we noticed Vault handle time metrics (vault.core.handle_request) were beginning to elevate. This has been somewhat of a reoccurring pattern in this specific environment for the past year, especially during maintenance activities that result in extra load.

Latencies continued to get worse as time passed however. Soon after we realized that Vault's mean handle time transitioned from ms units to minutes(!) which was something we definitely hadn't seen before.
image

Nomad's logging rate began increasing, full of entries like these snipped errors:

failed to create an alloc vault token: Post "/v1/auth/token/create/nomad-cluster": context deadline exceeded
failed to create an alloc vault token: Post "/v1/auth/token/create/nomad-cluster": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Enter fullscreen mode Exit fullscreen mode

At this point, whatever is occurring isn't super impactful yet since all jobs have multiple tasks running. We decided to engage HashiCorp Support since we're enterprise customers and kept digging and collecting logs.

The challenge we had was trying to identify which system was at fault here, Consul or Vault. Today in our environment, both systems are coupled very closely since we're using the Consul storage backend for Vault. This means that Vault's persistence storage is sitting (encrypted) on top of Consul's KV store. We've planned to move away from this architecture so that Vault has it's own dedicated cluster (or now raft integrated storage), but it hasn't happened yet.

We continued pouring through VM level metrics trying to identify a cause while ruling out symptoms and red herrings. There was essentially zero IO wait, CPU usage for Vault was 30% or less and Consul was 60% or less. We focused on disk performance since generally the limiting factor with Consul is disk write performance used by writing entries to the raft log. Disk IOPs on the Consul cluster were fairly low, even lower than our other data centers.
image

Nothing too wild being reported for disk latencies as well:
image

By now, we're even more stumped. HashiCorp Support has Development involved and they're worried the most about Consul commit and Vault handle times being so elevated. Our concern was that there's no way for the increasing thundering herd of Nomad token renewals/creates to ever get serviced, just compounding the issue even more.

At this point, we're suspecting an internal issue with Consul or Vault even more. Something was bottlenecking operations somewhere and it didn't seem like hardware or VMs anymore. Support was concerned about the time spent writing to the raft log still, specifically consul.raft.fsm.apply and consul.raft.commitTime.

The next theory is things are hanging because of the sheer number of Vault revocations (token and pki certs) occurring from the earlier Nomad maintenance. Support brought up a new feature in Vault that puts back pressure on the expiration manager. In Vault 1.6.x, this can be enabled with an environment variable (VAULT_16_REVOKE_PERMITPOOL=200) and 1.7.x includes this feature by default.

After learning about this new revocation back pressure mechanism, we elected to try an emergency Vault upgrade from 1.5.4 to 1.6.3. We also turned on the new environment variable in the systemd unit file for Vault. Once the dust settled, we saw a slight improvement, but still..... no great recovery yet.

Discovery

Periodically throughout the event we collected consul debug and vault debug archives, which are really convenient ways to package up logs and profiling dumps. Support reviewed those dumps multiple times and tried to identify what's causing the slowness.

After Development reviewed the pprofs more, someone brought up a potential cause: metrics, specifically there seemed to be a high amount of time being spent in a datadog metrics library that's built into Consul.

Wait... What? We weren't using datadog for monitoring so what's going on?

To get realtime metrics from our Nomad, Consul, and Vault clusters we leverage Prometheus to scrape a Telegraf agent on every VM. That's been a convenient way to collect VM level metrics (cpu, memory, disk/io, etc) plus any other service metrics on the VM. Having Consul spew statsd style metrics to the local Telegraf agent has been working great.

Consul Development began piecing things together and found there was a known issue in the datadog metrics library that was causing blocking for the Consul process. It's an unpublished issue and was patched in Consul 1.7.6 by bumping the go-metrics/Datadog version. In the Consul configuration, we had the telemetry stanza configured with the dogstatsd_addr pointing to the localhost Telegraf port. We couldn't remember exactly why we were using the dogstatsd output versus straight statsd (or even Prometheus). There was a vauge memory of adding additional tags to each metric, but clearly wasn't really needed since Telegraf supports [global_tags] in its configuration.

Resolution

To see if this truly could be causing our issues we updated the telemetry stanza on the Consul servers one at at time to include just statsd output.

Previous telemetry configuration:

"telemetry": {
    "disable_hostname": true,
    "dogstatsd_addr": "localhost:8125",
    "dogstatsd_tags": [
      "env_name:production"
    ]
  },
Enter fullscreen mode Exit fullscreen mode

Updated telemetry configuration change:

"telemetry": {
    "disable_hostname": true,
    "statsd_address": "localhost:8125"
  },
Enter fullscreen mode Exit fullscreen mode

The Consul leader was restarted last and once completed, we had near instant recovery! Thank goodness!

consul.raft.commitTime mean 120-200ms down to 7-9ms
image

Looking through Consul code, github.com/armon/go-metrics/datadog imports the official github.com/DataDog/datadog-go. I found a PR created back in October 2019(!) that fixed a blocking issue in instrumented applications (like Consul).
image

This fix ultimately ended up in Consul 1.6.7 released in July 2020.

Lessons Learned

I wanted to include a few lessons learned on various topics that hopefully can save you some time and frustrations in the future.

Important Metrics

Here's a couple interesting metrics to keep an eye on:

  • consul.raft.fsm.apply - This metric tipped HashiCorp off that there was something backing up or bottlenecking when writing raft logs. Before the fix, we were in the 20-40 range across all servers, but it fell down to 1-2 after the fix. This is how many logs were committed since last interval.
  • consul.raft.commitTime- HashiCorp says this shouldn't be more than a few milliseconds. In our environments and storage solution, I think 10-25ms is more realistic. Consider throwing alerts because something is wrong above say 30 or 50ms.
  • consul.catalog.register - This tells you how long it takes to register a service into the catalog and should be very similar to Consul commit time. A goal for mean time could be 10-20ms.
  • vault.core.handle_request - Mean time to handle requests in a health cluster look to be .5-10ms.
  • vault.runtime.num_goroutines - When Consul was slow/blocking, the vault go routines continued climbing. It looks like on a healthy these should be under 2000 generally. image

Quick Response

When troubleshooting issues like this, centralized metrics and logs are invaluable. It's often necessary to bundle up debug archives or logs to get in Support's hands. Here's a quick ansible playbook to gather consul logs from one or multiple servers at once. Logs end up in a single archive.

- name: get consul logs
  hosts: all
  vars:
    date: "{{ ansible_date_time.date }}"

  tasks:
  - name: remote - grep consul logs on hosts
    shell: "egrep 'consul:' /var/log/messages > /var/tmp/consul.log.{{ ansible_hostname }}"

  - name: remote - change log file permissions
    file:
      path: "/var/tmp/consul.log.{{ ansible_hostname }}"
      mode: 0775

  - name: remote - pull consul logs locally
    fetch:
        src: "/var/tmp/consul.log.{{ ansible_hostname }}"
        dest: "/var/tmp/consul.log.{{ ansible_hostname }}"
        flat: yes
    become: no   # fetch needs become=no if pulling large files, otherwise MemoryError

  - name: local - create tarball
    archive:
      path:
        - /var/tmp/consul.log.*
      dest: /var/tmp/consul_logs-{{ date }}.tgz
    delegate_to: 127.0.0.1
    run_once: true
    become: no

  - name: local - remove leftover local files
    file:
      path: "/var/tmp/consul.log.{{ ansible_hostname }}"
      state: absent
    delegate_to: 127.0.0.1
    become: no

  - name: remote - remove leftover remote files
    file:
      path: "/var/tmp/consul.log.{{ ansible_hostname }}"
      state: absent

  - debug:
      msg: "Logs are ready at /var/tmp/consul_logs-{{ date }}.tgz"
Enter fullscreen mode Exit fullscreen mode

Improve Your Profiling Skillset

If hardware has been ruled out, it may be time to transition to profiling and look at the application internals. It's important to be able to answer "What is my application spending time doing?" In GO based apps, pprof is the tool of choice, but it can be confusing if you don't use it often. I'm sure HashiCorp Development was using pprof along with metrics to understand where the slowness was coming from and I wish that's something we had done ourselves sooner.

Here's the Consul flame graph before the change. The purple is the regex for metrics|datadog|statsd. Note how wide (how much time is spent) the lines for go-metrics and datadog stacks are!

image

Here's the Consul profile after the remediation. Note how much less purple there is!

image

The above pprof profiles were viewed by running the following command:

go tool pprof -http=":8000" consul profile.prof
Enter fullscreen mode Exit fullscreen mode

In the above command, consul is the exact same version binary for Consul in my current directory. profile.prof came from the consul debug output.

Unsolved Issues

The last bit of advice is to not live with unsolved issues. Looking back, we should have better understood those Consul commit times and really learned what was considered high and what normal should really look like. Phantom Latency can often come back to haunt you, like we experienced! Always try to understand metrics and logs better, clean up reoccurring errors, mock up tests if needed and prove out theories. You and your team will be much better off in the long run.


Views and opinions are my own.

Discussion (0)

pic
Editor guide