DEV Community: Nikolay Sivko

Delay accounting: an underrated feature of the Linux kernel

Nikolay Sivko — Tue, 15 Mar 2022 16:02:59 +0000

Nowadays, in the era of microservices, infrastructures have become super-complex: dynamic nodes provisioning, autoscaling, dozens or even hundreds of containers working side by side. In order to maintain control over such infrastructure, we need to be able to know what has happened to each application at any given time.

First, let's look at computing resources. Usually, when engineers talk about resources, they are actually referring to utilization which is not always correct. For example, the high utilization of a CPU by a container is not an issue in general. The real problem here is that this can cause another container on the same machine to perform slower due to a lack of CPU time.

Only a few people know that the Linux kernel counts exactly how long each task has been waiting for kernel resources to become available. For instance, a task could wait for CPU time or synchronous block I/O to complete. Such delays usually directly affect application latency, so measuring this seems quite reasonable.

When researching how we could detect CPU-related issues, we came to the conclusion that the CPU delay of a container is a perfect starting point for investigation. The kernel provides per-PID or per-TGID statistics:

The per-PID (Process ID) statistics correspond only to the main thread of a process.
The per-TGID (Task Group ID) statistics are the sum over all threads of a process.

Node-agent gathers per-TGID statistics from the kernel through the Netlink protocol and aggregates it to the per-container metrics:

Let's see how one of the resulting metrics can help in detecting various CPU-related issues. I'll use failure scenarios from Failurepedia, our library of recorded failure scenarios, to illustrate this.

A noisy neighbor

In this scenario, a CPU-intensive container (stats-aggregator) runs on the same node as a latency-sensitive app (reservations).

In the chart above, we can see that the reservations app is not meeting its SLOs: requests are performing slowly (the red area) and failing (the black area). Now let's look at the CPU metrics related to this application:

Comparing these metrics, we can easily conclude that the stats-aggregator has caused the issue. Yet, could we do such analysis without using the container_resources_cpu_delay_seconds_total metric?

On the contrary, here are the same metrics for a case where stats-aggregator-(be) has started with the lowest CPU priority (the Best-Effort QOS class was assigned).

As you can see, the CPU utilization of the node is 100%, but this is not significantly affecting the reservations' SLIs.

At Coroot, we came up with the following algorithm to detect CPU-related issues of a particular application:

If the app is meeting its SLOs, it's not necessary to check anything else.
If the total CPU delay of all containers of the app is correlating with the affected SLIs of this app, this means the app is experiencing a lack of CPU time.

I must clarify why checking correlation can often be tricky. As seen in the first scenario above, several SLIs (latency and success rate) were affected all at once. This is why Coroot also checks the correlation between the metric and the composition of the relevant SLIs.

In the first stage, we had an issue with the application response time, so there was a strong correlation between this SLI and the total CPU delay.

After the errors appeared, the correlation between cpu_delay and the response time became weak. Yet, the correlation between the sum of errors and slow requests was strong enough to assure that the container did indeed experience a lack of CPU time.

A lack of CPU time can be caused by several reasons:

A container competes for CPU time against the other containers running on the same node.
The Linux CPU scheduler (CFS) preempts the processes of a container in favor of the other containers on the same node.
A container consumes all available CPU time on its own.
A container has reached its CPU limit and has been throttled by the system.

CPU throttling

If a container is limited in CPU time (throttled), the cpu_delay metric is bound to increase. However, in this case, we have a separate metric that measures how long each container has been throttled — container_resources_cpu_throttled_seconds_total. Thus, Coroot uses it in addition to cpu_delay to determine that a container has reached its CPU limit and it is that which caused a lack of CPU time.

Here is a failure scenario where the number of requests to the image-resizer app is gradually increased until its containers reach their CPU limits.

Detection also works by checking the correlation between the affected SLIs of the app and throttled_time. Thus, if the correlation is strong, we can conclude that the cause of a CPU lack is throttling.

Useful links

node-agent (Apache-2.0 License)
CPU Inspection
Try Coroot for free (14-day trial)

How ping measures network round-trip time accurately using SO_TIMESTAMPING

Nikolay Sivko — Mon, 28 Feb 2022 15:13:07 +0000

While working on node-agent, we set out to measure network latency between containers and the services they communicate with. Since the agent has already discovered the endpoints that each container communicates with, we just need to measure network latency. We embedded "pinger" directly into the agent to measure end-to-end latency because the ICMP Echo requests should be sent from within the network namespace of each container.

I've looked at the most popular pinger implementations on Go, and unfortunately, all of them use userspace-generated timestamps to calculate RTTs (Round-Trip Times). This can lead to significant measurement errors, especially in the case of a lack of CPU time on the node due to throttling or high utilization.

Let's look at how ping actually measures RTT.

Though it looks simple enough, the devil is in the details. First, let's see how a packet is sent:

The main problem here is that a considerable amount of time can elapse between the request for a current timestamp and the actual sending of a packet. We can also see the same error in the receiving stage:

To illustrate the case where the error is quite noticeable, let's run ping in a container that is limited in CPU time. The -U flag tells ping to use userspace-generated timestamps.

docker run --rm -ti --cpu-period=10000 --cpu-quota=1000 --name pinger ping 8.8.8.8 -U

Then I'll start a CPU-consuming app in the same container:

docker exec -ti pinger stress --cpu 4

Here is the output:

As you can see, the measurement results were severely affected. If we run ping without -U, stress does not affect RTT at all. How does ping generate timestamps to exclude extra time spent before the packet is actually sent or after it is received?

There is an interface called SO_TIMESTAMPING that allows a userspace program to request the transmission/reception timestamp of a particular packet from the kernel. SO_TIMESTAMPING supports multiple timestamp sources, but we will only use those that do not require special hardware or device driver support.

SOF_TIMESTAMPING_TX_SCHED: Request tx timestamps prior to entering the packet scheduler.
SOF_TIMESTAMPING_RX_SOFTWARE: These timestamps are generated just after a device driver hands a packet to the kernel receive stack.
SO_TIMESTAMPING can be enabled for a particular packet (by using control messages) or for every packet passing through the socket:

flags := unix.SOF_TIMESTAMPING_SOFTWARE | unix.SOF_TIMESTAMPING_RX_SOFTWARE | unix.SOF_TIMESTAMPING_TX_SCHED |
         unix.SOF_TIMESTAMPING_OPT_CMSG | unix.SOF_TIMESTAMPING_OPT_TSONLY

err := syscall.SetsockoptInt(socketFd, unix.SOL_SOCKET, unix.SO_TIMESTAMPING, flags)

Immediately after we have sent a packet to the socket, we can request the kernel for the timestamp of the packet transmission:

err := send(sock, pkt.seq, ip)
// ...
oob := make([]byte, 1024) // a buffer for Out-Of-Band data where the kernel will write the timestamp
// MSG_ERRQUEUE indicates that we want to receive a message from the socket's error queue
_, oobn, _, _, err := syscall.Recvmsg(socketFd, pktBuf, oob, syscall.MSG_ERRQUEUE)
sentAt, err := getTimestampFromOutOfBandData(oob, oobn)

Then parsing the received message to extract the timestamp:

func getTimestampFromOutOfBandData(oob []byte, oobn int) (time.Time, error) {
    cms, err := syscall.ParseSocketControlMessage(oob[:oobn])
    if err != nil {
        return time.Time{}, err
    }
    for _, cm := range cms {
        if cm.Header.Level == syscall.SOL_SOCKET || cm.Header.Type == syscall.SO_TIMESTAMPING {
            var t unix.ScmTimestamping
            if err := binary.Read(bytes.NewBuffer(cm.Data), binary.LittleEndian, &t); err != nil {
                return time.Time{}, err
            }
            return time.Unix(t.Ts[0].Unix()), nil
        }
    }
    return time.Time{}, fmt.Errorf("no timestamp found")
}

Upon receiving, ancillary data is available along with the packet data, so there is no need to read anything from the socket error queue:

n, oobn, _, ra, err := conn.ReadMsgIP(pktBuf, oob)
//...
receivedAt, err := getTimestampFromOutOfBandData(oob, oobn)

And finally, we can calculate the RTT:

rtt := receivedAt.Sub(sentAt)

Another thing worth noting is that the timestamp of when the packet has been sent must be stored in the app's memory, not in the packet's payload. Otherwise, it can lead to funny consequences.

The resulting round-trip time reflects the real network latency much more accurately and is not subject to the mentioned errors. The agent uses this method to collect the container_net_latency_seconds metric for each container. Since this metric is broken down by destination_ip, you can always find out what the network latency between the container and each one of the services it communicates with is. The limitation of this method is that it doesn't work if ICMP traffic is blocked, such as in the case of Amazon RDS instances.

Useful links

node-agent (Apache-2.0 License)
How Coroot checks the correlation between network RTT and the app's SLIs
Try Coroot for free (14-day trial)

Building a service map using eBPF

Nikolay Sivko — Fri, 25 Feb 2022 13:12:24 +0000

Distributed request tracing is a popular method for monitoring distributed systems because it allows you to see the specific execution stages of any request, such as calls to other services and databases. However, the costs of integrating can be significantly high, since it requires changing the code of every component. Additionally, it's practically impossible to achieve 100% coverage, due to the fact that many third-party components don't support such instrumentation.

To address said disadvantages, we implemented eBPF-based container tracing which is a part of our open source Prometheus exporter node-agent. It passively monitors all TCP connections on a node, associates every connection with the related container, and exports metrics in Prometheus format:

# HELP container_net_tcp_successful_connects_total Total number of successful TCP connects
# TYPE container_net_tcp_successful_connects_total counter
container_net_tcp_successful_connects_total{actual_destination="10.128.0.43:443",container_id="/k8s/default/prometheus-0/prometheus-server",destination="10.52.0.1:443"} 2

# HELP container_net_tcp_active_connections Number of active outbound connections used by the container
# TYPE container_net_tcp_active_connections gauge
container_net_tcp_active_connections{actual_destination="10.128.0.43:443",container_id="/k8s/default/prometheus-0/prometheus-server",destination="10.52.0.1:443"} 1

# HELP container_net_tcp_failed_connects_total Total number of failed TCP connects
# TYPE container_net_tcp_failed_connects_total counter
container_net_tcp_failed_connects_total{container_id="/k8s/kube-system/konnectivity-agent-56cdbd78f-f7r7j/konnectivity-agent",destination="10.48.2.2:10250"} 20

# HELP container_net_tcp_listen_info Listen address of the container
# TYPE container_net_tcp_listen_info gauge
container_net_tcp_listen_info{container_id="/k8s/default/paymentservice-5849646947-b744v/server",listen_addr="10.48.0.4:50051",proxy=""} 1

These metrics are obtained using the sock:inet_sock_set_state kernel tracepoint. As the name implies, this tracepoint is called whenever a TCP connection changes its state.

Connection establishing

First, let's recall the TCP state transitions while establishing an outbound connection:

As seen in the diagram, a connection can be considered successfully established if the transition SYN_SENT -> ESTABLISHED has occurred. Conversely, SYN_SENT -> CLOSED means a failed connection attempt. Now, the tricky thing is that an eBPF program can only get the PID of the connection initiator during the CLOSED -> SYN_SENT transition. The solution for that is saving the initiator's PID to a kernel space map using the socket's pointer as a key, then we can use it on further transitions.

// saving the connection initiator's PID
if (args.oldstate == BPF_TCP_CLOSE && args.newstate == BPF_TCP_SYN_SENT) {
    struct sk_info i = {};
    i.pid = bpf_get_current_pid_tgid() >> 32;
    bpf_map_update_elem(&sk_info, &args.skaddr, &i, BPF_ANY);
    return 0;
}

...

//getting the PID
__u32 pid = 0;
struct sk_info *i = bpf_map_lookup_elem(&sk_info, &args.skaddr);
if (!i) {
    return 0;
}
pid = i->pid;
bpf_map_delete_elem(&sk_info, &args.skaddr);

The full source code of the inet_sock_set_state handler can be found here.

Connection closing

Depending on which peer closes the connection, either an ESTABLISHED -> FIN_WAIT1 (active close) or an ESTABLISHED -> CLOSE_WAIT (passive close) transition will occur. The only thing worth noting here is that we cannot get a PID when it's in a passive close, so we decided to resolve this in the userspace code. Since the handler does not need to differentiate between active and passive closes, it triggers in both cases.

if (args.oldstate == BPF_TCP_ESTABLISHED && (args.newstate == BPF_TCP_FIN_WAIT1 || args.newstate == BPF_TCP_CLOSE_WAIT)) {
    pid = 0;
    type = EVENT_TYPE_CONNECTION_CLOSE;
}

TCP LISTEN

In addition to outgoing connection tracing, we need to discover all listening sockets and containers associated with them. This can also be done by handling the CLOSED -> LISTEN and LISTEN -> CLOSED transitions. The initiator's PID is available in both cases.

if (args.oldstate == BPF_TCP_CLOSE && args.newstate == BPF_TCP_LISTEN) {
    type = EVENT_TYPE_LISTEN_OPEN;
}
if (args.oldstate == BPF_TCP_LISTEN && args.newstate == BPF_TCP_CLOSE) {
    type = EVENT_TYPE_LISTEN_CLOSE;
}

If the IP address of a listen socket is unspecified (0.0.0.0) the agent replaces it with the IP addresses assigned to the corresponding network namespace.

Network Address Translation (NAT)

Each event sent by the eBPF-program to the agent contains the source and destination IP:PORT pairs. However, the destination address can be virtual, as in the case of Kubernetes services. In this case, ClusterIP of the service will be replaced by the IP of a particular Pod by iptables or IPVS.

Since we need to know the services that each container interacted with as well as the specific containers involved in this interaction, the agent resolves the actual destination of each connection by querying the conntrack table using the Netlink protocol. As a result, each metric contains both destination and actual_destination labels:

{destination="10.96.0.1:80", actual_destination="192.168.1.11:80", container_id="/k8s/default/client-pod/server"}

Initialization

In addition to being notified of every new connection, the agent needs to detect all connections that were established before it started. To do this, it reads the information about the established connections from /proc/net/tcp and /proc/net/tcp6 in each network namespace on startup. The inode of each connection from these files is mapped with /proc/<pid>/fd to find the corresponding container.

How we use these metrics

The container_net_tcp_successful_connects_total and container_net_tcp_active_connections metrics show which IP:PORT each container is communicating with.
The container_net_tcp_listen_info metric shows on which IP:PORT each container is accepting inbound TCP connections. So, joining these metrics by IP:PORT allows us to build a map of container-to-container communications.

To build a map of service-to-service communications, Coroot aggregates individual containers into applications, such as Deployments or StatefulSets by using metrics from the kube-state-metrics exporter.

A service map like this can give you an overview of the distributed system architecture.

Here I've only described a portion of the network metrics needed to build a service map. Next time we will talk about metrics, which are extremely useful for troubleshooting network-related issues. Stay tuned!

Useful links

node-agent (Apache-2.0 License)
Upstream Service Inspection
Try Coroot for free (14-day trial)

Mining metrics from unstructured logs

Nikolay Sivko — Thu, 24 Feb 2022 11:27:39 +0000

Looking through endless postmortem reports and talking to other SREs, I feel that about 80% of outages are caused by similar factors: infrastructure failures, network errors/delays, lack of computing resources, etc. However, the cause of the remaining 20% can vary significantly since not all applications are built the same way.

Logs

All well-designed applications log their errors, so using a log management system can help you navigate through the logs quite efficiently. Yet, the cost of such systems can be unreasonably high. In fact, engineers don't even need all the logs when investigating an incident, because it's impossible to read every single message in a short period of time. Instead, they try extracting some sort of summary from the logs related to a specific timespan. If using console utilities, it should look like this:

Getting the stream of the related log:

< cat | tail | journalctl | kubectl logs > ...

Identifying the format to filter messages by severity:

... | grep ERROR

If needed, doing something with multi-line messages:

... | grep -C<N> ...

Clustering (grouping) messages to identify the most frequent errors:

... |  < some awk | sed | sort magic >

The resulting summary is usually a list of message groups with a sample and the number of occurrences.

It seems like this particular type of analysis can be done without a centralized log management system. Moreover, with some automation, we can greatly speed up this type of investigation.

At Coroot, we implemented an automated log parsing in our open-source Prometheus exporter node-agent. To explain how it works, let's follow the same steps as I mentioned above.

Container logs discovery

The agent can discover the logs of each container running on a node by using different data sources depending on the container type:

A process might log directly to a file in /var/log. In this case, the agent detects this by reading the list of open files of each process from /proc/<pid>/fd.
Services managed by systemd usually use journald as a logging backend, so the agent can read a container's log from journald.
To find logs of Kubernetes pods or standalone containers, the agent uses meta-information from dockerd or containerd.

Handling multi-line messages

To extract multi-line messages from a log stream, we use a few dead-simple heuristics. The general one: if a line contains a field that looks like a datetime, and the following lines don't — these are probably parts of a multi-line message. In addition to that, the agent uses some special parsing rules for Java stack traces and Python tracebacks.

According to our tests, this simple approach works pretty accurately and it can even handle custom multi-line messages.

Severity level detection

I would say it's not actually detection, it's more like guessing. For most log formats, the agent simply looks for a mention of known levels at the beginning of each message. In addition, we implemented special parsers for glog/klog and redis log formats. If it is impossible to detect the level of severity, the UNKNOWN level is used.

Messages clustering

A great compilation of the publications related to automated log parsing is available here. After reviewing these papers, we realized that our problem is much simpler than the ones most researchers have tried to solve. We just want to cluster messages to groups identified by some fingerprint. In doing so, we do not need to recognize the message format itself.

Our approach is entirely empirical and based on the observation that a logging entity is actually a combination of a template and variables, such as date, time, severity level, component, and other user-defined variables. We discovered that after removing everything that looks like a variable from a message, the remaining part itself can be considered as the pattern of this message.

In the first step, the agent removes quoted parts, parts in brackets, HEX numbers, UUIDs, and numeric values. Then, the remaining words are used to calculate the message group fingerprint. We extracted the code related to log parsing to this separate repository along with a command-line tool that can parse a log stream from stdin.

Here is a summary for a log from the logpai/loghub dataset (kudos to the Logpai team for sharing this dataset):

cat Zookeeper_2k.log | docker run -i --rm ghcr.io/coroot/logparser

Metrics

Node-agent parses the logs of every container running on a node and exports the container_log_messages_total:

container_log_messages_total{
    container_id="<ID of the container writing this log>",
    source="<log path or the stream source>",
    severity="<DEBUG|INFO|WARNING|ERROR|FATAL|UNKNOWN>",
    pattern_hash="<hash of the recognized pattern>",
    sample="<a sample message that is matched to this pattern>", # this can be replaced with Prometheus examplars in the future
}

AWS-agent exports the similar metrics aws_rds_log_messages_total related to every discovered RDS instance.

After Prometheus has collected the metrics from the node-agents, you can query these metrics using PromQL expressions like this:

sum by(level) (rate(container_log_messages_total{container_id="/docker/prometheus"}[1m]))

How we use these metrics

Coroot's Logs Inspection (part of our commercial product) uses the log-based metrics to highlight errors that are correlating with the application SLIs. Here is a sample report:

For each pattern, Coroot provides a full-text sample and a timeline of events broken down by the app's instances.

Useful links

node-agent (Apache-2.0 License)
logparser (Apache-2.0 License)
Logs Inspection
Try Coroot for free (14-day trial)