DEV Community: Yury Bushmelev

Random thoughts about logs, delivery pipelines, and everything

Yury Bushmelev — Fri, 16 Feb 2024 17:51:27 +0000

While writing my previous article about logs, I found a few random notes with my ideas about the subject. Here, I'm going to share the most useful of these. Interesting (but expected) enough, most of these ideas are about logs reduction.

Let me start with the most thought-provoking idea first. Usually, people consider log delivery to be a forwarding-only plane. E.g., we can have a group of fluentbit instances that pick up the logs. Then these logs are delivered to a Logstash cluster, for example. And then they finish in Elasticsearch. This is typically the expected data flow. And that works as long as you have enough resources (hardware and money) to process the amount of logs.

Imagine you have thousands of servers with hundreds of microservices producing terabytes of logs per day. Do you still consider those logs useful?

Imagine now that one microservice was deployed with a bug that caused it to log five error messages per request. Let's say we have 10 other microservices depending on this one that are somehow affected by this bug. Those are emitting two error messages per request because of the issue. Then we have 100 instances of each affected microservice deployed in our infrastructure. So in the end, we have this:

1 broken service × 5 error messages + 10 dependent services × 2 error messages = 25 error messages per request

25 × 100 instances = 2500 error messages per request.

Impressive enough, I'd say. In real life, it's not that simple, because not every request is processed by all of those 11 services. Dependencies are not that simple, either. But you get the idea now. How useful are hundreds of those log copies per request?

Here the main idea comes…

Log delivery control plane

It's getting obvious to me, that having the ability to manage the log delivery based on metrics and/or events is a really useful feature! This is where we might introduce the log delivery control plane.

I see at least two reasons to have such an entity in the infrastructure. Firstly, it should allow for a reduction in the overall amount of logs. Secondly, it should give some operational abilities to control the log flow.

Let's start with the operational abilities first.

Manage the log flow

Below are some cases we might want to handle during day-to-day operations:

Enable or disable logs globally or by an attribute.
Rate-limit logs globally or by an attribute.

This may use the following data as an attribute:

a service name
the service's environment (dev/staging/prod)
the service's location (country, city, or datacenter)
a log message's attribute (severity, etc.)

So if a service goes mad, it's possible to quickly disable its log collection on a whole fleet, or just on staging, maybe. We may also want to drop log messages with a severity less than warning by default. Then we may quickly enable or disable debug logs if we need them.

Speaking of rate limiting, I have a strong opinion that it should be enabled everywhere by default. It's usually hard to select the proper numbers, though.

Reduce the log amount

Do you remember those 2500 messages coming from 100 instances and telling you about the same bug? Let's see what we can do to reduce the number of logs without reducing observability significantly.

Kind of DISCLAIMER:

The ideas below are not a general solution suitable for everyone. Consider your requirements (e.g., compliance and/or security) before applying anything described below.
It becomes really important to provide an easy-to-use UI to enable or disable logging for your developers. For example, they should be able to disable the automation if there is an ongoing incident.
It's assumed further that the metrics collection is already implemented in the infrastructure.

Sampling

I don't like logs, which are collected when everything is good. Nobody is going to read those logs, I believe. But they still consume your storage and waste your CPU cycles.

The simplest solution that comes to mind is to stop logging everything 24×7. As we have metrics, we can say if a deployment is good or bad based on them.

As long as things are good, we can have log collection disabled by default. Then we can enable logging for 10–15 minutes at some random time point to have a sample for analysis. The exact time range you can skip logging for depends heavily on your logs' nature. Doing sampling at least once per hour might be a good start.

Moreover, we should enable logging for 10–15 minutes after a deployment, because this is the point of the highest chance to see an error. Maybe we don't need to enable logging for everything everywhere, but just for affected services (the deployed service and its dependencies, at least).

Also, we should enable logging, if we see something wrong in the metrics. Here, I assume that on a 100-instance fleet, you'll see the same issue very soon.

Service dependencies

Now let's think about service dependencies. In the example above, we had 10 dependent services, which multiplied the error message at least twice each. Imagine if we could verify that the service depends on another service that has a known issue at the moment. Then we can skip collecting logs of dependent services until the incident is resolved. I.e., we'll have 500 error messages per request instead of 2500.

One may argue that a dependent service issue can be hidden in this way. I'd say you'll see it immediately after the original incident resolution. That may prevent you from fixing all visible issues at once, though. It's up to you to decide what is more important in your case.

"Pre-shooting" buffer

There is a feature in modern photo cameras. You may shoot some pictures, then go back in time and select a photo, that was made a few moments BEFORE you pressed the button. What if we apply the same idea to log delivery?

Imagine our log collector has a circular buffer that stores messages incoming. As long as there is no error message, we don't deliver anything. Older messages are silently dropped. Well, not really silently. The log collector counts everything and exposes it to metrics.

Boom! An error message is received! All messages will be delivered immediately, and delivery will continue until no errors are detected for a certain period of time.

It's really that simple. The biggest issue here is a long-term logs storage requirement you may have. In such a case, this method can still be implemented just for the short-term storage delivery route.

It'd be useful to have the ability to switch the log collector's operation mode quickly. I.e., it can do buffering by default, but can be switched to the normal delivery mode for 15 minutes after a deployment.

Reduce a message size

This part is not really related to the control plane idea, but I decided to include it here also to complete the log reduction topic. Consider this as a bonus for those who reached here.

Imagine we have 100k JSON log lines per second from our infrastructure. In every message, we have an “is_complete” Boolean field, which may be false in about 75% of cases. Let's do some calculations.

len('"is_complete": false') × 100 000 msg/s × 75% = 20 × 75 000 = 1 500 000 bytes/s

1 500 000 bytes/s = 129 600 000 000 bytes/day = 120 GiB/day

So, just having the field cost us 120 GiB of traffic and storage daily (without compression, replicas, or indices). As this is a Boolean field, we can stop adding it if it's false and save 120 GiB per day. Moreover, there should be a comma after or before the field, usually. Add another few GiB per day!

Out of curiosity, let's calculate how much data a single char generates:

1 × 100 000 msg/s = 100 000 bytes/s = 8 640 000 000 bytes/day = 8 GiB/day

Remember, every single character you have in your logs costs you some traffic and storage!

If you think 100k msg/s is a lot, it's not. I saw 250k msg/s and even more during the peak load at Lazada. I know a few companies, where it's even higher.

Impressed? Take the numbers with a grain of salt. Traffic and storage are typically compressed, so numbers are 2–10 times lower. On the other side, storage is often replicated and has some indices to speed up the search, so numbers are increased by some factor.

Implementation

So you feel enthusiastic and would like to introduce the log delivery pipeline into your infrastructure. Unfortunately, I'm not aware of any ready-to-use solution that can do most of the things above. I believe FAANG-level companies should have something similar developed in-house and tied specifically to their infrastructure and their needs. So you can do this as well. Check your orchestration and configuration management engine features. Maybe it's good enough to start with. If you ask me to implement such a tooling, I'd go with rsyslog, Puppet, and Choria, mostly because I know this software very well, and it's flexible enough.

We're engineers here, right? ;)

An overview of logging pipelines: Lazada 2017 vs Cloudflare 2023

Yury Bushmelev — Mon, 29 Jan 2024 15:43:14 +0000

Recently, I finished reading the Cloudflare blog post about their logging pipelines. Guess what? I found quite a lot of similarities with what we did in Lazada in 2017! Thanks to the layoff, I got some free time to write this long-enough-read about logging pipelines. I’m going to compare what we did in Lazada with what Cloudflare just published recently. Let’s dive into the history together!

Just for your information, Lazada is an e-commerce platform that was quite popular in Southeast Asia (and still is, I believe). Though, Lazada in 2017 and Lazada now are completely different. Alibaba replaced everything with their own in-house implementations in 2018. So what I’m going to describe below is non-existent anymore.

Below is a bird’s-eye view of the logging pipeline we’ve implemented.

Local delivery pipeline

On every host, we have a rsyslog instance, which listens on a Unix Datagram socket in a directory. This directory is mapped to every container running on the host. Every microservice running in a container writes its logs into the socket.

Why do we do this? Well… Let me elaborate a bit more here. Famous “The Twelve-Factor App” methodology says that “each running process writes its event stream, unbuffered, to stdout”. That sounds nice, and there are reasons behind it. But, on a scale, things are not that simple. Who is reading that stdout? What if your infra engineers want the reader process to be restarted for some reason? What if the reader process is stuck on a disk IO, e.g.? In the best case, your writer will receive a signal. In the worst case, it’ll block in write().

What can we do with this? Either we accept the risk or we implement a wrapper to write log messages in a separate thread and to not block on write(). Then we should push this to every microservice in your product.

So now we have a library that can write to stdout without blocking or crashing. Why the Unix socket, then? The answers are speed and reliability. There are more moving parts involved in the stdout logging pipeline. You should read the logs from the pipe and then deliver them somewhere. At the moment, I saw no software that can do this fast and reliably. Also, we were trying to avoid local disk writes in the local delivery pipeline as much as possible. That’s why we decided to use rsyslog with a Unix socket instead.

Honestly speaking, our library was configurable. By default, it logs to stdout to simplify developer’s experience. At the same time, it has an option to log to a Unix socket, which was enabled in our staging and production configurations.

Now let’s see what Cloudflare did:

This means that for our heavier workloads, where every millisecond of delay counts, we provide a more direct path into our pipeline. In particular, we offer a Unix Domain Socket that syslog-ng listens on. This socket operates as a separate source of logs into the same pipeline that the journald logs follow, but allows greater throughput by eschewing the need for a global ordering that journald enforces. Logging in this manner is a bit more involved than outputting logs to the stdout streams, as services have to have a pipe created for them and then manually open that socket to write to. As such, this is generally reserved for services that need it, and don’t mind the management overhead it requires.

They have different setup, but the overall idea is mostly the same. In our case, almost every microservice was critical enough to implement the library.

Thanks to rsyslog lookup tables and Puppet+MCO, we were able to enable/disable log collection per microservice/environment/DC. For example, it was possible to stop collecting logs from a noisy microservice on staging in a few minutes. Also, it was possible to enable/disable writing a local log file in case we really needed that.

Global logs delivery pipeline

As you may see in the picture above, logs from every server in a datacenter are sent to a pair of the datacenter’s log collectors via syslog-over-tcp. The collector (conditionally) forwards the message to DC long-term file storage and to Main DC’s log collector. It has an internal queue to hold messages for about a day in case the Main DC is inaccessible for some reason. The stream between a DC and the Main DC uses the compressed syslog-over-tcp protocol. That introduces some delay, but dramatically saves the bandwidth. Though, unlike the Cloudflare, we were using just a single DC as a Main DC. Also, the log collectors (per-DC and Main DC) were active-passive with manual failover. So there was definitely room for improvement.

Message routing

I omitted some details in the description above, though. One of those details is message routing. A log collector needs some information to make a decision what to do with the message. Partially, we have some information in the syslog message fields (severity, facility, hostname). Though most of the information required is JSON-encoded, including a microservice name, a venture (a country the microservice is serving), an environment, etc.

The easy way here is to parse JSON in the message and use fields to make the decision. But JSON parsing is a quite expensive operation. Instead, we add a syslog tag to the message in the following format: $microservice/$venture/$environment[/$optional…]. This way, we can just parse a short, slash-separated string instead of a full JSON message. Combined with the usual syslog message fields, this gives us enough information to route the message in any way we may need.

Kafka

When a message arrives in the Main DC, its syslog tag is stripped and the JSON is injected into a Kafka topic according to its source environment. There are 2 big reasons to use Kafka here. Firstly, it allows other teams (BI, Security, Compliance, etc.) to access the logs. Secondly, it gives us some flexibility in the infrastructure operations.

Onwards to storage

Developers were able to access logs in two ways. The first (and most popular) was a shell box on the per-DC file storage. Log files were distributed in a directory tree, which allows developers to use old’n’good Unix tools (tail/grep/sed/awk/sort/etc.) to deal with them. The second was the Graylog UI, backed by ElasticSearch v5.5.

We had 16 powerful servers for the ES cluster. That allows us to implement the hot/warm architecture of 16 hot (SSD) and 32 warm (HDD) instances. Also, we created three dedicated masters to coordinate the cluster. These days, I understand that it was a good idea to dedicate a subset of the data nodes to only serve the HTTP traffic (as Cloudflare did).

Pipeline metrics

So we had a log delivery pipeline working in some way. The next question was how to measure the quality of logs delivery. To answer this question, we implemented a few things.

On the microservice side, our logging library had some metrics exposed as a subset of the microservice metrics. That includes the count of log messages written and the count of log messages dropped.

On the rsyslog side, I implemented a simple Prometheus exporter, which exposes rsyslog’s metrics (impstats) and its custom counters (dys_stats). This way, we can monitor its queues and can count how many messages were received from a microservice.

With all that implemented, we were able to monitor the logging delivery pipeline and its quality of service. A few alerts were configured to react when inter-DC logs delivery got stuck, e.g.

What was next?

With all the above, we found that the Graylog cluster was able to process up to almost 200k msg/s, which I personally consider a bit on the low side for the hardware we had under. The Graylog was definitely a bottleneck there. So my next idea to try in 2018 was to set up rsyslog to write to ES directly, then drop Graylog and use Kibana instead.

From another side, I had the impression that ElasticSearch was the wrong tool here. Nobody needs the full-text search in logs, actually. So I added ClickHouse to my to-do list to explore, though I was concerned by a missing UI. I filled out a Feature Request for the omclickhouse module in the rsyslog GitHub repo, and it was implemented one year later.

Another items on my to-do list for 2018 was:

to work on High Availability issues
to configure rate limits on every microservice and per-DC log collectors
to define the log delivery SLA with our developers

But… Reality was different. Company management and the tech stack were replaced completely. I left the company in June 2018. I had no chance to work on logs at a scale anymore.

P.S. You might want to read this article also: https://dev.to/jay7x/random-thoughts-about-logs-delivery-pipelines-and-everything-2h6e