Grafana Loki: Optimising log based metrics

#grafana #devops #loki #monitoring

Log based metrics provide crucial KPI of any application and are widely used.

Loki is very popular and cost-effective OSS tool for log aggregation and commonly used with Grafana to query and build visualisation dashboards.

However, most of the time these dashboards are either very slow or even fail to load the data for 30 or 7 days.

There are multiple layers where the performance of Loki can be improved and fine-tuned. From optimising the query, channeling it efficiently for processing, to allocating the right computational resources, we will cover the following parameters that make a significant improvement to the performance.

Label filter
Data parse
Query split
Parallelism/Concurrency & Queue
Index
Cache
Resource allocation

Label Filter

This is a high-impact parameter. While building the query in Grafana, use as many labels as possible to filter the data that you want to use for metrics. Labels like Cluster, Namespace, and Severity will fetch only the data that is relevant to you.
This is because Loki data chunks are created for each data stream, so labels would narrow down the chunks that Loki needs to fetch and process.

Data parse

Simple thumb rule goes for this, Regex is fast, however JSON and logfmt parser is simpler, fastest is pattern matcher.
JSON parsers can also be used with parameters to extract only certain labels from the payload.

Pattern > JSON > Regex

Here is the link to official doc- https://grafana.com/docs/loki/latest/query/log_queries/#parser-expression

Query Split

This is an also a high-impact parameter. To understand this and next few parameters, we need to understand how Loki works and what happens with the query after we hit "Run".

Our query is first handled by the Query front-end component and forwarded to the Querier which fetches and processes the relevant chunks from the storage.

Query front-end splits the query either by sharding or time range into multiple small queries. For example, if your query time range is 1h, and you split config is 20m, then that query is split to 3 small queries before sending to the querier.

This split config can be configured using CLI flags or deployment manifest (https://grafana.com/docs/loki/latest/configure/query-frontend/#parallelization)

There is no universal best value for this parameter, this will vary depending upon your use case.

Now, Apart from Loki a split can be configured in latest version of Grafana as well. Here for example, if you request for 7 days of data, and configure Grafana to split be by 1d, then Grafana would request Loki only for 1d data first and after successful response only would request for previous 1 day. This is extremely helpful feature in Grafana since it reduces the load on Loki and give it enough time to scale up also if required.

Parallelism/Concurrency & Queue

This parameter does not directly improve the performance, however very important to channel and process our split queries.

We can configure how many queries in parallel can be handled by the query front-end to split and by the querier to process.
The remaining are pushed to the queue.

Both the queue size and parallelism in query front-end and concurrency in querier can be configured via CLI flags or deployment manifest.

Now depending upon your deployment type either distributed or simple, these component can be a different workloads or single.
Remember to keep the config in querier higher than the query-front-end.

Index

This is also a high impact configuration. It is straight forward in Loki. There are two types of storage, chunk and index. Index storage will index all the Labels and link that to the respective chunks.

So when this is enabled, the querier which by default fetches data from chunk storage now fetches from Index storage, which is faster.

Cache

Loki supports caching for Index, Chunks and Query result. Memcached can be configured for Chunks and query results, and index are cached in-memory by default. This will help to fetch results faster.

Resource allocation

This is also very important for performance and price optimisation. CPU and memory consumption of Loki can spike to very high values during heavy queries or insignificant values during low load.

To handle this, depending upon the above configuration of parallelism and split, we either need to configure smaller pods with high HPA or low replicas of very large pods.

I would suggest allocation 2 CPU and 4 GB Memory with 3 minimum and 40 maximum pods HPA, and allocating a dedicated node pool for it to scale in time of need and also to keep the cost in check.

Above mentioned parameters configure Loki from top to bottom to optimise for significantly improved performance.