DEV Community: Sematext

Java Logging Best Practices: 10+ Tips You Should Know to Get the Most Out of Your Logs

Rafał Kuć — Thu, 10 Sep 2020 17:40:07 +0000

Having visibility into your Java application is crucial for understanding how it works right now, how it worked some time in the past and increasing your understanding of how it might work in the future. More often than not, analyzing logs is the fastest way to detect what went wrong, thus making logging in Java critical to ensuring the performance and health of your app, as well as minimizing and reducing any downtime. Having a centralized logging and monitoring solution helps reduce the Mean Time To Repair by improving the effectiveness of your Ops or DevOps team.

By following logging best practices you will get more value out of your logs and make it easier to use them. You will be able to more easily pinpoint the root cause of errors and poor performance and solve problems before they impact end-users. So today, let me share some of the best practices you should follow when working with Java applications. Let’s dig in.

1. Use a Standard Logging Library

Logging in Java can be done a few different ways. You can use a dedicated logging library, a common API, or even just write logs to file or directly to a dedicated logging system. However, when choosing the logging library for your system think ahead. Things to consider and evaluate are performance, flexibility, appenders for new log centralization solutions, and so on. If you tie yourself directly to a single framework the switch to a newer library can take a substantial amount of work and time. Keep that in mind and go for the API that will give you the flexibility to swap logging libraries in the future. Just like with the switch from Log4j to Logback and to Log4j 2, when using the SLF4J API the only thing you need to do is change the dependency, not the code.

2. Select Your Appenders Wisely

Appenders define where your log events will be delivered. The most common appenders are the Console and File Appenders. While useful and widely known, they may not fulfill your requirements. For example, you may want to write your logs in an asynchronous way or you may want to ship your logs over the network using appenders like the one for Syslog, like this:



<Appenders>
    <Console name="Console" target="SYSTEM_OUT">
        <PatternLayout pattern="%d %level [%t] %c - %m%n"/>
    </Console>
    <Syslog name="Syslog" host="logsene-syslog-receiver.sematext.com"
            port="514" protocol="TCP" format="RFC5424"
            appName="11111111-2222-3333-4444-555555555555"
            facility="LOCAL0" mdcId="mdc" newLine="true"/>
</Appenders>

However, keep in mind that using appenders like the one shown above makes your logging pipeline susceptible to network errors and communication disruptions. That may result in logs not being shipped to their destination which may not be acceptable. You also want to avoid logging affecting your system if the appender is designed in a blocking way. To learn more check our Logging libraries vs Log shippers blog post.

3. Use Meaningful Messages

One of the crucial things when it comes to creating logs, yet one of the not so easy ones is using meaningful messages. Your log events should include messages that are unique to the given situation, clearly describe them and inform the person reading them. Imagine a communication error occurred in your application. You might do it like this:



LOGGER.warn("Communication error");

But you could also create a message like this:



LOGGER.warn("Error while sending documents to events Elasticsearch server, response code %d, response message %s. The message sending will be retried.", responseCode, responseMessage);

You can easily see that the first message will inform the person looking at the logs about some communication issues. That person will probably have the context, the name of the logger, and the line number where the warning happened, but that is all. To get more context that person would have to look at the code, know which version of the code the error is related to, and so on. This is not fun and often not easy, and certainly not something one wants to be doing while trying to troubleshoot a production issue as quickly as possible.

The second message is better. It provides exact information about what kind of communication error happened, what the application was doing at the time, what error code it got, and what the response from the remote server was. Finally, it also informs that sending the message will be retried. Working with such messages is definitely easier and more pleasant.

Finally, think about the size and verbosity of the message. Don’t log information that is too verbose. This data needs to be stored somewhere in order to be useful. One very long message will not be a problem, but if that line is repeating hundreds of times in a minute and you have lots of verbose logs, keeping longer retention of such data may be problematic and, at the end of the day, will also cost more.

4. Logging Java Stack Traces

One of the very important parts of Java logging are the Java stack traces. Have a look at the following code:



package com.sematext.blog.logging;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

import java.io.IOException;

public class Log4JExceptionNoThrowable {
    private static final Logger LOGGER = LogManager.getLogger(Log4JExceptionNoThrowable.class);

    public static void main(String[] args) {
        try {
            throw new IOException("This is an I/O error");
        } catch (IOException ioe) {
            LOGGER.error("Error while executing main thread");
        }
    }
}

The above code will result in an exception being thrown and a log message that will be printed to the console with our default configuration will look as follows:



11:42:18.952 ERROR - Error while executing main thread

As you can see there is not a lot of information there. We only know that the problem occurred, but we don’t know where it happened, or what the problem was, etc. Not very informative.

Now, look at the same code with a slightly modified logging statement:



package com.sematext.blog.logging;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

import java.io.IOException;

public class Log4JException {
    private static final Logger LOGGER = LogManager.getLogger(Log4JException.class);

    public static void main(String[] args) {
        try {
            throw new IOException("This is an I/O error");
        } catch (IOException ioe) {
            LOGGER.error("Error while executing main thread", ioe);
        }
    }
}

As you can see, this time we’ve included the exception object itself in our log message:



LOGGER.error("Error while executing main thread", ioe);

That would result in the following error log in the console with our default configuration:



11:30:17.527 ERROR - Error while executing main thread
java.io.IOException: This is an I/O error
    at com.sematext.blog.logging.Log4JException.main(Log4JException.java:13) [main/:?]

It contains relevant information – i.e. the name of the class, the method where the problem occurred, and finally the line number where the problem happened. Of course, in real-life situations, the stack traces will be longer, but you should include them to give you enough information for proper debugging.

To learn more about how to handle Java stack traces with Logstash see Handling Multiline Stack Traces with Logstash or look at Logagent which can do that for you out of the box.

5. Logging Java Exceptions

When dealing with Java exceptions and stack traces you shouldn’t only think about the whole stack trace, the lines where the problem appeared, and so on. You should also think about how not to deal with exceptions.

Avoid silently ignoring exceptions. You don’t want to ignore something important. For example, do not do this:



try {
     throw new IOException("This is an I/O error");
} catch (IOException ioe) {
}

Also, don’t just log an exception and throw it further. That means that you just pushed the problem up the execution stack. Avoid things like this as well:



try {
    throw new IOException("This is an I/O error");
} catch (IOException ioe) {
    LOGGER.error("I/O error occurred during request processing", ioe);
    throw ioe;
}

6. Use Appropriate Log Level

When writing your application code think twice about a given log message. Not every bit of information is equally important and not every unexpected situation is an error or a critical message. Also, using the logging levels consistently – information of a similar type should be on a similar severity level.

Both SLF4J facade and each Java logging framework that you will be using provide methods that can be used to provide a proper log level. For example:



LOGGER.error("I/O error occurred during request processing", ioe);

7. Log in JSON

If we plan to log and look at the data manually in a file or the standard output then the planned logging will be more than fine. It is more user friendly – we are used to it. But that is only viable for very small applications and even then it is suggested to use something that will allow you to correlate the metrics data with the logs. Doing such operations in a terminal window ain’t fun and sometimes it is simply not possible. If you want to store logs in the log management and centralization system you should log in JSON. That’s because parsing doesn’t come for free – it usually means using regular expressions. Of course, you can pay that price in the log shipper, but why do that if you can easily log in JSON. Logging in JSON also means easy handling of stack traces, so yet another advantage. Well, you can also just log to a Syslog compatible destination, but that is a different story.

In most cases, to enable logging in JSON in your Java logging framework it is enough to include the proper configuration. For example, let’s assume that we have the following log message included in our code:



LOGGER.info("This is a log message that will be logged in JSON!");

To configure Log4J 2 to write log messages in JSON we would include the following configuration:



<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <JSONLayout compact="true" eventEol="true">
            </JSONLayout>
        </Console>
    </Appenders>
    <Loggers>
        <Root level="info">
            <AppenderRef ref="Console"/>
        </Root>
    </Loggers>
</Configuration>

The result would look as follows:



{"instant":{"epochSecond":1596030628,"nanoOfSecond":695758000},"thread":"main","level":"INFO","loggerName":"com.sematext.blog.logging.Log4J2JSON","message":"This is a log message that will be logged in JSON!","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":1,"threadPriority":5}

8. Keep the Log Structure Consistent

The structure of your log events should be consistent. This is not only true within a single application or set of microservices, but should be applied across your whole application stack. With similarly structured log events it will be easier to look into them, compare them, correlate them, or simply store them in a dedicated data store. It is easier to look into data coming from your systems when you know that they have common fields like severity and hostname, so you can easily slice and dice the data based on that information. For inspiration, have a look at Sematext Common Schema even if you are not a Sematext user.

Of course, keeping the structure is not always possible, because your full stack consists of externally developed servers, databases, search engines, queues, etc., each of which has their own set of logs and log formats. However, to keep your and your team’s sanity minimize the number of different log message structures that you can control.

One way of keeping a common structure is to use the same pattern for your logs, at least the ones that are using the same logging framework. For example, if your applications and microservices use Log4J 2 you could use a pattern like this:



<PatternLayout>
    <Pattern>%d %p [%t] %c{35}:%L - %m%n</Pattern>
</PatternLayout>

By using a single or a very limited set of patterns you can be sure that the number of log formats will remain small and manageable.

9. Add Context to Your Logs

Information context is important and for us developers and DevOps a log message is information. Look at the following log entry:



[2020-06-29 16:25:34] [ERROR ] An error occurred!

We know that an error appeared somewhere in the application. We don’t know where it happened, we don’t know what kind of error it was, we only know when it happened. Now look at a message with slightly more contextual information:



[2020-06-29 16:25:34] [main] [ERROR ] com.sematext.blog.logging.ParsingErrorExample - A parsing error occurred for user with id 1234!

The same log record, but a lot more contextual information. We know the thread in which it happened, we know what class the error was generated at. We modified the message as well to include the user that the error happened for, so we can get back to the user if needed. We could also include additional information like diagnostic contexts. Think about what you need and include it.

To include context information you don’t have to do much when it comes to the code that is responsible for generating the log message. For example, the PatternLayout in Log4J 2 gives you all that you need to include the context information. You can go with a very simple pattern like this:



<PatternLayout pattern="%d{HH:mm:ss.SSS} %-5level - %msg%n"/>

That will result in a log message similar to the following one:



17:13:08.059 INFO - This is the first INFO level log message!

But you can also include a pattern that will include way more information:



<PatternLayout pattern="%d{HH:mm:ss.SSS} %c %l %-5level - %msg%n"/>

That will result in a log message like this:



17:24:01.710 com.sematext.blog.logging.Log4j2 com.sematext.blog.logging.Log4j2.main(Log4j2.java:12) INFO - This is the first INFO level log message!

10. Java Logging in Containers

Think about the environment your application is going to be running in. There is a difference in logging configuration when you are running your Java code in a VM or on a bare-metal machine, it is different when running it in a containerized environment, and of course, it is different when you run your Java or Kotlin code on an Android device.

To set up logging in a containerized environment you need to choose the approach you want to take. You can use one of the provided logging drivers – like the journald, logagent, Syslog, or JSON file. To do that, remember that your application shouldn’t write the log file to the container ephemeral storage, but to the standard output. That can be easily done by configuring your logging framework to write the log to the console. For example, with Log4J 2 you would just use the following appender configuration:



<Appenders>
    <Console name="Console" target="SYSTEM_OUT">
        <PatternLayout pattern="%d{HH:mm:ss.SSS} - %m %n"/>
    </Console>
</Appenders>

You can also completely omit the logging drivers and ship logs directly to your centralized logs solution like our Sematext Cloud:



<Appenders>
    <Console name="Console" target="SYSTEM_OUT">
        <PatternLayout pattern="%d %level [%t] %c - %m%n"/>
    </Console>
    <Syslog name="Syslog" host="logsene-syslog-receiver.sematext.com"
            port="514" protocol="TCP" format="RFC5424"
            appName="11111111-2222-3333-4444-555555555555"
            facility="LOCAL0" mdcId="mdc" newLine="true"/>
</Appenders>

11. Don’t Log Too Much or Too Little

As developers we tend to think that everything might be important – we tend to mark each step of our algorithm or business code as important. On the other hand, we sometimes do the opposite – we don’t add logging where we should or we log only FATAL and ERROR log levels. Both approaches will not do very well. When writing your code and adding logging, think about what will be important to see if the application is working properly and what will be important to be able to diagnose a wrong application state and fix it. Use this as your guiding light to decide what and where to log. Keep in mind that adding too many logs will end up in information fatigue and not having enough information will result in the inability to troubleshoot.

12. Keep the Audience in Mind

In most cases, you will not be the only person looking at the logs. Always remember that. There are multiple actors that may be looking at the logs.

The developer may be looking at the logs for troubleshooting or during debugging sessions. For such people, logs can be detailed, technical, and include very deep information related to how the system is running. Such a person will also have access to the code or will even know the code and you can assume that.

Then there are DevOps. For them, log events will be needed for troubleshooting and should include information helpful in diagnostics. You can assume the knowledge of the system, its architecture, its components, and the configuration of the components, but you should not assume the knowledge about the code of the platform.

Finally, your application logs may be read by your users themselves. In such a case, the logs should be descriptive enough to help fix the issue if that is even possible or give enough information to the support team helping the user. For example, using Sematext for monitoring involves installing and running a monitoring agent. If you are behind a very restrictive firewall and the agent cannot ship metrics to Sematext, it logs errors aimed that Sematext users themselves can look at, too.

We could go further and identify even more actors who might be looking into logs, but this shortlist should give you a glimpse into what you should think about when writing your log messages.

13. Avoid Logging Sensitive Information

Sensitive information shouldn’t be present in logs or should be masked. Passwords, credit card numbers, social security numbers, access tokens, and so on – all of that may be dangerous if leaked or accessed by those who shouldn’t see that. There are two things you ought to consider.

Think whether sensitive information is truly essential for troubleshooting. Maybe instead of a credit card number, it is enough to keep the information about the transaction identifier and the date of the transaction? Maybe it is not necessary to keep the social security number in the logs when you can easily store the user identifier. Think about such situations, think about the data that you store, and only write sensitive data when it is really necessary.

The second thing is shipping logs with sensitive information to a hosted logs service. There are very few exceptions where the following advice should not be followed. If your logs have and need to have sensitive information stored, mask or remove them before sending them to your centralized logs store. Most popular log shippers, like our own Logagent, include functionality that allows removal or masking of sensitive data.

Finally, the masking of sensitive information can be done in the logging framework itself. Let’s look at how it can be done by extending Log4j 2. Our code that produces log events looks as follows (full example can be found at Sematext Github):



public class Log4J2Masking {
    private static Logger LOGGER = LoggerFactory.getLogger(Log4J2Masking.class);
    private static final Marker SENSITIVE_DATA_MARKER = MarkerFactory.getMarker("SENSITIVE_DATA_MARKER");

    public static void main(String[] args) {
        LOGGER.info("This is a log message without sensitive data");
        LOGGER.info(SENSITIVE_DATA_MARKER, "This is a a log message with credit card number 1234-4444-3333-1111 in it");
    }
}

If you were to run the whole example from Github the output would be as follows:



21:20:42.099 - This is a log message without sensitive data
21:20:42.101 - This is a a log message with credit card number ****-****-****-**** in it

You can see that the credit card number was masked. This was done because we added a custom Converter that checks if the given Marker is passed along the log event and tries to replace a defined pattern. The implementation of such Converter looks as follows:



@Plugin(name = "sample_logging_mask", category = "Converter")
@ConverterKeys("sc")
public class LoggingConverter extends LogEventPatternConverter {
    private static Pattern PATTERN = Pattern.compile("\\b([0-9]{4})-([0-9]{4})-([0-9]{4})-([0-9]{4})\\b");

    public LoggingConverter(String[] options) {
        super("sc", "sc");
    }

    public static LoggingConverter newInstance(final String[] options) {
        return new LoggingConverter(options);
    }

    @Override
    public void format(LogEvent event, StringBuilder toAppendTo) {
        String message = event.getMessage().getFormattedMessage();
        String maskedMessage = message;

        if (event.getMarker() != null && "SENSITIVE_DATA_MARKER".compareToIgnoreCase(event.getMarker().getName()) == 0) {
            Matcher matcher = PATTERN.matcher(message);
            if (matcher.find()) {
                maskedMessage = matcher.replaceAll("****-****-****-****");
            }
        }

        toAppendTo.append(maskedMessage);
    }
}

It is very simple and could be written in a more optimized way and should also handle all the possible credit cards number formats, but it is enough for this purpose.

Before jumping into the code explanation I would also like to show you the log4j2.xml configuration file for this example:



<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN" packages="com.sematext.blog.logging">
    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <PatternLayout pattern="%d{HH:mm:ss.SSS} - %sc %n"/>
        </Console>
    </Appenders>
    <Loggers>
        <Root level="info">
            <AppenderRef ref="Console"/>
        </Root>
    </Loggers>
</Configuration>

As you can see, we’ve added the packages attribute in our Configuration to tell the framework where to look for our converter. Then we’ve used the %sc pattern to provide the log message. We do that because we can’t overwrite the default %m pattern. Once Log4j2 finds our %sc pattern it will use our converter which takes the formatted message of the log event and uses a simple regex and replaces the data if it was found. As simple as that.

One thing to notice here is that we are using the Marker functionality. Regex matching is expensive and we don’t want to do that for every log message. That’s why we mark the log events that should be processed with the created Marker, so only the marked ones are checked.

14. Use a Log Management Solution to Centralize & Monitor Java Logs

With the complexity of the applications, the volume of your logs will grow, too. You may get away with logging to a file and only using logs when troubleshooting is needed, but when the amount of data grows it quickly becomes difficult and slow to troubleshoot this way When this happens, consider using a log management solution to centralize and monitor your logs. You can either go for an in house solution based on the open-source software, like Elastic Stack, or use one of the log management tools available on the market like Sematext Cloud or Sematext Enterprise.

A fully managed log centralization solution will give you the freedom of not needing to manage yet another, usually quite complex, part of your infrastructure. Instead, you will be able to focus on your application and will need to set up only log shipping. You may want to include logs like JVM garbage collection logs in your managed log solution. After turning them on for your applications and systems working on the JVM you will want to have them in a single place for correlation, analysis, and to help you tune the garbage collection in the JVM instances. Alert on logs, aggregate the data, save and re-run the queries, hook up your favorite incident management software. Correlating the logs data with metrics coming from the JVM applications, system and infrastructure, real user, and API endpoints is something that platforms like Sematext Cloud are capable of. And of course, remember that application logs are not everything.

Conclusion

Incorporating each and every good practice may not be easy to implement right away, especially for applications that are already live and working in production. But if you take the time and roll the suggestions out one after another you will start seeing an increase in usefulness of your logs. Also, remember that at Sematext we do help organizations with their logging setups by offering logging consulting, so reach out if you are having trouble and we will be happy to help.

Node.js Monitoring in Production - Revised eBook

Adnan Rahić — Fri, 28 Aug 2020 10:51:11 +0000

Hey!

Not so long ago I wrote my first ever eBook called Node.js Monitoring in Production.

A few months ago I wrote another chapter and added it to the eBook. Now I want to share the revised version with all of you!

Lots of love peeps! 🥰

Here's the official download link!

Hope you like it! Feel free to let me know your thoughts in the comments below. Happy coding. :)

15+ Best Cloud Monitoring Tools of 2020: Pros & Cons Comparison

Rafał Kuć — Thu, 06 Aug 2020 10:28:01 +0000

When providing services to your customers you need to keep an eye on everything that could impact your success with that – from low-level performance metrics to high-level business key performance indicators. From server-side logs to stack traces giving you full visibility into business and software processes that underpin your product. That’s where cloud monitoring tools and services come into play. They help you achieve full readiness of your infrastructure, applications, and make sure that your users and customers can use your platform to its full potential.

What Is Cloud Monitoring?

Cloud monitoring is a process of gaining observability into your cloud-based infrastructure, services, applications, and user experience. It allows you to observe the environment, review, and predict performance and availability of the whole infrastructure or drill into each piece of it on its own. Cloud monitoring works by collecting observability data, such as metrics, logs, traces, etc. from your whole IT infrastructure, analyzing it, and presenting it in a format understood by humans, like charts, graphs, and alerts, as well as machines via APIs

Best Cloud Monitoring Tools

There are many types of tools that can help you gain full observability into your infrastructure, services, applications, website performance and health. Some help you with just one aspect of monitoring, while others give you full visibility into all of the key performance indicators, metrics, logs, traces, etc. Some you can set up easily and without talking to sales, others are more complex and involve a more traditional trial and sales process. Each solution has its pros and cons – sometimes the flexibility of a solution comes with a higher setup complication, while the setup and ease of use come with a limited set of features. As users, we need to choose the solution that’s the best fit for our needs and budget. In this post, we are going to explore the cloud monitoring tools that you should be aware of and that will let you know if your business and its IT operations are healthy.

1. Sematext Cloud

Sematext Cloud and its on-premise version – Sematext Enterprise – is a full observability solution that is easy to set up and that gives you in-depth visibility into your IT infrastructure. Dashboards with key application and infrastructure (e.g., common databases and NoSQL stores, servers, containers, etc.) come out of the box and can be customized. There is powerful alerting with anomaly detection and scheduling. Sematext Cloud is the solution that gives you both reactive and predictive monitoring with easy analysis.

Features

Auto-discovery of services enables hands-off auto-monitoring.
Full-blown log management solution with filtering, full-text search, alerting, scheduled reporting, AWS S3, IBM Cloud, and Minio archiving integrations, Elasticsearch-compatible API and Syslog support.
Real user and synthetic monitoring for full visibility of how your users experience your frontend and how fast and healthy your APIs are.
Comprehensive support for microservices and containerized environments – support for Kubernetes, Docker, and Docker Swarm with ability to observe applications running in them, too; collection of their metrics, logs, and events.
Network, database, processes, and inventory monitoring.
Alerting with anomaly detection and support for external notification services like PagerDuty, OpsGenie, VictorOps, WebHooks, etc.
Powerful dashboarding capabilities for graphing virtually any data shipped to Sematext.
Scheduled reporting.

Pros

Lots of out of the box integrations.
Lightweight, open-sourced and pluggable agents. Quick setup.
Powerful Machine Learning-based alerting and notifications system to quickly inform you about issues and potential problems with your environment.
Elasticsearch and InfluxDB APIs allow for the integration of any tools that work with those, like Logstash, Filebeat, Fluentd, Logagent, Vector, etc..
Easy correlation of performance metrics, logs, and various events.
Collection of IT inventory – installed packages and their versions, detailed server info, container image inventory, etc.
Straightforward pricing with free plans available, generous 30-days trial.

Cons

Limited support for transaction tracing.
Lack of full-featured profiler.

Pricing

The pricing for each solution is straight forward. Each solution lets you choose a plan. As a matter of fact, pricing is super flexible for the cost-conscious — you have the flexibility of picking a different plan for each of your Apps. For Logs there is a per-GB volume discount as your log volume or data retention goes up. Performance monitoring is metered by the hour, which makes it suitable for dynamic environments that scale up and down. Real user monitoring allows downsampling that can minimize your cost without sacrificing the value. Synthetic monitoring has a cheap pay-as-you-go option.

AppDynamics

Available in both software as a service and an on-premise model AppDynamics is more focused on large enterprises providing the ability to connect application performance metrics with infrastructure data, alerting, and business-level metrics. A combination of these allows you to monitor the whole stack that runs your services and gives you insights into your environment – from top-level transactions that are understood by the business executives to the code-level information useful for DevOps and developers.

Features:

End-user monitoring with mobile and browser real user, synthetic, and internet of things monitoring.
Infrastructure monitoring with network components, databases, and servers visibility providing information about status, utilization, and flow between each element.
Business-focused dashboards and features provide visualizations and analysis of the connections between performance and business-oriented metrics.
Machine Learning supported anomaly detection and root cause analysis features.
Alerting with email templating and period digest capabilities.

Pros:

Very detailed information about the environment including versions, for example, JVM application startup parameters, JVM version, etc.
Provides advanced features for various languages – for example, automatic leak detection and object instance tracking for the JVM based stack.
Visibility into connections between the system components, environment elements, endpoint response times, and business transactions.
Visibility into server and application metrics with up to code-level visibility and automated diagnostics.

Cons:

Pricing: very expensive, complex, and non-transparent. Focused on more traditional high-touch sales model and selling to large enterprises.
Installation of the agent requires manual downloading and starting of the agent – no one-line installation and setup command.
Some of the basic metrics like system CPU, memory, and network utilization are not available in the lowest, paid plan tier.
Slicing and dicing through the data is not as easy compared to some of the other tools mentioned in this summary that support rich dashboarding capabilities like Sematext, Datadog, or New Relic.

Pricing

Agent and feature-based pricing is used which makes the pricing not transparent. The amount of money you will pay for the solution depends on the language your applications are written in and what functionalities you need and want to use from the platform. For example, visibility into the CPU, memory, and disk metrics requires the APM Advanced plan.

Datadog

Datadog is a full observability solution providing an extended set of features needed to monitor your infrastructure, applications, containers, network, logs, or even serverless features such as AWS lambdas. With the flexibility and functionality comes a price though – the configuration based agent installation may be time-consuming to set up (e.g. process monitoring requires agent config editing and agent restart) and quite some time may pass before you start seeing all the metrics, logs, and traces – all in one place for that full visibility into your application stack that you are after.

Features:

Application performance monitoring with a large number of integrations available and distributed tracing support.
Logs centralization and analysis.
Real user and synthetics monitoring.
Network and host monitoring.
Dashboard framework allows building of virtually everything out of the provided metrics and logs and sharing those.
Alerting with machine learning capabilities.
Collaboration tools for team-based discussions.
API allowing to work with the data, tags, and dashboards.

Pros:

Full observability solution – metric, logs, security, real user, and synthetics all in one.
Infrastructure monitoring including hosts, containers, processes, networks, and serverless capabilities.
Rich logs integration including applications, containers, cloud providers, clients, and common log shippers.
Powerful and very flexible data analysis features with alerts and custom dashboards.
Provides API allowing interaction with the data.

Cons:

Overwhelming for newcomers with all the installation steps needed for anything beyond basic metrics.
Not a lot of pre-built dashboards compared to others. New users have to invest quite a bit of time to understand metrics and build dashboards before being able to make full use of the solution.

Pricing

Feature, host, and volume-based pricing combined together – each part of the solution is priced differently that can be billed annually or on-demand. The on-demand billing makes the solution about 17 – 20% more expensive than the annual pricing at the time of this writing. Pay close attention to your bill. We’ve seen a number of reports where people were surprised by bill items or amounts.

New Relic

New Relic as a full-stack observability solution is available in software-as-a-service model. Its monitoring capabilities include application performance monitoring with rich dashboarding support, distributed tracing support, logs along with real user and synthetics monitoring for the top to bottom visibility. Even though the agents require manual steps to download and install they are robust and reliable with a wide range of common programming languages support which is a big advantage of New Relic.

Features:

Application Performance Monitoring with dashboarding and support for commonly used languages including C++.
Log centralization and analysis.
Integrated alerting with anomaly detection.
Rich and powerful query language – NRQL.
Real user and synthetics monitoring.
Distributed tracing allowing you to understand what is happening from top to bottom.
Integration with most known cloud providers such as AWS, Azure, and Google Cloud Platform.
Business level metrics support.

Pros:

Visibility into the whole system, not only when using physical servers or virtual machines, but also when dealing with containers and microservices.
Ability to connect business-level metrics together with performance to correlate them together.
Error analytics tool for quick and efficient issues analysis, like site errors or downtime.
Rich visualization support allowing to graph metrics, logs, and NRQL queries.
Ability to define the correlation between alerts and defined logic to reduce alert noise.

Cons:

The platform itself doesn’t provide agent management functionality, which leads to additional work related to installation and configuration, especially on a larger scale.
Inconsistent UI: some parts of the product use the legacy interface, while others are already a part of NewRelic One.
The log management part of the solution is still young.
Lack of a single pricing page for all features.

Pricing

Annual and monthly compute unit or host-based pricing and depends on the features (for example: APM pricing, infrastructure pricing, synthetic pricing). For small services, the computing units may be the best option as they are calculated as the total number of CPUs with the amount of RAM your system has, multiplied by the number of running hours. For example, the infrastructure part of New Relic uses only compute units pricing, while the APM can be charged on both host and compute units-based pricing. This may be confusing and requires additional calculations if you want to control your costs.

5. Dynatrace

Dynatrace is a full-stack observability solution that introduces a user-friendly approach to monitoring your applications, infrastructure, and logs. It supports a single running agent that, once installed, can be controlled via Dynatrace UI making monitoring easy and pleasant to work with. Available in both software as a service and on-premise models it will fulfill most of your monitoring needs when it comes to application performance monitoring, real users, logs and infrastructure.

Features:

Application performance monitoring with dashboarding and rich integrations for commonly used tools and code-level tracing.
First-class Log analysis support with automatic detection of the common system and application log types.
Real user and synthetic monitoring.
Diagnostic tools allow taking memory dumps, exceptions and CPU analysis, top database, and web requests.
Docker, Kubernetes, and OpenShift integrations.
Support for common cloud providers like Amazon Web Services, Microsoft Azure, and Google Cloud Platform.
A virtual assistant can make your life easier when dealing with common questions.

Pros:

Simple and intuitive agent installation with UI guidance for new users with demo data to get to know the product faster.
Ease of integration to gain visibility into the logs of your systems and applications – almost everything is doable from the UI.
Easy to navigate and powerful top to bottom view of the whole stack – from the mobile/web application through the middle tier up to the database level.
Dedicated problem-solving functionalities to help in quick and efficient problem finding.

Cons:

Lots of options can be overwhelming to start with, but the solution tries to do its best to help new users.
Business metrics analysis is still limited compared to AppDynamics and Datadog, for example.
Serverless offering is limited when compared to other solutions on the market, like Datadog, New Relic, and AppDynamics.
Pricing information is only available once you sign up.

Pricing

Pricing is organized around features. The application performance monitoring pricing is tied to hosts and the amount of memory available on a host. Each 16GB is a host unit and the price is calculated on the basis of the number of host units in an hour. The real user monitoring price is calculated based on the number of sessions, while the synthetics monitoring pricing is based on the number of actions. Finally, the logs part of the solution is calculated based on the volume, similar to other vendors covered in this article.

6. Sumo Logic

Sumo Logic is an observability solution with strong focus on working with logs and it does that very well. With tools like LogReduce and LogCompare you can not only view the logs from a given time period but also reduce the volume of data you need to analyze or even compare periods to find interesting discrepancies and anomalies. Combining that with metrics and security gives a great tool that will fulfill the observability needs for your environment.

Features:

Log analysis with the LogReduce algorithm allows clustering of similar messages and LogCompare lets you compare data from two time periods.
Field extraction enables rule-based data extraction from unstructured data.
Application performance monitoring with real-time alerting and dashboarding.
Scheduled views for running your queries periodically.
Cloud security features for common cloud providers and SaaS solutions with PCI compliance and integrated threat intelligence.

Pros:

User-friendly interface that doesn’t overwhelm novice users and is still usable for experienced ones.
Ability to reduce the number of similar logs at read-time and compare periods of time together which can help to spot differences, anomalies, and track down problems quickly.
Possibility to extract fields from unstructured data allows you to drop the processing component from your local pipeline and move it to the vendor side.
Limited free tier available that may be enough for very small companies.

Cons:

Pricing may be confusing and may be hard to pre-calculate when using Cloud Flex credits and larger environments.
A limited number of out of the box charts compared to the competition.
Primarily focused on logs puts them at a disadvantage if you are looking for a full-stack observility solution.

Pricing

Credit and feature-based pricing with a limited free tier is available. A credit is a unit of utilization for ingested data – logs and metrics. The needed features dictate the price of each credit unit – the more features of the platform you need and will use, the more expensive the credit will be. Please keep in mind that the price also depends on the location you want to use. For example, at the time of this writing, the Ireland location was more expensive compared to North America.

7. CA Unified Infrastructure Monitoring (UIM)

Available in both the SaaS and on-premise models, targeted at the enterprise customers the DX Infrastructure Manager, formerly called CA Unified Infrastructure Monitoring is a unified tool that allows you to get observability into your hybrid cloud, services, applications, and infrastructural elements like switches, routers and storage devices. With the actionable log analytics, out of the box dashboard, and alerting with anomaly detection algorithms the solution will give you retrospective and proactive views over your IT environment.

Features:

Monitoring with various integrations supporting common infrastructure provides and services including packaged applications such as Office 365 and tools like Salesforce Service Cloud.
Log analytics with actionable, out of the box dashboards and rich visualization support.
Alerting with anomaly detection and dynamic thresholds.
Reporting with business-level metrics support and scheduling capabilities.

Pros:

Easy deployment and configuration with configurable automatic service discovery.
Templates support which allows you to build templates per environment, devices, and more.
Advanced correlations for hybrid infrastructures.
In-depth monitoring of the whole infrastructure with the help of various integrations.

Cons:

Non-transparent pricing — the pricing is not available on the web site.
A limited number of alert notification destinations compared to other competitors.
May be considered complicated for novice users.
Targeted for enterprise customers.
Dated UI.

Pricing

At the time of this writing the pricing was not publicly available on the vendor’s site.

8. Site 24×7

Site 24×7 is an observability solution providing all that is needed to get full visibility into your website’s health, application performance, infrastructure, and network gear. Both when it comes to metrics and logs. Set up alerts based on advanced rules to limit down the alerts fatigue and get insights from your mobile applications. Monitor servers and over 50 common technologies running inside your environment including common and widely used Apache or MySQL.

Features:

Website monitoring with the support for monitoring HTTP services, DNS and FTP servers, SMTP and POP servers, URLs, and REST APIs available both publicly and in private networks.
Server monitoring with support for Microsoft Windows and Linux and over 50 common technologies plugins, like MySQL or Apache.
Full featured network monitoring with routers, switches, firewalls, load balancers, UPS, and storage support.
Application performance monitoring and log management with support for server, desktop, and mobile applications and alerting capabilities.
Cloud monitoring with support for hybrid cloud infrastructure.

Pros:

Quick and easy agent installation.
Monitoring for various technologies with alerting support based on complex rules.
Full observability with visibility from your website performance and health up to network-level devices like switches and routers.
Custom dashboarding support lets you build your own views into the servers, applications, websites, servers, and cloud environments.
Pluggable server monitoring allows you to write your own plugins where needed.
Free, limited uptime and server monitoring which might be enough for personal needs or small companies.

Cons:

The number of features can be overwhelming for novice users.
It can be time-consuming when setting up in a larger environment because of the lack of autodiscovery.
A limited number of technologies when it comes to server monitoring.

Pricing

The pricing depends on the parts of the product that you will use with the free uptime monitoring for a small number of websites and servers available. The infrastructure monitoring starts with the 9 euro per month when billed annually for up to 10 servers, 500MB of logs, and 100K page views for a single site. You can buy additional add-ons for a monthly fee. You can also go for pure website monitoring or application performance monitoring or so-called “All-in-one” plan, which covers all the features of the platform.

9. Zabbix

Open-sourced monitoring tool capable of real-time monitoring large scale enterprises and small companies. If you are looking for a solution with a large community, well supported, and free you should look at Zabbix. Its multi-system, small footprint agents allow you to gather key performance indicators across your environment and use them as a source for your dashboards and alerts. With the template-based setup and auto-discovery you can speed up even the largest setups.

Features:

Multi-system, small footprint agent allowing to gather crucial metrics with support for SNMP and IPMI.
Problem detection and prediction mechanism with flexible thresholds and severity levels defining their importance.
Multi-lingual, multi-tenant, flexible UI with dashboarding capabilities and geolocation support for large organizations with data centers spread around the world.
Support for adjustable notifications with out of the box support for email, SMS, Slack, Hipchat and XMPP and escalation workflow.
Template-based host management and auto-discovery for monitoring large environments.

Pros:

Well known, open-sourced, and free with a large community and commercial support.
Wide functionality allowing to monitor virtually everything.
It can be easily integrated with other visualization tools like Grafana.
Easily extensible for support for technologies and infrastructure elements not covered out of the box.

Cons:

As an open-sourced and completely free solution, you need to host it yourself and maintain it, meaning paying for the team that will install and manage it.
Initial setup can be tedious and not so obvious and requires knowledge, not only about the platform but also about the applications, servers, and infrastructure elements that you plan on monitoring making the initial step quite steep.
Lack of dedicated functionality to monitor user experience, synthetic monitoring and no transaction tracing support.
If you are looking for a software-as-a-service solution, Zabbix Cloud is coming, but as of this writing it is still in beta.

Pricing

Zabbix is open-sourced and free. You can subscribe for support, consultancy, and training around it though if you would like to quickly and efficiently extend your knowledge about the platform.

10. Stackify Retrace

Stackify Retrace is a developer-centric solution providing users full visibility into their applications and infrastructure elements. With the availability of application performance monitoring, centralized logging, error reporting, and transaction tracing it is easy for a developer to connect pieces of information together when troubleshooting. All of that with help from the platform which connects those pieces together gluing the automated transaction tracing with the relevant logs and error data and proving the integrated profiler to give the top to bottom insight into the business transaction.

Features:

Centralized logging combined with error reporting.
Transaction tracing and code profiling with automatic instrumentalization for databases like MySQL, PostgreSQL, Oracle, SQL Server, and common NoSQL solutions like MongoDB and Elasticsearch.
Key performance metrics monitoring for your applications with alerting and notifications support.
Server monitoring gives you insight into the most useful metrics like uptime, CPU & memory utilization, disk space usage, and more.

Pros:

Top to bottom view starting with the web requests and ending at the relevant log message connected together with the transaction trace.
Integrated profiler with out of the box instrumentalization for common system elements like database or NoSQL store.
In-line log and error data inclusion in tracing information makes it super easy to connect information together for fast troubleshooting.
Support for custom dashboards and reports.

Cons:

No native support for Google Cloud at the time of writing.
Real user monitoring “coming soon” at the time of writing.
UI reminiscent of Windows.

Pricing

The pricing is based on data volume and is provided in three tiers – Essentials, Standard, and Enterprise. The Essentials package starts at $79/month allowing for 7 days of logs and traces retention, with up to 500k traces and 2m logs and up to 8 days of summary data retention with all the standard features provided. The Standard plan starts from $199 with additional features available for an appropriate higher price..

11. Zenoss

Multi-vendor infrastructure monitoring with support for end-to-end troubleshooting and real-time dependency mapping. With support for server monitoring including coming metrics, health and excellent network monitoring the Zenoss platform gives you visibility into your infrastructure, no matter if it is a private, hybrid, or a public cloud.

Features:

Infrastructure monitoring with the support for public, private, and hybrid clouds and real-time dependency mapping.
Server monitoring with support for common metrics, health, physical sensors like temperature sensors, file systems, processes, network interfaces, and routes monitoring.
Application performance monitoring available via ZenPacks with support for incident root cause analysis and metrics importance voting along with containers and microservices support.
Support for logs with the support of log format unification.

Pros:

Multi-vendor support for a wide variety of hardware and software infrastructure elements.
Automatic discovery for dynamic environments like containers and microservices.
Extensibility via ZenPacks – available both as driven by the community and commercial extensions with SDK allowing you to develop new extensions easier.
The self-managed, limited community version of the platform available as a solution with basic functionality and minimum scale.

Cons:

Application performance monitoring available via ZenPacks extension or integration with third-party services.
Available only in the on-premise model with no free trial available which makes it hard to test the platform.
No features like real user monitoring, synthetic monitoring or transaction tracing.
Focused on medium and large customers.

Pricing

At the time of writing the pricing was not publicly available on the vendor’s site, but one thing worth noting is the availability of the community version of the solution allowing you to install a limited, self-managed version of the platform.

When using Amazon Web Services, Google Cloud Platform, or Microsoft Azure you can rely on the tools provided by those platforms. The cloud provider dedicated solutions may not be as powerful as the platforms that we discussed above, but they provide insight into the metrics, logs, and infrastructure data. They give us not only visibility into the metrics but also proactive monitoring like alerts and health checks that you can use to configure the basic monitoring. If you are using a cloud solution from Amazon, Microsoft, or Google and you would like to use monitoring provided by those companies have a look at what they offer.

12. Amazon CloudWatch

Amazon CloudWatch is primarily aimed at customers using Amazon Web Services, but can also read metrics from statsd and collectd providing a way to ship custom metrics to the platform. By default, it provides an out of the box monitoring for your AWS infrastructure, services, and applications. With the integrated logs support and synthetics monitoring, it allows the users to set up basic monitoring quickly to give insights into the whole environment that is living in the Amazon ecosystem.

Features:

View metrics and logs of your infrastructure, services, and applications.
Insights into events coming from your AWS environment.
Service map and tracing support via AWS X-Ray.
Synthetic service for web application monitoring.
Alerting with anomaly detection on metrics and logs.

Pros:

Available out of the box for Amazon Web Services Users.
Support for custom metrics, so if you would like to stick to CloudWatch you can easily keep all your metrics there.
Possibility to graph billing-related information and have that under control.

Cons:

Limited dashboarding and visualization capabilities.
A limited number of dashboards that can be created in the free tier – if you have more than three dashboards will cost you $3.00 per month.
Limited metrics granularity even when going for the paid service.

Pricing

Volume-based pricing – you pay for what you want to have visibility into and how detailed it is. Free tier enables monitoring of your AWS services with 5-minute metric granularity. The free tier is also effective for services like EBS volumes, RDS DB instances, and Elastic Load Balancers. It covers up to ten metrics and then alarms per month. In addition, the free tier includes up to 5GB logs per month, 3 dashboards, and 100 runs of synthetic monitors per month. The paid tier price is based on usage. For example, for metrics, the one-minute granularity metrics starts at $0.30 per metric per month for the first 10,000 metrics and go as low as $0.02 per metric per month when sending over one million metrics. With logs the situation is similar – the more you send the less you pay per gigabyte of data.

13. Azure Monitor

The Azure Monitor a solution primarily focused on monitoring the services located in the Microsoft Azure cloud services, but support custom metrics for resources outside of the cloud. It provides a full-featured observability solution giving you deep insights into your infrastructure, services, applications, and Azure resources with powerful dashboards, BI support, and alerting that will automatically notify you when needed.

Features:

Monitoring for your Microsoft Azure resources, services, first-party solutions, and custom metrics sent by your applications.
Detailed infrastructure monitoring for deep insight into the metrics.
Network activity, layout, and services layout visualization and monitoring.
Support for alerts and autoscaling based on the metrics and logs.
Powerful dashboarding capabilities with workbooks and BI support.

Pros:

Available out of the box for Microsoft Azure users.
Azure resources, services, and first-party solutions expose their metrics in the free tier and other signals like logs and alerts have a free tier available.
Support for workbooks and BI allows to connect business-level metrics with the signals coming from the services and infrastructure.

Cons:

It may be complicated and overwhelming for users that just started with Azure.

Pricing

The Azure Monitor pricing is based on the volume of the ingested data or reserved capacity. Selected metrics from the Azure resources, services, and first-party solutions are free. Custom metrics are paid once you pass the 150MB per month. Similar to other cloud vendors you pay less per unit of data the more data you send. The logs have the option to pay as you go which gives you up to 5GB of logs per billing account per month free and then $2.76 per GB of data. You can also go for reserved data – for example, 100GB of data per day will cost you $219.52 daily. Other monitoring elements are priced in a similar way with small or no free tier available.

14. Google Stackdriver

Formerly Stackdriver Google Cloud operations suite is primarily focused to give the users of Google Cloud platform the insights into the infrastructure and application performance, but it also supports custom metrics and other cloud providers like AWS. The platform provides metrics, logs, and traces support along with the visibility into Google Cloud platform audit logs giving you the full visibility of what is happening inside your GCP account.

Features:

Metrics and dashboards allowing visibility into the performance of your services with alerting.
Health check monitoring for web applications and applications that can be accessed from the internet with uptime monitoring.
Support for logs and logs routing with error reporting and alerting.
Per-URL statistics based on distributed tracing for App Engine.
Audit logs for visibility into security-related events in your Google Cloud account.
Production debugging and profiling.

Pros:

Rich visualization support out of the box for Google Cloud platform users.
Free tier available.
Support for sending data to third-party providers if they provide an integration.

Cons:

Requires a manual cloud monitoring agent install, before getting visibility into the metrics, compared to AWS CloudWatch where this is not needed.

Pricing

Similar to Amazon CloudWatch and Microsoft Azure the pricing is based on the amount of data your services and applications are generating and sending to the platform. The free tier includes 150MB metrics per billing account, 50GB of logs per project, 1 million API calls per project, 2.5 million spans ingested per project and 25 million spans scanned per project. Everything above that falls into the paid tier.

Most of the tools that we’ve discussed provide a form of alerting and reporting. Those are usually limited to a number of methods, like e-mail or text messages to your mobile, sometimes other common destinations. Usually, we don’t see scheduling, automation, and workflow control in the monitoring tools themselves. Because of that, the observability solutions provide integrations with third-party incident alerting and reporting tools filling the communication gap and providing additional features like event automation and triage, noise suppression, alerts, and notifications centralization and lots of destinations where the information can be sent to. Let’s see what tools can provide such functionalities.

15. PagerDuty

The all in one alert and notification management and centralization solution. The PagerDuty provides the place where you can centralize notifications coming from various places, organize them, assign, automate, and send to virtually any destination you may think of. It not only provides a simple way of viewing and forwarding the data but also automates incident response, schedule on-call, and escalate incidents.

Features:

On-call management with flexible schedules, incident escalation, and alerting.
Context filtering for alert reduction.
Automated responses with status updates.
Event automation with triage, alert grouping, and noise suppression.
Dashboards for a variety of alert related information like operations, service health, responders, and incidents with customization capabilities.

Pros:

A large number of integrations available out of the box, which gives you the possibility to receive notifications on virtually any destination.
Scheduling and notifications escalation.
Services prioritization for controlling what is more important.

Pricing

The pricing is organized around the features and the number of users that will be using PagerDuty with no free tier available. The most basic plan starts from $10 for up to 6 users per month with an additional $15 per user after that and goes up to $47 per user per month depending on the features of the platform you want to use.

16. VictorOps

VictorOps is the tool that will quickly become your central place for alerts and notifications. It makes it possible to take action on alerts, schedule who is on-call and should react to a given incident. With rules-based incident response, it is easy to automate responses for certain alerts to reduce the noise and fatigue generated by notifications coming from various systems hooked up with the rich set of available integrations.

Features:

On-call scheduling and management with incident escalation and hands-off.
Alerts and notification centralization.
Incident automation with alert rules, automatic response, and noise suppression.
Reports and post-incident reviews.

Pros:

A large number of integrations available out of the box for centralizing the alerts and notifications in a single place.
Dedicated tools for teams.
Scheduling and incident escalation.

Pricing

The pricing is based one features and the number of users. The basic plan starts from $8 per user per month when paid monthly and goes up to $33 per user per month for the Enterprise plan.

17. OpsGenie

From the creators of JIRA and Confluence comes OpsGenie, the central place for your alerts and notifications. It allows for management of alerts, planning on-call schedules, and reacting automatically based on user-defined rules. With a rich set of integrations, heartbeat monitoring, and alerts deduplication the platform can be used as a tool for centralizing all of your alerts and notifications.

Features:

On-call scheduling and management with incident escalation.
Alerts and notification centralization with rule-based routing.
Advanced reporting with post-incident analysis.
ChatOps and stakeholder communications with a web conference bridge.
Incident command center.

Pros:

Rich set of integrations available out of the box for centralizing the notifications and alerts in a single place.
Team centric tools for multiple teams integrations.
Heartbeat monitoring and alerts deduplication.
Free tier available.

Pricing

The pricing is based on features and the number of users. It starts with the limited free tier for up to 5 users with basic alerting and on-call management aimed for small teams. The first non-free tier starts with $11 per user per month when billed monthly and goes up to $35 per user per month with monthly billing. The price depends on the set of features of the platform that you will use. For instance, if you are OK with up to 25 international SMS notifications per user per month you will be fine with the basic, non-free plan.

18. xMatters

xMatters is a user-friendly central place for all your alerts and notifications. It allows managing and reacting on incidents from a single place with on-call schedules, incident escalation, and rule-based responses and resolutions. With the incident timeline, you can see how the reaction on the incident was performed and how well the team reacted to the situation giving your organization a tool helping you in improving alerts handling.

Features:

On-call scheduling and management with incident escalation.
Automatic, rule-based responses and resolutions.
Stakeholder communication.
Incident timeline with team performance calculations.

Pros:

Over 100 integrations are available at the time of writing.
Easy to learn and user-friendly.
Free tier available.

Pricing

The pricing, similar to the rest of the competitors like OpsGenie and PagerDuty is organized around features and the number of users. The pricing plans start with a free tier that is available for up to 10 users without any kind of SMS and voice notifications. The first paid plan starts at $16 per user per month and goes up to $59 per user per month making it the most expensive of the tools. Of course, the price depends on the features of the platform you choose to use. For example, if you are OK with up to 50 SMS notifications per user per month you will be fine with the basic, non-free plan.

What Tools Will You Use?

Cloud computing, the public, hybrid, and private cloud environments opened up a world of opportunities. Flexibility, on-demand scaling, ready to use services, and the ease of use that comes with that allow for the next generation of platforms to be built on top of them. However, to leverage all the opportunities you need to deal with a set of challenges. Those require good tools so you can understand the state of the environment along with all the key performance indicators that your environment provides. The available cloud monitoring tools all help you with the gathering of observability data, but they take different approaches, provide different functionalities, and come with different costs. With the wide range of solutions available make sure to try different solutions and choose the one that fits your needs the most. Learn how to choose the best monitoring system for your use case from our Guide to monitoring and alerting.

Linux Logging Tutorial: What Are Linux Logs, How to View, Search and Centralize Them

Radu Gheorghe — Mon, 27 Jul 2020 12:17:37 +0000

TL;DR note: if you want the bzip2 -9 version of this post, scroll down to the very last section for some quick pointers. If you want to learn a bit about Linux system logs, please continue, as we'll talk about all these and more:

What are Linux logs and who generates them
Important types of Linux logs and their typical location
How to read and search logs, whether they're written by journald or syslog
How to centralize logs of many servers in one location. Spoiler alert: the easiest way is to send all system logs to Sematext Cloud in three commands, so you can build actionable dashboards:

Short Recap: What Are Linux Logs?

Linux logs are pieces of data that Linux writes, related to what the server, kernel, services, and applications running on it are doing, with an associated timestamp. They often come with other structured data, such as a hostname, being a valuable analysis and troubleshooting tool for admins when they encounter performance issues. You can read more about logs and why should you monitor them in our complete guide to log management. Here's an example of SSH log from /var/log/auth.log directory:

May 5 08:57:27 ubuntu-bionic sshd[5544]: pam_unix(sshd:session): session opened for user vagrant by (uid=0)

Notice how the log contains a few fields, like the timestamp, the hostname, the process writing the log and its PID, before the message itself. In Linux, logs come from different sources, mainly:

Systemd journal. Most Linux distros have systemd to manage services (like SSH above). Systemd catches the output of these services (i.e., logs like the one above) and writes them to the journal. The journal is written in a binary format, so you'll use journalctl to explore it, like:

    $ journalctl
    ...
    May 05 08:57:27 ubuntu-bionic sshd[5544]: pam_unix(sshd:session): session opened for user vagrant by (uid=0)
    ...

Syslog. When there's no systemd, processes like SSH can write to a UNIX socket (e.g., /dev/log) in the syslog message format. A syslog daemon (e.g., rsyslog) then picks the message, parses it and writes it to various destinations. By default, it writes to files in /var/log, which is how we got the earlier message from /var/log/auth.log.
The Linux kernel writes its own logs to a ring buffer. Systemd or the syslog daemon can read logs from this buffer, then write to the journal or flat files (typically /var/log/kern.log). You can also see kernel logs directly via dmesg:

$ dmesg -T
...
[Tue May 5 08:41:31 2020] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
...

Audit logs. These are a special case of kernel messages designed for auditing actions such as file access. You'd typically have a service to listen for such security logs, like auditd. By default, auditd writes audit messages to /var/log/audit/audit.log
Application logs. Non-system applications tend to write to /var/log as well. Here are some popular examples:
- Apache HTTPD logs are typically written to /var/log/httpd or /var/log/apache2. HTTP access logs would be in /var/log/httpd/access.log
- MySQL logs typically go to /var/log/mysql.log or /var/log/mysqld.log
- Older Linux versions would record boot logs via bootlogd to /var/log/boot or /var/log/boot.log. Systemd now takes care of this: you can view boot-related logs via journalctl -b. Distros without systemd have a syslog daemon reading from the kernel ring buffer, which normally has all the boot messages. So you can find your boot/reboot logs in /var/log/messages or /var/log/syslog
- Last but not least, you may have your own apps using a logging library to write to a specific file

These sources can interact with each other: journald can forward all its messages to syslog. Applications can write to syslog or the journal. It's Linux, where everything is configurable. But for now, we'll focus on the defaults: where can you typically find different types of logs in most modern distributions?

Log Files Location: Where Are They Stored?

Typically, you'll find Linux server logs in the /var/log directory. This is where syslog daemons are normally configured to write. It's also where most applications (e.g., Apache HTTPD) write by default. For Systemd journal, the default location is /var/log/journal, but you can't view the files directly because they're binary. So how do you view them?

How to Check Linux Logs

If your Linux distro uses Systemd (and most modern distros do), then all your system logs are in the journal. You can view them with journalctl, and you can find the most important journalctl commands here. If your distribution writes to local files via syslog, you can view them with standard text processing tools, such as cat, less or grep:

# grep "error" /var/log/syslog | tail
Mar 31 09:48:02 ubuntu-bionic rsyslogd: unexpected GnuTLS error -53 - this could be caused by a broken connection. GnuTLS reports: Error in the push function. [v8.2002.0 try https://www.rsyslog.com/e/2078 ]
...

If you're using auditd to manage audit logs, you can check them in /var/log/audit.log by default, but you can also search them with ausearch. That said, you're better off shipping these security logs to a central location, especially if you have multiple servers. For this task, a tool like Auditbeat might work better than auditd. We wrote a separate tutorial on centralizing audit logs with Auditbeat, but in the next section we'll focus on centralizing Linux system logs in general.

Centralizing Linux Logs

System logs can be in two places: systemd's journal or plain text files written by a syslog daemon. Some distributions (e.g., Ubuntu) have both: journald is set up to forward to syslog. This is done by setting ForwardToSyslog=Yes in journald.conf.

Centralizing Logs via Journald

Our recommendation is to use the journal-upload to centralize logs, if the distribution has systemd. You can check this by running journalctl - if the command isn't found, you don't have the journal. As promised earlier, you can centralize your system logs to Sematext Cloud with three commands:

Install journal-upload. On Ubuntu, this works via sudo apt-get install systemd-journal-remote
Configure journal-upload. In /etc/systemd/journal-upload.conf, set URL=http://logsene-journald-receiver.sematext.com:80/YOUR_LOGS_TOKEN
Start journal-upload now and on every boot: systemctl enable systemd-journal-upload && systemctl start systemd-journal-upload

Alternatively, you can use Logagent's journal-upload input to gather journal entries from one or more machines, before shipping them to a central location. That central location can be Sematext Cloud, a local ELK stack or something else: If you want to learn more about journald and journalctl, as well as the options you have around centralizing the journal, have a look at our complete guide to journald.

Centralizing Logs via syslog

There are a few scenarios in which centralizing Linux logs with syslog might make sense:

Your Linux distribution doesn't have journald. This means system logs go directly to your syslog daemon
You want to use your syslog daemon to collect and parse application logs as well. An example is described in our tutorial for Apache logs with rsyslog and Elasticsearch.
You want to forward journal entries to syslog (i.e., by setting ForwardToSyslog=Yes in journald.conf), so you can use a syslog protocol as a transport. However, this approach will lose some of journald's structured data: journald only forwards syslog-specific fields.
Similar to the above, except that you'd configure the syslog daemon to read from the journal (like journalctl does). This approach doesn't lose structured data, but is more error prone (e.g., in case of journal corruption) and adds more overhead.

In all situations listed above, data will go through your syslog daemon. From there, you can send it to any of the supported destinations. Most Linux distributions come with rsyslog installed. To forward data to another syslog server via TCP, you can add this line in your /etc/rsyslog.conf:

*.* @@logsene-syslog-receiver.sematext.com

This particular line will forward data to Sematext Cloud's syslog endpoint, but you can replace logsene-syslog-receiver.sematext.com with the host name of your own syslog server. Some syslog daemons can output data to Elasticsearch via HTTP/HTTPS. rsyslog is one of them and so is syslog-ng. For example, if you use rsyslog on Ubuntu, you'll install the Elasticsearch output module first:

sudo apt-get install rsyslog-elasticsearch

Then, in the configuration file, you need two elements:

A template that formats your syslog messages as JSON, for Elasticsearch to consume

template(name="LogseneFormat" type="list" option.json="on") {
 constant(value="{")
 constant(value="\\"@timestamp\\":\\"")
 property(name="timereported" dateFormat="rfc3339")
 constant(value="\\",\\"message\\":\\"")
 property(name="msg")
 constant(value="\\",\\"host\\":\\"")
 property(name="hostname")
 constant(value="\\",\\"severity\\":\\"")
 property(name="syslogseverity-text")
 constant(value="\\",\\"facility\\":\\"")
 property(name="syslogfacility-text")
 constant(value="\\",\\"syslog-tag\\":\\"")
 property(name="syslogtag")
 constant(value="\\",\\"source\\":\\"")
 property(name="programname")
 constant(value="\\"}")
}

An action that forwards data to Elasticsearch, using the template specified above

module(load="omelasticsearch")
action(type="omelasticsearch"
 template="LogseneFormat" # the template that you defined earlier
 searchIndex="LOGSENE_APP_TOKEN_GOES_HERE"
 server="logsene-receiver.sematext.com"
 serverport="443"
 usehttps="on"
 bulkmode="on"
 queue.dequeuebatchsize="100" # how many messages to send at once
 action.resumeretrycount="-1") # buffer messages if connection fails

The above example shows how to send messages to Sematext Cloud's Elasticsearch API, but you can adjust the action element to point it to your local Elasticsearch:

searchIndex would be your own rolling index alias
server would be the hostname of an Elasticsearch node
serverport can be 9200 or a custom port Elasticsearch listens to
usehttps="off" would send data over plain HTTP

Whether you use a syslog protocol, the Elasticsearch API or something else, it's better to forward syslog directly from the syslog daemon than to tail individual files from /var/log using a different log shipper. Tailing files will add overhead and miss some of the metadata, such as facility or severity. Which is not to say that files in /var/log are useless. You'll need them in two scenarios:

Logs of applications that write directly to /var/log. For example, HTTP logs, FTP logs, mysql logs and so on. You can tail such files with a log shipper. We have tutorials on parsing apache logs with rsyslog and with Logstash.
Process system logs with UNIX text tools like grep. Here, different log files contain different kinds of data. We'll look at the typical configuration in the next section.

What Are the Most Important Log Files You Should Monitor

By default, some distributions write system logs to syslog (either directly or from the journal). The syslog daemon writes these logs to files under /var/log. Typically that syslog daemon is rsyslog, though syslog-ng works in a similar fashion. In this section, we'll look at the important log files and:

what kind of information you'll find in them
how is rsyslog configured to write there (in case you want to change the configuration)
how to view the same information with journalctl, in case it doesn't forward to syslog

/var/log/syslog or /var/log/messages

This is the “catch-all” of syslog. For example:

# logger "this is a test"
# tail -1 /var/log/syslog
May 7 15:33:11 ubuntu-bionic test-user: this is a test

Typically, you'll find all messages here (error logs, informational messages, and every other severity), as this line from /etc/rsyslog.conf suggests:

*.* /var/log/syslog

The only exception is the stop action. For example, you may find something like this:

:msg,contains,"[UFW " /var/log/ufw.log
& stop

In plain English, this block says:

If the msg property of this message contains "[UFW "
Then write to /var/log/ufw.log (the file output module is implied)
If the action succeeds (&), then don't process this message further (stop)

So if the /var/log/syslog action comes later, it won't write UFW messages there. If there's nothing in /var/log/syslog or /var/log/messages, you probably have journald set up not to forward to syslog. The same data (and more) can be viewed via journalctl with no parameters. By default, journalctl pages data through less, but if you want to filter through grep you'll need to disable paging:

# journalctl --no-pager | grep "this is a test"
May 07 15:33:11 ubuntu-bionic test-user[7526]: this is a test

/var/log/kern.log or /var/log/dmesg

This is where kernel messages go by default:

Apr 17 16:47:28 ubuntu-bionic kernel: [ 0.004000] console [tty1] enabled

It's really down to filtering syslog messages by the kern facility:

kern.* /var/log/kern.log

If you don't have syslog (or the file is missing) and you have journald, you can show kernel messages in journalctl:

# journalctl -k
...
Apr 17 16:47:28 ubuntu-bionic kernel: console [tty1] enabled
...

/var/log/auth.log or /var/log/secure

This is where you find authentication messages, generated by services like sshd:

May 7 15:03:09 ubuntu-bionic sshd[1202]: pam_unix(sshd:session): session closed for user vagrant

This is another filter by facility, this time by two values (auth and authpriv):

auth,authpriv.* /var/log/auth.log

You can do such filters in journalctl as well, except that you have to provide numeric facility levels:

# journalctl SYSLOG_FACILITY=4 SYSLOG_FACILITY=10
...
May 7 15:03:09 ubuntu-bionic sshd[1202]: pam_unix(sshd:session): session closed for user vagrant
...

/var/log/cron.log

This is where your cron messages go (i.e., jobs that run regularly):

May 06 08:19:01 localhost.localdomain anacron[1142]: Job `cron.daily' started

Yet another facility filter:

cron.* /var/log/cron

With journalctl, you'd do:

# journalctl SYSLOG_FACILITY=9

/var/log/mail.log or /var/log/maillog

Email daemons such as Postfix typically log to syslog in the mail facility, just like cron logs to the cron facility. Then, rsyslog puts these logs in a different file:

mail.* /var/log/mail.log

If you're using journald, you can still view mail logs with:

# journalctl SYSLOG_FACILITY=2

Because journald exposes the syslog API, everything that normally goes to syslog ends up in the journal.

TL;DR Takeaways

Let's summarize the actionables here:

The location and format of your Linux system logs depends on how your distro is configured.
Most distros have systemd. It means all your system logs live in the journal. To view and search it, use journalctl. Use the complete guide to journald for reference.
Some distros get system logs to syslog. Either directly or through the journal. In this case you likely have logs written to various files in /var/log. Have a look at the section above for details on each important file.
Either way, if you manage multiple servers, you'll want to centralize system logs with a log management software such as Sematext Cloud. Sematext makes this very easy, as it has both journald integration and syslog integration. Though you can use your own ELK stack if you prefer build vs buy.
If you need help with your own ELK stack, please reach out, as we provide ELK stack consulting, Elasticsearch production support and Elasticsearch and ELK stack training classes

Tutorial: Logging with journald

Radu Gheorghe — Tue, 09 Jun 2020 07:37:39 +0000

If you're using Linux, I'm sure you bumped into journald: it's what most distros use by default for system logging. Most applications running as a service will also log to the journal. So how do you make use of these logs to:

find the error or debug message that you're looking for?
make sure logs don't fill your disk?
centralize journals so you don't have to ssh to each box?

In this post, we'll answer all the above and more. We will dive into the following topics:

what is journald, how it came to be and what are its benefits
main configuration options, like when to remove old logs so you don't run out of disk
journald and containers: can/should containers log to the journal?
journald vs syslog: advantages and disadvantages of both, how they integrate
ways to centralize journals. Advantages and disadvantages of each method, and which is the best. Spoiler alert: you can configure journald to send logs directly to Sematext Cloud; or you can use the open-source Logagent as a journald aggregator. Either way, you'll have one place to search and analyze your journal events:

There are lots of other options to centralize journal entries, and lots of tools to help. We'll explore them in detail, but before that, let's zoom in to journald itself.

What is journald?

journald is the part of systemd that deals with logging. systemd, at its core, is in charge of managing services: it starts them up and keeps them alive.

All services and systemd itself need to log: “ssh started” or “user root logged in”, they might say. That's where journald comes in: to capture these logs, record them, make them easy to find, and remove them when they pass a certain age.

Why use journald?

In short, because syslog sucks :) Jokes aside, the paper announcing journald explained that systemd needed functionality that was hard to get through existing syslog implementations. Examples include structured logging, indexing logs for fast search, access control and signed messages.

As you might expect, not everyone agrees with these statements or the general approach systemd took with journald. But by now, systemd is adopted by most Linux distributions, and it includes journald as well. journald happily coexists with syslog daemons, as:

some syslog daemons can both read from and write to the journal
journald exposes the syslog API

journald benefits

Think of journald as your mini-command-line-ELK that lives on virtually every Linux box. It provides lots of features, most importantly:

Indexing. journald uses a binary storage for logs, where data is indexed. Lookups are much faster than with plain text files
Structured logging. Though it's possible with syslog, too, it's enforced here. Combined with indexing, it means you can easily filter specific logs (e.g. with a set priority, in a set timeframe)
Access control. By default, storage files are split by user, with different permissions to each. As a regular user, you won't see everything root sees, but you'll see your own logs
Automatic log rotation. You can configure journald (see below) to keep logs only up to a space limit, or based on free space

Configuring journald

To tweak how journald behaves, you'll edit /etc/systemd/journald.conf and then reload the journal service like:

systemctl reload systemd-journald.service

Though earlier versions of journald need to be restarted:

systemctl restart systemd-journald.service

Most important settings will be around storage: whether the journal should be kept in memory or on disk, when to remove old logs and how much to rate limit. We'll focus on some of those next, but you can see all the configuration options in journald.conf's man page.

journald storage

The Storage option controls whether the journal is stored in memory (under /run/log/journal) or on disk (under /var/log/journal). Setting Storage=volatile will store the journal in memory, while Storage=persistent will store it on disk. Most distributions have it set to auto, which means it will store the journal on disk if /var/log/journal exists, otherwise it will be stored in memory.

Once you've decided where to store the journal, you may want to set some limits. For example, SystemMaxUse=4G will limit /var/log/journal to about 4GB. Similarly, SystemKeepFree=10G will try to keep 10GB of disk space free. If you choose to keep the journal in memory, the equivalent options are RuntimeMaxUse and RuntimeKeepFree.

You can check the current disk usage of the journal with journalctl via journalctl --disk-usage. If you need to, you can clean it up on demand via journalctl --vacuum-size=4GB (i.e. to reduce it to 4GB).

Compression is enabled by default on log entries larger than 512 bytes. If you want to change this threshold to, say 1KB, you'd add Compress=1K.

Also by default, journald will drop all log messages from a service if it passes certain limits. These limits can be configured via RateLimitBurst and RateLimitIntervalSec, which default to 10000 and 30s respectively. Actual values will depend on the available free space. For example, if you have more than 64GB of free disk space, the multiplier will be 6. Meaning it will drop logs from a service after 60K messages sent in 30 seconds.

The rate limit defaults are sensible, unless you have a specific service that's generating lots of logs (e.g. a web server). In that case, it might be better to LogRateLimitBurst and LogRateLimitIntervalSec in that application's service definition.

journald commands via journalctl

journalctl is your main tool for interacting with the journal. If you just run it, you'll see:

all entries, from oldest to newest
paged by less
lines go past the edge of your screen if they have to (use left and right arrow keys to navigate)
format is similar to the syslog output, as it is configured in most Linux distributions: syslog timestamp + hostname + program and its PID + message

Here's an example snippet:

Apr 09 10:22:49 localhost.localdomain su[866]: pam_unix(su-l:session): session opened for user solr by (uid=0)<
Apr 09 10:22:49 localhost.localdomain systemd[1]: Started Session c1 of user solr.<
Apr 09 10:22:49 localhost.localdomain systemd[1]: Created slice User Slice of solr.<
Apr 09 10:22:49 localhost.localdomain su[866]: (to solr) root on none

This is rarely what you want. More common scenarios are:

last N lines (equivalent of tail -n 20 - if N=20): journalctl -n 20
follow (tail -f equivalent): journalctl -f
page from newest to oldest: journalctl --reverse
skip paging and just grep for something (e.g. “solr”): journalctl --no-pager | grep solr

If you often find yourself using --no-pager, you can change the default pager through the SYSTEMD_PAGER variable. export SYSTEMD_PAGER=cat will disable paging. That said, you might want to look into journalctl's own options for displaying and filtering - described below - before using text processing tools.

journalctl display settings

The main option here is --output, which can take many values. As an ELK consultant, I want my timestamps ISO 8601, and --output=short-iso will do just that. Now this is more like it:

2020-04-09T10:23:01+0000 localhost.localdomain solr[860]: Started Solr server on port 8983 (pid=999). Happy searching!
2020-04-09T10:23:01+0000 localhost.localdomain su[866]: pam_unix(su-l:session): session closed for user solr

journald keeps more information than what the short/short-iso output shows. Adding --output=json-pretty (or just json if you want it compact) can look like this for a single event:

{
 "__CURSOR" : "s=83694dffb084461ea30a168e6cef1e6c;i=103f;b=f0bbba1703cb43229559a8fcb4cb08b9;m=c2c9508c;t=5a2d9c22f07ed;x=c5fe854a514cef39",
 "__REALTIME_TIMESTAMP" : "1586431033018349",
 "__MONOTONIC_TIMESTAMP" : "3267973260",
 "_BOOT_ID" : "f0bbba1703cb43229559a8fcb4cb08b9",
 "PRIORITY" : "6",
 "_UID" : "0",
 "_GID" : "0",
 "_MACHINE_ID" : "13e3a06d01d54447a683822d7e0b4dc9",
 "_HOSTNAME" : "localhost.localdomain",
 "SYSLOG_FACILITY" : "3",
 "SYSLOG_IDENTIFIER" : "systemd",
 "_TRANSPORT" : "journal",
 "_PID" : "1",
 "_COMM" : "systemd",
 "_EXE" : "/usr/lib/systemd/systemd",
 "_CAP_EFFECTIVE" : "1fffffffff",
 "_SYSTEMD_CGROUP" : "/",
 "CODE_FILE" : "src/core/job.c",
 "CODE_FUNCTION" : "job_log_status_message",
 "RESULT" : "done",
 "MESSAGE_ID" : "9d1aaa27d60140bd96365438aad20286",
 "_SELINUX_CONTEXT" : "system_u:system_r:init_t:s0",
 "UNIT" : "user-0.slice",
 "MESSAGE" : "Removed slice User Slice of root.",
 "CODE_LINE" : "781",
 "_CMDLINE" : "/usr/lib/systemd/systemd --switched-root --system --deserialize 22",
 "_SOURCE_REALTIME_TIMESTAMP" : "1586431033018103"
}

This is where you can use structured logging to filter events. Next up, we'll look closer at the most important options for filtering.

journald log filtering

You can filter by any field (see the JSON output above) by specifying key=value arguments, like:

journalctl _SYSTEMD_UNIT=sshd.service

There are shortcuts, for example the _SYSTEMD_UNIT above can be expressed as -u. The above command is the equivalent of of:

journalctl -u sshd.service

Other useful shortcuts:

severity (here called priority). journalctl -p warning will show logs with at least a severity of warning
show only kernel messages: journalctl --dmesg

You can also filter by time, of course. Here, you have multiple options:

--since/--until as a full timestamp. For example: journalctl --since="2020-04-09 11:30:00"
date only (00:00:00 is assumed as the time): journalctl --since=2020-04-09
abbreviations: journalctl --since=yesterday --until=now

In general, you have to specify the exact value you're looking for. With the exception of _SYSTEMD_UNIT. Here, patterns also work:

journalctl -u sshd*

Newer versions of systemd also allow a --grep flag, which allows you to filter the MESSAGE field by regex. But you can always pipe the journalctl output through grep itself.

journald and boots

Besides messages logged by applications, journald remembers significant events, such as system reboots. Here's an example:

# journalctl MESSAGE="Server listening on 0.0.0.0 port 22."
**-- Logs begin at Wed 2020-04-08 11:53:18 UTC, end at Thu 2020-04-09 12:01:01 UTC. --**
Apr 08 11:53:23 localhost.localdomain sshd[822]: Server listening on 0.0.0.0 port 22.
Apr 08 13:23:42 localhost.localdomain sshd[7425]: Server listening on 0.0.0.0 port 22.
**-- Reboot --**
Apr 09 10:22:49 localhost.localdomain sshd[857]: Server listening on 0.0.0.0 port 22.

You can suppress these special messages via -q. Use -b to show only messages after a certain boot. For example, to show messages since the last boot:

# journalctl MESSAGE="Server listening on 0.0.0.0 port 22." -b
-- Logs begin at Wed 2020-04-08 11:53:18 UTC, end at Thu 2020-04-09 12:01:01 UTC. --
Apr 09 10:22:49 localhost.localdomain sshd[857]: Server listening on 0.0.0.0 port 22.

You can specify a boot as an offset to the current one (e.g. -b -1 is the boot before the last). You can also specify a boot ID, but to do this you need to know what are the available boot IDs:

# journalctl --list-boots
-1 d26652f008ef4020b15a3d510bbcb381 Wed 2020-04-08 11:53:18 UTC—Wed 2020-04-08 14:31:16 UTC
 0 f0bbba1703cb43229559a8fcb4cb08b9 Thu 2020-04-09 10:22:43 UTC—Thu 2020-04-09 12:01:01 UTC

And then:

# journalctl MESSAGE="Server listening on 0.0.0.0 port 22." -b d26652f008ef4020b15a3d510bbcb381
-- Logs begin at Wed 2020-04-08 11:53:18 UTC, end at Thu 2020-04-09 12:01:01 UTC. --
Apr 08 11:53:23 localhost.localdomain sshd[822]: Server listening on 0.0.0.0 port 22.
Apr 08 13:23:42 localhost.localdomain sshd[7425]: Server listening on 0.0.0.0 port 22.

This is all useful if you configure journald for persistent storage (see the configuration section above).

journald centralized logging

As you probably noticed, journald is quite host-centric. In practice, you'll want to access these logs in a central location, without having to SSH into each machine.

There are multiple ways of centralizing journald logs, and we'll detail each below:

systemd-journal-upload uploads journal entries. Either directly to Sematext Cloud or to a log shipper that can read its output, such as the open-source Logagent
systemd-journal-remote as a “centralizer”. The idea is to have all journals on one host, so you can use journalctl to search (see above). This can work in “pull” or “push” mode
a syslog daemon or another log shipper reads from the local journal. Then, it forwards logs to a central store like ELK or Sematext Cloud
journald forwards entries to a local syslog socket. Then, a log shipper (typically a syslog daemon) picks messages up and forwards them to the central store

systemd-journal-upload to ELK or Sematext Cloud

systemd-journal-upload is a service that pushes new journal entries over HTTP/HTTPS. That destination can be the Sematext Cloud Journald Receiver - the easiest way to centralize journald logs. And probably the best, as we'll discuss below.

Although it's part of journald/systemd, systemd-journal-upload isn't installed by default on most distros. So you have to add it via something like:

apt-get install systemd-journal-remote

Then, uploading journal entries is as easy as:

systemd-journal-upload --url=http://logsene-journald-receiver.sematext.com:80/YOUR_LOGS_TOKEN

Though most likely you'll want to configure it as a service:

$ cat /etc/systemd/journal-upload.conf
[Upload]
URL=http://logsene-journald-receiver.sematext.com:80/YOUR_LOGS_TOKEN

If you need more control, or if you want to send journal entries to your local Elasticsearch, you can use the open-source Logagent with its journald input plugin as a journald centralizer: Here's the relevant part of logagent.conf:

input:
  journal-upload:
    module: input-journald-upload
    port: 9090
    worker: 0
    systemdUnitFilter:
      include: !!js/regexp /.*/i

Using Logagent and Elasticsearch or Sematext Cloud (i.e. we host Logagent and Elasticsearch for you) is probably the best option to centralize journald logs. That's because you get all journald's structured data over a reliable protocol (HTTP/HTTPS) with minimal overhead. The catch? Initial import is tricky, because it can generate a massive HTTP payload. For this, you might want to do the initial import by streaming journalctl output through Logagent, like:

journalctl --output=json --no-page | logagent --index SEMATEXT-LOGS-TOKEN

systemd-journal-remote

Journald comes with its own “log centralizer”: systemd-journal-remote. You don't get anywhere near the flexibility of ELK/Sematext Cloud, but it's already there and it might be enough for small environments.

systemd-journal-remote can either pull journals from remote systems or listen for journal entries on HTTP/HTTPS. The push model - where systemd-journal-upload is in charge of pushing logs - is typically better because:

it can continuously tail the journal and remembers where it left off (i.e. maintains a cursor)
you don't need to open access to the journal of every system

systemd-journal-remote typically comes in the same package as systemd-journal-upload. Once it's installed, you can make it listen to HTTP/HTTPS traffic:

host2# systemd-journal-remote --listen-http=0.0.0.0:19352 --output=/var/log/journal/remote

Now you can push the journal of a remote host like this:

host1# systemd-journal-upload --url=http://host2:19352

systemd-journal-remote and systemd-journal-gatewayd

s ystemd-journal-remote can also pull journal entries from remote hosts. These hosts would normally serve their journal via systemd-journal-gatewayd (which is often provided by the same package). Once you have systemd-journal-gatewayd, you can start it via:

host1# systemctl start systemd-journal-gatewayd.socket

You can verify if it works like this:

curl host1:19531/entries

Then, from the “central” host, you can use systemd-journal-remote to fetch journal entries:

host2# systemd-journal-remote --url [http://](http://host1:19531)[host1](http://host1:19531)[:19531](http://host1:19531)

By default, systemd-journal-remote will write the imported journal to /var/log/journal/remote/ (you might have to create it first!), so you can search them via journalctl:

journalctl -D /var/log/journal/remote/

Tools that read directly from the journal

Another approach for centralizing journald logs is to have a log shipper read from the journal, much like journalctl does. Then, it can process logs and send them to destinations like Elasticsearch or Sematext Cloud (which exposes the Elasticsearch API).

For this approach, there's a PoC journald input plugin for Logstash. As you probably know, Logstash is easy to use, so reading from the journal is as easy as:

input {
  journald {
  # you may add other options here, but of course the defaults are sensible :)
  }
}

Journalbeat is also available. It's as easy to install and use as Filebeat, except that it reads from the journal. But it's marked as experimental.

Why PoC and experimental? Because of potential journal corruption which might lead to nasty results. Check the comments in rsyslog's journal input documentation for details.

Syslog daemons are also log shippers. Some of them can also read from the journal, or even write to it. There's a lot to say about syslog and the journal, so we'll dissect the topic in a section of its own.

journald vs syslog

Journald provides a good out-of-the-box logging experience for systemd. The trade-off is, journald is a bit of a monolith, having everything from log storage and rotation, to log transport and search. Some would argue that syslog is more UNIX-y: more lenient, easier to integrate with other tools. Which was its main criticism to begin with.

Flame wars aside, there's good integration between the two. Journald provides a syslog API and can forward to syslog (see below). On the other hand, syslog daemons have journal integrations. For example, rsyslog provides plugins to both read from journald and write to journald. In fact, they recommend two architectures:

A small setup (e.g. N embedded devices and one server) could work by centralizing journald logs (see above). If embedded devices don't have systemd/journald but have syslog, they can centralize via syslog to the server and finally write to the server's journal. This journal will act like a mini-ELK
A larger setup can work by aggregating journal entries through a syslog daemon. We'll concentrate on this scenario in the rest of this section

There are two ways of centralizing journal entries via syslog:

syslog daemon acts as a journald client (like journalctl or Logstash or Journalbeat)
journald forwards messages to syslog (via socket)

Option 1) is slower - reading from the journal is slower than reading from the socket - but captures all the fields from the journal. Option 2) is safer (e.g. no issues with journal corruption), but the journal will only forward traditional syslog fields (like severity, hostname, message..). Typically, you'd go for 2) unless you need the structured info. Here's an example configuration for implementing 1) with rsyslog, and writing all messages to Elasticsearch or Sematext Cloud:

# module that reads from journal
module(load="imjournal"
 StateFile="/var/run/journal.state" # we write here where we left off
 PersistStateInterval="100" # update the state file every 100 messages
)
# journal entries are read as JSON, we'll need this to parse them
module(load="mmjsonparse")
# Elasticsearch or Sematext Cloud HTTP output
module(load="omelasticsearch")

# this is done on every message (i.e. parses the JSON)
action(type="mmjsonparse")

# output template that simply writes the parsed JSON
template(name="all-json" type="list"){
 property(name="$!all-json")
}

action(type="omelasticsearch"
 template="all-json" # use the template defined earlier
 searchIndex="SEMATEXT-LOGS-APP-TOKEN-GOES-HERE"
 server="logsene-receiver.sematext.com"
 serverport="80"
 bulkmode="on" # use the bulk API
 action.resumeretrycount="-1" # retry indefinitely if Logsene/Elasticsearch is unreachable
)

For option 2), we'll need to configure journald to forward to a socket. It's as easy as adding this to /etc/systemd/journald.conf:

ForwardToSyslog=yes

And it will write messages, in syslog format, to /run/systemd/journal/syslog. On the rsyslog side, you'll have to configure its socket input module to listen to that socket. Here's a similar example of sending logs to Elasticsearch or Sematext Cloud:

module(load="imuxsock"
 SysSock.Name="/run/systemd/journal/syslog")

# template to write traditional syslog fields as JSON
template(name="plain-syslog"
 type="list") {
 constant(value="{")
 constant(value="\"timestamp\":\"") property(name="timereported" dateFormat="rfc3339")
 constant(value="\",\"host\":\"") property(name="hostname")
 constant(value="\",\"severity\":\"") property(name="syslogseverity-text")
 constant(value="\",\"facility\":\"") property(name="syslogfacility-text")
 constant(value="\",\"tag\":\"") property(name="syslogtag" format="json")
 constant(value="\",\"message\":\"") property(name="msg" format="json")
 constant(value="\"}")
}

action(type="omelasticsearch"
 template="plain-syslog" # use the template defined earlier
 searchIndex="SEMATEXT-LOGS-APP-TOKEN-GOES-HERE"
 server="logsene-receiver.sematext.com"
 serverport="80"
 bulkmode="on" # use the bulk API
 action.resumeretrycount="-1" # retry indefinitely if Logsene/Elasticsearch is unreachable
)

Whether you read the journal through syslog, systemd-journal-upload or through a log shipper, all the above methods assume that you're dealing with Linux running on bare metal or VMs. But what if you're using containers? Let's explore your options in the next section.

journald and containers

In this context, I think it's worth making a distinction between Docker containers and systemd containers. Let's take them one at a time.

journald and Docker

Typically, a Docker container won't have systemd, because it would make it too “heavy”. As a consequence, it won't have journald, either. That said, you probably have journald on the host, if the host is running Linux. This means you can use the journald logging driver to send all the logs of a host's containers to that host's journal. It's as easy as:

docker run my_container --log-driver=journald

And that container's logs will be in the journal:

# journalctl CONTAINER_NAME=my_container --all
Apr 09 13:03:28 localhost.localdomain dockerd-current[25558]: hello journal

If you want to use journald by default, you can make the change in daemon.json and restart Docker:

# cat /etc/docker/daemon.json
{
 "log-driver": "journald"
}
systemctl restart docker

If you have more than one host, you're back to the centralizing problem that we explored in the previous section: getting all journals in one place. This makes journald an intermediate step that may not be necessary.

A better approach is to centralize container logs via Logagent, which can run as a container. Here, Logagent picks up logs and forwards them to a central place, like Elasticsearch or Sematext Cloud. But it's not the only way. In fact, we explore different approaches, with their pros and cons, in our Complete Guide to Docker logging.

journald and systemd containers

systemd provides containers as well (called machines) via systemd-nspawn. Unlike Docker containers, systemd-nspawn machines can log to the journal directly. You can read the logs of a specific machine like this:

journalctl --machine $MACHINE_NAME

Where $MACHINE_NAME is one of the running machines. You'd use machinectl list to see all of them.

As with Docker's journald logging driver, this setup might be challenging when you have multiple hosts. You'll either want to centralize your journals - as described in the previous section. Or, you can send logs from your systemd containers directly to the central location - either via a log shipper or a logging library.

Conclusions

Did you read all the way to the end? You're a hero! And you probably figured that journald is good for structured logging, quick local searches, and tight integration with systemd. Its design shows its weaknesses when it comes to centralizing log events. Here we have many options, but none is perfect. That said, Logagent's journald input and Sematext Cloud's journald receiver (the hosted equivalent) come pretty close.

Working with Solr Plugins System

Rafał Kuć — Mon, 08 Jun 2020 14:03:36 +0000

Apache Solr was always ready to be extended. What was only needed is a binary with the code and the modification of the Solr configuration file, the solrconfig.xml and we were ready. It was even simpler with the Solr APIs that allowed us to create various configuration elements – for example, request handlers. What’s more, the default Solr distribution came with a few plugins already – for example, the Data Import Handler or Learning to Rank.

As consultants working with clients across different industries, dealing with a wide variety of use cases with Solr clusters monitored by Sematext Cloud the next thing that we saw the need for were plugins. Installing those plugins was not hard – put a jar file in a defined place, modify the configuration, reload the core/collection or restart Solr and we are ready. Well not so fast. What if you had hundreds of Solr nodes and you needed to upgrade the plugin or even install it. Yes, that’s where things can get nasty and require automation. Solr was not very supportive in this until one of the recent releases. All of the users that wanted to extend Solr were doing the same thing – manual jar loading. We did the same thing with our plugins – like the Researcher or Query Segmenter.

With the release of Solr 8.4.0, we’ve got a new functionality that helps us with extending Solr – the plugin management. It allows installing plugins from remote locations and it makes it very easy to do so for us as users. Today I wanted to show you not only how to install Solr plugins using this new feature, but also how to prepare your own plugin repository. Let’s get started.

Solr Plugin Management

With Solr 8.4.0, we didn’t only get the script itself but also the whole set of changes under the hood. Those changes include things like package management APIs and scripts, class loader isolation, artifact read and write API and more.

Let’s start from the beginning though. By default Solr comes with the package loading turned off. One of the reasons for such a decision is security. Users could potentially force Solr to download malicious content, so you need to be sure that your environment is secure and you need to know potential downsides and risks of using that feature. But if we are sure that we want to run Solr with plugin management mechanism turned on we need to add the enable.packages property to Solr startup parameters and set it to true:

$ bin/solr start -c -f -Denable.packages=true

Now we can start playing around with the packages.

Package Management Basics

Let’s try using the bin/solr script and see what it allows us to do when it comes to package management. The simplest way to check that is just by running the following command:

$ cd bin/
$ ./solr package

In the result we will get the following response:

Found 1 Solr nodes:

Solr process 20949 running on port 8983
Package Manager

./solr package add-repo
Add a repository to Solr.

./solr package install [:]
Install a package into Solr. This copies over the artifacts from the repository into Solr's internal package store and sets up classloader for this package to be used.

./solr package deploy [:] [-y] [--update] -collections <package-name>[:] [-y] [--update] -collections  [-p param1=value1 -p param2=value2 …
Bootstraps a previously installed package into the specified collections. It the package accepts parameters for its setup commands, they can be specified (as per package documentation).

./solr package list-installed
Print a list of packages installed in Solr.

./solr package list-available
Print a list of packages available in the repositories.

./solr package list-deployed -c
Print a list of packages deployed on a given collection.

./solr package list-deployed
Print a list of collections on which a given package has been deployed.

./solr package undeploy  -collections
Undeploys a package from specified collection(s)

Note: (a) Please add '-solrUrl http://host:port' parameter if needed (usually on Windows).
      (b) Please make sure that all Solr nodes are started with '-Denable.packages=true' parameter.

It seems we get everything that is needed. We can add repositories, we can list installed packages, we can install packages, we can deploy packages, list deployed ones and of course undeploy the ones that we no longer need.

At the time of writing this blog post, there were no Solr plugin repositories publicly available. But for us this is not bad – we can use that to learn even more. We just need to start by preparing our own plugin repository.

Preparing the Package Repository

If you are using a repository that was already created, where the plugins are available you can skip this part of the blog post. But if you would like to learn how to set up a Solr plugin repository on your own, I’ll try to guide you through that process.

So there are a few steps that need to be taken:

You need to create a private key that will be used to sign your binaries
You need to create a public key that Solr will use to verify the signed packages
You need to create a repository description file that Solr will read when requesting packages from the repository
And of course, you need to have the binaries that you would like to expose as plugins. We will not be discussing this step though and I will assume you already have that. We created a very naive and simple code at https://github.com/sematext/example-solr-module. Have a look if you want.

Creating a Private and a Public Key

We will start by creating a private key. This key will be used to generate a signature of the binaries that we will be exposing as plugins. For that we will use openssl:

$ openssl genrsa -out sematext_example.pem 512

The above command creates a 512 bits RSA key called sematext_example.pem. With that generated, we can now create a public key that will be based on the above key.

The public key will be created from the private one and Solr will use it to verify the signatures of the files. The idea is as follows:

The package maintainer creates a signature of the package file using the private key and writes the signature in the repository description file,
During package deployment, the package signature is verified by Solr using the public key. If the signature doesn’t match – the package will not be deployed.

To create a public key we will again use the openssl command:

$ openssl rsa -in sematext_example.pem -pubout -outform DER -out publickey.der

The output of the above command is a publickey.der file that we will upload to our repository location along with the binary file and the repository description file.

Generating Package Signature

The last step is generating the signature of the file. We will once again use the openssl command:

$ openssl dgst -sha1 -sign sematext.pem solr-example-module-1.0.jar | openssl enc -base64 | tr -d \\n | sed

As a result we will have the signature, which in our case looks as follows:

iXyDDhYkYZgBrYCTxawAdeIJFYR+KHglK4m6uLSR1lo9pFm67dKfIzTmXPHasFVgLwVRbYvGMJG5p69TowMPAg==

Note it somewhere as we will need it soon.

Repository Description

Now that we already have our binary file, the private and the public keys we can create the repository description file that Solr will be looking for inside the repository. This repository has to be called repository.json and needs to include a list of plugins that are available in our repository. Each plugin is defined by:

A name,
A description,
An array of versions, that include:
Version itself,
Release date of the given plugin version,
An array of artifacts for the version – the URL of the file and the signature that we generated earlier,
The manifest which includes supported Solr versions, default parameters, setup, uninstall and verification commands.

The repository.json file that we are using for the purpose of this blog post looks as follows:

[
  {
"name": "sematext-example", "description": "Example plugin created for blog post", "versions": [{
    "date": "2020-04-16", "artifacts": [{
            "url": "solr-example-module-1.0.jar",
            "sig": "iXyDDhYkYZgBrYCTxawAdeIJFYR+KHglK4m6uLSR1lo9pFm67dKfIzTmXPHasFVgLwVRbYvGMJG5p69TowMPAg=="
          }
        ],
        "manifest": {
          "version-constraint": "8 - 9",
          "plugins": [
            {
              "name": "request-handler",
              "setup-command": {
                "path": "/api/collections/${collection}/config",
                "payload": {"add-requesthandler": {"name": "${RH-HANDLER-PATH}", "class":
        "sematext-example:com.sematext.blog.solr.ExampleRequestHandler"}},
                "method": "POST"
              },
              "uninstall-command": {
                "path": "/api/collections/${collection}/config",
                "payload": {"delete-requesthandler": "${RH-HANDLER-PATH}"},
                "method": "POST"
              },
              "verify-command": {
                "path": "/api/collections/${collection}/config/requestHandler?componentName=${RH-HANDLER-PATH}&meta=true",
                "method": "GET",
                "condition":
        "$['config'].['requestHandler'].['${RH-HANDLER-PATH}'].['_packageinfo_'].['version']",
                "expected": "${package-version}"
              }
            }
          ],
          "parameter-defaults": {
            "RH-HANDLER-PATH": "/sematextexample"
          }
        }
      }
    ]
  }
]

While most of the properties are self-descriptive you should put attention to one thing – the class of the request handler in the setup-command definition. Because of the out-of-the-box class loaders isolation, we need to provide a prefix with the name of the plugin to be able to create the request handler. If we won’t do that Solr will fail to create the request handler, because our class that implements the request handler will not be visible. Keep that in mind when creating the repository description file for your own plugins.

With all that we can upload it to some remote location like we did with http://pub-repo.sematext.com/training/solr/blog/repo/ and start using it.

Adding a New Package Repository

Once we are ready with setting up our own repository or we already have a repository that we would like to install plugins from we can add that repository to Solr. Just remember, to successfully add the repository it needs to provide the repository.json file. The second thing is security – you should avoid adding repositories that don’t use SSL. Adding a repository that doesn’t use a secure connection exposes you and your Solr for man in the middle attacks. The potential thing that can happen is that during the download of the package it can be replaced with a malicious version. Keeping your Solr secure is as important as keeping an eye on the Solr metrics by using one of the Solr monitoring tools like Sematext Cloud.

Now that we know about the potential security issue let’s use a secure location of the example Solr repository. We do that by using the following command:

$ ./solr package add-repo sematext https://pub-repo.sematext.com/training/solr/blog/repo/

We are using new functionalities of the bin/solr script – the package one. We use one of the possible options, the add-repo which requires us to provide a name and the location. The name in our case is sematext and the location is the last provided parameter.

If the operation was successful Solr will give us information about the number of nodes found in the cluster, the process identifier and the port on which the instance is running. And finally, the last information that tells that the repository was added:

Found 1 Solr nodes:

Solr process 65854 running on port 8983
Added repository: sematext

As a side note – I’ll omit the information about the number of nodes, process identifier and the Solr port from the other examples. It should be easier to see the crucial information returned by Solr.

Installing and Removing Solr Packages

Once the repository is added we can start using it. The first thing that you would usually do is listing the available packages and look for something that we can install to extend our Solr. To list all the available packages we should run the following command:

$ bin/solr package list-available

The response to the above command should be similar to the following one:

Available packages:
-----
sematext-example    Example plugin created for blog post
  Version: 1.0.0

In the response, we have a list of packages – each described with a name, description and version. Just as they were defined in the repository.json file. We are very close to being ready for installation. But there is one more thing – the public key that Solr will use to verify the package signature. Where to look for such a key? It will be provided to you or you can download it from the repository itself under the publickey.der name. I’ll do the latter and will download the key by using the following command:

$ curl -s -o publickey.der -LO http://pub-repo.sematext.com/training/solr/blog/repo/publickey.der

Once we will have the key we can add it to Solr by using the bin/solr script its package part of the functionality and the add-key action:

$ ./solr package add-key publickey.der

After all those steps we can finally start installing the packages. For example, let’s install the one package that we have available in our sample repository. We do that by running the following command:

$ ./solr package install sematext-example:1.0.0

The response that I got from Solr was as follows:

Posting manifest...
Posting artifacts...
Executing Package API to register this package...
Response: {"responseHeader":{
    "status":0,
    "QTime":68}}
sematext-example installed.

This means that our package is now ready to be used. Let’s create a collection where we can use the package by running the following command:

$ ./solr create_collection -c test

By now we should have the package installed and a sample collection created. This means that we are finally ready to use that plugin. To do that we need to deploy it. We can do that to a single collection or multiple ones at the same time. For the purpose of this blog post I will use our test collection and will deploy our plugin by using the following command:

$ ./solr package deploy sematext-example:1.0.0 -collections test

In addition to the name of the collection or collections that our plugin should be installed to, we need to provide the name of the plugin and its version. The response was as follows:

Executing {"add-requesthandler":{"name":"/sematextexample","class":"sematext-example:com.sematext.blog.solr.ExampleRequestHandler"}} for path:/api/collections/test/config
Execute this command (y/n):
y
Executing http://localhost:8983/api/collections/test/config/requestHandler?componentName=/sematextexample&meta=true for collection:test
{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "config":{"requestHandler":{"/sematextexample":{
        "name":"/sematextexample",
        "class":"sematext-example:com.sematext.blog.solr.ExampleRequestHandler",
        "_packageinfo_":{
          "package":"sematext-example",
          "version":"1.0.0",
          "files":["/package/sematext-example/1.0.0/solr-example-module-1.0.jar"],
          "manifest":"/package/sematext-example/1.0.0/manifest.json",
          "manifestSHA512":"da463cdad3efbe4c9159b29156bbaf26f4aa35a083a8b74fd57e1dfa1f79ee7eaadfd3863f5d88fa2550281c027e82b516ebc64a7fa4159089f32c565813c574"}}}}}

Actual: 1.0.0, expected: 1.0.0
Deployed on [test] and verified package: sematext-example, version: 1.0.0
Deployment successful

During the execution of the above command, the bin/solr script will ask you if you are certain that you would like to deploy the chosen package. If you agree to that Solr will deploy and try to verify the package by using the verification command provided in the repository.json description file. If that went well – the plugin is ready and we can use it.

When we no longer need a package we can remove it by running the undeploy command. For example, if we would like to remove the previously deployed package we just need to run the following command:

$ bin/solr package undeploy sematext-example -collections test

In this case, the response says that everything went well and we will no longer be using the package:

Executing {"delete-requesthandler":"/sematextexample"} for path:/api/collections/test/config

How Solr Packages Work Under the Hood

The heart of the implementation of the plugin mechanism is related to the isolation of classloaders of the plugins and the core Solr classes. The plugin mechanism assumes that any change in the files that are in the Solr classpath requires a restart. The rest of the files can be loaded dynamically and are bound to the configuration stored in Zookeeper.

The basis of the mechanism is a so-called Package Store. It is a distributed file system that keeps its data on each Solr node in the $SOLR_HOME/filestore directory and each of the files is described by metadata written in a JSON file. Of course, each file stores the checksum in its metadata for verification purposes. That way replacing the binary itself is not enough to load malicious versions of the plugin – the signature is still there and needs to be adjusted as well. That gives us a certain degree of security.

On top of all of that, we have an API allowing us not only to manage the whole package repository but also single files.

Solr Package API

Of course, our bin/solr tool and installing the packages using it is not everything that Solr gives us. In addition to that we got the API that allows us to:

add files using the PUT HTTP method and the /api/cluster/files/{file_path} endpoint
retrieve files using the GET HTTP method and the /api/cluster/files/{file_path} endpoint
retrieve file metadata using the GET HTTP method and the /api/cluster/files/{file_path}?meta=true
retrieve files available at a given path using the GET HTTP method using the /api/cluster/files/{directory_path} endpoint

You should remember that adding a file to Solr is not only about sending it to Solr. You need to sign it using a key that will be available to Solr – we saw that already.

Similar to manipulating the files in the package repository we also have the option to manage packages. We have the option to add, remove and download the packages and their versions:

GET on /api/cluster/package to download the list of packages
PUT on /api/cluster/package to add a package
DELETE on /api/cluster/package to remove a package

For example to add a package to Solr we could use a command like this:

$ curl -XPUT 'http://localhost:8983/api/cluster/package' -H 'Content-type:application/json' -d  '{
 "add": {
  "package" : "sematext-example",
  "version" : "1.0.0",
  "files" : [
   "/test/sematext/1.0.0/sematext-example.jar"
  ]
 }
}'

Security

The package management functionality brings a new way of extending Solr functionality. However, you should remember that flexibility doesn’t come for free. Having the option to use hot-deploy and being able to install Solr extensions on the fly, without the need of bringing the whole cluster down carries limitations and security threats. Because of that remember not to add package repositories that you don’t know. Such repositories can be dangerous and can result in downloading and installing malicious code. The second thing to remember is that you shouldn’t add repositories that are not using SSL. Adding a repository that is not using SSL exposes you to the man in the middle attack during which the files can be replaced on the fly leading to installing the malicious code. That can result in compromising your cluster, which may lead to data leaks or the whole environment being compromised. That is something that you should remember and keep your Solr secure no matter if you use package management or not.

Summary

The functionality of installing Solr extensions without the need to manually download them to each node, restarting the nodes and so on is very nice and tempting. Especially to those of us who use such extensions. However, please remember about security and the limitations of the mechanism. If we will be cautious we will have a way of extending Solr in a flexible way.

Also keep in mind to monitor your Solr, for example with software like our Sematext Cloud that can help you identify the bottlenecks and find the root cause of the problems with your instance or the whole cluster. Keep that in mind – you can’t fix what you can’t measure 🙂

Where Are Docker Logs Stored?

Adnan Rahić — Tue, 05 May 2020 14:57:28 +0000

There’s a short answer, and a long answer. The short answer, that will satisfy your needs in the vast majority of cases, is:

/var/lib/docker/containers/<container_id>/<container_id>-json.log

From here you need to ship logs to a central location, and enable log rotation for your Docker containers. Let me elaborate on why with the long answer below.

Where Are Docker Container Logs Stored by Default?

You see, by default, Docker containers emit logs to the stdout and stderr output streams. Containers are stateless, and the logs are stored on the Docker host in JSON files by default.

Why?

The default logging driver is json-file. What’s a logging driver?

A logging driver is a mechanism for getting info from your running containers. Here’s a more elaborate explanation from the Docker docs. There are several different log drivers you can use except for the default json-file, like syslog, journald, fluentd, or logagent.

These logs are emitted from output streams, annotated with the log origin, either stdout or stderr, and a timestamp. Each log file contains information about only one container and is in JSON format. Remember, one log file per container.

You find these JSON log files in the /var/lib/docker/containers/ directory on a Linux Docker host. The <container_id> here is the id of the running container.

/var/lib/docker/containers/<container_id>/<container_id>-json.log

If you’re not sure which id is related to which container, you can run the docker ps command to list all running containers. The container id is located in the first column.

docker ps

[Output]
CONTAINERID  IMAGE     COMMAND       CREATED    STATUS    PORTS    NAMES
cf74b6fce535 foo_image "node app.js" X min ago  Up X min  3000/tcp foo_app

Now you know where the container logs are stored, and you can continue to troubleshoot and debug any issues that come up.

That’s where logging comes into play. You collect the logs with a log aggregator and store them in a place where they’ll be available forever. It’s dangerous to keep logs on the Docker host because they can build up over time and eat into your disk space. That’s why you should use a central location for your logs and enable log rotation for your Docker containers.

Debugging Docker Issues with Container Logs

Docker has a dedicated API for working with logs. But, keep in mind, it will only work if you use the json-file log driver. I strongly recommend not changing the log driver! Let’s start debugging.

First of all, to list all running containers, use the docker ps command.

docker ps

Then, with the docker logs command you can list the logs for a particular container.

docker logs <container_id>

Most of the time you’ll end up tailing these logs in real time, or checking the last few logs lines.

Using the --follow or -f flag will tail -f (follow) the Docker container logs:

docker logs <container_id> -f

The --tail flag will show the last number of log lines you specify:

docker logs <container_id> --tail N

The -t or --timestamp flag will show the timestamps of the log lines:

docker logs <container_id> -t

The --details flag will show extra details about the log lines:

docker logs <container_id> --details

But what if you only want to see specific logs? Luckily, grep works with Docker logs as well.

docker logs <container_id> | grep pattern

This command will only show errors:

docker logs <container_id> | grep -i error

Once an application starts growing, you tend to start using Docker Compose. Don’t worry, it has a logs command as well.

docker-compose logs

This will display the logs from all services in the application defined in the Docker Compose configuration file.

Storing Docker Container Logs in a Central Location Using a Log Shipper

With your infrastructure growing, you can rely on just using the Docker API to troubleshoot logs. You need to store all logs in a secure place, so you can analyze and troubleshoot any issues after-the-fact.

You need a steady influx of logs so you can get actionable insight into what is happening to your Docker containers. Setting up log rotation is just step one.

By storing logs in one place you can also set up alerts that notify you if anything breaks, or whenever you’re experiencing unexpected behavior.

Container logs can be a mix of plain text messages from start scripts and structured logs from applications, which makes it difficult for you to tell which log event belongs to what container and application.

Although Docker log drivers can ship logs to log management tools, most of them don’t allow you to parse container logs. You need a separate tool called a log shipper, such as Logagent, Logstash or rsyslog to structure and enrich the logs before shipping them.

The solution is to have a container dedicated solely to logging and collecting logs. You deploy the dedicated logging container within your Docker environment. It will automatically aggregate logs from all containers, as well as monitor, analyze, and store or forward them to a central location.

This makes it easier to move containers between hosts and easily scale your infrastructure. It also lets you collect logs through various streams, including log events, Docker API data, stats, etc.

This is what I’d suggest you use. By far the most reliable and convenient way of log collection is to use the json-file driver and set up a log shipper to ship the logs. You always have a local copy of logs on your server and you get the advantage of centralized log management.

If you were to use Sematext Logagent there are a few simple steps to follow in order to start sending logs to Sematext. After creating a Logs App, run these commands in a terminal.

docker pull sematext/logagent

docker run -d --restart=always --name st-logagent \
  -e LOGS_TOKEN=YOUR_LOGS_TOKEN \
  -e LOGS_RECEIVER_URL="https://logsene-receiver.sematext.com" \
  -v /var/run/docker.sock:/var/run/docker.sock \
  sematext/logagent

This will start sending all container logs to Sematext.

You can read more about how Logagent works and how to use it for monitoring logs in our post on Docker Container Monitoring with Sematext.

Conclusion

There we go, both a short and long answer to where Docker Container logs are stored. By default, Docker uses the json-file log driver that stores logs in dedicated directories on the host:

/var/lib/docker/containers/<container_id>/<container_id>-json.log

The long answer, and what I’d suggest you do, is to set up a dedicated logging container that will structure and enrich your container logs, then send them to a central location. This makes troubleshooting and searching through logs much easier. But, you also get alerting which is the main point. You want to know what breaks before your users do.

Hope you guys and girls enjoyed reading this as much as I enjoyed writing it. If you liked it, feel free to hit the share button so more people will see this tutorial. Until next time, be curious and have fun.

Performance Best Practices: Running and Monitoring Express.js in Production

Adnan Rahić — Tue, 28 Apr 2020 10:50:34 +0000

What is the most important feature an Express.js application can have? Maybe using sockets for real-time chats or GraphQL instead of REST APIs? Come on, tell me. What’s the most amazing, sexy, and hyped feature you have in your Express.js application?

Want to guess what mine is? Optimal performance with minimal downtime. If your users can't use your application, what's the point of fancy features?

In the past four years, I've learned that performant Express.js applications need to do four things well:

Ensure minimal downtime
Have predictable resource usage
Scale effectively based on load
Increase developer productivity by minimizing time spent on troubleshooting and debugging

In the past, I've talked a lot about how to improve Node.js performance and related key metrics you have to monitor. There are several bad practices in Node.js you should avoid, such as blocking the thread and creating memory leaks, but also how to boost the performance of your application with the cluster module, PM2, Nginx and Redis.

The first step is to go back to basics and build up knowledge about the tool you are using. In our case the tool is JavaScript. Lastly, I'll cover how to add structured logging and using metrics to pinpoint performance issues in Express.js applications like memory leaks.

In a previous article, I explained how to monitor Node.js applications with five different open-source tools. They may not have full-blown features like the Sematext Express.js monitoring integration, Datadog, or New Relic, but keep in mind they’re open-source products and can hold their own just fine.

In this article, I want to cover my experience from the last four years, mainly the best practices you should stick to, but also the bad things you should throw out right away. After reading this article you'll learn what you need to do to make sure you have a performant Express.js application with minimal downtime.

In short, you'll learn about:

Creating an intuitive structure for an Express.js application
Hints for improving Express.js application performance
Using test-driven development and functional programming paradigms in JavaScript
Handling exceptions and errors gracefully
Using Sematext Logs for logging and error handling
Using dotenv to handle environment variables and configurations
Using Systemd for running Node.js scripts as a system process
Using the cluster module or PM2 to enable cluster-mode load balancing
Using Nginx as a reverse proxy and load balancer
Using Nginx and Redis to cache API request results
Using Sematext Monitoring for performance monitoring and troubleshooting

My goal for you is to use this to embrace Express.js best practices and a DevOps mindset. You want to have the best possible performance with minimal downtime and ensure high developer productivity. The goal is to solve issues quickly if they occur and trust me, they always do.

Let's go back to basics, and talk a bit about Express.js.

How to Structure Express.js Applications

Having an intuitive file structure will play a huge role in making your life easier. You will have an easier time adding new features as well as refactoring technical debt.

The approach I stick to looks like this:

src/
  config/
    - configuration files
  controllers/
    - routes with provider functions as callback functions
  providers/
    - business logic for controller routes
  services/
    - common business logic used in the provider functions
  models/
    - database models
  routes.js
    - load all routes
  db.js
    - load all models
  app.js
    - load all of the above
test/
  unit/
    - unit tests
  integration/
    - integration tests
server.js
  - load the app.js file and listen on a port
(cluster.js)
  - load the app.js file and create a cluster that listens on a port
test.js
  - main test file that will run all test cases under the test/ directory

With this setup you can limit the file size to around 100 lines, making code reviews and troubleshooting much less of a nightmare. Have you ever had to review a pull request where every file has more than 500 lines of code? Guess what, it's not fun.

There's a little thing I like to call separation of concerns. You don't want to create clusterfucks of logic in a single file. Separate concerns into their dedicated files. That way you can limit the context switching that happens when reading a single file. It's also very useful when merging to master often because it's much less prone to cause merge conflicts.

To enforce rules like this across your team you can also set up a linter to tell you when you go over a set limit of lines in a file, as well as if a single line is above 100 characters long. One of my favorite settings, by the way.

How to Improve Express.js Performance and Reliability

Express.js has a few well known best practices you should adhere to. Below are a few I think are the most important.

Set NODE_ENV=production

Here's a quick hint to improve performance. Would you believe that only by setting the NODE_ENV environment variable to production will make your Express.js application three times faster!

In the terminal you can set it with:

export NODE_ENV=production

Or, when running your server.js file you can add like this:

NODE_ENV=production node server.js

Enable Gzip Compression

Moving on, another important setting is to enable Gzip compression. First, install the compression npm package:

npm i compression

Then add this snippet below to your code:

const compression = require('compression')
const express = require('express')
const app = express()
app.use(compression())

If you're using a reverse proxy with Nginx, you can enable it at that level instead. That's covered in the Enabling Gzip Compression with Nginx section a bit further down.

Always Use Asynchronous Functions

The last thing you want to do is to block the thread of execution. Never use synchronous functions! Like, seriously, don't. I mean it.

What you should do instead is use Promises or Async/Await functions. If you by any chance only have access to sync functions you can easily wrap them in an Async function that will execute it outside of the main thread.

(async () => {
  const foo = () => {
    ...some sync code
    return val
  }

  async const asyncWrapper = (syncFun) => {
    const val = syncFun()
    return val
  }

  // the value will be returned outside of the main thread of execution
  const val = await asyncWrapper(foo)
})()

If you really can't avoid using a synchronous function then you can run them on a separate thread. To avoid blocking the main thread and bogging down your CPU you can create child processes or forks to handle CPU intensive tasks.

An example would be that you have a web server that handles incoming requests. To avoid blocking this thread, you can spawn a child process to handle a CPU intensive task. Pretty cool. I explained this in more detail here.

Make Sure To Do Logging Correctly

To unify logs across your Express.js application, instead of using console.log(), you should use a logging agent to structure and collect logs in a central location.

You can use any SaaS log management tool as the central location, like Sematext, Logz.io, Datadog, and many more. Think of it like a bucket where you keep logs so you can search and filter them later, but also get alerted about error logs and exceptions.

I'm part of the integrations team here at Sematext, building open-source agents for Node.js. I put together this tiny open-source Express.js agent to collect logs. It can also collect metrics, but about that a bit further down. The agent is based on Winston and Morgan. It tracks API request traffic with a middleware. This will give you per-route logs and data right away, which is crucial to track performance.

Note: Express.js middleware functions are functions that have access to the request object (req), the response object (res), and the next middleware function in the application’s request-response cycle. The next middleware function is commonly denoted by a variable named next. - from Using middleware, expressjs.com

Here's how to add the logger and the middleware:

const { stLogger, stHttpLoggerMiddleware } = require('sematext-agent-express')

// At the top of your routes add the stHttpLoggerMiddleware to send API logs to Sematext
const express = require('express')
const app = express()
app.use(stHttpLoggerMiddleware)

// Use the stLogger to send all types of logs directly to Sematext
app.get('/api', (req, res, next) => {
 stLogger.info('An info log.')
 stLogger.debug('A debug log.')
 stLogger.warn('A warning log.')
 stLogger.error('An error log.')


 res.status(200).send('Hello World.')
})

Prior to requiring this agent you need to configure Sematext tokens as environment variables. In the dotenv section below, you will read more about configuring environment variables.

Here's a quick preview of what you can get.

Handle Errors and Exceptions Properly

When using Async/Await in your code, it's a best practice to rely on try-catch statements to handle errors and exceptions, while also using the unified Express logger to send the error log to a central location so you can use it to troubleshoot the issue with a stack trace.

async function foo() {
  try {
    const baz = await bar()
    return baz
  } catch (err) {
    stLogger.error('Function \'bar\' threw an exception.', err);
  }
}

It's also a best practice to configure a catch-all error middleware at the bottom of your routes.js file.

function errorHandler(err, req, res, next) {
  stLogger.error('Catch-All error handler.', err)
  res.status(err.status || 500).send(err.message)
}

router.use(errorHandler)
module.exports = router

This will catch any error that gets thrown in your controllers. Another last step you can do is to add listeners on the process itself.

process.on('uncaughtException', (err) => {
  stLogger.error('Uncaught exception', err)
  throw err
})

process.on('unhandledRejection', (err) => {
  stLogger.error('unhandled rejection', err)
})

With these tiny snippets you'll cover all the needed precautions for handling Express errors and log collection. You now have a solid base where you don't have to worry about losing track of errors and logs. From here you can set up alerts in the Sematext Logs UI and get notified through Slack or E-mail, which is configured by default. Don't let your customers tell you your application is broken, know before they do.

Watch Out For Memory Leaks

You can't catch errors before they happen. Some issues don't have root causes in exceptions breaking your application. They are silent and like memory leaks, they creep up on you when you least expect it. I explained how to avoid memory leaks in one of my previous tutorials. What it all boils down to is to preempt any possibility of getting memory leaks.

Noticing memory leaks is easier than you might think. If your process memory keeps growing steadily, while not periodically being reduced by garbage collection, you most likely have a memory leak. Ideally, you’d want to focus on preventing memory leaks rather than troubleshooting and debugging them. If you come across a memory leak in your application, it’s horribly difficult to track down the root cause.

This is why you need to look into metrics about process and heap memory.

Adding a metrics collector to your Express.js application, that will gather and store all key metrics in a central location where you can later slice and dice the data to get to the root cause of when a memory leak happened, and most importantly, why it happened.

By importing a monitoring agent from the Sematext Agent Express module I mentioned above, you can enable the metric collector to store and visualize all the data in the Sematext Monitoring UI.

Here's the kicker, it's only one line of code. Add this snippet in your app.js file.

const { stMonitor, stLogger, stHttpLoggerMiddleware } = require('sematext-agent-express')
stMonitor.start() // run the .start method on the stMonitor

// At the top of your routes add the stHttpLoggerMiddleware to send API logs to Sematext
const express = require('express')
const app = express()
app.use(stHttpLoggerMiddleware)
...

With this you'll get access to several dashboards giving you key insight into everything going on with your Express.js application. You can filter and group the data to visualize processes, memory, CPU usage and HTTP requests and responses. But, what you should do right away is configure alerts to notify you when the process memory starts growing steadily without any increase in the request rate.

Moving on from Express.js-specific hints and best practices, let's talk a bit about JavaScript and how to use the language itself in a more optimized and solid way.

How to Set Up Your JavaScript Environment

JavaScript is neither object-oriented or functional. Rather, it's a bit of both. I'm quite biased towards using as many functional paradigms in my code as possible. However, one surpasses all others. Using pure functions.

Pure Functions

As the name suggests, pure functions are functions that do not mutate the outer state. They take parameters, do something with them, and return a value.

Every single time you run them they will behave the same and return a value. This concept of throwing away state mutations and only relying on pure functions is something that has simplified my life to an enormous extent.

Instead of using var or let only use const, and rely on pure functions to create new objects instead of mutating existing objects. This ties into using higher-order functions in JavaScript, like .map(), .reduce(), .filter(), and many more.

How to practice writing functional code? Throw out every variable declaration except for const. Now try writing a controller.

Object Parameters

JavaScript is a weakly typed language, and it can show its ugly head when dealing with function arguments. A function call can be passed one, none, or as many parameters as you want, even though the function declaration has a fixed number of arguments defined. What's even worse is that the order of the parameters are fixed and there is no way to enforce their names so you know what is getting passed along.

It's absolute lunacy! All of it, freaking crazy! Why is there no way to enforce this? But, you can solve it somewhat by using objects as function parameters.

const foo = ({ param1, param2, param3 }) => {
 if (!(param1 && param2 && param3)) {
   throw Error('Invalid parameters in function: foo.')
}

 const sum = param1 + param2 + param3
 return sum
}

foo({ param1: 5, param2: 345, param3: 98 })
foo({ param2: 45, param3: 57, param1: 81 }) // <== the same

All of these function calls will work identically. You can enforce the names of the parameters and you're not bound by order, making it much easier to manage.

Freaking write tests, seriously!

Do you know what's the best way to document your code, keep track of features and dependencies, increase community awareness, gain contributors, increase performance, increase developer productivity, have a nicer life, attract investors, raise a seed round, make millions selling your startup!?.... wait that got out of hand.

Yes, you guessed it, writing tests is the answer.

Let's get back on track. Write tests based on the features you want to build. Then write the feature. You will have a clear picture of what you want to build. During this process you will automatically start thinking about all the edge cases you would usually never consider.

Trust me, TDD works.

How to get started? Use something simple like Mocha and Chai. Mocha is a testing framework, while Chai is an assertion library.

Install the npm packages with:

npm i mocha chai

Let's test the foo function from above. In your main test.js file add this snippet of code:

const chai = require('chai')
const expect = chai.expect

const foo = require('./src/foo')

describe('foo', function () {
  it('should be a function', function () {
    expect(foo).to.be.a('function')
  })
  it('should take one parameter', function () {
    expect(
      foo.bind(null, { param1: 5, param2: 345, param3: 98 }))
      .to.not.throw(Error)
  })
  it('should throw error if the parameter is missing', function () {
    expect(foo.bind(null, {})).to.throw(Error)
  })
  it('should throw error if the parameter does not have 3 values', function () {
    expect(foo.bind(null, { param1: 4, param2: 1 })).to.throw(Error)
  })
  it('should return the sum of three values', function () {
    expect(foo({ param1: 1, param2: 2, param3: 3 })).to.equal(6)
  })
})

Add this to your scripts section in the package.json:

"scripts": {
 "test": "mocha"
}

Now you can run the tests by running a single command in your terminal:

npm test

The output will be:

> test-mocha@1.0.0 test /path/to/your/expressjs/project
> mocha

foo
  ✓ should be a function
  ✓ should take one parameter
  ✓ should throw error if the parameter is missing
  ✓ should throw error if the parameter does not have 3 values
  ✓ should return the sum of three values

 5 passing (6ms)

Writing tests gives you a feeling of clarity. And it feels freaking awesome! I feel better already.

With this out of my system I'm ready for DevOps topics. Let's move on to some automation and configuration.

Use DevOps Tools To Make Running Express.js in Production Easier

Apart from the things you can do in the code, like you saw above, some things need to be configured in your environment and server setup. Starting from the basics, you need an easy way to manage environment variables, you also need to make sure your Express.js application restarts automatically in case it crashes.

You also want to configure a reverse proxy and load balancer to expose your application, cache requests, and load balance traffic across multiple worker processes. The most important step in maintaining high performance is to add a metrics collector so you can visualize data across time and troubleshoot issues whenever they occur.

Managing Environment Variables in Node.js with dotenv

Dotenv is an npm module that lets you load environment variables easily into any Node.js application by using a file.

In the root of your project create a .env file. Here you'll add any environment variables you need.

NODE_ENV=production
DEBUG=false
LOGS_TOKEN=xxx-yyy-zzz
MONITORING_TOKEN=xxx-yyy-zzz
INFRA_TOKEN=xxx-yyy-zzz
...

Loading this file is super simple. In your app.js file require dotenv at the top before anything else.

// dotenv at the top
require('dotenv').config()

// require any agents
const { stLogger, stHttpLoggerMiddleware } = require('sematext-agent-express')

// require express and instantiate the app
const express = require('express')
const app = express()
app.use(stHttpLoggerMiddleware)
...

Dotenv will load a file named .env by default. If you want to have multiple dotenv files, here's how you can configure them.

Make Sure the Application Restarts Automatically With Systemd or PM2

JavaScript is a scripting language, obviously, the name says so. What does this mean? When you start your server.js file by running node server.js it will run the script as a process. However, if it fails, the process exits and there's nothing telling it to restart.

Here's where using Systemd or PM2 comes into play. Either one works fine, but the Node.js maintainers urge us to use Systemd.

Ensure Application Restarts with Systemd

In short, Systemd is part of the building blocks of Linux operating systems. It runs and manages system processes. What you want is to run your Node.js process as a system service so it can recover from crashes.

Here's how you do it. On your VM or server, create a new file under /lib/systemd/system/ called app.service.

# /lib/systemd/system/fooapp.service
[Unit]
Description=Node.js as a system service.
Documentation=https://example.com
After=network.target
[Service]
Type=simple
User=ubuntu
ExecStart=/usr/bin/node /path/to/your/express/project/server.js
Restart=on-failure
[Install]
WantedBy=multi-user.target

The two important lines in this file are ExecStart and Restart. The ExecStart says that the /usr/bin/node binary will start your server.js file. Make sure to add an absolute path to your server.js file. The Restart=on-failure makes sure to restart the application if it crashes. Exactly what you're looking for.

Once you save the fooapp.service file, reload your daemon and start the script.

systemctl daemon-reload
systemctl start fooapp
systemctl enable fooapp
systemctl status fooapp

The status command will show you the application is running as a system service. The enable command makes sure it starts on boot. That was easier than you thought, am I right?

Ensure Application Restarts with PM2

PM2 has been around for a few years. They utilize a custom-built script that manages and runs your server.js file. It is simpler to set up, but comes with the overhead of having another Node.js process that acts as a Master process, like a manager, for your Express.js application processes.

First you need to install PM2:

npm i -g pm2

Then you start your application by running this command in the root directory of your Express.js project:

pm2 start server.js -i max

The -i max flag will make sure to start the application in cluster-mode, spawning as many workers as there are CPU cores on the server.

Mentioning cluster-mode is the perfect segue into the next section about load balancing and reverse proxies and caching.

Enable Load Balancing and Reverse Proxies

Load balancing can be done with both the Node.js cluster module or with Nginx. I'll show you my preferred setup, which is also what the peeps over at Node.js think is the right way to go.

Load Balancing with the Cluster Module

The built-in cluster module in Node.js lets you spawn worker processes that will serve your application. It's based on the child_process implementation and, luckily for us, is very easy to set up if you have a basic Express.js application.

You only really need to add one more file. Create a file called cluster.js and paste this snippet of code into it:

const cluster = require('cluster')
const numCPUs = require('os').cpus().length
const app = require('./src/app')
const port = process.env.PORT || 3000

const masterProcess = () => Array.from(Array(numCPUs)).map(cluster.fork)
const childProcess = () => app.listen(port)

if (cluster.isMaster) {
 masterProcess()
} else {
 childProcess()
}

cluster.on('exit', () => cluster.fork())

Let's break down what's happening here. When you start the cluster.js file with node cluster.js the cluster module will detect that it is running as a master process. In that case it invokes the masterProcess() function. The masterProcess() function counts how many CPU cores the server has and invokes the cluster.fork() function that many times. Once the cluster.fork() function is invoked the cluster module will detect it is running as a child process and invoke the childProcess() function, which then tells the Express.js server to .listen() on a port. All these processes are running on the same port. It's possible due to something called an IPC connection. Read more about that here.

The cluster.on('exit') event listener will restart a worker process if it fails.

With this setup you can now edit the ExecStart field in the fooapp.service Systemd service file to run the cluster.js file instead.

Replace:

ExecStart=/usr/bin/node /path/to/your/express/project/server.js

With:

ExecStart=/usr/bin/node /path/to/your/express/project/cluster.js

Reload the Systemd daemon and restart the fooapp.service:

systemctl daemon-reload
systemctl restart fooapp

There you have it. You've added load balancing to your Express.js application. Now it will scale across all the CPUs on your server.

However, this will only work for a single-server setup. If you want to have multiple servers, you need Nginx.

Adding a Reverse Proxy with Nginx

One of the primal laws of running Node.js applications is to never expose them on port 80 or 443. You should always use a reverse proxy to direct traffic to your application. Nginx is the most common tool you use with Node.js to achieve this. It's a web server that can act as both a reverse proxy and load balancer.

Installing Nginx is rather straightforward, for Ubuntu it would look like this:

apt update
apt install nginx

Make sure to check the Nginx installation instructions if you're using another operating system.

Nginx should start right away, but just in case make sure to check:

systemctl status nginx

[Output]
nginx.service - A high performance web server and a reverse proxy server
  Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
  Active: active (running) since Fri 2018-04-20 16:08:19 UTC; 3 days ago
    Docs: man:nginx(8)
Main PID: 2369 (nginx)
  Tasks: 2 (limit: 1153)
  CGroup: /system.slice/nginx.service
          ├─2369 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
          └─2380 nginx: worker process

If it is not started, go ahead and run this command to start it.

systemctl start nginx

Once you have Nginx running, you need to edit the configuration to enable a reverse proxy. You can find the Nginx configuration file in the /etc/nginx/ directory. The main configuration file is called nginx.conf, while there are additional snippets in the etc/nginx/sites-available/ directory. The default server configuration is found here and is named default.

To just enable a reverse proxy, open up the default configuration file and edit it so it looks like this:

server {
   listen 80;
   location / {
       proxy_pass http://localhost:3000; # change the port if needed
  }
}

Save the file and restart the Nginx service.

systemctl restart nginx

This configuration will route all traffic hitting port 80 to your Express.js application.

Load Balancing with Nginx

If you want to take it a step further, and enable load balancing, here's how to do it.

Now, edit the main nginx.conf file:

http {
   upstream fooapp {
       server localhost:3000;
       server domain2;
       server domain3;
      ...
  }
  ...
}

Adding this upstream section will create a server group that will load balance traffic across all the servers you specify.

You also need to edit the default configuration file to point the reverse proxy to this upstream.

server {
   listen 80;
   location / {
       proxy_pass http://fooapp;
  }
}

Save the files and restart the Nginx service once again.

systemctl restart nginx

Enabling Caching with Nginx

Caching is important to reduce response times for API endpoints, and resources that don't change very often.

Once again edit the nginx.conf file, and add this line:

http {
   proxy_cache_path /data/nginx/cache levels=1:2   keys_zone=STATIC:10m
  inactive=24h max_size=1g;
  ...
}

Open up the default configuration file again. Add these lines of code as well:

server {
   listen 80;
   location / {
       proxy_pass             http://fooapp;
       proxy_set_header       Host $host;
       proxy_buffering       on;
       proxy_cache           STATIC;
       proxy_cache_valid      200 1d;
       proxy_cache_use_stale  error timeout invalid_header updating
            http_500 http_502 http_503 http_504;
  }
}

Save both files and restart the Nginx service once again.

Enabling Gzip Compression with Nginx

To improve performance even more, go ahead and enable Gzip. In the server block of your Nginx configuration file add these lines:

server {
   gzip on;
   gzip_types     text/plain application/xml;
   gzip_proxied    no-cache no-store private expired auth;
   gzip_min_length 1000;
  ...
}

If you want to check out more configuration options about Gzip compression in Nginx, check this out.

Enabling Caching with Redis

Redis in an in-memory data store, which is often used as a cache.

Installing it on Ubuntu is rather simple:

apt update
apt install redis-server

This will download and install Redis and its dependencies. There is one important configuration change to make in the Redis configuration file that was generated during the installation.

Open up the /etc/redis/redis.conf file. You have to change one line from:

supervised no

To:

supervised systemd

That’s the only change you need to make to the Redis configuration file at this point, so save and close it when you are finished. Then, restart the Redis service to reflect the changes you made to the configuration file:

systemctl restart redis
systemctl status redis

[Output]
● redis-server.service - Advanced key-value store
  Loaded: loaded (/lib/systemd/system/redis-server.service; enabled; vendor preset: enabled)
  Active: active (running) since Wed 2018-06-27 18:48:52 UTC; 12s ago
    Docs: http://redis.io/documentation,
          man:redis-server(1)
Process: 2421 ExecStop=/bin/kill -s TERM $MAINPID (code=exited, status=0/SUCCESS)
Process: 2424 ExecStart=/usr/bin/redis-server /etc/redis/redis.conf (code=exited, status=0/SUCCESS)
Main PID: 2445 (redis-server)
  Tasks: 4 (limit: 4704)
  CGroup: /system.slice/redis-server.service
          └─2445 /usr/bin/redis-server 127.0.0.1:6379

Next you install the redis npm module to access Redis from your application.

npm i redis

Now you can require it in your application and start caching request responses. Let me show you an example:

const express = require('express')
const app = express()
const redis = require('redis')

const redisClient = redis.createClient(6379)

async function getSomethingFromDatabase (req, res, next) {
  try {
    const { id } = req.params;
    const data = await database.query()

    // Set data to Redis
    redisClient.setex(id, 3600, JSON.stringify(data))


    res.status(200).send(data)
  } catch (err) {
    console.error(err)
    res.status(500)
  }
}

function cache (req, res, next) {
  const { id } = req.params

  redisClient.get(id, (err, data) => {
    if (err) {
      return res.status(500).send(err)
    }


    // If data exists return the cached value
    if (data != null) {
      return res.status(200).send(data)
    }

   // If data does not exist, proceed to the getSomethingFromDatabase function
   next()
  })
}


app.get('/data/:id', cache, getSomethingFromDatabase)
app.listen(3000, () => console.log(`Server running on Port ${port}`))

This piece of code will cache the response from the database as a JSON string in the Redis cache for 3600 seconds. You can change this based on your own needs.

With this, you've configured key settings to improve performance. But, you've also introduced additional possible points of failure. What if Nginx crashes or Redis overloads your disk space? How do you troubleshoot that?

Enable VM/Server-Wide Monitoring and Logging

Ideally, you'd configure an Infrastructure Agent on your VM or server to gather metrics and logs and send them to a central location. That way you can keep track of all infrastructure metrics like CPU, memory, disk usage, processes, etc.

This way you can keep an eye on your whole infrastructure, including CPU, memory and disk usage, as well as all the separate processes while running your application in cluster-mode.

But, we do need to know what's going on with Nginx first. You can configure the stub_status to show Nginx metrics, but that doesn't really give you any actionable insight. But, you can install an Nginx Integration and get insight into Nginx metrics alongside your Express.js Integration in Sematext Cloud.

Why is monitoring Nginx important? Nginx is the entry point to your application. If it fails, your whole application fails. Your Node.js instance can be fine, but Nginx stops responding and your website goes down. You'll have no clue it's down because the Express.js application is still running without any issues.

You have to keep an eye on all the points of failure in your system. That's why having proper alerting in place is so crucial. If you want to learn more about alerting you can read this.

Same goes for Redis. To keep an eye on it, check out ways to monitor Redis, here or here.

That wraps up the DevOps tools and best practices you should stick to. What a ride that was! If you want to delve deeper into learning about DevOps and tooling, check out this guide my co-worker wrote.

Wrapping Up

It took me the better part of four years to start using proper tooling and adhering to best practices. In the end, I just want to point out the most important part of your application is to be available and performant. Otherwise, you won't see any users stick around. If they can't use your application, what's the point?

The idea behind this article was to cover best practices you should stick to, but also the bad practices to stay away from.

You've learned many new things in this Express.js tutorial. From optimizing Express.js itself, creating an intuitive project structure and optimizing for performance to learning about JavaScript best practices and test-driven development. You've also learned about error handling, logging and monitoring.

After all this, you can say with certainty that you've had an introduction to DevOps culture. What does that mean? Well, making sure to write reliable and performant software with test coverage, while maintaining the best possible developer productivity. That's how we as engineers continue loving our job. Otherwise, it's all mayhem.

Hope you all enjoyed reading this as much as I enjoyed writing it. If you liked it, feel free to hit the share button so more people will see this tutorial. Until next time, be curious and have fun.

Running and Deploying Elasticsearch Operator on Kubernetes

Adnan Rahić — Mon, 09 Mar 2020 18:36:33 +0000

Have you ever grown tired of running the same kubectl commands again and again? Well, the good folks over at the Kubernetes team understand you. With the addition of custom resources and the operator pattern, you can now make use of extensions, or addons as I like to call them, to the Kubernetes API that help you manage applications and components.

Operators follow Kubernetes principles including the control loop. The Operator Pattern is set out to help DevOps teams manage a service or set of services by automating repeatable tasks.

This article will show you the pros and cons of using the Operator Pattern versus StatefulSets, as I explained in our previous tutorial about Running and Deploying Elasticsearch on Kubernetes. It will also guide you through installing and running the Elasticsearch Operator on a Kubernetes cluster. I will also explain how to quickly set up basic monitoring with the Sematext Elasticsearch monitoring integration. You can also peek at Kubernetes monitoring integration on your own.

Keep in mind, there are no silver bullets. Both solutions are valid, but are useful for different scenarios. At Sematext we're using the StatefulSet approach, and it's working great for us.

The Elasticsearch Operator I'll be using in this tutorial is the official Operator from Elastic. It automates the deployment, provisioning, management, and orchestration of Elasticsearch on Kubernetes.

With that out of the way, let's jump into the tutorial!

What are Kubernetes Operators?

Operators are extensions to Kubernetes that use custom resources to manage applications. By using the CustomResourceDefinition (CRD) API resource, you can define custom resources. In this tutorial you'll learn how to create a custom resource in a separate namespace.

When you define a CRD object, it creates a new custom resource with a name and schema that you specify. What's so cool about this? Well, you don't have to write a custom configuration to handle the custom resource. The Kubernetes API does it all for you. It serves and handles the storage of your custom resource.

The point of using the Operator Pattern is to help you, the DevOps engineer, automate repeatable tasks. It captures how you can write code to automate a task beyond what Kubernetes itself provides.

You deploy an Operator by adding the Custom Resource Definition and Controller to your cluster. The Controller will normally run outside of the control plane, much as you would run any containerized application. More about that a bit further down. Let me explain what the Elasticsearch Operator is first.

What is the Elasticsearch Operator?

The Elasticsearch Operator automates the process of managing Elasticsearch on Kubernetes.

There are a few different Elasticsearch Operators you can choose from. Some of them are made by active open-source contributors, however only one is written and maintained by Elastic.

However, I won't go into details about any of them except for the official ECK Operator built by Elastic. For the rest of this tutorial, I'll demo how to manage and run this particular Elasticsearch Operator.

ECK simplifies deploying the whole Elastic stack on Kubernetes, giving you tools to automate and streamline critical operations. You can add, remove, and update resources with ease. Like playing with Lego bricks, changing things around is incredibly simple. It also makes it much easier to handle operational and cluster administration tasks. What is streamlined?

Managing multiple clusters
Upgrading versions
Scaling cluster capacity
Changing cluster configuration
Dynamically scaling storage
Scheduling backups

Why Use the Elasticsearch Operator: Pros and Cons?

When I first learned about the Operator Pattern, I had an overwhelming feeling of hype. I wanted it to be better than the "old" way. I was hoping the added automation would make managing and deploying applications on Kubernetes much easier. I was literally hoping it would be the same breakthrough as Helm.

In the end, it's not. Well, at least not yet. If you compare the stars of the most popular Helm charts that configure Elasticsearch StatefulSets versus the official Elasticsearch Operator, they're neck-and-neck. We still seem to be a bit conflicted about what to use.

Elasticsearch Operator vs. StatefulSet

The Elasticsearch Operator essentially creates an additional namespace that houses tools to automate the process of creating Elasticsearch resources in your default namespace. It's literally an addon you add to your Kubernetes system to handle Elasticsearch-specific resources.

This gives you more automation but also abstracts away things you might need more fine-tuned control over. Configuring your own StatefulSets can often be the better approach because this is the way the community is used to configuring Elasticsearch clusters. It also gives you more control.

However, the Operator can do things that are not available with the StatefulSets. It uses Kubernetes resources in the background to automate your work with some additional features:

S3 snapshots of indexes
Automatic TLS - the operator automatically generates secrets
Spread loads across zones
Support for Kibana and Cerebro
Instrumentation with statsd
Secure by default, with encryption enabled and password protected
Official Operator maintained by Elastic

Why Use the Elasticsearch Operator?

If you want to get up and running quickly, choose the Operator. You'll get all of this out of the box:

Elasticsearch, Kibana and APM Server deployments
TLS certificates management
Safe Elasticsearch cluster configuration & topology changes
Persistent volumes usage
Custom node configuration and attributes
Secure settings keystore updates

However, keep in mind there are downsides.

Why Stay Away From the Elasticsearch Operator?

Like with any new and exciting tool, there are a few issues. The biggest one being that it's a totally new tool you need to learn. Here are my reasons for staying away from the Operator:

An additional tool to learn
Additional Kubernetes resources in a separate namespace to worry about
Additional resources create overhead
Less fine-tuned control

Most of what the Elasticsearch Operator offers is already available with prebuilt Helm charts.

With that out of the way. Let's start by building something!

How to Run and Deploy the Elasticsearch Operator on Kubernetes

Installing the Elasticsearch Operator is as simple as running one command. Don't believe me? Follow along and find out for yourself.

Prerequisites

To follow along with this tutorial you’ll need a few things first:

A Kubernetes cluster with role-based access control (RBAC) enabled.
- Ensure your cluster has enough resources available, and if not scale your cluster by adding more Kubernetes Nodes. You’ll deploy a 3-Pod Elasticsearch cluster. I’d suggest you have 3 Kubernetes Nodes with at least 4GB of RAM and 10GB of storage.
The kubectl command-line tool installed on your local machine, configured to connect to your cluster. You can read more about how to install kubectl in the official documentation.

Installing the Elasticsearch Operator

This command will install custom resource definitions and the Operator with RBAC rules:

kubectl apply -f https://download.elastic.co/downloads/eck/1.0.0/all-in-one.yaml

Once you've installed the Operator, you can check the resources by running this command:

kubectl -n elastic-system get all

[Output]
NAME                     READY   STATUS   RESTARTS   AGE
pod/elastic-operator-0   1/1     Running   0         18s

NAME                             TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/elastic-webhook-server   ClusterIP   10.96.52.149   <none>        443/TCP   19s

NAME                               READY   AGE
statefulset.apps/elastic-operator   1/1     19s

As you see the Operator will live under the elastic-system namespace. You can monitor the logs of the Operator's StatefulSet with this command:

kubectl -n elastic-system logs -f statefulset.apps/elastic-operator

A better way of monitoring logs on a cluster-level is to add the Sematext Operator to collect these logs and send them to a central location, alongside performance metrics about your Elasticsearch cluster. It’s pretty straightforward.

kubectl apply -f https://raw.githubusercontent.com/sematext/sematext-operator/master/bundle.yaml

cat <<EOF | kubectl apply -f -
apiVersion: sematext.com/v1alpha1
kind: SematextAgent
metadata:
  name: sematext-agent
spec:
  region: <"US" or "EU">
  containerToken: YOUR_CONTAINER_TOKEN
  logsToken: YOUR_LOGS_TOKEN
  infraToken: YOUR_INFRA_TOKEN
EOF

All you need are these two commands above, and you’re set to go. Next up, let's take a look at the CRDs that were created as well.

kubectl get crd

[Output]
NAME                                           CREATED AT
apmservers.apm.k8s.elastic.co                  2020-02-05T15:46:33Z
elasticsearches.elasticsearch.k8s.elastic.co   2020-02-05T15:46:33Z
kibanas.kibana.k8s.elastic.co                  2020-02-05T15:46:33Z

These are the APIs you'll have access to, in order to streamline the process of creating and managing Elasticsearch resources in your Kubernetes cluster. Next up, let's deploy an Elasticsearch cluster.

Deploying the Elasticsearch Cluster

Once the Operator is installed you'll get the access elasticsearch.k8s.elastic.co/v1 API. Now you can spin up an Elasticsearch server in no time. Run this command to create an Elasticsearch cluster with a single node:

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch
spec:
  version: 7.5.2
  nodeSets:
  - name: default
    count: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
      node.store.allow_mmap: false
EOF

Give it a minute to start. You can check the cluster health during the creation process:

kubectl get elasticsearch

[Output]
NAME           HEALTH   NODES   VERSION   PHASE   AGE
elasticsearch   green    1       7.5.2     Ready   61s

You now have a running Elasticsearch Pod, which is tied to a StatefulSet in the default namespace. Alongside this, you also have two Services you can expose to access the Pod.

kubectl get all
[Output]
NAME                             READY   STATUS   RESTARTS   AGE
pod/elasticsearch-es-default-0   1/1     Running   0         2m18s
NAME                               TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/elasticsearch-es-default   ClusterIP   None           <none>       <none>     2m18s
service/elasticsearch-es-http     ClusterIP   10.96.192.180   <none>        9200/TCP   2m19s
service/kubernetes                 ClusterIP   10.96.0.1       <none>        443/TCP   2d2h
NAME                                       READY   AGE
statefulset.apps/elasticsearch-es-default   1/1     2m18s

To make sure your Pod is working, check its logs:

kubectl logs elasticsearch-es-default-0
...

If you see logs streaming in, you know it's working. The Services both have ClusterIPs and you get credentials generated automatically.

First, open up another terminal window, there you expose the quickstart-es-http service, so you can access it from your local machine:

kubectl port-forward service/elasticsearch-es-http 9200

A default user named elastic is automatically created with the password stored in a Kubernetes secret. Back in your initial terminal window, run this command to retrieve the password:

PASSWORD=$(kubectl get secret elasticsearch-es-elastic-user -o=jsonpath='{.data.elastic}' | base64 --decode)

Use curl to test the endpoint:

curl -u "elastic:$PASSWORD" -k "https://localhost:9200"

[Output]
{
   "name" : "elasticsearch-es-default-0",
   "cluster_name" : "elasticsearch",
   "cluster_uuid" : "7auDvcXLTwqLmXfBcAXIqg",
   "version" : {
       "number" : "7.5.2",
       "build_flavor" : "default",
       "build_type" : "docker",
       "build_hash" : "8bec50e1e0ad29dad5653712cf3bb580cd1afcdf",
       "build_date" : "2020-01-15T12:11:52.313576Z",
       "build_snapshot" : false,
       "lucene_version" : "8.3.0",
       "minimum_wire_compatibility_version" : "6.8.0",
       "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
   "tagline" : "You Know, for Search"
}

Hey presto! It works. This might be good for starters, but the cluster only has one Pod. Let's spice things up a bit and add a few more.

Upgrade and Configure the Elasticsearch Cluster

Any edits you do to the configuration will automatically upgrade the cluster. The Operator will try to update all the configuration changes you tell it, except for existing volume claims, these cannot be resized. Make sure your Kubernetes cluster has enough resources to handle any resizing you do.

If you want to have 3 Pods, run this command:

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch
spec:
version: 7.5.2
nodeSets:
 - name: default
  count: 3
  config:
     node.master: true
     node.data: true
     node.ingest: true
     node.store.allow_mmap: false
EOF

This will bump up the Pod count. Check out this sample to see all the configuration options. Let's check if our Pods have updated:

kubectl get all

[Output]
NAME                             READY   STATUS   RESTARTS   AGE
pod/elasticsearch-es-default-0   1/1     Running   0         25m
pod/elasticsearch-es-default-1   1/1     Running   0         3m8s
pod/elasticsearch-es-default-2   1/1     Running   0         2m46s

NAME                               TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/elasticsearch-es-default   ClusterIP   None           <none>       <none>     25m
service/elasticsearch-es-http     ClusterIP   10.96.192.180   <none>        9200/TCP   25m
service/kubernetes                 ClusterIP   10.96.0.1       <none>        443/TCP   2d2h

NAME                                       READY   AGE
statefulset.apps/elasticsearch-es-default   3/3     25m

Awesome! Our cluster is starting to look nice! This cluster that you deployed by default only allocates a persistent volume of 1 GB for storage using the default storage class defined for the Kubernetes cluster.

Here's a sample of what adding more storage looks like:

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch
spec:
  version: 7.5.2
  nodeSets:
  - name: default
    count: 3
    config:
      node.master: true
      node.data: true
      node.ingest: true
      node.store.allow_mmap: false
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 4Gi
        storageClassName: standard
EOF

You'll most likely want to have more control over this for production workloads. Check out the Volume claim templates for more information.

How to Run and Deploy Kibana with the Elasticsearch Operator

This Operator is called ECK for a reason. It comes packaged with Kibana. In one of the sections above we ran this command:

kubectl get crd

[Output]
NAME                                           CREATED AT
apmservers.apm.k8s.elastic.co                  2020-02-05T15:46:33Z
elasticsearches.elasticsearch.k8s.elastic.co   2020-02-05T15:46:33Z
kibanas.kibana.k8s.elastic.co                  2020-02-05T15:46:33Z

Check it out. You have a kibana.k8s.elastic.co/v1 API as well. This is what you'll use to create your Kibana instance.

Go ahead and specify a Kibana instance and reference your Elasticsearch cluster:

cat <<EOF | kubectl apply -f -
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: elasticsearch
spec:
  version: 7.5.2
  count: 1
  elasticsearchRef:
    name: elasticsearch
EOF

Give it a second to spin up the Pod. Similar to Elasticsearch, you can retrieve details about Kibana instances with this simple command:

kubectl get kibana

[Output]
NAME           HEALTH   NODES   VERSION   AGE
elasticsearch   green    1       7.5.2     2m31s

Wait until the health is green, then check the Pods:

kubectl get pod --selector='kibana.k8s.elastic.co/name=elasticsearch'

[Output]
NAME                               READY   STATUS   RESTARTS   AGE
elasticsearch-kb-5f568dcdb6-xd55w   1/1     Running   0         3m19s

When the Pods are up and running as well, you can go ahead and set up accessing Kibana. A ClusterIP Service is automatically created for Kibana:

kubectl get service elasticsearch-kb-http

[Output]
NAME                   TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
elasticsearch-kb-http   ClusterIP   10.96.199.44   <none>        5601/TCP   4m24s

Once again, open up another terminal window, and use kubectl port-forward to access Kibana from your local machine:

kubectl port-forward service/elasticsearch-kb-http 5601

Open https://localhost:5601 in your browser. Log in as the elastic user. Get the password with this command:

kubectl get secret elasticsearch-es-elastic-user -o=jsonpath='{.data.elastic}' | base64 --decode; echo

Once you're signed in, you'll see the Kibana quickstart screen.

There you have it. You've added a Kibana instance to your Kubernetes cluster.

Cleaning Up and Deleting the Elasticsearch Operator

With all resources installed and working, you should see this when running kubectl get all.

NAME                                   READY   STATUS   RESTARTS   AGE
pod/elasticsearch-es-default-0          1/1     Running   0         13m
pod/elasticsearch-kb-5f568dcdb6-xd55w   1/1     Running   0         11m

NAME                               TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/elasticsearch-es-default   ClusterIP   None           <none>       <none>     13m
service/elasticsearch-es-http     ClusterIP   10.96.168.225   <none>        9200/TCP   13m
service/elasticsearch-kb-http     ClusterIP   10.96.199.44   <none>        5601/TCP   11m
service/kubernetes                 ClusterIP   10.96.0.1       <none>        443/TCP   2d3h

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/elasticsearch-kb   1/1     1            1           11m

NAME                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/elasticsearch-kb-5f568dcdb6   1         1         1       11m

NAME                                       READY   AGE
statefulset.apps/elasticsearch-es-default   1/1     13m

Way to go, you've configured an Elasticsearch cluster with Kibana using the Elasticsearch Operator! But, what if you need to delete resources? Easy. Run two commands and you're done.

First, delete all Elastic resources from all namespaces:

kubectl delete elastic --all --all-namespaces

Then, delete the Operator itself:

kubectl delete -f https://download.elastic.co/downloads/eck/1.0.0/all-in-one.yaml

That's it, all clean!

Final Thoughts About the Elasticsearch Operator

In this tutorial you've learned about the Kubernetes Operator pattern, and how to run and deploy the Elasticsearch Operator on a Kubernetes cluster. You've also scaled up the number of Elasticsearch Pods on the cluster, and installed Kibana.

With this knowledge on top of what you learned in part 1 of this series, you can make a decision whether to use a Helm chart with StatefulSets or the Elasticsearch Operator.

Why bother learning Operators?

In the last year we've witnessed a huge increase in popularity for the Operator Pattern. Right now, the official Elasticsearch Operator has the same number of stars on GitHub as the most popular Elasticsearch Helm chart. This popularity will seemingly continue to grow.

What can you do now? Contribute! Learn even more about Kubernetes, and give back to the community. These projects are open-source for a reason. Help them grow!

A Step-by-Step Guide to Java Garbage Collection Tuning

Rafał Kuć — Mon, 27 Jan 2020 11:19:14 +0000

Working with Java applications has a lot of benefits. Especially when compared to languages like C/C++. In the majority of cases, you get interoperability between operating systems and various environments. You can move your applications from server to server, from operating system to operating system, without major effort or in rare cases with minor changes.

One of the most interesting benefits of running a JVM based application is automatic memory handling. When you create an object in your code it is assigned on a heap and stays there until it is referenced from the code. When it is no longer needed it needs to be removed from the memory to make room for new objects. In programming languages like C or C++, the cleaning of the memory is done by us, programmers, manually in the code. In languages like Java or Kotlin, we don’t need to take care of that – it is done automatically by the JVM, by its garbage collector.

What Is Garbage Collection Tuning?

Garbage Collection GC tuning is the process of adjusting the startup parameters of your JVM-based application to match the desired results. Nothing more and nothing less. It can be as simple as adjusting the heap size – the -Xmx and -Xms parameters. Which is by the way what you should start with. Or it can be as complicated as tuning all the advanced parameters to adjust the different heap regions. Everything depends on the situation and your needs.

Why Is Garbage Collection Tuning Important?

Cleaning our applications’ JVM process heap memory is not free. There are resources that need to be designated for the garbage collector so it can do its work. You can imagine that instead of handling the business logic of our application the CPU can be busy handling the removal of unused data from the heap.

This is why it’s crucial for the garbage collector to work as efficiently as possible. The GC process can be heavy. During our work as developers and consultants, we’ve seen situations where the garbage collector was working for 20 seconds during a 60-second window of time. Meaning that 33% of the time the application was not doing its job — it was doing the housekeeping instead.

We can expect threads to be stopped for very short periods of time. It happens constantly:



2019-10-29T10:00:28.879-0100: 0.488: Total time for which application threads were stopped: 0.0001006 seconds, Stopping threads took: 0.0000065 seconds

What’s dangerous, however, is a complete stop of the application threads for a very long period of time – like seconds or in extreme cases even minutes. This can lead to your users not being able to properly use your application at all. Your distributed systems can collapse because of elements not responding in a timely manner.

To avoid that we need to ensure that the garbage collector that is running for our JVM applications is well configured and is doing its job as good as it can.

When to Do Garbage Collection Tuning?

The first thing that you should know is that tuning the garbage collection should be one of the last operations you do. Unless you are absolutely sure that the problem lies in the garbage collection, don’t start with changing JVM options. To be blunt, there are numerous situations where the way how the garbage collector works only highlights a bigger problem.

If your JVM memory utilization looks good and your garbage collector works without causing trouble, you shouldn’t spend time turning your garbage collection. You will most likely be more effective in refactoring the code to be more efficient.

So how do we say that the garbage collector does a good job? We can look into our monitoring, like our own Sematext Cloud. It will provide you information regarding your JVM memory utilization, the garbage collector work and of course the overall performance of your application. For example, have a look at the following chart:

In this chart, you can see something called “shark tooth”. Usually, it is a sign of a healthy JVM heap. The largest portion of the memory, called the old generation, gets filled up and then is cleared by the garbage collector. If we would correlate that with the garbage collector timings we would see the whole picture. Knowing all of that we can judge if we are satisfied with how the garbage collection is working or if tuning is needed.

Another thing you can look into is garbage collection logs that we discussed in the Understanding Java GC Logs blog post. You can also use tools like jstat or any profiler. They will give you detailed information regarding what’s happening inside your JVM, especially when it comes to heap memory and garbage collection.

There is also one more thing that you should consider when thinking about garbage collection performance tuning. The default Java garbage collection settings may not be perfect for your application, so to speak. Meaning, instead of going for more hardware or for more beefy machines you may want to look into how your memory is managed. Sometimes tuning can decrease the operation cost lowering your expenses and allowing for growth without growing the environment.

Once you are sure that the garbage collector is to blame and you want to start optimizing its parameters we can start working on the JVM startup parameters.

Garbage Collection Tuning Procedure: How to Tune Java GC

When talking about the procedure you should take when tuning the garbage collector you have to remember that there are more garbage collectors available in the JVM world. When dealing with smaller heaps and older JVM versions, like version 7, 8 or 9, you will probably use the good, old Concurrent Mark Sweep garbage collector for your old generation heap. With a newer version of the JVM, like 11, you are probably using G1GC. If you like experimenting you are probably using the newest JVM version along with ZGC. You have to remember that each garbage collector works differently. Hence, the tuning procedure for them will be different.

Running a JVM based application with different garbage collectors is one thing, doing experiments is another. Java garbage collection tuning will require lots of experiments and tries. It’s normal that you won’t achieve the desired results in your first try. You will want to introduce changes one by one and observe how your application and the garbage collector behaves after each change.

Whatever your motivation for GC tuning is I would like to make one thing clear. To be able to tune the garbage collector, you need to be able to see how it works. This means that you need to have visibility into GC metric or GC logs, or both, which would be the best solution.

Starting GC Tuning

Start by looking at how your application behaves, what events fill up the memory space, and what space is filled. Remember that:

Assigned objects in the Eden generation are moved to Survivor space
Assigned objects in the Survivor space are moved to Tenured generation if the counter is high enough or the counter is increased.
Assigned objects in the Tenured generation are ignored and will not be collected.

You need to be sure you understand what is happening inside your application’s heap, and keep in mind what causes the garbage collection events. That will help you understand your application’s memory needs and how to improve garbage collection.

Let’s start tuning.

Heap Size

You would be surprised how often setting the correct heap size is overlooked. As consultants, we’ve seen a few of those, believe us. Start by checking if your heap size is really well set up.

What should you consider when setting up the heap for your application? It depends on many factors of course. There are systems like Apache Solr or Elasticsearch which are heavily I/O dependent and can share the operating system file system cache. In such cases, you should leave as much memory as you can for the operating system, especially if your data is large. If your application processes a lot of data or does a lot of parsing, larger heaps may be needed.

Anyways, you should remember that until 32GB of heap size you benefit from so-called compressed ordinary object pointers. Ordinary object pointers or OOP are 64-bits pointers to memory. They point to memory allowing the JVM to reference objects on the heap. At least this is how it works without getting deep into the internals.

Up to 32GB of the heap size, JVM can compress those OOPs and thus save memory. This is how you can imagine the compressed ordinary object pointer in the JVM world:

The first 32 bits are used for the actual memory reference and are stored on the heap. 32 bits is enough to address every object on heaps up to 32GB. How do we calculate that? We have 232 – our space that can be addressed by a 32-bits pointer. Because of the three zeros in the tail of our pointer we have 232+3, which gives us 235, so 32GB of memory space that can be addressed. That’s the maximum heap size we can use with compressed ordinary object pointers.

Going above 32GB of the heap will result in JVM using 64-bits pointers. In some cases going from 32GB to 35GB heap, you are likely to have more or less the same amount of usable space. That depends on your application memory usage, but you need to take that into consideration and probably go above 35GB to see the difference.

Finally, how do I choose the proper heap size? Well, monitor your usage and see how your heap behaves. You can use your monitoring for that, like our Sematext Cloud and its JVM monitoring capabilities:

You can see the JVM pool size and the GC summary charts. As you can see the JVM heap size can be characterized as shark teeth – a healthy pattern. Based on the first chart we can that we need at least 500 – 600MB of memory for that application. The point where the memory is evacuated is around 1.2GB of the total heap size, for the G1 garbage collector, in this case. In this scenario, we have the garbage collector running for about 2 seconds in the 60 seconds time period, which means that the JVM spends around 2% of the time in garbage collection. This is good and healthy.

We can also look at the average garbage collection time along with the 99th and 90th percentile:

Based on that information we can see that we don’t need a higher heap. Garbage collection is fast and efficiently clears the data.

On the other hand, if we know that our application is used and processes data, its heap is above 70 – 80% of the maximum heap that we set it to and we would see GC struggling we know that we are in trouble. For example, look at this application’s memory pools:

You can see that something started happening and that memory usage is constantly above 80% in the old generation space. Correlate that with garbage collector work:

And you can clearly see signs of high memory utilization. The garbage collector started doing more work while memory is not being cleared. That means that even though JVM is trying to clear the data – it can’t. This is a sign of trouble coming – we just don’t have enough space on the heap for new objects. But keep in mind this may also be a sign of memory leaks in the application. If you see memory growth over time and garbage collection not being able to free the memory you may be hitting an issue with the application itself. Something worth checking.

So how do we set the heap size? By setting its minimum and maximum size. The minimum size is set using the -Xms JVM parameter and the maximum size is set using the -Xmx parameter. For example, to set the heap size for our application to be of size 2GB we would add -Xms2g -Xmx2g to our application startup parameters. In most cases, I would also set them to the same value to avoid heap resizing and in addition to that I would add the -XX:+AlwaysPreTouch flag as well to load the memory pages into memory at the start of the application.

We can also control the size of the young generation heap space by using the -Xmn property, just like the -Xms and -Xmx. This allows us to explicitly define the size of the young generation heap space when needed.

Serial Garbage Collector

The Serial Garbage Collector is the simplest, single-threaded garbage collector. You can turn on the Serial garbage collector by adding the -XX:+UseSerialGC flag to your JVM application startup parameters. We won’t focus on tuning the serial garbage collector.

Parallel Garbage Collector

The Parallel garbage collector similar in its roots to the Serial garbage collector but uses multiple threads to perform garbage collection on your application heap. You can turn on the Parallel garbage collector by adding the -XX:+UseParallelGC flag to your JVM application startup parameters. To disable it entirely, use the -XX:-UseParallelGC flag.

Tuning the Parallel Garbage Collector

As we’ve mentioned The Parallel garbage collector uses multiple threads to perform its cleaning duties. The number of threads that the garbage collector can use is set by using the -XX:ParallelGCThreads flag added to our application startup parameters.

For example, if we would like 4 threads to do the garbage collection, we would add the following flag to our application parameters: -XX:ParallelGCThreads=4. Keep in mind that the more threads you dedicate to cleaning duties the faster it can get. But there is also a downside of having more garbage collection threads. Each GC thread involved in a minor garbage collection event will reserve a portion of the tenured generation heap for promotions. This will create divisions of space and fragmentation. The more the threads the higher the fragmentation. Reducing the number of Parallel garbage collection threads and increasing the size of the old generation will help with the fragmentation if that becomes an issue.

The second option that can be used is -XX:MaxGCPauseMillis. It specifies the maximum pause time goal between two consecutive garbage collection events. It is defined in milliseconds. For example, with a flag -XX:MaxGCPauseMillis=100 we tell the Parallel garbage collector that we would like to have the maximum pause of 100 milliseconds between garbage collections. The longer the gap between garbage collections the more garbage can be left on the heap making the next garbage collection more expensive. On the other hand, if the value is too small, the application will spend the majority of its time in garbage collection instead of executing business logic.

The maximum throughput target can be set by using the -XX:GCTimeRatio flag. It defines the ratio between the time spent in GC and the time spent outside of GC. It is defined as 1/(1 + GC_TIME_RATIO_VALUE) and it’s a percentage of time spent in garbage collection.

For example, setting -XX:GCTimeRatio=9 means that 10% of the application’s working time may be spent in the garbage collection. This means that the application should get 9 times more working time compared to the time given to garbage collection.

By default, the value of -XX:GCTimeRatio flag is set to 99 by the JVM, which means that the application will get 99 times more working time compared to the garbage collection which is a good trade-off for the server-side applications.

You can also control the adjustment of the generations of the Parallel garbage collector. The goals for the Parallel garbage collector are as follows:

achieve maximum pause time
achieve throughput, only if pause time is achieved
achieve footprint only if the first two goals are achieved

The Parallel garbage collector grows and shrinks the generations to achieve the goals above. Growing and shrinking the generations is done in increments at a fixed percentage. By default, the generation grows in increments of 20% and shrinks in increments of 5%. Each generation is configured on its own. The percentage of the growth of a generation is controlled by the -XX:YoungGenerationSizeIncrement flag. The growth of the old generation is controlled by the -XX:TenuredGenerationSizeIncrement flag.

The shrinking part can be controlled by the -XX:AdaptiveSizeDecrementScaleFactor flag. For example, the percentage of the shrinking increment for the young generation is set by dividing the value of -XX:YoungGenerationSizeIncrement flag by the value of the -XX:AdaptiveSizeDecrementScaleFactor.

If the pause time goal is not achieved the generations will be shrunk one at the time. If the pause time of both generations is above the goal, the generation that caused threads to stop for a longer period of time will be shrunk first. If the throughput goal is not met then both the young and old generations will be grown.

The Parallel garbage collector can throw OutOfMemory exception if too much time is spent in garbage collection. By default, if more than 98% of the time is spent in garbage collection and less than 2% of the heap is recovered such exception will be thrown. If we want to disable that behavior we can add the -XX:-UseGCOverheadLimit flag. But please be aware that garbage collectors working for an extensive amount of time and clearing very little or close to no memory at all usually means that your heap size is too low or your application suffers from memory leaks.

Once you know all of this we can start looking at garbage collector logs. They will tell us about the events that our Parallel garbage collector performs. That should give us the basic idea of where to start the tuning and which part of the heap is not healthy or could use some improvements.

Concurrent Mark Sweep Garbage Collector

The Concurrent Mark Sweep garbage collector, a mostly concurrent implementation that shares the threads used for garbage collection with the application. You can turn it on by adding the -XX:+UseConcMarkSweepGC flag to your JVM application startup parameters.

Tuning the Concurrent Mark Sweep Garbage Collector

Similar to other available collectors in the JVM world the CMS garbage collector is generational which means that you can expect two types of events to happen – minor and major collections. The idea here is that most work will be done in parallel to the application threads to prevent the tenured generation to get full. During normal work, most of the garbage collection is done without stopping application threads. CMS only stops the threads for a very short period of time at the beginning and the middle of the collection during the major collection. Minor collections are done in a very similar way to how the Parallel garbage collector works – all application threads are stopped during GC.

One of the signals that your CMS garbage collector needs tuning is concurrent mode failures. This indicates that the Concurrent Mark Sweep garbage collector was not able to reclaim all unreachable objects before the old generation filled up or there was simply not enough fragmented space in the heap tenured generation to promote objects.

But what about the concurrency we’ve mentioned? Let’s get back to the pauses for a while. During the concurrent phase, the CMS garbage collector pauses two times. The first is called the initial mark pause. It is used to mark the live objects that are directly reachable from the roots and from any other place in the heap. The second pause called remark pause is done at the end of the concurrent tracing phase. It finds objects that were missed during the initial mark pause, mainly because of being updated in the meantime. The concurrent tracing phase is done between those two pauses. During this phase, one or more garbage collector threads may be working to clear the garbage. After the whole cycle ends the Concurrent Mark Sweep garbage collector waits until the next cycle while consuming close to no resources. However, be aware that during the concurrent phase your application may experience performance degradation.

The collection of tenured generation space must be timed when using the CMS garbage collector. Because concurrent mode failures can be expensive we need to properly adjust the start of the old generation heap cleaning not to hit such events. We can do that by using the -XX:CMSInitiatingOccupancyFraction flag. It is used to set the percentage of the old generation heap utilization when the CMS should start clearing it. For example, starting at 75% we would set the mentioned flag to -XX:CMSInitiatingOccupancyFraction=75. Of course, this is only an informative value and the garbage collector will still use heuristics and try to determine the best possible value for starting its old generation cleaning job. To avoid using heuristics we can use the -XX:+UseCMSInitiatingOccupancyOnly flag. That way we will only stick to the percentage from the -XX:CMSInitiatingOccupancyFraction setting.

So when setting the -XX:+UseCMSInitiatingOccupancyOnly flag to a higher value you delay the cleaning of the old generation space on the heap. This means that your application will run longer without the need for CMS kicking in to clear the tenured space. But, when the process starts it may be more expensive because it will have more work. On the other hand, setting the -XX:+UseCMSInitiatingOccupancyOnly flag to a lower value will make the CMS tenured generation cleaning more often, but it may be faster. Which one to choose – that depends on your application and needs to be adjusted per use case.

We can also tell our garbage collector to collect the young generation heap during the remark pause or before doing the Full GC. The first is done by adding the -XX:+CMSScavengeBeforeRemark flag to our startup parameters. The second is done by adding the -XX:+ScavengeBeforeFullGC flag to our application startup parameters. As a result, it can improve garbage collection performance as it will not need to check for references between the young and old generation heap spaces.

The remark phase of the Concurrent Mark Sweep garbage collector can potentially speed it up. By default it is a single-threaded and as you recall we’ve mentioned that it stops all the application threads. By including the -XX:+CMSParallelRemarkEnabled flag to our application startup parameters, we can force the remark phase to use multiple threads. However, because of certain implementation details, it is not actually always true that the concurrent version of the remark phase will be faster compared to the single-threaded version. That’s something you have to check and test in your environment.

Similar to the Parallel garbage collector, the Concurrent Mark Sweep garbage collector can throw OutOfMemory exceptions if too much time is spent in garbage collection. By default, if more than 98% of the time is spent in garbage collection and less than 2% of the heap is recovered such an exception will be thrown. If we want to disable that behavior we can add the -XX:-UseGCOverheadLimit flag. The difference compared to the Parallel garbage collector is that the time that counts towards the 98% is only counted when the application threads are stopped.

G1 Garbage Collector

G1 garbage collector, the default garbage collector in the newest Java versions targeted for latency-sensitive applications. You can turn it on by adding the -XX:+G1GC flag to your JVM application startup parameters.

Tuning G1 Garbage Collector

There are also two things worth mentioning. The G1 garbage collector tries to perform longer operations in parallel without stopping the application threads. The quick operations will be performed faster when application threads are paused. So it’s yet another implementation of mostly concurrent garbage collection algorithms.

The G1 garbage collector cleans memory mostly in evacuation fashion. Live objects from one memory area are copied to a new area and compacted along the way. After the process is done, the memory area from which the object was copied is again available for object allocation.

On a very high level, the G1GC goes between two phases. The first phase is called young-only and focuses on the young generation space. During that phase, the objects are moved gradually from the young generation to the old generation space. The second phase is called space reclamation and is incrementally reclaiming the space in the old generation while also taking care of the young generation at the same time. Let’s look closer at those phases as there are some properties we can tune there.

The young-only phase starts with a few young-generation collections that promote objects to the tenured generation. That phase is active until the old generation space reaches a certain threshold. By default, it’s 45% utilization and we can control that by setting the -XX:InitiatingHeapOccupancyPercent flag and its value. Once that threshold is hit, G1 starts a different young generation collection, one called concurrent start. The -XX:InitiatingHeapOccupancyPercent flag which controls the Initial Mark collection is the initial value that is further adjusted by the garbage collector. To turn off the adjustments add -XX:-G1UseAdaptiveIHOP flag to your JVM startup parameters.

The concurrent start, in addition to the normal young generation collection, starts the object marking process. It determines all live, reachable objects in the old generation space that need to be kept for the following space reclamation phase. To finish the marking process two additional steps are introduced – remark and cleanup. Both of them pause the application threads. The remark step performs global processing of references, class unloading, completely reclaims empty regions and cleans up internal data structures. The cleanup step determines if the space-reclamation phase is needed. If it’s needed the young-only phase is ended with Prepare Mixed young collection and the space-reclamation phase is launched.

The space-reclamation phase contains multiple Mixed garbage collections that work on both young and old generation regions of the G1GC heap space. The space-reclamation phase ends when the G1GC sees that evacuating more old generation regions wouldn’t give enough free space to make the effort of reclaiming the space worthwhile. It can be set by using the -XX:G1HeapWastePercent flag value.

We can also control, at least to some degree, if the periodic garbage collection will run. By using the -XX:G1PeriodicGCSystemLoadThreshold flag we can set the average load above which the periodic garbage collection will not be run. For example, if our system is load is 10 for the last minute and we set the -XX:G1PeriodicGCSystemLoadThreshold=10 flag, the period garbage collection will not be executed.

The G1 garbage collector, apart from the -Xmx and -Xms flags, allows us to use a set of flags to size the heap and its regions. We can use the -XX:MinHeapFreeRatio to tell the garbage collector the ratio of the free memory that should be achieved and the -XX:MaxHeapFreeRatio flag to set the desired maximum ratio of the free memory on the heap. We also know that G1GC tries to keep the young generation size between the values of -XX:G1NewSizePercent and -XX:G1MaxNewSizePercent. That also determines the pause times. Decreasing the size may speed up the garbage collection process at the cost of less work. We can also set the strict size of the young generation by using the -XX:NewSize and the -XX:MaxNewSize flags.

The documentation on tuning the G1 garbage collector says that we shouldn’t touch it in general. Eventually, we should only modify the desired pause times for different heap sizes. Fair enough. But, it’s also good to know what and how we can tune and how those properties affect the G1 garbage collector behavior.

When tuning for garbage collector latency we should keep the pause time to a minimum. Meaning that in most cases the -Xmx and -Xms values should be set to the same value and we should also load the memory pages during application start by using the -XX:+AlwaysPreTouch flag.

If your young-only phase takes too long it’s a sign that decreasing the -XX:G1NewSizePercent (defaults to 5) value is a good idea. In some cases decreasing the -XX:G1MaxNewSizePercent (defaults to 60) can also help. If the Mixed collections take too long we are advised to increase the value of -XX:G1MixedGCCountTarget flag to spread the tenured generation GC across more collections. Increase the -XX:G1HeapWastePercent to stop the old generation garbage collection earlier. You can also change the -XX:G1MixedGCLiveThresholdPercent – it defaults to 65 and controls the occupancy threshold above which the old generation heap will be included in the mixed collection. Increasing this value will tell garbage collection to omit less occupied old generation space regions when doing the mixed collection. Regions that have a lot of objects in them take a longer time to collect garbage from. By using the mentioned flag we can avoid setting these regions as candidates for garbage collection. If you’re seeing a high update and scan RS times, decreasing the -XX:G1RSetUpdatingPauseTimePercent flag value, including the -XX:-ReduceInitialCardMarks flag, and increasing the -XX:G1RSetRegionEntries flag may help. There is also one additional flag, the -XX:MaxGCPauseTimeMillis (defaults to 250) which defines the maximum, desired pause time. If you would like to reduce the pause time, lowering the value may help as well.

When tuning for throughput we want the garbage collector to clean as much garbage as possible. Mostly in cases of systems that process and hold a lot of data. The first thing that you should go for is increasing the -XX:MaxGCPauseMillis value. By doing that we relax the garbage collector. This allows it to work longer to process more objects on the heap. However, that may not be enough. In such cases increasing the -XX:G1NewSizePercent flag value should help. In some cases the throughput may be limited by the size of young generation regions – in such cases increasing the -XX:G1MaxNewSizePercent flag value should help.

We can also decrease the parallelism which requires a lot of work from the CPU. Using the -XX:G1RSetUpdatingPauseTimePercent flag and increasing its value will allow more work when the application threads are paused and will decrease the time spent in concurrent parts of the phase. Also similar to latency tuning you may want to keep the -Xmx and -Xms flags to the same value to avoid heap resizing. Load the memory pages to memory by using the -XX:+AlwaysPreTouch flag and the -XX:+UseLargePages flag. But please remember to apply the changes one by one and compare the results so that you understand what is happening.

Finally, we can tune for heap size. There is a single option that we can think about here, the -XX:GCTimeRatio (defaults to 12). It determines the ratio of time spent in garbage collection compared to application threads doing their work and is calculated as 1/(1 + GCTimeRatio). The default value will result in about 8% of the application working time to be spent in garbage collection, which is more than the Parallel GC. More time in garbage collection will allow clearing more space on the heap, but this is highly dependent on the application and it is hard to give general advice. Experiment to find the value that suits your needs.

There are also general tunable parameters for the G1 garbage collector. We can control the degree of parallelization when using this garbage collector. By including the -XX:+ParallelRefProcEnabled flag and changing the -XX:ReferencesPerThread flag value. For each N references defined by the -XX:ReferencesPerThread flag a single thread will be used. Setting this value to 0 will tell the G1 garbage collector to always use the number of threads specified by the -XX:ParallelGCThreads flag value. For more parallelization decrease the -XX:ReferencesPerThread flag value. This should speed up the parallel parts of the garbage collection.

Z Garbage Collector

Still experimental, very scalable and low latency implementation. If you would like to experiment with that Z garbage collector you must use JDK 11 or newer and add the -XX:+UseZGC flag to your application startup parameters along with the -XX:+UnlockExperimentalVMOptions flag as the Z garbage collector is still considered experimental.

Tuning the Z Garbage Collector

There aren’t many parameters that we can play around with when it comes to the Z garbage collector. As the documentation states, the most important option here is the maximum heap size, so the -Xmx flag. Because the Z garbage collector is a concurrent collector, the heap size must be adjusted in a way that it can hold the live set of objects of your application and allows for the headroom to allow allocations while the garbage collector is running. This means that the heap size may need to be higher compared to other garbage collectors and the more memory you assign to the heap the better results you may expect from the garbage collector.

The second option that you can expect is, of course, the number of threads that the Z garbage collector will use. After all, it is a concurrent collector, so it can utilize more than a single thread. We can set the number of threads that the Z garbage collector will use by using the -XX:ConcGCThreads flag. The collector itself uses heuristics to choose the proper number of threads it should use, but as usual, it is highly dependent on the application and in some cases setting that number to a static value may bring better results. However, that needs to be tested as it is very use-case dependent. There are two things to remember though. If you assign too many threads for the garbage collector your application may not have enough computing power to do its job. Set the number of garbage collector threads to a low number and the garbage may not be collected fast enough. Take that into consideration when tuning.

Other JVM Options

We’ve covered quite a lot when it comes to garbage collection parameters and how they affect garbage collection. But, not everything. There is way more to it than that. Of course, we won’t talk about every single parameter, it just doesn’t make sense. However, there are a few more things that you should know about.

JVM Statistics Causing Long Garbage Collection Pauses

Some people reported that on Linux systems, during high I/O utilization the garbage collection can pause threads for a long period of time. This is probably caused by the JVM using a memory-mapped file called hsperfdata. That file is written in the /tmp directory and is used for keeping the statistics and safepoints. The mentioned file is updated during GC. On Linux, modifying a memory-mapped file can be blocked until I/O completes. As you can imagine such an operation can take a longer period of time, presumably hundreds of milliseconds.

How to spot such an issue in your environment? You need to look into the timings of your garbage collection. If you see in the garbage collection logs that the real-time spent by the JVM for garbage collection is way longer than the user and system metrics combined you have a potential candidate. For example:



[Times: user=0.13 sys=0.11, real=5.45 secs]

If your system is heavily I/O based and you see the mentioned behavior you can move the path of your GC logs and the tmpfs to a fast SSD drive. With recent JDK versions, the temporary directory that Java uses is hardcoded, so we can’t use the -Djava.io.tmpdir to change that. You can also include the -XX:+PerfDisableSharedMem flag to your JVM application parameters. You need to be aware that including that option will break tools that are using the statistics from the hsperfdata file. For example, jstat will not work.

You can read more on that issue in the blog post from the Linkedin engineering team.

Heap Dump on Out Of Memory Exception

One thing that can be very useful when dealing with Out Of Memory exceptions, diagnosing their cause and looking into problems like memory leaks are heap dumps. A heap dump is basically a file with the contents of the heap written on disk. We can generate heap dumps on demand, but it takes time and can freeze the application or, in the best-case scenario, make it slow. But if our application crashes we can’t grab the heap dump – it’s already gone.

To avoid losing information that can help us in diagnosing problems we can instruct the JVM to create a heap dump when the OutOfMemory error happens. We do that by including the -XX:+HeapDumpOnOutOfMemoryError flag. We can also specify where the heaps should be stored by using the -XX:HeapDumpPath flag and setting its value to the location we want to write the heap dump to. For example: -XX:HeapDumpPath=/tmp/heapdump.hprof.

Keep in mind that the heap dump file may be very big – as large as your heap size. So you need to account for that when setting the path where the file should be written. We’ve seen situations where the JVM was not able to write the 64GB heap dump file on the target file system.

For analysis of the file, there are tools that you can use. There are open-source tools like MAT and proprietary tools like YourKit Java Profiler or JProfiler. There are also services like heaphero.io that can help you with the analysis, while older versions of the Oracle JDK distribution come with jhat – the Java Heap Analysis Tool. Choose the one that you like and fit your needs.

Using -XX:+AggressiveOpts

The -XX:+AgressiveOpts flag turns on additional flags that are proven to result in an increase of performance during a set of benchmarks. Those flags can change from version to version and contain options like large autoboxing cache and removal of aggressive autoboxing. It also includes disabling of the biased locking delay. Should you use this flag? That depends on your use case and your production system. As usual, test in your environment, compare instances with and without the flag and see how large of a difference it makes.

Conclusion

Tuning garbage collection is not an easy task. It requires knowledge and understanding. You need to know the garbage collector that you are working with and you need to understand your application’s memory needs. Every application is different and has different memory usage patterns, thus requires different garbage collection strategies. It’s also not a quick task. It will take time and resources to make improvements in iterations that will show you if you are going in the right direction with each and every change.

Remember that we only touched the tip of the iceberg when it comes to tuning garbage collectors in the JVM world. We’ve only mentioned a limited number of available flags that you can turn on/off and adjust. For additional context and learning, I suggest going to Oracle HotSpot VM Garbage Collection Tuning Guide and reading the parts that you think may be of interest to you. Look at your garbage collection logs, analyze them, try to understand them. It will help you in understanding your environment and what’s happening inside the JVM when garbage is collected. In addition to that, experiment a lot! Experiment in your test environment, on your developer machines, experiment in some of the production or pre-production instances and observe the difference in behavior.

Hopefully, this article will help you on your journey to a healthy garbage collection in your JVM based applications. Good luck!

A Quick Start on Java Garbage Collection: What it is, and How it works

Rafał Kuć — Mon, 27 Jan 2020 10:43:26 +0000

In this tutorial, we will talk about how different Java Garbage Collectors work and what you can expect from them. This will give us the necessary background to start tuning the garbage collection algorithm of your choice.

Before going into Java Garbage Collection tuning we need to understand two things. First of all, how garbage collection works in theory and how it works in the system we are going to tune. Our system’s garbage collector work is described by garbage collector logs and metrics from observability tools like Sematext Cloud for JVM. We talked about how to read and understand Java Garbage Collection logs in a previous blog post.

What is Garbage Collection in Java: A Definition

Java Garbage Collection is an automatic process during which the Java Virtual Machine inspects the object on the heap, checks if they are still referenced and releases the memory used by those objects that are no longer needed.

Object Eligibility: When Does Java Perform Garbage Collection

Let’s take a quick look on when the object is ready to be collected by the garbage collection and how to actually request the Java Virtual Machine to start garbage collection.

How to Make an Object Eligible for GC?

To put it straight – you don’t have to do anything explicitly to make an object eligible for garbage collection. When an object is no longer used in your application code, the heap space used by it can be reclaimed. Look at the following Java code:

public Integer run() {
  Integer variableOne = 10;
  Integer variableTwo = 20;
  return variableOne + variableTwo;
}

In the run() method we explicitly create two variables. They are first put on the heap, in the young generation heap. Once the method finishes its execution they are no longer needed and they start being eligible for garbage collection. When a young generation garbage collection happens the memory used by those variables may be reclaimed. If that happens the previously occupied memory will be visible as free.

How to Request the JVM to Run GC?

The best thing regarding Java garbage collection is that it is automatic. Until the time comes, and you want and need to control and tune it, you don’t have to do anything. When the Java Virtual Machine will decide it’s time to start reclaiming the space on the heap and throwing away unused objects it will just start the garbage collection process.

If you want to force garbage collection you can use the System object from the java.lang package and its gc() method or the Runtime.getRuntime().gc() call. As the documentation states – Java Virtual Machine will do its best efforts to reclaim the space. This means that the garbage collection may actually not happen, this depends on the JVM. If the garbage collection happens it will be a Major collection which means that we can expect a stop-the-world event to happen. In general, using the System.gc() is considered a bad practice and we should tune the work of the garbage collector instead of calling it explicitly.

How Does Java Garbage Collection Work?

No matter what implementation of the garbage collector we use, to clean up the memory, a short pause needs to happen. Those pauses are also called stop-the-world events or STW in short. You can envision your JVM-based application’s working cycles in the following way:

The first step of the cycle starts when your application threads are started and your business code is working. This is where your application code is running. At a certain point in time, an event happens that triggers garbage collection. To clear the memory, application threads have to be stopped. This is where the work of your application stops and the next steps start. The garbage collector marks objects that are no longer used and reclaims the memory. Finally, an optional step of heap resizing may happen if possible. Then the circle starts again, application threads are started. The full cycle of the garbage collection is called the epoch.

The key when running JVM applications and tuning the garbage collector is to keep the application threads running for as long as possible. That means that the pauses caused by the garbage collector should be minimal.

The second thing that we need to talk about is generations. Java garbage collectors are generational, which means that they work under certain principles:

Young data will not survive long
Data that is old will continue to persist in memory

That’s why JVM heap memory is divided into generations:

Young generation which is divided into two sections called Eden space and Survivor space
Old generation, or Tenured space.

A simplified promotion of objects between spaces and generations can be illustrated with the following example. When an object is created it is first put into the young generation space into the Eden space. Once the young garbage collection happens the object is promoted into the Survivor space 0 and next into the Survivor space 1. If the object is still used at this point the next garbage collection cycle will move it to the Tenured space which means that it is moved to the old generation. You can imagine it as follows:

So the Eden space contains newly created objects and is empty at the beginning of the epoch. During the epoch, the Eden space will fill up eventually triggering a Minor GC event when filled up. The Survivor spaces contain objects that were used during at least a single epoch. Objects that survived through many epochs will be eventually promoted to the Tenured generation.

Before Java 8 there was one additional memory space called the PermGen. PermGen or otherwise Permanent Generation was a special space on the heap separated from its other parts – the young and the tenured generation. It was used to store metadata such as classes and methods.

Starting from Java 8, the Metaspace is the memory space that replaces the removed PermGen space. The implementation differs from the PermGen and this space of the heap is now automatically resized limiting the problems of going into out of memory in this region of the heap. The Metaspace memory can be garbage collected and the classes that are no longer used can be cleaned when the Metaspace reaches its maximum size.

There are a few flags that can be used to control the size and the behavior of the Metaspace memory space:

-XX:MetaspaceSize – initial size of the Metaspace memory region,
-XX:MaxMetaspaceSize – maximum size of the Metaspace memory & region,
-XX:MinMetaspaceFreeRatio – minimum percentage of class metadata capacity that should be free after garbage collection,
-XX:MaxMetaspaceFreeRatio – maximum percentage of class metadata capacity that should be free after garbage collection.

You can now imagine why some garbage collectors may need a considerable amount of time to clear the old generation space. It’s done in a single step. The tenured generation is one big space of heap and to clear it the application threads have to be stopped.

The Heap Structure of G1 Garbage Collector

What we wrote above is true for all garbage collectors including serial, parallel and Concurrent Mark Sweep. We will discuss them a bit later. However, the G1 garbage collector goes a step further and divides the heap into something called regions. A region is a small, independent heap that can be dynamically set to be of Eden, Survivor or Tenured type:

In addition to the three mentioned types, we also have free memory, the white cells on the image.

Such architecture allows for different operations. First of all, because the tenured generation is divided it can be collected in portions that affect latency, making the garbage collector faster for old generation space. Such heaps can be easily defragmented and dynamically resized. No cons, right? Well, that’s actually not true. The cost of maintaining such heap architecture is higher compared to the traditional heap architecture. It requires more CPU and memory.

The region size when using the G1GC can be controlled. When the heap size is set to be lower than 4GB the region size will be automatically set to 1MB. For heaps between 4 and 8GB, the region size will be set to 2MB and so on, up to 32MB region size for heaps 64GB in size or larger. In general, the region size must be a power of two and be between 1 and 32MB. By default, the JVM will try to set up an optimal number of two thousand regions or more during the application start. We can control that by using the -XX:G1HeapRegionSize=N JVM parameter.

The clearing of the heap in the case of G1GC is done by copying live data out of an existing region into an empty region and discarding the old region altogether. After that, the old region is considered free and objects can be allocated to it. Freeing multiple regions at the same time allows for defragmentation and assignment of humongous objects – ones that are larger than 50% of a heap region.

You may now wonder what triggers garbage collection and that would be a great question. Common triggers for garbage collection are Eden space being full, not enough free space to allocate an object, external resources like System.gc(), tools like jmap or not enough free space to create an object.

What Triggers Java Garbage Collection

To keep things even more complicated there are several types of garbage collection events. You can divide them in a very simplified way, as follows:

Minor event – happen when the Eden space is full and moved data to Survivor space. So a Minor event happens within the young generation
Mixed event – a Minor event plus reclaim of the Tenured generation
Full GC event – a young and old generation space clearing together

Even by looking at the names of the events you can see that the key in most cases will be lowering the pause times of the Mixed and Full GC events. Let’s stop discussing the garbage collection events for now. There is more to it and we could get deeper and deeper. But for now, we should be good.

The next thing that I would like to mention is the humongous object. Remember? Those are the ones that are larger than a single region in our heap when dealing with the G1 garbage collector (G1GC). Actually, any object larger than 50% of the region size is considered humongous. Those objects are not allocated in the young generation space, but instead, they are put directly in the Tenured generation. Such objects can increase the pause time of the garbage collector and can increase the risk of triggering the Full GC because of running out of continued free space.

Java Garbage Collectors Types

We now understand the basics and it’s time to understand what kind of garbage collectors we have available and how each of them works in our application. Keep in mind that different Java versions will have different garbage collectors available. For example, Java 9 will have both Concurrent Mark Sweep and G1 garbage collectors, while the older updates of Java 7 will not have the G1 garbage collector available.

That said, there are five types of garbage collectors in Java:

Serial Garbage Collector

The Serial garbage collector is designed to be used for single-threaded environments. Before doing garbage collection this garbage collector freezes all the application threads. Because of that, it is not suited for multi-threaded environments, like server-side applications. However, it is perfectly suited for single-threaded applications that don’t require low pause time. For example batch jobs.

The documentation on Java garbage collectors also mentions that this garbage collector may be useful on multiprocessor machines for applications with a data set up to approximately 100MB.

Parallel Garbage Collector

The Parallel garbage collector, also known as throughput collector is very similar to the Serial garbage collector. It also needs to freeze the application threads when doing garbage collection. But, it was designed to work on multiprocessor environments and in multi-threaded applications with medium and large-sized data. The idea is that using multiple threads will speed up garbage collection making it faster for such use cases.

If your application’s priority is peak performance and the thread pause time of one second or even longer is not a problem for it then the Parallel garbage collector may be a good idea. It will run from time to time freezing application threads and performing GC using multiple threads speeding it up compared to the Serial garbage collector.

Concurrent Mark Sweep Garbage Collector

The Concurrent Mark Sweep (CMS) garbage collector is one of the implementations that are called mostly concurrent. They perform expensive operations using multiple threads. They also share the threads used for garbage collection with the application. The overhead for this type of garbage collection comes not only from the fact that they do the collection concurrently, but also that the concurrent collection must be enabled.

The CMS GC is designed for applications that prefer short pauses. Basically, it performs slower compared to Parallel or Serial garbage collector but doesn’t have to stop the application threads to perform garbage collection.

This garbage collector should be chosen if your application prefers short pauses and can afford to share the application threads with the garbage collector. Keep in mind though that the Concurrent Mark Sweep garbage collector is doing to be removed in Java 14 and you should look at the G1 garbage collector if you are not using it yet.

G1 Garbage Collector

The G1 garbage collector is the garbage collection algorithm that was introduced with the 4th update of the 7th version of Java and improved since. G1GC was designed to be low latency, but that comes at a price – more frequent work which means more CPU cycles spent in garbage collection. It partitions the heap into smaller regions allowing for easier garbage collection and evacuation style memory clearing. It means that the objects are moved out of the cleared region and copied to another region. Most of the garbage collection is done in the young generation where it’s most efficient to do so.

As the documentation states, the G1GC was designed for server-style applications running in a multiprocessor environment with a large amount of memory available. It tries to meet garbage collector pause goals with high probability. While doing that it also tries to achieve high throughput. All of that without the needs of complicated configuration, at least in theory.

Think about it this way – if you have services that are latency-sensitive G1 garbage collector may be a very good choice. Having low latencies means that those services will not suffer from long stop-the-world events. Of course at the cost of higher CPU usage. Also, the G1 garbage collector was designed to work with larger heap sizes – if you have heap larger than 32GB G1 is usually a good choice. The G1 garbage collector is a replacement for the CMS garbage collector and it’s also the default garbage collector in the most recent Java versions.

Z Garbage Collector

The Z garbage collector is an experimental garbage collection implementation still not available on all platforms, like Windows and macOS. It is designed to be a very scalable low latency implementation. It performs expensive garbage collection work concurrently without the need for stopping the application threads.

The ZGC is expected to work well with applications requiring pauses of 10ms or less and ones that use very large heaps.

Java Garbage Collection Benefits

There are multiple benefits of garbage collection in Java. However, the major one, that you may not think about in the first place is the simplified code. We don’t have to worry about proper memory assignment and release cycles. In our code, we just stop using an object and the memory it is using will be automatically reclaimed at some point. This is yet another added value from a simplicity point of view. The memory reclaim process is automatic and is the job of the internal algorithm inside the JVM. We just control what kind of algorithm we want to use – if we want to control it. Of course, we can still hit memory leaks if we keep the references to the objects forever, but this is a different pair of shoes.

We have to remember though that those benefits come at a price – performance. Depending on the situation and the garbage collection algorithm, we can pay for the ease and automation of the memory management in the CPU cycles spent on the garbage collection. In extreme cases, when we have issues with memory or garbage collection, we can even experience stop of the whole application until the space reclamation process ends.

Java Garbage Collection Best Practices

We will cover the process of tuning garbage collection in the next post in the series, but before that, we wanted to share some good and bad practices around garbage collection. First of all – you should avoid calling the System.gc() method to ask for explicit garbage collection. As we’ve mentioned it is considered a bad practice and should be avoided.

The second thing I wanted to mention is the right amount of heap memory. If you don’t have enough memory for your application to work you will experience slowdowns, long garbage collection, stop the world events and eventually out of memory errors. All of that can indicate that your heap is too small, but can also mean that you have a memory leak in your application. Look at the JVM monitoring of your choice to see if the heap usage grows indefinitely – if it is, it may mean you have a bug in your application code. We will talk more about the heap size in the next post in the series.

Finally, if you are running a small, standalone application, you will probably not need any kind of garbage collection tuning. Just go with the defaults and you should be more than fine.

The next step after that would be to choose the right garbage collector implementation. The one that matches the needs and requirements of our business. How to do that and what are the options to tune different garbage collection algorithms – this is something that we will cover in the next blog post – the one called A Step-by-Step Guide to Java Garbage Collection Tuning.

Conclusion

At this point, we know how the Java garbage collection process looks like, how each garbage collector works and what behavior we can expect from each of them. In addition to that, in the previous blog post, we also discussed how to turn on and understand the logs produced by each garbage collector. This means that we are ready for the final part of the series – tuning our garbage collector.

Java Garbage Collection Logs & How to Analyze Them

Rafał Kuć — Thu, 19 Dec 2019 14:47:25 +0000

When working with Java or any other JVM-based programming language we get certain functionalities for free. One of those functionalities is clearing the memory. If you’ve ever used languages like C/C++ you probably remember functions like malloc, calloc, realloc and free. We needed to take care of the assignment of each byte in memory and take care of releasing the assigned memory when it was no longer needed. Without that, we were soon running into a shortage of memory leading to instability and crashes.

With Java, we don’t have to worry about releasing the memory that was assigned to an object. We only need to stop using the object. It’s as simple as that. Once the object is no longer referenced from inside our code the memory can be released and re-used again.

Freeing memory is done by a specialized part of the JVM called Garbage Collector.

How Does the Java Garbage Collector Work

The Java Virtual Machine runs the Garbage Collector in the background to find references that are not used. Memory used by such references can be freed and re-used. You can already see the difference compared to languages like C/C++. You don’t have to mark the object for deletion, it is enough to stop using it.

The heap memory is also divided into different regions and each has its own garbage collector type. There are a few implementations of the garbage collector — and each JVM can have its own implementation as long as it meets the specification. In theory and practice, each JVM vendor can provide its own garbage collector implementation providing different performance.

The simplified view over the three main regions of the JVM Heap can be visualized as follows:

Having a healthy garbage collection process is crucial to achieving optimal performance of your JVM based applications. Because of that, we need to ensure that we monitor JVM and its Garbage Collector. By using logs we can understand what the JVM tells us about the garbage collectors’ work.

What Are Garbage Collection (GC) Logs

The garbage collector log is a text file produced by the Java Virtual Machine that describes the work of the garbage collector. It contains all the information you could need to see how the memory cleaning process works. It also shows how the garbage collector behaves and how much resources it uses. Though we can monitor our application using an APM provider or in-house built monitoring tool, the garbage collector log will be invaluable to quickly identify any potential issues and bottlenecks when it comes to heap memory utilization.

An example of what you can expect to find in the garbage collection log look as follows:

2019-10-29T10:00:28.693-0100: 0.302: [GC (Allocation Failure) 2019-10-29T10:00:28.693-0100: 0.302: [ParNew
Desired survivor size 1114112 bytes, new threshold 1 (max 6)
- age   1:    2184256 bytes,    2184256 total
: 17472K->2175K(19648K), 0.0011358 secs] 17472K->2382K(63360K), 0.0012071 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2019-10-29T10:00:28.694-0100: 0.303: Total time for which application threads were stopped: 0.0012996 seconds, Stopping threads took: 0.0000088 seconds
2019-10-29T10:00:28.879-0100: 0.488: Total time for which application threads were stopped: 0.0001006 seconds, Stopping threads took: 0.0000065 seconds
2019-10-29T10:00:28.897-0100: 0.506: Total time for which application threads were stopped: 0.0000981 seconds, Stopping threads took: 0.0000076 seconds
2019-10-29T10:00:28.910-0100: 0.519: Total time for which application threads were stopped: 0.0000896 seconds, Stopping threads took: 0.0000062 seconds
2019-10-29T10:00:28.923-0100: 0.531: Total time for which application threads were stopped: 0.0000975 seconds, Stopping threads took: 0.0000069 seconds
2019-10-29T10:00:28.976-0100: 0.585: Total time for which application threads were stopped: 0.0001414 seconds, Stopping threads took: 0.0000091 seconds
2019-10-29T10:00:28.982-0100: 0.590: [GC (Allocation Failure) 2019-10-29T10:00:28.982-0100: 0.590: [ParNew
Desired survivor size 1114112 bytes, new threshold 1 (max 6)
- age   1:    1669448 bytes,    1669448 total
: 19647K->2176K(19648K), 0.0032520 secs] 19854K->5036K(63360K), 0.0033060 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]

Even a very small period of time can provide a lot of information. You see allocation failures, young garbage collection, threads being stopped, memory before and after garbage collection, each event leading to the promotion of the objects inside the heap memory.

Why Are Garbage Collection Logs Important

Dealing with application performance tuning can be a long and unpleasant experience. We need to properly prepare the environment and observe the application. Check this out to learn more about JVM performance tuning. With the right observability tool, like our Sematext Cloud, you get insights into crucial metrics related to the application, the JVM and the operating system.

Metrics are not everything though. Even the best APM tools will not give you everything. Metrics can show you patterns and historical data that will help you identify potential issues, but to be able to see everything you will need to dig deeper. That deeper level in terms of a Java-based application is the garbage collection log. Even though GC logs are very verbose, they provide information that’s not available in other sources, like stop the world events and how long they took, how long the application threads were stopped, memory pool utilization and many, many more.

How to Enable GC Logging

Before talking about how to enable garbage collector logging we should ask ourselves one thing. Should I turn on the logs by default or I should only turn them on when issues start appearing? On modern devices, you shouldn’t worry about performance when enabling the garbage collector logs. Of course, you will experience a bit more writing to your persistent storage just because the logs have to be written somewhere. Apart from that, the logs shouldn’t produce any additional load on the system.

You should always have the Java garbage collection logs turned on. In fact, a lot of open-source systems are already following that practice. For example, search systems like Apache Solr or Elasticsearch are already including JVM flags that turn on the logs. We already know that those files include crucial information about the Java Virtual Machine operations so we know why we should have it turned on.

There is a difference in terms of how you activate garbage collection logging for Java 8 and earlier and for the newer Java versions.

For Java 8 and earlier you should add the following flags to your JVM based application startup parameters:

-XX:+PrintGCDetails -Xloggc:<PATH_TO_GC_LOG_FILE>

Where the PATH_TO_GC_LOG_FILE is the location of the garbage collector log file. For example:

java -XX:+PrintGCDetails -Xloggc:/var/log/myapp/gc.log -jar my_awesome_app.jar

In some cases, you can also see that the -XX:+PrintGCTimeStamps is included. However, it is redundant here and not needed.

For Java 9 and newer you can simplify the command above and add the following flag to the application startup parameters:

-Xlog:gc*:file=<PATH_TO_GC_LOG_FILE>

For example:

java -Xlog:gc*:file=/var/log/myapp/gc.log -jar my_awesome_app.jar

Once you enable the logs it’s important to remember the GC logs rotation. When using an older JVM version, like JDK 8 you may want to rotate your GC logs. To do that we have three flags that we can add to our JVM application startup parameters. The first one is the flag that enables GC logs rotation: -XX:+UseGCLogFileRotation. The second property -XX:NumberOfGCLogFiles tells the JVM how many GC log files should be kept. For example including -XX:NumberOfGCLogFiles=10 will enable up to 10 GC log files. Finally the -XX:GCLogFileSize tells how large a single GC log file can be. For example -XX:GCLogFileSize=10m will rotate the GC log file when it reaches 10 megabytes.

When using JDK 11 and the G1GC garbage collector to control your GC logs you will want to include property like this: java -Xlog:gc*:file=gc.log,filecount=10,filesize=10m. This will result in exactly the same behavior. we will have up to 10 GC log files with up to 10 megabytes in size.

Now that we know how important the JVM garbage collector logs are, and we’ve turned then on by default, we can start analyzing them.

How to Analyze GC Logs

Understanding garbage collection logs is not easy. It requires an understanding of how the Java virtual machine works and the understanding of memory usage of the application. In this blog post, we will skip the analysis of the application as it differs from application to application and requires knowledge of the code. What we will discuss though is how to read and analyze the garbage collection logs that we can get out of the JVM.

What is also very important is that there are various JVM versions and multiple garbage collector implementations. You can still encounter Java 7, 8, 9 and so on. Some companies still use Java 6 because of various reasons. Each version may be running different garbage collectors — Serial, Parallel, Concurrent Mark Sweep, G1 or even Shenandoah or Z. You can expect different Java versions and different garbage collector implementations to output a slightly different log format and of course we will not be discussing all of them. In fact, we will show you only a small portion of the logs, but such that should help you in understanding all other garbage collector logs as well.

The garbage collection logs will be able to answer questions like:

When was the young generation garbage collector used?
When was the old generation garbage collector used?
How many garbage collections were run?
For how long were the garbage collectors running?
What was the memory utilization before and after garbage collection?

Let’s now look at an example taken out of a JVM garbage collector log and analyze each fragment highlighting the crucial parts behind it.

Parallel and Concurrent Mark Sweep Garbage Collectors

Let’s start by looking at Java 8 and the Parallel collector for the young generation space and the Concurrent Mark Sweep garbage collector for the old generation. A single line coming from our JVM garbage collector can look as follows:

2019-10-30T11:13:00.920-0100: 6.399: [Full GC (Allocation Failure) 2019-10-30T11:13:00.920-0100: 6.399: [CMS: 43711K->43711K(43712K), 0.1417937 secs] 63359K->48737K(63360K), [Metaspace: 47130K->47130K(1093632K)], 0.1418689 secs] [Times: user=0.14 sys=0.00, real=0.14 secs]

First of all, you can see the date and time of the event which in our case is 2019–10–30T11:13:00.920–0100. This is the time when the event happened so that you can see what happened and when it happened.

The next thing we can see in the logline above is the type of garbage collection. In our case, it is Full GC and you can also expect GC as a value here. There are three types of garbage collector events that can happen:

Minor garbage collection
Major garbage collection
Full garbage collection

Minor garbage collection means that the young generation space clearing event was performed by the JVM. The minor garbage collector will always be triggered when there is not enough memory to allocate a new object on the heap, i.e. when the Eden generation is full or is getting close to being full. If your application creates new objects very often you can expect the minor garbage collector to run often. What you should remember is that during minor garbage collection, when cleaning the Eden and survivor spaces the data is copied entirely which means that no memory fragmentation will happen.

Major garbage collection means that the tenured generation clearing event was performed. The tenured generation is also widely called the old generation space. Depending on the garbage collector and its settings the tenured generation cleaning may happen less or more often. Which is better? The right answer depends on the use-case and we will not be covering that in this blog post.

Java Full GC means that the full garbage collection event happened. Meaning that both the young and old generation was cleared. The garbage collector tried to clear it and the log tells us what the outcome of that procedure was. Tenured generation cleaning requires mark, sweep and compact phases to avoid high memory fragmentation. If a garbage collector wouldn’t care about memory fragmentation you could end up in a situation where you have enough memory, but it is fragmented and the object can’t be allocated. We can illustrate this situation with the following diagram:

There is also one part that we didn’t discuss — the Allocation Failure. The Allocation Failure part of the garbage collector logline explains why the garbage collection cycle started. It usually means that there was no space left for new object allocation in the Eden space of heap memory and the garbage collector tried to free some memory for new objects. The Allocation Failure can also be generated by the remark phase of the Concurrent Mark Sweep garbage collector.

The next important thing in the logline is the information about the memory occupation before and after the garbage collection process. Let’s look into the line once again in greater detail:

You can see that the line contains a lot of useful information. In addition to what we already discussed we also have information about the memory both before and after the collection. We have the time garbage collection took and CPU resources used during the garbage collection process. As you can see we have a lot of information allowing us to see how fast or slow the process is.

One piece of the very important information that the JVM garbage collector gives us is the total time for which the application threads are stopped. You can expect the threads to be stopped very often, but for a very short period of time. For example:

2019-10-29T10:00:28.879-0100: 0.488: Total time for which application threads were stopped: 0.0001006 seconds, Stopping threads took: 0.0000065 seconds

You can see that the threads were stopped for 0.0001006 seconds and the stopping of the threads took 0.0000065 seconds. This is not a long time for threads to be stopped and you will see information like this over and over again in your garbage collector logs. What should raise a red flag is a long thread stop time — also called a stop the world event that will basically stop your application. Here’s an example:

2019-11-02T17:11:54.259-0100: 7.438: Total time for which application threads were stopped: 11.2305001 seconds, Stopping threads took: 0.5230011 seconds

In the logline above, we can see that the application threads were stopped for more than 11 seconds. What does that mean? Basically, your application was not responding for more than 11 seconds. It wasn’t responding to any requests, it wasn’t processing data and the JVM was only doing garbage collection. You want to avoid situations like this at all costs. It is a sign of a big memory problem. Either your memory is too low for your application to properly do its job or you have a memory leak that fills up your heap space eventually leading to long garbage collection and finally to running out of memory. This means that your applications will not be able to create new objects and will stop working.

G1 Garbage Collector

Let’s now look at what the G1 garbage collector looks like. We will disable the previously used CMS garbage collector and turn on G1GC by using the following application options:

-XX:+UseG1GC
-XX:-UseConcMarkSweepGC
-XX:-UseCMSInitiatingOccupancyOnly

So we turn on the G1 garbage collector and remove the Concurrent Mark Sweep.

A standard G1 garbage collector log entry looks as follows:

2019-11-03T21:26:21.827-0100: 2.069: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 2097152 bytes, new threshold 15 (max 15)
- age   1:     341608 bytes,     341608 total
, 0.0021740 secs]
   [Parallel Time: 0.9 ms, GC Workers: 10]
      [GC Worker Start (ms): Min: 2069.4, Avg: 2069.5, Max: 2069.6, Diff: 0.1]
      [Ext Root Scanning (ms): Min: 0.1, Avg: 0.2, Max: 0.4, Diff: 0.3, Sum: 1.5]
      [Update RS (ms): Min: 0.1, Avg: 0.2, Max: 0.3, Diff: 0.2, Sum: 2.3]
         [Processed Buffers: Min: 1, Avg: 1.4, Max: 4, Diff: 3, Sum: 14]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 0.2, Avg: 0.3, Max: 0.3, Diff: 0.1, Sum: 3.0]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 10]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [GC Worker Total (ms): Min: 0.6, Avg: 0.7, Max: 0.8, Diff: 0.1, Sum: 7.0]
      [GC Worker End (ms): Min: 2070.2, Avg: 2070.2, Max: 2070.2, Diff: 0.0]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.2 ms]
   [Other: 1.1 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.8 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.2 ms]
      [Humongous Register: 0.0 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.0 ms]
   [Eden: 26.0M(26.0M)->0.0B(30.0M) Survivors: 5120.0K->3072.0K Heap: 51.4M(64.0M)->22.6M(64.0M)]
 [Times: user=0.01 sys=0.00, real=0.01 secs]

In the logline above you can see that we had a young generation garbage collection event [GC pause (G1 Evacuation Pause) (young), which resulted in certain regions of memory being cleared: [Eden: 26.0M(26.0M)->0.0B(30.0M) Survivors: 5120.0K->3072.0K Heap: 51.4M(64.0M)->22.6M(64.0M)]. We also have the timing information and CPU usage [Times: user=0.01 sys=0.00, real=0.01 secs]. The timings are exactly the same as in the previous garbage collector discussion. The user and system scope CPU usage during the garbage collection process and we have the time it took.

The memory information summary is detailed and gives us an overview of what happened. We can see that the Eden space was fully cleared Eden: 26.0M(26.0M)->0.0B(30.0M). The garbage collection started when it was occupied by 26M of data. After the garbage collection, we ended with a completely empty Eden space. The total size of the Eden space at the point of garbage collection was 30MB. The garbage collection started with the survivors’ space having 5120K of memory and ended with 3072K of data in it. Finally, the whole heap started with 51.4M occupied out of the total size of 64M and ended with 22.6M of occupation.

In addition to that, you also see more detailed information about the internals of the parallel garbage collector workers and the phases of their work — like the start, scanning and working.

You can also see additional log entries related to G1 garbage collector:

2019-11-03T21:26:23.704-0100: 2019-11-03T21:26:23.704-0100: 3.946: 3.946: [GC concurrent-root-region-scan-start]
Total time for which application threads were stopped: 0.0035771 seconds, Stopping threads took: 0.0000111 seconds
2019-11-03T21:26:23.706-0100: 3.948: [GC concurrent-root-region-scan-end, 0.0017994 secs]
2019-11-03T21:26:23.706-0100: 3.948: [GC concurrent-mark-start]
2019-11-03T21:26:23.737-0100: 3.979: [GC concurrent-mark-end, 0.0315921 secs]
2019-11-03T21:26:23.737-0100: 3.979: [GC remark 2019-11-03T21:26:23.737-0100: 3.979: [Finalize Marking, 0.0002017 secs] 2019-11-03T21:26:23.738-0100: 3.980: [GC ref-proc, 0.0004151 secs] 2019-11-03T21:26:23.738-0100: 3.980: [Unloading, 0.0025065 secs], 0.0033738 secs]
 [Times: user=0.04 sys=0.01, real=0.01 secs]
2019-11-03T21:26:23.741-0100: 3.983: Total time for which application threads were stopped: 0.0034705 seconds, Stopping threads took: 0.0000308 seconds
2019-11-03T21:26:23.741-0100: 3.983: [GC cleanup 54M->54M(64M), 0.0004419 secs]
 [Times: user=0.00 sys=0.00, real=0.00 secs]

Of course, the above log lines are different, but the principles still stand. The log gives us information about the total time for which the application threads were stopped, the result of the cleanup done by the garbage collector and the resources used.

GC Logging Options in Java 9 and Newer

We can even go deeper with garbage collection and turn on the debug level. Let’s take Java 10 as an example and let’s include the -Xlog:gc*,gc+phases=debug to the startup parameters of the JVM. This will turn on debug level logging for the garbage collection phases for the default G1 garbage collector on Java 10. This will enable verbose GC logging giving you extensive information about garbage collector work.

[0.006s][info][gc,heap] Heap region size: 1M
[0.012s][info][gc     ] Using G1
[0.013s][info][gc,heap,coops] Heap address: 0x00000006c0000000, size: 4096 MB, Compressed Oops mode: Zero based, Oop shift amount: 3
[0.428s][info][gc,start     ] GC(0) Pause Young (G1 Evacuation Pause)
[0.428s][info][gc,task      ] GC(0) Using 2 workers of 2 for evacuation
[0.432s][info][gc,phases    ] GC(0)   Pre Evacuate Collection Set: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Prepare TLABs: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Choose Collection Set: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Humongous Register: 0.0ms
[0.433s][info ][gc,phases    ] GC(0)   Evacuate Collection Set: 3.8ms
[0.433s][debug][gc,phases    ] GC(0)     Ext Root Scanning (ms):   Min:  0.6, Avg:  0.7, Max:  0.8, Diff:  0.2, Sum:  1.4, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)     Update RS (ms):           Min:  0.0, Avg:  0.0, Max:  0.0, Diff:  0.0, Sum:  0.0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)       Processed Buffers:        Min: 0, Avg:  0.0, Max: 0, Diff: 0, Sum: 0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)       Scanned Cards:            Min: 0, Avg:  0.0, Max: 0, Diff: 0, Sum: 0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)       Skipped Cards:            Min: 0, Avg:  0.0, Max: 0, Diff: 0, Sum: 0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)     Scan RS (ms):             Min:  0.0, Avg:  0.0, Max:  0.0, Diff:  0.0, Sum:  0.0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)       Scanned Cards:            Min: 0, Avg:  0.0, Max: 0, Diff: 0, Sum: 0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)       Claimed Cards:            Min: 0, Avg:  0.0, Max: 0, Diff: 0, Sum: 0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)       Skipped Cards:            Min: 0, Avg:  0.0, Max: 0, Diff: 0, Sum: 0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)     Code Root Scanning (ms):  Min:  0.0, Avg:  0.1, Max:  0.1, Diff:  0.1, Sum:  0.1, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)     AOT Root Scanning (ms):   skipped
[0.433s][debug][gc,phases    ] GC(0)     Object Copy (ms):         Min:  2.8, Avg:  2.9, Max:  3.0, Diff:  0.2, Sum:  5.7, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)     Termination (ms):         Min:  0.0, Avg:  0.0, Max:  0.0, Diff:  0.0, Sum:  0.0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)       Termination Attempts:     Min: 1, Avg:  1.0, Max: 1, Diff: 0, Sum: 2, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)     GC Worker Other (ms):     Min:  0.0, Avg:  0.0, Max:  0.0, Diff:  0.0, Sum:  0.0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)     GC Worker Total (ms):     Min:  3.6, Avg:  3.6, Max:  3.7, Diff:  0.1, Sum:  7.3, Workers: 2
[0.433s][info ][gc,phases    ] GC(0)   Post Evacuate Collection Set: 0.1ms
[0.433s][debug][gc,phases    ] GC(0)     Code Roots Fixup: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Preserve CM Refs: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Reference Processing: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Clear Card Table: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Reference Enqueuing: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Merge Per-Thread State: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Code Roots Purge: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Redirty Cards: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     DerivedPointerTable Update: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Free Collection Set: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Humongous Reclaim: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Start New Collection Set: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Resize TLABs: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Expand Heap After Collection: 0.0ms
[0.433s][info ][gc,phases    ] GC(0)   Other: 0.2ms
[0.433s][info ][gc,heap      ] GC(0) Eden regions: 7->0(72)
[0.433s][info ][gc,heap      ] GC(0) Survivor regions: 0->1(1)
[0.433s][info ][gc,heap      ] GC(0) Old regions: 0->1
[0.433s][info ][gc,heap      ] GC(0) Humongous regions: 6->3
[0.433s][info ][gc,metaspace ] GC(0) Metaspace: 9281K->9281K(1058816K)
[0.433s][info ][gc           ] GC(0) Pause Young (G1 Evacuation Pause) 13M->4M(122M) 4.752ms
[0.433s][info ][gc,cpu       ] GC(0) User=0.00s Sys=0.01s Real=0.00s

You can see the exact timings in the highlighter section of the logline above. They were not present in the G1 garbage collector log that we were discussing earlier. Of course, phases are not the only option that you can turn on. Those are options that became available with Java 9 and are here to correspond to the flags that were removed or deprecated. Here are some of the options available in the earlier Java Virtual Machine versions and the options that they translate to in Java 9 and newer:

-XX:+PrintHeapAtGC can now be expressed as -Xlog:gc+heap=debug option
-XX:+PrintParallelOldGCPhasesTimes can be expressed as -Xlog:gc+phases*=trace
-XX:+PrintGCApplicationConcurrentTime and -XX:+PrintGCApplicationStoppedTime can now be expressed as -Xlog:safepoint
-XX:+G1PrintHeapRegions can be expressed as -Xlog:gc+region*=trace
-XX:+SummarizeConcMark can be expressed as -Xlog:gc+marking*=trace
-XX:+SummarizeRSetStats can be expressed as -Xlog:gc+remset*=trace
-XX:+PrintJNIGCStalls can be expressed as -Xlog:gc+jni*=debug
-XX:+PrintTaskqueue can be expressed as -Xlog:gc+task+stats*=trace
-XX:+TraceDynamicGCThreads can be expressed as -Xlog:gc+task*=trace
-XX:+PrintAdaptiveSizePolicy can be expressed as -Xlog:gc+ergo*=trace
-XX:+PrintTenuringDistribution can be expressed as -Xlog:gc+age*=trace

You can combine multiple options or enable all of them by adding the -Xlog:all=trace flag to your JVM application startup parameters. But be aware that it can result in quite a lot of information in the garbage collector log files. To avoid the flood of information you can set it to debug using -Xlog:all=debug — it will lower down the amount of information, but it will give you way more than the standard garbage collector log.

Java Garbage Collection Logging: Log Analysis Tools you Need to Know About

There are log analysis tools that can help you analyze the garbage collector logs. Nothing available out of the box in the standard Java Virtual Machine distribution though.

APM & Observability Tools

When it comes to observing the high-level overview of the performance of the Java garbage collector, you can use one of the observability tools providing Java application-level monitoring.For example, our own Sematext JVM Monitoring provided by Sematext Cloud.

A tool like this should give you summary information about how the garbage collector works, the times, collection count the maximum collection time and the average collection size. In most cases, this is more than enough to spot issues with the garbage collection without the need of going deep into the logs and analyzing them.

However, when troubleshooting you may need to have a more fine-grained view over what was happening inside the garbage collector in the JVM. If you don’t want to analyze the data manually there are tools that can help you.

GCViewer

For example one of the tools that can help you visualize the GC logs is the GCViewer. A tool that allows you to analyze the garbage collector logs up to Java 1.5 and its continuation aiming to support newer Java versions and the G1 garbage collector.

The GC Viewer aims to provide comprehensive information about memory utilization and the garbage collector process overall. It is open-sourced and completely free for personal and commercial use aiming to provide support up to and including Java 8 and with unified logging for OpenJDK 9 and 10.

GCEasy

There are also tools that are proprietary and commercial. One of them is GCEasy. This is an online GC log analyzer tool where you can upload the garbage collection log and get the results in the form of an easy to read log analysis report:

The report will include information like generation size and maximum size, key performance indicators like average and maximum pause time, pauses statistics, memory leak information and interactive graphs showing you each heap memory space. All of that information calculated on the basis of the log file that you provide.

Even though the GCEasy has a free plan it is limited. At the time of writing a single user could upload 5 GC log files a month with up to 50mb per file. There are additional plans available if you are interested in using the tool.

Wrapping Up

Understanding garbage collector logs is not easy. A large number of possible formats, different Java Virtual Machine versions, and different garbage collector implementations don’t make it simpler. Even though there are a lot of options you have to remember, certain parts are common. Each garbage collector will tell you the size of the heap, the before and after occupancy of the region of the heap that was cleared. Finally, you will also see the time and resources used to perform the operation. Start from that and continue the journey of understanding the JVM garbage collection process and the memory usage of your application. Happy analysis :)