DEV Community: Rafał Kuć

Getting Started with Sematext Browser SDK for Front-end Performance Monitoring

Rafał Kuć — Tue, 08 Dec 2020 07:12:52 +0000

Open-sourcing a code base for the world to see after working on it for a long time is a great experience. You should care about what your users want. You want your users to have a great experience using your product. Everything has to fall into place. Performance, responsiveness, user experience, etc. all have to be exceptional. That’s why I think front-end performance metrics are crucial.

What Metrics Should You Monitor to Improve Front-end Performance?

You want to know your page load times, how fast (or slow!) your HTTP requests are. You want to measure the Apdex score for DOM elements. User-centric performance metrics like largest contentful paint measures the page load performance, first input delay measures its interactivity and the cumulative layout shift gives you information about the visual stability of your web application. You can read more about these metrics in Tips and tricks about how to optimize website performance.

But those are only basic examples.

You can use the Sematext Browser SDK to implement more complex monitoring solutions and collect the exact metrics you want from real users in real browsers in real time. Woah, so many uses of "real" in one sentence! Want another one? What is Real User Monitoring? Another? 5 Best Practices for Real User Monitoring. I could go like this all night.

Front-end Performance Monitoring with Sematext Browser SDK

While working on open-source projects at Sematext we’ve learned how crucial it is to see the codebase. It allows you to understand the solution better and edit it to meet your own needs. It’s also valuable as you get contributors and bug fixes from the community using the product.

That’s why we open-sourced Sematext Browser SDK. It’s the library powering our Sematext Experience data collection. It is now available on Github under the Apache 2.0 license. This means you can get insight into how the SDK collects metrics, how it ships data to Sematext Cloud, our cloud monitoring solution, and even modify the script itself if you wish to do so.

Getting Started

To get started with the Sematext Browser SDK head over to Github and clone it. The project is developed using ECMAScript 2015 and uses various tools, like:

npm package manager
yarn package dependency manager
flow static type checker
Cypress.io integration tests framework

SDK Customization Examples

Open-sourcing of Sematext Browser SDK brings new possibilities. You can not only see what the script is doing with your website or web application but also modify the experience script behavior.

Some of the example modifications that are possible include, but are not limited to:

including page loads and HTTP requests happening in the background
record every page load – even the ones that were caused by the refresh issued by the user when using Ctrl+F5
avoid sending part of the data – you don’t want to collect info about element timing, web vitals, or memory usage? You can do that!
change how the URL is parsed and omit parts of it – for example, the query part of the URL

These are just a few examples of what you can do with the Sematext Browser SDK now available as open-source.

Let’s look at one of these customizations. For instance, how tos include page loads and HTTP requests happening in the background as data and send that to Sematext Experience, our real user monitoring tool.

How to Include Page Loads and HTTP Requests Happening in the Background

By default, the Sematext Browser SDK stops listening to page loads, HTTP requests, and element timing metrics when the tab of your web browser is in the background. But you may want to include that. Here’s how you do it.

Start with the index.js file. During the startup of the script you create a so-called DocumentVisibilityObserver which is responsible for informing metrics collection mechanisms whether the content is visible or hidden. The code for this looks as follows:

const visibilityObserver = new DocumentVisibilityObserver();
const pageLoadDispatcher = new PageLoadDispatcher();
const ajaxDispatcher = new AjaxDispatcher();
const elementTimingDispatcher = new ElementTimingDispatcher();

visibilityObserver.addListener(pageLoadDispatcher);
visibilityObserver.addListener(ajaxDispatcher);
visibilityObserver.addListener(elementTimingDispatcher);

PageLoadDispatcher, AjaxDispatcher, and ElementTimingDispatcher are responsible for creating commands and sending metrics. Once you create their instances you add them into the created DocumentVisibilityObserver:

visibilityObserver.addListener(pageLoadDispatcher);
visibilityObserver.addListener(ajaxDispatcher);
visibilityObserver.addListener(elementTimingDispatcher);

If you want to keep measuring page loads, HTTP requests, and element timing metrics when the page is in the background you need to remove the code above. That’s all. You could of course clear up the code and remove the DocumentVisibilityObserver completely, but let’s not complicate our lives and do one change at a time.

How to Build the Experience Script

Once you’re happy with the changes you need to do the following:

Check if the code passes lint by running yarn lint.
Check if the code passes static type checking by running yarn flow.
Check if the integration tests are passing by running yarn e2e.

If everything is OK you can just run:

$ yarn build

The output should be similar to the following one:

➜ browser-sdk git:(master) ✗ yarn build
yarn run v1.22.5
$ webpack --mode production
Hash: 801dbe739470081f7282
Version: webpack 4.44.1
Time: 1885ms
Built at: 10/13/2020 5:05:23 PM
 Asset Size Chunks Chunk Names
 experience.js 135 KiB 0 [emitted] main
experience.js.LICENSE.txt 1.6 KiB [emitted]
Entrypoint main = experience.js
 [0] ./src/common.js 9.38 KiB {0} [built]
 [2] ./src/CommandExecutor.js 2.07 KiB {0} [built]
 [6] ./src/element/utils.js 4.41 KiB {0} [built]
 [8] ./src/constants.js 1.03 KiB {0} [built]
 [34] ./src/index.js 3.6 KiB {0} [built]
 [88] ./src/dispatchers/PageLoadDispatcher.js 5.02 KiB {0} [built]
 [93] ./src/RumUploader.js 9.09 KiB {0} [built]
 [95] ./src/bootstrap.js 3.37 KiB {0} [built]
 [96] ./src/dispatchers/AjaxDispatcher.js 3.6 KiB {0} [built]
 [97] ./src/ajax/xhr.js 1.52 KiB {0} [built]
 [98] ./src/ajax/fetch.js 1.81 KiB {0} [built]
 [99] ./src/DocumentVisibilityObserver.js 2.32 KiB {0} [built]
[100] ./src/dispatchers/ElementTimingDispatcher.js 3.04 KiB {0} [built]
[101] ./src/dispatchers/WebVitalsDispatcher.js 2.43 KiB {0} [built]
[103] ./src/dispatchers/MemoryUsageDispatcher.js 2.91 KiB {0} [built]
 + 90 hidden modules
✨ Done in 3.21s.

That means that the Experience script is now ready to be used and you can find the experience.js file in the dist directory of the project.

Because you have now modified the Sematext Browser SDK behaviour and built your own version, you need to host your custom experience.js file somewhere. You can host it along with your application or a web page or push it to a CDN of some kind – for example, Cloudflare.

Once that is done, instead of adding:

(window,document,"script","//cdn.sematext.com/experience.js","strum");

to the head section in your application or a web page you just need to specify the location of the modified file. For example, if you host the modified experience.js file at awesome.website.com it would be:

(window,document,"script","https://awesome.website.com/experience.js","strum");

Please note that if you are shipping front-end metrics to Sematext Experience with your customized script, you may want to pull the changes from the Sematext Browser SDK repo from time to time and merge them with your cloned repo. That way you will benefit from any improvements, bug fixes, or additional data Sematext started collecting and exposing in Experience.

Next Steps

You can see that modifying the Experience script by working with the Sematext Browser SDK is not complicated and you can adjust it to your needs. We encourage you to try and experiment with the code.

If you seek more detailed information regarding the Experience script we encourage you to head over to the dedicated documentation in the official Sematext docs pages. On Github, you will find the list of supported commands, global constants, and the installation instruction for the Browser SDK.

We are open to questions, suggestions, enhancements, and potential fixes. Feel free to open a Github issue with questions or pull requests with improvements. After all, we made it for you 🙂

Get Started with Experience

You can make your front-end better by using the Sematext Browser SDK and sending performance metrics to Sematext. You get insight into all key front-end metrics and how to improve them. Think of it as Lighthouse on steroids.

To get started head over to Sematext Cloud, sign up, and follow the getting started instructions for Experience. You’ll see front-end performance metrics show up in Sematext in no-time.

Java Logging Best Practices: 10+ Tips You Should Know to Get the Most Out of Your Logs

Rafał Kuć — Thu, 10 Sep 2020 17:40:07 +0000

Having visibility into your Java application is crucial for understanding how it works right now, how it worked some time in the past and increasing your understanding of how it might work in the future. More often than not, analyzing logs is the fastest way to detect what went wrong, thus making logging in Java critical to ensuring the performance and health of your app, as well as minimizing and reducing any downtime. Having a centralized logging and monitoring solution helps reduce the Mean Time To Repair by improving the effectiveness of your Ops or DevOps team.

By following logging best practices you will get more value out of your logs and make it easier to use them. You will be able to more easily pinpoint the root cause of errors and poor performance and solve problems before they impact end-users. So today, let me share some of the best practices you should follow when working with Java applications. Let’s dig in.

1. Use a Standard Logging Library

Logging in Java can be done a few different ways. You can use a dedicated logging library, a common API, or even just write logs to file or directly to a dedicated logging system. However, when choosing the logging library for your system think ahead. Things to consider and evaluate are performance, flexibility, appenders for new log centralization solutions, and so on. If you tie yourself directly to a single framework the switch to a newer library can take a substantial amount of work and time. Keep that in mind and go for the API that will give you the flexibility to swap logging libraries in the future. Just like with the switch from Log4j to Logback and to Log4j 2, when using the SLF4J API the only thing you need to do is change the dependency, not the code.

2. Select Your Appenders Wisely

Appenders define where your log events will be delivered. The most common appenders are the Console and File Appenders. While useful and widely known, they may not fulfill your requirements. For example, you may want to write your logs in an asynchronous way or you may want to ship your logs over the network using appenders like the one for Syslog, like this:



<Appenders>
    <Console name="Console" target="SYSTEM_OUT">
        <PatternLayout pattern="%d %level [%t] %c - %m%n"/>
    </Console>
    <Syslog name="Syslog" host="logsene-syslog-receiver.sematext.com"
            port="514" protocol="TCP" format="RFC5424"
            appName="11111111-2222-3333-4444-555555555555"
            facility="LOCAL0" mdcId="mdc" newLine="true"/>
</Appenders>

However, keep in mind that using appenders like the one shown above makes your logging pipeline susceptible to network errors and communication disruptions. That may result in logs not being shipped to their destination which may not be acceptable. You also want to avoid logging affecting your system if the appender is designed in a blocking way. To learn more check our Logging libraries vs Log shippers blog post.

3. Use Meaningful Messages

One of the crucial things when it comes to creating logs, yet one of the not so easy ones is using meaningful messages. Your log events should include messages that are unique to the given situation, clearly describe them and inform the person reading them. Imagine a communication error occurred in your application. You might do it like this:



LOGGER.warn("Communication error");

But you could also create a message like this:



LOGGER.warn("Error while sending documents to events Elasticsearch server, response code %d, response message %s. The message sending will be retried.", responseCode, responseMessage);

You can easily see that the first message will inform the person looking at the logs about some communication issues. That person will probably have the context, the name of the logger, and the line number where the warning happened, but that is all. To get more context that person would have to look at the code, know which version of the code the error is related to, and so on. This is not fun and often not easy, and certainly not something one wants to be doing while trying to troubleshoot a production issue as quickly as possible.

The second message is better. It provides exact information about what kind of communication error happened, what the application was doing at the time, what error code it got, and what the response from the remote server was. Finally, it also informs that sending the message will be retried. Working with such messages is definitely easier and more pleasant.

Finally, think about the size and verbosity of the message. Don’t log information that is too verbose. This data needs to be stored somewhere in order to be useful. One very long message will not be a problem, but if that line is repeating hundreds of times in a minute and you have lots of verbose logs, keeping longer retention of such data may be problematic and, at the end of the day, will also cost more.

4. Logging Java Stack Traces

One of the very important parts of Java logging are the Java stack traces. Have a look at the following code:



package com.sematext.blog.logging;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

import java.io.IOException;

public class Log4JExceptionNoThrowable {
    private static final Logger LOGGER = LogManager.getLogger(Log4JExceptionNoThrowable.class);

    public static void main(String[] args) {
        try {
            throw new IOException("This is an I/O error");
        } catch (IOException ioe) {
            LOGGER.error("Error while executing main thread");
        }
    }
}

The above code will result in an exception being thrown and a log message that will be printed to the console with our default configuration will look as follows:



11:42:18.952 ERROR - Error while executing main thread

As you can see there is not a lot of information there. We only know that the problem occurred, but we don’t know where it happened, or what the problem was, etc. Not very informative.

Now, look at the same code with a slightly modified logging statement:



package com.sematext.blog.logging;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

import java.io.IOException;

public class Log4JException {
    private static final Logger LOGGER = LogManager.getLogger(Log4JException.class);

    public static void main(String[] args) {
        try {
            throw new IOException("This is an I/O error");
        } catch (IOException ioe) {
            LOGGER.error("Error while executing main thread", ioe);
        }
    }
}

As you can see, this time we’ve included the exception object itself in our log message:



LOGGER.error("Error while executing main thread", ioe);

That would result in the following error log in the console with our default configuration:



11:30:17.527 ERROR - Error while executing main thread
java.io.IOException: This is an I/O error
    at com.sematext.blog.logging.Log4JException.main(Log4JException.java:13) [main/:?]

It contains relevant information – i.e. the name of the class, the method where the problem occurred, and finally the line number where the problem happened. Of course, in real-life situations, the stack traces will be longer, but you should include them to give you enough information for proper debugging.

To learn more about how to handle Java stack traces with Logstash see Handling Multiline Stack Traces with Logstash or look at Logagent which can do that for you out of the box.

5. Logging Java Exceptions

When dealing with Java exceptions and stack traces you shouldn’t only think about the whole stack trace, the lines where the problem appeared, and so on. You should also think about how not to deal with exceptions.

Avoid silently ignoring exceptions. You don’t want to ignore something important. For example, do not do this:



try {
     throw new IOException("This is an I/O error");
} catch (IOException ioe) {
}

Also, don’t just log an exception and throw it further. That means that you just pushed the problem up the execution stack. Avoid things like this as well:



try {
    throw new IOException("This is an I/O error");
} catch (IOException ioe) {
    LOGGER.error("I/O error occurred during request processing", ioe);
    throw ioe;
}

6. Use Appropriate Log Level

When writing your application code think twice about a given log message. Not every bit of information is equally important and not every unexpected situation is an error or a critical message. Also, using the logging levels consistently – information of a similar type should be on a similar severity level.

Both SLF4J facade and each Java logging framework that you will be using provide methods that can be used to provide a proper log level. For example:



LOGGER.error("I/O error occurred during request processing", ioe);

7. Log in JSON

If we plan to log and look at the data manually in a file or the standard output then the planned logging will be more than fine. It is more user friendly – we are used to it. But that is only viable for very small applications and even then it is suggested to use something that will allow you to correlate the metrics data with the logs. Doing such operations in a terminal window ain’t fun and sometimes it is simply not possible. If you want to store logs in the log management and centralization system you should log in JSON. That’s because parsing doesn’t come for free – it usually means using regular expressions. Of course, you can pay that price in the log shipper, but why do that if you can easily log in JSON. Logging in JSON also means easy handling of stack traces, so yet another advantage. Well, you can also just log to a Syslog compatible destination, but that is a different story.

In most cases, to enable logging in JSON in your Java logging framework it is enough to include the proper configuration. For example, let’s assume that we have the following log message included in our code:



LOGGER.info("This is a log message that will be logged in JSON!");

To configure Log4J 2 to write log messages in JSON we would include the following configuration:



<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <JSONLayout compact="true" eventEol="true">
            </JSONLayout>
        </Console>
    </Appenders>
    <Loggers>
        <Root level="info">
            <AppenderRef ref="Console"/>
        </Root>
    </Loggers>
</Configuration>

The result would look as follows:



{"instant":{"epochSecond":1596030628,"nanoOfSecond":695758000},"thread":"main","level":"INFO","loggerName":"com.sematext.blog.logging.Log4J2JSON","message":"This is a log message that will be logged in JSON!","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":1,"threadPriority":5}

8. Keep the Log Structure Consistent

The structure of your log events should be consistent. This is not only true within a single application or set of microservices, but should be applied across your whole application stack. With similarly structured log events it will be easier to look into them, compare them, correlate them, or simply store them in a dedicated data store. It is easier to look into data coming from your systems when you know that they have common fields like severity and hostname, so you can easily slice and dice the data based on that information. For inspiration, have a look at Sematext Common Schema even if you are not a Sematext user.

Of course, keeping the structure is not always possible, because your full stack consists of externally developed servers, databases, search engines, queues, etc., each of which has their own set of logs and log formats. However, to keep your and your team’s sanity minimize the number of different log message structures that you can control.

One way of keeping a common structure is to use the same pattern for your logs, at least the ones that are using the same logging framework. For example, if your applications and microservices use Log4J 2 you could use a pattern like this:



<PatternLayout>
    <Pattern>%d %p [%t] %c{35}:%L - %m%n</Pattern>
</PatternLayout>

By using a single or a very limited set of patterns you can be sure that the number of log formats will remain small and manageable.

9. Add Context to Your Logs

Information context is important and for us developers and DevOps a log message is information. Look at the following log entry:



[2020-06-29 16:25:34] [ERROR ] An error occurred!

We know that an error appeared somewhere in the application. We don’t know where it happened, we don’t know what kind of error it was, we only know when it happened. Now look at a message with slightly more contextual information:



[2020-06-29 16:25:34] [main] [ERROR ] com.sematext.blog.logging.ParsingErrorExample - A parsing error occurred for user with id 1234!

The same log record, but a lot more contextual information. We know the thread in which it happened, we know what class the error was generated at. We modified the message as well to include the user that the error happened for, so we can get back to the user if needed. We could also include additional information like diagnostic contexts. Think about what you need and include it.

To include context information you don’t have to do much when it comes to the code that is responsible for generating the log message. For example, the PatternLayout in Log4J 2 gives you all that you need to include the context information. You can go with a very simple pattern like this:



<PatternLayout pattern="%d{HH:mm:ss.SSS} %-5level - %msg%n"/>

That will result in a log message similar to the following one:



17:13:08.059 INFO - This is the first INFO level log message!

But you can also include a pattern that will include way more information:



<PatternLayout pattern="%d{HH:mm:ss.SSS} %c %l %-5level - %msg%n"/>

That will result in a log message like this:



17:24:01.710 com.sematext.blog.logging.Log4j2 com.sematext.blog.logging.Log4j2.main(Log4j2.java:12) INFO - This is the first INFO level log message!

10. Java Logging in Containers

Think about the environment your application is going to be running in. There is a difference in logging configuration when you are running your Java code in a VM or on a bare-metal machine, it is different when running it in a containerized environment, and of course, it is different when you run your Java or Kotlin code on an Android device.

To set up logging in a containerized environment you need to choose the approach you want to take. You can use one of the provided logging drivers – like the journald, logagent, Syslog, or JSON file. To do that, remember that your application shouldn’t write the log file to the container ephemeral storage, but to the standard output. That can be easily done by configuring your logging framework to write the log to the console. For example, with Log4J 2 you would just use the following appender configuration:



<Appenders>
    <Console name="Console" target="SYSTEM_OUT">
        <PatternLayout pattern="%d{HH:mm:ss.SSS} - %m %n"/>
    </Console>
</Appenders>

You can also completely omit the logging drivers and ship logs directly to your centralized logs solution like our Sematext Cloud:



<Appenders>
    <Console name="Console" target="SYSTEM_OUT">
        <PatternLayout pattern="%d %level [%t] %c - %m%n"/>
    </Console>
    <Syslog name="Syslog" host="logsene-syslog-receiver.sematext.com"
            port="514" protocol="TCP" format="RFC5424"
            appName="11111111-2222-3333-4444-555555555555"
            facility="LOCAL0" mdcId="mdc" newLine="true"/>
</Appenders>

11. Don’t Log Too Much or Too Little

As developers we tend to think that everything might be important – we tend to mark each step of our algorithm or business code as important. On the other hand, we sometimes do the opposite – we don’t add logging where we should or we log only FATAL and ERROR log levels. Both approaches will not do very well. When writing your code and adding logging, think about what will be important to see if the application is working properly and what will be important to be able to diagnose a wrong application state and fix it. Use this as your guiding light to decide what and where to log. Keep in mind that adding too many logs will end up in information fatigue and not having enough information will result in the inability to troubleshoot.

12. Keep the Audience in Mind

In most cases, you will not be the only person looking at the logs. Always remember that. There are multiple actors that may be looking at the logs.

The developer may be looking at the logs for troubleshooting or during debugging sessions. For such people, logs can be detailed, technical, and include very deep information related to how the system is running. Such a person will also have access to the code or will even know the code and you can assume that.

Then there are DevOps. For them, log events will be needed for troubleshooting and should include information helpful in diagnostics. You can assume the knowledge of the system, its architecture, its components, and the configuration of the components, but you should not assume the knowledge about the code of the platform.

Finally, your application logs may be read by your users themselves. In such a case, the logs should be descriptive enough to help fix the issue if that is even possible or give enough information to the support team helping the user. For example, using Sematext for monitoring involves installing and running a monitoring agent. If you are behind a very restrictive firewall and the agent cannot ship metrics to Sematext, it logs errors aimed that Sematext users themselves can look at, too.

We could go further and identify even more actors who might be looking into logs, but this shortlist should give you a glimpse into what you should think about when writing your log messages.

13. Avoid Logging Sensitive Information

Sensitive information shouldn’t be present in logs or should be masked. Passwords, credit card numbers, social security numbers, access tokens, and so on – all of that may be dangerous if leaked or accessed by those who shouldn’t see that. There are two things you ought to consider.

Think whether sensitive information is truly essential for troubleshooting. Maybe instead of a credit card number, it is enough to keep the information about the transaction identifier and the date of the transaction? Maybe it is not necessary to keep the social security number in the logs when you can easily store the user identifier. Think about such situations, think about the data that you store, and only write sensitive data when it is really necessary.

The second thing is shipping logs with sensitive information to a hosted logs service. There are very few exceptions where the following advice should not be followed. If your logs have and need to have sensitive information stored, mask or remove them before sending them to your centralized logs store. Most popular log shippers, like our own Logagent, include functionality that allows removal or masking of sensitive data.

Finally, the masking of sensitive information can be done in the logging framework itself. Let’s look at how it can be done by extending Log4j 2. Our code that produces log events looks as follows (full example can be found at Sematext Github):



public class Log4J2Masking {
    private static Logger LOGGER = LoggerFactory.getLogger(Log4J2Masking.class);
    private static final Marker SENSITIVE_DATA_MARKER = MarkerFactory.getMarker("SENSITIVE_DATA_MARKER");

    public static void main(String[] args) {
        LOGGER.info("This is a log message without sensitive data");
        LOGGER.info(SENSITIVE_DATA_MARKER, "This is a a log message with credit card number 1234-4444-3333-1111 in it");
    }
}

If you were to run the whole example from Github the output would be as follows:



21:20:42.099 - This is a log message without sensitive data
21:20:42.101 - This is a a log message with credit card number ****-****-****-**** in it

You can see that the credit card number was masked. This was done because we added a custom Converter that checks if the given Marker is passed along the log event and tries to replace a defined pattern. The implementation of such Converter looks as follows:



@Plugin(name = "sample_logging_mask", category = "Converter")
@ConverterKeys("sc")
public class LoggingConverter extends LogEventPatternConverter {
    private static Pattern PATTERN = Pattern.compile("\\b([0-9]{4})-([0-9]{4})-([0-9]{4})-([0-9]{4})\\b");

    public LoggingConverter(String[] options) {
        super("sc", "sc");
    }

    public static LoggingConverter newInstance(final String[] options) {
        return new LoggingConverter(options);
    }

    @Override
    public void format(LogEvent event, StringBuilder toAppendTo) {
        String message = event.getMessage().getFormattedMessage();
        String maskedMessage = message;

        if (event.getMarker() != null && "SENSITIVE_DATA_MARKER".compareToIgnoreCase(event.getMarker().getName()) == 0) {
            Matcher matcher = PATTERN.matcher(message);
            if (matcher.find()) {
                maskedMessage = matcher.replaceAll("****-****-****-****");
            }
        }

        toAppendTo.append(maskedMessage);
    }
}

It is very simple and could be written in a more optimized way and should also handle all the possible credit cards number formats, but it is enough for this purpose.

Before jumping into the code explanation I would also like to show you the log4j2.xml configuration file for this example:



<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN" packages="com.sematext.blog.logging">
    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <PatternLayout pattern="%d{HH:mm:ss.SSS} - %sc %n"/>
        </Console>
    </Appenders>
    <Loggers>
        <Root level="info">
            <AppenderRef ref="Console"/>
        </Root>
    </Loggers>
</Configuration>

As you can see, we’ve added the packages attribute in our Configuration to tell the framework where to look for our converter. Then we’ve used the %sc pattern to provide the log message. We do that because we can’t overwrite the default %m pattern. Once Log4j2 finds our %sc pattern it will use our converter which takes the formatted message of the log event and uses a simple regex and replaces the data if it was found. As simple as that.

One thing to notice here is that we are using the Marker functionality. Regex matching is expensive and we don’t want to do that for every log message. That’s why we mark the log events that should be processed with the created Marker, so only the marked ones are checked.

14. Use a Log Management Solution to Centralize & Monitor Java Logs

With the complexity of the applications, the volume of your logs will grow, too. You may get away with logging to a file and only using logs when troubleshooting is needed, but when the amount of data grows it quickly becomes difficult and slow to troubleshoot this way When this happens, consider using a log management solution to centralize and monitor your logs. You can either go for an in house solution based on the open-source software, like Elastic Stack, or use one of the log management tools available on the market like Sematext Cloud or Sematext Enterprise.

A fully managed log centralization solution will give you the freedom of not needing to manage yet another, usually quite complex, part of your infrastructure. Instead, you will be able to focus on your application and will need to set up only log shipping. You may want to include logs like JVM garbage collection logs in your managed log solution. After turning them on for your applications and systems working on the JVM you will want to have them in a single place for correlation, analysis, and to help you tune the garbage collection in the JVM instances. Alert on logs, aggregate the data, save and re-run the queries, hook up your favorite incident management software. Correlating the logs data with metrics coming from the JVM applications, system and infrastructure, real user, and API endpoints is something that platforms like Sematext Cloud are capable of. And of course, remember that application logs are not everything.

Conclusion

Incorporating each and every good practice may not be easy to implement right away, especially for applications that are already live and working in production. But if you take the time and roll the suggestions out one after another you will start seeing an increase in usefulness of your logs. Also, remember that at Sematext we do help organizations with their logging setups by offering logging consulting, so reach out if you are having trouble and we will be happy to help.

Java Logging Tutorial: Basic Concepts to Help You Get Started

Rafał Kuć — Fri, 21 Aug 2020 11:41:07 +0000

When it comes to troubleshooting application performance, metrics are no longer enough. To fully understand the environment you need logs and traces. Today, we’re going to focus on your Java applications.

Logging in Java could be done just by easily writing data to a file, however, this is not the simplest or most convenient way of logging. There are frameworks for Java that provide the object, methods, and unified configuration methods that help setting up logging, store logs, and in some cases even ship them to a log centralization solution. On top of that, there are also abstraction layers that enable easy switching of the underlying framework without the need of changing the implementation. This can come in handy if you ever need to replace your logging library for whatever reason – for example performance, configuration, or even simplicity.

Sounds complicated? It doesn’t have to be.

In this blog post we will focus on how to properly set up logging for your code to avoid all those mistakes that we already did. We will cover:

Logging abstraction layers for Java
Out of the box Java logging capabilities
Java logging libraries, their configuration, and usage
Logging the important information
Log centralization solutions.

Let’s get started!

Logging Frameworks: Choosing the Logging Solution for Your Java Application

The best logging solution for your Java application is….well, it’s complicated. You have to think and look at your current environment and organization needs. If you already have a framework of choice that the majority of your applications use – go for that. It is very likely that you will already have an established format for your logs. That also means that you may already have an easy way to ship your logs to a log centralization solution of your choice, and you just need to follow the existing pattern.

However, if your organization does not have any kind of common logging framework, go and see what kind of logging is used in the application you are using. Are you using Elasticsearch? It uses Log4j 2. Majority of the third party applications also use Log4j 2? If so, consider using it. Why? Because you will probably want to have your logs in a single place and it will be easier for you to just work with a single configuration or pattern. However, there is a pitfall here. Don’t go that route if it means using an outdated technology when a newer and mature alternative already exists.

Finally, if you are selecting a new logging framework I would suggest using an abstraction layer and a logging framework. Such an approach gives you the flexibility to switch to a different logging framework when needed and, what’s most important, without needing to change the code. You will only have to update the dependencies and configuration, the rest will stay the same.

In most cases, the SLF4J with bindings to the logging framework of your choice will be a good idea. The Log4j 2 is the go-to framework for many projects out there both open and closed source ones.

The Abstraction Layers

Of course, your application can use the basic logging API provided out of the box by Java via the java.util.logging package. There is nothing wrong with that, but note that this will limit you can do with your logs. Certain libraries provide easy to configure formatters, out of the box industry standard destinations and top-notch performance. If you wish to use one of those frameworks and you would also like to be able to switch the framework in the future you should look at the abstraction layer on top of the logging APIs.

SLF4J – The Simple Logging Facade for Java is one such abstraction layer. It provides bindings for common logging frameworks such as Log4j, Logback, and the out of the box java.util.logging package. You can imagine the process of writing the log message in the following, simplified way:

But how would that look from the code perspective? Well, that’s a very good question. Let’s start by looking at the out of the box java.util.logging code. For example, if we would like to just start our application and print something to the log it would look as follows:

package com.sematext.blog.logging;
import java.util.logging.Level;
import java.util.logging.Logger;

public class JavaUtilsLogging {
  private static Logger LOGGER = Logger.getLogger(JavaUtilsLogging.class.getName());

  public static void main(String[] args) {
    LOGGER.log(Level.INFO, "Hello world!");
  }
}

You can see that we initialize the static Logger class by using the class name. That way we can clearly identify where the log message comes from and we can reuse a single Logger for all the log messages that are generated by a given class.

The output of the execution of the above code will be as follows:

Jun 26, 2020 2:58:40 PM com.sematext.blog.logging.JavaUtilsLogging main
INFO: Hello world!

Now let’s do the same using the SLF4J abstraction layer:

package com.sematext.blog.logging;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class JavaSLF4JLogging {
  private static Logger LOGGER = LoggerFactory.getLogger(JavaSLF4JLogging.class);

  public static void main(String[] args) {
    LOGGER.info("Hello world!");
  }
}

This time the output is slightly different and looks like this:

[main] INFO com.sematext.blog.logging.JavaSLF4JLogging - Hello world!

This time, things are a bit different. We use the LoggerFactory class to retrieve the logger for our class and we use a dedicated info method of the Logger object to write the log message using the INFO level.

Starting with SLF4J 2.0 the fluent logging API was introduced, but at the time of writing of this blog it is still in alpha and is considered experimental, so we will skip it for now as the API may change.

So to finish up with the abstraction layer – it gives us a few major advantages compared to using the java.util.logging:

A common API for all your application logging
An easy way to use the desired logging framework
An easy way to exchange the logging framework and not having to go through the whole code when wanting to switch

What Are Log Levels?

Before continuing with the Java logging API we should talk about log levels and how to use them to properly categorize our log messages. Java log level is a way to specify how important a given log message is.

For example, the following log levels are sorted from least to most important with some explanation on how I see them:

TRACE – Very fine-grained information only used in a rare case where you need the full visibility of what is happening. In most cases, the TRACE level will be very verbose, but you can also expect a lot of information about the application. Use for annotating the steps in an algorithm that are not relevant in everyday use.
DEBUG – Less granular than TRACE level, but still more granular than you should need in your normal, everyday use. The DEBUG level should be used for information that can be useful for troubleshooting and is not needed for looking at the everyday application state.
INFO – The standard level of log information that indicates normal application action – for example “Created a user {} with id {}” is an example of a log message on an INFO level that gives you information about a certain process that finished with a success. In most cases, if you are not looking into how your application performs you could ignore most if not all of the INFO level logs.
WARN – Log level that usually indicates a state of the application that might be problematic or that it detected an unusual execution. Something may be wrong, but it doesn’t mean that the application failed. For example, a message was not parsed correctly, because it was not correct. The code execution is continuing, but we could log that with the WARN level to inform us and others that potential problems are happening.
ERROR – Log level that indicates an issue with a system that prevents certain functionality from working. For example, if you provide login via social media as one way of logging into your system, the failure of such a module is an ERROR level log for sure.
FATAL – The log level indicating that your application encountered an event that prevents it from working or a crucial part of it from working. A FATAL log level is for example an inability to connect to a database that your system relies on or to an external payment system that is needed to check out the basket in your e-commerce system.

Hopefully, that sheds some light on what the log levels are and how you can use them in your application.

Java Logging API

The Java logging APIs come with a standard set of key elements that we should know about to be able to proceed and discuss more than just the basic logging. Those classes elements include:

Logger – The logger is the main entity that an application uses to make logging calls. The Logger object is usually used for a single class or a single component to provide context-bound to a specific use case.
LogRecord – The entity used to pass the logging requests between the framework that is used for logging and the handlers that are responsible for log shipping.
Handler – The handler is used to export the LogRecord entity to a given destination. Those destinations can be the memory, console, files, and remote locations via sockets and various APIs. Eample of such a handlers are the handler for Syslog or a handler for Elasticsearch.
Level – Defines the set of standard severities of the logging message such as INFO, ERROR, etc. Tells how important the log record is.
Filter – Gives control over what gets logged and what gets dropped. It gives the application the ability to attach functionality controlling the logging output.
Formatter – Supports the formatting of the LogRecord objects. By default, two formatters are available – the SimpleFormatter and the XMLFormatter. The first one prints the LogRecord object in a human-readable form using one or two lines. The second one writes the messages in the standard XML format.

This is everything that we need to know about these APIs and we can start looking into the configuration of different frameworks and abstraction layers of Java logging.

Configuration: How Do You Enable Logging in Java?

Now that we know the basics we are ready to include logging in our Java application. I assume that we didn’t choose the framework that we would like to go with, so I’ll discuss each of the common solutions mentioned earlier, so you can see how they can be integrated.

Adding logging to our Java application is usually about configuring the library of choice and including the Logger. That allows us to add logging into the parts of our application that we want to know about.

The Logger

Java Logger is the main entity that our application uses to create LogRecord, so basically to log what we want to output as the log message. Before discussing each logging framework in greater detail let’s discuss how to obtain the Logger instance and how to log a single log event in each of the four discussed frameworks.

Creating a Logger

How do we create a Logger? Well, that actually depends on the logging framework of our choice. But it is usually very simple. For example, for the SLF4J APIs you would just do the following:

... Logger LOGGER = LoggerFactory.getLogger(SomeFancyClassName.class)

When using java.util.logging as the logging framework of choice we would do:

... Logger LOGGER = Logger.getLogger(SomeFancyClassName.class.getName())

While using Log4j 2 would require the following Java code call:

... Logger LOGGER = LogManager.getLogger(SomeFancyClassName.class)

For Logback we would call the same code as for SLF4J:

... Logger LOGGER = LoggerFactory.getLogger(SomeFancyClassName.class)

Logging Events

Similar to obtaining the Logger, logging the events depends on the framework. Let’s look at the various ways of logging an INFO level log message.

For the SLF4J APIs you would just do the following:

LOGGER.info("This is an info level log message!")

When using java.util.logging as the logging framework of choice we would do:

LOGGER.log(Level.INFO, "This is an info level log message!")

While using Log4j 2 we would do:

LOGGER.info("This is an info level log message!")

Finally, if using Logback, we would have to do the same call as for SLF4J:

LOGGER.info("This is an info level log message!")

SLF4J: Using The Logging Facade

The SLF4J or the Simple Logging Facade for Java serves as an abstraction layer for various Java logging frameworks, like Log4J or Logback. This allows for plugging different logging frameworks at deployment time without the need for code changes.

To enable SLF4J in your project you need to include the slf4j-api library and logging framework of your choice. For the purpose of this blog post we will also include the slf4j-simple library, so that we can show how the SLF4J API looks in an easy way without complicating it with additional logging framework configuration. The slf4j-simple results in the SLF4J facade printing all log messages with INFO level or above to be printed in the System.err. Because of that, our Gradle build.gradle file looks as follows (you can find the whole project on Sematext Github account):

dependencies {
    implementation 'org.slf4j:slf4j-api:1.7.30'
    implementation 'org.slf4j:slf4j-simple:1.7.30'
}

The code that will generate our SLF4J log message looks as follows:

package com.sematext.blog.logging;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class SLF4J {
  private static Logger LOGGER = LoggerFactory.getLogger(SLF4J.class);

  public static void main(String[] args) {
    LOGGER.info("This is an info level log message!");
  }
}

We’ve already seen that – we create the Logger instance by using the LoggerFactory class and its getLogger method and providing the name of the class. That way we bind the logger to the class name which gives us context of the log message. Once we have that we can use the Logger methods to create LogRecord on a given level.

The above code execution results in the following output:

[main] INFO com.sematext.blog.logging.SLF4J - This is an info level log message!

The SLF4J API is well designed and allows not only for simple messages. The typical usage pattern will require parameterized log messages. Of course SLF4J allows for that. Have a look at the following class (I omitted the import section for simplicity):

public class SLF4JParametrized {
  private static Logger LOGGER = LoggerFactory.getLogger(SLF4JParametrized.class);

  public static void main(String[] args) {
    int currentValue = 36;
    LOGGER.info("The parameter value in the log message is {}", currentValue);
  }
}

The output of the above code execution looks as follows:

[main] INFO com.sematext.blog.logging.SLF4JParametrized - The parameter value in the log message is 36

As you can see the parameter placeholder was properly replaced and the log message contains the value. This allows for efficient log message building without the need of String concatenation or buffers and writers usage.

Binding SLF4J With Logging Framework at the Deploy Time

When using SLF4J you probably won’t be using the slf4j-simple library and you’ll want to use it with a dedicated logging framework so that your logs can be put in the destination of your choice or even multiple destinations of choice.

For SLF4J to be able to work with a logging framework of your choice in addition to the slf4j-api library you need one of the following:

slf4j-jdk1.4-SLF4J_VERSION.jar – the bindings for the java.util.logging, the one by default bundled with the JDK.
slf4j-log4j12-SLF4J_VERSION.jar – the bindings for the Log4j version 1.2 and of course requires the log4j library to be present in the classpath.
slf4j-jcl-SLF4J_VERSION.jar – the bindings for the Jakarta Commons Logging also referred to as Commons Logging.
slf4j-nop-SLF4J_VERSION.jar – the binding that silently discards all the log messages.
logback-classic-1.2.3.jar – the bindings for the Logback logging framework.
log4j-slf4j-impl-SLF4J_VERSION.jar – the bindings for Log4J 2 and the SLF4J up to version 1.8.
log4j-slf4j18-impl-SLF4J_VERSION.jar – the bindings for Log4J 2 and the SLF4J 1.8 and up.

Starting with the 1.6 version of the SLF4J library, if no bindings are found on the classpath SLF4J API will start to discard all log messages silently. So please keep that in mind.

It is also a good Java logging practice for the libraries and embedded software to only include dependency to the slf4j-api library and nothing else. That way the binding will be chosen by the developer of the application that uses the library.

SLF4J supports the Mapped Diagnostic Context which is a map where the application code provides key-value pairs which can be inserted by the logging framework in the log messages. If the underlying logging framework supports the MDC then SLF4J facade will pass the maintained key-value pairs to the used logging framework.

java.util.logging

When using the standard java.util.logging package we don’t need any kind of external dependencies. Everything that we need is already present in the JDK distribution so we can just jump on it and start including logging to our awesome application.

There are two ways we can include and configure logging using the java.util.logging package – by using a configuration file or programmatically. For the demo purposes let’s assume that we want to see our log messages in a single line starting with a date and time, severity of the log message and of course the log message itself.

Configuring java.util.logging via Configuration File

To show you how to use the java.util.logging package we’ve created a simple Java project and shared it in our Github repository. The code that generates the log and sets up our logging looks as follows:

package com.sematext.blog.loggging;

import java.io.InputStream;
import java.util.logging.Level;
import java.util.logging.LogManager;
import java.util.logging.Logger;

public class UtilLoggingConfiguration {
  private static Logger LOGGER = Logger.getLogger(UtilLoggingConfiguration.class.getName());

  static {
    try {
      InputStream stream = UtilLoggingConfiguration.class.getClassLoader()
          .getResourceAsStream("logging.properties");
      LogManager.getLogManager().readConfiguration(stream);
    } catch (Exception ex) {
      ex.printStackTrace();
    }
  }

  public static void main(String[] args) throws Exception {
    LOGGER.log(Level.INFO, "An INFO level log!");
  }
}

We start with initializing the Logger for our UtilLoggingConfiguration class by using the name of the class. We also need to provide the name of our logging configuration. By default the default configuration lives in the JAVA_HOME/jre/lib/logging.properties file and we don’t want to adjust the default configuration. Because of that, I created a new file called logging.properties and in the static block we just initialize it and provide it to the LogManager class by using its readConfiguration method. That is enough to initialize the logging configuration and results in the following output:

[2020-06-29 15:34:49] [INFO   ] An INFO level log!

Of course, you would only do the initialization once and not in each and every class. Keep that in mind.

Now that is different from what we’ve seen earlier in the blog post and the only thing that changed is the configuration that we included. Let’s now look at our logging.properties file contents:

handlers=java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
java.util.logging.SimpleFormatter.format=[%1$tF %1$tT] [%4$-7s] %5$s %n

You can see that we’ve set up the handler which is used to set the log records destination. In our case, this is the java.util.logging.ConsoleHandler that prints the logs to the console.

Next, we said what format we would like to use for our handler and we’ve said that the formatter of our choice is the java.util.logging.SimpleFormatter. This allows us to provide the format of the log record.

Finally by using the format property we set the format to [%1$tF %1$tT] [%4$-7s] %5$s %n. That is a very nice pattern, but it can be confusing if you see something like this for the first time. Let’s discuss it. The SimpleFormatter uses String.format method with the following list or arguments: (format, date, source, logger, level, message, thrown).

So the [%1$tF %1$tT] part tells the formatter to take the second argument and provide the date and time parts. Then the [%4$-7s] part reads the log level and uses 7 spaces for formatting that. So for the INFO level, it will add 3 additional spaces. The %5$s tells the formatter to take the message of the log record and print it as a string and finally, the %n is the new line printing. Now that should be clearer.

Configuring java.util.logging Programmatically

Let’s now look at how to do similar formatting by configuring the formatter from the Java code level. We won’t be using any kind of properties file this time, but we will set up our Logger using Java.

The code that does that looks as follows:

package com.sematext.blog.loggging;

import java.util.Date;
import java.util.logging.*;

public class UtilLoggingConfigurationProgramatically {
  private static Logger LOGGER = Logger.getLogger(UtilLoggingConfigurationProgramatically.class.getName());

  static {
    ConsoleHandler handler = new ConsoleHandler();
    handler.setFormatter(new SimpleFormatter() {
      private static final String format = "[%1$tF %1$tT] [%2$-7s] %3$s %n";
      @Override
      public String formatMessage(LogRecord record) {
        return String.format(format,
            new Date(record.getMillis()),
            record.getLevel().getLocalizedName(),
            record.getMessage()
        );
      }
    });
    LOGGER.setUseParentHandlers(false);
    LOGGER.addHandler(handler);
  }

  public static void main(String[] args) throws Exception {
    LOGGER.log(Level.INFO, "An INFO level log!");
  }
}

The difference between this method and the one using a configuration file is this static block. Instead of providing the location of the configuration file we are creating a new instance of the ConsoleHandler and we override the formatMessage method that takes the LogRecord object as its argument. We provide the format, but we are not passing the same number of arguments to the String.format method, so we modified our format as well. We also said that we don’t want to use the parent handler which means that we would only like to use our own handler and finally we are adding our handler by calling the LOGGER.addHandler method. And that’s it – we are done. The output of the above code looks as follows:

[2020-06-29 16:25:34] [INFO   ] An INFO level log!

I wanted to show that setting up logging in Java can also be done programmatically, not only via configuration files. However, in most cases, you will end up with a file that configures the logging part of your application. It is just more convenient to use, simpler to adjust, modify, and work with.

Log4j

Log4j 2 and its predecessor the Log4j are the most common and widely known logging framework for Java. It promises to improve the first version of the Log4j library and fix some of the issues identified in the Logback framework. It also supports asynchronous logging. For the purpose of this tutorial I created a simple Java project that uses Log4j 2, you can find it in our Github account.

Let’s start with the code that we will be using for the test:

package com.sematext.blog.logging;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

public class Log4JDefaultConfig {
  private static final Logger LOGGER = LogManager.getLogger(Log4JDefaultConfig.class);

  public static void main(String[] args) {
    LOGGER.info("This is an INFO level log message!");
    LOGGER.error("This is an ERROR level log message!");
  }
}

For the code to compile we need two dependencies – the Log4j 2 API and the Core. Our build.gradle includes the following:

dependencies {
    implementation 'org.apache.logging.log4j:log4j-api:2.13.3'
    implementation 'org.apache.logging.log4j:log4j-core:2.13.3'
}

Things are quite simple here – we get the Logger instance by using the Log4J 2 LogManager class and its getLogger method. Once we have that we use the info and error methods of the Logger to write the log records. We could of course use SLF4J and we will in one of the examples. For now, though, let’s stick to the pure Log4J 2 API.

In comparison to the java.util.logging we don’t have to do any setup. By default, Log4j will use the ConsoleAppender to write the log message to the console. The log record will be printed using the PatternLayout attached to the mentioned ConsoleAppander with the pattern defined as follows: %d{HH:mm:ss.SSS} [%t] %-5level %logger{36} – %msg%n. However, the root logger is defined for ERROR level. That means, that by default, ERROR level logs and beyond will be only visible. Log messages with level INFO, DEBUG, TRACE will not be visible.

There are multiple conditions when Log4j2 will use the default configuration, but in general, you can expect it to kick in when no log4j.configurationFile property is present in the application startup parameters and when this property doesn’t point to a valid configuration file. In addition no log4j2-test.[properties|xml|yaml|jsn] files are present in the classpath and no log4j2.[properties|xml|yaml|jsn] files are present in the classpath. Yes, that means that you can easily have different logging configurations for tests, different configurations as the default for your codebase, and then use the log4j.configurationFile property and point to a configuration for a given environment.

A simple test – running the above code gives the following output:

09:27:10.312 [main] ERROR com.sematext.blog.logging.Log4JDefaultConfig - This is an ERROR level log message!

Of course, such default behavior will not be enough for almost anything apart from a simple code, so let’s see what we can do about it.

Configuring Log4J 2

Log4J 2 can be configured in one of the two ways:

By using the configuration file. By default, Log4J 2 understands configuration written in Java properties files and XML files, but you can also include additional dependencies to work with JSON or YAML. In this blog post, we will use this method.
Programmatically by creating the ConfigurationFactory and Configuration implementations, or by using the exposed APIs in the Configuration interface or by calling internal Logger methods.

Let’s create a simple, old fashioned XML configuration file first just to print the time, level and the message associated with the log record. Log4J 2 requires us to call that file log4j2.xml. We put it in the resources folder of our example project and run it. This time the output looks as follows:

<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <PatternLayout pattern="%d{HH:mm:ss.SSS} %-5level - %msg%n"/>
        </Console>
    </Appenders>
    <Loggers>
        <Root level="info">
            <AppenderRef ref="Console"/>
        </Root>
    </Loggers>
</Configuration>

Let’s stay here for a minute or two and discuss what we did here. We started with the Configuration definition. In this element, we defined that all the status-related messages will be logged with the WARN level – status=”WARN”.

Next, we defined the appender. The Appender is responsible for delivering the LogEvent to its destination. You can have multiple of those. In our case the Appenders section contains a single appender of the Console type:

<Console name="Console" target="SYSTEM_OUT">
    <PatternLayout pattern="%d{HH:mm:ss.SSS} %-5level - %msg%n"/>
</Console>

We set its name, which we will use a bit later, we say that the target is the standard system output – SYSTEM_OUT. We also provided the pattern for the PatternLayout which defines the way our LogEvent will be formatted in the Console appender.

The last section defines the Loggers. We define the configuration of different loggers that we defined in our code or in the libraries that we are using. In our case this section only contains the Root logger definition:

<Root level="info">
    <AppenderRef ref="Console"/>
</Root>

The Root logger definition tells Log4J to use that configuration when a dedicated configuration for a logger is not found. In our Root logger definition, we say that the default log level should be set to INFO and the log events should be sent to the appender with the name Console.

If we run the example code mentioned above with our new XML configuration the output will be as follows:

09:29:58.735 INFO  - This is an INFO level log message!
09:29:58.737 ERROR - This is an ERROR level log message!

Just as we expected we have both log messages present in the console.

If we would like to use the properties file instead of the XML one we could just create a log4j2.properties file and include it in the classpath or just use the log4j.configuration property and point it to the properties file of our choice. To achieve similar results as we got with the XML based configuration we would use the following properties:

appender.console.type = Console
appender.console.name = STDOUT
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{HH:mm:ss.SSS} %-5level - %msg%n
rootLogger.level = info
rootLogger.appenderRef.stdout.ref = STDOUT

It is less verbose and works, but is also less flexible if you want to use slightly more complicated functionalities.

Log4J Appenders

Log4J appender is responsible for delivering LogEvents to their destination. Some appenders only send LogEvents while others wrap other appenders to provide additional functionality. Let’s look at some of the appenders available in Log4J. Keep in mind that this is not a full list of appenders and to see all of them look at Log4J appenders documentation.

The not-so-full list of appenders in Log4J 2:

Console – writes the data to System.out or System.err with the default begin the first one (a good practice when logging in containers)
File – appender that uses the FileManager to write the data to a defined file
Rolling File – appender that writes data to a defined file and rolls over the file according to a defined policy
Memory Mapped File – added in version 2.1, uses memory-mapped files and relies on the operating system virtual memory manager to synchronize the changes in the file with the storage device
Flume – appender that writes data to Apache Flume
Cassandra – appender that writes data to Apache Cassandra
JDBC – writes data to a database using standard JDBC driver
HTTP – writes data to a defined HTTP endpoint
Kafka – writes data to Apache Kafka
Syslog – writes data to a Syslog compatible destination
ZeroMQ – writes data to ZeroMQ
Async – encapsulates another appender and uses a different thread to write data, which results in asynchronous logging

Of course, we won’t be discussing each and every appender mentioned above, but let’s look at the File appender to see how to log the log messages to the console and file at the same time.

Using Multiple Appenders

Let’s assume we have the following use case – we would like to set up our application to log everything to a file so we can use it for shipping data to an external log management and analysis solution such as Sematext Cloud and we would also like to have the logs from com.sematext.blog loggers printed to the console.

I’ll use the same example application that looks as follows:

package com.sematext.blog.logging;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

public class Log4j2 {
  private static final Logger LOGGER = LogManager.getLogger(Log4j2.class);

  public static void main(String[] args) {
    LOGGER.info("This is an INFO level log message!");
    LOGGER.error("This is an ERROR level log message!");
  }
}

The log4j2.xml will look as follows:

<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <PatternLayout pattern="%d{HH:mm:ss.SSS} %-5level - %msg%n"/>
        </Console>
        <File name="File" fileName="/tmp/log4j2.log" append="true">
            <PatternLayout>
                <Pattern>%d{HH:mm:ss.SSS} [%t] %-5level - %msg%n</Pattern>
            </PatternLayout>
        </File>
    </Appenders>
    <Loggers>
        <Logger name="com.sematext.blog" level="info" additivity="true">
            <AppenderRef ref="Console"/>
        </Logger>
        <Root level="info">
            <AppenderRef ref="File"/>
        </Root>
    </Loggers>
</Configuration>

You can see that I have two appenders defined – one called Console that appends the data to the System.out or System.err. The second one is called File. I provided a fileName configuration property to tell the appender where the data should be written and I said that I want to append the data to the end of the file. Keep in mind that those two appenders have slightly different patterns – the File one also logs the thread, while the Console one doesn’t.

We also have two Loggers defined. We have the Root logger that appends the data to our File appender. It does that for all the log events with the INFO log level and higher. We also have a second Logger defined with the name of com.sematext.blog, which appends the data using the Console appender as well with the INFO level.

After running the above code we will see the following output on the console:

15:19:54.730 INFO  - This is an INFO level log message!
15:19:54.732 ERROR - This is an ERROR level log message!

And the following output in the /tmp/log4j2.log file:

15:19:54.730 [main] INFO  - This is an INFO level log message!
15:19:54.732 [main] ERROR - This is an ERROR level log message!

You can see that the times are exactly the same, the lines are different though just as our patterns are.

In the normal production environment, you would probably use the Rolling File appender to roll over the files daily or when they will reach a certain size.

Log4J Layouts

Layout is used by the appender to format a LogEvent into a form that is defined.

By default, there are a few layouts available in Log4j 2 (some of them require additional runtime dependencies):

Pattern – uses a string pattern to format log events (learn more about the Pattern layout)
CSV – the layout for writing data in the CSV format
GELF – layout for writing events in the Graylog Extended Log Format 1.1
HTML – layout for writing data in the HTML format
JSON – layout for writing data in JSON
RFC5424 – writes the data in accordance with RFC 5424 – the extended Syslog format
Serialized – serializes log events into a byte array using Java serialization
Syslog – formats log events into Syslog compatible format
XML – layout for writing data in the XML format
YAML – layout for writing data in the YAML format

For example, if we would like to have our logs be formatted in HTML we could use the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <HTMLLayout></HTMLLayout>
        </Console>
    </Appenders>
    <Loggers>
        <Root level="info">
            <AppenderRef ref="Console"/>
        </Root>
    </Loggers>
</Configuration>

The output of running the example code with the above configuration is quite verbose and looks as follows:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
    <meta charset="UTF-8"/>
    <title>Log4j Log Messages</title>
    <style type="text/css">
    <!--
    body, table {font-family:arial,sans-serif; font-size: medium
    ;}th {background: #336699; color: #FFFFFF; text-align: left;}
    -->
    </style>
</head>
<body bgcolor="#FFFFFF" topmargin="6" leftmargin="6">
<hr size="1" noshade="noshade">
Log session start time Wed Jul 01 15:45:08 CEST 2020<br>
<br>
    <table cellspacing="0" cellpadding="4" border="1" bordercolor="#224466" width="100%">
    <tr>
        <th>Time</th>
        <th>Thread</th>
        <th>Level</th>
        <th>Logger</th>
        <th>Message</th>
    </tr>
    <tr>
        <td>652</td>
        <td title="main thread">main</td>
        <td title="Level">INFO</td>
        <td title="com.sematext.blog.logging.Log4j2 logger">com.sematext.blog.logging.Log4j2</td>
        <td title="Message">This is an INFO level log message!</td>
    </tr>
    <tr>
        <td>653</td>
        <td title="main thread">main</td>
        <td title="Level"><font color="#993300"><strong>ERROR</strong></font></td>
        <td title="com.sematext.blog.logging.Log4j2 logger">com.sematext.blog.logging.Log4j2</td>
        <td title="Message">This is an ERROR level log message!</td>
    </tr>
</table>
<br>
</body></html>

Log4J Filters

The Filter allows log events to be checked to determine if or how they should be published. A filter execution can end with one of three values – ACCEPT, DENY or NEUTRAL.

Filters can be configured in one of the following locations:

Directly in the Configuration for context – wide filters
In the Logger for the logger specific filtering
In the Appender for appender specific filtering
In the Appender reference to determine if the log event should reach a given appender

There are a few out of the box filters and we can also develop our own filters. The ones that are available out of the box are:

Burst – provides a mechanism to control the rate at which log events are processed and discards them silently is a limit is hit
Composite – provides a mechanism to combine multiple filters
Threshold – allows filtering on the log level of the log event
Dynamic Threshold – similar to Threshold filter, but allows to include additional attributes, for example user
Map – allows filtering of data that is in a MapMessage
Marker – compares the Marker defined in the filter with the Marker in the log event and filters on the basis of that
No Marker – allows checking for Marker existence in the log event and filter on the basis of that information
Regex – filters on the basis of the defined regular expression
Script – executes a script that should return a boolean result
Structured Data – allows filtering on the id, type, and message of the log event
Thread Context Map – allows filtering against data that are in the current context
Time – allows restricting log events to a certain time

For example, if we would like to include all logs with level WARN or higher we could use the ThresholdFilter with the following configuration:

<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <PatternLayout pattern="%d{HH:mm:ss.SSS} %-5level - %msg%n"/>
            <ThresholdFilter level="WARN" onMatch="ACCEPT" onMismatch="DENY"/>
        </Console>
    </Appenders>
    <Loggers>
        <Root level="info">
            <AppenderRef ref="Console"/>
        </Root>
    </Loggers>
</Configuration>

Running our example code with the above Log4j 2 configuration would result in the following output:

16:34:30.797 ERROR - This is an ERROR level log message!

Log4J Garbage Free Logging

With Log4J 2.6 a partially implemented garbage-free logging was introduced. The framework reuses objects stored in ThreadLocal and tries to reuse buffers when converting text to bytes.

With the 2.6 version of the framework, two properties are used to control the garbage-free logging capabilities of Log4J 2. The log4j2.enableThreadlocals which is by default set to true for non-web applications and the log4j2.enableDirectEncoders which is also set to true by default are the ones that enable optimizations in Log4J 2.

With the 2.7 version of the framework, the third property was added – the log4j2.garbagefreeThreadContextMap. If set to true the ThreadContext map will also use a garbage-free approach. By default, this property is set to false.

There is a set of limitations when it comes to garbage-free logging in Log4J 2. Not all filters, appenders and layouts are available. If you decide to use it check Log4J 2 documentation on that topic.

Log4J Asynchronous Logging

The asynchronous logging is a new addition to the Log4J 2. The aim is to return to the application from the Logger.log method call as soon as possible by executing the logging operation in a separate thread.

There are several benefits and drawbacks of using asynchronous logging. The benefits are:

Higher peak throughput
Lower logging response time latency

There are also drawbacks:

Complicated error handling
Will not give higher performance on machines with low CPU count

If you log more than the appender can process, the speed of logging will be dedicated by the slowest appender because of the queue filling up
If at this point you still think about turning on asynchronous logging you need to be aware that additional runtime dependencies are needed. Log4J 2 uses the Disruptor library and requires it as the runtime dependency. You also need to set the log4j2.contextSelector system property to org.apache.logging.log4j.core.async.AsyncLoggerContextSelector.

Log4J Thread Context

Log4J combines the Mapped Diagnostic Context with the Nested Diagnostic Context in the Thread Context. But let’s discuss each of those separate for a moment.

The Mapped Diagnostic Context or MDC is basically a map that can be used to store the context data of the particular thread where the context is running. For example, we can store the user identifier or the step of the algorithm. In the Log4J 2 instead of MDC we will be using the Thread Context. Thread Context is used to associate multiple events with limited numbers of information.

The Nested Diagnostic Context or NDC is similar to MDC, but can be used to distinguish interleaved log output from different sources, so in situations where a server or application handles multiple clients at once. In Log4J 2 instead of NDC we will use Thread Context Stack.

Let’s see how we can work with it. What we would like to do is add additional information to our log messages – the user name and the step of the algorithm. This could be done by using the Thread Context in the following way:

package com.sematext.blog.logging;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.apache.logging.log4j.ThreadContext;

public class Log4j2ThreadContext {
  private static final Logger LOGGER = LogManager.getLogger(Log4j2ThreadContext.class);

  public static void main(String[] args) {
    ThreadContext.put("user", "rafal.kuc@sematext.com");
    LOGGER.info("This is the first INFO level log message!");
    ThreadContext.put("executionStep", "one");
    LOGGER.info("This is the second INFO level log message!");
    ThreadContext.put("executionStep", "two");
    LOGGER.info("This is the third INFO level log message!");
  }
}

For our additional information from the Thread Context to be displayed in the log messages, we need a different configuration. The configuration that we used to achieve this look as follows:

<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <PatternLayout pattern="%d{HH:mm:ss.SSS} [%X{user}] [%X{executionStep}] %-5level - %msg%n"/>
        </Console>
    </Appenders>
    <Loggers>
        <Root level="info">
            <AppenderRef ref="Console"/>
        </Root>
    </Loggers>
</Configuration>

You can see that we’ve added the [%X{user}] [%X{executionStep}] part to our pattern. This means that we will take the user and executionStep property values from the Thread Context and include that. If we would like to include all that are present we would just add %X.

The output of running the above code with the above configuration would be as follows:

18:05:40.181 [rafal.kuc@sematext.com] [] INFO  - This is the first INFO level log message!
18:05:40.183 [rafal.kuc@sematext.com] [one] INFO  - This is the second INFO level log message!
18:05:40.183 [rafal.kuc@sematext.com] [two] INFO  - This is the third INFO level log message!

You can see that when the context information is available it is written along with the log messages. It will be present until it is changed or cleared.

What Is a Marker?

Markers are named objects that are used to enrich the data. While the Thread Context is used to provide additional information for the log events, the Markers can be used to mark a single log statement. You can image marking a log event with the IMPORTANT marker, which will mean that the appender should for example store the event in a separate log file.

How to do that? Look at the following code example that uses filtering to achieve that:

package com.sematext.blog.logging;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.Marker;
import org.slf4j.MarkerFactory;

public class Log4J2MarkerFiltering {
  private static Logger LOGGER = LoggerFactory.getLogger(Log4J2MarkerFiltering.class);
  private static final Marker IMPORTANT = MarkerFactory.getMarker("IMPORTANT");

  public static void main(String[] args) {
    LOGGER.info("This is a log message that is not important!");
    LOGGER.info(IMPORTANT, "This is a very important log message!");
  }
}

The above example uses SLF4J as the API of choice and Log4J 2 as the logging framework. That means that our dependencies section of the Gradle build file looks like this:

dependencies {
    implementation 'org.slf4j:slf4j-api:1.7.30'
    implementation 'org.apache.logging.log4j:log4j-api:2.13.3'
    implementation 'org.apache.logging.log4j:log4j-core:2.13.3'
    implementation 'org.apache.logging.log4j:log4j-slf4j-impl:2.13.3'
}

There are two important things in the above code example. First of all, we are creating a static Marker:

private static final Marker IMPORTANT = MarkerFactory.getMarker("IMPORTANT");

We use the MarkerFactory class to do that by calling its getMarker method and providing the name of the marker. Usually, you would set up those markers in a separate, common class that your whole code can access.

Finally, to provide the Marker to the Logger we use the appropriate logging method of the Logger class and provide the marker as the first argument:

LOGGER.info(IMPORTANT, "This is a very important log message!");

After running the code the console will show the following output:

12:51:25.905 INFO  - This is a log message that is not important!
12:51:25.907 INFO  - This is a very important log message!

While the file that is created by the File appender will contain the following log message:

12:51:25.907 [main] INFO  - This is a very important log message!

Exactly what we wanted!

Keep in mind that Markers are not only available in Log4J. As you can see in the above example, they come from the SLF4J, so the SLF4J compatible frameworks should support them. In fact, the next logging framework that we will discuss – Logback also has support for Markers and we will look into that while discussing it.

Logback

Logback starts where the first version of Log4J ends and promises to provide improvements to that. For the purpose of this blog post, I created a simple project and shared it in our Github account.

Logback uses SLF4J as the logging facade. In order to use it we need to include it as the project dependency. In addition to that we need the logback-core and logback-classic libraries. Our Gradle build file dependency section looks as follows:

dependencies {
    implementation 'ch.qos.logback:logback-core:1.2.3'
    implementation 'ch.qos.logback:logback-classic:1.2.3'
    implementation 'org.slf4j:slf4j-api:1.7.30'
}

I also created a simple class for the purpose of showing how the Logback logging framework works:

package com.sematext.blog.logging;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Logback {
  private static final Logger LOGGER = LoggerFactory.getLogger(Logback.class);

  public static void main(String[] args) {
    LOGGER.info("This is an INFO level log message!");
    LOGGER.error("This is an ERROR level log message!");
  }
}

As you can see, we generate two log messages. But before that, we start by creating the Logger by using the SLF4J LoggerFactory class and its getLogger method. Once we have that we can start generating log messages by using appropriate Logger methods. We already discussed that when we were looking at the SLF4J earlier, so I’ll skip the more detailed discussion.

By default, when no configuration is provided the output generated by the execution of the above code will look as follows:

12:44:39.162 [main] INFO com.sematext.blog.logging.Logback - This is an INFO level log message!
12:44:39.165 [main] ERROR com.sematext.blog.logging.Logback - This is an ERROR level log message!

The default behavior will be activated if the logback-test.xml file is not found in the classpath and logback.groovy is not found in the classpath and the logback.xml file is not present in the classpath. The default configuration uses the ConsoleAppender which prints the log messages to the standard output and uses the PatternLayoutEncoder with the pattern used for formatting the log messages defined as follows %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} – %msg%n. The default Logger is assigned to the DEBUG log level, so it can be quite verbose.

Configuring Logback

To configure Logback we can either use the standard XML files as well as Groovy ones. We’ll use the XML method. Let’s just create the logback.xml file in the resources folder in our project with the following contents:

<configuration>
    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>
    <root level="info">
        <appender-ref ref="STDOUT" />
    </root>
</configuration>

You can already see some similarities with Log4j 2. We have the root configuration element which is responsible for keeping the whole configuration. Inside we see two sections. The first one is the Appender section. We created a single appender with the STDOUT name that prints the output to the console and uses an Encoder with a given pattern. The Encoder is responsible for formatting the LogRecord. Finally, we defined the root element to use the INFO level as the default log level and by default send all the messages to the appender with the name STDOUT.

The output of the execution of the above code with the created logback.xml file present in the classpath results in the following output:

12:49:47.730 INFO  com.sematext.blog.logging.Logback - This is an INFO level log message!
12:49:47.732 ERROR com.sematext.blog.logging.Logback - This is an ERROR level log message!

Logback Appenders

Logback appender is the component that Logback uses to write log events. They have their name and a single method that can process the event.

The logback-core library lays the foundation for the Logback appenders and provides a few classes that are ready to be used. Those are:

ConsoleAppender – appends the log events to the System.out or System.err
OutputStreamAppender – appends the log events to java.io.Outputstream providing the basic services for other appenders
FileAppender – appends the log events to a file
RollingFileAppender – appends the log events to a file with the option of automatic file rollover

The logback-classic library that we included in our example project extends the list of available appenders and provide the appenders that are able to send data to the external systems:

SocketAppender – appends the log events to a socket
SSLSocketAppender – appends the log events to a socket using secure connection
SMTPAppender – accumulates data in batches and send the content of the batch to a user-defined email after a user-specified event occurs
DBAppender – appends data into a database tables
SyslogAppender – appends data into Syslog compatible destination
SiftingAppender – appender that is able to separate logging according to a given runtime attribute
AsyncAppender – appends the logs events asynchronously

There are many more appenders that are present in Logback extensions, but we won’t mention them all. We won’t be talking about each and every appender either, but let’s look at one of the common use cases – writing logs both to a console and to a file.

Using Multiple Appenders

Let’s assume we have the following use case – we would like to set up our application to log everything to file so we can use it for shipping data to an external log management and analysis solution such as Sematext Cloud and we would also like to have the logs from com.sematext.blog loggers printed to the console.

I’ll use the same example application that looks as follows:

package com.sematext.blog.logging;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Logback {
  private static final Logger LOGGER = LoggerFactory.getLogger(Logback.class);

  public static void main(String[] args) {
    LOGGER.info("This is an INFO level log message!");
    LOGGER.error("This is an ERROR level log message!");
  }
}

But this time, the logback.xml will look slightly different:

<configuration>
    <appender name="console" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>
    <appender name="file" class="ch.qos.logback.core.FileAppender">
        <file>/tmp/logback.log</file>
        <append>true</append>
        <immediateFlush>true</immediateFlush>
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>
    <logger name="com.sematext.blog">
        <appender-ref ref="console"/>
    </logger>
    <root level="info">
        <appender-ref ref="file" />
    </root>
</configuration>

You can see that I have two appenders defined – one called console that appends the data to the System.out or System.err. The second one is called the file and uses the ch.qos.logback.core.FileAppender to write the log events to file. I provided a file configuration property to tell the appender where data should be written, I specified that I want to append data and I want an immediate flush to increase logging throughput. I also added an encoder to format my data and included the thread in the logline.

The difference between the earlier examples are also in the loggers definition. We have a root logger that appends the data to our file appender. It does that for all log events with the INFO log level and higher. We also have a second logger defined with the name of com.sematext.blog, which appends data using the console appender.

After running the above code we will see the following output on the console:

12:54:09.077 INFO  com.sematext.blog.logging.Logback - This is an INFO level log message!
12:54:09.078 ERROR com.sematext.blog.logging.Logback - This is an ERROR level log message!

And the following output in the /tmp/logback.log file:

12:54:09.077 [main] INFO  com.sematext.blog.logging.Logback - This is an INFO level log message!
12:54:09.078 [main] ERROR com.sematext.blog.logging.Logback - This is an ERROR level log message!

You can see that the times are exactly the same, the lines are different though just as our encoders are.

In the normal production environment, you would probably use the RollingFileAppender to roll over the files daily or when they reach a certain size. Keep that in mind when setting up your own production logging.

Logback Encoders

Logback encoder is responsible for transforming a log event into a byte array and writing that byte array to an OutputStream.

Right now there are two encoders available in Logback:

PatternLayoutEncoder – encoder that takes a pattern and encodes the log event based on that pattern
LayoutWrappingEncoder – encoder that closes the gap between the current Logback version and the versions prior to 0.9.19 that used to use Layout instances instead of the patterns.

Right now, in most cases you will be using a pattern, for example like in the following appender definition:

<configuration>
    <appender name="console" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} %-5level %logger{36} - %msg%n</pattern>
            <outputPatternAsHeader>true</outputPatternAsHeader>
        </encoder>
    </appender>
    <root level="info">
        <appender-ref ref="console" />
    </root>
</configuration>

The execution of such configuration along with our example project would result in the following output:

13:16:30.724 INFO  com.sematext.blog.logging.Logback - This is an INFO level log message!
13:16:30.726 ERROR com.sematext.blog.logging.Logback - This is an ERROR level log message!

Logback Layouts

Logback layout is a component that is responsible for transforming an incoming event into a String. You can write your own layout and include it in your appender using the ch.qos.logback.code.encoder.LayoutWrappingEncoder:

<appender name="console" class="ch.qos.logback.core.ConsoleAppender">
    <encoder class="ch.qos.logback.code.encoder.LayoutWrappingEncoder">
        <layout class="com.sematext.blog.logging.LayoutThatDoesntExist" />
    </encoder>
</appender>

Or you can use the PatternLayout, which is flexible and provides us with numerous ways of formatting our log events. In fact, we already used the PatternLayoutEncoder in our Logback examples! If you are interested in all the formatting options available check out the Logback documentation on layouts.

Logback Filters

Logback filter is a mechanism for accepting or rejecting a log event based on the criteria defined by the filter itself.

A very simple filter implementation could look as follows:

package com.sematext.blog.logging.logback;

import ch.qos.logback.classic.spi.ILoggingEvent;
import ch.qos.logback.core.filter.Filter;
import ch.qos.logback.core.spi.FilterReply;

public class SampleFilter extends Filter<ILoggingEvent> {
  @Override
  public FilterReply decide(ILoggingEvent event) {
    if (event.getMessage().contains("ERROR")) {
      return FilterReply.ACCEPT;
    }
    return FilterReply.DENY;
  }
}

You can see that we extend the Filter class and provide an implementation for the decide(ILoggingEvent event) method. In our case, we just check if the message contains a given String value and if it does we accept the message by returning FilterReply.ACCEPT. Otherwise, we reject the log event by running FilterReply.DENY.

We can include the filter in our appender by including the Filter tag and providing the class:

<configuration>
    <appender name="console" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} %-5level %logger{36} - %msg%n</pattern>
        </encoder>
        <filter class="com.sematext.blog.logging.logback.SampleFilter" />
    </appender>
    <root level="info">
        <appender-ref ref="console" />
    </root>
</configuration>

If we were to now execute our example code with the above Logback configuration the output would be as follows:

13:52:12.451 ERROR com.sematext.blog.logging.Logback - This is an ERROR level log message!

There are also out of the box filters. The ch.qos.logback.classic.filter.LevelFilter allows us to filter events based on exact log level matching. For example, to reject all INFO level logs we could use the following configuration:

<configuration>
    <appender name="console" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} %-5level %logger{36} - %msg%n</pattern>
        </encoder>
        <filter class="ch.qos.logback.classic.filter.LevelFilter">
            <level>INFO</level>
            <onMatch>DENY</onMatch>
            <onMismatch>ACCEPT</onMismatch>
        </filter>
    </appender>
    <root level="info">
        <appender-ref ref="console" />
    </root>
</configuration>

The ch.qos.logback.classic.filter.ThresholdFilter allows us to filter log events below the specified threshold. For example, to discard all log events that have level lower than WARN, so INFO and beyond we could use the following filter:

<configuration>
    <appender name="console" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} %-5level %logger{36} - %msg%n</pattern>
        </encoder>
        <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
            <level>WARN</level>
        </filter>
    </appender>
    <root level="info">
        <appender-ref ref="console" />
    </root>
</configuration>

There are also other filter implementations and one more additional type of filters called TurboFilters – we suggest looking into Logback documentation regarding filters to learn about them.

Mapped Diagnostic Contexts in Logback

As we already discussed when discussing Log4J 2 the MDC or Mapped Diagnostic Contexts is a way for the developers to provide the additional context information that will be included along with the log events if we wish. MDC can be used to distinguish log output from different sources – for example in highly concurrent environments. MDC is managed on a per thread basis.

Logback uses SLF4J APIs to use the MDC. For example, let’s look at the following code:

package com.sematext.blog.logging;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;

public class LogbackMDC {
  private static final Logger LOGGER = LoggerFactory.getLogger(LogbackMDC.class);

  public static void main(String[] args) {
    MDC.put("user", "rafal.kuc@sematext.com");
    LOGGER.info("This is the first INFO level log message!");
    MDC.put("executionStep", "one");
    LOGGER.info("This is the second INFO level log message!");
    MDC.put("executionStep", "two");
    LOGGER.info("This is the third INFO level log message!");
  }
}

We’ve used the org.slf4j.MDC class and its put method to provide additional context information. In our case, there are two properties that we provide – user and the executionStep.

To be able to display the added context information we need to modify our pattern, for example like this:

<configuration>
    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} [%X{user}] [%X{executionStep}] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>
    <root level="info">
        <appender-ref ref="STDOUT" />
    </root>
</configuration>

I added two new elements: [%X{user}] [%X{executionStep}]. By using the %{name_of_the_mdc_property} we can easily include our additional context information.

After executing our code with the above Logback configuration we would get the following output on the console:

14:14:10.687 [rafal.kuc@sematext.com] [] INFO  com.sematext.blog.logging.LogbackMDC - This is the first INFO level log message!
14:14:10.688 [rafal.kuc@sematext.com] [one] INFO  com.sematext.blog.logging.LogbackMDC - This is the second INFO level log message!
14:14:10.688 [rafal.kuc@sematext.com] [two] INFO  com.sematext.blog.logging.LogbackMDC - This is the third INFO level log message!

You can see that when the context information is available it is written along with the log messages. It will be present until it is changed or cleared.

Of course, this is just a simple example and the Mapped Diagnostic Contexts can be used in advanced scenarios such as distributed client-server architectures. If you are interested to learn more have a look at the Logback documentation dedicated to MDC. You can also use MDC as a discriminator value for the Sifting appender and route your logs based on that.

Markers in Logback

When we were discussing Log4J 2 we saw an example of using Markers for filtering. I also promised to get back to this topic and show you a different use case. For example, this time we will try sending an email when a log message with an important marker will appear.

First, we need to have some code for that:

package com.sematext.blog.logging;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.Marker;
import org.slf4j.MarkerFactory;

public class LogbackMarker {
  private static Logger LOGGER = LoggerFactory.getLogger(LogbackMarker.class);
  private static final Marker IMPORTANT = MarkerFactory.getMarker("IMPORTANT");

  public static void main(String[] args) {
    LOGGER.info("This is a log message that is not important!");
    LOGGER.info(IMPORTANT, "This is a very important log message!");
  }
}

We’ve already seen that example. The code is the case, but this time we use SLF4J as the API of our choice and Logback as the logging framework.

Our logback.xml file will look as follows:

<configuration>
    <appender name="console" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>
    <appender name="markersift" class="ch.qos.logback.classic.sift.SiftingAppender">
        <discriminator class="com.sematext.blog.logging.logback.MarkerDiscriminator">
            <key>importance</key>
            <defaultValue>not_important</defaultValue>
        </discriminator>
        <sift>
            <appender name="file-${importance}" class="ch.qos.logback.core.FileAppender">
                <file>/tmp/logback-${importance}.log</file>
                <append>false</append>
                <immediateFlush>true</immediateFlush>
                <encoder>
                    <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
                </encoder>
            </appender>
        </sift>
    </appender>
    <logger name="com.sematext.blog">
        <appender-ref ref="console" />
    </logger>
    <root level="info">
        <appender-ref ref="markersift" />
    </root>
</configuration>

This time the configuration is a bit more complicated. We’ve used the SiftingAppender to be able to create two files, depending on the value returned by our MarkerDiscriminator. Our discriminator implementation is simple and returns the name of the marker if the IMPORTANT marker is present in the log message and the defaultValue if it is not present. The code looks as follows:

package com.sematext.blog.logging.logback;

import ch.qos.logback.classic.spi.ILoggingEvent;
import ch.qos.logback.core.sift.AbstractDiscriminator;
import org.slf4j.Marker;

public class MarkerDiscriminator extends AbstractDiscriminator<ILoggingEvent> {
  private String key;
  private String defaultValue;

  @Override
  public String getDiscriminatingValue(ILoggingEvent iLoggingEvent) {
    Marker marker = iLoggingEvent.getMarker();
    if (marker != null && marker.contains("IMPORTANT")) {
      return marker.getName();
    }
    return defaultValue;
  }

  public String getKey() { return key; }
  public void setKey(String key) { this.key = key; }
  public String getDefaultValue() { return defaultValue; }
  public void setDefaultValue(String defaultValue) { this.defaultValue = defaultValue; }
}

Let’s get back to our Logback configuration though. The sift appender that we have uses the same variable name as the key in our discriminator. That means that for each of the values returned by the discriminator a new file appender will be created. In our case, there will be two files: /tmp/logback-IMPORTANT.log and /tmp/logback-not_important.log.

After the code execution the /tmp/logback-IMPORTANT.log file will contain the following log message:

11:08:47.543 [main] INFO  c.s.blog.logging.LogbackMarker - This is a very important log message!

While the second file, the /tmp/logback-not_important.log will contain the following log message:

11:08:47.537 [main] INFO  c.s.blog.logging.LogbackMarker - This is a log message that is not important!

The output on the console will be as follows:

11:08:47.537 INFO  c.s.blog.logging.LogbackMarker - This is a log message that is not important!
11:08:47.543 INFO  c.s.blog.logging.LogbackMarker - This is a very important log message!

As you can see in this simple example, everything works as we wanted.

Java Logging & Monitoring with Log Management Solutions

Now you know the basics about how to turn on logging in our Java application but with the complexity of the applications, the volume of the logs grows. You may get away with logging to a file and only using them when troubleshooting is needed, but working with huge amounts of data quickly becomes unmanageable and you should end up using a log management solution to centralize and monitor your logs. You can either go for an in house solution based on the open-source software or use one of the products available on the market like Sematext Cloud or Sematext Enterprise.

A fully managed log centralization solution will give you the freedom of not needing to manage yet another, usually quite complex, part of your infrastructure. It will allow you to manage a plethora of sources for your logs. You may want to include logs like JVM garbage collection logs in your managed log solution. After turning them on for your applications and systems working on the JVM you will want to have them in a single place for correlation, analysis, and to help you tune the garbage collection in the JVM instances. Alert on logs, aggregate the data, save and re-run the queries, hook up your favorite incident management software. Correlating logs data with metrics coming from the JVM applications, system and infrastructure, real user and API endpoints is something that platforms like Sematext Cloud are capable of. And of course, remember that application logs are not everything.

Conclusions

Logging is invaluable when troubleshooting Java applications. In fact, logging is invaluable in troubleshooting in general, no matter if that is a Java application or a hardware switch or firewall. Both the software and hardware gives us a look into how they are working in the form of parametrized logs enriched with contextual information.

In this article, we started with the basics of logging in your Java applications. We’ve learned what are the options when it comes to Java logging, how to add logging to your application, how to configure it. We’ve also seen some of the advanced features of logging frameworks such as Log4j and Logback. We learned about SLF4J and how it helps in safe proofing the application for future changes.

I hope this article gave you an idea on how to deal with Java application logs and why you should start working with them right away if you haven’t already. Good luck!

15+ Best Cloud Monitoring Tools of 2020: Pros & Cons Comparison

Rafał Kuć — Thu, 06 Aug 2020 10:28:01 +0000

When providing services to your customers you need to keep an eye on everything that could impact your success with that – from low-level performance metrics to high-level business key performance indicators. From server-side logs to stack traces giving you full visibility into business and software processes that underpin your product. That’s where cloud monitoring tools and services come into play. They help you achieve full readiness of your infrastructure, applications, and make sure that your users and customers can use your platform to its full potential.

What Is Cloud Monitoring?

Cloud monitoring is a process of gaining observability into your cloud-based infrastructure, services, applications, and user experience. It allows you to observe the environment, review, and predict performance and availability of the whole infrastructure or drill into each piece of it on its own. Cloud monitoring works by collecting observability data, such as metrics, logs, traces, etc. from your whole IT infrastructure, analyzing it, and presenting it in a format understood by humans, like charts, graphs, and alerts, as well as machines via APIs

Best Cloud Monitoring Tools

There are many types of tools that can help you gain full observability into your infrastructure, services, applications, website performance and health. Some help you with just one aspect of monitoring, while others give you full visibility into all of the key performance indicators, metrics, logs, traces, etc. Some you can set up easily and without talking to sales, others are more complex and involve a more traditional trial and sales process. Each solution has its pros and cons – sometimes the flexibility of a solution comes with a higher setup complication, while the setup and ease of use come with a limited set of features. As users, we need to choose the solution that’s the best fit for our needs and budget. In this post, we are going to explore the cloud monitoring tools that you should be aware of and that will let you know if your business and its IT operations are healthy.

1. Sematext Cloud

Sematext Cloud and its on-premise version – Sematext Enterprise – is a full observability solution that is easy to set up and that gives you in-depth visibility into your IT infrastructure. Dashboards with key application and infrastructure (e.g., common databases and NoSQL stores, servers, containers, etc.) come out of the box and can be customized. There is powerful alerting with anomaly detection and scheduling. Sematext Cloud is the solution that gives you both reactive and predictive monitoring with easy analysis.

Features

Auto-discovery of services enables hands-off auto-monitoring.
Full-blown log management solution with filtering, full-text search, alerting, scheduled reporting, AWS S3, IBM Cloud, and Minio archiving integrations, Elasticsearch-compatible API and Syslog support.
Real user and synthetic monitoring for full visibility of how your users experience your frontend and how fast and healthy your APIs are.
Comprehensive support for microservices and containerized environments – support for Kubernetes, Docker, and Docker Swarm with ability to observe applications running in them, too; collection of their metrics, logs, and events.
Network, database, processes, and inventory monitoring.
Alerting with anomaly detection and support for external notification services like PagerDuty, OpsGenie, VictorOps, WebHooks, etc.
Powerful dashboarding capabilities for graphing virtually any data shipped to Sematext.
Scheduled reporting.

Pros

Lots of out of the box integrations.
Lightweight, open-sourced and pluggable agents. Quick setup.
Powerful Machine Learning-based alerting and notifications system to quickly inform you about issues and potential problems with your environment.
Elasticsearch and InfluxDB APIs allow for the integration of any tools that work with those, like Logstash, Filebeat, Fluentd, Logagent, Vector, etc..
Easy correlation of performance metrics, logs, and various events.
Collection of IT inventory – installed packages and their versions, detailed server info, container image inventory, etc.
Straightforward pricing with free plans available, generous 30-days trial.

Cons

Limited support for transaction tracing.
Lack of full-featured profiler.

Pricing

The pricing for each solution is straight forward. Each solution lets you choose a plan. As a matter of fact, pricing is super flexible for the cost-conscious — you have the flexibility of picking a different plan for each of your Apps. For Logs there is a per-GB volume discount as your log volume or data retention goes up. Performance monitoring is metered by the hour, which makes it suitable for dynamic environments that scale up and down. Real user monitoring allows downsampling that can minimize your cost without sacrificing the value. Synthetic monitoring has a cheap pay-as-you-go option.

AppDynamics

Available in both software as a service and an on-premise model AppDynamics is more focused on large enterprises providing the ability to connect application performance metrics with infrastructure data, alerting, and business-level metrics. A combination of these allows you to monitor the whole stack that runs your services and gives you insights into your environment – from top-level transactions that are understood by the business executives to the code-level information useful for DevOps and developers.

Features:

End-user monitoring with mobile and browser real user, synthetic, and internet of things monitoring.
Infrastructure monitoring with network components, databases, and servers visibility providing information about status, utilization, and flow between each element.
Business-focused dashboards and features provide visualizations and analysis of the connections between performance and business-oriented metrics.
Machine Learning supported anomaly detection and root cause analysis features.
Alerting with email templating and period digest capabilities.

Pros:

Very detailed information about the environment including versions, for example, JVM application startup parameters, JVM version, etc.
Provides advanced features for various languages – for example, automatic leak detection and object instance tracking for the JVM based stack.
Visibility into connections between the system components, environment elements, endpoint response times, and business transactions.
Visibility into server and application metrics with up to code-level visibility and automated diagnostics.

Cons:

Pricing: very expensive, complex, and non-transparent. Focused on more traditional high-touch sales model and selling to large enterprises.
Installation of the agent requires manual downloading and starting of the agent – no one-line installation and setup command.
Some of the basic metrics like system CPU, memory, and network utilization are not available in the lowest, paid plan tier.
Slicing and dicing through the data is not as easy compared to some of the other tools mentioned in this summary that support rich dashboarding capabilities like Sematext, Datadog, or New Relic.

Pricing

Agent and feature-based pricing is used which makes the pricing not transparent. The amount of money you will pay for the solution depends on the language your applications are written in and what functionalities you need and want to use from the platform. For example, visibility into the CPU, memory, and disk metrics requires the APM Advanced plan.

Datadog

Datadog is a full observability solution providing an extended set of features needed to monitor your infrastructure, applications, containers, network, logs, or even serverless features such as AWS lambdas. With the flexibility and functionality comes a price though – the configuration based agent installation may be time-consuming to set up (e.g. process monitoring requires agent config editing and agent restart) and quite some time may pass before you start seeing all the metrics, logs, and traces – all in one place for that full visibility into your application stack that you are after.

Features:

Application performance monitoring with a large number of integrations available and distributed tracing support.
Logs centralization and analysis.
Real user and synthetics monitoring.
Network and host monitoring.
Dashboard framework allows building of virtually everything out of the provided metrics and logs and sharing those.
Alerting with machine learning capabilities.
Collaboration tools for team-based discussions.
API allowing to work with the data, tags, and dashboards.

Pros:

Full observability solution – metric, logs, security, real user, and synthetics all in one.
Infrastructure monitoring including hosts, containers, processes, networks, and serverless capabilities.
Rich logs integration including applications, containers, cloud providers, clients, and common log shippers.
Powerful and very flexible data analysis features with alerts and custom dashboards.
Provides API allowing interaction with the data.

Cons:

Overwhelming for newcomers with all the installation steps needed for anything beyond basic metrics.
Not a lot of pre-built dashboards compared to others. New users have to invest quite a bit of time to understand metrics and build dashboards before being able to make full use of the solution.

Pricing

Feature, host, and volume-based pricing combined together – each part of the solution is priced differently that can be billed annually or on-demand. The on-demand billing makes the solution about 17 – 20% more expensive than the annual pricing at the time of this writing. Pay close attention to your bill. We’ve seen a number of reports where people were surprised by bill items or amounts.

New Relic

New Relic as a full-stack observability solution is available in software-as-a-service model. Its monitoring capabilities include application performance monitoring with rich dashboarding support, distributed tracing support, logs along with real user and synthetics monitoring for the top to bottom visibility. Even though the agents require manual steps to download and install they are robust and reliable with a wide range of common programming languages support which is a big advantage of New Relic.

Features:

Application Performance Monitoring with dashboarding and support for commonly used languages including C++.
Log centralization and analysis.
Integrated alerting with anomaly detection.
Rich and powerful query language – NRQL.
Real user and synthetics monitoring.
Distributed tracing allowing you to understand what is happening from top to bottom.
Integration with most known cloud providers such as AWS, Azure, and Google Cloud Platform.
Business level metrics support.

Pros:

Visibility into the whole system, not only when using physical servers or virtual machines, but also when dealing with containers and microservices.
Ability to connect business-level metrics together with performance to correlate them together.
Error analytics tool for quick and efficient issues analysis, like site errors or downtime.
Rich visualization support allowing to graph metrics, logs, and NRQL queries.
Ability to define the correlation between alerts and defined logic to reduce alert noise.

Cons:

The platform itself doesn’t provide agent management functionality, which leads to additional work related to installation and configuration, especially on a larger scale.
Inconsistent UI: some parts of the product use the legacy interface, while others are already a part of NewRelic One.
The log management part of the solution is still young.
Lack of a single pricing page for all features.

Pricing

Annual and monthly compute unit or host-based pricing and depends on the features (for example: APM pricing, infrastructure pricing, synthetic pricing). For small services, the computing units may be the best option as they are calculated as the total number of CPUs with the amount of RAM your system has, multiplied by the number of running hours. For example, the infrastructure part of New Relic uses only compute units pricing, while the APM can be charged on both host and compute units-based pricing. This may be confusing and requires additional calculations if you want to control your costs.

5. Dynatrace

Dynatrace is a full-stack observability solution that introduces a user-friendly approach to monitoring your applications, infrastructure, and logs. It supports a single running agent that, once installed, can be controlled via Dynatrace UI making monitoring easy and pleasant to work with. Available in both software as a service and on-premise models it will fulfill most of your monitoring needs when it comes to application performance monitoring, real users, logs and infrastructure.

Features:

Application performance monitoring with dashboarding and rich integrations for commonly used tools and code-level tracing.
First-class Log analysis support with automatic detection of the common system and application log types.
Real user and synthetic monitoring.
Diagnostic tools allow taking memory dumps, exceptions and CPU analysis, top database, and web requests.
Docker, Kubernetes, and OpenShift integrations.
Support for common cloud providers like Amazon Web Services, Microsoft Azure, and Google Cloud Platform.
A virtual assistant can make your life easier when dealing with common questions.

Pros:

Simple and intuitive agent installation with UI guidance for new users with demo data to get to know the product faster.
Ease of integration to gain visibility into the logs of your systems and applications – almost everything is doable from the UI.
Easy to navigate and powerful top to bottom view of the whole stack – from the mobile/web application through the middle tier up to the database level.
Dedicated problem-solving functionalities to help in quick and efficient problem finding.

Cons:

Lots of options can be overwhelming to start with, but the solution tries to do its best to help new users.
Business metrics analysis is still limited compared to AppDynamics and Datadog, for example.
Serverless offering is limited when compared to other solutions on the market, like Datadog, New Relic, and AppDynamics.
Pricing information is only available once you sign up.

Pricing

Pricing is organized around features. The application performance monitoring pricing is tied to hosts and the amount of memory available on a host. Each 16GB is a host unit and the price is calculated on the basis of the number of host units in an hour. The real user monitoring price is calculated based on the number of sessions, while the synthetics monitoring pricing is based on the number of actions. Finally, the logs part of the solution is calculated based on the volume, similar to other vendors covered in this article.

6. Sumo Logic

Sumo Logic is an observability solution with strong focus on working with logs and it does that very well. With tools like LogReduce and LogCompare you can not only view the logs from a given time period but also reduce the volume of data you need to analyze or even compare periods to find interesting discrepancies and anomalies. Combining that with metrics and security gives a great tool that will fulfill the observability needs for your environment.

Features:

Log analysis with the LogReduce algorithm allows clustering of similar messages and LogCompare lets you compare data from two time periods.
Field extraction enables rule-based data extraction from unstructured data.
Application performance monitoring with real-time alerting and dashboarding.
Scheduled views for running your queries periodically.
Cloud security features for common cloud providers and SaaS solutions with PCI compliance and integrated threat intelligence.

Pros:

User-friendly interface that doesn’t overwhelm novice users and is still usable for experienced ones.
Ability to reduce the number of similar logs at read-time and compare periods of time together which can help to spot differences, anomalies, and track down problems quickly.
Possibility to extract fields from unstructured data allows you to drop the processing component from your local pipeline and move it to the vendor side.
Limited free tier available that may be enough for very small companies.

Cons:

Pricing may be confusing and may be hard to pre-calculate when using Cloud Flex credits and larger environments.
A limited number of out of the box charts compared to the competition.
Primarily focused on logs puts them at a disadvantage if you are looking for a full-stack observility solution.

Pricing

Credit and feature-based pricing with a limited free tier is available. A credit is a unit of utilization for ingested data – logs and metrics. The needed features dictate the price of each credit unit – the more features of the platform you need and will use, the more expensive the credit will be. Please keep in mind that the price also depends on the location you want to use. For example, at the time of this writing, the Ireland location was more expensive compared to North America.

7. CA Unified Infrastructure Monitoring (UIM)

Available in both the SaaS and on-premise models, targeted at the enterprise customers the DX Infrastructure Manager, formerly called CA Unified Infrastructure Monitoring is a unified tool that allows you to get observability into your hybrid cloud, services, applications, and infrastructural elements like switches, routers and storage devices. With the actionable log analytics, out of the box dashboard, and alerting with anomaly detection algorithms the solution will give you retrospective and proactive views over your IT environment.

Features:

Monitoring with various integrations supporting common infrastructure provides and services including packaged applications such as Office 365 and tools like Salesforce Service Cloud.
Log analytics with actionable, out of the box dashboards and rich visualization support.
Alerting with anomaly detection and dynamic thresholds.
Reporting with business-level metrics support and scheduling capabilities.

Pros:

Easy deployment and configuration with configurable automatic service discovery.
Templates support which allows you to build templates per environment, devices, and more.
Advanced correlations for hybrid infrastructures.
In-depth monitoring of the whole infrastructure with the help of various integrations.

Cons:

Non-transparent pricing — the pricing is not available on the web site.
A limited number of alert notification destinations compared to other competitors.
May be considered complicated for novice users.
Targeted for enterprise customers.
Dated UI.

Pricing

At the time of this writing the pricing was not publicly available on the vendor’s site.

8. Site 24×7

Site 24×7 is an observability solution providing all that is needed to get full visibility into your website’s health, application performance, infrastructure, and network gear. Both when it comes to metrics and logs. Set up alerts based on advanced rules to limit down the alerts fatigue and get insights from your mobile applications. Monitor servers and over 50 common technologies running inside your environment including common and widely used Apache or MySQL.

Features:

Website monitoring with the support for monitoring HTTP services, DNS and FTP servers, SMTP and POP servers, URLs, and REST APIs available both publicly and in private networks.
Server monitoring with support for Microsoft Windows and Linux and over 50 common technologies plugins, like MySQL or Apache.
Full featured network monitoring with routers, switches, firewalls, load balancers, UPS, and storage support.
Application performance monitoring and log management with support for server, desktop, and mobile applications and alerting capabilities.
Cloud monitoring with support for hybrid cloud infrastructure.

Pros:

Quick and easy agent installation.
Monitoring for various technologies with alerting support based on complex rules.
Full observability with visibility from your website performance and health up to network-level devices like switches and routers.
Custom dashboarding support lets you build your own views into the servers, applications, websites, servers, and cloud environments.
Pluggable server monitoring allows you to write your own plugins where needed.
Free, limited uptime and server monitoring which might be enough for personal needs or small companies.

Cons:

The number of features can be overwhelming for novice users.
It can be time-consuming when setting up in a larger environment because of the lack of autodiscovery.
A limited number of technologies when it comes to server monitoring.

Pricing

The pricing depends on the parts of the product that you will use with the free uptime monitoring for a small number of websites and servers available. The infrastructure monitoring starts with the 9 euro per month when billed annually for up to 10 servers, 500MB of logs, and 100K page views for a single site. You can buy additional add-ons for a monthly fee. You can also go for pure website monitoring or application performance monitoring or so-called “All-in-one” plan, which covers all the features of the platform.

9. Zabbix

Open-sourced monitoring tool capable of real-time monitoring large scale enterprises and small companies. If you are looking for a solution with a large community, well supported, and free you should look at Zabbix. Its multi-system, small footprint agents allow you to gather key performance indicators across your environment and use them as a source for your dashboards and alerts. With the template-based setup and auto-discovery you can speed up even the largest setups.

Features:

Multi-system, small footprint agent allowing to gather crucial metrics with support for SNMP and IPMI.
Problem detection and prediction mechanism with flexible thresholds and severity levels defining their importance.
Multi-lingual, multi-tenant, flexible UI with dashboarding capabilities and geolocation support for large organizations with data centers spread around the world.
Support for adjustable notifications with out of the box support for email, SMS, Slack, Hipchat and XMPP and escalation workflow.
Template-based host management and auto-discovery for monitoring large environments.

Pros:

Well known, open-sourced, and free with a large community and commercial support.
Wide functionality allowing to monitor virtually everything.
It can be easily integrated with other visualization tools like Grafana.
Easily extensible for support for technologies and infrastructure elements not covered out of the box.

Cons:

As an open-sourced and completely free solution, you need to host it yourself and maintain it, meaning paying for the team that will install and manage it.
Initial setup can be tedious and not so obvious and requires knowledge, not only about the platform but also about the applications, servers, and infrastructure elements that you plan on monitoring making the initial step quite steep.
Lack of dedicated functionality to monitor user experience, synthetic monitoring and no transaction tracing support.
If you are looking for a software-as-a-service solution, Zabbix Cloud is coming, but as of this writing it is still in beta.

Pricing

Zabbix is open-sourced and free. You can subscribe for support, consultancy, and training around it though if you would like to quickly and efficiently extend your knowledge about the platform.

10. Stackify Retrace

Stackify Retrace is a developer-centric solution providing users full visibility into their applications and infrastructure elements. With the availability of application performance monitoring, centralized logging, error reporting, and transaction tracing it is easy for a developer to connect pieces of information together when troubleshooting. All of that with help from the platform which connects those pieces together gluing the automated transaction tracing with the relevant logs and error data and proving the integrated profiler to give the top to bottom insight into the business transaction.

Features:

Centralized logging combined with error reporting.
Transaction tracing and code profiling with automatic instrumentalization for databases like MySQL, PostgreSQL, Oracle, SQL Server, and common NoSQL solutions like MongoDB and Elasticsearch.
Key performance metrics monitoring for your applications with alerting and notifications support.
Server monitoring gives you insight into the most useful metrics like uptime, CPU & memory utilization, disk space usage, and more.

Pros:

Top to bottom view starting with the web requests and ending at the relevant log message connected together with the transaction trace.
Integrated profiler with out of the box instrumentalization for common system elements like database or NoSQL store.
In-line log and error data inclusion in tracing information makes it super easy to connect information together for fast troubleshooting.
Support for custom dashboards and reports.

Cons:

No native support for Google Cloud at the time of writing.
Real user monitoring “coming soon” at the time of writing.
UI reminiscent of Windows.

Pricing

The pricing is based on data volume and is provided in three tiers – Essentials, Standard, and Enterprise. The Essentials package starts at $79/month allowing for 7 days of logs and traces retention, with up to 500k traces and 2m logs and up to 8 days of summary data retention with all the standard features provided. The Standard plan starts from $199 with additional features available for an appropriate higher price..

11. Zenoss

Multi-vendor infrastructure monitoring with support for end-to-end troubleshooting and real-time dependency mapping. With support for server monitoring including coming metrics, health and excellent network monitoring the Zenoss platform gives you visibility into your infrastructure, no matter if it is a private, hybrid, or a public cloud.

Features:

Infrastructure monitoring with the support for public, private, and hybrid clouds and real-time dependency mapping.
Server monitoring with support for common metrics, health, physical sensors like temperature sensors, file systems, processes, network interfaces, and routes monitoring.
Application performance monitoring available via ZenPacks with support for incident root cause analysis and metrics importance voting along with containers and microservices support.
Support for logs with the support of log format unification.

Pros:

Multi-vendor support for a wide variety of hardware and software infrastructure elements.
Automatic discovery for dynamic environments like containers and microservices.
Extensibility via ZenPacks – available both as driven by the community and commercial extensions with SDK allowing you to develop new extensions easier.
The self-managed, limited community version of the platform available as a solution with basic functionality and minimum scale.

Cons:

Application performance monitoring available via ZenPacks extension or integration with third-party services.
Available only in the on-premise model with no free trial available which makes it hard to test the platform.
No features like real user monitoring, synthetic monitoring or transaction tracing.
Focused on medium and large customers.

Pricing

At the time of writing the pricing was not publicly available on the vendor’s site, but one thing worth noting is the availability of the community version of the solution allowing you to install a limited, self-managed version of the platform.

When using Amazon Web Services, Google Cloud Platform, or Microsoft Azure you can rely on the tools provided by those platforms. The cloud provider dedicated solutions may not be as powerful as the platforms that we discussed above, but they provide insight into the metrics, logs, and infrastructure data. They give us not only visibility into the metrics but also proactive monitoring like alerts and health checks that you can use to configure the basic monitoring. If you are using a cloud solution from Amazon, Microsoft, or Google and you would like to use monitoring provided by those companies have a look at what they offer.

12. Amazon CloudWatch

Amazon CloudWatch is primarily aimed at customers using Amazon Web Services, but can also read metrics from statsd and collectd providing a way to ship custom metrics to the platform. By default, it provides an out of the box monitoring for your AWS infrastructure, services, and applications. With the integrated logs support and synthetics monitoring, it allows the users to set up basic monitoring quickly to give insights into the whole environment that is living in the Amazon ecosystem.

Features:

View metrics and logs of your infrastructure, services, and applications.
Insights into events coming from your AWS environment.
Service map and tracing support via AWS X-Ray.
Synthetic service for web application monitoring.
Alerting with anomaly detection on metrics and logs.

Pros:

Available out of the box for Amazon Web Services Users.
Support for custom metrics, so if you would like to stick to CloudWatch you can easily keep all your metrics there.
Possibility to graph billing-related information and have that under control.

Cons:

Limited dashboarding and visualization capabilities.
A limited number of dashboards that can be created in the free tier – if you have more than three dashboards will cost you $3.00 per month.
Limited metrics granularity even when going for the paid service.

Pricing

Volume-based pricing – you pay for what you want to have visibility into and how detailed it is. Free tier enables monitoring of your AWS services with 5-minute metric granularity. The free tier is also effective for services like EBS volumes, RDS DB instances, and Elastic Load Balancers. It covers up to ten metrics and then alarms per month. In addition, the free tier includes up to 5GB logs per month, 3 dashboards, and 100 runs of synthetic monitors per month. The paid tier price is based on usage. For example, for metrics, the one-minute granularity metrics starts at $0.30 per metric per month for the first 10,000 metrics and go as low as $0.02 per metric per month when sending over one million metrics. With logs the situation is similar – the more you send the less you pay per gigabyte of data.

13. Azure Monitor

The Azure Monitor a solution primarily focused on monitoring the services located in the Microsoft Azure cloud services, but support custom metrics for resources outside of the cloud. It provides a full-featured observability solution giving you deep insights into your infrastructure, services, applications, and Azure resources with powerful dashboards, BI support, and alerting that will automatically notify you when needed.

Features:

Monitoring for your Microsoft Azure resources, services, first-party solutions, and custom metrics sent by your applications.
Detailed infrastructure monitoring for deep insight into the metrics.
Network activity, layout, and services layout visualization and monitoring.
Support for alerts and autoscaling based on the metrics and logs.
Powerful dashboarding capabilities with workbooks and BI support.

Pros:

Available out of the box for Microsoft Azure users.
Azure resources, services, and first-party solutions expose their metrics in the free tier and other signals like logs and alerts have a free tier available.
Support for workbooks and BI allows to connect business-level metrics with the signals coming from the services and infrastructure.

Cons:

It may be complicated and overwhelming for users that just started with Azure.

Pricing

The Azure Monitor pricing is based on the volume of the ingested data or reserved capacity. Selected metrics from the Azure resources, services, and first-party solutions are free. Custom metrics are paid once you pass the 150MB per month. Similar to other cloud vendors you pay less per unit of data the more data you send. The logs have the option to pay as you go which gives you up to 5GB of logs per billing account per month free and then $2.76 per GB of data. You can also go for reserved data – for example, 100GB of data per day will cost you $219.52 daily. Other monitoring elements are priced in a similar way with small or no free tier available.

14. Google Stackdriver

Formerly Stackdriver Google Cloud operations suite is primarily focused to give the users of Google Cloud platform the insights into the infrastructure and application performance, but it also supports custom metrics and other cloud providers like AWS. The platform provides metrics, logs, and traces support along with the visibility into Google Cloud platform audit logs giving you the full visibility of what is happening inside your GCP account.

Features:

Metrics and dashboards allowing visibility into the performance of your services with alerting.
Health check monitoring for web applications and applications that can be accessed from the internet with uptime monitoring.
Support for logs and logs routing with error reporting and alerting.
Per-URL statistics based on distributed tracing for App Engine.
Audit logs for visibility into security-related events in your Google Cloud account.
Production debugging and profiling.

Pros:

Rich visualization support out of the box for Google Cloud platform users.
Free tier available.
Support for sending data to third-party providers if they provide an integration.

Cons:

Requires a manual cloud monitoring agent install, before getting visibility into the metrics, compared to AWS CloudWatch where this is not needed.

Pricing

Similar to Amazon CloudWatch and Microsoft Azure the pricing is based on the amount of data your services and applications are generating and sending to the platform. The free tier includes 150MB metrics per billing account, 50GB of logs per project, 1 million API calls per project, 2.5 million spans ingested per project and 25 million spans scanned per project. Everything above that falls into the paid tier.

Most of the tools that we’ve discussed provide a form of alerting and reporting. Those are usually limited to a number of methods, like e-mail or text messages to your mobile, sometimes other common destinations. Usually, we don’t see scheduling, automation, and workflow control in the monitoring tools themselves. Because of that, the observability solutions provide integrations with third-party incident alerting and reporting tools filling the communication gap and providing additional features like event automation and triage, noise suppression, alerts, and notifications centralization and lots of destinations where the information can be sent to. Let’s see what tools can provide such functionalities.

15. PagerDuty

The all in one alert and notification management and centralization solution. The PagerDuty provides the place where you can centralize notifications coming from various places, organize them, assign, automate, and send to virtually any destination you may think of. It not only provides a simple way of viewing and forwarding the data but also automates incident response, schedule on-call, and escalate incidents.

Features:

On-call management with flexible schedules, incident escalation, and alerting.
Context filtering for alert reduction.
Automated responses with status updates.
Event automation with triage, alert grouping, and noise suppression.
Dashboards for a variety of alert related information like operations, service health, responders, and incidents with customization capabilities.

Pros:

A large number of integrations available out of the box, which gives you the possibility to receive notifications on virtually any destination.
Scheduling and notifications escalation.
Services prioritization for controlling what is more important.

Pricing

The pricing is organized around the features and the number of users that will be using PagerDuty with no free tier available. The most basic plan starts from $10 for up to 6 users per month with an additional $15 per user after that and goes up to $47 per user per month depending on the features of the platform you want to use.

16. VictorOps

VictorOps is the tool that will quickly become your central place for alerts and notifications. It makes it possible to take action on alerts, schedule who is on-call and should react to a given incident. With rules-based incident response, it is easy to automate responses for certain alerts to reduce the noise and fatigue generated by notifications coming from various systems hooked up with the rich set of available integrations.

Features:

On-call scheduling and management with incident escalation and hands-off.
Alerts and notification centralization.
Incident automation with alert rules, automatic response, and noise suppression.
Reports and post-incident reviews.

Pros:

A large number of integrations available out of the box for centralizing the alerts and notifications in a single place.
Dedicated tools for teams.
Scheduling and incident escalation.

Pricing

The pricing is based one features and the number of users. The basic plan starts from $8 per user per month when paid monthly and goes up to $33 per user per month for the Enterprise plan.

17. OpsGenie

From the creators of JIRA and Confluence comes OpsGenie, the central place for your alerts and notifications. It allows for management of alerts, planning on-call schedules, and reacting automatically based on user-defined rules. With a rich set of integrations, heartbeat monitoring, and alerts deduplication the platform can be used as a tool for centralizing all of your alerts and notifications.

Features:

On-call scheduling and management with incident escalation.
Alerts and notification centralization with rule-based routing.
Advanced reporting with post-incident analysis.
ChatOps and stakeholder communications with a web conference bridge.
Incident command center.

Pros:

Rich set of integrations available out of the box for centralizing the notifications and alerts in a single place.
Team centric tools for multiple teams integrations.
Heartbeat monitoring and alerts deduplication.
Free tier available.

Pricing

The pricing is based on features and the number of users. It starts with the limited free tier for up to 5 users with basic alerting and on-call management aimed for small teams. The first non-free tier starts with $11 per user per month when billed monthly and goes up to $35 per user per month with monthly billing. The price depends on the set of features of the platform that you will use. For instance, if you are OK with up to 25 international SMS notifications per user per month you will be fine with the basic, non-free plan.

18. xMatters

xMatters is a user-friendly central place for all your alerts and notifications. It allows managing and reacting on incidents from a single place with on-call schedules, incident escalation, and rule-based responses and resolutions. With the incident timeline, you can see how the reaction on the incident was performed and how well the team reacted to the situation giving your organization a tool helping you in improving alerts handling.

Features:

On-call scheduling and management with incident escalation.
Automatic, rule-based responses and resolutions.
Stakeholder communication.
Incident timeline with team performance calculations.

Pros:

Over 100 integrations are available at the time of writing.
Easy to learn and user-friendly.
Free tier available.

Pricing

The pricing, similar to the rest of the competitors like OpsGenie and PagerDuty is organized around features and the number of users. The pricing plans start with a free tier that is available for up to 10 users without any kind of SMS and voice notifications. The first paid plan starts at $16 per user per month and goes up to $59 per user per month making it the most expensive of the tools. Of course, the price depends on the features of the platform you choose to use. For example, if you are OK with up to 50 SMS notifications per user per month you will be fine with the basic, non-free plan.

What Tools Will You Use?

Cloud computing, the public, hybrid, and private cloud environments opened up a world of opportunities. Flexibility, on-demand scaling, ready to use services, and the ease of use that comes with that allow for the next generation of platforms to be built on top of them. However, to leverage all the opportunities you need to deal with a set of challenges. Those require good tools so you can understand the state of the environment along with all the key performance indicators that your environment provides. The available cloud monitoring tools all help you with the gathering of observability data, but they take different approaches, provide different functionalities, and come with different costs. With the wide range of solutions available make sure to try different solutions and choose the one that fits your needs the most. Learn how to choose the best monitoring system for your use case from our Guide to monitoring and alerting.

Working with Solr Plugins System

Rafał Kuć — Mon, 08 Jun 2020 14:03:36 +0000

Apache Solr was always ready to be extended. What was only needed is a binary with the code and the modification of the Solr configuration file, the solrconfig.xml and we were ready. It was even simpler with the Solr APIs that allowed us to create various configuration elements – for example, request handlers. What’s more, the default Solr distribution came with a few plugins already – for example, the Data Import Handler or Learning to Rank.

As consultants working with clients across different industries, dealing with a wide variety of use cases with Solr clusters monitored by Sematext Cloud the next thing that we saw the need for were plugins. Installing those plugins was not hard – put a jar file in a defined place, modify the configuration, reload the core/collection or restart Solr and we are ready. Well not so fast. What if you had hundreds of Solr nodes and you needed to upgrade the plugin or even install it. Yes, that’s where things can get nasty and require automation. Solr was not very supportive in this until one of the recent releases. All of the users that wanted to extend Solr were doing the same thing – manual jar loading. We did the same thing with our plugins – like the Researcher or Query Segmenter.

With the release of Solr 8.4.0, we’ve got a new functionality that helps us with extending Solr – the plugin management. It allows installing plugins from remote locations and it makes it very easy to do so for us as users. Today I wanted to show you not only how to install Solr plugins using this new feature, but also how to prepare your own plugin repository. Let’s get started.

Solr Plugin Management

With Solr 8.4.0, we didn’t only get the script itself but also the whole set of changes under the hood. Those changes include things like package management APIs and scripts, class loader isolation, artifact read and write API and more.

Let’s start from the beginning though. By default Solr comes with the package loading turned off. One of the reasons for such a decision is security. Users could potentially force Solr to download malicious content, so you need to be sure that your environment is secure and you need to know potential downsides and risks of using that feature. But if we are sure that we want to run Solr with plugin management mechanism turned on we need to add the enable.packages property to Solr startup parameters and set it to true:

$ bin/solr start -c -f -Denable.packages=true

Now we can start playing around with the packages.

Package Management Basics

Let’s try using the bin/solr script and see what it allows us to do when it comes to package management. The simplest way to check that is just by running the following command:

$ cd bin/
$ ./solr package

In the result we will get the following response:

Found 1 Solr nodes:

Solr process 20949 running on port 8983
Package Manager

./solr package add-repo
Add a repository to Solr.

./solr package install [:]
Install a package into Solr. This copies over the artifacts from the repository into Solr's internal package store and sets up classloader for this package to be used.

./solr package deploy [:] [-y] [--update] -collections <package-name>[:] [-y] [--update] -collections  [-p param1=value1 -p param2=value2 …
Bootstraps a previously installed package into the specified collections. It the package accepts parameters for its setup commands, they can be specified (as per package documentation).

./solr package list-installed
Print a list of packages installed in Solr.

./solr package list-available
Print a list of packages available in the repositories.

./solr package list-deployed -c
Print a list of packages deployed on a given collection.

./solr package list-deployed
Print a list of collections on which a given package has been deployed.

./solr package undeploy  -collections
Undeploys a package from specified collection(s)

Note: (a) Please add '-solrUrl http://host:port' parameter if needed (usually on Windows).
      (b) Please make sure that all Solr nodes are started with '-Denable.packages=true' parameter.

It seems we get everything that is needed. We can add repositories, we can list installed packages, we can install packages, we can deploy packages, list deployed ones and of course undeploy the ones that we no longer need.

At the time of writing this blog post, there were no Solr plugin repositories publicly available. But for us this is not bad – we can use that to learn even more. We just need to start by preparing our own plugin repository.

Preparing the Package Repository

If you are using a repository that was already created, where the plugins are available you can skip this part of the blog post. But if you would like to learn how to set up a Solr plugin repository on your own, I’ll try to guide you through that process.

So there are a few steps that need to be taken:

You need to create a private key that will be used to sign your binaries
You need to create a public key that Solr will use to verify the signed packages
You need to create a repository description file that Solr will read when requesting packages from the repository
And of course, you need to have the binaries that you would like to expose as plugins. We will not be discussing this step though and I will assume you already have that. We created a very naive and simple code at https://github.com/sematext/example-solr-module. Have a look if you want.

Creating a Private and a Public Key

We will start by creating a private key. This key will be used to generate a signature of the binaries that we will be exposing as plugins. For that we will use openssl:

$ openssl genrsa -out sematext_example.pem 512

The above command creates a 512 bits RSA key called sematext_example.pem. With that generated, we can now create a public key that will be based on the above key.

The public key will be created from the private one and Solr will use it to verify the signatures of the files. The idea is as follows:

The package maintainer creates a signature of the package file using the private key and writes the signature in the repository description file,
During package deployment, the package signature is verified by Solr using the public key. If the signature doesn’t match – the package will not be deployed.

To create a public key we will again use the openssl command:

$ openssl rsa -in sematext_example.pem -pubout -outform DER -out publickey.der

The output of the above command is a publickey.der file that we will upload to our repository location along with the binary file and the repository description file.

Generating Package Signature

The last step is generating the signature of the file. We will once again use the openssl command:

$ openssl dgst -sha1 -sign sematext.pem solr-example-module-1.0.jar | openssl enc -base64 | tr -d \\n | sed

As a result we will have the signature, which in our case looks as follows:

iXyDDhYkYZgBrYCTxawAdeIJFYR+KHglK4m6uLSR1lo9pFm67dKfIzTmXPHasFVgLwVRbYvGMJG5p69TowMPAg==

Note it somewhere as we will need it soon.

Repository Description

Now that we already have our binary file, the private and the public keys we can create the repository description file that Solr will be looking for inside the repository. This repository has to be called repository.json and needs to include a list of plugins that are available in our repository. Each plugin is defined by:

A name,
A description,
An array of versions, that include:
Version itself,
Release date of the given plugin version,
An array of artifacts for the version – the URL of the file and the signature that we generated earlier,
The manifest which includes supported Solr versions, default parameters, setup, uninstall and verification commands.

The repository.json file that we are using for the purpose of this blog post looks as follows:

[
  {
"name": "sematext-example", "description": "Example plugin created for blog post", "versions": [{
    "date": "2020-04-16", "artifacts": [{
            "url": "solr-example-module-1.0.jar",
            "sig": "iXyDDhYkYZgBrYCTxawAdeIJFYR+KHglK4m6uLSR1lo9pFm67dKfIzTmXPHasFVgLwVRbYvGMJG5p69TowMPAg=="
          }
        ],
        "manifest": {
          "version-constraint": "8 - 9",
          "plugins": [
            {
              "name": "request-handler",
              "setup-command": {
                "path": "/api/collections/${collection}/config",
                "payload": {"add-requesthandler": {"name": "${RH-HANDLER-PATH}", "class":
        "sematext-example:com.sematext.blog.solr.ExampleRequestHandler"}},
                "method": "POST"
              },
              "uninstall-command": {
                "path": "/api/collections/${collection}/config",
                "payload": {"delete-requesthandler": "${RH-HANDLER-PATH}"},
                "method": "POST"
              },
              "verify-command": {
                "path": "/api/collections/${collection}/config/requestHandler?componentName=${RH-HANDLER-PATH}&meta=true",
                "method": "GET",
                "condition":
        "$['config'].['requestHandler'].['${RH-HANDLER-PATH}'].['_packageinfo_'].['version']",
                "expected": "${package-version}"
              }
            }
          ],
          "parameter-defaults": {
            "RH-HANDLER-PATH": "/sematextexample"
          }
        }
      }
    ]
  }
]

While most of the properties are self-descriptive you should put attention to one thing – the class of the request handler in the setup-command definition. Because of the out-of-the-box class loaders isolation, we need to provide a prefix with the name of the plugin to be able to create the request handler. If we won’t do that Solr will fail to create the request handler, because our class that implements the request handler will not be visible. Keep that in mind when creating the repository description file for your own plugins.

With all that we can upload it to some remote location like we did with http://pub-repo.sematext.com/training/solr/blog/repo/ and start using it.

Adding a New Package Repository

Once we are ready with setting up our own repository or we already have a repository that we would like to install plugins from we can add that repository to Solr. Just remember, to successfully add the repository it needs to provide the repository.json file. The second thing is security – you should avoid adding repositories that don’t use SSL. Adding a repository that doesn’t use a secure connection exposes you and your Solr for man in the middle attacks. The potential thing that can happen is that during the download of the package it can be replaced with a malicious version. Keeping your Solr secure is as important as keeping an eye on the Solr metrics by using one of the Solr monitoring tools like Sematext Cloud.

Now that we know about the potential security issue let’s use a secure location of the example Solr repository. We do that by using the following command:

$ ./solr package add-repo sematext https://pub-repo.sematext.com/training/solr/blog/repo/

We are using new functionalities of the bin/solr script – the package one. We use one of the possible options, the add-repo which requires us to provide a name and the location. The name in our case is sematext and the location is the last provided parameter.

If the operation was successful Solr will give us information about the number of nodes found in the cluster, the process identifier and the port on which the instance is running. And finally, the last information that tells that the repository was added:

Found 1 Solr nodes:

Solr process 65854 running on port 8983
Added repository: sematext

As a side note – I’ll omit the information about the number of nodes, process identifier and the Solr port from the other examples. It should be easier to see the crucial information returned by Solr.

Installing and Removing Solr Packages

Once the repository is added we can start using it. The first thing that you would usually do is listing the available packages and look for something that we can install to extend our Solr. To list all the available packages we should run the following command:

$ bin/solr package list-available

The response to the above command should be similar to the following one:

Available packages:
-----
sematext-example    Example plugin created for blog post
  Version: 1.0.0

In the response, we have a list of packages – each described with a name, description and version. Just as they were defined in the repository.json file. We are very close to being ready for installation. But there is one more thing – the public key that Solr will use to verify the package signature. Where to look for such a key? It will be provided to you or you can download it from the repository itself under the publickey.der name. I’ll do the latter and will download the key by using the following command:

$ curl -s -o publickey.der -LO http://pub-repo.sematext.com/training/solr/blog/repo/publickey.der

Once we will have the key we can add it to Solr by using the bin/solr script its package part of the functionality and the add-key action:

$ ./solr package add-key publickey.der

After all those steps we can finally start installing the packages. For example, let’s install the one package that we have available in our sample repository. We do that by running the following command:

$ ./solr package install sematext-example:1.0.0

The response that I got from Solr was as follows:

Posting manifest...
Posting artifacts...
Executing Package API to register this package...
Response: {"responseHeader":{
    "status":0,
    "QTime":68}}
sematext-example installed.

This means that our package is now ready to be used. Let’s create a collection where we can use the package by running the following command:

$ ./solr create_collection -c test

By now we should have the package installed and a sample collection created. This means that we are finally ready to use that plugin. To do that we need to deploy it. We can do that to a single collection or multiple ones at the same time. For the purpose of this blog post I will use our test collection and will deploy our plugin by using the following command:

$ ./solr package deploy sematext-example:1.0.0 -collections test

In addition to the name of the collection or collections that our plugin should be installed to, we need to provide the name of the plugin and its version. The response was as follows:

Executing {"add-requesthandler":{"name":"/sematextexample","class":"sematext-example:com.sematext.blog.solr.ExampleRequestHandler"}} for path:/api/collections/test/config
Execute this command (y/n):
y
Executing http://localhost:8983/api/collections/test/config/requestHandler?componentName=/sematextexample&meta=true for collection:test
{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "config":{"requestHandler":{"/sematextexample":{
        "name":"/sematextexample",
        "class":"sematext-example:com.sematext.blog.solr.ExampleRequestHandler",
        "_packageinfo_":{
          "package":"sematext-example",
          "version":"1.0.0",
          "files":["/package/sematext-example/1.0.0/solr-example-module-1.0.jar"],
          "manifest":"/package/sematext-example/1.0.0/manifest.json",
          "manifestSHA512":"da463cdad3efbe4c9159b29156bbaf26f4aa35a083a8b74fd57e1dfa1f79ee7eaadfd3863f5d88fa2550281c027e82b516ebc64a7fa4159089f32c565813c574"}}}}}

Actual: 1.0.0, expected: 1.0.0
Deployed on [test] and verified package: sematext-example, version: 1.0.0
Deployment successful

During the execution of the above command, the bin/solr script will ask you if you are certain that you would like to deploy the chosen package. If you agree to that Solr will deploy and try to verify the package by using the verification command provided in the repository.json description file. If that went well – the plugin is ready and we can use it.

When we no longer need a package we can remove it by running the undeploy command. For example, if we would like to remove the previously deployed package we just need to run the following command:

$ bin/solr package undeploy sematext-example -collections test

In this case, the response says that everything went well and we will no longer be using the package:

Executing {"delete-requesthandler":"/sematextexample"} for path:/api/collections/test/config

How Solr Packages Work Under the Hood

The heart of the implementation of the plugin mechanism is related to the isolation of classloaders of the plugins and the core Solr classes. The plugin mechanism assumes that any change in the files that are in the Solr classpath requires a restart. The rest of the files can be loaded dynamically and are bound to the configuration stored in Zookeeper.

The basis of the mechanism is a so-called Package Store. It is a distributed file system that keeps its data on each Solr node in the $SOLR_HOME/filestore directory and each of the files is described by metadata written in a JSON file. Of course, each file stores the checksum in its metadata for verification purposes. That way replacing the binary itself is not enough to load malicious versions of the plugin – the signature is still there and needs to be adjusted as well. That gives us a certain degree of security.

On top of all of that, we have an API allowing us not only to manage the whole package repository but also single files.

Solr Package API

Of course, our bin/solr tool and installing the packages using it is not everything that Solr gives us. In addition to that we got the API that allows us to:

add files using the PUT HTTP method and the /api/cluster/files/{file_path} endpoint
retrieve files using the GET HTTP method and the /api/cluster/files/{file_path} endpoint
retrieve file metadata using the GET HTTP method and the /api/cluster/files/{file_path}?meta=true
retrieve files available at a given path using the GET HTTP method using the /api/cluster/files/{directory_path} endpoint

You should remember that adding a file to Solr is not only about sending it to Solr. You need to sign it using a key that will be available to Solr – we saw that already.

Similar to manipulating the files in the package repository we also have the option to manage packages. We have the option to add, remove and download the packages and their versions:

GET on /api/cluster/package to download the list of packages
PUT on /api/cluster/package to add a package
DELETE on /api/cluster/package to remove a package

For example to add a package to Solr we could use a command like this:

$ curl -XPUT 'http://localhost:8983/api/cluster/package' -H 'Content-type:application/json' -d  '{
 "add": {
  "package" : "sematext-example",
  "version" : "1.0.0",
  "files" : [
   "/test/sematext/1.0.0/sematext-example.jar"
  ]
 }
}'

Security

The package management functionality brings a new way of extending Solr functionality. However, you should remember that flexibility doesn’t come for free. Having the option to use hot-deploy and being able to install Solr extensions on the fly, without the need of bringing the whole cluster down carries limitations and security threats. Because of that remember not to add package repositories that you don’t know. Such repositories can be dangerous and can result in downloading and installing malicious code. The second thing to remember is that you shouldn’t add repositories that are not using SSL. Adding a repository that is not using SSL exposes you to the man in the middle attack during which the files can be replaced on the fly leading to installing the malicious code. That can result in compromising your cluster, which may lead to data leaks or the whole environment being compromised. That is something that you should remember and keep your Solr secure no matter if you use package management or not.

Summary

The functionality of installing Solr extensions without the need to manually download them to each node, restarting the nodes and so on is very nice and tempting. Especially to those of us who use such extensions. However, please remember about security and the limitations of the mechanism. If we will be cautious we will have a way of extending Solr in a flexible way.

Also keep in mind to monitor your Solr, for example with software like our Sematext Cloud that can help you identify the bottlenecks and find the root cause of the problems with your instance or the whole cluster. Keep that in mind – you can’t fix what you can’t measure 🙂

A Step-by-Step Guide to Java Garbage Collection Tuning

Rafał Kuć — Mon, 27 Jan 2020 11:19:14 +0000

Working with Java applications has a lot of benefits. Especially when compared to languages like C/C++. In the majority of cases, you get interoperability between operating systems and various environments. You can move your applications from server to server, from operating system to operating system, without major effort or in rare cases with minor changes.

One of the most interesting benefits of running a JVM based application is automatic memory handling. When you create an object in your code it is assigned on a heap and stays there until it is referenced from the code. When it is no longer needed it needs to be removed from the memory to make room for new objects. In programming languages like C or C++, the cleaning of the memory is done by us, programmers, manually in the code. In languages like Java or Kotlin, we don’t need to take care of that – it is done automatically by the JVM, by its garbage collector.

What Is Garbage Collection Tuning?

Garbage Collection GC tuning is the process of adjusting the startup parameters of your JVM-based application to match the desired results. Nothing more and nothing less. It can be as simple as adjusting the heap size – the -Xmx and -Xms parameters. Which is by the way what you should start with. Or it can be as complicated as tuning all the advanced parameters to adjust the different heap regions. Everything depends on the situation and your needs.

Why Is Garbage Collection Tuning Important?

Cleaning our applications’ JVM process heap memory is not free. There are resources that need to be designated for the garbage collector so it can do its work. You can imagine that instead of handling the business logic of our application the CPU can be busy handling the removal of unused data from the heap.

This is why it’s crucial for the garbage collector to work as efficiently as possible. The GC process can be heavy. During our work as developers and consultants, we’ve seen situations where the garbage collector was working for 20 seconds during a 60-second window of time. Meaning that 33% of the time the application was not doing its job — it was doing the housekeeping instead.

We can expect threads to be stopped for very short periods of time. It happens constantly:



2019-10-29T10:00:28.879-0100: 0.488: Total time for which application threads were stopped: 0.0001006 seconds, Stopping threads took: 0.0000065 seconds

What’s dangerous, however, is a complete stop of the application threads for a very long period of time – like seconds or in extreme cases even minutes. This can lead to your users not being able to properly use your application at all. Your distributed systems can collapse because of elements not responding in a timely manner.

To avoid that we need to ensure that the garbage collector that is running for our JVM applications is well configured and is doing its job as good as it can.

When to Do Garbage Collection Tuning?

The first thing that you should know is that tuning the garbage collection should be one of the last operations you do. Unless you are absolutely sure that the problem lies in the garbage collection, don’t start with changing JVM options. To be blunt, there are numerous situations where the way how the garbage collector works only highlights a bigger problem.

If your JVM memory utilization looks good and your garbage collector works without causing trouble, you shouldn’t spend time turning your garbage collection. You will most likely be more effective in refactoring the code to be more efficient.

So how do we say that the garbage collector does a good job? We can look into our monitoring, like our own Sematext Cloud. It will provide you information regarding your JVM memory utilization, the garbage collector work and of course the overall performance of your application. For example, have a look at the following chart:

In this chart, you can see something called “shark tooth”. Usually, it is a sign of a healthy JVM heap. The largest portion of the memory, called the old generation, gets filled up and then is cleared by the garbage collector. If we would correlate that with the garbage collector timings we would see the whole picture. Knowing all of that we can judge if we are satisfied with how the garbage collection is working or if tuning is needed.

Another thing you can look into is garbage collection logs that we discussed in the Understanding Java GC Logs blog post. You can also use tools like jstat or any profiler. They will give you detailed information regarding what’s happening inside your JVM, especially when it comes to heap memory and garbage collection.

There is also one more thing that you should consider when thinking about garbage collection performance tuning. The default Java garbage collection settings may not be perfect for your application, so to speak. Meaning, instead of going for more hardware or for more beefy machines you may want to look into how your memory is managed. Sometimes tuning can decrease the operation cost lowering your expenses and allowing for growth without growing the environment.

Once you are sure that the garbage collector is to blame and you want to start optimizing its parameters we can start working on the JVM startup parameters.

Garbage Collection Tuning Procedure: How to Tune Java GC

When talking about the procedure you should take when tuning the garbage collector you have to remember that there are more garbage collectors available in the JVM world. When dealing with smaller heaps and older JVM versions, like version 7, 8 or 9, you will probably use the good, old Concurrent Mark Sweep garbage collector for your old generation heap. With a newer version of the JVM, like 11, you are probably using G1GC. If you like experimenting you are probably using the newest JVM version along with ZGC. You have to remember that each garbage collector works differently. Hence, the tuning procedure for them will be different.

Running a JVM based application with different garbage collectors is one thing, doing experiments is another. Java garbage collection tuning will require lots of experiments and tries. It’s normal that you won’t achieve the desired results in your first try. You will want to introduce changes one by one and observe how your application and the garbage collector behaves after each change.

Whatever your motivation for GC tuning is I would like to make one thing clear. To be able to tune the garbage collector, you need to be able to see how it works. This means that you need to have visibility into GC metric or GC logs, or both, which would be the best solution.

Starting GC Tuning

Start by looking at how your application behaves, what events fill up the memory space, and what space is filled. Remember that:

Assigned objects in the Eden generation are moved to Survivor space
Assigned objects in the Survivor space are moved to Tenured generation if the counter is high enough or the counter is increased.
Assigned objects in the Tenured generation are ignored and will not be collected.

You need to be sure you understand what is happening inside your application’s heap, and keep in mind what causes the garbage collection events. That will help you understand your application’s memory needs and how to improve garbage collection.

Let’s start tuning.

Heap Size

You would be surprised how often setting the correct heap size is overlooked. As consultants, we’ve seen a few of those, believe us. Start by checking if your heap size is really well set up.

What should you consider when setting up the heap for your application? It depends on many factors of course. There are systems like Apache Solr or Elasticsearch which are heavily I/O dependent and can share the operating system file system cache. In such cases, you should leave as much memory as you can for the operating system, especially if your data is large. If your application processes a lot of data or does a lot of parsing, larger heaps may be needed.

Anyways, you should remember that until 32GB of heap size you benefit from so-called compressed ordinary object pointers. Ordinary object pointers or OOP are 64-bits pointers to memory. They point to memory allowing the JVM to reference objects on the heap. At least this is how it works without getting deep into the internals.

Up to 32GB of the heap size, JVM can compress those OOPs and thus save memory. This is how you can imagine the compressed ordinary object pointer in the JVM world:

The first 32 bits are used for the actual memory reference and are stored on the heap. 32 bits is enough to address every object on heaps up to 32GB. How do we calculate that? We have 232 – our space that can be addressed by a 32-bits pointer. Because of the three zeros in the tail of our pointer we have 232+3, which gives us 235, so 32GB of memory space that can be addressed. That’s the maximum heap size we can use with compressed ordinary object pointers.

Going above 32GB of the heap will result in JVM using 64-bits pointers. In some cases going from 32GB to 35GB heap, you are likely to have more or less the same amount of usable space. That depends on your application memory usage, but you need to take that into consideration and probably go above 35GB to see the difference.

Finally, how do I choose the proper heap size? Well, monitor your usage and see how your heap behaves. You can use your monitoring for that, like our Sematext Cloud and its JVM monitoring capabilities:

You can see the JVM pool size and the GC summary charts. As you can see the JVM heap size can be characterized as shark teeth – a healthy pattern. Based on the first chart we can that we need at least 500 – 600MB of memory for that application. The point where the memory is evacuated is around 1.2GB of the total heap size, for the G1 garbage collector, in this case. In this scenario, we have the garbage collector running for about 2 seconds in the 60 seconds time period, which means that the JVM spends around 2% of the time in garbage collection. This is good and healthy.

We can also look at the average garbage collection time along with the 99th and 90th percentile:

Based on that information we can see that we don’t need a higher heap. Garbage collection is fast and efficiently clears the data.

On the other hand, if we know that our application is used and processes data, its heap is above 70 – 80% of the maximum heap that we set it to and we would see GC struggling we know that we are in trouble. For example, look at this application’s memory pools:

You can see that something started happening and that memory usage is constantly above 80% in the old generation space. Correlate that with garbage collector work:

And you can clearly see signs of high memory utilization. The garbage collector started doing more work while memory is not being cleared. That means that even though JVM is trying to clear the data – it can’t. This is a sign of trouble coming – we just don’t have enough space on the heap for new objects. But keep in mind this may also be a sign of memory leaks in the application. If you see memory growth over time and garbage collection not being able to free the memory you may be hitting an issue with the application itself. Something worth checking.

So how do we set the heap size? By setting its minimum and maximum size. The minimum size is set using the -Xms JVM parameter and the maximum size is set using the -Xmx parameter. For example, to set the heap size for our application to be of size 2GB we would add -Xms2g -Xmx2g to our application startup parameters. In most cases, I would also set them to the same value to avoid heap resizing and in addition to that I would add the -XX:+AlwaysPreTouch flag as well to load the memory pages into memory at the start of the application.

We can also control the size of the young generation heap space by using the -Xmn property, just like the -Xms and -Xmx. This allows us to explicitly define the size of the young generation heap space when needed.

Serial Garbage Collector

The Serial Garbage Collector is the simplest, single-threaded garbage collector. You can turn on the Serial garbage collector by adding the -XX:+UseSerialGC flag to your JVM application startup parameters. We won’t focus on tuning the serial garbage collector.

Parallel Garbage Collector

The Parallel garbage collector similar in its roots to the Serial garbage collector but uses multiple threads to perform garbage collection on your application heap. You can turn on the Parallel garbage collector by adding the -XX:+UseParallelGC flag to your JVM application startup parameters. To disable it entirely, use the -XX:-UseParallelGC flag.

Tuning the Parallel Garbage Collector

As we’ve mentioned The Parallel garbage collector uses multiple threads to perform its cleaning duties. The number of threads that the garbage collector can use is set by using the -XX:ParallelGCThreads flag added to our application startup parameters.

For example, if we would like 4 threads to do the garbage collection, we would add the following flag to our application parameters: -XX:ParallelGCThreads=4. Keep in mind that the more threads you dedicate to cleaning duties the faster it can get. But there is also a downside of having more garbage collection threads. Each GC thread involved in a minor garbage collection event will reserve a portion of the tenured generation heap for promotions. This will create divisions of space and fragmentation. The more the threads the higher the fragmentation. Reducing the number of Parallel garbage collection threads and increasing the size of the old generation will help with the fragmentation if that becomes an issue.

The second option that can be used is -XX:MaxGCPauseMillis. It specifies the maximum pause time goal between two consecutive garbage collection events. It is defined in milliseconds. For example, with a flag -XX:MaxGCPauseMillis=100 we tell the Parallel garbage collector that we would like to have the maximum pause of 100 milliseconds between garbage collections. The longer the gap between garbage collections the more garbage can be left on the heap making the next garbage collection more expensive. On the other hand, if the value is too small, the application will spend the majority of its time in garbage collection instead of executing business logic.

The maximum throughput target can be set by using the -XX:GCTimeRatio flag. It defines the ratio between the time spent in GC and the time spent outside of GC. It is defined as 1/(1 + GC_TIME_RATIO_VALUE) and it’s a percentage of time spent in garbage collection.

For example, setting -XX:GCTimeRatio=9 means that 10% of the application’s working time may be spent in the garbage collection. This means that the application should get 9 times more working time compared to the time given to garbage collection.

By default, the value of -XX:GCTimeRatio flag is set to 99 by the JVM, which means that the application will get 99 times more working time compared to the garbage collection which is a good trade-off for the server-side applications.

You can also control the adjustment of the generations of the Parallel garbage collector. The goals for the Parallel garbage collector are as follows:

achieve maximum pause time
achieve throughput, only if pause time is achieved
achieve footprint only if the first two goals are achieved

The Parallel garbage collector grows and shrinks the generations to achieve the goals above. Growing and shrinking the generations is done in increments at a fixed percentage. By default, the generation grows in increments of 20% and shrinks in increments of 5%. Each generation is configured on its own. The percentage of the growth of a generation is controlled by the -XX:YoungGenerationSizeIncrement flag. The growth of the old generation is controlled by the -XX:TenuredGenerationSizeIncrement flag.

The shrinking part can be controlled by the -XX:AdaptiveSizeDecrementScaleFactor flag. For example, the percentage of the shrinking increment for the young generation is set by dividing the value of -XX:YoungGenerationSizeIncrement flag by the value of the -XX:AdaptiveSizeDecrementScaleFactor.

If the pause time goal is not achieved the generations will be shrunk one at the time. If the pause time of both generations is above the goal, the generation that caused threads to stop for a longer period of time will be shrunk first. If the throughput goal is not met then both the young and old generations will be grown.

The Parallel garbage collector can throw OutOfMemory exception if too much time is spent in garbage collection. By default, if more than 98% of the time is spent in garbage collection and less than 2% of the heap is recovered such exception will be thrown. If we want to disable that behavior we can add the -XX:-UseGCOverheadLimit flag. But please be aware that garbage collectors working for an extensive amount of time and clearing very little or close to no memory at all usually means that your heap size is too low or your application suffers from memory leaks.

Once you know all of this we can start looking at garbage collector logs. They will tell us about the events that our Parallel garbage collector performs. That should give us the basic idea of where to start the tuning and which part of the heap is not healthy or could use some improvements.

Concurrent Mark Sweep Garbage Collector

The Concurrent Mark Sweep garbage collector, a mostly concurrent implementation that shares the threads used for garbage collection with the application. You can turn it on by adding the -XX:+UseConcMarkSweepGC flag to your JVM application startup parameters.

Tuning the Concurrent Mark Sweep Garbage Collector

Similar to other available collectors in the JVM world the CMS garbage collector is generational which means that you can expect two types of events to happen – minor and major collections. The idea here is that most work will be done in parallel to the application threads to prevent the tenured generation to get full. During normal work, most of the garbage collection is done without stopping application threads. CMS only stops the threads for a very short period of time at the beginning and the middle of the collection during the major collection. Minor collections are done in a very similar way to how the Parallel garbage collector works – all application threads are stopped during GC.

One of the signals that your CMS garbage collector needs tuning is concurrent mode failures. This indicates that the Concurrent Mark Sweep garbage collector was not able to reclaim all unreachable objects before the old generation filled up or there was simply not enough fragmented space in the heap tenured generation to promote objects.

But what about the concurrency we’ve mentioned? Let’s get back to the pauses for a while. During the concurrent phase, the CMS garbage collector pauses two times. The first is called the initial mark pause. It is used to mark the live objects that are directly reachable from the roots and from any other place in the heap. The second pause called remark pause is done at the end of the concurrent tracing phase. It finds objects that were missed during the initial mark pause, mainly because of being updated in the meantime. The concurrent tracing phase is done between those two pauses. During this phase, one or more garbage collector threads may be working to clear the garbage. After the whole cycle ends the Concurrent Mark Sweep garbage collector waits until the next cycle while consuming close to no resources. However, be aware that during the concurrent phase your application may experience performance degradation.

The collection of tenured generation space must be timed when using the CMS garbage collector. Because concurrent mode failures can be expensive we need to properly adjust the start of the old generation heap cleaning not to hit such events. We can do that by using the -XX:CMSInitiatingOccupancyFraction flag. It is used to set the percentage of the old generation heap utilization when the CMS should start clearing it. For example, starting at 75% we would set the mentioned flag to -XX:CMSInitiatingOccupancyFraction=75. Of course, this is only an informative value and the garbage collector will still use heuristics and try to determine the best possible value for starting its old generation cleaning job. To avoid using heuristics we can use the -XX:+UseCMSInitiatingOccupancyOnly flag. That way we will only stick to the percentage from the -XX:CMSInitiatingOccupancyFraction setting.

So when setting the -XX:+UseCMSInitiatingOccupancyOnly flag to a higher value you delay the cleaning of the old generation space on the heap. This means that your application will run longer without the need for CMS kicking in to clear the tenured space. But, when the process starts it may be more expensive because it will have more work. On the other hand, setting the -XX:+UseCMSInitiatingOccupancyOnly flag to a lower value will make the CMS tenured generation cleaning more often, but it may be faster. Which one to choose – that depends on your application and needs to be adjusted per use case.

We can also tell our garbage collector to collect the young generation heap during the remark pause or before doing the Full GC. The first is done by adding the -XX:+CMSScavengeBeforeRemark flag to our startup parameters. The second is done by adding the -XX:+ScavengeBeforeFullGC flag to our application startup parameters. As a result, it can improve garbage collection performance as it will not need to check for references between the young and old generation heap spaces.

The remark phase of the Concurrent Mark Sweep garbage collector can potentially speed it up. By default it is a single-threaded and as you recall we’ve mentioned that it stops all the application threads. By including the -XX:+CMSParallelRemarkEnabled flag to our application startup parameters, we can force the remark phase to use multiple threads. However, because of certain implementation details, it is not actually always true that the concurrent version of the remark phase will be faster compared to the single-threaded version. That’s something you have to check and test in your environment.

Similar to the Parallel garbage collector, the Concurrent Mark Sweep garbage collector can throw OutOfMemory exceptions if too much time is spent in garbage collection. By default, if more than 98% of the time is spent in garbage collection and less than 2% of the heap is recovered such an exception will be thrown. If we want to disable that behavior we can add the -XX:-UseGCOverheadLimit flag. The difference compared to the Parallel garbage collector is that the time that counts towards the 98% is only counted when the application threads are stopped.

G1 Garbage Collector

G1 garbage collector, the default garbage collector in the newest Java versions targeted for latency-sensitive applications. You can turn it on by adding the -XX:+G1GC flag to your JVM application startup parameters.

Tuning G1 Garbage Collector

There are also two things worth mentioning. The G1 garbage collector tries to perform longer operations in parallel without stopping the application threads. The quick operations will be performed faster when application threads are paused. So it’s yet another implementation of mostly concurrent garbage collection algorithms.

The G1 garbage collector cleans memory mostly in evacuation fashion. Live objects from one memory area are copied to a new area and compacted along the way. After the process is done, the memory area from which the object was copied is again available for object allocation.

On a very high level, the G1GC goes between two phases. The first phase is called young-only and focuses on the young generation space. During that phase, the objects are moved gradually from the young generation to the old generation space. The second phase is called space reclamation and is incrementally reclaiming the space in the old generation while also taking care of the young generation at the same time. Let’s look closer at those phases as there are some properties we can tune there.

The young-only phase starts with a few young-generation collections that promote objects to the tenured generation. That phase is active until the old generation space reaches a certain threshold. By default, it’s 45% utilization and we can control that by setting the -XX:InitiatingHeapOccupancyPercent flag and its value. Once that threshold is hit, G1 starts a different young generation collection, one called concurrent start. The -XX:InitiatingHeapOccupancyPercent flag which controls the Initial Mark collection is the initial value that is further adjusted by the garbage collector. To turn off the adjustments add -XX:-G1UseAdaptiveIHOP flag to your JVM startup parameters.

The concurrent start, in addition to the normal young generation collection, starts the object marking process. It determines all live, reachable objects in the old generation space that need to be kept for the following space reclamation phase. To finish the marking process two additional steps are introduced – remark and cleanup. Both of them pause the application threads. The remark step performs global processing of references, class unloading, completely reclaims empty regions and cleans up internal data structures. The cleanup step determines if the space-reclamation phase is needed. If it’s needed the young-only phase is ended with Prepare Mixed young collection and the space-reclamation phase is launched.

The space-reclamation phase contains multiple Mixed garbage collections that work on both young and old generation regions of the G1GC heap space. The space-reclamation phase ends when the G1GC sees that evacuating more old generation regions wouldn’t give enough free space to make the effort of reclaiming the space worthwhile. It can be set by using the -XX:G1HeapWastePercent flag value.

We can also control, at least to some degree, if the periodic garbage collection will run. By using the -XX:G1PeriodicGCSystemLoadThreshold flag we can set the average load above which the periodic garbage collection will not be run. For example, if our system is load is 10 for the last minute and we set the -XX:G1PeriodicGCSystemLoadThreshold=10 flag, the period garbage collection will not be executed.

The G1 garbage collector, apart from the -Xmx and -Xms flags, allows us to use a set of flags to size the heap and its regions. We can use the -XX:MinHeapFreeRatio to tell the garbage collector the ratio of the free memory that should be achieved and the -XX:MaxHeapFreeRatio flag to set the desired maximum ratio of the free memory on the heap. We also know that G1GC tries to keep the young generation size between the values of -XX:G1NewSizePercent and -XX:G1MaxNewSizePercent. That also determines the pause times. Decreasing the size may speed up the garbage collection process at the cost of less work. We can also set the strict size of the young generation by using the -XX:NewSize and the -XX:MaxNewSize flags.

The documentation on tuning the G1 garbage collector says that we shouldn’t touch it in general. Eventually, we should only modify the desired pause times for different heap sizes. Fair enough. But, it’s also good to know what and how we can tune and how those properties affect the G1 garbage collector behavior.

When tuning for garbage collector latency we should keep the pause time to a minimum. Meaning that in most cases the -Xmx and -Xms values should be set to the same value and we should also load the memory pages during application start by using the -XX:+AlwaysPreTouch flag.

If your young-only phase takes too long it’s a sign that decreasing the -XX:G1NewSizePercent (defaults to 5) value is a good idea. In some cases decreasing the -XX:G1MaxNewSizePercent (defaults to 60) can also help. If the Mixed collections take too long we are advised to increase the value of -XX:G1MixedGCCountTarget flag to spread the tenured generation GC across more collections. Increase the -XX:G1HeapWastePercent to stop the old generation garbage collection earlier. You can also change the -XX:G1MixedGCLiveThresholdPercent – it defaults to 65 and controls the occupancy threshold above which the old generation heap will be included in the mixed collection. Increasing this value will tell garbage collection to omit less occupied old generation space regions when doing the mixed collection. Regions that have a lot of objects in them take a longer time to collect garbage from. By using the mentioned flag we can avoid setting these regions as candidates for garbage collection. If you’re seeing a high update and scan RS times, decreasing the -XX:G1RSetUpdatingPauseTimePercent flag value, including the -XX:-ReduceInitialCardMarks flag, and increasing the -XX:G1RSetRegionEntries flag may help. There is also one additional flag, the -XX:MaxGCPauseTimeMillis (defaults to 250) which defines the maximum, desired pause time. If you would like to reduce the pause time, lowering the value may help as well.

When tuning for throughput we want the garbage collector to clean as much garbage as possible. Mostly in cases of systems that process and hold a lot of data. The first thing that you should go for is increasing the -XX:MaxGCPauseMillis value. By doing that we relax the garbage collector. This allows it to work longer to process more objects on the heap. However, that may not be enough. In such cases increasing the -XX:G1NewSizePercent flag value should help. In some cases the throughput may be limited by the size of young generation regions – in such cases increasing the -XX:G1MaxNewSizePercent flag value should help.

We can also decrease the parallelism which requires a lot of work from the CPU. Using the -XX:G1RSetUpdatingPauseTimePercent flag and increasing its value will allow more work when the application threads are paused and will decrease the time spent in concurrent parts of the phase. Also similar to latency tuning you may want to keep the -Xmx and -Xms flags to the same value to avoid heap resizing. Load the memory pages to memory by using the -XX:+AlwaysPreTouch flag and the -XX:+UseLargePages flag. But please remember to apply the changes one by one and compare the results so that you understand what is happening.

Finally, we can tune for heap size. There is a single option that we can think about here, the -XX:GCTimeRatio (defaults to 12). It determines the ratio of time spent in garbage collection compared to application threads doing their work and is calculated as 1/(1 + GCTimeRatio). The default value will result in about 8% of the application working time to be spent in garbage collection, which is more than the Parallel GC. More time in garbage collection will allow clearing more space on the heap, but this is highly dependent on the application and it is hard to give general advice. Experiment to find the value that suits your needs.

There are also general tunable parameters for the G1 garbage collector. We can control the degree of parallelization when using this garbage collector. By including the -XX:+ParallelRefProcEnabled flag and changing the -XX:ReferencesPerThread flag value. For each N references defined by the -XX:ReferencesPerThread flag a single thread will be used. Setting this value to 0 will tell the G1 garbage collector to always use the number of threads specified by the -XX:ParallelGCThreads flag value. For more parallelization decrease the -XX:ReferencesPerThread flag value. This should speed up the parallel parts of the garbage collection.

Z Garbage Collector

Still experimental, very scalable and low latency implementation. If you would like to experiment with that Z garbage collector you must use JDK 11 or newer and add the -XX:+UseZGC flag to your application startup parameters along with the -XX:+UnlockExperimentalVMOptions flag as the Z garbage collector is still considered experimental.

Tuning the Z Garbage Collector

There aren’t many parameters that we can play around with when it comes to the Z garbage collector. As the documentation states, the most important option here is the maximum heap size, so the -Xmx flag. Because the Z garbage collector is a concurrent collector, the heap size must be adjusted in a way that it can hold the live set of objects of your application and allows for the headroom to allow allocations while the garbage collector is running. This means that the heap size may need to be higher compared to other garbage collectors and the more memory you assign to the heap the better results you may expect from the garbage collector.

The second option that you can expect is, of course, the number of threads that the Z garbage collector will use. After all, it is a concurrent collector, so it can utilize more than a single thread. We can set the number of threads that the Z garbage collector will use by using the -XX:ConcGCThreads flag. The collector itself uses heuristics to choose the proper number of threads it should use, but as usual, it is highly dependent on the application and in some cases setting that number to a static value may bring better results. However, that needs to be tested as it is very use-case dependent. There are two things to remember though. If you assign too many threads for the garbage collector your application may not have enough computing power to do its job. Set the number of garbage collector threads to a low number and the garbage may not be collected fast enough. Take that into consideration when tuning.

Other JVM Options

We’ve covered quite a lot when it comes to garbage collection parameters and how they affect garbage collection. But, not everything. There is way more to it than that. Of course, we won’t talk about every single parameter, it just doesn’t make sense. However, there are a few more things that you should know about.

JVM Statistics Causing Long Garbage Collection Pauses

Some people reported that on Linux systems, during high I/O utilization the garbage collection can pause threads for a long period of time. This is probably caused by the JVM using a memory-mapped file called hsperfdata. That file is written in the /tmp directory and is used for keeping the statistics and safepoints. The mentioned file is updated during GC. On Linux, modifying a memory-mapped file can be blocked until I/O completes. As you can imagine such an operation can take a longer period of time, presumably hundreds of milliseconds.

How to spot such an issue in your environment? You need to look into the timings of your garbage collection. If you see in the garbage collection logs that the real-time spent by the JVM for garbage collection is way longer than the user and system metrics combined you have a potential candidate. For example:



[Times: user=0.13 sys=0.11, real=5.45 secs]

If your system is heavily I/O based and you see the mentioned behavior you can move the path of your GC logs and the tmpfs to a fast SSD drive. With recent JDK versions, the temporary directory that Java uses is hardcoded, so we can’t use the -Djava.io.tmpdir to change that. You can also include the -XX:+PerfDisableSharedMem flag to your JVM application parameters. You need to be aware that including that option will break tools that are using the statistics from the hsperfdata file. For example, jstat will not work.

You can read more on that issue in the blog post from the Linkedin engineering team.

Heap Dump on Out Of Memory Exception

One thing that can be very useful when dealing with Out Of Memory exceptions, diagnosing their cause and looking into problems like memory leaks are heap dumps. A heap dump is basically a file with the contents of the heap written on disk. We can generate heap dumps on demand, but it takes time and can freeze the application or, in the best-case scenario, make it slow. But if our application crashes we can’t grab the heap dump – it’s already gone.

To avoid losing information that can help us in diagnosing problems we can instruct the JVM to create a heap dump when the OutOfMemory error happens. We do that by including the -XX:+HeapDumpOnOutOfMemoryError flag. We can also specify where the heaps should be stored by using the -XX:HeapDumpPath flag and setting its value to the location we want to write the heap dump to. For example: -XX:HeapDumpPath=/tmp/heapdump.hprof.

Keep in mind that the heap dump file may be very big – as large as your heap size. So you need to account for that when setting the path where the file should be written. We’ve seen situations where the JVM was not able to write the 64GB heap dump file on the target file system.

For analysis of the file, there are tools that you can use. There are open-source tools like MAT and proprietary tools like YourKit Java Profiler or JProfiler. There are also services like heaphero.io that can help you with the analysis, while older versions of the Oracle JDK distribution come with jhat – the Java Heap Analysis Tool. Choose the one that you like and fit your needs.

Using -XX:+AggressiveOpts

The -XX:+AgressiveOpts flag turns on additional flags that are proven to result in an increase of performance during a set of benchmarks. Those flags can change from version to version and contain options like large autoboxing cache and removal of aggressive autoboxing. It also includes disabling of the biased locking delay. Should you use this flag? That depends on your use case and your production system. As usual, test in your environment, compare instances with and without the flag and see how large of a difference it makes.

Conclusion

Tuning garbage collection is not an easy task. It requires knowledge and understanding. You need to know the garbage collector that you are working with and you need to understand your application’s memory needs. Every application is different and has different memory usage patterns, thus requires different garbage collection strategies. It’s also not a quick task. It will take time and resources to make improvements in iterations that will show you if you are going in the right direction with each and every change.

Remember that we only touched the tip of the iceberg when it comes to tuning garbage collectors in the JVM world. We’ve only mentioned a limited number of available flags that you can turn on/off and adjust. For additional context and learning, I suggest going to Oracle HotSpot VM Garbage Collection Tuning Guide and reading the parts that you think may be of interest to you. Look at your garbage collection logs, analyze them, try to understand them. It will help you in understanding your environment and what’s happening inside the JVM when garbage is collected. In addition to that, experiment a lot! Experiment in your test environment, on your developer machines, experiment in some of the production or pre-production instances and observe the difference in behavior.

Hopefully, this article will help you on your journey to a healthy garbage collection in your JVM based applications. Good luck!

A Quick Start on Java Garbage Collection: What it is, and How it works

Rafał Kuć — Mon, 27 Jan 2020 10:43:26 +0000

In this tutorial, we will talk about how different Java Garbage Collectors work and what you can expect from them. This will give us the necessary background to start tuning the garbage collection algorithm of your choice.

Before going into Java Garbage Collection tuning we need to understand two things. First of all, how garbage collection works in theory and how it works in the system we are going to tune. Our system’s garbage collector work is described by garbage collector logs and metrics from observability tools like Sematext Cloud for JVM. We talked about how to read and understand Java Garbage Collection logs in a previous blog post.

What is Garbage Collection in Java: A Definition

Java Garbage Collection is an automatic process during which the Java Virtual Machine inspects the object on the heap, checks if they are still referenced and releases the memory used by those objects that are no longer needed.

Object Eligibility: When Does Java Perform Garbage Collection

Let’s take a quick look on when the object is ready to be collected by the garbage collection and how to actually request the Java Virtual Machine to start garbage collection.

How to Make an Object Eligible for GC?

To put it straight – you don’t have to do anything explicitly to make an object eligible for garbage collection. When an object is no longer used in your application code, the heap space used by it can be reclaimed. Look at the following Java code:

public Integer run() {
  Integer variableOne = 10;
  Integer variableTwo = 20;
  return variableOne + variableTwo;
}

In the run() method we explicitly create two variables. They are first put on the heap, in the young generation heap. Once the method finishes its execution they are no longer needed and they start being eligible for garbage collection. When a young generation garbage collection happens the memory used by those variables may be reclaimed. If that happens the previously occupied memory will be visible as free.

How to Request the JVM to Run GC?

The best thing regarding Java garbage collection is that it is automatic. Until the time comes, and you want and need to control and tune it, you don’t have to do anything. When the Java Virtual Machine will decide it’s time to start reclaiming the space on the heap and throwing away unused objects it will just start the garbage collection process.

If you want to force garbage collection you can use the System object from the java.lang package and its gc() method or the Runtime.getRuntime().gc() call. As the documentation states – Java Virtual Machine will do its best efforts to reclaim the space. This means that the garbage collection may actually not happen, this depends on the JVM. If the garbage collection happens it will be a Major collection which means that we can expect a stop-the-world event to happen. In general, using the System.gc() is considered a bad practice and we should tune the work of the garbage collector instead of calling it explicitly.

How Does Java Garbage Collection Work?

No matter what implementation of the garbage collector we use, to clean up the memory, a short pause needs to happen. Those pauses are also called stop-the-world events or STW in short. You can envision your JVM-based application’s working cycles in the following way:

The first step of the cycle starts when your application threads are started and your business code is working. This is where your application code is running. At a certain point in time, an event happens that triggers garbage collection. To clear the memory, application threads have to be stopped. This is where the work of your application stops and the next steps start. The garbage collector marks objects that are no longer used and reclaims the memory. Finally, an optional step of heap resizing may happen if possible. Then the circle starts again, application threads are started. The full cycle of the garbage collection is called the epoch.

The key when running JVM applications and tuning the garbage collector is to keep the application threads running for as long as possible. That means that the pauses caused by the garbage collector should be minimal.

The second thing that we need to talk about is generations. Java garbage collectors are generational, which means that they work under certain principles:

Young data will not survive long
Data that is old will continue to persist in memory

That’s why JVM heap memory is divided into generations:

Young generation which is divided into two sections called Eden space and Survivor space
Old generation, or Tenured space.

A simplified promotion of objects between spaces and generations can be illustrated with the following example. When an object is created it is first put into the young generation space into the Eden space. Once the young garbage collection happens the object is promoted into the Survivor space 0 and next into the Survivor space 1. If the object is still used at this point the next garbage collection cycle will move it to the Tenured space which means that it is moved to the old generation. You can imagine it as follows:

So the Eden space contains newly created objects and is empty at the beginning of the epoch. During the epoch, the Eden space will fill up eventually triggering a Minor GC event when filled up. The Survivor spaces contain objects that were used during at least a single epoch. Objects that survived through many epochs will be eventually promoted to the Tenured generation.

Before Java 8 there was one additional memory space called the PermGen. PermGen or otherwise Permanent Generation was a special space on the heap separated from its other parts – the young and the tenured generation. It was used to store metadata such as classes and methods.

Starting from Java 8, the Metaspace is the memory space that replaces the removed PermGen space. The implementation differs from the PermGen and this space of the heap is now automatically resized limiting the problems of going into out of memory in this region of the heap. The Metaspace memory can be garbage collected and the classes that are no longer used can be cleaned when the Metaspace reaches its maximum size.

There are a few flags that can be used to control the size and the behavior of the Metaspace memory space:

-XX:MetaspaceSize – initial size of the Metaspace memory region,
-XX:MaxMetaspaceSize – maximum size of the Metaspace memory & region,
-XX:MinMetaspaceFreeRatio – minimum percentage of class metadata capacity that should be free after garbage collection,
-XX:MaxMetaspaceFreeRatio – maximum percentage of class metadata capacity that should be free after garbage collection.

You can now imagine why some garbage collectors may need a considerable amount of time to clear the old generation space. It’s done in a single step. The tenured generation is one big space of heap and to clear it the application threads have to be stopped.

The Heap Structure of G1 Garbage Collector

What we wrote above is true for all garbage collectors including serial, parallel and Concurrent Mark Sweep. We will discuss them a bit later. However, the G1 garbage collector goes a step further and divides the heap into something called regions. A region is a small, independent heap that can be dynamically set to be of Eden, Survivor or Tenured type:

In addition to the three mentioned types, we also have free memory, the white cells on the image.

Such architecture allows for different operations. First of all, because the tenured generation is divided it can be collected in portions that affect latency, making the garbage collector faster for old generation space. Such heaps can be easily defragmented and dynamically resized. No cons, right? Well, that’s actually not true. The cost of maintaining such heap architecture is higher compared to the traditional heap architecture. It requires more CPU and memory.

The region size when using the G1GC can be controlled. When the heap size is set to be lower than 4GB the region size will be automatically set to 1MB. For heaps between 4 and 8GB, the region size will be set to 2MB and so on, up to 32MB region size for heaps 64GB in size or larger. In general, the region size must be a power of two and be between 1 and 32MB. By default, the JVM will try to set up an optimal number of two thousand regions or more during the application start. We can control that by using the -XX:G1HeapRegionSize=N JVM parameter.

The clearing of the heap in the case of G1GC is done by copying live data out of an existing region into an empty region and discarding the old region altogether. After that, the old region is considered free and objects can be allocated to it. Freeing multiple regions at the same time allows for defragmentation and assignment of humongous objects – ones that are larger than 50% of a heap region.

You may now wonder what triggers garbage collection and that would be a great question. Common triggers for garbage collection are Eden space being full, not enough free space to allocate an object, external resources like System.gc(), tools like jmap or not enough free space to create an object.

What Triggers Java Garbage Collection

To keep things even more complicated there are several types of garbage collection events. You can divide them in a very simplified way, as follows:

Minor event – happen when the Eden space is full and moved data to Survivor space. So a Minor event happens within the young generation
Mixed event – a Minor event plus reclaim of the Tenured generation
Full GC event – a young and old generation space clearing together

Even by looking at the names of the events you can see that the key in most cases will be lowering the pause times of the Mixed and Full GC events. Let’s stop discussing the garbage collection events for now. There is more to it and we could get deeper and deeper. But for now, we should be good.

The next thing that I would like to mention is the humongous object. Remember? Those are the ones that are larger than a single region in our heap when dealing with the G1 garbage collector (G1GC). Actually, any object larger than 50% of the region size is considered humongous. Those objects are not allocated in the young generation space, but instead, they are put directly in the Tenured generation. Such objects can increase the pause time of the garbage collector and can increase the risk of triggering the Full GC because of running out of continued free space.

Java Garbage Collectors Types

We now understand the basics and it’s time to understand what kind of garbage collectors we have available and how each of them works in our application. Keep in mind that different Java versions will have different garbage collectors available. For example, Java 9 will have both Concurrent Mark Sweep and G1 garbage collectors, while the older updates of Java 7 will not have the G1 garbage collector available.

That said, there are five types of garbage collectors in Java:

Serial Garbage Collector

The Serial garbage collector is designed to be used for single-threaded environments. Before doing garbage collection this garbage collector freezes all the application threads. Because of that, it is not suited for multi-threaded environments, like server-side applications. However, it is perfectly suited for single-threaded applications that don’t require low pause time. For example batch jobs.

The documentation on Java garbage collectors also mentions that this garbage collector may be useful on multiprocessor machines for applications with a data set up to approximately 100MB.

Parallel Garbage Collector

The Parallel garbage collector, also known as throughput collector is very similar to the Serial garbage collector. It also needs to freeze the application threads when doing garbage collection. But, it was designed to work on multiprocessor environments and in multi-threaded applications with medium and large-sized data. The idea is that using multiple threads will speed up garbage collection making it faster for such use cases.

If your application’s priority is peak performance and the thread pause time of one second or even longer is not a problem for it then the Parallel garbage collector may be a good idea. It will run from time to time freezing application threads and performing GC using multiple threads speeding it up compared to the Serial garbage collector.

Concurrent Mark Sweep Garbage Collector

The Concurrent Mark Sweep (CMS) garbage collector is one of the implementations that are called mostly concurrent. They perform expensive operations using multiple threads. They also share the threads used for garbage collection with the application. The overhead for this type of garbage collection comes not only from the fact that they do the collection concurrently, but also that the concurrent collection must be enabled.

The CMS GC is designed for applications that prefer short pauses. Basically, it performs slower compared to Parallel or Serial garbage collector but doesn’t have to stop the application threads to perform garbage collection.

This garbage collector should be chosen if your application prefers short pauses and can afford to share the application threads with the garbage collector. Keep in mind though that the Concurrent Mark Sweep garbage collector is doing to be removed in Java 14 and you should look at the G1 garbage collector if you are not using it yet.

G1 Garbage Collector

The G1 garbage collector is the garbage collection algorithm that was introduced with the 4th update of the 7th version of Java and improved since. G1GC was designed to be low latency, but that comes at a price – more frequent work which means more CPU cycles spent in garbage collection. It partitions the heap into smaller regions allowing for easier garbage collection and evacuation style memory clearing. It means that the objects are moved out of the cleared region and copied to another region. Most of the garbage collection is done in the young generation where it’s most efficient to do so.

As the documentation states, the G1GC was designed for server-style applications running in a multiprocessor environment with a large amount of memory available. It tries to meet garbage collector pause goals with high probability. While doing that it also tries to achieve high throughput. All of that without the needs of complicated configuration, at least in theory.

Think about it this way – if you have services that are latency-sensitive G1 garbage collector may be a very good choice. Having low latencies means that those services will not suffer from long stop-the-world events. Of course at the cost of higher CPU usage. Also, the G1 garbage collector was designed to work with larger heap sizes – if you have heap larger than 32GB G1 is usually a good choice. The G1 garbage collector is a replacement for the CMS garbage collector and it’s also the default garbage collector in the most recent Java versions.

Z Garbage Collector

The Z garbage collector is an experimental garbage collection implementation still not available on all platforms, like Windows and macOS. It is designed to be a very scalable low latency implementation. It performs expensive garbage collection work concurrently without the need for stopping the application threads.

The ZGC is expected to work well with applications requiring pauses of 10ms or less and ones that use very large heaps.

Java Garbage Collection Benefits

There are multiple benefits of garbage collection in Java. However, the major one, that you may not think about in the first place is the simplified code. We don’t have to worry about proper memory assignment and release cycles. In our code, we just stop using an object and the memory it is using will be automatically reclaimed at some point. This is yet another added value from a simplicity point of view. The memory reclaim process is automatic and is the job of the internal algorithm inside the JVM. We just control what kind of algorithm we want to use – if we want to control it. Of course, we can still hit memory leaks if we keep the references to the objects forever, but this is a different pair of shoes.

We have to remember though that those benefits come at a price – performance. Depending on the situation and the garbage collection algorithm, we can pay for the ease and automation of the memory management in the CPU cycles spent on the garbage collection. In extreme cases, when we have issues with memory or garbage collection, we can even experience stop of the whole application until the space reclamation process ends.

Java Garbage Collection Best Practices

We will cover the process of tuning garbage collection in the next post in the series, but before that, we wanted to share some good and bad practices around garbage collection. First of all – you should avoid calling the System.gc() method to ask for explicit garbage collection. As we’ve mentioned it is considered a bad practice and should be avoided.

The second thing I wanted to mention is the right amount of heap memory. If you don’t have enough memory for your application to work you will experience slowdowns, long garbage collection, stop the world events and eventually out of memory errors. All of that can indicate that your heap is too small, but can also mean that you have a memory leak in your application. Look at the JVM monitoring of your choice to see if the heap usage grows indefinitely – if it is, it may mean you have a bug in your application code. We will talk more about the heap size in the next post in the series.

Finally, if you are running a small, standalone application, you will probably not need any kind of garbage collection tuning. Just go with the defaults and you should be more than fine.

The next step after that would be to choose the right garbage collector implementation. The one that matches the needs and requirements of our business. How to do that and what are the options to tune different garbage collection algorithms – this is something that we will cover in the next blog post – the one called A Step-by-Step Guide to Java Garbage Collection Tuning.

Conclusion

At this point, we know how the Java garbage collection process looks like, how each garbage collector works and what behavior we can expect from each of them. In addition to that, in the previous blog post, we also discussed how to turn on and understand the logs produced by each garbage collector. This means that we are ready for the final part of the series – tuning our garbage collector.

Java Garbage Collection Logs & How to Analyze Them

Rafał Kuć — Thu, 19 Dec 2019 14:47:25 +0000

When working with Java or any other JVM-based programming language we get certain functionalities for free. One of those functionalities is clearing the memory. If you’ve ever used languages like C/C++ you probably remember functions like malloc, calloc, realloc and free. We needed to take care of the assignment of each byte in memory and take care of releasing the assigned memory when it was no longer needed. Without that, we were soon running into a shortage of memory leading to instability and crashes.

With Java, we don’t have to worry about releasing the memory that was assigned to an object. We only need to stop using the object. It’s as simple as that. Once the object is no longer referenced from inside our code the memory can be released and re-used again.

Freeing memory is done by a specialized part of the JVM called Garbage Collector.

How Does the Java Garbage Collector Work

The Java Virtual Machine runs the Garbage Collector in the background to find references that are not used. Memory used by such references can be freed and re-used. You can already see the difference compared to languages like C/C++. You don’t have to mark the object for deletion, it is enough to stop using it.

The heap memory is also divided into different regions and each has its own garbage collector type. There are a few implementations of the garbage collector — and each JVM can have its own implementation as long as it meets the specification. In theory and practice, each JVM vendor can provide its own garbage collector implementation providing different performance.

The simplified view over the three main regions of the JVM Heap can be visualized as follows:

Having a healthy garbage collection process is crucial to achieving optimal performance of your JVM based applications. Because of that, we need to ensure that we monitor JVM and its Garbage Collector. By using logs we can understand what the JVM tells us about the garbage collectors’ work.

What Are Garbage Collection (GC) Logs

The garbage collector log is a text file produced by the Java Virtual Machine that describes the work of the garbage collector. It contains all the information you could need to see how the memory cleaning process works. It also shows how the garbage collector behaves and how much resources it uses. Though we can monitor our application using an APM provider or in-house built monitoring tool, the garbage collector log will be invaluable to quickly identify any potential issues and bottlenecks when it comes to heap memory utilization.

An example of what you can expect to find in the garbage collection log look as follows:

2019-10-29T10:00:28.693-0100: 0.302: [GC (Allocation Failure) 2019-10-29T10:00:28.693-0100: 0.302: [ParNew
Desired survivor size 1114112 bytes, new threshold 1 (max 6)
- age   1:    2184256 bytes,    2184256 total
: 17472K->2175K(19648K), 0.0011358 secs] 17472K->2382K(63360K), 0.0012071 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2019-10-29T10:00:28.694-0100: 0.303: Total time for which application threads were stopped: 0.0012996 seconds, Stopping threads took: 0.0000088 seconds
2019-10-29T10:00:28.879-0100: 0.488: Total time for which application threads were stopped: 0.0001006 seconds, Stopping threads took: 0.0000065 seconds
2019-10-29T10:00:28.897-0100: 0.506: Total time for which application threads were stopped: 0.0000981 seconds, Stopping threads took: 0.0000076 seconds
2019-10-29T10:00:28.910-0100: 0.519: Total time for which application threads were stopped: 0.0000896 seconds, Stopping threads took: 0.0000062 seconds
2019-10-29T10:00:28.923-0100: 0.531: Total time for which application threads were stopped: 0.0000975 seconds, Stopping threads took: 0.0000069 seconds
2019-10-29T10:00:28.976-0100: 0.585: Total time for which application threads were stopped: 0.0001414 seconds, Stopping threads took: 0.0000091 seconds
2019-10-29T10:00:28.982-0100: 0.590: [GC (Allocation Failure) 2019-10-29T10:00:28.982-0100: 0.590: [ParNew
Desired survivor size 1114112 bytes, new threshold 1 (max 6)
- age   1:    1669448 bytes,    1669448 total
: 19647K->2176K(19648K), 0.0032520 secs] 19854K->5036K(63360K), 0.0033060 secs] [Times: user=0.03 sys=0.00, real=0.00 secs]

Even a very small period of time can provide a lot of information. You see allocation failures, young garbage collection, threads being stopped, memory before and after garbage collection, each event leading to the promotion of the objects inside the heap memory.

Why Are Garbage Collection Logs Important

Dealing with application performance tuning can be a long and unpleasant experience. We need to properly prepare the environment and observe the application. Check this out to learn more about JVM performance tuning. With the right observability tool, like our Sematext Cloud, you get insights into crucial metrics related to the application, the JVM and the operating system.

Metrics are not everything though. Even the best APM tools will not give you everything. Metrics can show you patterns and historical data that will help you identify potential issues, but to be able to see everything you will need to dig deeper. That deeper level in terms of a Java-based application is the garbage collection log. Even though GC logs are very verbose, they provide information that’s not available in other sources, like stop the world events and how long they took, how long the application threads were stopped, memory pool utilization and many, many more.

How to Enable GC Logging

Before talking about how to enable garbage collector logging we should ask ourselves one thing. Should I turn on the logs by default or I should only turn them on when issues start appearing? On modern devices, you shouldn’t worry about performance when enabling the garbage collector logs. Of course, you will experience a bit more writing to your persistent storage just because the logs have to be written somewhere. Apart from that, the logs shouldn’t produce any additional load on the system.

You should always have the Java garbage collection logs turned on. In fact, a lot of open-source systems are already following that practice. For example, search systems like Apache Solr or Elasticsearch are already including JVM flags that turn on the logs. We already know that those files include crucial information about the Java Virtual Machine operations so we know why we should have it turned on.

There is a difference in terms of how you activate garbage collection logging for Java 8 and earlier and for the newer Java versions.

For Java 8 and earlier you should add the following flags to your JVM based application startup parameters:

-XX:+PrintGCDetails -Xloggc:<PATH_TO_GC_LOG_FILE>

Where the PATH_TO_GC_LOG_FILE is the location of the garbage collector log file. For example:

java -XX:+PrintGCDetails -Xloggc:/var/log/myapp/gc.log -jar my_awesome_app.jar

In some cases, you can also see that the -XX:+PrintGCTimeStamps is included. However, it is redundant here and not needed.

For Java 9 and newer you can simplify the command above and add the following flag to the application startup parameters:

-Xlog:gc*:file=<PATH_TO_GC_LOG_FILE>

For example:

java -Xlog:gc*:file=/var/log/myapp/gc.log -jar my_awesome_app.jar

Once you enable the logs it’s important to remember the GC logs rotation. When using an older JVM version, like JDK 8 you may want to rotate your GC logs. To do that we have three flags that we can add to our JVM application startup parameters. The first one is the flag that enables GC logs rotation: -XX:+UseGCLogFileRotation. The second property -XX:NumberOfGCLogFiles tells the JVM how many GC log files should be kept. For example including -XX:NumberOfGCLogFiles=10 will enable up to 10 GC log files. Finally the -XX:GCLogFileSize tells how large a single GC log file can be. For example -XX:GCLogFileSize=10m will rotate the GC log file when it reaches 10 megabytes.

When using JDK 11 and the G1GC garbage collector to control your GC logs you will want to include property like this: java -Xlog:gc*:file=gc.log,filecount=10,filesize=10m. This will result in exactly the same behavior. we will have up to 10 GC log files with up to 10 megabytes in size.

Now that we know how important the JVM garbage collector logs are, and we’ve turned then on by default, we can start analyzing them.

How to Analyze GC Logs

Understanding garbage collection logs is not easy. It requires an understanding of how the Java virtual machine works and the understanding of memory usage of the application. In this blog post, we will skip the analysis of the application as it differs from application to application and requires knowledge of the code. What we will discuss though is how to read and analyze the garbage collection logs that we can get out of the JVM.

What is also very important is that there are various JVM versions and multiple garbage collector implementations. You can still encounter Java 7, 8, 9 and so on. Some companies still use Java 6 because of various reasons. Each version may be running different garbage collectors — Serial, Parallel, Concurrent Mark Sweep, G1 or even Shenandoah or Z. You can expect different Java versions and different garbage collector implementations to output a slightly different log format and of course we will not be discussing all of them. In fact, we will show you only a small portion of the logs, but such that should help you in understanding all other garbage collector logs as well.

The garbage collection logs will be able to answer questions like:

When was the young generation garbage collector used?
When was the old generation garbage collector used?
How many garbage collections were run?
For how long were the garbage collectors running?
What was the memory utilization before and after garbage collection?

Let’s now look at an example taken out of a JVM garbage collector log and analyze each fragment highlighting the crucial parts behind it.

Parallel and Concurrent Mark Sweep Garbage Collectors

Let’s start by looking at Java 8 and the Parallel collector for the young generation space and the Concurrent Mark Sweep garbage collector for the old generation. A single line coming from our JVM garbage collector can look as follows:

2019-10-30T11:13:00.920-0100: 6.399: [Full GC (Allocation Failure) 2019-10-30T11:13:00.920-0100: 6.399: [CMS: 43711K->43711K(43712K), 0.1417937 secs] 63359K->48737K(63360K), [Metaspace: 47130K->47130K(1093632K)], 0.1418689 secs] [Times: user=0.14 sys=0.00, real=0.14 secs]

First of all, you can see the date and time of the event which in our case is 2019–10–30T11:13:00.920–0100. This is the time when the event happened so that you can see what happened and when it happened.

The next thing we can see in the logline above is the type of garbage collection. In our case, it is Full GC and you can also expect GC as a value here. There are three types of garbage collector events that can happen:

Minor garbage collection
Major garbage collection
Full garbage collection

Minor garbage collection means that the young generation space clearing event was performed by the JVM. The minor garbage collector will always be triggered when there is not enough memory to allocate a new object on the heap, i.e. when the Eden generation is full or is getting close to being full. If your application creates new objects very often you can expect the minor garbage collector to run often. What you should remember is that during minor garbage collection, when cleaning the Eden and survivor spaces the data is copied entirely which means that no memory fragmentation will happen.

Major garbage collection means that the tenured generation clearing event was performed. The tenured generation is also widely called the old generation space. Depending on the garbage collector and its settings the tenured generation cleaning may happen less or more often. Which is better? The right answer depends on the use-case and we will not be covering that in this blog post.

Java Full GC means that the full garbage collection event happened. Meaning that both the young and old generation was cleared. The garbage collector tried to clear it and the log tells us what the outcome of that procedure was. Tenured generation cleaning requires mark, sweep and compact phases to avoid high memory fragmentation. If a garbage collector wouldn’t care about memory fragmentation you could end up in a situation where you have enough memory, but it is fragmented and the object can’t be allocated. We can illustrate this situation with the following diagram:

There is also one part that we didn’t discuss — the Allocation Failure. The Allocation Failure part of the garbage collector logline explains why the garbage collection cycle started. It usually means that there was no space left for new object allocation in the Eden space of heap memory and the garbage collector tried to free some memory for new objects. The Allocation Failure can also be generated by the remark phase of the Concurrent Mark Sweep garbage collector.

The next important thing in the logline is the information about the memory occupation before and after the garbage collection process. Let’s look into the line once again in greater detail:

You can see that the line contains a lot of useful information. In addition to what we already discussed we also have information about the memory both before and after the collection. We have the time garbage collection took and CPU resources used during the garbage collection process. As you can see we have a lot of information allowing us to see how fast or slow the process is.

One piece of the very important information that the JVM garbage collector gives us is the total time for which the application threads are stopped. You can expect the threads to be stopped very often, but for a very short period of time. For example:

2019-10-29T10:00:28.879-0100: 0.488: Total time for which application threads were stopped: 0.0001006 seconds, Stopping threads took: 0.0000065 seconds

You can see that the threads were stopped for 0.0001006 seconds and the stopping of the threads took 0.0000065 seconds. This is not a long time for threads to be stopped and you will see information like this over and over again in your garbage collector logs. What should raise a red flag is a long thread stop time — also called a stop the world event that will basically stop your application. Here’s an example:

2019-11-02T17:11:54.259-0100: 7.438: Total time for which application threads were stopped: 11.2305001 seconds, Stopping threads took: 0.5230011 seconds

In the logline above, we can see that the application threads were stopped for more than 11 seconds. What does that mean? Basically, your application was not responding for more than 11 seconds. It wasn’t responding to any requests, it wasn’t processing data and the JVM was only doing garbage collection. You want to avoid situations like this at all costs. It is a sign of a big memory problem. Either your memory is too low for your application to properly do its job or you have a memory leak that fills up your heap space eventually leading to long garbage collection and finally to running out of memory. This means that your applications will not be able to create new objects and will stop working.

G1 Garbage Collector

Let’s now look at what the G1 garbage collector looks like. We will disable the previously used CMS garbage collector and turn on G1GC by using the following application options:

-XX:+UseG1GC
-XX:-UseConcMarkSweepGC
-XX:-UseCMSInitiatingOccupancyOnly

So we turn on the G1 garbage collector and remove the Concurrent Mark Sweep.

A standard G1 garbage collector log entry looks as follows:

2019-11-03T21:26:21.827-0100: 2.069: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 2097152 bytes, new threshold 15 (max 15)
- age   1:     341608 bytes,     341608 total
, 0.0021740 secs]
   [Parallel Time: 0.9 ms, GC Workers: 10]
      [GC Worker Start (ms): Min: 2069.4, Avg: 2069.5, Max: 2069.6, Diff: 0.1]
      [Ext Root Scanning (ms): Min: 0.1, Avg: 0.2, Max: 0.4, Diff: 0.3, Sum: 1.5]
      [Update RS (ms): Min: 0.1, Avg: 0.2, Max: 0.3, Diff: 0.2, Sum: 2.3]
         [Processed Buffers: Min: 1, Avg: 1.4, Max: 4, Diff: 3, Sum: 14]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 0.2, Avg: 0.3, Max: 0.3, Diff: 0.1, Sum: 3.0]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 10]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [GC Worker Total (ms): Min: 0.6, Avg: 0.7, Max: 0.8, Diff: 0.1, Sum: 7.0]
      [GC Worker End (ms): Min: 2070.2, Avg: 2070.2, Max: 2070.2, Diff: 0.0]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.2 ms]
   [Other: 1.1 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.8 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.2 ms]
      [Humongous Register: 0.0 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.0 ms]
   [Eden: 26.0M(26.0M)->0.0B(30.0M) Survivors: 5120.0K->3072.0K Heap: 51.4M(64.0M)->22.6M(64.0M)]
 [Times: user=0.01 sys=0.00, real=0.01 secs]

In the logline above you can see that we had a young generation garbage collection event [GC pause (G1 Evacuation Pause) (young), which resulted in certain regions of memory being cleared: [Eden: 26.0M(26.0M)->0.0B(30.0M) Survivors: 5120.0K->3072.0K Heap: 51.4M(64.0M)->22.6M(64.0M)]. We also have the timing information and CPU usage [Times: user=0.01 sys=0.00, real=0.01 secs]. The timings are exactly the same as in the previous garbage collector discussion. The user and system scope CPU usage during the garbage collection process and we have the time it took.

The memory information summary is detailed and gives us an overview of what happened. We can see that the Eden space was fully cleared Eden: 26.0M(26.0M)->0.0B(30.0M). The garbage collection started when it was occupied by 26M of data. After the garbage collection, we ended with a completely empty Eden space. The total size of the Eden space at the point of garbage collection was 30MB. The garbage collection started with the survivors’ space having 5120K of memory and ended with 3072K of data in it. Finally, the whole heap started with 51.4M occupied out of the total size of 64M and ended with 22.6M of occupation.

In addition to that, you also see more detailed information about the internals of the parallel garbage collector workers and the phases of their work — like the start, scanning and working.

You can also see additional log entries related to G1 garbage collector:

2019-11-03T21:26:23.704-0100: 2019-11-03T21:26:23.704-0100: 3.946: 3.946: [GC concurrent-root-region-scan-start]
Total time for which application threads were stopped: 0.0035771 seconds, Stopping threads took: 0.0000111 seconds
2019-11-03T21:26:23.706-0100: 3.948: [GC concurrent-root-region-scan-end, 0.0017994 secs]
2019-11-03T21:26:23.706-0100: 3.948: [GC concurrent-mark-start]
2019-11-03T21:26:23.737-0100: 3.979: [GC concurrent-mark-end, 0.0315921 secs]
2019-11-03T21:26:23.737-0100: 3.979: [GC remark 2019-11-03T21:26:23.737-0100: 3.979: [Finalize Marking, 0.0002017 secs] 2019-11-03T21:26:23.738-0100: 3.980: [GC ref-proc, 0.0004151 secs] 2019-11-03T21:26:23.738-0100: 3.980: [Unloading, 0.0025065 secs], 0.0033738 secs]
 [Times: user=0.04 sys=0.01, real=0.01 secs]
2019-11-03T21:26:23.741-0100: 3.983: Total time for which application threads were stopped: 0.0034705 seconds, Stopping threads took: 0.0000308 seconds
2019-11-03T21:26:23.741-0100: 3.983: [GC cleanup 54M->54M(64M), 0.0004419 secs]
 [Times: user=0.00 sys=0.00, real=0.00 secs]

Of course, the above log lines are different, but the principles still stand. The log gives us information about the total time for which the application threads were stopped, the result of the cleanup done by the garbage collector and the resources used.

GC Logging Options in Java 9 and Newer

We can even go deeper with garbage collection and turn on the debug level. Let’s take Java 10 as an example and let’s include the -Xlog:gc*,gc+phases=debug to the startup parameters of the JVM. This will turn on debug level logging for the garbage collection phases for the default G1 garbage collector on Java 10. This will enable verbose GC logging giving you extensive information about garbage collector work.

[0.006s][info][gc,heap] Heap region size: 1M
[0.012s][info][gc     ] Using G1
[0.013s][info][gc,heap,coops] Heap address: 0x00000006c0000000, size: 4096 MB, Compressed Oops mode: Zero based, Oop shift amount: 3
[0.428s][info][gc,start     ] GC(0) Pause Young (G1 Evacuation Pause)
[0.428s][info][gc,task      ] GC(0) Using 2 workers of 2 for evacuation
[0.432s][info][gc,phases    ] GC(0)   Pre Evacuate Collection Set: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Prepare TLABs: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Choose Collection Set: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Humongous Register: 0.0ms
[0.433s][info ][gc,phases    ] GC(0)   Evacuate Collection Set: 3.8ms
[0.433s][debug][gc,phases    ] GC(0)     Ext Root Scanning (ms):   Min:  0.6, Avg:  0.7, Max:  0.8, Diff:  0.2, Sum:  1.4, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)     Update RS (ms):           Min:  0.0, Avg:  0.0, Max:  0.0, Diff:  0.0, Sum:  0.0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)       Processed Buffers:        Min: 0, Avg:  0.0, Max: 0, Diff: 0, Sum: 0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)       Scanned Cards:            Min: 0, Avg:  0.0, Max: 0, Diff: 0, Sum: 0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)       Skipped Cards:            Min: 0, Avg:  0.0, Max: 0, Diff: 0, Sum: 0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)     Scan RS (ms):             Min:  0.0, Avg:  0.0, Max:  0.0, Diff:  0.0, Sum:  0.0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)       Scanned Cards:            Min: 0, Avg:  0.0, Max: 0, Diff: 0, Sum: 0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)       Claimed Cards:            Min: 0, Avg:  0.0, Max: 0, Diff: 0, Sum: 0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)       Skipped Cards:            Min: 0, Avg:  0.0, Max: 0, Diff: 0, Sum: 0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)     Code Root Scanning (ms):  Min:  0.0, Avg:  0.1, Max:  0.1, Diff:  0.1, Sum:  0.1, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)     AOT Root Scanning (ms):   skipped
[0.433s][debug][gc,phases    ] GC(0)     Object Copy (ms):         Min:  2.8, Avg:  2.9, Max:  3.0, Diff:  0.2, Sum:  5.7, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)     Termination (ms):         Min:  0.0, Avg:  0.0, Max:  0.0, Diff:  0.0, Sum:  0.0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)       Termination Attempts:     Min: 1, Avg:  1.0, Max: 1, Diff: 0, Sum: 2, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)     GC Worker Other (ms):     Min:  0.0, Avg:  0.0, Max:  0.0, Diff:  0.0, Sum:  0.0, Workers: 2
[0.433s][debug][gc,phases    ] GC(0)     GC Worker Total (ms):     Min:  3.6, Avg:  3.6, Max:  3.7, Diff:  0.1, Sum:  7.3, Workers: 2
[0.433s][info ][gc,phases    ] GC(0)   Post Evacuate Collection Set: 0.1ms
[0.433s][debug][gc,phases    ] GC(0)     Code Roots Fixup: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Preserve CM Refs: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Reference Processing: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Clear Card Table: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Reference Enqueuing: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Merge Per-Thread State: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Code Roots Purge: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Redirty Cards: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     DerivedPointerTable Update: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Free Collection Set: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Humongous Reclaim: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Start New Collection Set: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Resize TLABs: 0.0ms
[0.433s][debug][gc,phases    ] GC(0)     Expand Heap After Collection: 0.0ms
[0.433s][info ][gc,phases    ] GC(0)   Other: 0.2ms
[0.433s][info ][gc,heap      ] GC(0) Eden regions: 7->0(72)
[0.433s][info ][gc,heap      ] GC(0) Survivor regions: 0->1(1)
[0.433s][info ][gc,heap      ] GC(0) Old regions: 0->1
[0.433s][info ][gc,heap      ] GC(0) Humongous regions: 6->3
[0.433s][info ][gc,metaspace ] GC(0) Metaspace: 9281K->9281K(1058816K)
[0.433s][info ][gc           ] GC(0) Pause Young (G1 Evacuation Pause) 13M->4M(122M) 4.752ms
[0.433s][info ][gc,cpu       ] GC(0) User=0.00s Sys=0.01s Real=0.00s

You can see the exact timings in the highlighter section of the logline above. They were not present in the G1 garbage collector log that we were discussing earlier. Of course, phases are not the only option that you can turn on. Those are options that became available with Java 9 and are here to correspond to the flags that were removed or deprecated. Here are some of the options available in the earlier Java Virtual Machine versions and the options that they translate to in Java 9 and newer:

-XX:+PrintHeapAtGC can now be expressed as -Xlog:gc+heap=debug option
-XX:+PrintParallelOldGCPhasesTimes can be expressed as -Xlog:gc+phases*=trace
-XX:+PrintGCApplicationConcurrentTime and -XX:+PrintGCApplicationStoppedTime can now be expressed as -Xlog:safepoint
-XX:+G1PrintHeapRegions can be expressed as -Xlog:gc+region*=trace
-XX:+SummarizeConcMark can be expressed as -Xlog:gc+marking*=trace
-XX:+SummarizeRSetStats can be expressed as -Xlog:gc+remset*=trace
-XX:+PrintJNIGCStalls can be expressed as -Xlog:gc+jni*=debug
-XX:+PrintTaskqueue can be expressed as -Xlog:gc+task+stats*=trace
-XX:+TraceDynamicGCThreads can be expressed as -Xlog:gc+task*=trace
-XX:+PrintAdaptiveSizePolicy can be expressed as -Xlog:gc+ergo*=trace
-XX:+PrintTenuringDistribution can be expressed as -Xlog:gc+age*=trace

You can combine multiple options or enable all of them by adding the -Xlog:all=trace flag to your JVM application startup parameters. But be aware that it can result in quite a lot of information in the garbage collector log files. To avoid the flood of information you can set it to debug using -Xlog:all=debug — it will lower down the amount of information, but it will give you way more than the standard garbage collector log.

Java Garbage Collection Logging: Log Analysis Tools you Need to Know About

There are log analysis tools that can help you analyze the garbage collector logs. Nothing available out of the box in the standard Java Virtual Machine distribution though.

APM & Observability Tools

When it comes to observing the high-level overview of the performance of the Java garbage collector, you can use one of the observability tools providing Java application-level monitoring.For example, our own Sematext JVM Monitoring provided by Sematext Cloud.

A tool like this should give you summary information about how the garbage collector works, the times, collection count the maximum collection time and the average collection size. In most cases, this is more than enough to spot issues with the garbage collection without the need of going deep into the logs and analyzing them.

However, when troubleshooting you may need to have a more fine-grained view over what was happening inside the garbage collector in the JVM. If you don’t want to analyze the data manually there are tools that can help you.

GCViewer

For example one of the tools that can help you visualize the GC logs is the GCViewer. A tool that allows you to analyze the garbage collector logs up to Java 1.5 and its continuation aiming to support newer Java versions and the G1 garbage collector.

The GC Viewer aims to provide comprehensive information about memory utilization and the garbage collector process overall. It is open-sourced and completely free for personal and commercial use aiming to provide support up to and including Java 8 and with unified logging for OpenJDK 9 and 10.

GCEasy

There are also tools that are proprietary and commercial. One of them is GCEasy. This is an online GC log analyzer tool where you can upload the garbage collection log and get the results in the form of an easy to read log analysis report:

The report will include information like generation size and maximum size, key performance indicators like average and maximum pause time, pauses statistics, memory leak information and interactive graphs showing you each heap memory space. All of that information calculated on the basis of the log file that you provide.

Even though the GCEasy has a free plan it is limited. At the time of writing a single user could upload 5 GC log files a month with up to 50mb per file. There are additional plans available if you are interested in using the tool.

Wrapping Up

Understanding garbage collector logs is not easy. A large number of possible formats, different Java Virtual Machine versions, and different garbage collector implementations don’t make it simpler. Even though there are a lot of options you have to remember, certain parts are common. Each garbage collector will tell you the size of the heap, the before and after occupancy of the region of the heap that was cleared. Finally, you will also see the time and resources used to perform the operation. Start from that and continue the journey of understanding the JVM garbage collection process and the memory usage of your application. Happy analysis :)

Solr Monitoring Made Easy with Sematext

Rafał Kuć — Mon, 20 May 2019 17:31:08 +0000

As shown in Part 1 Solr Key Metrics to Monitor, the setup, tuning, and operations of Solr require deep insights into the performance metrics such as request rate and latency, JVM memory utilization, garbage collector work time and count and many more. Sematext provides an excellent alternative to other Solr monitoring tools.

How Sematext Saves you Time, Work and Costs

Here are a few things you will NOT have to do when using Sematext for Solr monitoring:

figure out which metrics to collect and which ones to ignore
give metrics meaningful labels
hunt for metric descriptions in the docs so that you know what each one actually shows
build charts to group metrics that you really want on the same charts, not several separate charts
figure out which aggregation to use for each set of metrics (min? max? avg? something else?)
set up basic alert rules

All of the above is not even a complete story. Do you want to collect Solr logs? How about structuring them? Sematext does all this automatically for you!

In this post, we will look at how Sematext provides more comprehensive – and easy to set up – monitoring for Solr and other technologies in your infrastructure. By combining events, logs, and metrics together in one integrated full stack observability platform and using the Sematext open-source monitoring agent and its integrations, which are also open-source, you can monitor your whole infrastructure and apps, not just Solr. You can also get deeper visibility into your entire software stack by collecting, processing, and analyzing your logs.

Solr Monitoring

Collecting Solr Metrics

Sematext Solr integration collects over 30 different Solr metrics for various caches, requests, latency and much more. JVM monitoring is included, too. Sematext maintains and supports official Solr monitoring integration. Moreover, the Sematext Solr integration is customizable and open source.

Bottom line: you don’t need to deal with configuring the agent for metrics collection, which is the first huge time saver!

Installing Monitoring Agent

Setting up the monitoring agent takes less than 5 minutes:

Step 1. Create a Solr App in the Integrations / Overview (or Sematext Cloud Europe). This will let you install the agent and control access to your monitoring and logs data. The short What is an App in Sematext Cloud video has more details.

Step 2. Name your Solr monitoring App and, if you want to collect Solr logs as well, create a Logs App along the way.

Step 3. Install the Sematext Agent according to the setup instructions displayed in the UI.

For example, on Ubuntu, add Sematext Linux packages with the following command:

echo "deb http://pub-repo.sematext.com/ubuntu sematext main" | sudo tee
/etc/apt/sources.list.d/sematext.list > /dev/null
wget -O - https://pub-repo.sematext.com/ubuntu/sematext.gpg.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install spm-client

Then setup Solr monitoring by preparing Solr server connection details:

sudo bash /opt/spm/bin/setup-sematext  \
    --monitoring-token <your-monitoring-token-goes-here>   \
    --app-type solr  \
    --agent-type javaagent  \
    --infra-token <your-infra-token-goes-here>

Finally, adjust your solr.in.sh file:

SOLR_OPTS="$SOLR_OPTS -Dcom.sun.management.jmxremote
-javaagent:/opt/spm/spm-monitor/lib/spm-monitor-generic.jar=::default"

Step 4. Go grab a drink… but hurry – Solr metrics will start appearing in your charts in less than a minute.

Solr Monitoring Dashboard

When you open the Solr App you’ll find a predefined set of dashboards that organize more than 60 Solr metrics and general server monitoring in predefined charts grouped into an intuitively organized set of monitoring dashboards:

Overview with charts for all key Solr metrics
Operating System metrics such as CPU, memory, network, disk usage, etc.
Solr metrics
- Request Rate & Latency: requests per second, requests latencies, including percentiles
- Index Size: Solr data size, file system statistics
- Indexing: Added documents, delete by identifier, delete by queries, commit events, other events such as update errors or expunge deletes
- Cache: Query result cache, document cache, filter cache, and per segment filter cache metrics
- Warmup: Searcher and caches warmup times

Setup Solr Alerts

To save you time Sematext automatically creates a set of default alert rules such as alerts for low disk space. You can create additional alerts on any metric. Watch Alerts in Sematext Cloud for more details.

Alerting on Solr Metrics

There are 3 types of alerts in Sematext:

Heartbeat alerts, which notify you when a Solr server is down
Classic threshold-based alerts that notify you when a metric value crosses a pre-defined threshold
Alerts based on statistical anomaly detection that notify you when metric values suddenly change and deviate from the baseline

Let’s see how to actually create some alert rules for Request Rate metrics in the animation below. The Request Rate chart shows a drop in the number of requests. We normally have above three requests per second, but we can see that it drops to zero. We would like to be notified about such situations. To create an alert rule on a metric we’d go to the pull down in the top right corner of a chart and choose “Create alert”. The alert rule applies the filters from the current view and you can choose various notification options such as email or configured notification hooks (PagerDuty, Slack, VictorOps, BigPanda, OpsGenie, Pushover, generic webhooks, etc.). Alerts are triggered either by anomaly detection, watching metric changes in a given time window or through the use of classic threshold-based alerts.

Monitoring Solr Logs

Shipping Solr Logs

Since having logs and metrics in one platform makes troubleshooting simpler and faster let’s ship Solr logs, too. You can use many log shippers, but we’ll use Logagent because it’s lightweight, easy to set up and because it can parse and structure logs out of the box. The log parser extracts timestamp, severity, and messages. For query traces, the log parser also extracts the unique query ID to group logs related to query execution.

Step 1. Create a Logs App to obtain an App token

Step 2. Install Logagent npm package

sudo npm i -g @sematext/logagent

If you don’t have Node, you can install it easily. E.g. On Debian/Ubuntu:

curl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash -
sudo apt-get install -y nodejs

Step 3. Install Logagent service by specifying the logs token and the path to Solr log files.

You can use -g ‘/var/solr/logs/.log`* to ship only logs from Solr server. If you run other services, such as ZooKeeper or MySQL on the same server consider shipping all logs. The default settings ship all logs from /var/log//.log* when the -g parameter is not specified.

Logagent detects the init system and installs Systemd or Upstart service scripts. On Mac OS X it creates a launchd service. Simply run:

`bash
sudo logagent-setup -i YOUR_LOGS_TOKEN -g '/var/solr/logs/*.log'

for EU region:

sudo logagent-setup -i LOGS_TOKEN \

-u logsene-receiver.eu.sematext.com \

-g ‘var/log/**/[FOO]*.log`

The setup script generates the configuration file in /etc/sematext/logagent.conf and starts Logagent as system service.

Note, if you run Solr in containers, setup Logagent for container logs.

Log Search and Dashboards

One you have logs in Sematext you can search them when troubleshooting, save queries you run frequently or create your individual logs dashboard.

Log Search Syntax

If you know how to search with Google, you’ll know how to search your logs in Sematext Cloud.

Use AND, OR, NOT operators – e.g. (error OR warn) NOT exception
Group your AND, OR, NOT clauses – e.g. message:(exception OR error OR timeout) AND severity:(error OR warn)
Don’t like Booleans? Use + and – to include and exclude – e.g. +message:error -message:timeout -host:db1.example.com
Use explicitly field references – e.g. message:timeout
Need a phrase search? Use quotation marks – e.g. message:”fatal error”

When digging through logs you might find yourself running the same searches again and again. To solve this annoyance, Sematext lets you save queries so you can re-execute them quickly without having to retype them. Please watch how using logs for troubleshooting simplifies your work.

Alerting on Solr Logs

To create an alert on logs we start by running a query that matches exactly those log events that we want to be alerted about. To create an alert just click on the “Create Saved query / Alert rule” icon next to the search box.

Similar to the setup of metric alert rules, we can define threshold-based or anomaly detection alerts based on the number of matching log events the alert query returns.

Please watch Alerts in Sematext Cloud for more details.

Finding Common Solr Problems from Solr Logs

There are common issues that you may want to watch for when running Solr / SolrCloud:

ZooKeeper disconnects, which you can find with e.g. +ZooKeeperConnection +watcher +disconnected

Issues with auto commit operations: +auto +commit +error
Issues with caches warming up:. +error +during +auto +warming
Issues with commit operations: +unable +distribute +commit
SolrCloud leader election issues: +met +exception +give +up +leadership
Issues with shards: +no +servers +hosting +shard

You might, of course, want to save some of these as alerts.

Solr Metrics and Log Correlation

A typical troubleshooting workflow starts from detecting a spike in the metrics, then digging into logs to find the root cause of the problem. Sematext makes this really simple and fast. Your metrics and logs live under the same roof. Logs are centralized, the search is fast, and the powerful log search syntax is simple to use. Correlation of metrics and logs is literally one click away.

Full Stack Observability for Solr & Friends

Solr’s best friend, especially when using its SolrCloud mode is Apache ZooKeeper. SolrCloud requires ZooKeeper to operate, handle partitions and perform leader elections. Monitoring ZooKeeper and SolrCloud together is crucial in order to correlate metrics from both and have full observability of your SolrCloud operations.

Monitor Solr with Sematext

Comprehensive monitoring for Solr involves identifying key metrics for Solr collecting metrics and logs, and connecting everything in a meaningful way. In this post, we’ve shown you how to monitor Solr metrics and logs in one place. We used OOTB and customized dashboards, metrics correlation, log correlation, anomaly detection, and alerts. And with other open-source integrations, like MySQL or Kafka, you can easily start monitoring Solr alongside metrics, logs, and distributed request traces from all of the other technologies in your infrastructure. Get deeper visibility into Solr today with a free Sematext trial.

Solr Open Source Monitoring Tools

Rafał Kuć — Mon, 20 May 2019 13:24:23 +0000

Open source software adoption continues to grow. Tools like Kafka and Solr are widely used in small startups, ones that are using cloud ready tools from the start, but also in large enterprises, where legacy software is getting faster by incorporating new tools. In this second part of our Solr monitoring series (see the first part discussing Solr metrics to monitor), we will explore some of the open source tools available to monitor Solr nodes and clusters. We'll take the opportunity to look into what it takes to install, configure and use each tool in a meaningful way.

The Importance of Solr Monitoring

Operating, managing and maintaining distributed systems is not easy. As we explored in the first part of our monitoring Solr series there are more than forty metrics that we need to have full visibility into our Solr instances and the full cluster. Without any kind of monitoring tool, it is close to impossible to have a full view over all the needed pieces to be sure that the cluster is healthy or to react properly when things are not going the right way.

When searching for an open source tool to help you track Solr metrics, look at the following qualities:

The ability to monitor and manage multiple clusters
An easy, at-glance overview of the whole cluster and its state
Clear information about the crucial performance metrics
Ability to provide historical metrics for post mortem analysis
Combines low-level OS metrics, JVM metrics, and Solr specific metrics
Ability to set up alerts

Let's now explore some of the available options.

Prometheus with Solr Exporter

Prometheus is an open-source monitoring and alerting system that was originally developed at SoundCloud. Right now it is a standalone open source project and it is maintained independently from the company that created it initially. Prometheus project, in 2016, joined the Cloud Native Computing Foundation as the second hosted project, right after Kubernetes.

Out of the box Prometheus supports flexible query language on top of the multi-dimensional data model based on TSDB where the data can be pulled using the HTTP based protocol:

Source: https://prometheus.io/docs/introduction/overview/

For Solr to be able to ship metrics to Prometheus we will use a tool called Exporter. It takes the metrics from Solr and translates them into a format that is understandable by Prometheus itself. The Solr Exporter is not only able to ship metrics to Prometheus, but also responses for requests like Collections API commands, ping requests and facets gathered from search results.

The Prometheus Solr Exporter is shipped with Solr as a contrib module located in the contrib/prometheus-exporter directory. To start working with it we need to take the solr-exporter.xml file that is located in the contrib/prometheus-exporter/conf directory. It is already pre-configured to work with Solr and we will not modify it now. However, if you are interested in additional metrics, ship additional facet results or send fewer data to Prometheus you should look and modify the mentioned file.

Once we have the exporter configured we need to start it. It is very simple. Just go to the contrib/prometheus-exporter directory (or the one where you copied it in your production system) and run appropriate command, depending on the architecture of Solr you are running.

For Solr master-slave deployments you would run:

./bin/solr-exporter -p 9854 -b http://localhost:8983/solr -f 
./conf/solr-exporter-config.xml -n 8

For SolrCloud you would run:

./bin/solr-exporter -p 9854 -z localhost:2181/solr -f 
./conf/solr-exporter-config.xml -n 16

The above command runs Solr exporter on the 9854 port with 8 threads for Solr master-slave and 16 for SolrCloud. In case of SolrCloud we are also pointing exporter to the Zookeeper ensemble that is accessible on port 2181 on the localhost. Of course, you should adjust the commands to match your environment.

After the command was successfully run you should see the following:

INFO  - 2019-04-29 16:36:21.476; 
org.apache.solr.prometheus.exporter.SolrExporter; Start server

We have Solr master-slave/SolrCloud running and we have our Solr Exporter running, this means we are ready to take the next step and configure our Prometheus instance to fetch data from our Solr Exporter. To do that we need to adjust the prometheus.yml file and add the following:

scrape_configs:
  - job_name: 'solr'
    static_configs:
      - targets: ['localhost:9854']

Of course, in the production system, our Prometheus will run on a different host compared to our Solr and Solr Exporter - we can even run multiple exporters. That means that we will need to adjust the targets property to match our environment.

After all the preparations we can finally look into what Prometheus gives us. We can start with the main Prometheus UI.

It allows for choosing the metrics that we are interested in, graph it, alert on it and so on. The beautiful thing about it is that the UI support the full Prometheus Query Language allowing the use of operators, functions, subqueries and many, many more.

When using the visualization functionality of Prometheus we get the full view of the available metrics by using a simple dropdown menu, so we don't need to be aware of each and every metric that is shipped to Solr.

The nice thing about Prometheus is that we are not limited to the default UI, but we can also use Grafana for dashboarding, alerting and team management. Defining the new, Prometheus data source is very, very simple:

Once that is done we can start visualizing the data:

However, all of that requires us to build rich dashboards ourselves. Luckily Solr comes with an example pre-built Grafana dashboard that can be used along with the metrics scrapped to Prometheus. The example dashboard definition is stored in the contrib/prometheus-exporter/conf/grafana-solr-dashboard.json file and can be loaded to Grafana giving a basic view over our Solr cluster.

Of dashboards with metrics is not everything that Grafana is capable of. We are able to set up teams, users, assign roles to them, set up alerts on the metrics and include multiple data sources within a single installation of Grafana. This allows us to have everything in one place - metrics from multiple sources, logs, signals, tracing and whatever we need and can think of.

Graphite & Graphite with Grafana

Graphite is a free open-sourced monitoring software that can monitor and graph numeric time-series data. It can collect, store and display data in a real-time manner allowing for fine-grained metrics monitoring. It is composed of three main parts - Carbon, the daemon listening for time-series data, Whisper - database for storing time-series data and the Graphite web-app that is used for on-demand metrics rendering.

To start monitoring Solr with Graphite as the platform of choice we assume that you already have Graphite up and running, but if you don't we can start by using the provided Docker container:

docker run -d
  --name graphite
  --restart=always
  -p 80:80
  -p 2003-2004:2003-2004
  -p 2023-2024:2023-2024
  -p 8125:8125/udp
  -p 8126:8126
  graphiteapp/graphite-statsd

To be able to get the data from Solr we will use Solr metrics registry along with the Graphite reporter. To configure that we need to adjust the solr.xml file and add the metrics part to it. For example, to monitor information about the JVM and the Solr node the metrics section would look as follows:

<metrics>
  <reporter name="graphite" group="node, jvm"
 class="org.apache.solr.metrics.reporters.SolrGraphiteReporter">
     <str name="host">localhost</str>
     <int name="port">2003</int>
     <int name="period">60</int>
  </reporter>
</metrics>

So we pointed Solr to the Graphite server that is running on the localhost on the port 2003 and we defined the period of data writing to 60, which means that Solr will push the JVM and Solr node metrics once every 60 seconds.

Keep in mind that by default Solr will write by using the plain text protocol. This is less efficient than using the pickled protocol. If you would like to configure Solr and Graphite in production we suggest using setting the pickled property to true in the reporter configuration and using the port for the pickled protocol, which in case of our Docker container would be 2004.

We can now easily navigate to our Graphite server, available at 127.0.0.1 on port 80 with our container and graph our data:

All the metrics are sorted out and easily accessible in the left menu allowing for rich dashboarding capabilities.

If you are using Grafana it is easy to setup Graphite as yet another data source and uses its graphing and dashboarding capabilities to correlate multiple metrics together, even ones that are coming from different data sources.

Next, we need to configure Graphite as the data source. It is as easy as providing the proper Graphite URL and setting the version:

And we are ready to create our visualizations and dashboards, which is very easy and powerful. With the autocomplete available for metrics we don't need to recall any of the names and Grafana will just show them for us. An example of a single metric dashboard can looks as follows:

Ganglia

Ganglia is a scalable distributed monitoring system. It is based on a hierarchical design targeted for a large number of clusters and nodes. It is using XML for data representation, XDR for data transport and RRD for data storage and visualization. It has been used to connect clusters across university campuses and is proven to handle clusters with 2000 nodes.

To start monitoring Solr master-slave or SolrCloud clusters with Ganglia we will start with setting up the metrics reporter in the solr.xml configuration file. To do that we add the following section to the mentioned file:

<metrics>
  <reporter name="ganglia" group="node, jvm"
class="org.apache.solr.metrics.reporters.SolrGangliaReporter">
    <str name="host">localhost</str>
    <int name="port">8649</int>
  </reporter>
</metrics>

Next thing that we need to do is allow Solr to understand the XDR protocol used for data transport. We need to download the oncrpc-1.0.7.jar jar file and place it either in your Solr classpath or include the path to it in your solrconfig.xml file using the lib directive.

One all of the above is done and assuming our Ganglia is running on localhost on port 8649 that is everything that we need to do to have everything ready and start shipping Solr nodes and JVM metrics.

By visiting Ganglia and choosing the Solr node we can start looking into the metrics:

We can jump to the graphs right away, choose which group of metrics interested in and basically see most of the data that we are interested in right away.

Ganglia provides us with all the visibility for our metrics, but out of the box, it doesn't support one of the crucial features that we are looking for - alerting. There is a project called ganglia-alert, which is a user contributed extension to Ganglia.

Conclusion

As you can see there is a wide variety of tools that help you monitor Solr. What you have to keep in mind is that each requires setting up, configuration and manual dashboard building in order to get meaningful information. All of that may require deep knowledge across the whole ecosystem.

If you are looking for a Solr monitoring tool that you can set up in minutes and have pre-built dashboards with all necessary information, alerting and team management take a look at the third part of the Solr monitoring series to learn more about production ready Solr monitoring with Sematext.

If you need full-stack observability for your software stack, check out Sematext. We’re pushing to open source our products and make an impact.

Solr Key Metrics to Monitor

Rafał Kuć — Mon, 20 May 2019 12:07:09 +0000

As the first part of the three-part series on monitoring Apache Solr, this article explores which Solr metrics are important to monitor and why. The second part of the series covers Solr open source monitoring tools and identify the tools and techniques you need to help you monitor and administer Solr and SolrCloud in production.

Two Architectures

When first thinking about installing Solr you usually ask yourself a question - should I go with the master-slave environment or should I dedicate myself to SolrCloud? This question will remain unanswered in this blog post, but what we want to mention is that it is important to know which architecture you will monitor. When dealing with SolrCloud clusters you not only want to monitor per node metrics but also cluster-wide information and the metrics related to Zookeeper.

Why Monitor Solr?

When running Solr it is usually a crucial part of the system. It is used as a search and analysis engine for your data - part of it or all. Such a critical part of the whole architecture is needed to be both fault tolerant and highly available. Solr approaches that in two ways. The legacy architecture also called master-slave - it is based on a clear distinction between the master server which is responsible for indexing the data and the slave servers responsible for delivering the search and analysis results.

When the data is pushed to the master it is transformed into a so-called inverted index based on the configuration that we provide. The disk-based inverted index is divided into smaller, immutable pieces called segments, which are then used for searching. The segments can also be combined together into larger segments in a process called segment merging for performance reasons - the more segments you have, the slower your searches can be and vice versa.

Once the data has been written in the form of segments on the master’s disk, it can be replicated to the slave servers. This is done in a pull model. The slave servers use an HTTP protocol to copy the binary data from the master node. Each node does that on its own and works separately copying the changed data over the network. We already see a dozen of places that we should monitor and have knowledge about.

Having a single master node is not something that we would call fault tolerant, because of having a single point of failure. Because of that, the second type of architecture was introduced with Solr 4.0 release - the SolrCloud. It is based on the assumption that the data is distributed among a virtually unlimited number of nodes and each node can perform indexing and search processing roles. Physical copies of the data, placed in so-called shards can be created on demand in the form of physical replicas and replicated between them in a near real-time manner allowing for true fault tolerance and high availability. However, for that to happen we need an additional piece of software - Apache Zookeeper cluster to help Solr manage and configure its nodes.

When the data is pushed to any of the Solr nodes that are part of the cluster, the first thing that is done is forwarding the data to a leader shard. The leader stores the data in the write-ahead log called transaction log and, depending on the replica type, send the data to the replica for processing. The data is then indexed and written onto the disk into the inverted index format. This can cause additional I/O requirements - as the data indexing may also cause segment merging and finally it needs to be refreshed in order to be visible for searching, which requires yet another I/O operation.

When you send a search query to SolrCloud cluster, the node that is hit by the query initially propagates the data to shards that are needed to be queried in order to provide full data visibility. Each distributed search is done in two phases - scatter and gather. The scatter phase is dedicated to finding which shards have the matching documents, the identifier of those documents and their score. The gather phase is dedicated to rendering the search results by retrieving the needed documents from the shards that have them indexed. Each search phase requires I/O to read the data from disk, memory to store the results and intermediate steps required to perform the search, CPU cycles to calculate everything and network to transport the data.

Let's now look at how we can monitor all those metrics that are crucial to our indexing and searching.

Monitoring Solr Metrics via JMX

One of the tools that come out of the box with the Java Development Kit and can come in handy when you need quick, ad-hoc monitoring is jconsole. A GUI tool that allows one to get basic metrics about your JVM, like memory usage, CPU utilization, JVM threads, loaded classes. In addition to that, it also allows us to read metrics exposed by Solr itself in the form of JMX MBeans. Whatever metrics were exposed by Solr creators in that form can be read using jconsole. Things like average query response time for a given search handler, number of indexing requests or number of errors - all can be read via the JMX MBeans. The problem with jconsole is that it doesn't show us the history of the measurements.

Jconsole is not the only way of reading the JMX MBean values - there are other tools that can do that, like the open sourced JMXC or our open-source Sematext Java Agent. Unlike the tools that export the data in text format, our Sematext Java Agent can ship the data to Sematext Cloud - a full stack observability solution which can help you get detailed insight into your Solr metrics.

Monitoring Solr Metrics via Metrics API

The second option for gathering Solr metrics is an API introduced in Solr 6.4 - the Solr Metrics API. It supports on-demand metrics retrieval using the HTTP based API for cores, collections, nodes, and the JVM. However, the flexibility of the API doesn't come from it being available on-demand, but it being able to report the data to various destinations by using Reporters. Right now, out of the box the data can be exported with a little configuration to:

JMX - JMX MBeans, something we already discussed
SLF4J - logs or any destination that SLF4J supports
Graphite
Ganglia

We will cover the open source Solr monitoring solutions in greater details in the second part of this three-part monitoring series.

Key Solr Metrics to Monitor

Having knowledge about how we can monitor Solr, let's now look into top Solr metrics that we should keep an eye on.

Request Rate

Each handler in Solr provides information about the rate of the requests that are sent to it. Knowing how many requests in total and per handler are handled by your Solr node or cluster may be crucial to diagnose if the operations are going properly or not. A sudden drop or spike in the request rate may point out failure in one of the components of your system. Cross-referencing the request rate metric with the request latency can give you information on how fast your requests are at the given rate or show potential issues that are coming your way when the number of requests is the same, but the latency starts to grow.

Request Latency

A measurement of how fast your requests are, similar to the request rate, is available for each of the handlers separately. It means that we are able to easily see the latency of our queries and update requests. If you have various search handlers dedicated to different search needs, i.e. one for product search, one for articles search, you can easily measure the latency of each of the handlers and see how fast the results for a given type of data are returned. Cross-referencing the latency of the request with metrics like garbage collector work, JVM memory utilization, I/O utilization, and CPU utilization allows for easy diagnostics of performance problems.

Commit Events

Commit events in Solr come in various flavors. There manual commits, send with or without the indexing requests. There are automatic commits - ones that are fired after certain criteria are met - either the time has passed or the number of documents was greater than the threshold. Why are they important? They are responsible for data persistence and data visibility. The hard commit flushes the data to the disk and clears the transaction log. The soft commit reopens the searcher object allowing Solr to see new segments and thus serve new data to your users. However, it is crucial to balance the number of commits and the time between them - they are not free and we need to be sure that our commits are not too frequent as well as not too far apart.

Caches Utilization

Caches play a crucial role in Solr performance, especially when it comes to Solr master-slave architecture. The data that is cached can be easily accessed without the need for expensive disk operations. The caches are not free - they require memory and the more information you would like to cache, the more memory it will require. That's why it is important to monitor the size and hit rate of the caches. If your caches are too small, the hit rate will be low and you will see lots of evictions - removal of the data from the caches causing CPU usage and garbage collection effectively reducing your node performance. Caches that are too large on the other hand will increase the amount of data on the JVM heap pushing the garbage collector even further and again lowering the effective performance of your nodes. You can also cross-reference the utilization of the cache with commit events - remember, each commit event discards the entries inside the cache causing its refresh and warm up which uses resources such as CPU and I/O.

The above are the key Solr metrics to pay attention to, although there are other useful Solr metrics, too.

Key OS & JVM Metrics to Monitor

Apache Solr is a Java software and as such is greatly dependent on the performance of the whole Java Virtual Machine and its parts, such as garbage collector. The JVM itself doesn’t work in isolation and is dependent on the operating system, such as available physical memory, number of CPU cores and their speed and the speed of the I/O subsystem. Let's look into crucial metrics that we should be aware of.

CPU Utilization

Majority of the operations performed by Solr are in some degree dependent on the processing power of the CPU. When you index data it needs to be processed before it is written to the disk - the more complicated the analysis configuration, the more CPU cycles will be needed for each document. Query time analytics - facets, need to process a vast amount of documents in a subsecond time for Solr to be able to return query results in a timely manner. The Java virtual machine also requires CPU processing power for operations such as garbage collection. Correlating the CPU utilization with other metrics, i.e. request rate or request latency may reveal potential bottlenecks or point us to potential improvements.

Memory, JVM Memory & Swap

Free memory and swap space are very important when you care about performance. The swap space is used by the operating system when there is not enough physical memory available and there is a need for assigning some more memory for applications. In such case memory pages may be swapped, which means that those will be taken out of the physical memory and written to the dedicated swap partition on the hard drive. When the data from those swapped memory pages are needed the operating system loads it from the swap space back to the physical memory. You can imagine that such an operation takes time, even the fastest solid-state drives are magnitude slower compared to RAM memory. Being aware of the implications of swapping the memory we can now easily say that the JVM applications don't like to be swapped - it kills performance. Because of that, you want to avoid you Solr JVM heap memory to be swapped. You should closely monitor your memory usage and swap usage and correlate that with your Solr performance or completely disable swapping.

In addition to the monitoring the system memory you should also keep close attention to JVM memory and utilization of its various pools. Having the JVM memory pools fully utilized, especially the old generation space will result in extensive garbage collection and your Solr being completely unusable.

Disk Utilization

Apache Solr is a very I/O based application - it writes the data when indexing and reads the data when the search is performed. The more data you have the higher the utilization of your I/O subsystem will be and of course the performance of the I/O subsystem will have a direct connection to your Solr performance. Correlating the I/O read and write metrics to request latency and CPU usage can highlight potential bottlenecks in your system allowing you to scale the whole deployment better.

Garbage Collector Statistics

When data is used inside JVM based application it is put onto the heap. First into smaller your generation, later moved to usually larger old generation heap space. Assigning an object to an appropriate heap space is one of the garbage collector responsibilities. The major responsibility and the one we are most interested in is the cleaning of the objects that are not used. When the object inside the Java code is no longer in use it can be taken out of the heap in the process of garbage collection. That process is run from time to time, like few times a second for the young generation and every now and then for the old generation heap. We need to know how fast they are, how often they are and if everything is healthy.

If your garbage collection process is not stopping the whole application and the old generation garbage collection is not constant - it is good, it usually means you have a healthy environment. Keep in mind that correlating garbage collector metrics with memory utilization and performance measurements like request latency may reveal memory issues.

SolrCloud and Zookeeper

Having a healthy Zookeeper ensemble is crucial when running a SolrCloud cluster. It is responsible for keeping collection configurations, collection state required for the SolrCloud cluster to work, help with leader election and so on. When Zookeeper is in trouble, your SolrCloud cluster will not be able to accept new data, move shards around or accept new nodes joining the cluster - the only thing that may work are queries, but only to some extent.

Because healthy Zookeeper cluster is a required piece of every SolrCloud cluster it is crucial to have full observability of the Zookeeper ensemble. You should keep an eye on metrics like:

Statistics of connections established with Zookeeper
Requests latency
Memory and JVM memory utilization
Garbage collector time and count
CPU utilization
Quorum status

See a more complete list of ZooKeeper metrics in Sematext docs.

Conclusion

Solr is an awesome search engine and analytics platform, allowing for blazingly fast data indexing and retrieval. Keeping all the relevant Solr and OS metrics under observation is way easier when used with the right monitoring tool. That's why in the second part of the monitoring Solr series, we take a look at the possible options when it comes to monitoring Solr using open source tools. The last part of the series will cover production ready Solr monitoring with Sematext.

Kafka Monitoring in Production - eBook

Rafał Kuć — Mon, 20 May 2019 09:44:33 +0000

Hello!

I love data and I love working with it. As a developer and consultant I spend time with data working on internal products that you can see in the form of Sematext Cloud and for our clients as well.

We decided to share the knowledge that we gained when working with Apache Kafka in a form of blog posts series, starting with Key Kafka Metrics to Monitor.

We also decided that it would be good to release that content as an eBook that you can download and have with you whenever you need it. Here it is - the official download link for Kafka Monitoring - The Complete Guide.

Feel free to reach out in the comments or through various social media channels if you have any questions, suggestions or comments!