DEV Community: Yousef Zook

Java Performance - 5 - An introduction to Garbage Collection

Yousef Zook — Fri, 03 Dec 2021 13:04:21 +0000

Recap

Hello hello :3, welcom back. This is the part number 6 for the series Java Performance that summarize the java performance book by Scot Oaks

In the previous chapter we have talked about the JIT compilers in Java, and introduced the new VM GraalVM. We have also discussed some important tuning flags regarding the JIT and the tiered compilation.

In this chapter we are going to talk about Garbage Collectors in Java. We will mention breifely how they are working and the difference of performance between them.

So, let's start the fifth chapter...

Chapter Title:

An Introduction to Garbage Collection

Because the performance of Java applications depends heavily on garbage collection technology, it is not surprising that quite a few collectors are available. The OpenJDK has three collectors suitable for production, another that is deprecated in JDK 11 but still quite popular in JDK 8, and some experimental collectors that will (ideally) be production-ready in future releases. Other Java implementations such as Open J9 or the Azul JVM have their own collectors.

1) Garbage Collection Overview

At a basic level, GC consists of finding objects that are in use and freeing the memory associated with the remaining objects (those that are not in use).

Since references cannot be tracked dynamically via a count, instead, the JVM must peri‐ odically search the heap for unused objects.
Why cannot be tracked by a count?!
Answer: Consider this example:

Given a linked list of objects, each object in the list (except the head) will be pointed to by another object in the list—but if nothing refers to the head of the list, the entire list is not in use and can be freed. And if the list is circular (e.g., the tail of the list points to the head), every object in the list has a reference to it—even though no object in the list can actually be used, since no objects reference the list itself.

GC main steps are:

Freeing Objects
Compaction

A- Generational Garbage Collectors

Though the details differ somewhat, most garbage collectors work by splitting the heap into generations:

The old generation (or tenured)
The young generation, this contains
- eden
- survivor spaces

Objects are first allocated in the young generation, which is a subset of the entire heap. When the young generation fills up, the garbage collector will stop all the application threads and empty out the young generation.

Objects that are no longer in use are discarded,
and objects that are still in use are moved elsewhere (survivor space, and if no available space in the survior then to the old generation). This operation is called a minor GC or a young GC.

This design has 2 performance advantages:

because the young generation is only a portion of the entire heap, processing it is faster than processing the entire heap. The application threads are stopped for a much shorter period of time than if the entire heap were processed at once.
The second advantage arises from the way objects are allocated in the young genera‐ tion. Objects are allocated in eden (which encompasses the vast majority of the young generation). When the young generation is cleared during a collection, all objects in eden are either moved or discarded: objects that are not in use can be discarded, and objects in use are moved either to one of the survivor spaces or to the old generation. Since all surviving objects are moved, the young generation is automatically compacted when it is collected: at the end of the collection, eden and one of the survi‐ vor spaces are empty, and the objects that remain in the young generation are com‐ pacted within the other survivor space.

B- GC Algorithms

The following table lists the algorithms and their status in OpenJdk and Oracle Java releases:

1- The serial garbage collector
The serial collector uses a single thread to process the heap. It will stop all application threads as the heap is processed (for either a minor or full GC). During a full GC, it will fully compact the old generation.
The serial collector is enabled by using the -XX:+UseSerialGC flag

2- The throughput collector
In JDK 8, the throughput collector is the default collector for any 64-bit machine with two or more CPUs. The throughput collector uses multiple threads to collect the young generation, which makes minor GCs much faster than when the serial collec‐ tor is used. This uses multiple threads to process the old generation as well. Because it uses multiple threads, the throughput collector is often called the parallel collector.
The throughput collector stops all application threads during both minor and full GCs, and it fully compacts the old generation during a full GC. Since it is the default in most situations where it would be used, it needn’t be explicitly enabled. To enable it where necessary, use the flag -XX:+UseParallelGC.

3- The G1 GC collector
The G1 GC (or garbage first garbage collector) uses a concurrent collection strategy to collect the heap with minimal pauses. It is the default collector in JDK 11 and later for 64-bit JVMs on machines with two or more CPUs.

G1 GC divides the heap into regions, but it still considers the heap to have two generations. Some of those regions make up the young generation, and the young genera‐ tion is still collected by stopping all application threads and moving all objects that are alive into the old generation or the survivor spaces. (This occurs using multiple threads.
In G1 GC, the old generation is processed by background threads that don’t need to stop the application threads to perform most of their work.

G1 GC is enabled by specifying the flag -XX:+UseG1GC.

4- The CMS collector
The CMS collector was the first concurrent collector. Like other algorithms, CMS stops all application threads during a minor GC, which it performs with multiple threads.

CMS is officially deprecated in JDK 11 and beyond, and its use in JDK 8 is discouraged.
the major flaw in CMS is that it has no way to compact the heap during its background processing. If the heap becomes fragmented (which is likely to happen at some point), CMS must stop all application threads and compact the heap, which defeats the purpose of a concurrent collector.

CMS is enabled by specifying the flag -XX:+UseConcMarkSweepGC, which is false by default.

5- Experimental collectors
Garbage collection continues to be fertile ground for JVM engineers, and the latest versions of Java come with the three experimental algorithms mentioned earlier. I’ll have more to say about those in the next chapter; for now, let’s continue with a look at choosing among the three collectors supported in production environments.

C- Choosing a GC Algorithm

The choice of a GC algorithm depends

in part on the hardware available.
in part on what the application looks like.
and in part on the performance goals for the application.

When to use (and not use) the serial collector
On a machine with a single CPU, the JVM defaults to using the serial collector. This includes virtual machines with one CPU, and Docker containers that are limited to one CPU.

In these environments, the serial collector is usually a good choice, but at times G1 GC will give better results. This example is also a good starting point for understand‐ ing the general trade-offs involved in choosing a GC algorithm.

let's start by a CPU-intensive batch job:

the serial collector wins because it spends much less time paused for garbage collection.

Let's take another example, the following table shows the response time for a web server that is handling roughly 11 requests per second on its single CPU, which takes roughly 50% of the available CPU cycles.

The default (serial) algorithm still has the best average time, by 30%. Again, that’s because the collections of the young generation by the serial collector are generally faster than those of the other algorithms, so an average request is delayed less by the serial collector.

When to use the throughput collecotr
When a machine has multiple CPUs available, more-complex interactions can occur between GC algorithms, but at a basic level, the trade-offs between G1 GC and the throughput collector are the same as we’ve just seen. For example, The follosing table shows how our sample application works when running either two or four application threads on a machine with four cores (where the cores are not hyper-threaded).

When the elapsed time of an application is key, the throughput collector will be advantageous when it spends less time pausing the application threads than G1 GC does. That happens when one or more of these things occur:

There are no (or few) full GCs. Full GC pauses can easily dominate the pause times of an application, but if they don’t occur in the first place, the throughput collector is no longer at a disadvantage.
The old generation is generally full, causing the background G1 GC threads to work more.
The G1 GC threads are starved for CPU.

Let's take another test. This test is the same code we used before for batch jobs with long calculations, though it has a few modifications: multiple applica‐ tion threads are doing calculations (two, in this case), the old generation is seeded with objects to keep it 65% full, and almost all objects can be collected directly from the young generation. This test is run on a system with four CPUs (not hyper- threaded) so that there is sufficient CPU for the G1 GC background threads to run.

2) Basic GC Tuning

Although GC algorithms differ in the way they process the heap, they share basic configuration parameters. In many cases, these basic configurations are all that is needed to run an application.
There are 4 basic areas that can be tuned for better GC in java:

Sizing the heap
Sizing the generations
Sizing Metaspace
Controlling Parallelism

A- Sizing the Heap

Like most performance issues, choosing a heap size is a matter of balance. If the heap is too small, the program will spend too much time performing GC and not enough time performing application logic. But simply specifying a very large heap isn’t neces‐ sarily the answer either. The time spent in GC pauses is dependent on the size of the heap, so as the size of the heap increases, the duration of those pauses also increases. The pauses will occur less frequently, but their duration will make the overall perfor‐ mance lag.

The first rule in sizing a heap is never to specify a heap that is larger than the amount of physical memory on the machine and if multiple JVMs are running, that applies to the sum of all their heaps.

The size of the heap is controlled by two values: an initial value (specified with -XmsN) and a maximum value (-XmxN). The defaults vary depending on the operating system, the amount of system RAM, and the JVM in use. The defaults can be affected by other flags on the command line as well; heap sizing is one of the JVM’s core ergonomic tunings.
The following table shows the default heap sizes for differenct operating systems:

On a machine with less than 192 MB of physical memory, the maximum heap size will be half of the physical memory (96 MB or less).

B- Sizing the Generations

Once the heap size has been determined, the JVM must decide how much of the heap to allocate to the young generation and how much to allocate to the old generation.

The command-line flags to tune the generation sizes all adjust the size of the young generation; the old generation gets everything that is left over. A variety of flags can be used to size the young generation:

-XX:NewRatio=N Set the ratio of the young generation to the old generation.
-XX:NewSize=N Set the initial size of the young generation.
-XX:MaxNewSize=N Set the maximum size of the young generation.
-XmnN Shorthand for setting both NewSize and MaxNewSize to the same value.

C- Sizing Metaspace

When the JVM loads classes, it must keep track of certain metadata about those classes. This occupies a separate heap space called the metaspace. In older JVMs, this was handled by a different implementation called permgen.

The metaspace behaves similarly to a separate instance of the regular heap. It is sized dynamically based on an initial size (-XX:MetaspaceSize=N) and will increase as needed to a maximum size (-XX:MaxMetaspaceSize=N).

D- Controlling Parallelism

All GC algorithms except the serial collector use multiple threads. The number of these threads is controlled by the -XX:ParallelGCThreads=N flag. The value of this flag affects the number of threads used for the following operations:

Collection of the young generation when using -XX:+UseParallelGC
Collection of the old generation when using -XX:+UseParallelGC
Collection of the young generation when using -XX:+UseG1GC
Stop-the-world phases of G1 GC (though not full GCs)

3) GC Tools

Since GC is central to the performance of Java, many tools monitor its performance. The best way to see the effect that GC has on the performance of an application is to become familiar with the GC log, which is a record of every GC operation during the program’s execution.

A- Enabling GC Logging in JDK 8

JDK 8 provides multiple ways to enable the GC log. Specifying either of the flags -verbose:gc or -XX:+PrintGC will create a simple GC log (the flags are aliases for each other, and by default the log is disabled). The -XX:+PrintGCDetails flag will create a log with much more information. This flag is recommended (it is also false by default); it is often too difficult to diagnose what is happening with GC using only the simple log.

B- Enabling GC Logging in JDK 11

JDK 11 and later versions use Java’s new unified logging feature. This means that all logging—GC related or not—is enabled via the flag -Xlog.
hen you append various options to that flag that control how the logging should be performed. In order to specify logging similar to the long example from JDK 8, you would use this flag:

-Xlog:gc*:file=gc.log:time:filecount=7,filesize=8M

The colons divide the command into four sections. You can run java -Xlog:help: to get more information on the available options, but here’s how they map for this string.

One thing to note: log rotation is handled slightly differently between JDK 8 and JDK 11. Say that we have specified a log name of gc.log and that three files should be retained.
In JDK 8, the logs will be written this way:

Start logging to gc.log.0.current.
When full, rename that to gc.log.0 and start logging to gc.log.1.current.
When full, rename that to gc.log.1 and start logging to gc.log.2.current.
When full, rename that to gc.log.2, remove gc.log.0, and start logging to a new gc.log.0.current.
Repeat this cycle.

In JDK 11, the logs will be written this way:

Start logging to gc.log.
When that is full, rename it to gc.log.0 and start a new gc.log.
When that is full, rename it to gc.log.1 and start a new gc.log.
When that is full, rename it to gc.log.2 and start a new gc.log.
When that is full, rename it to gc.log.0, removing the old gc.log.0, and start a new gc.log.

For real-time monitoring of the heap, use jvisualvm or jconsole. The Memory panel of jconsole displays a real-time graph of the heap:

Aaaaaand that's it for today :D

🏃 See you in chapter 6 ...

🐒take a tip

Keep your machine clean and tidy. : house:

Java Performance - 4 - Working with the JIT Compiler

Yousef Zook — Sat, 13 Nov 2021 13:51:20 +0000

Recap

This article is part 5 for the series Java Performance that summarize the java performance book by Scot Oaks

In the previous chapter we have discussed performance toolbox in java. We have mentioned JVM commands to monitor cpu, network and disk usage. We have also talked about the JFR (Java Flight Recorder).

In this chapter we are going to talk about how java is run on a computer and how it's converted into binary, we are also going to describe the difference between JIT and AOT compilers and some details about GraalVM.

Great, let's start the fourth chapter...

Chapter Title:

Working with the JIT Compiler

The just-in-time (JIT) compiler is the heart of the Java Virtual Machine; nothing controls the performance of your application more than the JIT compiler.

1) Just-in-Time Compilers: An Overview

Computers CPUs can execute only a relatively few specific instructions, called machine code.
There are two types of programming languages:

Compiled Languages like C++ and Fortran. Their programs are derived as binary (machine) code ready to run on the cpu.
Interpreted Languages like PHP and perl. They are interpreted which means that the same program source code can be run on any CPU as long as the machine has the correct interpreter (that is, the program called php or perl). The interpreter translates each line of the program into binary code as that line is executed.

Each system has advantages and disadvantages. Programs written in interpreted languages are portable, however it is slower than compiled ones.

A- HotSpot Compilation

As discussed in Chapter 1, the Java implementation discussed in this book is Oracle’s HotSpot JVM. This name (HotSpot) comes from the approach it takes toward compiling the code.

It compile the java code to java bytecodes, then starting at run time interpreting these bytecodes.

when the JVM executes code, it does not begin compiling the code immediately. There are two basic reasons for this:

First: if the code is going to be executed only once, then compiling it is essentially a wasted effort; it will be faster to interpret the Java bytecodes than to compile them and execute (only once) the compiled code.
Second: is the optimization reason, the more times that the JVM executes a particular method or loop, the more information it has about that code. This allows the JVM to make numerous optimizations when it compiles the code.

example: consider the equals() method. This method exists in every Java object (because it is inherited from the Object class) and is often overridden. When the interpreter encounters the statement b = obj1.equals(obj2), it must look up the type (class) of obj1 in order to know which equals() method to execute. This dynamic lookup can be somewhat time-consuming.

Over time, say the JVM notices that each time this statement is executed, obj1 is of type java.lang.String. Then the JVM can produce compiled code that directly calls the String.equals() method. Now the code is faster not only because it is compiled but also because it can skip the lookup of which method to call.

2) Tiered Compilation

Once upon a time, the JIT compiler came in two flavors, and you had to install different versions of the JDK depending on which compiler you wanted to use. These compilers are known as the client (now called C1) and server (now called C2) compilers. Today, all shipping JVMs include both compilers (though in common usage, they are usually referred to as server JVMs).

C1: begins compiling sooner, less aggressive in optimization but faster
C2: begins compiling later, after collecting some optimization informations while the code is running

That technique is known as tiered compilation, and it is the technique all JVMs now use. It can be explicitly disabled with the -XX:TieredCompilation flag (the default value of which is true).

3) Common Compiler Flags

Two commonly used flags affect the JIT compiler; we’ll look at them in this section.
1- Code cache
2- Inspection flag

A- Tuning the code cache

When the JVM compiles code, it holds the set of assembly-language instructions in the code cache. The code cache has a fixed size, and once it has filled up, the JVM is not able to compile any additional code.
When the code cache fills up, the JVM spits out this warning:

    Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
    Java HotSpot(TM) 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=

To solve the problem, a typical option is to simply double or quadruple the default.

The maximum size of the code cache is set via the -XX:ReservedCodeCacheSize=N flag (where N is the default just mentioned for the particular compiler).
there is an initial size (specified by -XX:InitialCodeCacheSize=N).
The initial size of the code cache is 2,496 KB, and the default maximum size is 240 MB.

Resizing the cache happens in the background and doesn’t really affect performance, so setting the ReservedCodeCacheSize size (i.e., setting the maximum code cache size) is all that is generally needed.

In Java 11, the code cache is segmented into three parts:

Nonmethod code
Profiled code
Nonprofiled code By default, the code cache is sized the same way (up to 240 MB), and you can still adjust the total size of the code cache by using the ReservedCodeCacheSize flag.

You’ll rarely need to tune these segments individually, but if so, the flags are as follows:

-XX:NonNMethodCodeHeapSize=N for the nonmethod code
-XX:ProfiledCodeHapSize=N for the profiled code
-XX:NonProfiledCodeHapSize=N for the nonprofiled code

B- Inspecting the Compilation Process

The second flag isn’t a tuning per se: it will not improve the performance of an application. Rather, the -XX:+PrintCompilation flag (which by default is false) gives us visibility into the workings of the compiler (though we’ll also look at tools that provide similar information).

If PrintCompilation is enabled, every time a method (or loop) is compiled, the JVM prints out a line with information about what has just been compiled, with the following format:

timestamp  compilation_id  attributes  (tiered_level)  method_name  size  deopt

timestamp here is the time after the compilation has finished (relative to 0, which is when the JVM started).
Compilation_id is an internal task ID. Sometimes you may see an out-of-order compilation ID. This happens most frequently when there are multiple compilation threads
attributes: it is a series of five characters that indicates the state of the code being compiled
- % the compilation is OSR (on-stack replacement): JIT compilation is an asynchronous process: when the JVM decides that a certain method should be compiled, that method is placed in a queue. Rather than wait for the compilation, the JVM then continues interpreting the method, and the next time the method is called, the JVM will execute the compiled version of the method (assuming the compilation has finished, of course).
- s The method is synchronized
- ! The method has an exception handler.
- b Compilation occurred in blocking mode: will never be printed by default in current versions of Java; it indicates that compilation did not occur in the background.
- n Compilation occurred for a wrapper to a native method: indicates that the JVM generated compiled code to facili‐ tate the call into a native method.
tiered_level indicates which compiler (c1 levels vs c2 levels), If tiered compilation has been disabled, the next field (tiered_level) will be blank. Otherwise, it will be a number indicating which tier has completed compilation.
method_name the name of the compiled method
size the size (in bytes) of the code being compiled.
deopt in some cases appears and indicates that some sort of deoptimization has occurred.

The compilation log may also include a line that looks like this:

    timestamp compile_id COMPILE SKIPPED: reason

reasons:

Code cache filled: The size of the code cache needs to be increased using the ReservedCodeCache flag.
Concurrent classloading: The class was modified as it was being compiled. The JVM will compile it again later; you should expect to see the method recompiled later in the log.

Here are a few lines of output from enabling PrintCompilation on the stock REST application:

The server took about 2 seconds to start; the remaining 26 seconds before anything else was compiled were essentially idle as the application server waited for requests.
The process() method is synchronized, so the attributes include an s.
Inner classes are compiled just like any other class and appear in the output with the usual Java nomenclature: outer-classname$inner-classname.
The processRequest() method shows up with the exception handler as expected.

C- Tiered Compilation Levels

The compilation log for a program using tiered compilation prints the tier level at which each method is compiled.

So the levels of compilation are as follows:

0: Inerpreted code
1: Simple C1 compiled code
2: Limited C1 compiled code
3: Full C1 compiled code
4: C2 compiled code

A typical compilation log shows that most methods are first compiled at level 3: full C1 compilation. (All methods start at level 0, of course, but that doesn’t appear in the log.) If a method runs often enough, it will get compiled at level 4 (and the level 3 code will be made not entrant). This is the most frequent path: the C1 compiler waits to compile something until it has information about how the code is used that it can leverage to perform optimizations.

If the C2 compiler queue is full, methods will be pulled from the C2 queue and com‐ piled at level 2, which is the level at which the C1 compiler uses the invocation and back-edge counters (but doesn’t require profile feedback).
On the other hand, if the C1 compiler queue is full, a method that is scheduled for compilation at level 3 may become eligible for level 4 compilation while still waiting to be compiled at level 3. In that case, it is quickly compiled to level 2 and then transi‐ tioned to level 4.
Trivial methods may start in either level 2 or 3 but then go to level 1 because of their trivial nature. If the C2 compiler for some reason cannot compile the code, it will also go to level 1. And, of course, when code is deoptimized, it goes to level 0.

D- Deoptimization

Deoptimization means that the compiler has to “undo” a previous compilation. The effect is that the performance of the application will be reduced, at least until the compiler can recompile the code in question.

Deoptimization occurs in two cases:

when code is made not entrant: Two things cause code to be made not entrant:
- One is due to the way classes and interfaces work, example: If an interface has two implementations, and compiler assumes that an implementation of an interface is called most of the time so decided to inline the code of the first implementation as optimization, then the second implementation has been called which means that the compiler assumption is not correct, then it need to do deoptimization.
- and one is an implementation detail of tiered compilation: When code is compiled by the C2 compiler, the JVM must replace the code already compiled by the C1 compiler.
when code is made zombie: When the compilation log reports that it has made zombie code, it is saying that it has reclaimed previous code that was made not entrant. For performance, this is a good thing. Recall that the compiled code is held in a fixed- size code cache; when zombie methods are identified, the code in question can be removed from the code cache, making room for other classes to be compiled (or lim‐ iting the amount of memory the JVM will need to allocate later).

4) Advanced Compiler Flags

A- Compilation Thresholds

This chapter has been somewhat vague in defining just what triggers the compilation of code. The major factor is how often the code is executed; once it is executed a cer‐ tain number of times, its compilation threshold is reached, and the compiler deems that it has enough information to compile the code.

Compilation is based on two counters in the JVM:

the number of times the method has been called, - and the number of times any loops in the method have branched back. Branching back can effectively be thought of as the number of times a loop has completed execution, either because it reached the end of the loop itself or because it executed a branching statement like continue.

When the JVM executes a Java method, it checks the sum of those two counters and decides whether the method is eligible for compilation.

Tunings affect these thresholds. When tiered compilation is disabled, standard compilation is triggered by the value of the -XX:CompileThreshold=N flag. The default value of N is 10,000. Changing the value of the CompileThreshold flag will cause the compiler to choose to compile the code sooner (or later) than it normally would have. Note, however, that although there is one flag here, the threshold is calculated by adding the sum of the back-edge loop counter plus the method entry counter.

changing the flags -XX:Tier3InvocationThreshold=N (default 200) to get C1 to compile a method more quickly, and -XX:Tier4InvocationThreshold=N (default 5000) to get C2 to compile a method more quickly. Similar flags are available for the back-edge threshold.

B- Compilation Threads

when a method (or loop) becomes eligible for compilation, it is queued for compilation. That queue is processed by one or more background threads.

These queues are not strictly first in, first out; methods whose invocation counters are higher have priority.
The C1 and C2 compilers have different queues, each of which is processed by (potentially multiple) different threads.

The following table shows default number of C1 and C2 compiler threads for tiered compilation.

If tiered compilation is disabled, only the given number of C2 compiler threads are started.

C- Inlining

One of the most important optimizations the compiler makes is to inline methods. Code that follows good object-oriented design often contains attributes that are accessed via getters (and perhaps setters):

public class Point { 
    private int x, y;

    public void getX() { return x; }
    public void setX(int i) { x = i; } }

The overhead for invoking a method call like this is quite high, especially relative to the amount of code in the method.

Fortunately, JVMs now routinely perform code inlining for these kinds of methods. Hence, you can write this code:

Point p = getPoint();
p.setX(p.getX() * 2);

The compiled code will essentially execute this:

Point p = getPoint(); 
p.x=p.x*2;

Inlining is enabled by default. It can be disabled using the -XX:-Inline flag.

D- Escape Analysis

The C2 compiler performs aggressive optimizations if escape analysis is enabled (-XX:+DoEscapeAnalysis, which is true by default). For example, consider this class to work with factorials:

public class Factorial {
    private BigInteger factorial; private int n;
    public Factorial(int n) {
        this.n = n; 
    }
    public synchronized BigInteger getFactorial() { 
        if (factorial == null)
                factorial = ...;
                return factorial; 
    }
}

To store the first 100 factorial values in an array, this code would be used:

ArrayList<BigInteger> list = new ArrayList<BigInteger>(); 
for(inti=0;i<100;i++){
    Factorial factorial = new Factorial(i);
    list.add(factorial.getFactorial());
}

The factorial object is referenced only inside that loop; no other code can ever access that object. Hence, the JVM is free to perform optimizations on that object:

It needn’t get a synchronization lock when calling the getFactorial() method.
It needn’t store the field n in memory; it can keep that value in a register. Similarly, it can store the factorial object reference in a register.
In fact, it needn’t allocate an actual factorial object at all; it can just keep track of the individual fields of the object.

This kind of optimization is sophisticated: it is simple enough in this example, but these optimizations are possible even with more-complex code.

5) Tiered Compilation Trade-offs

Question Given the performance advantages it provides, is there ever a reason to turn tiered compilation off?

Answer:

One such reason might be when running in a memory-constrained environment.
for example you may be running in a Docker container with a small memory limit or in a cloud virtual machine that just doesn’t have quite enough memory.

The below table shows the effect of tiered compilation on the code cache

The C1 compiler compiled about four times as many classes and predictably required about four times as much memory for the code cache.

6) The GraalVM

The GraalVM is a new virtual machine. It provides a means to run Java code, of course, but also code from many other languages. This universal virtual machine can also run JavaScript, Python, Ruby, R, and traditional JVM bytecodes from Java and other languages that compile to JVM bytecodes (e.g., Scala, Kotlin, etc.). Graal comes in two editions: a full open source Community Edition (CE) and a commercial Enter‐ prise Edition (EE). Each edition has binaries that support either Java 8 or Java 11.

The GraalVM has two important contributions to JVM performance:

First, an add- on technology allows the GraalVM to produce fully native binaries.
Second, the GraalVM can run in a mode as a regular JVM, but it contains a new implementation of the C2 compiler. This compiler is written in Java (as opposed to the traditional C2 compiler, which is written in C++).

Within the JVM, using the GraalVM compiler is considered experimental, so to enable it, you need to supply these flags: -XX:+UnlockExperimentalVMOptions, -XX:+EnableJVMCI, and -XX:+UseJVMCICompiler. The default for all those flags is false.

The following table shows the perfroamcne of Graal Compiler:

7) Precompilation

We began this chapter by discussing the philosophy behind a just-in-time compiler. Although it has its advantages, code is still subject to a warm-up period before it exe‐ cutes. What if in our environment a traditional compiled model would work better: an embedded system without the extra memory the JIT requires, or a program that completes before having a chance to warm up?
In this section, we’ll look at two experimental features that address that scenario. Ahead-of-time compilation is an experimental feature of the standard JDK 11, and the ability to produce a fully native binary is a feature of the Graal VM

A- Ahead-of-Time compilation AOT

Ahead-of-time (AOT) compilation was first available in JDK 9 for Linux only, but in JDK 11 it is available on all platforms. From a performance standpoint, it is still a work in progress, but this section will give you a sneak peek at it.

AOT compilation allows you to compile some (or all) of your application in advance of running it. This compiled code becomes a shared library that the JVM uses when starting the application. In theory, this means the JIT needn’t be involved, at least in the startup of your application: your code should initially run at least as well as the C1 compiled code without having to wait for that code to be compiled.

In practice, it’s a little different: the startup time of the application is greatly affected by the size of the shared library (and hence the time to load that shared library into the JVM). That means a simple application like a “Hello, world” application won’t run any faster when you use AOT compilation (in fact, it may run slower depending on the choices made to precompile the shared library).

B- GraalVM Native Compilation

AOT compilation was beneficial for relatively large programs but didn’t help (and could hinder) small, quick-running programs. That is because it’s still an experimen‐ tal feature and because its architecture has the JVM load the shared library.

he GraalVM, on the other hand, can produce full native executables that run without the JVM. These executables are ideal for short-lived programs. If you ran the examples, you may have noticed references in some things (like ignored errors) to GraalVM classes: AOT compilation uses GraalVM as its foundation. This is an Early Adopter feature of the GraalVM; it can be used in production with the appropriate license but is not subject to warranty.

Limitations also exist on which Java features can be used in a program compiled into native code. These limitations include the following:

Dynamic class loading (e.g., by calling Class.forName()).
Finalizers.
The Java Security Manager.
JMX and JVMTI (including JVMTI profiling).
Use of reflection often requires special coding or configuration.
Use of dynamic proxies often requires special configuration.
Use of JNI requires special coding or configuration.

🏃 See you in chapter 5 ...

🐒take a tip

Embrace your beliefes, but make yourself open for changes. 🌔

Java Performance - 3 - A java Performance Toolbox

Yousef Zook — Fri, 05 Nov 2021 20:30:35 +0000

Recap

This article is part 4 for the series Java Performance that summarize the java performance book by Scot Oaks

In the previous chapter we have discussed performance testing methods. We have mentioned the difference between Micorbenchmarks, Macrobenchmarks and Mesobecnhmars. We have also talked about the responsetime, throughput and variability.

In this chapter we are going to discuss some intersting measurement tools for cpu, network and disk. We will understand the difference different profilers in java and talk a bit about JFR Java Flight Recorder.

Great, let's start the third chapter...

Chapter Title:

A java Performance Toolbox

Performance analysis is all about visibility—knowing what is going on inside an application and in the application’s environment. Visibility is all about tools. And so performance tuning is all about tools.

1) Operating System Tools and Analysis

The starting point for program analysis is not Java-specific at all: it is the basic set of monitoring tools that come with the operating system.
We are going to see a quick look on operating system methods to take a look into the usage of:

CPU --> vmstat
Disk --> iostat
Network --> nicstat

A- CPU Usage

CPU usage is typically divided into two categories: user time and system time (Windows refers to this as privileged time).

User time is the percentage of time the CPU is executing application code.
System Time is the percentage of time the CPU is executing kernel code.

Goal is to maximize the cpu utilization.

If you run vmstat 1 on your Linux desktop, you will get a series of lines (one every second) that look like this:

As you can find in the output:

Each second has a system time = 3% and user time = 42% approximately.
CPU total time [aka utilization] is 45% This means that the cpu is idle for 55% of the time.

The CPU can be idle for multiple reasons:

The application might be blocked on a synchronization primitive and unable to execute until that lock is released.
The application might be waiting for something, such as a response to come back from a call to the database.
The application might have nothing to do.

These first two situations are always indicative of something that can be addressed. If contention on the lock can be reduced or the database can be tuned so that it sends the answer back more quickly, then the program will run faster, and the average CPU use of the application will go up (assuming, of course, that there isn’t another such issue that will continue to block the application).

Java and Single CPU:
If code is batch-style application, then the cpu will not be idle, because it has work to do always [if job is blocked for i/o or something, another batch can use the cpu .. etc]
...
Java and multi CPU:
The general idea is the same as in single cpu, however making sure individual threads are not blocked will drive the CPU higher.

CPU Run Queue

You can monitor the number of threads that can be run [aka not blocked]. Those threads are called to be in the CPU Run Queue. You can find the length of the run queue in the previous image at the first column procs r

Note

In linux: the number equals the number of currently running threads [those that are using the processors] and the others who are waiting for processors to use.
In Windows: the number equals the number does NOT count the currently running threads. So the goal in linux is to make this queue length = the number of machine processors, and in windows to make it = 0.

B- Disk Usage

Monitoring disk usage has two important goals.

The first pertains to the application itself: if the application is doing a lot of disk I/O, that I/O can easily become a bottleneck.
The second reason is to monitor disk usage, even if the application is not expected to perform a significant amount of I/O—is to help monitor if the system is swapping.

You can use iostat command to monitor the disk, Let's see an example:

This application is writing data to disk sda.
w_await: the time to service each I/O write
util: the disk utilization

Applications that write to disk can be bottlenecked both because they are writing data inefficiently (too little through‐ put) or because they are writing too much data (too much throughput).

C- Network Usage

If you are running an application that uses the network—for example, a REST server—you must monitor the network traffic as well.
You can use nicstat to monitor the network, it is not the default of the system but it's opensource with more features.

Applications that write to the network can be bottlenecked because they are writing data inefficiently (too little through‐ put) or because they are writing too much data (too much throughput).

2) Java Monitoring Tools

To gain insight into the JVM itself, Java monitoring tools are required. These tools come with the JDK:

A- JVM Commands

jcmd: Prints basic class, thread, and JVM information for a Java process.
jconsole: Provides a graphical view of JVM activities, including thread usage, class usage, and GC activities
jmap: Provides heap dumps and other information about JVM memory usage. Suitable for scripting, though the heap dumps must be used in a postprocessing tool.
jinfo: Provides visibility into the system properties of the JVM, and allows some system properties to be set dynamically. Suitable for scripting.
jstack: Dumps the stacks of a Java process. Suitable for scripting.
jstat: Provides information about GC and class-loading activities. Suitable for scripting.
jvisualvm: A GUI tool to monitor a JVM, profile a running application, and analyze JVM heap dumps (which is a postprocessing activity, though jvisualvm can also take the heap dump from a live program).

if you are using docker, you can run them using docker exec except jconsole and jvisualvm.

These tools fits into these broad areas:
• Basic VM information
• Thread information
• Class information
• Live GC analysis
• Heap dump postprocessing
• Profiling a JVM

B- Basic VM Information

Uptime The length of time the JVM has been up can be found via this command:

% jcmd process_id VM.uptime

System properties

% jcmd process_id VM.system_properties

% jinfo -sysprops process_id

JVM version The version of the JVM is obtained like this:

% jcmd process_id VM.version

JVM tuning flags The tuning flags in effect for an application can be obtained like this:

% jcmd process_id VM.flags [-all]

Note you can change tuning flags dynamically at runtime using jinfo command, example:

% jinfo -flag -PrintGCDetails process_id # turns off PrintGCDetails % jinfo -flag PrintGCDetails process_id

3) Profiling Tools

Profilers are the most important tool in a performance analyst’s toolbox. Many profil‐ ers are available for Java, each with its own advantages and disadvantages.

Many common Java profiling tools are themselves written in Java and work by “attaching” themselves to the application to be profiled. This attachment is via a socket or via a native Java interface called the JVM Tool Interface (JVMTI).
This means you must pay attention to tuning the profiling tool just as you would tune any other Java application. In particular, if the application being profiled is large, it can transfer quite a lot of data to the profiling tool, so the profiling tool must have a sufficiently large heap to handle the data.

Profiling happens in one of two modes:

sampling mode
instrumented mode

A- Sampling Profilers

Pros:The basic mode of profiling and carries the least amount of overhead.

Cons: However, sampling profilers can be subject to all sorts of errors, for example, the most common sampling erro is as shown in the figure below:

The thread here is alternating between executing methodA (shown in the shaded bars) and methodB (shown in the clear bars). If the timer fires only when the thread happens to be in methodB, the profile will report that the thread spent all its time executing methodB; in reality, more time was actually spent in methodA.

Reason: this is due to safepoint bias, which means that the profiler can get the stack trace of a thread only when the thread is at safepoint, when they are:
• Blocked on a synchronized lock
• Blocked waiting for I/O
• Blocked waiting for a monitor
• Parked
• Executing Java Native Interface (JNI) code (unless they perform a GC locking function)

B- Instrumented Profilers

Pros: Instrumented profilers are much more intrusive than sampling profilers, but they can also give more beneficial information about what’s happening inside a program.

Cons: They are much more likely to introduce performance differences into the application than are sampling profilers.

Instrumented profilers work by altering the bytecode sequence of classes as they are loaded (inserting code to count the invocations, and so on).

Note

Is this a better profile than the sampled version? It depends; there is no way to know in a given situation which is the more accurate profile. The invocation count of an instrumented profile is certainly accurate, and that additional information is often helpful in determining where the code is spending more time and which things are more fruitful to optimize.

C- Native Profilers

Tools like async-profiler and Oracle Developer Studio have the capability to profile native code in addition to Java code. This has two advantages:

significant operations occur in native code, including within native libraries and native memory allocation.
we typically profile to find bottlenecks in application code, but sometimes the native code is unexpectedly dominating performance. We would prefer to find out our code is spending too much time in GC by examining GC logs.

4) Java Flight Recorder `JFR`

Java Flight Recorder (JFR) is a feature of the JVM that performs lightweight performance analysis of applications while they are running. As its name suggests, JFR data is a history of events in the JVM that can be used to diagnose the past performance and operations of the JVM.

The basic operation of JFR is that a set of events is enabled (for example, one event is that a thread is blocked waiting for a lock), and each time a selected event occurs, data about that event is saved (either in memory or to a file).

The higher the number of events, the higher the performance got affected by the JFR.

A- Java Mission Control

The usual tool to examine JFR recordings is Java Mission Control (jmc), though other tools exist, and you can use toolkits to write your own analysis tools.
The Java Mission Control program (jmc) starts a window that displays the JVM pro‐ cesses on the machine and lets you select one or more processes to monitor. Figure 3-9 shows the Java Management Extensions (JMX) console of Java Mission Control monitoring our example REST server.

B- JFR features

The following table shows what other tools can collect and what jfr collects for each event:

Event	Other tools	JFR
Classloading	Number of classes loaded and unloaded	Which classloader loaded the class; time required to load an individual class
Thread statistics	Number of threads created and destroyed; thread dumps	Which threads are blocked on locks (and the specific lock they are blocked on)
Throwables	Throwable classes used by the application	Number of exceptions and errors thrown and the stack trace of their creation
TLAB allocation	Number of allocations in the heap and size of thread-local allocation buffers (TLABs)	Specific objects allocated in the heap and the stack trace where they are allocated
File and socket I/O	Time spent performing I/O	Time spent per read/write call, the specific file or socket taking a long time to read or write
Monitor blocked	Threads waiting for a monitor	Specific threads blocked on specific monitors and the length of time they are blocked
Code cache	Size of code cache and how much it contains	Methods removed from the code cache; code cache configuration
Code compilation	Which methods are compiled, on-stack replacement (OSR) compilation, and length of time to compile	Nothing specific to JFR, but unifies information from several sources
Garbage collection	Times for GC, including individual phases; sizes of generations	Nothing specific to JFR, but unifies the information from several tools
Profiling	Instrumenting and sampling profiles	Not as much as you’d get from a true profiler, but the JFR profile provides a good high-order overview

C- Enabling JFR

JFR is initially disabled. To enable it, add the flag
-XX:+FlightRecorder to the command line of the application. This enables JFR as a feature, but no recordings will be made until the recording process itself is enabled. That can occur either through a GUI or via the command line.

In Oracle’s JDK 8, you must also specify this flag (prior to the FlightRecorder flag): -XX:+UnlockCommercialFeatures (default: false).
If you forget to include these flags, remember that you can use jinfo to change their values and enable JFR. If you use jmc to start a recording, it will automatically change these values in the target JVM if necessary.

To enable it from command line:
-XX:+FlightRecorderOptions=string
The string in that parameter is a list of comma-separated name- value pairs taken from these options:

name=name
-->The name used to identify the recording.
defaultrecording=<true|false>
-->Whether to start the recording initially. The default value is false; for reactive analysis, this should be set to true.
settings=path
-->Name of the file containing the JFR settings (see the next section).
delay=time
-->The amount of time (e.g., 30s, 1h) before the recording should start.
duration=time
-->The amount of time to make the recording.
filename=path
-->Name of the file to write the recording to.
compress=<true|false>
-->Whether to compress (with gzip) the recording; the default is false.
maxage=time
-->Maximum time to keep recorded data in the circular buffer.
maxsize=size
-->Maximum size (e.g., 1024K, 1M) of the recording’s circular buffer.

🏃 See you in chapter 4 ...

🐒take a tip

Never trust your code. 👮

Java Performance - 2 - An Approach to Performance Testing

Yousef Zook — Sat, 30 Oct 2021 18:06:12 +0000

Recap

This article is part 3 for the series Java Performance that summarize the java performance book by Scot Oaks

In the previous chapter we have discussed a breif outline, the platforms, hardware and software, and perfromance hints The Complete Performance Story.

In this chapter we are going to discuss some intersting performance entries. We will understand the difference between:

Microbenchmarks
Macrobenchmarks
Mesobenchmarks We are going also to talk about response time, batching and throughput, understand variability and see some interesting code examples.

Great, let's start the second chapter...

Chapter Title:

An Approach to Performance Testing

1) Test a Real Application

The first principle is that testing should occur on the actual product in the way the product will be used. There are, roughly speaking, three categories of code that can be used for performance testing, each with its own advantages and disadvantages. The category that includes the actual application will provide the best results.

A- Microbenchmarks

The first of these categories is the microbenchmark. A microbenchmark is a test designed to measure a very small unit of performance, exmples:

The time to call a synchronized method versus a nonsynchronized method
The overhead in creating a thread versus using a thread pool
The time to execute one arithmetic algorithm versus an alternate implementation

Points to take care of
Consider the following code that measures the performance of different implementations of a method to compute the 50th Fibonacci number:



public void doTest() { // Main Loop
    double l;
    long then = System.currentTimeMillis();
    for(inti=0;i<nLoops;i++){ 
       l = fibImpl1(50);
    }
    long now = System.currentTimeMillis(); 
    System.out.println("Elapsed time: " + (now - then));
}
...
private double fibImpl1(int n) {
    if (n < 0) throw new IllegalArgumentException("Must be > 0"); if (n == 0) return 0d;
    if (n == 1) return 1d;
    double d = fibImpl1(n - 2) + fibImpl(n - 1);
    if (Double.isInfinite(d)) throw new ArithmeticException("Overflow");
    return d; 
}

The previous code has the following issues:
1- Microbenchmarks must use their results: The biggest problem with this code is that it never actually changes any program state. Because the result of the Fibonacci calculation is never used, the compiler is free to discard that calculation. A smart compiler will end using the following code:



long then = System.currentTimeMillis();
long now = System.currentTimeMillis(); System.out.println("Elapsed time: " + (now - then));

Why?: because you have never used the l variable in your code, so the compiler will assume that this is a redundant code, removing it before executing.
Solution try to consume the l variable with any method you like.



...
consume(l);
...

2- Microbenchmarks must not include extraneous operations: This code performs only one operation: calculating the 50th Fibonacci number. A very smart compiler can figure that out and execute the loop only once—or at least discard some of the iterations of the loop since those oper‐ ations are redundant.
Additionally, the performance of fibImpl(1000) is likely to be very different than the performance of fibImpl(1); if the goal is to compare the performance of different im‐ plementations, then a range of input values must be considered.
The easy way to code the use of the random number generator is to process the loop as follows:



int[] input = new int[nLoops]; 
for(inti=0;i<nLoops;i++){
    input[i] = random.nextInt();
}
long then = System.currentTimeMillis(); for(inti=0;i<nLoops;i++){
    try {
        l = fibImpl1(input[i]);
    } catch (IllegalArgumentException iae) {} 
}
long now = System.currentTimeMillis();

3- Microbenchmarks must measure the correct input: The third pitfall here is the input range of the test: selecting arbitrary random values isn’t necessarily representative of how the code will be used. In this case, an exception will be immediately thrown on half of the calls to the method under test (anything with a negative value). An exception will also be thrown anytime the input parameter is greater than 1476, since that is the largest Fibonacci number that can be represented in a double.
Consider this alternate implementation:



public double fibImplSlow(int n) {
    if (n < 0) throw new IllegalArgumentException("Must be > 0"); 
    if (n > 1476) throw new ArithmeticException("Must be < 1476"); 
    return verySlowImpl(n);
}

If that this implementation is very slow "solwer that the first one in point #2", It will give better perforamnce results because the input range checks in it lines 1, and 2.

4- No warmup period: The previous implementation doesn't offer a warm-up peroid which is important because that One of the performance characteristics of Java is that code performs better the more it is executed, a topic that is covered in Chapter 4, something realted to the JIT compilers in java.
So the final version of the performance test should be as the following:



package net.sdo;
import java.util.Random;
public class FibonacciTest { 
    private volatile double l; private int nLoops; 
    private int[] input;
    public static void main(String[] args) {
        FibonacciTest ft = new                 
        FibonacciTest(Integer.parseInt(args[0])); 
        ft.doTest(true);
        ft.doTest(false);
    }

    private FibonacciTest(int n) { 
        nLoops = n;
        input = new int[nLoops];
        Random r = new Random(); 
        for(inti=0;i<nLoops;i++){
            input[i] = r.nextInt(100);
        }
    }

    private void doTest(boolean isWarmup) { 
        long then = System.currentTimeMillis(); 
        for(inti=0;i<nLoops;i++){
            l = fibImpl1(input[i]);
        }
        if (!isWarmup) {
            long now = System.currentTimeMillis(); 
            System.out.println("Elapsed time: " + (now - then));
        } 
    }

    private double fibImpl1(int n) {
        if (n < 0) 
            throw new IllegalArgumentException("Must be > 0");
        if (n == 0) return 0d;
        if (n == 1) return 1d;
        double d = fibImpl1(n - 2) + fibImpl(n - 1);
        if (Double.isInfinite(d)) throw new ArithmeticException("Overflow"); 
        return d;
    } 
}

B- Macrobenchmarks

The best thing to use to measure performance of an application is the application itself, in conjunction with any external resources it uses. Testing the whole application with all the external resources is called Macrobenchmark.
Complex systems are more than the sum of their parts; they will behave quite differently when those parts are assembled. Mocking out database calls, for example, may mean that you no longer have to worry about the database perfor‐ mance—and hey, you’re a Java person; why should you have to deal with someone else’s performance problem?
The other reason to test the full application is one of resource allocation. In a perfect world, there would be enough time to optimize every line of code in the application. In the real world, deadlines loom, and optimizing only one part of a complex environment may not yield immediate benefits.

C- Mesobenchmarks

Java EE engineers tend to use that term to apply to something else: bench‐ marks that measure one aspect of performance, but that still execute a lot of code.
An example of a Java EE might be something that measures how quickly the response from a simple JSP can be returned from an application server. The code involved in such a request is substantial compared to a traditional microbenchmark: there is a lot of socket-management code, code to read the request, code to find (and possibly compile) the JSP, code to write the answer, and so on. From a traditional standpoint, this is not microbenchmarking.
This kind of test is not a macrobenchmark either: there is no security (e.g., the user does not log in to the application), no session management, and no use of a host of other Java EE features.
So this is called a Mesobenchmark

Common Code Examples

Many of the examples throughout the book are based on a sample application that calculates the “historical” high and low price of a stock over a range of dates, as well as the standard deviation during that time. Historical is in quotes here because in the application, all the data is fictional; the prices and the stock symbols are randomly generated.

The basic object within the application is a StockPrice object that represents the price range of a stock on a given day: ```java

public interface StockPrice {
String getSymbol();
Date getDate();
BigDecimal getClosingPrice();
BigDecimal getHigh();
BigDecimal getLow();
BigDecimal getOpeningPrice();
boolean isYearHigh();
boolean isYearLow();
Collection<? extends StockOptionPrice> getOptions();
}

The sample applications typically deal with a collection of these prices, representing the history of the stock over a period of time (e.g., 1 year or 25 years, depending on the example):
```java


public interface StockPriceHistory { 
    StockPrice getPrice(Date d);
    Collection<StockPrice> getPrices(Date startDate, Date endDate);
    Map<Date, StockPrice> getAllEntries();
    Map<BigDecimal,ArrayList<Date>> getHistogram();
    BigDecimal getAveragePrice();
    Date getFirstDate();
    BigDecimal getHighPrice();
    Date getLastDate();
    BigDecimal getLowPrice();
    BigDecimal getStdDev();
    String getSymbol();
}

The basic implementation of this class loads a set of prices from the database:



public class StockPriceHistoryImpl implements StockPriceHistory { 
    ...
    public StockPriceHistoryImpl(String s, Date startDate, Date endDate, EntityManager em) {
        Date curDate = new Date(startDate.getTime()); 
        symbol = s;
        while (!curDate.after(endDate)) {
            StockPriceImpl sp = em.find(StockPriceImpl.class, new StockPricePK(s, (Date) curDate.clone())); 
            if (sp != null) {
                Date d = (Date) curDate.clone(); 
                if (firstDate == null) {
                    firstDate = d;
                }
                prices.put(d, sp);
                lastDate = d;
            }
            curDate.setTime(curDate.getTime() + msPerDay);
        }
    }
    ... 
}

The architecture of the samples is designed to be loaded from a database, and that functionality will be used in the examples in Chapter 11. However, to facilitate running the examples, most of the time they will use a mock entity manager that generates random data for the series.

2) Understand Throughput, Batching, and Response Time

The second principle in performance testing involves various ways to look at the ap‐ plication’s performance. Which one to measure depends on which factors are most important to your application.

A- Elapsed Time (Batch) Measurements:

The simplest way of measuring performance is to see how long it takes to accomplish a certain task, example:

Retrieve the history of 10,000 stocks for a 25-year period and calculate the standard deviation of those prices; produce a report of the payroll benefits for the 50,000 employees of a corporation; execute a loop 1,000,000 times.

In the non-Java world, this testing is straightforward: the application is written, and the time of its execution is measured. In the Java world, there is one wrinkle to this: just-in-time compilation which means that the program needs some time to be fully optimized [that's why we needed the warmup in the previous section code].

B- Throughput Measurements:

A throughput measurement is based on the amount of work that can be accomplished in a certain period of time.
Notes:

In a client-server test, a throughput measurement means that clients have no think time. If there is a single client, that client sends a request to the server. When it receives a response, it immediately sends a new request.
This measurement is frequently referred to as transactions per second (TPS), requests per second (RPS), or operations per second (OPS).
All client-server tests run the risk that the client cannot send data quickly enough to the server. This may occur because there aren’t enough CPU cycles on the client machine to run the desired number of client threads, or because the client has to spend a lot of time processing the request before it can send a new request. In those cases, the test is effectively measuring the client performance rather than the server performance, which is usually not the goal.
It is common for tests that measure throughput also to report the average response time of its requests.

C- Response Time Tests

The last common test is one that measures response time: the amount of time that elapses between the sending of a request from a client and the receipt of the response.
Difference between response time and throughput is that the client threads in a response time test sleep for some period of time between operations. This is referred to as think time.
When think time is introduced into a test, throughput becomes fixed: a given number of clients executing requests with a given think time will always yield the same TPS.
example:
The simplest way is for clients to sleep for a period of time between requests:



while (!done) {
    time = executeOperation();                 
    Thread.currentThread().sleep(30*1000);
}

In this case, the throughput is somewhat dependent on the response time.
There is another alternative is known as cycle time (rather than think time). Cycle time sets the total time between requests to 30 seconds, so that the time the client sleeps depends on the response time:



while (!done) {
    time = executeOperation(); 
    Thread.currentThread().sleep(30*1000 - time);
}

This alternative yields a fixed throughput of 0.033 OPS per client regardless of the re‐ sponse time (assuming the response time is always less than 30 seconds in this example).

There are two ways of measuring response time. Response time can be reported as:

Average: the individual times are added together and divided by the number of requests.
Percentile request: for example the 90th% response time. If 90% of responses are less than 1.5 seconds and 10% of responses are greater than 1.5 seconds, then 1.5 seconds is the 90th% response time. The following 2 graphs shows how the percentile is important:

3) Understand Variability

The third principle involves understanding how test results vary over time. Programs that process exactly the same set of data will produce a different answer each time they are run. Why? because of the following:

Background processes on the machine will affect the application
the network will be more or less congested when the program is run
... etc.

You must consider statstics while measuring performance of an application, run the test many times to be more confident of your results.

4) Test Early, Test Often

Fourth and finally, performance geeks like to recommend that perfor‐ mance testing be an integral part of the development cycle.

The typical development cycle does not make things any easier. A project schedule often establishes a feature-freeze date: all feature changes to code must be checked into the repository at some early point in the release cycle, and the remainder of the cycle is devoted to shaking out any bugs (including performance issues) in the new release. This causes two problems for early testing:

Developers are under time constraints to get code checked in to meet the schedule; they will balk at having to spend time fixing a performance issue when the schedule has time for that after all the initial code is checked in. The developer who checks in code causing a 1% regression early in the cycle will face pressure to fix that issue; the developer who waits until the evening of the feature freeze can check in code that causes a 20% regression and deal with it later.
Performance characteristics of code will change as the code changes. This is the same principle that argued for testing the full application (in addition to any module-level tests that may occur): heap usage will change, code compilation will change, and so on.
A developer who introduces code causing a 5% regression may have a plan to address that regression as development proceeds: maybe her code depends on some as-yet-to-be integrated feature, and when that feature is available, a small tweak will allow the regression to go away. That’s a reasonable position, even though it means that performance tests will have to live with that 5% regression for a few weeks (and the unfortunate but unavoidable issue that said regression is masking other regressions).

Early, frequent testing is most useful if the following guidelines are followed:

A- Automate everything

All performance testing should be scripted
The scripts must be able to run the test multiple times
Perform t-test analysis on the results
Produce a report showing the confidence level that the results are the same
The automation must make sure that the machine is in a known state before tests are run

B- Measure everything

The automation must gather every conceivable piece of data that will be useful for later analysis
System information sampled throughout the run: CPU usage, disk usage, network usage, memory usage, and so on.
The monitoring information must also include data from other parts of the system, if applicable: for example, if the program uses a database, then include the system statistics from the database machine as well as any diagnostic output from the database

C- Run on the target system

A test that is run on a single-core laptop will behave very differently than a test run on a machine with a 256-thread SPARC CPU. That should be clear in terms of threading effects: the larger machine is going to run more threads at the same time, reducing contention among application threads for access to the CPU. At the same time, the large system will show synchronization bottlenecks that would be unno‐ ticed on the small laptop.
Hence, the performance of a particular production environment can never be fully known without testing the expected load on the expected hardware. Approxima‐ tions and extrapolations can be made from running smaller tests on smaller hard‐ ware, and in the real world, duplicating a production environment for testing can be quite difficult or expensive.

🏃 See you in chapter 3 ...

🐒take a tip

Play a sport to work your brain effectively, Exercise your body. 🏊

Java Performance - 1 - Introduction

Yousef Zook — Sat, 23 Oct 2021 20:37:37 +0000

Recap

In the previous article, we demonstrated that we are going to discuss the great book Java Performance 2^nd Edition by Scott Oaks.
In this article we are goind to summarize the first book chapter, Chapter 1: Introduction.
Please note that many senteces have been quoted from the book itself.
While you are reading this article, please consider that you are reading a summarized version of the chapter and chapter bold points only are included.

Great, let's start the first chapter...

Chapter Title:

Introduction

For who?

This book is designed for performance engineers and developers who are looking to understand how various aspects of the JVM and the Java APIs impact performance.
If it is late Sunday night, your site is going live Monday morning, and you’re looking for a quick fix for performance issues, this is not the book for you.

This is a book about the art and science of Java performance.

Java Platforms

The book has covered the performance of Oracle Hotspot JVM and Java Developement Kit JDK version 8 and 11, This is also known as Java, Standard Edition (SE).
These versions of the JDK were selected because they carry long-term support (LTS) from Oracle.

JVM tuning flags

JVM flags are curical parameters that you can pass to the java virtual machine to enhance the performance of your application.
With a few exceptions, the JVM accepts two kinds of flags: boolean flags, and flags that require a parameter.

Boolean flags use this syntax: -XX:+FlagName enables the flag, and -XX:-FlagName disables the flag.
Paramterized Flags that require a parameter use this syntax: -XX:FlagName=something, meaning to set the value of FlagName to something.

from jrebel.com

Hardware Platforms

From a performance perspective, the important thing about a machine is its number of cores. Let’s take a basic four-core machine: each core can (for the most part) pro‐ cess independently of the others, so a machine with four cores can achieve four times the throughput of a machine with a single core. (This depends on other factors about the software, of course.)
However adding more threads to this machine not necessirly increase throuput or make the program finishes quicker, will see more examples in the following chapters.

Software Containers

The biggest change in Java deployments in recent years is that they are now fre‐ quently deployed within a software container. That change is not limited to Java, of course; it’s an industry trend hastened by the move to cloud computing.
Two containers are important here:

Virtual Machine: which sets up a com‐ pletely isolated copy of the operating system on a subset of the hardware on which the virtual machine is running. From Java’s perspective (and the perspective of other applications), that virtual machine is indistinguishable from a regular machine with x cores and n GB of memory.
Docker Containers: A Java process running inside a Docker container doesn’t necessarily know it is in such a container. The Docker container is just a process (potentially with resource constraints) within a running OS, the way Java handles that differs between early versions of Java 8 (up until update 192) and later version of Java 8 (and all versions of Java 11).

The Complete Performance Story

This book is focused on how to best use the JVM and Java platform APIs so that programs run faster, but many outside influences affect performance.
Here are some notes for these outsides...

1.Write Better Algorithms:

Ultimately, the performance of an application is based on how well it is written. If the program loops through all elements in an array, the JVM will optimize the way it per‐ forms bounds checking of the array so that the loop runs faster, and it may unroll the loop operations to provide an additional speedup. But if the purpose of the loop is to find a specific item, no optimization in the world is going to make the array-based code as fast as a different version that uses a hash map.

2.Write Less Code

A small well-written program will run faster than a large well-written program.

So even if your code should be neat, extensible and easy to read, you should also consider the performance while writing your code.
A subset of the team may argue that they are adding a small piece of code which will not affect the perfromance, then another part calims the same, and after some repition you will find that the progress may be affected by 10%.
Over time, as the small regressions creep in, it will be harder and harder to fix them.

3.Oh, Go Ahead, Prematurely Optimize

premature optimization: A term often used by developers to claim that the performance of their code doesn’t matter, and if it does matter, we won’t know that until the code is run.
Notes:

If you are going to change the code to get better performance but this will lead to complicating the code and make it hard to fix in the future, then it's better to not do this enhancement till the profiling is done and points that this enhancement is mandatory.
However, if you have 2 choices and they are both straighforward and easy to make, and 1 of them will enhance the performance, then do it. Let's take the following example:

log.log(Level.FINE, "I am here, and the value of X is "
            + calcX() + " and Y is " + calcY());

This code do string concatination and calls some functions calcX() and calcY(), when the log level is FINE. However this level of logging often are not used and it will not be printed, in this case it is better to check if it is loggable first to save the time of string concatination and the functions calls:

if (log.isLoggable(Level.FINE)) { 
    log.log(Level.FINE,
            "I am here, and the value of X is {} and Y is {}", 
            new Object[]{calcX(), calcY()});
}

This avoids the string concatenation altogether (the message format isn’t necessarily more efficient, but it is cleaner), and there are no method calls or allocation of the object array unless logging has been enabled

4.Look Elsewhere: The Database Is Always the Bottleneck

If you are developing standalone Java applications that use no external resources, the performance of that application is (mostly) all that matters. Once an external resource (a database, for example) is added, the performance of both programs is important. And in a distributed environment—say with a Java REST server, a load balancer, a database, and a backend enterprise information system—the performance of the Java server may be the least of the performance issues.
If the database is the bottleneck, tuning the Java application accessing the database won’t help overall performance at all. In fact, it might be counterproductive. As a general rule, when load is increased into a system that is overburdened, performance of that system gets worse.

5.Optimize for the Common Case

we should focus on the common use case scenarios. This principle manifests itself in several ways:

Optimize code by profiling it and focusing on the operations in the profile taking the most time.
Apply Occam’s razor to diagnosing performance problems. The simplest explana‐ tion for a performance issue is the most conceivable cause: a performance bug in new code is more likely than a configuration issue on a machine, which in turn is more likely than a bug in the JVM or operating system.
Write simple algorithms for the most common operations in an application

🏃 See you in chapter 2 ...

🐒take a tip

Pareto principle: 80% of consequences come from 20% of causes.

Java Performance - Overview

Yousef Zook — Wed, 20 Oct 2021 21:03:31 +0000

What is this?

In this articles Series, I am going to summarize the main points of the incredible book "Java Perfroamcne" by Scott Oaks.
This is a great book if you are interested in optimizing your java application and/or want to dig deeper into how java works and how to benchmark and profile your java applicatoin.
We are going to discuss each chapter in a separate part of the article.
This article is for people who want to take a quick summary for the book and/or want to get quick information of how to think in java performance without the need to read all the book.

About Scott, the author

Scott Oaks is a Java Technologist at Sun Microsystems, where he has worked since 1987. While at Sun, he has specialized in many disparate technologies, from the SunOS kernel to network programming and RPCs. Since 1995, he's focused primarily on Java and bringing Java technology to end-users. Scott also authored O'Reilly's Java Security, Java Threads and Jini in a Nutshell titles.
Oreilly

About the book

We are going to disucss the 2nd edition of the book. This edition was published in Feburary 2020. The book index is as follow:

Let's see the index of the book highlights:

Introduction

A breif Outline

Platformas and Conventions

The Complete Performance Story

Summary

An approach to performance testing

Test a Real Application

Understand Throughput, Batching, and Response Time

Understand Variablitiy

Test Early, Test Often

Benchmark Examples

Summary

A Java Performance Toolbox

Operating System Tools and Analysis

Java Monitoring tools

Profiling Tools

Java Flight Recorder

Summary

Working with the JIT Compiler

JIT Overview

Tiered Compilation

Common Compiler Flags

Advanced Compiler Flags

The Compilation Trade-offs

The GrallVM

Precompilation

Summary

An Introduction to Garbage Collection

Garbage Collection Overview

Basic GC Tuning

GC Tools

Summary

Garbage Collection Algorithms

Understanding the Throughput Collector

Understanding the G1 Garbage Collector

Understanding the CMS Collector

Advanced Tuning

Experimental GC Algorithms

Summary

Heap Memory Best Practices

Heap Analysis

Using Less Memory

Object Life-Cycle Management

Summary

Native Memory Best Practices

Footprint

JVM Tunings for the Operating System

Summary

Threading and Synchronization Performance

Threading and Hardware

Thread Pools and ThreadPoolExectures

The ForkJoinPool

Thread Synchronization

JVM Thread Tunings

Monitoring Threads and Locks

Summary

Java Servers

Java NIO Overview

Server Containers

Asynchronous Outbound Calls

Json Processing

Summary

Database Performance Best Practices

Sample Database

JDBC

JPA

Spring Data

Summary

Java SE API Tips

Strings

Buffered I/O

Classloading

Random Numbers

Java Native Interface

Exceptions

Logging

Java Collections API

Lambdas and Anonymous Classes

Stream and Filter Performance

Object Serilaization

Summary

✋Further notes

I want to mention that I am adding each chapter after finishing reading it, so it may take sometime till finishing the 12 chapters in the article.
If you have any suggestions and/or notes please mention them to enhance the article.

As soon as I wrote a chapter, I will update this article to include it as a following part.
see you there...

🐒take a tip

Work smart AND hard

DEV Community: Yousef Zook

Java Performance - 5 - An introduction to Garbage Collection

Recap

Chapter Title:

1) Garbage Collection Overview

A- Generational Garbage Collectors

B- GC Algorithms

C- Choosing a GC Algorithm

2) Basic GC Tuning

A- Sizing the Heap

B- Sizing the Generations

C- Sizing Metaspace

D- Controlling Parallelism

3) GC Tools

A- Enabling GC Logging in JDK 8

B- Enabling GC Logging in JDK 11

Aaaaaand that's it for today :D

🏃 See you in chapter 6 ...

🐒take a tip

Java Performance - 4 - Working with the JIT Compiler

Recap

Chapter Title:

1) Just-in-Time Compilers: An Overview

A- HotSpot Compilation

2) Tiered Compilation

3) Common Compiler Flags

A- Tuning the code cache

B- Inspecting the Compilation Process

C- Tiered Compilation Levels

D- Deoptimization

4) Advanced Compiler Flags

A- Compilation Thresholds

B- Compilation Threads

C- Inlining

D- Escape Analysis

5) Tiered Compilation Trade-offs

6) The GraalVM

7) Precompilation

A- Ahead-of-Time compilation AOT

B- GraalVM Native Compilation

🏃 See you in chapter 5 ...

🐒take a tip

Java Performance - 3 - A java Performance Toolbox

Recap

Chapter Title:

1) Operating System Tools and Analysis

A- CPU Usage

CPU Run Queue

B- Disk Usage

C- Network Usage

2) Java Monitoring Tools

A- JVM Commands

B- Basic VM Information

3) Profiling Tools

A- Sampling Profilers

B- Instrumented Profilers

C- Native Profilers

4) Java Flight Recorder JFR

A- Java Mission Control

B- JFR features

C- Enabling JFR

🏃 See you in chapter 4 ...

🐒take a tip

Java Performance - 2 - An Approach to Performance Testing

Recap

Chapter Title:

1) Test a Real Application

A- Microbenchmarks

B- Macrobenchmarks

C- Mesobenchmarks

Common Code Examples

2) Understand Throughput, Batching, and Response Time

A- Elapsed Time (Batch) Measurements:

B- Throughput Measurements:

C- Response Time Tests

3) Understand Variability

4) Test Early, Test Often

A- Automate everything

B- Measure everything

C- Run on the target system

4) Java Flight Recorder `JFR`