DEV Community: Tom den Braber

Finding memory issues in PHP programs — Part 2

Tom den Braber — Thu, 21 Feb 2019 23:00:00 +0000

Finding memory issues in PHP programs — Part 2

The cause of a memory issue is often hard to find. In the previous post, we looked at two methods for finding the culprits of a memory issue. In case you missed it, check it out!

In this post, we will look at another tool for finding memory leaks: php-memprof, created by Arnaud Le Blanc.

Another profiler… why?

Each profiler has its own characteristics. In the previous post, we saw that the Xdebug profiler generates a ‘cumulative memory profile’: it shows the sum of all the allocations made by a certain function over the lifetime of the program. This makes the generated profile sometimes hard to interpret. With the profiles generated by Xdebug, you cannot see if the memory is already released at the time the profile is generated. However, such a situation could very well indicate a memory leak. php-memprof is made to fill in this gap. It provides information which can be used to find whether a function ‘leaks’ memory.

Setting up php-memprof

The installation of php-memprof can be done using PECL (pecl install memprof), or manually. After downloading the source, make sure you have libJudy installed and run the following commands in the source directory.

phpize
./configure 
make
make install

Make sure to load the extension (set extension=memprof.so in your php.ini or run your script with php -dextension=memprof.so my_script.php). For my own analyses, I use a Docker image that has the extension already installed and loaded by default.

Note that running Xdebug and php-memprof alongside will not work. To be able to run php-memprof, Xdebug has to be disabled.

Running example

As our running example, we will use a simple array-based cache implementation. The store function is an excellent example of (valid) code that keeps memory allocated, even after its termination.

<?php
class ArrayBasedCache {
  private $cache = [];

  public function store(string $key, $data) {
    $this->cache[$key] = $data;
  }

  public function has(string $key) {
    return isset($this->cache[$key]);
  }

  public function get(string $key) {
    if ($this->has($key) === false) {
      throw new LogicException("Tried to get '$key' from cache, but it does not exist. has(\$key) should be checked first!");
    }
    return $this->cache[$key];
  }

  public function clear() {
    $this->cache = [];
  }
}

Using php-memprof

To signal php-memprof that it should start profiling, the memprof_enable() function has to be called. After the call to this function, php-memprof starts tracking memory allocations. This implies that memory allocated before memprof_enable() was called will not be taken into account.

The information that php-memprof gathered can be dumped using one of the following functions:

memprof_dump_array() dumps the information in an array format
memprof_dump_callgrind(resource $stream) dumps the information to the given $stream in the callgrind format, which can be analysed using KCachegrind or qCachegrind (as shown in the previous post)
memprof_dump_pprof(resource $stream) dumps the information to the given $stream in the pprof format

Let’s run the following code:

<?php 
$cache = new ArrayBasedCache;

memprof_enable();
printf("A:\n%s\n", json_encode(memprof_dump_array(), JSON_PRETTY_PRINT));

$cache->store("a", 123);

printf("B:\n%s\n", json_encode(memprof_dump_array(), JSON_PRETTY_PRINT));
memprof_disable();

The JSON-encoded output of this script is as follows:

{
    "memory_size": 0,
    "blocks_count": 0,
    "memory_size_inclusive": 0,
    "blocks_count_inclusive": 0,
    "calls": 1,
    "called_functions": []
}

{
    "memory_size": 103,
    "blocks_count": 3,
    "memory_size_inclusive": 479,
    "blocks_count_inclusive": 5,
    "calls": 1,
    "called_functions": {
        "ArrayBasedCache::store": {
            "memory_size": 376,
            "blocks_count": 2,
            "memory_size_inclusive": 376,
            "blocks_count_inclusive": 2,
            "calls": 1,
            "called_functions": []
        }
    }
}

The first thing to note is that the dumped structure is recursive. The top-level fields indicate the data from the main program, that is the code that is executed outside any function.

In the called_functions field, we can see the callees from the overarching context. For each of these called functions, the same fields are present again.

Let's go over all the fields that we see in the output to find out what they mean:

memory_size: the number of bytes allocated from the current context;
blocks_count: the number of memory blocks allocated from the current context;
memory_size_inclusive: the number of bytes allocated by the function itself plus all the memory that was allocated by any of the callees of the current context;
blocks_count_inclusive: same as for memory_size_inclusive, but expressed in blocks;
calls: the number of times this function was called from the overarching context;
called_functions: a dictionary with the keys being function-names and the fields being all the fields that are in this table. This field represents the call-stack.

We can see that the A-dump does not contain information about the creation of the ArrayBasedCache itself: this call happened before the enabling of php-memprof and is therefore not taken into consideration. As the A-dump is generated immediately after enabling php-memprof, we see that the dump has no information about any allocated memory.

The B-dump shows that ArrayBasedCache::store was called once from the 'main' program. We can also see that ArrayBasedCache::store has a memory_size of 376 bytes, eventhough its execution was already terminated by the time the B-dump was generated. These bytes are still in memory because the store method added data to the local $cache field, and did not remove it.

Let’s see another example:

<?php 
$cache = new ArrayBasedCache;

memprof_enable();

$cache->store("a", 123);
$cache->clear();

json_encode(memprof_dump_array(), JSON_PRETTY_PRINT);
memprof_disable();

and its output:

{
    "memory_size": 103,
    "blocks_count": 3,
    "memory_size_inclusive": 103,
    "blocks_count_inclusive": 3,
    "calls": 1,
    "called_functions": {
        "ArrayBasedCache::store": {
            "memory_size": 0,
            "blocks_count": 0,
            "memory_size_inclusive": 0,
            "blocks_count_inclusive": 0,
            "calls": 1,
            "called_functions": []
        },
        "ArrayBasedCache::clear": {
            "memory_size": 0,
            "blocks_count": 0,
            "memory_size_inclusive": 0,
            "blocks_count_inclusive": 0,
            "calls": 1,
            "called_functions": []
        }
    }
}

After calling $cache->clear(), all items in the cache are removed. The allocation that was done by $cache->store(...) is thereby deleted, and we can see in the dump that there is indeed no memory allocated by ArrayBasedCache::store at the time the dump is made.

As you can imagine, the array output will become very large for larger programs. For small, isolated scripts the memory_dump_array works fine. For larger programs, I find it way more useful to look at a visualisation of the output of the memory_dump_callgrind function.

Wrapping up

In this post, we looked at another profiler, which approaches memory profiling from a different angle as compared to the Xdebug profiler we looked at last time. This two-part series is meant to help you find the causes of memory issues faster and with a more structured process. Several techniques, like logging memory usage, using profilers, and profile analysis tools have been discussed. Do you miss an important technique that should have been discussed in this series? Let us know!

Photo by Daan Mooij on Unsplash

Originally published at www.moxio.com on February 21, 2019.

Finding memory issues in PHP programs

Tom den Braber — Thu, 11 Oct 2018 12:25:12 +0000

Finding memory issues in PHP programs

"Fatal error: Allowed memory size of 2097152 bytes exhausted (tried to allocate 528384 bytes)." If this error sounds familiar, this post is for you. The problem with this message is that it does not tell you a lot: it does not tell you where all the memory was allocated. Locating the places where a lot of memory is consumed in large and complex systems is not easy. Luckily, there are some tools available which can help finding the problematic code. In this post, we will cover two methods for finding places in your program where a lot of memory is allocated.

Running example

We will be using the following code as our running example. The purpose of the code is finding Nemo. There are two functions: one that reads a file in which Nemo could be located, and the other which tries to find a line of which the content is equal to 'nemo'. The problem with this snippet is that it sometimes consumes too much memory. Not always, but with certain files, the program crashes.

<?php
function fetch_data_from_file(string $file_path) : iterable {
    $resource = fopen($file_path, 'r');
    $lines = [];
    while (($line = fgets($resource)) !== false) {
        $lines[] = trim($line);
    }
    return $lines;
}

function finding_nemo(string $filepath) : int {
    $lines = fetch_data_from_file($filepath);
    foreach ($lines as $line_number => $line) {
        if ($line === "nemo") {
            return $line_number;
        }
    }
    return -1; //nemo not found
}

All the techniques and tools described are not needed to solve this issue (can you already spot the problem?), but it enables us to see how those tools and techniques work in practice.

memory_get_usage()

PHP has two functions which can tell you something about the memory usage of you program: memory_get_usage and memory_get_peak_usage.
memory_get_usage only gives insight in how much memory is in use at the moment of the function call. memory_get_peak_usage() returns the maximum number of bytes allocated by the program until the function call. Both of these functions take one boolean argument: $real_usage. If $real_usage is set to true, memory_get_usage returns the total amount of memory that is actually allocated from the operating system, but some of it might not (yet) be in use by your program. If it is set to false, it returns the number of bytes which PHP has requested (and received) from the operating system, and which is actually in use by the program. The following statement always holds: memory_get_usage(true) >= memory_get_usage(false). Memory is requested in blocks, which are not fully used all of the time.

An advantage of using these functions is that they are really easy to use. One of the possible ways of finding your memory leak is scattering calls to memory_get_usage all over your code, and logging its output. You can then try to find a pattern: where does the memory usage increase?
A drawback of these functions is that their use is limited, as they do not provide insight in which functions or classes are using all that memory.

Let's use these functions to get an idea of where our current problem might reside. In the example below, I use marker characters like 'A', 'B', etc. to be able to track a log entry back to a location in the code. Another option is to include 'magic constants' like __FILE__ and __LINE__ in your log output.

<?php

function fetch_data_from_file(string $file_path) : iterable {
    error_log(sprintf("A: %d bytes used\n", memory_get_usage()));
    /** original code... **/
    error_log(sprintf("B: %d bytes used\n", memory_get_usage());
    return $lines;
}

function finding_nemo(string $filepath) : int {
    error_log(sprintf("C: %d bytes used\n", memory_get_usage()));
    /** original code... **/
    error_log(sprintf("D: %d bytes used\n", memory_get_usage()));
    return -1; //nemo not found
}

finding_nemo("the_sea.txt");

When we run our example now, we have the following log output:

C: 406912 bytes used
A: 406912 bytes used
B: 3599952 bytes used
D: 3591384 bytes used

That's interesting: until line marker A, there is no problem. Between line A and B, the memory suddenly starts to increase. These markers correspond to the start and end of the fetch_data_from_file function. Let's try to confirm this hypothesis using another technique.

Xdebug profiler

As a PHP programmer, you probably have heard of (and used) Xdebug. If you haven't, check it out and make sure to install it. What you might not know, is that it also comes with a profiler: a tool which provides insight in the run time behaviour of a program. This profiler is much more sophisticated than the PHP functions mentioned earlier: instead of giving you just information about how much memory is used, it also provides insight in which functions are actually allocating memory. This is an advantage over the previous technique, because if you don't really have a clue where to look for your memory problem, you will have to scatter a huge amount of calls to memory_get_usage all over your codebase. Before being able to use the profiler, there are some things that need to be configured in your php.ini. Note that most of these options cannot be set at run time using ini_set.
First, you have to enable the profiler. This can be done in two ways: either by using xdebug.profiler_enable = 1 or by using xdebug.profiler_trigger_enable = 1. When using the first option, a profile is generated for every run of your program. The second option only creates a profile of your running program if there is a GET/POST variable or COOKIE set with the name XDEBUG_PROFILER. You also have to tell Xdebug where it has to store the generated files, using xdebug.profiler_output_dir.
There are more things to configure, but with these settings you are already good to go.

Now, run the script again with the Xdebug profiler enabled. If we look into our configurated output directory, we can find the generated profile there. However, before we can open it, we need another tool: qCachegrind for Windows or kCachegrind for Linux. I will use qCachegrind for now.

When opening the profile with qCachegrind you will see something like the picture below.

Make sure you select 'Memory' in the dropdown menu at the top of the window, as opposed to 'Time' (this option can be useful if performance is an issue).
When looking at the 'callee map' of the {main} entry in the function list, you can see by the size of the blocks how the called functions have allocated memory. The larger blocks are the most interesting: these are the functions that allocate the most memory. Each called function is located inside the caller in the callee map.
In the 'Flat Profile' section on the left, you can see a list of functions. For each function, there is an 'Incl.' and a 'Self' column. 'Incl' indicates the amount of memory allocated by this function, including all the memory which is allocated by the callees of that function. 'Self' shows the memory which is allocated by the function itself.
The functions that are most interesting to look at, are those functions that have a relatively high value in the 'Self' column.
As we can see, there are two functions which take up a lot of memory itself: php:fgets and php:trim. But wait... trim() only trims one line at a time, and fgets only reads one line at a time, right? Why are these functions using so much memory? Here we get at one of the drawbacks of the Xdebug profiler: it generates a 'cumulative memory profile', i.e. when a function is called multiple times, it shows the sum of all the memory that was used over the different times it was called.

Although the Xdebug profile has its drawbacks, it enables us to see where (potentially) a lot of memory is allocated. We can confirm our hypothesis, namely that the fetch_data_from_file seems to have a problem, as this function calls two PHP functions which allocate a lot of memory.

Fixing the script

Note that a profiler, or logging memory usage, will almost never give you an exact answer of what or where your memory problem is located. Manual analysis will always be part of your debugging process. However, the tools do help you to build an idea of where the problem might be. At this point, we know which function likely has a problem. Upon closer analysis of the fetch_data_from_file function, we can see that it uses an array to buffer the complete file. If the file is large, the program will run out of memory. Now we do have enough information to fix it.
Let's work with the assumption that fetch_data_from_file is also used elsewhere, and that its behaviour should not change. Luckily, there is a solution for this problem: we do not actually have to load the complete file.

A relatively simple way to work around this problem is to use a Generator. This excellent post describes the concept in more detail.

In short, a Generator enables you to write a basic iterator, where you have the control over what information is needed in memory. When looping over an iterator, the loop is in control of when it fetches the next item from the iterator. As the iterator knows how to fetch the next item, it does not neccesarily need to have all items in memory. In this example, this means that there will only be one line of the file in memory at a time.

Lets look at the example code from above, with a Generator:

<?php

function fetch_data_from_file(string $file_path) : iterable {
    $resource = fopen($file_path, 'r');
    while (($line = fgets($resource)) !== false) {
        yield trim($line);
    }
}

function finding_nemo(string $filepath) : int {
    $lines = fetch_data_from_file($filepath);
    foreach ($lines as $line_number => $line) {
        if ($line === "nemo") {
            return $line_number;
        }
    }
    return -1; //nemo not found
}

Interestingly, the finding_nemo function did not have to change: foreach loops have no problem with Generators. The fetch_data_from_file function did change: it now contains the yield statement.

When we log the memory usage for this piece of code, we can see that the usage stays low. However, because Xdebug generates a cumulative memory profile, the Xdebug profile will look more or less the same. This happens because in total, the fetch_data_from_file function indeed allocates the same amount of memory. However, the function now frees its allocated memory sooner, leading to a memory usage that is overall much lower than in the previous version. This is one of the drawbacks of using the Xdebug profiler. In a follow-up post, I'll show how to use php-memory-profiler, which generates another type of memory profiles.

Conclusion

In this post, we saw two methods of locating places in your PHP program where a lot of memory is allocated: first, by using PHP's memory_get_usage function, thereafter by generating a memory profile using Xdebug and analyzing it with qCachegrind. One thing to keep in mind is that there is no tool or technique available which will indefinetely point to the problem. As such, your debugging process will always at least partly consist of manual analysis. In the next post, I'll show how php-memory-profiler can help you find memory leaks in your program.

Originally posted at the Moxio company blog.

The What, Why and How of Type Inference

Tom den Braber — Sun, 17 Dec 2017 23:00:00 +0000

In my previous post concerning the exceptional flow of PHP programs, I presented a global overview of my project. In this post, I am going to cover one of the building blocks of the algorithm in more depth: type inference. We will discuss what type inference is, why it is needed and how it can be done.

Wait, what?

The words ‘type inference’ might sound scary, but the principle behind it is quite simple. Consider the following code:

<?php 
class Elf { 
    public function __construct(string $name) {
        /* constructor code */
    }

    public function getName() : string { 
        /*... an implementation ...*/ 
    } 

    public function say(string $what_to_say) : void { 
        /*... also an implementation ... */ 
    } 
}

$an_elf = new Elf("Legolas");

if ($an_elf->getName() === "Legolas") { 
    $an_elf->say("They're taking the hobbits to Isengard!"); 
}

Looking at this code, we can say a few things. For example, we can say that the expression "Legolas" is of type string, because it is text enclosed by quotes. We also know that any occurrence of the $an_elf variable is of type Elf, because $an\_elf is only assigned a value once, namely the expression new Elf("Legolas") which is of type Elf. I don't know whether you noticed or not, but... we already inferred a few types. Type inference is nothing else than deducing the types of expressions in a program at Compile Time, i.e. without actually running the program.

But why?

So why would you want to do this? Well, it is likely that you are already using it, maybe even without knowing it. If you are using an IDE, it probably informs you about the types of expressions and variables. It probably supports ‘click to go to definition’: you click on an expression with type Elf and your cursor automagically moves to the definition of the Elf class. You already guessed it: these kind of interactions are possible because of type inference.

Type inference can also help to spot errors before even running your code. Try to spot the error in the snippet below:

<?php
function tell_me(string $what_to_tell) { /* some implementation */}
tell_me(["where is Gandalf, for I much desire to speak to him"]);

When your IDE has type inference, it could warn you that you are calling the function tell_me with an argument of type array, whereas the definition of the function clearly states that it requires a string. In this case, it is quite easy to see. But what if these statements were located in different files? When types of expressions are known at compile time, IDEs and static analysis tools can warn programmers that they are making a mistake. Type inference can be a great help while developing. Even better: you are probably already using tools that are powered by type inference algorithms, maybe without even knowing it.

Awesome! How can we do this?

When we look back at the process we used to infer types in the first code snippet, we can see that we used two main methods.

We looked at ‘stand-alone’ expressions, like "Legolas" or new Elf(...) and inferred the types of those expressions.
We propagated types of expressions to other expressions. E.g. each occurrence of $an_elf has type Elf because it is assigned the expression new Elf(...) of type Elf at the start of the program.

The interesting thing is that these methods go hand in hand. For example, if we want to know the type of $an_elf->getName(), we have to know the type of $an_elf: this type needs to be propagated from a definition earlier in the program. The complete expression can be resolved when we know that $an_elf is of type Elf, as we can then look at the the definition of Elf::getName(), which says that this function returns a string. The type of $an_elf->getName() is thus string.

Implementing a type inference algorithm which uses only infers types on ‘stand-alone’ expressions is not too hard. Combining a simple traversal of the Abstract Syntax Tree (AST) with enough knowledge of the language of a program will do the job. It gets interesting when you also want to propagate the types of expressions to other expressions. In order to do that, we need to know which expressions are dependent on other expressions. We can discern two types of dependences here: control dependence and data dependence. Control dependence says something about the order of execution, whereas data dependences describe which variables influence other variables.

Control flow data is often represented as a Control Flow Graph (CFG). The nodes in the graph are sets of instructions which are executed linearly, i.e. one after another. The edges represent the ‘jumps’ that might occur in the program. Consider the following snippet.

<?php 
$a = rand_int(0,5); 
if ($a > 3) { 
    $a = "a string"; 
} else { 
    $a = false; 
} 
some_function($a);

The CFG of this snippet looks as follows:

From the CFG, we can clearly see that either the if-branch or the else-branch will be executed, but certainly not both in the same run. In the snippet, the variable $a is used. But what is the type of $a in this snippet? The correct answer would be that $a can be a string, an integer or a boolean. However, this is not really useful: although it is true, it does not say a lot. As $a changes multiple times in the program, it is hard to determine what is meant by the variable $a.

But what if we slightly transform the program? We could say, for example, that each variable might be assigned only once. In this way, we can be sure what the type of the variable is at all times, as it is only initialised but not changed. This form is called the Single Static Assignment (SSA) form. The snippet above would look as follows in SSA:

<?php
$a_1 = rand_int(0,5); 
if ($a_1 > 3) { 
    $a_2 = "a string"; 
} else { 
    $a_3 = false; 
} 
some_function(φ($a_2, $a_3));

This snippet has exactly the same behaviour as the earlier snippet, but each variable is assigned only once. This makes the process of type inference somewhat easier, as a variable can now have only one type during its entire life. By keeping a link to from the variables in SSA form to the variables in the original program, we can transfer the inferred types back to the original program in non-SSA form. However, translating the program to SSA form only gets you so far. You might have noticed the strange φ function in the last line of the snippet. The φ actually tells us that we cannot really know which variable will be used here. The value of φ($a_2, $a_3) will be either $a_2 or $a_3: we know for sure that one of those variables will be used as parameter for some_function, but we cannot know which one at compile time. The type of the φ-function is the union of all the types of its parameters. In this example, the type is either a string (originating from $a_2) or boolean (originating from $a_3).

Wrapping up

We covered a lot of ground in this post. To summarise, type inference…

is the process of deducing types of expressions at compile time;
enables all kinds of tools which can be used to help you, the software developer;
can be done by combining deep knowledge of the programming language and flow information.

Stay tuned for the next post concerning call graph construction!

Tools

There are some tools around for creating CFGs and performing type inference. For example:

PHP-CFG — a tool for generating CFGs of PHP programs
PHP-Types — a tool for performing type inference on CFGs.

Originally published at www.moxio.com on December 17,2017.

Understanding Exceptional Flow

Tom den Braber — Wed, 15 Mar 2017 07:00:00 +0000

Maybe you recognise the following situation. You are implementing a new feature, and you know that you can use a certain method, as it already covers some of the functionality you need. You briefly look at it, and you don’t see any exception handling constructs. The method documentation does not contain information about what exceptions can be thrown, for example via a @throws declaration. You conclude that there is no reason to think any exceptions are thrown or propagated by the method.

But can you be sure? To come to the conclusion that no exceptions can be thrown or propagated, you would need to trace every method call that could be made by the method you are looking at, and repeat this process for all methods you encounter. You would also need to look up all definitions of internal php functions or methods that are used, to see if they throw any exceptions. This requires an enormous amount of work, and it is tedious to do. It would be very helpful to have this process automated!

That’s exactly why I am currently working on a tool which models the Exceptional Flow as part of my MSc Thesis Project. In this introductory post, the building blocks of this tool will be discussed briefly. In the posts to follow in this series, each of these building blocks will be covered in more depth.

The series will conclude with an overview of the results of using the tool on a number of open source projects.

Building blocks

The exception flow model consists of a few building blocks. This is visualised in the picture below.

The system takes a complete PHP program as an input. The code is parsed, which results in an Abstract Syntax Tree (AST). This AST serves as the basis for the complete analysis. First, the types will be inferred and mapped back to the AST. Thereafter, the call graph of the program will be created. Using the AST with types and the call graph together, the exception flow can be deduced.

Type inference

Because PHP is dynamically typed, the AST does not contain information about the types of expressions. Because these types are needed in order to construct the call graph and to detect which exceptions are thrown, the types of the expressions in the AST need to be inferred. However, before we can do type inference, we need to have a Control Flow Graph (CFG), as the paths that can be taken through the code during program execution decide what types a variable can have. Note that a separate CFG is created for each function or method and that these CFGs are not connected by resolving the method and function calls.

When the CFGs are created, the types can be inferred. These types are mapped back to the AST. At this point, we have an AST which includes type information of expressions.

Call graph construction

Because we want to know how exceptions can travel between functions and methods, we want to know for each method which method calls it can make. Because we have done type inference, we can now decide (for most) expressions what type they have. If we encounter a statement like $a->m(), and we now the type of $a, we can limit the number of possible methods this expression resolves to. Polymorphism plays an important role here.

Inferring the exceptional flow

Now that we have the call graph and the AST with types, we can start inferring the exception flow. The analysis uses the notion of ‘scopes’ and ‘guarded scopes’ [1]. A scope in this context is a method or function, whereas a guarded scope is a try/catch/finally block. A guarded scope can be nested in another (guarded) scope.

An exception that is encountered within a (guarded) scope can originate from four different sources. To start with, the exception can be explicitly thrown using the throw statement. Secondly, the exception can be generated by a statement. This happens when the code causes an exception to occur, without explicitly throwing it. For example, if you call a function which specifies that it returns an int, but actually returns a string, calling this function would result in a TypeError. The third origin of an exception could be a call to a method or function that encounters an exception. The encountered exception is then propagated into the scope of the caller. Finally, an exception can be encountered in a scope, because it was not caught in a nested guarded scope.

Using these sources, we can model the exceptional flow in a system. The exact algorithm will be covered in a later post.

Wrapping up

In this post, all ingredients for building a tool that can model the exceptional flow were briefly discussed.

Stay tuned for the next post in this series, in which the subject of type inference will be covered in more depth.

References

[1] Robillard, M. P., & Murphy, G. C. (2003). Static analysis to support the evolution of exception structure in object-oriented systems. ACM Transactions on Software Engineering and Methodology (TOSEM), 12(2), 191–221.

Originally published at www.moxio.com.

Query optimization: from a few weeks to 24 hours

Tom den Braber — Tue, 17 Jan 2017 08:00:00 +0000

Everyone who writes SQL queries encounters them once in a while: those queries that just take too long. Recently, we ran into such an issue with one of our systems. In this blog post, I will first describe the system, then show how the problem could arise and lastly, how we solved it. Spoiler: the process now takes 24 hours… instead of a few weeks.

Detecting mutations

One of our applications provides our clients with a different view on their own data. Every few weeks, we receive a dataset from the client, which we import into our application. However, the representation that we use in our application differs from the representation of the received data dump. To be able to process the differences between the received dump and the data that is in our system, we use a mutation detector. The mutation detector uses SQL queries to find batches of differences between the data in our application and the received dump. Each batch is first processed before the next batch of changes will be fetched. An example of a mutation detection query can be found below. It detects all entries that do occur in table A but not in table B, under the assumption that for an entry which occurs in both A and B it holds that A.id = B.id.

SELECT A.* 
FROM A 
LEFT JOIN B ON 
    A.id = B.id 
WHERE 
    B.id IS NULL 
LIMIT 400

Where to find them

The mutation detector executes the same query over and over again, until no changes can be found. However, MySQL does not keep information about the last query it executed, i.e. MySQL does not know where it already looked for changes. It is very likely that MySQL looks at a record and finds that it isn’t different from the data in our system, because the entry was already covered in a previous query. The result is that the query becomes slower over time: the first changes are detected within milliseconds, but as more changes have been processed, the search for new mutations takes longer and longer.

Lending MySQL a hand

The solution to this problem is straightforward: we need to give MySQL more information about where it has already looked for changes, so that it does not look at the same entry multiple times. To find out where to start, we run EXPLAIN to find out how the JOIN statements are resolved, and which table SQL would read first (we will call this table base from now on). This is the topmost entry in the output of the EXPLAIN statement. Knowing where MySQL starts searching, we can introduce a "cheap ORDER BY". Let pk be the primary key of base; we can add ORDER BY base.pk without introducing extra cost. Now that we have told MySQL in what order it should detect mutations, we can also keep track of where it detected the last one. Instead of querying for just mutations, we add base.pk to the selected columns. In the mutation detector, we save the largest value for base.pk that was encountered, and add the following condition to the query: base.pk > [largest base.pk encountered]. Because of the ORDER BY and the condition on base.pk, we are sure that MySQL does not cover the same entry multiple times.

We can now incorporate these techniques into the mutation detection query given above. Because we do a LEFT JOIN from A to B, we know that A will be read first and thus corresponds to the base table we talked about earlier. A.id is the primary key of A and is already included in A.*, so in this case, it is not needed to explicitly select A.id. The resulting query is as follows:

SELECT A.* 
FROM A 
LEFT JOIN B ON 
    A.id = B.id 
WHERE 
    B.id IS NULL AND 
    A.id > [largest encountered A.id in previous queries] 
ORDER BY A.id 
LIMIT 400

Concluding remarks

This mechanism (including some other small optimizations) reduced the time it took to import a certain dataset from weeks or even months* to 24 hours. The mechanism described above applies to a context where you want to detect and process batches, instead of the complete dataset at once. The key lesson is that adding extra information to a query can gain you a huge speedup. Another thing that each reader should take to heart: use EXPLAIN to analyze your queries. It will deepen your understanding of how a database handles your query and you will learn how to deal with the database's query strategies.

*don’t worry, we did not actually wait for weeks: we optimized the query before the process finished.

Originally published at www.moxio.com.