(Or: How someone who doesn’t really know what they’re doing manages to improve performance of lengthy queries.)
I’m working on a system that has more than 200 tables and about 2 GB of data. The system is built using Django and relies heavily upon the ORM. We don’t write SQL – all our queries are built through Django querysets. Some of the queries are fairly complex, involving 10 or more tables.
About a month ago, my co-worker discovered that at least 5 of the more complex queries were taking longer to run on our test server than on his development laptop. That seemed odd to me, so I tried to recreate the situation – and I did. Those queries ran faster (generally about half the time) on each of our laptops than on the server available to us.
Table 1 – base line results | Query | server | laptop | | ----- | ------:| ------:| | Query 1 | 31s | 17s | | Query 2 | 44s | 24s | | Query 3 | 1m 47s | 46s | | Query 4 | 28s | 17s |
And so the hunt began… Knowing that the amount of data was only going to continue to increase, we had to get the run times for the longer queries down into the neighborhood of our development environments.
First step was to get copies of the actual queries being run to completely isolate the time needed to run the queries from the overhead introduced by Python and the Django ORM. These are some fairly big queries – the generate SQL for the two largest queries are about 16,000 characters. (At a full 100 characters per line, that means that this would be a 160-line query – and that’s full lines. In reality, formatting these queries ends up being 200+ lines.)
Then it was time to check the basics. We’re working with identical copies of source data. We’re building the database using the same scripts. After the database is built, we’re manually running a vacuum, reindex, analyze. We’ve got enough memory on these systems to set the buffers such that they’re larger than the data being accessed.
Time to look at the differences between these systems.
Admittedly, my knowledge an understanding of PostgreSQL internals is minimal at best. I’m a lot more comfortable with my knowledge of operating systems and file systems. So I wanted to start with what I know.
Now, I’ll admit to being a bit of a Linux-bigot. I refused to believe a Windows laptop running PostgreSQL could be more than twice as fast as a server running Linux on hardware significantly better than those laptops. At this point I’m discarding the OS difference as a significant factor
The server is running PostgreSQL in a docker container with the data stored in a docker volume. Change to mapping the data directory to a mapped directory on the host. No change. Change to using a PostgreSQL instance running on the host. Still no difference.
This server is actually a virtual machine running in a VMWare ESXI. The server image is Ubuntu 20.04 with the disk mapped as an LVM volume. Could it be an issue with the multiple layers of disk mapping causing the problems? What about the selection of the disk interface? (ATA vs SCSI) So we allocated two new “physical” drives to that image, one with each of the available interfaces. I didn’t bother partitioning either drive, I created file systems directly on the devices. Created tablespaces on those file systems, created databases… and still no difference.
The next identified differences were the versions of PostgreSQL being used. My laptop is running PostgreSQL 12, my co-worker’s laptop is running PostgreSQL 10, the test server is running PostgreSQL 13. Ahhh… is there a problem with 13? Back the test server back to version 12 – no difference. Install and test with version 10 – difference! A significant difference (~ 20%), but doesn’t come close to accounting for the entire difference – and backing down from 13 to 10 is not an acceptable answer. But at least I’ve got some direction to go to continue my search.
The next stage of my quest lasted for the better part of a week. I knew I had to do a lot of reading to try and understand what might be happening. I started out by looking at all the release notes and other available docs to see what might be causing this. One of the changes I saw made references to the query planner and how it uses the various settings to improve the execution plan.
So my next step was to run “explain analyze” on my queries across the difference versions. Bingo! The slow queries on the server were running sequential scans on tables where the faster systems were using indexes. I also saw where the JIT was taking more than 30 seconds of the total run time.
More research, more ideas. There’s a PostgreSQL mailing list just for performance issues and there’s a wealth of information in the archives. Between the mailing list and the docs, I identified some settings that may have been relevant.
I started out with making some trivial changes (turning JIT off was one of them), and immediately saw some improvements. It still wasn’t an answer, but I was starting to feel confident that I was heading in the right direction.
For the record, here is the list of settings I decided to focus in on. (Note, this is after making all the appropriate changes to the various buffer and cache settings.)
seq_page_cost random_page_cost cpu_tuple_cost cpu_index_tuple_cost cpu_operator_cost parallel_tuple_cost parallel_setup_cost
At this point I was getting kinda lost. My first experiments were showing some improvements, but I wasn’t seeing any direct relationship between the changes I was making and the degree (or lack) of improvement. I know it’s due to a lack of understanding of exactly what these settings represent, but it makes it difficult to know exactly what changes I’m looking for.
So without knowing what else to try, I decided to rely upon some brute-force evaluation. I put together a script that would change settings, restart PostgreSQL, run the queries, save the results, and repeat this process. From there it’s an issue of looking at all the results.
One thing I realized rather quickly is that the number of combinations for these settings quickly multiply out of control. If I were to test 3 values for each of those variables independently, that produces 2,187 combinations. If each test takes 1 minute, then the complete set of tests would last more than 36 hours. If I tried to run 5 values for each of those 7 variables, it would take 54 days to complete.
I needed to come up with a different approach. I ran more limited tests with a larger number of combinations of each. For example, I ran 25 tests of different combinations of the seq_page_cost and random_page_cost, while keeping all other values the same. I then took my best combinations from than, and ran 125 tests for each of 5 different values for cpu_tuple_cost, cpu_index_tuple_cost, and cpu_operator_cost. Take the best results of that, then go back and rerun the 25 earlier tests. Repeat this process until the results as to what the best values should be, stabilize.
Many groups of these changes produce very minor overall effects. I’d get large groups of test results that are all within a tenth of a second for a 30-second query. However, the results would show large clusters around certain timing results, which I was happy to see – the big gaps occur when PostgreSQL changes a table access from an index search to a sequential scan.
And so we get to the punchline. After all this, the values we’ve currently settled on are:
seq_page_cost = 0.01 random_page_cost = 0.01 cpu_tuple_cost = 0.01 cpu_index_tuple_cost = 0.5 cpu_operator_cost = 0.25 parallel_tuple_cost = 1 parallel_setup_cost = 100
Note: My understanding is that these are all relative to each other, and that they all could be rescaled. However, that doesn’t appear to be the case. I tried multiplying all values by 100 to make them all integers, but the run times of the tests failed. I’m guessing there’s another setting I’ve missed that would also need to be adjusted.
Table 2 – final results | Query | server before | After | Laptop | | ------ | -------------:| -----:|-------:| | Query 1 | 31s | 4s | 17s | | Query 2 | 44s | 5s | 24s | | Query 3 | 1m 47s | 33s | 46s | | Query 4 | 28s | 4s | 17s |
So not only is it better on the server now, it’s significantly better!
If there are any lessons that I learned from this, it would be:
- I am extremely grateful to the PostgreSQL team for their documentation and their mailing lists. There's no way I could have gotten this far without those resources.
- There's a lot of information available to help resolve performance issues - it may just require some digging and reading.
- There are hardware configurations where the PostgreSQL defaults are not optimal for complex data relationships.