Understanding how things work is always important, as it helps in identifying and resolving potential issues. In my previous post, I ran a small query to scan a growing table and observe when the throughput slows down, to get an idea where the reads come from. The results showed that PostgreSQL shared buffer is allocated to 16GB and an additional filesystem cache of 32GB is available. However, it was confirmed that this memory is not really available in the Linux kernel sense, but is allocated for this specific purpose. In this 64GB RAM instance, this leaves only 20% of the memory available for any other use.
Upon first glance, this appears favorable. I appreciate the consistent performance and prefer an allocated cache over one that can fluctuate based on other activities.
After understanding how the system works, several corner cases come to mind. For instance, what about temporary tables? They are expected to be faster since they don't need to be shared and don't need to persist across failures. Temporary tables in SQL need to be AC (Atomicity and Consistency) but not ACID. So, where do you think they should be stored in the system I described?
You may want to store the temporary data locally if you don't need it to be stored in the distributed system that is protected from failures and replicated across failure zones.
However, temporary tables in PostgreSQL don't go through the shared buffers. Instead, they use the work_mem to cache it in the session's private memory and rely on the filesystem cache for good performance. However, if they are stored in a different storage pool, they cannot use the same buffer cache.
Do you happen to know how much available RAM there is? Less than 20% on my 64 GB system with 80% used. I have less than 12GB for all local files, including log files and temporary files. When new connections arrive and allocate memory, this available memory shrinks even further.
I ran the same script as before, but with a temporary table, replacing create table
with create temporary table
.
The first iteration on the 100MB table is fast, 6231 pages per millisecond:
Mon 27 Nov 2023 02:27:50 PM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 104914944 | 100 MB | 2023-11-27 14:27:50.465103+00
(1 row)
Mon 27 Nov 2023 02:27:50 PM GMT (every 120s)
QUERY PLAN
-----------------------------------------------------------------
Seq Scan on large (actual time=0.003..2.054 rows=12800 loops=1)
Buffers: local hit=12800
Planning:
Buffers: shared hit=7
Planning Time: 0.040 ms
Execution Time: 2.556 ms
(6 rows)
You see Buffers: local hit
because PostgreSQL caches it in the private memory of the PostgreSQL backend. It is faster than the shared buffers because it doesn't have to managed concurrent access. The size is defined by work_mem
:
postgres=> show work_mem;
work_mem
----------
256MB
(1 row)
As a result, even when the table size reaches 200MB, it continues to access the same memory location:
Mon 27 Nov 2023 02:29:50 PM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 209797120 | 200 MB | 2023-11-27 14:29:50.463191+00
(1 row)
Mon 27 Nov 2023 02:29:50 PM GMT (every 120s)
QUERY PLAN
-----------------------------------------------------------------
Seq Scan on large (actual time=0.005..4.437 rows=25600 loops=1)
Buffers: local hit=25600
Planning Time: 0.018 ms
Execution Time: 5.455 ms
(4 rows)
The calculation shows that the throughput is equivalent: 25600 / 4.437 = 5770 pages per millisecond.
However, when the size of the table is 300MB, which is greater than the value of work_mem
, the throughput slows down by a factor of 10:
Mon 27 Nov 2023 02:31:50 PM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 314679296 | 300 MB | 2023-11-27 14:31:50.461167+00
(1 row)
QUERY PLAN
------------------------------------------------------------------
Seq Scan on large (actual time=0.005..71.099 rows=38400 loops=1)
Buffers: local read=38400 written=1
Planning Time: 0.027 ms
Execution Time: 73.193 ms
(4 rows)
Here, I've read 38400 / 71.099 = 540 pages per millisecond like in the previous post when reading from the filesystem cache for a permanent table. Let's see the highest size that can read at this rate:
Mon 27 Nov 2023 05:43:50 PM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 10383482880 | 9902 MB | 2023-11-27 17:43:50.258762+00
(1 row)
QUERY PLAN
----------------------------------------------------------------------
Seq Scan on large (actual time=0.007..2200.163 rows=1267200 loops=1)
Buffers: local read=1267200
Planning Time: 0.027 ms
Execution Time: 2266.705 ms
(4 rows)
With a table of 9MB I still read with the same thoughput: 1267200 / 2200.163 = 576 pages per millisecond.
However, with an additional 100MB, the response time is 1x slower:
Mon 27 Nov 2023 05:45:50 PM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 10488365056 | 10002 MB | 2023-11-27 17:45:50.256662+00
(1 row)
QUERY PLAN
-----------------------------------------------------------------------
Seq Scan on large (actual time=0.745..23856.933 rows=1280000 loops=1)
Buffers: local read=1280000
Planning Time: 0.035 ms
Execution Time: 23943.280 ms
(4 rows)
This scan has read 1280000 / 23856.933 = 53 pages per millisecond because those reads all came from local disk.
The metrics doesn't show anything about those reads:
While I used a simple example of scanning a temporary table, the same issue can arise with hashing or sorting, especially when using JOIN, GROUP BY, or ORDER BY. When 80% of the RAM is allocated for the PostgreSQL shared buffers and the distributed filesystem cache, there may not be enough memory left for the sessions. As a result, when a work area spills to disk, it cannot take advantage of the filesystem cache. This may not be a problem if you can increase work_mem
to keep the data in memory instead of writing it to a file. However, due to the lack of available memory, you may not be able to allocate a large work_mem
.
If your workload involves more Analytics than OLTP, it may be beneficial to decrease the shared_buffers
and allocate a larger work_mem
. However, this may not be possible because the used memory will still be set at 80%. Reducing the shared_buffers
will only provide more space to the filesystem cache for permanent files, and it is not available for work_mem
or temporary file cache.
In addition, you may consider creating a read replica for this Analytics workload. Unfortunately, this seems to be infeasible since the read replicas have the same configuration as the primary database.
Top comments (0)