TL;DR:
Understanding the underlying mechanics is crucial for optimization. PostgreSQL stores data in 8kB pages, but DELETE
and UPDATE
operations don't actually remove data—they create dead tuples that waste space and slow queries. While VACUUM
helps clean up dead tuples, it keeps pages allocated. VACUUM FULL
can reclaim space but locks tables during execution.
Simply adding more storage leads to unnecessary costs and performance issues, regardless of your hosting solution.
Part 2 will cover practical strategies for database optimization. You'll learn when (and when not) to use VACUUM FULL
, and other optimization tricks I've learned the hard way.
Your phone buzzes in the middle of the night. You pick it up. A monitor went off at work—your PostgreSQL database is slowly but steadily reaching its maximum storage space. You are the engineer in charge. What should you do?
Okay, if it comes down to that situation, you should remedy it ASAP by adding more storage. But you’re going to need a better long-term strategy to optimize your PostgreSQL storage use, or you’ll keep paying more and more money.
Does your PostgreSQL database really need to be that large? Is there something you can do to optimize your storage use?
This article breaks down the essential PostgreSQL storage concepts and optimization strategies every developer needs to understand to build and maintain efficient databases.
Why Is PostgreSQL Storage Optimization Important?
Perhaps you’re thinking:
“Storage is cheap these days, and optimizing a PostgreSQL database takes time and effort. I’ll just keep adding more storage.”
Or perhaps:
“My PostgreSQL provider is actually usage-based (like Timescale), and I don’t have the problem of being locked into a large disk.”
Indeed, resigning yourself to simply using more storage is the most straightforward way to tackle an increasingly growing PostgreSQL database. Are you running servers on-prem? Slap another hard drive on that bad boy. Are you running PostgreSQL in RDS? Raise the storage limits. But this comes with problems.
The first and most obvious problem is the cost. For example, if you’re running PostgreSQL in an EBS instance in AWS or in RDS, you’ll be charged on an allocation basis. This model assumes you’ll predetermine how much disk space you’ll need in the future and then pay for it, regardless of whether you end up using it or not, and without the chance of downscaling.
In other PostgreSQL providers, when you run out of storage space, you must upgrade and pay for the next available plan or storage tier, meaning you’ll see a considerably higher bill overnight.
In a way, these issues are mitigated by usage-based models. Timescale charges by the amount of storage you use: you don't need to worry about allocating storage or managing storage plans, which really simplifies things—and the less storage you use, the less it costs.
Usage-based models are a great incentive to actually optimize your PostgreSQL database size as much as possible since you’ll see immediate reductions in your bill. But yes, this also works the opposite way: if you ignore managing your storage, your storage bill will go up.
The second problem with not optimizing your PostgreSQL storage usage is that this situation can lead to bad performance. Queries run slower and your I/O operations increase. This is something that often gets overlooked, but maintaining PostgreSQL storage usage is paramount to keeping large PostgreSQL tables fast.
This last point deserves a deeper dive into how data is actually stored in PostgreSQL and what is causing the problem, so let’s briefly cover some essential PostgreSQL storage concepts.
Essential PostgreSQL Storage Concepts
How does PostgreSQL store data?
At a high level, there are two terms you need to understand: tuples and pages.
A tuple is the physical representation of an entry in a table. You'll generally see the terms tuple and row used interchangeably. Each element in a tuple corresponds to a specific column in that table, containing the actual data value for that column.
A page is the unit of storage in PostgreSQL, typically 8 kB in size, that holds one or more tuples. PostgreSQL reads and writes data in page units.
Each page in PostgreSQL consists of a page header (which contains metadata about the page, such as page layout versions, page flags, and so on) and actual data (including tuples). There’s also a special area called the Line Pointer Array, which provides the offsets where each tuple begins.
What happens when querying data?
When querying data, PostgreSQL utilizes the metadata to quickly navigate to the relevant page and tuple. The PostgreSQL query planner examines the metadata to decide the optimal path for retrieving data, for example, estimating the cost of different query paths based on the metadata information about the tables, indexes, and data distribution.
What happens when we INSERT/ DELETE/ UPDATE a row in PostgreSQL?
When a new tuple is inserted into a PostgreSQL table, it gets added to a page with enough free space to accommodate the tuple. Each tuple within a page is identified and accessed using the offset provided in the Line Pointer Array.
If a tuple inserted is too big for the available space of a page, PostgreSQL doesn't split it between two 8kB pages. Instead, it employs TOAST to compress and/or break the large values into smaller pieces. These pieces are then stored in a separate TOAST table, while the original tuple retains a pointer to this external stored data.
What is a dead tuple?
A key aspect to understand (and this will influence our PostgreSQL database size, as we’ll see shortly) is that when you delete data in PostgreSQL via DELETE FROM
, you’re not actually deleting it but marking the rows as unavailable. These unavailable rows are usually referred to as “dead tuples.”
When you run UPDATE
, the row you’re updating will also be marked as a dead tuple. Then, PostgreSQL will insert a new tuple with the updated column.
You might be wondering why PostgreSQL does this. Dead tuples are actually a compromise to reduce excessive locks on tables during concurrent operations, multiple connections, and simplifying transactions.
Imagine a transaction failing halfway through its execution; it is much easier to revert a change when the old data is still available than trying to rewind each action in an idempotent way.
Furthermore, this mechanism supports the easy and efficient implementation of rollbacks, ensuring data consistency and integrity during transactions.
The trade-off, however, is the increased database size due to the accumulation of dead tuples, necessitating regular maintenance to reclaim space and maintain performance… What brings us to table bloat.
What is table bloat?
When a tuple is deleted or updated, its old instance is considered a dead tuple. The issue with dead tuples is that they’re effectively still a tuple on disk, taking up storage space—yes, that storage page that is costing you money every month.
Table bloat refers to this excess space that dead tuples occupy in your PostgreSQL database, which not only leads to an inflated table size but also to increased I/O and slower queries. Since PostgreSQL runs under the MVCC system, it doesn't immediately purge these dead tuples from the disk. Instead, they linger until a vacuum process reclaims their space.
Table bloat also occurs when a table contains unused pages, which can accumulate as a result of operations such as mass deletes.
What is VACUUM
?
Dead tuples get cleaned and deleted from storage when the VACUUM command runs:
VACUUM customers;
Vacuum has a lot of roles, but the relevant point for this article is that vacuum removes dead tuples once all connections using the dead tuples are closed. VACUUM
by itself will not delete pages, though. Any pages created by a table will stay allocated, although the memory in those pages is now usable space after running vacuum.
What is autovacuum?
Postgres conveniently includes a daemon to automatically run vacuum on tables that get heavy insert, update, and delete traffic. It operates in the background, monitoring the database to identify tables with accumulating dead tuples and then initiating the vacuum process autonomously.
Autovacuum comes enabled by default, although the threshold PostgreSQL uses to enable autovacuum is very conservative.
What is VACUUM FULL?
Autovacuum helps with dead tuples, but what about unused pages?
The VACUUM FULL
command is a more aggressive version of VACUUM
that locks the table, removes dead tuples and empty pages, and then returns the reclaimed space to the operating system. VACUUM FULL
can be resource-intensive and requires an exclusive lock on the table during the process. But we’ll revisit this in part 2 where we will dive into strategies to reduce your database size.
Click save and subscribe to the comments to get notified for part 2.
Top comments (0)