Intro
In this post I'll show you how to speed up data loading in Postgres. Using a worked example, we'll start at half an hour runtime and end up with a version which is done in half a minute. Step by step, we'll get it ~60x
faster.
If you're looking for a TL;DR, here's a checklist with all the refinements we'll do:
- UUID v7 instead of v4
- Numeric ID instead of UUID
- Increase shared buffers
- Disable full page writes
- Remove constraints
- Remove indexes
- Unlogged tables
- COPY
In the sections below we'll apply each of the above steps in order, as well as understand why they speed up the data loading.
I'm using Postgres on my laptop (Apple MacBook Pro M1 and 32GB RAM).
Tables
If you've checked out my other posts, you'll recognise familiar tables. It's the same ones I used for this post. We have meters
and their readings
stored in their respective tables, here's what they look like:
create table meters
(
id uuid primary key
);
create table readings
(
id uuid primary key,
meter_id bigint,
rating double precision,
date date
constraint fk__readings_meters foreign key (meter_id) references meters (id)
);
We'll be inserting 15000 meters with one reading every day for each one, for 5 years.
insert into meters
select uuidv4() from generate_series(1, 15000) seq;
insert into readings(id, meter_id, date, reading)
select uuidv4(), m.id, seq, random() from generate_series('2019-02-01'::date, '2024-02-01'::date, '1 day'::interval) seq, meters m
I'm letting this run, and after a (rather long) while, it's done (in 2098.31s
).
It's just our starting point, we'll get it much faster. Buckle up!
UUID v7
Let's start by focusing on the primary key. The UUID v4 is not the best when it comes to insert performance. I elaborated why that is in another post, but the gist is that its randomness cause a lot of page modifications. The database has to do a lot of work in order to keep the B-tree balanced after every tuple inserted.
Recently, Postgres got support for UUID v7! It will be available in version 18, however we can already use it if we work with the source code directly. These are time-sortable identifiers, which means insertions will "affect" only a isolated and specific part of the B-tree instead of everything. This means much less work for the database to do. Let's give this a try.
insert into meters
select uuidv7() from generate_series(1, 15000) seq;
insert into readings(id, meter_id, date, reading)
select uuidv7(), m.id, seq, random() from generate_series('2019-02-01'::date, '2024-02-01'::date, '1 day'::interval) seq, meters m
Check this out - with this, we've reduced the time to more than half! It finished in 821.50s
. So far so good.
Numeric IDs
Let's try something else. The UUIDs themselves are generated before insertion (the call to uuidv7()
). Let's then replace the UUID primary key with a numeric one, which does not have to be generated, as it will be just read from a sequence. In addition, the corresponding data type (bigint
) will be half the size of a UUID. Sounds good, let's see where this brings us.
create table meters
(
id bigint primary key
);
create table readings
(
id bigint primary key generated always as identity,
meter_id bigint,
reading double precision,
date date,
constraint fk__readings_meters foreign key (meter_id) references meters (id)
);
Here's the updated script:
insert into meters
select seq from generate_series(1, 15000) seq;
insert into readings(meter_id, date, reading)
select m.id, seq, random() from generate_series('2019-02-01'::date, '2024-02-01'::date, '1 day'::interval) seq, meters m
This gets us a little further indeed! We're at 646.73s
. So, a bit over 10 minutes. This is great, but we've still got work to do - remember, we're eventually going to get it about 20 times faster than this.
Let's move on to doing some configuration tuning.
Shared buffers
Postgres uses its shared memory buffers to make reads and writes more efficient. It is a set of 8kb pages in memory which are used in order to avoid doing slower disk operations all the time.
If we don't size the shared buffers correctly, we can expect a lot of evictions during our data loading, slowing it down. The default setting of 128 MB
is low compared to how much data we're inserting (~2GB), so I'll increase the shared buffers accordingly.
alter system set shared_buffers='2GB';
Time to run it again. As expected, this brought us closer to our goal. We're now at 595.33s
.
Full page writes
As mentioned above, Postgres works with 8kb pages, however the OS and the disk do not (e.g. in Linux the page is 4kb, and a sector on disk is 512 bytes). This can lead to - in the event of a power failure - pages being only partially written. This would prevent Postgres from being able to do its data recovery, because it relies on the fact that the pages are not corrupted in any way when it starts its recovery protocol. The solution to this is that after every checkpoint, at the first update of a page, the full page is written instead of only the changes as is the common case.
For the sake of experimentation let's shut it off, however I do not recommend doing this in production, except only on a temporary basis strictly for the data loading.
alter system set full_page_writes=off;
Hmm, well, it got us to 590.01s
. It's not that much, but we'll take it!
Constraints
Next up, we'll remove the table constraints. From the script above, I'll remove the the fk__readings_meters
foreign key constraint. The database has to do less work because there's no more checking this at runtime.
Quite a difference this made with this. We're now at 150.71s
. This is the biggest gain so far.
Indexes
We're onto something. I'll now remove the indexes as well. This means no more updating the index after every tuple inserted. By the way, we're dropping constraints and indexes but only temporarily. You can always recreate them after the data loading finished successfully.
These are my tables now.
create table meters
(
id bigint
);
create table readings
(
id bigint,
meter_id bigint,
reading double precision,
date date
);
I've ran the same import script as above, and now we're at 109.51s
. Great stuff! Can we get it under 100s?
Unlogged tables
Sure thing! But we'll have to make some more concessions. For example, for the rest of the experiment I'll be using unlogged tables. Again, not a setting to keep on for production beyond strictly the data loading procedure. This is because this way the database does not ensure durability anymore because we've disabled the write-ahead logging facility.
create unlogged table meters
(
id bigint
);
create unlogged table readings
(
id bigint,
meter_id bigint,
reading double precision,
date date
);
I'm now at 40.65s
. Believe it or not, we're not done yet here.
Copy
The copy command is the Postgres method for data loading. Let's give it a go.
\copy readings from '<path_to_file>/readings.csv' delimiter ',';
This finishes in 35.41s
. Amazing!
Here are all our results in one view:
That's quite a difference from when we started out. As expected, using COPY lead to the shortest time. But what was interesting to see is the difference it made when we dropped the constraints, compared with the other changes.
I want to add that I've experimented with checkpoint tuning as well. It didn't yield any notable improvements for this experiment. Nonetheless, you might want to keep it in mind as it can affect performance if misconfigured.
Thanks for reading!
Cover Photo by Florian Steciuk on Unsplash
Top comments (0)