DEV Community

Cover image for High speed data loading in Postgres
Mircea Cadariu
Mircea Cadariu

Posted on • Edited on

5

High speed data loading in Postgres

Intro

In this post I'll show you how to speed up data loading in Postgres. Using a worked example, we'll start at half an hour runtime and end up with a version which is done in half a minute. Step by step, we'll get it ~60x faster.

If you're looking for a TL;DR, here's a checklist with all the refinements we'll do:

  • UUID v7 instead of v4
  • Numeric ID instead of UUID
  • Increase shared buffers
  • Disable full page writes
  • Remove constraints
  • Remove indexes
  • Unlogged tables
  • COPY

In the sections below we'll apply each of the above steps in order, as well as understand why they speed up the data loading.

I'm using Postgres on my laptop (Apple MacBook Pro M1 and 32GB RAM).

Tables

If you've checked out my other posts, you'll recognise familiar tables. It's the same ones I used for this post. We have meters and their readings stored in their respective tables, here's what they look like:

create table meters
(
    id  uuid primary key
);

create table readings
(

    id           uuid primary key,
    meter_id     bigint,
    rating       double precision,
    date         date

    constraint fk__readings_meters foreign key (meter_id) references meters (id)
);
Enter fullscreen mode Exit fullscreen mode

We'll be inserting 15000 meters with one reading every day for each one, for 5 years.

insert into meters 
    select uuidv4() from generate_series(1, 15000) seq;

insert into readings(id, meter_id, date, reading)
    select uuidv4(), m.id, seq, random() from generate_series('2019-02-01'::date, '2024-02-01'::date, '1 day'::interval) seq, meters m
Enter fullscreen mode Exit fullscreen mode

I'm letting this run, and after a (rather long) while, it's done (in 2098.31s).

It's just our starting point, we'll get it much faster. Buckle up!

UUID v7

Let's start by focusing on the primary key. The UUID v4 is not the best when it comes to insert performance. I elaborated why that is in another post, but the gist is that its randomness cause a lot of page modifications. The database has to do a lot of work in order to keep the B-tree balanced after every tuple inserted.

Recently, Postgres got support for UUID v7! It will be available in version 18, however we can already use it if we work with the source code directly. These are time-sortable identifiers, which means insertions will "affect" only a isolated and specific part of the B-tree instead of everything. This means much less work for the database to do. Let's give this a try.

insert into meters 
    select uuidv7() from generate_series(1, 15000) seq;

insert into readings(id, meter_id, date, reading)
    select uuidv7(), m.id, seq, random() from generate_series('2019-02-01'::date, '2024-02-01'::date, '1 day'::interval) seq, meters m
Enter fullscreen mode Exit fullscreen mode

Check this out - with this, we've reduced the time to more than half! It finished in 821.50s. So far so good.

Numeric IDs

Let's try something else. The UUIDs themselves are generated before insertion (the call to uuidv7()). Let's then replace the UUID primary key with a numeric one, which does not have to be generated, as it will be just read from a sequence. In addition, the corresponding data type (bigint) will be half the size of a UUID. Sounds good, let's see where this brings us.

create table meters
(
    id bigint primary key 
);

create table readings
(

    id           bigint primary key generated always as identity,
    meter_id     bigint,
    reading      double precision,
    date         date,

    constraint fk__readings_meters foreign key (meter_id) references meters (id)
);
Enter fullscreen mode Exit fullscreen mode

Here's the updated script:

insert into meters 
    select seq from generate_series(1, 15000) seq;

insert into readings(meter_id, date, reading)
    select m.id, seq, random() from generate_series('2019-02-01'::date, '2024-02-01'::date, '1 day'::interval) seq, meters m
Enter fullscreen mode Exit fullscreen mode

This gets us a little further indeed! We're at 646.73s. So, a bit over 10 minutes. This is great, but we've still got work to do - remember, we're eventually going to get it about 20 times faster than this.

Let's move on to doing some configuration tuning.

Shared buffers

Postgres uses its shared memory buffers to make reads and writes more efficient. It is a set of 8kb pages in memory which are used in order to avoid doing slower disk operations all the time.

shared-buffers

If we don't size the shared buffers correctly, we can expect a lot of evictions during our data loading, slowing it down. The default setting of 128 MB is low compared to how much data we're inserting (~2GB), so I'll increase the shared buffers accordingly.

alter system set shared_buffers='2GB';
Enter fullscreen mode Exit fullscreen mode

Time to run it again. As expected, this brought us closer to our goal. We're now at 595.33s.

Full page writes

As mentioned above, Postgres works with 8kb pages, however the OS and the disk do not (e.g. in Linux the page is 4kb, and a sector on disk is 512 bytes). This can lead to - in the event of a power failure - pages being only partially written. This would prevent Postgres from being able to do its data recovery, because it relies on the fact that the pages are not corrupted in any way when it starts its recovery protocol. The solution to this is that after every checkpoint, at the first update of a page, the full page is written instead of only the changes as is the common case.

full-page-writes

For the sake of experimentation let's shut it off, however I do not recommend doing this in production, except only on a temporary basis strictly for the data loading.

alter system set full_page_writes=off;
Enter fullscreen mode Exit fullscreen mode

Hmm, well, it got us to 590.01s. It's not that much, but we'll take it!

Constraints

Next up, we'll remove the table constraints. From the script above, I'll remove the the fk__readings_meters foreign key constraint. The database has to do less work because there's no more checking this at runtime.

Quite a difference this made with this. We're now at 150.71s. This is the biggest gain so far.

Indexes

We're onto something. I'll now remove the indexes as well. This means no more updating the index after every tuple inserted. By the way, we're dropping constraints and indexes but only temporarily. You can always recreate them after the data loading finished successfully.

These are my tables now.

create table meters
(
    id bigint 
);

create table readings
(

    id           bigint,
    meter_id     bigint,
    reading      double precision,
    date         date

);
Enter fullscreen mode Exit fullscreen mode

I've ran the same import script as above, and now we're at 109.51s. Great stuff! Can we get it under 100s?

Unlogged tables

Sure thing! But we'll have to make some more concessions. For example, for the rest of the experiment I'll be using unlogged tables. Again, not a setting to keep on for production beyond strictly the data loading procedure. This is because this way the database does not ensure durability anymore because we've disabled the write-ahead logging facility.

create unlogged table meters
(
    id bigint 
);

create unlogged table readings
(

    id           bigint,
    meter_id     bigint,
    reading      double precision,
    date         date

);
Enter fullscreen mode Exit fullscreen mode

I'm now at 40.65s. Believe it or not, we're not done yet here.

Copy

The copy command is the Postgres method for data loading. Let's give it a go.

\copy readings from '<path_to_file>/readings.csv' delimiter ',';
Enter fullscreen mode Exit fullscreen mode

This finishes in 35.41s. Amazing!

Here are all our results in one view:

chart

That's quite a difference from when we started out. As expected, using COPY lead to the shortest time. But what was interesting to see is the difference it made when we dropped the constraints, compared with the other changes.

I want to add that I've experimented with checkpoint tuning as well. It didn't yield any notable improvements for this experiment. Nonetheless, you might want to keep it in mind as it can affect performance if misconfigured.

Thanks for reading!

Cover Photo by Florian Steciuk on Unsplash

API Trace View

How I Cut 22.3 Seconds Off an API Call with Sentry 🕒

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more