DEV Community

丁久
丁久

Posted on • Originally published at dingjiu1989-hue.github.io

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

Batch Operations: Bulk Insert, COPY, and Batch Size Tuning

Loading or updating large volumes of data row-by-row is prohibitively slow. Batch operations reduce overhead by orders of magnitude. This article covers bulk insert techniques, PostgreSQL's COPY command, batch updates, and the art of choosing the right batch size.

Row-by-Row is Slow

Inserting one row at a time incurs overhead for each statement:

  • Parse SQL

  • Plan query

  • Execute plan

  • Commit (if auto-commit)

  • Network round trip

SLOW: one round trip per row

for row in dataset:

cursor.execute("INSERT INTO users (email, name) VALUES (%s, %s)", row)

For 100,000 rows, that is 100,000 round trips. Batch operations reduce this to one.

Bulk Insert with Multi-Row VALUES

The simplest batch insert sends multiple rows in a single statement:

INSERT INTO users (email, name) VALUES

('alice@example.com', 'Alice'),

('bob@example.com', 'Bob'),

('carol@example.com', 'Carol');

Using psycopg2 execute_values

from psycopg2.extras import execute_values

data = [

("alice@example.com", "Alice"),

("bob@example.com", "Bob"),

... 1000 rows

]

execute_values(

cursor,

"INSERT INTO users (email, name) VALUES %s",

data,

template="(%s, %s)",

page_size=1000

)

Using asyncpg

import asyncpg

executemany with prepared statement reuse

await conn.executemany(

"INSERT INTO users (email, name) VALUES ($1, $2)",

[("alice@example.com", "Alice"), ("bob@example.com", "Bob")]

)

The COPY Command

COPY is PostgreSQL's most efficient data loading mechanism. It streams data in a binary or text format directly into a table, bypassing the SQL layer:

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- From file

COPY users (email, name) FROM '/path/to/users.csv' WITH (FORMAT CSV, HEADER true);

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- From standard input (via driver)

COPY users (email, name) FROM STDIN WITH (FORMAT CSV);

Python with COPY

import io

import csv

buffer = io.StringIO()

writer = csv.writer(buffer)

writer.writerows([

("alice@example.com", "Alice"),

("bob@example.com", "Bob"),

])

buffer.seek(0)

cursor.copy_expert(

"COPY users (email, name) FROM STDIN WITH CSV",

buffer

)

Performance Comparison

| Method | Time for 1M rows | Network Rounds | |--------|-----------------|-----------------| | Row-by-row INSERT | ~120 seconds | 1,000,000 | | Batch INSERT (1000 rows) | ~8 seconds | 1,000 | | COPY (binary) | ~1.5 seconds | 1 | | COPY (CSV) | ~2 seconds | 1 |

Batch Updates

Updating rows in bulk follows a different pattern. Use a temporary table or unnest:

Using UNNEST

UPDATE users SET email = data.email

FROM (SELECT UNNEST(%s) AS id, UNNEST(%s) AS email) AS data

WHERE users.id = data.id;

user_ids = [1, 2, 3, 4, 5]

emails = ["alice@new.com", "bob@new.com", "carol@new.com", "dave@new.com", "eve@new.com"]

cursor.execute("""

UPDATE users SET email = data.email

FROM (SELECT UNNEST(%s::int[]) AS id, UNNEST(%s::text[]) AS email) AS data

WHERE users.id = data.id

""", (user_ids, emails))

Using a Temporary Table

Create temp table

cursor.execute("""

CREATE TEMP TABLE tmp_updates (

id INTEGER PRIMARY KEY,

email TEXT,

name TEXT

) ON COMMIT DROP

""")

COPY into temp table

buffer = io.StringIO()

writer = csv.writer(buffer)

writer.writerows(update_data)

buffer.seek(0)

cursor.copy_expert("COPY tmp_updates FROM STDIN WITH CSV", buffer)

Join update

cursor.execute("""

UPDATE users u

SET email = t.email, name = t.name

FROM tmp_updates t

WHERE u.id = t.id

""")

Batch Size Tuning

The optimal batch size depends on row width, network latency, and available memory.

General Guidelines

| Row Width | Recommended Batch Size | |-----------|----------------------| | Narrow


Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.

Top comments (0)