This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.
Batch Operations: Bulk Insert, COPY, and Batch Size Tuning
Batch Operations: Bulk Insert, COPY, and Batch Size Tuning
Batch Operations: Bulk Insert, COPY, and Batch Size Tuning
Batch Operations: Bulk Insert, COPY, and Batch Size Tuning
Batch Operations: Bulk Insert, COPY, and Batch Size Tuning
Batch Operations: Bulk Insert, COPY, and Batch Size Tuning
Batch Operations: Bulk Insert, COPY, and Batch Size Tuning
Batch Operations: Bulk Insert, COPY, and Batch Size Tuning
Batch Operations: Bulk Insert, COPY, and Batch Size Tuning
Batch Operations: Bulk Insert, COPY, and Batch Size Tuning
Batch Operations: Bulk Insert, COPY, and Batch Size Tuning
Batch Operations: Bulk Insert, COPY, and Batch Size Tuning
Batch Operations: Bulk Insert, COPY, and Batch Size Tuning
Batch Operations: Bulk Insert, COPY, and Batch Size Tuning
Loading or updating large volumes of data row-by-row is prohibitively slow. Batch operations reduce overhead by orders of magnitude. This article covers bulk insert techniques, PostgreSQL's COPY command, batch updates, and the art of choosing the right batch size.
Row-by-Row is Slow
Inserting one row at a time incurs overhead for each statement:
Parse SQL
Plan query
Execute plan
Commit (if auto-commit)
Network round trip
SLOW: one round trip per row
for row in dataset:
cursor.execute("INSERT INTO users (email, name) VALUES (%s, %s)", row)
For 100,000 rows, that is 100,000 round trips. Batch operations reduce this to one.
Bulk Insert with Multi-Row VALUES
The simplest batch insert sends multiple rows in a single statement:
INSERT INTO users (email, name) VALUES
('alice@example.com', 'Alice'),
('bob@example.com', 'Bob'),
('carol@example.com', 'Carol');
Using psycopg2 execute_values
from psycopg2.extras import execute_values
data = [
("alice@example.com", "Alice"),
("bob@example.com", "Bob"),
... 1000 rows
]
execute_values(
cursor,
"INSERT INTO users (email, name) VALUES %s",
data,
template="(%s, %s)",
page_size=1000
)
Using asyncpg
import asyncpg
executemany with prepared statement reuse
await conn.executemany(
"INSERT INTO users (email, name) VALUES ($1, $2)",
[("alice@example.com", "Alice"), ("bob@example.com", "Bob")]
)
The COPY Command
COPY is PostgreSQL's most efficient data loading mechanism. It streams data in a binary or text format directly into a table, bypassing the SQL layer:
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- From file
COPY users (email, name) FROM '/path/to/users.csv' WITH (FORMAT CSV, HEADER true);
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- From standard input (via driver)
COPY users (email, name) FROM STDIN WITH (FORMAT CSV);
Python with COPY
import io
import csv
buffer = io.StringIO()
writer = csv.writer(buffer)
writer.writerows([
("alice@example.com", "Alice"),
("bob@example.com", "Bob"),
])
buffer.seek(0)
cursor.copy_expert(
"COPY users (email, name) FROM STDIN WITH CSV",
buffer
)
Performance Comparison
| Method | Time for 1M rows | Network Rounds | |--------|-----------------|-----------------| | Row-by-row INSERT | ~120 seconds | 1,000,000 | | Batch INSERT (1000 rows) | ~8 seconds | 1,000 | | COPY (binary) | ~1.5 seconds | 1 | | COPY (CSV) | ~2 seconds | 1 |
Batch Updates
Updating rows in bulk follows a different pattern. Use a temporary table or unnest:
Using UNNEST
UPDATE users SET email = data.email
FROM (SELECT UNNEST(%s) AS id, UNNEST(%s) AS email) AS data
WHERE users.id = data.id;
user_ids = [1, 2, 3, 4, 5]
emails = ["alice@new.com", "bob@new.com", "carol@new.com", "dave@new.com", "eve@new.com"]
cursor.execute("""
UPDATE users SET email = data.email
FROM (SELECT UNNEST(%s::int[]) AS id, UNNEST(%s::text[]) AS email) AS data
WHERE users.id = data.id
""", (user_ids, emails))
Using a Temporary Table
Create temp table
cursor.execute("""
CREATE TEMP TABLE tmp_updates (
id INTEGER PRIMARY KEY,
email TEXT,
name TEXT
) ON COMMIT DROP
""")
COPY into temp table
buffer = io.StringIO()
writer = csv.writer(buffer)
writer.writerows(update_data)
buffer.seek(0)
cursor.copy_expert("COPY tmp_updates FROM STDIN WITH CSV", buffer)
Join update
cursor.execute("""
UPDATE users u
SET email = t.email, name = t.name
FROM tmp_updates t
WHERE u.id = t.id
""")
Batch Size Tuning
The optimal batch size depends on row width, network latency, and available memory.
General Guidelines
| Row Width | Recommended Batch Size | |-----------|----------------------| | Narrow
Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.
Found this useful? Check out more developer guides and tool comparisons on AI Study Room.
Top comments (0)