Arijit Ghosh

Posted on Oct 16

🗄️DB Performance 101: A Practical Deep Dive into Backend Database Optimization⚡

#sql #database #postgres #tutorial

Introduction

If you’ve ever watched your backend performance graph nosedive as production traffic rises, you already know the culprit often lurks below the API layer — in the database.

A well-designed backend can still stall if every request waits on slow queries, exhausted connections, or bloated indexes. While scaling compute resources might buy time, true backend optimization starts at the data layer.

This post focuses entirely on database-centric performance techniques — the stuff that really moves the needle when your system is in production. We’ll cover practical approaches ranging from connection pooling to sharding, with comparisons across PostgreSQL, MySQL, and MongoDB where relevant.

1. The Hidden Cost of Connections: Why Pooling Matters

Every database query starts with a handshake — establishing a TCP socket, authentication, and resource allocation. Opening and closing these connections per request is like starting your car engine every time you need to move a few meters.

The Fix: Connection Pooling

A connection pool keeps a set of ready-to-use connections in memory.
Instead of creating a new one for every query, your backend borrows an idle connection, executes, and returns it to the pool.

Example in Python (using `psycopg2.pool`)

import psycopg2
from psycopg2 import pool

# Initialize the connection pool
db_pool = pool.SimpleConnectionPool(
    minconn=5,
    maxconn=20,
    user='db_user',
    password='secret',
    host='localhost',
    port=5432,
    database='production_db'
)

# Borrow a connection
conn = db_pool.getconn()
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM orders;")
print(cursor.fetchone())

# Return connection
cursor.close()
db_pool.putconn(conn)

Production-level pooling

For PostgreSQL, PgBouncer is the gold standard.
It acts as a lightweight proxy that manages thousands of connections efficiently.
Compared to the app-level pool, PgBouncer reduces memory overhead on the database and improves stability under spikes.

For MySQL, ProxySQL plays a similar role.

2. Tuning Your Pool: The Balancing Act

A poorly tuned connection pool can be worse than no pool at all. Too few connections cause queuing delays; too many exhaust the database server’s memory.

Key Parameters

Parameter	Description	Example Value
`max_connections`	Max number of active connections	20–50 for most web apps
`min_connections`	Keep a baseline ready	5–10
`connection_lifetime`	Recycle connections periodically	30–60 minutes
`idle_timeout`	Close idle connections	300 seconds

HikariCP-style fine-tuning logic

If you’re using SQLAlchemy, connection pooling is built in. Here’s a tuned setup:

from sqlalchemy import create_engine

engine = create_engine(
    "postgresql+psycopg2://db_user:secret@localhost/production_db",
    pool_size=20,
    max_overflow=5,
    pool_timeout=30,
    pool_recycle=1800
)

Why this matters

PostgreSQL defaults (max_connections=100) can become a bottleneck with large pools from multiple app servers.
Monitor metrics like “active vs idle” connections and pool wait times.
Use PgBouncer’s transaction pooling mode for short-lived queries; session mode for transactions.

3. Indexing: Your Database’s Shortcut System

Indexes are your database’s version of a book’s index — you don’t read every page to find a word.

When to Index

Columns used in WHERE, JOIN, ORDER BY, or GROUP BY.
Foreign keys and high-selectivity columns (many unique values).

PostgreSQL Example

CREATE INDEX idx_orders_customer_id ON orders(customer_id);

Composite Index

CREATE INDEX idx_orders_customer_date ON orders(customer_id, created_at);

In Python (with SQLAlchemy)

from sqlalchemy import Column, Integer, Index, DateTime, ForeignKey
from sqlalchemy.orm import declarative_base

Base = declarative_base()

class Order(Base):
    __tablename__ = 'orders'
    id = Column(Integer, primary_key=True)
    customer_id = Column(Integer, ForeignKey('customers.id'))
    created_at = Column(DateTime)

    __table_args__ = (
        Index('idx_customer_date', 'customer_id', 'created_at'),
    )

Be Careful

Every index speeds up reads but slows down writes (INSERT, UPDATE, DELETE).

Drop unused or redundant indexes.
Periodically run ANALYZE to update statistics.
Avoid indexing columns with low selectivity (like boolean flags).

MongoDB Comparison

MongoDB also uses B-tree indexes, but supports compound and text indexes with more flexibility. Use .explain("executionStats") to analyze query performance similarly to EXPLAIN ANALYZE in SQL.

4. ORM Queries: Convenience vs. Performance

ORMs are fantastic productivity boosters — until they silently generate 100 queries where one would do.

Example: The N+1 Query Problem

# Bad
orders = session.query(Order).all()
for order in orders:
    print(order.customer.name)  # Triggers a query per order

Fix: Eager Loading

from sqlalchemy.orm import joinedload

orders = session.query(Order).options(joinedload(Order.customer)).all()

Lazy Loading vs. Eager Loading

Mode	Description	When to Use
Lazy	Load related data only when accessed	For rarely accessed relationships
Eager	Fetch related data in one go (via JOIN)	When relations are always needed
Batch / Subquery Loading	Fetch related data in batches	When you have many children per parent

MongoDB Analogy

MongoDB avoids joins, but you pay the cost in client-side aggregation. Embedding documents can mimic eager loading, while references behave more like lazy loading.

5. Pagination Done Right

The classic OFFSET + LIMIT approach works fine until you reach tens of millions of rows.

Why OFFSET Fails

OFFSET 1000000 makes the database scan and discard a million rows before returning the next 20.

Better: Keyset Pagination

SELECT * FROM orders
WHERE id > 1050
ORDER BY id
LIMIT 20;

Python Example

def fetch_next_page(last_id):
    query = "SELECT * FROM orders WHERE id > %s ORDER BY id LIMIT 20;"
    with engine.connect() as conn:
        result = conn.execute(query, (last_id,))
        return result.fetchall()

When to Use OFFSET

Use only for small datasets or admin dashboards. For user-facing pagination, keyset pagination ensures consistent performance.

**6. Stop Using `SELECT *` (Your Future Self Will Thank You)**

🧑‍💻 “SELECT * is like ordering the entire menu when all you wanted was a sandwich.”

It’s easy, it works… until it doesn’t.

Fetching all columns might seem harmless during development, but in production it’s a silent performance killer:

It transfers more data than needed, increasing network latency.
It prevents the query planner from using covering indexes efficiently.
It wastes CPU and I/O, especially if the table contains large JSON, TEXT, or BLOB columns.

But here’s the kicker 👇
When your schema changes — say a new column is added, renamed, or reordered — SELECT * can break backend logic that depends on column order or specific field mappings.
ORMs and serializers that expect certain columns in a fixed order might suddenly throw runtime errors or return unexpected data.

Bad

SELECT * FROM customers;

Good

SELECT id, name, email FROM customers;

In Python

result = conn.execute("SELECT id, name, email FROM customers;")
for row in result:
    print(row.id, row.name, row.email)

By explicitly selecting columns, you:

Reduce query time and network overhead.
Make your backend resilient to schema changes.
Keep your APIs stable and predictable even as your database evolves.

So next time you reach for SELECT *, remember:
It’s fine for a quick debug — but deadly in production. 🧠

7. When to Denormalize

Normalization reduces redundancy — great for integrity, but not always for speed.
In read-heavy workloads, excessive JOINs can kill performance.

Example: Before

SELECT o.id, c.name, p.title
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id;

After Denormalization

Add redundant columns (e.g., customer_name, product_title) to orders.

ALTER TABLE orders ADD COLUMN customer_name TEXT;

Then populate and maintain them with triggers or background jobs.

MongoDB Comparison

MongoDB inherently embraces denormalization — embedding related documents (e.g., a customer object inside each order) for faster reads at the cost of redundancy.

8. JOINs: Necessary Evil

JOINs are powerful but must be used with care.

Optimizing JOINs

Always index join keys.
Keep joins between tables of similar size when possible.
Avoid unnecessary LEFT JOINs (use INNER JOIN if possible).
For multi-table joins, check the join order via EXPLAIN.

Example

EXPLAIN ANALYZE
SELECT o.id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;

PostgreSQL shows whether it used a Nested Loop, Hash Join, or Merge Join — each has different performance trade-offs.

9. Database Maintenance: The Forgotten Hero

Even perfectly tuned databases degrade over time.
Old rows, dead tuples, and stale stats accumulate — slowing everything down.

Routine Tasks

Task	Description
VACUUM	Reclaims space from deleted/updated rows
ANALYZE	Updates query planner stats
REINDEX	Rebuilds bloated indexes
Partitioning	Splits large tables (e.g., by date)
Archiving	Moves old data out of hot tables

Example

VACUUM (VERBOSE, ANALYZE);

Or automate:

psql -d production_db -c "VACUUM ANALYZE;"

Python Maintenance Job

import schedule
import time
from sqlalchemy import text

def vacuum_analyze():
    with engine.begin() as conn:
        conn.execute(text("VACUUM ANALYZE;"))
    print("Maintenance complete.")

schedule.every().day.at("03:00").do(vacuum_analyze)

while True:
    schedule.run_pending()
    time.sleep(60)

10. Log and Analyze Slow Queries

You can’t fix what you don’t measure.

Enable Slow Query Logging (PostgreSQL)

In postgresql.conf:

log_min_duration_statement = 500  # log queries slower than 500ms

View in logs or monitoring systems like pgBadger, Datadog, or New Relic.

Example Query Analysis

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '7 days';

You’ll see actual vs estimated cost, number of rows, and index usage.

MySQL has the slow_query_log; MongoDB exposes db.setProfilingLevel(1) for similar profiling.

11. Replication: Scaling Reads and Ensuring Redundancy

Replication copies data from a primary (write) node to one or more replicas (read-only).

Benefits

Read scaling — distribute queries across replicas.
High availability — failover in case of node failure.
Offload backups and analytics — run them on replicas.

PostgreSQL Setup (simplified)

# On primary
wal_level = replica
max_wal_senders = 10

# On standby
primary_conninfo = 'host=primary-db user=replicator password=secret'

In Python (Routing Reads)

import random

READ_REPLICAS = ["replica1.db", "replica2.db"]

def get_read_connection():
    host = random.choice(READ_REPLICAS)
    return psycopg2.connect(host=host, user='app', password='secret', dbname='production_db')

Be aware of replication lag — reads from replicas may be slightly stale.

MongoDB Comparison

MongoDB’s replica sets natively handle this with automatic election and read preference modes (primaryPreferred, secondaryPreferred).

12. Sharding: When One Database Isn’t Enough

If replication scales reads, sharding scales both reads and writes by splitting data horizontally.

Imagine dividing your customers by region or ID range, each on a different shard.

Approach

Choose a shard key (e.g., user_id, region).
Route queries based on shard key.
Avoid cross-shard joins and transactions.

Illustration (textual)

+-----------+       +------------+
| Shard 1   |       | Shard 2    |
| User 1-50 |       | User 51-100|
+-----------+       +------------+
        \                 /
         \               /
          +---- App ----+

Python Example

def get_shard_connection(user_id):
    shard_id = user_id % 2
    db_host = f"shard{shard_id}.db"
    return psycopg2.connect(host=db_host, dbname="production_db", user="app", password="secret")

Sharding requires routing logic and consistent hashing strategies — consider libraries like Citus (for PostgreSQL) or managed sharded solutions (e.g., MongoDB Atlas, CockroachDB).

13. Profiling and Observability

PostgreSQL provides deep introspection tools:

pg_stat_statements – aggregate query stats.
EXPLAIN ANALYZE – visualize query execution.
auto_explain – automatically logs slow queries.
Extensions like pg_stat_activity show active sessions.

Python Integration

from sqlalchemy import text

with engine.connect() as conn:
    result = conn.execute(text("SELECT query, total_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"))
    for row in result:
        print(row)

Pair this with Grafana + Prometheus to visualize query times, buffer usage, and cache hit ratios.

Conclusion: Building a Performance-First Mindset

Backend performance is not about one magic tweak — it’s about discipline and visibility.

Start with fundamentals:
✅ Use connection pooling
✅ Index smartly
✅ Monitor ORM queries
✅ Denormalize carefully
✅ Maintain your DB regularly
✅ Scale with replication and sharding

Over time, these practices turn your backend from a fragile system into a resilient, high-performing service.

Database optimization isn’t just about making queries faster — it’s about designing systems that stay fast as they grow.

Top comments (10)

Alex Bevilacqua • Oct 17

It's worth noting MongoDB's drivers all support Connection Pooling as they've been built to follow the Connection Monitoring and Pooling specification.

The drivers offer this out of the box, so you don't need a secondary tool like PgBouner to leverage connection pooling.

Arijit Ghosh • Oct 18

Yup, that’s a great point — thanks for highlighting it!

Absolutely — MongoDB drivers implement connection pooling natively through the Connection Monitoring and Pooling (CMAP) spec, which means developers don’t need to configure an external pooler like PgBouncer.

I love that about MongoDB — the pooling behavior (min/max pool size, idle time, etc.) is handled directly in the driver, making it much simpler to manage at the application level.

The key takeaway here is that connection pooling is essential regardless of the database engine — the difference is just where it’s managed (driver-level in MongoDB vs. middleware tools like PgBouncer for PostgreSQL).

Thanks for adding that context — super helpful for readers comparing ecosystems!

Zara Johnson • Oct 22

This is a brilliant, practical deep dive into database optimization; it moves far beyond basic indexing to cover crucial production topics like connection pooling (PgBouncer), keyset pagination, and the N+1 problem. The actionable advice on ORM efficiency and explicit column selection is invaluable for any backend developer.

Arijit Ghosh • Oct 28

Glad you liked it

Hakan Önder • Oct 17

Excellent breakdown — especially the parts on connection pooling and index tuning.
I have a question regarding scaling strategies:
In high-throughput environments (e.g., PostgreSQL or MongoDB clusters handling both read- and write-heavy workloads), do you generally recommend exhausting replication and read-scaling first, or is it better to design for sharding early on to avoid future architectural bottlenecks?
In your experience, at what point does replication stop being effective compared to horizontal partitioning?

Arijit Ghosh • Oct 18

Great question — and thanks for the feedback!

In most cases, I recommend maxing out replication and read-scaling first, since it’s simpler and covers most bottlenecks in read-heavy systems. Sharding adds major complexity and only really pays off once write throughput or dataset size exceed what a single primary can handle.

The gist of it I always remember is "replicate to scale reads, shard to survive writes."

Sabuj Ghosh • Oct 16

Very Informative. Will be using them for sure.

Arijit Ghosh • Oct 16

Thanks

Swapnodip Singha Roy • Oct 17

Informative indeed

Arijit Ghosh • Oct 18

Thanks a lot

Introduction

1. The Hidden Cost of Connections: Why Pooling Matters

The Fix: Connection Pooling

Example in Python (using psycopg2.pool)

Production-level pooling

2. Tuning Your Pool: The Balancing Act

Key Parameters

HikariCP-style fine-tuning logic

Why this matters

3. Indexing: Your Database’s Shortcut System

When to Index

PostgreSQL Example

Composite Index

In Python (with SQLAlchemy)

Be Careful

MongoDB Comparison

4. ORM Queries: Convenience vs. Performance

Example: The N+1 Query Problem

Fix: Eager Loading

Lazy Loading vs. Eager Loading

MongoDB Analogy

5. Pagination Done Right

Why OFFSET Fails

Better: Keyset Pagination

Python Example

When to Use OFFSET

6. Stop Using SELECT * (Your Future Self Will Thank You)

Bad

Good

In Python

7. When to Denormalize

Example: Before

After Denormalization

MongoDB Comparison

8. JOINs: Necessary Evil

Optimizing JOINs

Example

9. Database Maintenance: The Forgotten Hero

Routine Tasks

Example

Python Maintenance Job

10. Log and Analyze Slow Queries

Enable Slow Query Logging (PostgreSQL)

Example Query Analysis

11. Replication: Scaling Reads and Ensuring Redundancy

Benefits

PostgreSQL Setup (simplified)

In Python (Routing Reads)

MongoDB Comparison

12. Sharding: When One Database Isn’t Enough

Approach

Illustration (textual)

Python Example

13. Profiling and Observability

Python Integration

Conclusion: Building a Performance-First Mindset

Example in Python (using `psycopg2.pool`)

**6. Stop Using `SELECT *` (Your Future Self Will Thank You)**