Somewhere right now, a developer is staring at a terminal, realizing the DELETE FROM users they just ran did not have a WHERE clause. Their staging tab was not the active tab. The table had 2.4 million rows. It has zero now.
GitLab famously lost six hours of production data in 2017 when an engineer ran rm -rf on the wrong database directory during a late-night incident. This is not rare. It happens constantly.
Production database safety is not about being careful. Careful people make mistakes at 2 AM after sixteen hours of debugging. Safety comes from systems, defaults, and friction.
Rule 1: Production Is Red, Always
The most common cause of accidental production writes is tab confusion. Four database connections open. Staging and production look identical.
The fix is visual. Production should look unmistakably different from every other environment:
- Production: red background, red title bar
- Staging: orange or yellow
- Development: green
- Local: default / neutral
This is not cosmetic. This is the cheapest safety measure with the highest return. Every team that adopts environment coloring reports fewer "wrong environment" incidents.
Rule 2: Read-Only by Default
How often do you actually write to production? For most developers, 95% of production access is read access. So why does your connection default to full read-write?
-- What you run 95% of the time (safe in read-only mode):
SELECT o.id, o.status, o.total
FROM orders o
WHERE o.user_id = 48291
ORDER BY o.created_at DESC
LIMIT 20;
-- PostgreSQL: set read-only at the session level
SET default_transaction_read_only = ON;
-- When you genuinely need to write:
SET default_transaction_read_only = OFF;
When you need to write, you explicitly switch. One extra step that prevents more accidental UPDATE statements than any code review.
Rule 3: Back Up Before ALTER
Schema changes are irreversible in a way data changes are not. You can roll back a bad UPDATE from a backup. But ALTER TABLE DROP COLUMN destroys data immediately.
-- Before dropping a column, back it up:
CREATE TABLE orders_backup_20260524 AS
SELECT id, legacy_metadata FROM orders;
-- Now you have a safety net:
ALTER TABLE orders DROP COLUMN legacy_metadata;
For ALTER TABLE operations, always estimate lock duration. In PostgreSQL, many ALTER TABLE operations acquire ACCESS EXCLUSIVE locks, blocking all reads and writes:
-- This locks the entire table until it finishes:
ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ;
-- This is instant in PostgreSQL 11+:
ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ DEFAULT NOW();
-- Adding a column with a non-volatile default is instant since PG 11.
Plan schema changes like surgery: with imaging, preparation, and a way to abort.
Rule 4: Never DELETE Without Verifying First
-- The pattern that kills databases:
DELETE FROM orders;
-- Missing WHERE clause. Everything gone.
-- The safe pattern:
-- Step 1: Verify scope
SELECT COUNT(*) FROM orders
WHERE status = 'cancelled' AND created_at < '2026-01-01';
-- Returns: 847
-- Step 2: Inspect a sample
SELECT id, user_id, status FROM orders
WHERE status = 'cancelled' AND created_at < '2026-01-01'
LIMIT 10;
-- Step 3: Delete with the exact same WHERE clause
DELETE FROM orders
WHERE status = 'cancelled' AND created_at < '2026-01-01';
-- 847 rows deleted. Matches the count.
SELECT first, then DELETE. The oldest database safety trick, still the most effective.
Rule 5: Wrap Mutations in Transactions
Every manual mutation on production should be wrapped in a transaction.
-- DANGEROUS: runs immediately, no undo
UPDATE users SET plan = 'enterprise' WHERE company_id = 4012;
-- SAFE: wrapped in a transaction
BEGIN;
UPDATE users SET plan = 'enterprise' WHERE company_id = 4012;
-- Check what you just did:
SELECT id, email, plan FROM users WHERE company_id = 4012;
-- Does this look right?
COMMIT; -- If yes
ROLLBACK; -- If no
The transaction gives you a window between execution and commitment. You can inspect results before they become permanent.
UPDATE is the most dangerous DML operation. A bad INSERT adds rows you can delete. A bad DELETE removes rows you can restore. But a bad UPDATE overwrites data in place.
Rule 6: Gate Production Access Behind Biometrics
Passwords are shared. SSH keys sit in ~/.ssh forever. API tokens get committed to .env files.
Biometric authentication (Touch ID on Mac, Windows Hello) adds a gate that cannot be shared and creates a moment of intentional friction.
Before your database client opens a production connection, it prompts for your fingerprint. Every time. That half-second pause is a moment to ask: "Do I actually need to be in production right now?"
Rule 7: Enforce Least Privilege Access
Most incidents are caused by developers who have more permissions than they need.
-- Role for developers (read-only debugging):
CREATE ROLE dev_readonly;
GRANT CONNECT ON DATABASE production TO dev_readonly;
GRANT USAGE ON SCHEMA public TO dev_readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO dev_readonly;
-- Role for the application backend:
CREATE ROLE app_service;
GRANT CONNECT ON DATABASE production TO app_service;
GRANT USAGE ON SCHEMA public TO app_service;
GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA public TO app_service;
-- Note: no DELETE, no DROP, no ALTER
-- Role for DBAs (used rarely):
CREATE ROLE dba_admin;
GRANT ALL PRIVILEGES ON DATABASE production TO dba_admin;
Developers should not routinely use accounts with write access to production.
Rule 8: Log Everything, Audit Regularly
-- postgresql.conf:
-- log_connections = on
-- log_disconnections = on
-- log_statement = 'mod' -- Logs INSERT, UPDATE, DELETE, DDL
-- log_line_prefix = '%t [%p] %u@%d '
-- For granular auditing:
CREATE EXTENSION pgaudit;
SET pgaudit.log = 'write, ddl';
Review production access patterns weekly. Look for:
- Users who accessed production but shouldn't have
- Write operations outside deployment windows
- Connections from unexpected IP addresses
The goal is not catching bad actors. It's catching mistakes before they compound.
Rule 9: Separate Credentials Per Environment
# BAD: same credentials, different hosts
DB_HOST=staging-db.internal
DB_USER=app
DB_PASS=s3cret123
# GOOD: completely separate credential sets
STAGING_DB_HOST=staging-db.internal
STAGING_DB_USER=app_staging
STAGING_DB_PASS=staging_pass_abc
PRODUCTION_DB_HOST=prod-db.internal
PRODUCTION_DB_USER=app_production
PRODUCTION_DB_PASS=prod_pass_xyz
When credentials are separate, a leaked staging password is not a leaked production password.
Additional safeguards:
- Rotate production credentials quarterly minimum
- Use short-lived credentials where possible (IAM database auth on AWS/GCP)
- Never put production credentials in Slack, docs, or emails
- If a credential might have been exposed, rotate immediately
Rule 10: Test on Staging First, Every Time
"I'll just run this quick query on production" is the most dangerous sentence in software engineering.
-- "Just adding a column, it'll be instant"
ALTER TABLE orders ADD COLUMN tracking_url TEXT;
-- On 50M rows in PostgreSQL < 11: rewrites entire table.
-- Production down for 12 minutes.
Your staging-to-production workflow:
- Write the change
- Run it on staging
- Measure: execution time, lock duration, replication impact
- Compare results to expectations
- If everything matches, run on production
- If anything is surprising, investigate first
The Reliability Hierarchy
- Impossible (best): Tool prevents the dangerous action (read-only mode blocks writes)
- Difficult: Tool adds friction (biometrics before production connections)
- Visible: Tool makes risk obvious (red = production)
- Documented (worst): A wiki page says "be careful." Nobody reads it.
Most teams operate at level 4. The goal is levels 1 through 3.
I'm building QueryDeck, a native macOS database client with color-coded connections, read-only defaults, and Touch ID gating for production. Currently in early access.
Top comments (0)