Shubham Kansal

Posted on Jun 10 • Originally published at shubhamkansal.com

DevOps Best Practices From the Trenches: What Actually Moves the Needle

#devops #aws #postgres #webdev

TL;DR: I've run production infrastructure for 14 years, founded DietGhar and NyayX, and cut a client's cloud bill 40% in one sprint. This is not a tool checklist. These are the specific practices — with the numbers — that move real needles.

Most DevOps content is a rehashed checklist. Use CI/CD. Write tests. Monitor things. Helpful in the same way "eat less, move more" is helpful for weight loss — technically correct and completely insufficient.

I've been building and running production infrastructure for 14 years. I've founded DietGhar (healthtech, live at dietghar.com) and NyayX (legaltech, live at nyayx.com), and I've consulted for teams running everything from two-person startups to platforms handling 2M-product catalogues under Black Friday load. What follows are the practices that actually moved needles, with specific numbers.

CI/CD is only as good as your slowest feedback loop

A pipeline that takes 40 minutes to run is a pipeline nobody trusts. Developers will start merging without waiting. For DietGhar I use GitHub Actions with a staged pipeline:

# .github/workflows/ci.yml
jobs:
  lint-typecheck:
    runs-on: ubuntu-latest
    timeout-minutes: 2
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20', cache: 'npm' }
      - run: npm ci
      - run: npm run lint && npm run typecheck

  unit-tests:
    runs-on: ubuntu-latest
    needs: lint-typecheck
    timeout-minutes: 5
    strategy:
      matrix:
        shard: [1, 2, 3, 4]   # run in parallel
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm test -- --shard=${{ matrix.shard }}/4

  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    timeout-minutes: 10
    services:
      postgres:
        image: postgres:16
        env: { POSTGRES_PASSWORD: test }
      redis:
        image: redis:7
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm run test:integration

Total: 12 minutes from push to green. That speed is intentional. If I let it creep to 30 minutes, the pipeline becomes ceremonial.

The second CI/CD failure mode is treating deployment as a one-way door. Every pipeline I build includes an automated rollback trigger: if the health check endpoint returns non-200 for 60 seconds post-deploy, the previous ECS task definition is reactivated automatically:

# In the deploy step — simplified ECS blue/green with health-check rollback
NEW_TASK_DEF_ARN=$(aws ecs register-task-definition \
  --cli-input-json file://task-def.json \
  --query 'taskDefinition.taskDefinitionArn' --output text)

aws ecs update-service \
  --cluster production \
  --service api \
  --task-definition "$NEW_TASK_DEF_ARN"

# Wait and verify — roll back if health check fails
if ! aws ecs wait services-stable --cluster production --services api; then
  echo "Deploy failed health check, rolling back..."
  aws ecs update-service \
    --cluster production \
    --service api \
    --task-definition "$PREVIOUS_TASK_DEF_ARN"
  exit 1
fi

A bad release never touches more than the seconds between deploy and health check failure.

Infrastructure as code is your documentation

When I inherited a client's AWS account for a cost review, I found: 23 EC2 instances, 11 RDS instances, 6 load balancers. No Terraform, no CDK, no CloudFormation. Nobody knew what half of them did. Three were serving zero traffic. One had been running since 2019.

That engagement paid for itself in the first sprint. I wrote Terraform to model what was actually in use, deleted the orphaned resources, rightsized the rest against CloudWatch metrics, and moved batch workloads to Spot. The bill dropped 40% before I touched a single line of application code.

My rule: if I cannot recreate the entire environment from the repo in under 30 minutes, the infrastructure is undocumented. A minimal Terraform layout for a typical service:

infra/
  modules/
    ecs-service/    # reusable ECS service + ALB target group + autoscaling
    rds-postgres/   # RDS + PgBouncer sidecar + parameter group
    sqs-lambda/     # SQS queue + Lambda + IAM role + DLQ
  envs/
    staging/        # tfvars + state config
    production/     # tfvars + state config

One terraform apply per environment. Infrastructure you can't recreate is infrastructure you can't recover.

Observability before you ship, not after something breaks

With NyayX I instrumented the application before I wrote the first feature. Error rates, p95 latency, conversion funnel drop-off — all dashboarded before a single real user hit the system.

The payoff came in week two of beta. Onboarding had a 30% drop-off rate at step three. Without instrumentation I would have assumed users weren't interested. With it, I could see the drop-off happened on a specific form where validation errors weren't surfacing. A two-line fix.

For the Black Friday platform: three weeks before the event, we ran EXPLAIN ANALYZE on every query over 50ms. Found three queries doing full sequential scans on 2M+ row tables. Added composite indexes. Query times: 800ms to 12ms. No new servers. Observability data led to the right fix before it mattered.

Key metrics to track from day one:

Application layer:
  - p50 / p95 / p99 response time per route
  - Error rate (4xx vs 5xx, separated)
  - Apdex score

Infrastructure layer:
  - Postgres connection count (alert at 80% of max_connections)
  - Cache hit rate (alert if Redis hit rate drops below 90%)
  - Lambda error rate + duration p95

Business layer:
  - Conversion funnel drop-off per step
  - API error rate per endpoint (for externally-facing APIs)

Database connections are infrastructure, not an afterthought

At 12,000 concurrent users, your application will attempt thousands of simultaneous database connections. PostgreSQL degrades sharply above 200. The solution: PgBouncer in transaction pooling mode, deployed as a sidecar or separate service between your app tier and Postgres.

# pgbouncer.ini — the config that saved Black Friday
[pgbouncer]
pool_mode = transaction
max_client_conn = 10000
default_pool_size = 150

Before: 8,000 app-side connections → Postgres under severe load.
After: 8,000 app-side connections → 150 Postgres connections → 4x throughput.

This is a provisioning concern, not a code concern. The right moment to configure connection pooling is during infrastructure setup — it should be in your Terraform module alongside your RDS resource, not discovered during an incident.

Cloud cost is a product decision

Teams that treat cloud cost as a finance problem consistently overspend. Every architecture decision has a cost consequence:

Synchronous fan-out to 10 services vs an event queue? You're paying for the over-provisioning required to absorb spikes.
Polling an API every minute vs webhooks? You're paying for compute that generates mostly empty responses.

For the SP-API platform: the inventory sync originally polled the Catalog Items API for 50,000 SKUs on a 15-minute schedule. The right architecture was subscribing to ITEM_INVENTORY_UPDATE events via the Notifications API, processing changes as they arrived via SQS + Lambda. Sync latency: 24 hours → under 15 minutes. API call volume: down 90%. Compute cost dropped proportionally.

Review your cloud bill monthly. Flag any service growing faster than your user base for architectural review. That growth-rate discrepancy is almost always a design problem.

Environment parity prevents bugs that only appear in production

For NyayX, every environment — local, staging, production — runs the same Docker image built from the same Dockerfile:

# docker-compose.yml (local development)
services:
  api:
    build: .               # same Dockerfile as production
    image: nyayx-api:dev
    environment:
      - NODE_ENV=development
      - DATABASE_URL=postgres://postgres:postgres@postgres:5432/nyayx
    depends_on: [postgres, redis]

  postgres:
    image: postgres:16

  redis:
    image: redis:7-alpine

# ECS task definition (production) — same image, different env vars via Secrets Manager
container_definitions = jsonencode([{
  name  = "api"
  image = "${aws_ecr_repository.api.repository_url}:${var.image_tag}"
  environment = [
    { name = "NODE_ENV", value = "production" }
  ]
  secrets = [
    { name = "DATABASE_URL", valueFrom = aws_secretsmanager_secret.db_url.arn }
  ]
}])

The application code does not know which environment it is in beyond what environment variables tell it. This sounds obvious until you see a team discover that Node.js 16 in production and Node.js 20 in development produce different behaviour — found during a customer demo.

Security is not a phase at the end

NyayX is a legaltech platform handling sensitive legal documents. Security controls that are in production today were in the first pull request that touched each relevant subsystem:

-- Row-level security: tenants cannot query each other's data
-- regardless of application logic
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON documents
  USING (tenant_id = current_setting('app.tenant_id')::uuid);

-- Audit log: append-only — no UPDATE or DELETE permission
REVOKE UPDATE, DELETE ON audit_log FROM api_role;

// S3 objects private by default — pre-signed URLs for time-limited access
const url = await getSignedUrl(s3Client, new GetObjectCommand({
  Bucket: process.env.DOCUMENTS_BUCKET,
  Key: documentKey,
}), { expiresIn: 900 }); // 15 minutes

Retrofitting security into an existing system costs roughly 10x what it costs to design it in. This is the practice most frequently deferred and most frequently regretted.

Backups are a promise you haven't tested

Everyone has backups. Almost nobody has restores. For NyayX, which stores legal documents clients are legally required to retain, an untested backup is not a backup — it is a liability with good intentions.

I automate the restore, not just the backup:

#!/bin/bash
# Runs weekly via EventBridge + Lambda
set -e

# Restore latest snapshot into throwaway DB
SNAPSHOT_ID=$(aws rds describe-db-snapshots \
  --db-instance-identifier nyayx-production \
  --query 'DBSnapshots[-1].DBSnapshotIdentifier' \
  --output text)

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier nyayx-restore-test \
  --db-snapshot-identifier "$SNAPSHOT_ID"

aws rds wait db-instance-available \
  --db-instance-identifier nyayx-restore-test

# Run integrity checks
ROW_COUNT=$(psql "$RESTORE_URL" -t -c \
  "SELECT COUNT(*) FROM documents WHERE created_at > NOW() - INTERVAL '7 days'")

if [ "$ROW_COUNT" -lt 100 ]; then
  # Page me — something is wrong
  aws sns publish --topic-arn "$ALERT_TOPIC" \
    --message "Restore check FAILED: only $ROW_COUNT recent documents"
  exit 1
fi

# Tear down
aws rds delete-db-instance \
  --db-instance-identifier nyayx-restore-test \
  --skip-final-snapshot

A backup job succeeding tells me almost nothing. A restore succeeding tells me everything. I learned this the hard way: nightly backups that had been succeeding for months while silently writing zero-byte files — a rotated credential had quietly broken the export. The dashboard stayed green the entire time.

If you do one thing after reading this article, schedule a restore test before you schedule another backup.

Incident response needs a script before the incident

For every production system I run, there is a written runbook for the five most likely failure modes:

Database connection exhaustion
High error rate on the main API
Deployment rollback
Third-party API degradation
Certificate expiry

Each runbook has a decision tree, the commands to run, and the escalation path. When the SP-API platform went down during a peak sales window because Amazon's Notifications API was delayed, the runbook said: check SQS queue depth → check Lambda error rate → check SP-API status page → switch to polling fallback if SQS depth exceeds threshold. Resolution time: 11 minutes. Without the runbook, I estimate 45.

After every incident, write a short post-incident note: what broke, the signal, detection time, resolution time, and the single change that would have prevented it. That SP-API incident produced one action item: alert on SQS queue depth. That alert has fired twice since, both times early enough that nobody downstream noticed.

An incident you learn nothing from is just downtime. An incident that produces one concrete prevention is cheap insurance.

The compounding return

Each of these practices compounds:

CI/CD with automated rollbacks → ship confidently at any time
Infrastructure as code → rebuild from disaster in 30 minutes
Observability before launch → find problems before customers do
Connection pooling + cost discipline → scale without re-architecture
Security by design → sell to enterprise without a security audit sprint
Tested restores → a bad day is an inconvenience, not a closure
Runbooks → incidents are learning exercises, not emergencies

I run DietGhar and NyayX as a solo founder-engineer. The only reason that is viable is that the infrastructure runs itself 95% of the time because each of these practices is in place. DevOps best practices are not a tax on your time. They are how you get your time back.

Originally published at https://shubhamkansal.com/blog/devops-best-practices. I'm Shubham Kansal, a freelance Full Stack & DevOps engineer — more at https://shubhamkansal.com.

DEV Community