Divine Ikhuoria

Posted on Dec 31, 2025

Scaling Django on Railway + S3 + Cloudflare for 1k+ concurrent users

#django #performance #cloudflarechallenge #aws

Goal: Provide a practical, beginner-friendly, step-by-step guide to prepare a Django app (deployed on Railway) with S3 for static/media and Cloudflare as the CDN/edge, so it can reliably handle sustained peaks of ~1000 concurrent users.

TL;DR — One-page checklist
Clarifications & assumptions
Measure before you optimize
Static files & media — S3 + Cloudflare (how & why)
Database: Postgres, connection pooling & query tuning
Caching strategy (Redis + patterns)
Background jobs & throttling (Celery etc.)
Web concurrency & process sizing (Gunicorn / Uvicorn)
Cloudflare configuration & edge logic
Observability & debugging essentials
Cost control tips
Rollout plan: 10 → 1000 concurrent users (step-by-step)
Common pitfalls & how to avoid them
Recommended resources & docs
Appendix: Useful config snippets

TL;DR — One-page checklist

Measure realistic flows and determine p95/p99 latencies.
Offload static and media to S3 and cache them at Cloudflare edge.
Add a Redis instance and use django-redis for caches; use a separate Redis for Celery broker if needed.
Use Postgres + PgBouncer (or your provider's pooling) to avoid connection storms.
Move slow work to background workers (Celery/RQ/Huey) and scale workers independently.
Tune Django process concurrency to available CPU/memory.
Configure Cloudflare rules to bypass API/authenticated routes and cache public assets aggressively.
Add observability (Sentry + metrics + logs) and run iterative load tests.

Clarifications & assumptions

Concurrent users vs daily active users: "1000 concurrent users" means approximately 1000 simultaneous active connections or users at peak. This requires more capacity and different trade-offs than 1000 daily active users. Verify which you mean before provisioning.
Assumptions for this guide:
- Django app hosted on Railway (or similar PaaS).
- Postgres database (Railway or external).
- S3-compatible object storage (AWS S3 recommended).
- Cloudflare as DNS + CDN in front of Railway.

Measure before you optimize

Why: Optimization without measurements is guesswork. Know your bottlenecks.

Concrete steps:

Identify a realistic user flow and script it (e.g., browse catalog -> open product -> add to cart -> checkout).
Use load-test tools: k6, Locust, or hey to simulate realistic traffic and think about pacing (think time between actions).
Collect these metrics:

Request rate (RPS)
p50/p95/p99 latencies for endpoints
Error rates
DB queries/sec and top slow queries
DB CPU and active connections
Redis hit/miss rate and eviction counts
Celery queue length and task durations
1. Define SLOs: for example p95 < 300ms, error rate < 0.5%.

Tip: Start small, run tests, then increase RPS until you find the bottleneck and address it.

Static files & media — S3 + Cloudflare (how & why)

Why this matters

Serving static or uploaded files from your web dyno consumes CPU, memory, and bandwidth and increases origin response times. S3 + Cloudflare moves those costs to object storage and the CDN.

Best-practices

Compile/build assets with hashed filenames (cache-busting by content hash).
Use Cache-Control: public, max-age=31536000, immutable for hashed build assets.
Use presigned uploads so clients upload directly to S3 (web dyno never handles large file bodies).
Use S3 lifecycle policies to move infrequently accessed objects to cheaper classes (Infrequent Access, Glacier).

Presigned POST example (Django DRF view)

# uploads/views.py
import uuid
import os
from django.conf import settings
from rest_framework.decorators import api_view, permission_classes
from rest_framework.permissions import IsAuthenticated
from rest_framework.response import Response
import boto3

@api_view(['GET'])
@permission_classes([IsAuthenticated])
def presign_upload(request):
    s3 = boto3.client('s3', region_name=settings.AWS_REGION)
    key = f"uploads/{request.user.id}/{uuid.uuid4()}.jpg"
    presigned = s3.generate_presigned_post(
        Bucket=settings.AWS_STORAGE_BUCKET_NAME,
        Key=key,
        Fields={"Content-Type": "image/jpeg"},
        Conditions=[
            {"Content-Type": "image/jpeg"},
        ],
        ExpiresIn=60,
    )
    return Response({'url': presigned['url'], 'fields': presigned['fields'], 'key': key})

Security notes

Validate file types and sizes with S3 conditions and client-side checks.
Use server-side virus scanning or processing for untrusted uploads if required in your domain.

Database — Postgres + connection pooling + query tuning

Problems to avoid

Large numbers of concurrent DB connections cause memory and CPU strain on the DB server.

Connection pooling

Use PgBouncer (transaction pooling) in front of Postgres to multiplex many client connections to fewer DB backends. If your code relies on session-level state (e.g., SET commands), use session pooling but size accordingly.
Configure Django with CONN_MAX_AGE = 60 (or tuned value) to reuse TCP connections.

Query & index hygiene

Add indexes on frequently-filtered/sorted fields (e.g., vendor_id, status, created_at).
Use select_related() for foreign-key single-object joins and prefetch_related() for M2M or reverse relationships.
Use pg_stat_statements or your host’s slow-query logging to find top offenders.

Sizing guidance

Estimate: (max web concurrency) * (avg DB connections per request) to plan pool sizes.
Let PgBouncer reduce backend connections, but still monitor pg_stat_activity to detect connection pressure.

Caching strategy (Redis + patterns)

Redis usages

Django cache backend (django-redis) for page fragments and small objects.
Rate-limiting counters and session caching (if you choose not to use DB-backed sessions).
Celery broker (or dedicated broker) — for high-volume systems, consider a separate Redis instance for broker vs cache to avoid contention.

Example CACHES config

CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": os.environ.get("REDIS_URL"),
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "PARSER_CLASS": "redis.connection.HiredisParser",
        }
    }
}

Cache key design

Use versioned keys for easy invalidation: e.g., menu:vendor:{id}:v{version}.
Cache only what you can invalidate or can tolerate stale values for. For critical, near-real-time content (order status), prefer short TTLs or event-driven invalidation.

Background jobs & throttling (Celery, RQ, etc.)

What to push to background workers

Emails, PDF generation, image processing, payouts, and long webhooks.

Minimal Celery example

# celery.py
from __future__ import absolute_import
import os
from celery import Celery

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'myproject.settings')
app = Celery('myproject')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()

Worker tuning tips

task_acks_late = True for safer retries (ensure idempotency).
Start with conservative concurrency (2–4) and scale horizontally by running more worker processes as needed.
Protect heavy endpoints with rate-limiting using DRF throttling, Cloudflare rate limits, or an API gateway.

Web concurrency & process sizing (Gunicorn / Uvicorn)

Basic rules

For synchronous Gunicorn workers, a common starting heuristic is workers = 2 * CPU + 1 — then tune downward/upward based on memory and throughput.
For async needs (WebSockets or many concurrent idle connections), prefer ASGI (uvicorn) with fewer workers and async event loop (uvloop).

Examples

Gunicorn (synchronous + threads):

gunicorn myproject.wsgi:application \
  --workers 3 \
  --threads 2 \
  --timeout 30 \
  --bind 0.0.0.0:$PORT

Uvicorn (ASGI — for WebSockets/async):

uvicorn myproject.asgi:application --host 0.0.0.0 --port $PORT --workers 1 --loop uvloop --ws websockets

Memory considerations

Each worker is a process and consumes memory. On small Railway dynos (e.g., 512MB), prefer 1–2 workers and rely on horizontal scaling instead of packing many workers into a single small instance.

Cloudflare configuration & edge logic

Caching rules

Use Cloudflare Cache Rules to:
- Bypass cache for paths like /api/*, /admin/*, and requests containing Authorization headers or session cookies.
- Cache public assets aggressively using long TTLs and content-hashed filenames.

Edge features

Cloudflare Workers let you run light logic at the edge (A/B tests, redirects, personalization) without hitting the origin.
Use Tiered Cache to reduce origin requests for global traffic.
Use Cloudflare rate-limiting and WAF rules to mitigate abusive traffic.

Observability & debugging essentials

Minimum stack

Error tracking: Sentry for exceptions and performance traces.
Logs: central log aggregator (Papertrail, LogDNA, or your hosted solution).
Metrics: track RPS, latency, DB connections, Redis hit rate, worker queue lengths. Use Prometheus/Grafana or a hosted metrics provider.
Uptime: health endpoints and external monitors (UptimeRobot, Pingdom).

Health check endpoint

Provide a lightweight /health that checks DB and Redis connectivity and returns a simple JSON 200/500 so load balancers and monitors can check service health.

Cost control tips

Use Cloudflare free tier for DNS and basic CDN functionality — it reduces bandwidth to the origin.
Use S3 lifecycle policies to archive infrequently accessed media to cheaper storage tiers.
Right-size Railway services; scale workers only when the queue length or CPU indicates the need.
Avoid proxying uploads through web dynos and avoid storing blobs in your DB.

Rollout plan: 10 → 1000 concurrent users (step-by-step)

Step	Action
1	Baseline: Integrate Sentry, basic metrics, and logging. Add `/healthz`.
2	Move build assets & static files to S3. Put Cloudflare in front and confirm cache hit rates.
3	Add Redis and implement fragment/full-page caching for public content.
4	Switch to presigned uploads for user media.
5	Enable PgBouncer or provider-managed pooling. Tune `CONN_MAX_AGE`.
6	Move slow tasks to Celery/RQ and set worker scaling rules.
7	Run iterative load tests (k6/Locust). Monitor DB, Redis, and worker metrics.
8	Harden Cloudflare (cache bypass for API, rate limits, firewall rules).
9	Optimize top slow queries discovered during tests.
10	Repeat measurement and optimization cycles.

Common pitfalls & how to avoid them

Too many DB connections: Use PgBouncer; monitor pg_stat_activity.
Caching authenticated content accidentally: Use cache-bypass rules for private routes and Authorization headers.
Proxying uploads through web dyno: Use presigned uploads to protect dyno memory and requests.
Ignoring observability: Add Sentry and metrics early. You cannot optimize what you do not measure.
Underestimating memory per worker: Test and monitor memory footprints before scaling worker counts.

Recommended resources & docs

Railway documentation for deploying Django and managing services.
PostgreSQL and PgBouncer docs for connection pooling best practices.
Cloudflare docs: Cache Rules, Workers, Tiered Cache, and rate-limiting.
Boto3 docs for generate_presigned_post.
Django ASGI / Uvicorn documentation for async features and websockets.

Appendix: Useful config snippets

Django settings (snippets)

# settings.py (excerpt)
CONN_MAX_AGE = 60
CACHES = { ... }  # django-redis example
DEFAULT_FILE_STORAGE = "storages.backends.s3boto3.S3Boto3Storage"
AWS_S3_REGION_NAME = os.environ.get("AWS_REGION")
AWS_STORAGE_BUCKET_NAME = os.environ.get("AWS_STORAGE_BUCKET_NAME")

Procfile (Railway style)

web: gunicorn myproject.wsgi:application --bind 0.0.0.0:$PORT --workers 3 --threads 2 --timeout 30
worker: celery -A myproject worker --loglevel=info --concurrency=${CELERY_CONCURRENCY:-2}

Celery minimal config

# celery.py
from celery import Celery
import os

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'myproject.settings')
app = Celery('myproject')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()

Gunicorn start example

gunicorn myproject.wsgi:application --workers 3 --threads 2 --timeout 30

Conclusion

Scaling to 1k+ concurrent users is iterative: measure realistic flows, remove origin bottlenecks, offload work to caches and background processes, and protect the origin with Cloudflare. With Railway + S3 + Cloudflare, you can build an efficient, cost-effective stack — but success depends on careful DB pooling, cache boundaries, worker sizing, and continuous measurement.

DEV Community