DEV Community

Cover image for Scaling Django on Railway + S3 + Cloudflare for 1k+ concurrent users
Divine Ikhuoria
Divine Ikhuoria

Posted on

Scaling Django on Railway + S3 + Cloudflare for 1k+ concurrent users

Goal: Provide a practical, beginner-friendly, step-by-step guide to prepare a Django app (deployed on Railway) with S3 for static/media and Cloudflare as the CDN/edge, so it can reliably handle sustained peaks of ~1000 concurrent users.


Table of Contents

  1. TL;DR — One-page checklist
  2. Clarifications & assumptions
  3. Measure before you optimize
  4. Static files & media — S3 + Cloudflare (how & why)
  5. Database: Postgres, connection pooling & query tuning
  6. Caching strategy (Redis + patterns)
  7. Background jobs & throttling (Celery etc.)
  8. Web concurrency & process sizing (Gunicorn / Uvicorn)
  9. Cloudflare configuration & edge logic
  10. Observability & debugging essentials
  11. Cost control tips
  12. Rollout plan: 10 → 1000 concurrent users (step-by-step)
  13. Common pitfalls & how to avoid them
  14. Recommended resources & docs
  15. Appendix: Useful config snippets

TL;DR — One-page checklist

  • Measure realistic flows and determine p95/p99 latencies.
  • Offload static and media to S3 and cache them at Cloudflare edge.
  • Add a Redis instance and use django-redis for caches; use a separate Redis for Celery broker if needed.
  • Use Postgres + PgBouncer (or your provider's pooling) to avoid connection storms.
  • Move slow work to background workers (Celery/RQ/Huey) and scale workers independently.
  • Tune Django process concurrency to available CPU/memory.
  • Configure Cloudflare rules to bypass API/authenticated routes and cache public assets aggressively.
  • Add observability (Sentry + metrics + logs) and run iterative load tests.

Clarifications & assumptions

  • Concurrent users vs daily active users: "1000 concurrent users" means approximately 1000 simultaneous active connections or users at peak. This requires more capacity and different trade-offs than 1000 daily active users. Verify which you mean before provisioning.

  • Assumptions for this guide:

    • Django app hosted on Railway (or similar PaaS).
    • Postgres database (Railway or external).
    • S3-compatible object storage (AWS S3 recommended).
    • Cloudflare as DNS + CDN in front of Railway.

Measure before you optimize

Why: Optimization without measurements is guesswork. Know your bottlenecks.

Concrete steps:

  1. Identify a realistic user flow and script it (e.g., browse catalog -> open product -> add to cart -> checkout).
  2. Use load-test tools: k6, Locust, or hey to simulate realistic traffic and think about pacing (think time between actions).
  3. Collect these metrics:
  • Request rate (RPS)
  • p50/p95/p99 latencies for endpoints
  • Error rates
  • DB queries/sec and top slow queries
  • DB CPU and active connections
  • Redis hit/miss rate and eviction counts
  • Celery queue length and task durations
    1. Define SLOs: for example p95 < 300ms, error rate < 0.5%.

Tip: Start small, run tests, then increase RPS until you find the bottleneck and address it.


Static files & media — S3 + Cloudflare (how & why)

Why this matters

  • Serving static or uploaded files from your web dyno consumes CPU, memory, and bandwidth and increases origin response times. S3 + Cloudflare moves those costs to object storage and the CDN.

Best-practices

  • Compile/build assets with hashed filenames (cache-busting by content hash).
  • Use Cache-Control: public, max-age=31536000, immutable for hashed build assets.
  • Use presigned uploads so clients upload directly to S3 (web dyno never handles large file bodies).
  • Use S3 lifecycle policies to move infrequently accessed objects to cheaper classes (Infrequent Access, Glacier).

Presigned POST example (Django DRF view)

# uploads/views.py
import uuid
import os
from django.conf import settings
from rest_framework.decorators import api_view, permission_classes
from rest_framework.permissions import IsAuthenticated
from rest_framework.response import Response
import boto3

@api_view(['GET'])
@permission_classes([IsAuthenticated])
def presign_upload(request):
    s3 = boto3.client('s3', region_name=settings.AWS_REGION)
    key = f"uploads/{request.user.id}/{uuid.uuid4()}.jpg"
    presigned = s3.generate_presigned_post(
        Bucket=settings.AWS_STORAGE_BUCKET_NAME,
        Key=key,
        Fields={"Content-Type": "image/jpeg"},
        Conditions=[
            {"Content-Type": "image/jpeg"},
        ],
        ExpiresIn=60,
    )
    return Response({'url': presigned['url'], 'fields': presigned['fields'], 'key': key})
Enter fullscreen mode Exit fullscreen mode

Security notes

  • Validate file types and sizes with S3 conditions and client-side checks.
  • Use server-side virus scanning or processing for untrusted uploads if required in your domain.

Database — Postgres + connection pooling + query tuning

Problems to avoid

  • Large numbers of concurrent DB connections cause memory and CPU strain on the DB server.

Connection pooling

  • Use PgBouncer (transaction pooling) in front of Postgres to multiplex many client connections to fewer DB backends. If your code relies on session-level state (e.g., SET commands), use session pooling but size accordingly.
  • Configure Django with CONN_MAX_AGE = 60 (or tuned value) to reuse TCP connections.

Query & index hygiene

  • Add indexes on frequently-filtered/sorted fields (e.g., vendor_id, status, created_at).
  • Use select_related() for foreign-key single-object joins and prefetch_related() for M2M or reverse relationships.
  • Use pg_stat_statements or your host’s slow-query logging to find top offenders.

Sizing guidance

  • Estimate: (max web concurrency) * (avg DB connections per request) to plan pool sizes.
  • Let PgBouncer reduce backend connections, but still monitor pg_stat_activity to detect connection pressure.

Caching strategy (Redis + patterns)

Redis usages

  • Django cache backend (django-redis) for page fragments and small objects.
  • Rate-limiting counters and session caching (if you choose not to use DB-backed sessions).
  • Celery broker (or dedicated broker) — for high-volume systems, consider a separate Redis instance for broker vs cache to avoid contention.

Example CACHES config

CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": os.environ.get("REDIS_URL"),
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
            "PARSER_CLASS": "redis.connection.HiredisParser",
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Cache key design

  • Use versioned keys for easy invalidation: e.g., menu:vendor:{id}:v{version}.
  • Cache only what you can invalidate or can tolerate stale values for. For critical, near-real-time content (order status), prefer short TTLs or event-driven invalidation.

Background jobs & throttling (Celery, RQ, etc.)

What to push to background workers

  • Emails, PDF generation, image processing, payouts, and long webhooks.

Minimal Celery example

# celery.py
from __future__ import absolute_import
import os
from celery import Celery

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'myproject.settings')
app = Celery('myproject')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
Enter fullscreen mode Exit fullscreen mode

Worker tuning tips

  • task_acks_late = True for safer retries (ensure idempotency).
  • Start with conservative concurrency (2–4) and scale horizontally by running more worker processes as needed.
  • Protect heavy endpoints with rate-limiting using DRF throttling, Cloudflare rate limits, or an API gateway.

Web concurrency & process sizing (Gunicorn / Uvicorn)

Basic rules

  • For synchronous Gunicorn workers, a common starting heuristic is workers = 2 * CPU + 1 — then tune downward/upward based on memory and throughput.
  • For async needs (WebSockets or many concurrent idle connections), prefer ASGI (uvicorn) with fewer workers and async event loop (uvloop).

Examples

Gunicorn (synchronous + threads):

gunicorn myproject.wsgi:application \
  --workers 3 \
  --threads 2 \
  --timeout 30 \
  --bind 0.0.0.0:$PORT
Enter fullscreen mode Exit fullscreen mode

Uvicorn (ASGI — for WebSockets/async):

uvicorn myproject.asgi:application --host 0.0.0.0 --port $PORT --workers 1 --loop uvloop --ws websockets
Enter fullscreen mode Exit fullscreen mode

Memory considerations

  • Each worker is a process and consumes memory. On small Railway dynos (e.g., 512MB), prefer 1–2 workers and rely on horizontal scaling instead of packing many workers into a single small instance.

Cloudflare configuration & edge logic

Caching rules

  • Use Cloudflare Cache Rules to:

    • Bypass cache for paths like /api/*, /admin/*, and requests containing Authorization headers or session cookies.
    • Cache public assets aggressively using long TTLs and content-hashed filenames.

Edge features

  • Cloudflare Workers let you run light logic at the edge (A/B tests, redirects, personalization) without hitting the origin.
  • Use Tiered Cache to reduce origin requests for global traffic.
  • Use Cloudflare rate-limiting and WAF rules to mitigate abusive traffic.

Observability & debugging essentials

Minimum stack

  • Error tracking: Sentry for exceptions and performance traces.
  • Logs: central log aggregator (Papertrail, LogDNA, or your hosted solution).
  • Metrics: track RPS, latency, DB connections, Redis hit rate, worker queue lengths. Use Prometheus/Grafana or a hosted metrics provider.
  • Uptime: health endpoints and external monitors (UptimeRobot, Pingdom).

Health check endpoint

Provide a lightweight /health that checks DB and Redis connectivity and returns a simple JSON 200/500 so load balancers and monitors can check service health.


Cost control tips

  • Use Cloudflare free tier for DNS and basic CDN functionality — it reduces bandwidth to the origin.
  • Use S3 lifecycle policies to archive infrequently accessed media to cheaper storage tiers.
  • Right-size Railway services; scale workers only when the queue length or CPU indicates the need.
  • Avoid proxying uploads through web dynos and avoid storing blobs in your DB.

Rollout plan: 10 → 1000 concurrent users (step-by-step)

Step Action
1 Baseline: Integrate Sentry, basic metrics, and logging. Add /healthz.
2 Move build assets & static files to S3. Put Cloudflare in front and confirm cache hit rates.
3 Add Redis and implement fragment/full-page caching for public content.
4 Switch to presigned uploads for user media.
5 Enable PgBouncer or provider-managed pooling. Tune CONN_MAX_AGE.
6 Move slow tasks to Celery/RQ and set worker scaling rules.
7 Run iterative load tests (k6/Locust). Monitor DB, Redis, and worker metrics.
8 Harden Cloudflare (cache bypass for API, rate limits, firewall rules).
9 Optimize top slow queries discovered during tests.
10 Repeat measurement and optimization cycles.

Common pitfalls & how to avoid them

  • Too many DB connections: Use PgBouncer; monitor pg_stat_activity.
  • Caching authenticated content accidentally: Use cache-bypass rules for private routes and Authorization headers.
  • Proxying uploads through web dyno: Use presigned uploads to protect dyno memory and requests.
  • Ignoring observability: Add Sentry and metrics early. You cannot optimize what you do not measure.
  • Underestimating memory per worker: Test and monitor memory footprints before scaling worker counts.

Recommended resources & docs

  • Railway documentation for deploying Django and managing services.
  • PostgreSQL and PgBouncer docs for connection pooling best practices.
  • Cloudflare docs: Cache Rules, Workers, Tiered Cache, and rate-limiting.
  • Boto3 docs for generate_presigned_post.
  • Django ASGI / Uvicorn documentation for async features and websockets.

Appendix: Useful config snippets

Django settings (snippets)

# settings.py (excerpt)
CONN_MAX_AGE = 60
CACHES = { ... }  # django-redis example
DEFAULT_FILE_STORAGE = "storages.backends.s3boto3.S3Boto3Storage"
AWS_S3_REGION_NAME = os.environ.get("AWS_REGION")
AWS_STORAGE_BUCKET_NAME = os.environ.get("AWS_STORAGE_BUCKET_NAME")
Enter fullscreen mode Exit fullscreen mode

Procfile (Railway style)

web: gunicorn myproject.wsgi:application --bind 0.0.0.0:$PORT --workers 3 --threads 2 --timeout 30
worker: celery -A myproject worker --loglevel=info --concurrency=${CELERY_CONCURRENCY:-2}
Enter fullscreen mode Exit fullscreen mode

Celery minimal config

# celery.py
from celery import Celery
import os

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'myproject.settings')
app = Celery('myproject')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
Enter fullscreen mode Exit fullscreen mode

Gunicorn start example

gunicorn myproject.wsgi:application --workers 3 --threads 2 --timeout 30
Enter fullscreen mode Exit fullscreen mode

Conclusion

Scaling to 1k+ concurrent users is iterative: measure realistic flows, remove origin bottlenecks, offload work to caches and background processes, and protect the origin with Cloudflare. With Railway + S3 + Cloudflare, you can build an efficient, cost-effective stack — but success depends on careful DB pooling, cache boundaries, worker sizing, and continuous measurement.

Top comments (0)