Goal: Provide a practical, beginner-friendly, step-by-step guide to prepare a Django app (deployed on Railway) with S3 for static/media and Cloudflare as the CDN/edge, so it can reliably handle sustained peaks of ~1000 concurrent users.
Table of Contents
- TL;DR — One-page checklist
- Clarifications & assumptions
- Measure before you optimize
- Static files & media — S3 + Cloudflare (how & why)
- Database: Postgres, connection pooling & query tuning
- Caching strategy (Redis + patterns)
- Background jobs & throttling (Celery etc.)
- Web concurrency & process sizing (Gunicorn / Uvicorn)
- Cloudflare configuration & edge logic
- Observability & debugging essentials
- Cost control tips
- Rollout plan: 10 → 1000 concurrent users (step-by-step)
- Common pitfalls & how to avoid them
- Recommended resources & docs
- Appendix: Useful config snippets
TL;DR — One-page checklist
- Measure realistic flows and determine p95/p99 latencies.
- Offload static and media to S3 and cache them at Cloudflare edge.
- Add a Redis instance and use
django-redisfor caches; use a separate Redis for Celery broker if needed. - Use Postgres + PgBouncer (or your provider's pooling) to avoid connection storms.
- Move slow work to background workers (Celery/RQ/Huey) and scale workers independently.
- Tune Django process concurrency to available CPU/memory.
- Configure Cloudflare rules to bypass API/authenticated routes and cache public assets aggressively.
- Add observability (Sentry + metrics + logs) and run iterative load tests.
Clarifications & assumptions
Concurrent users vs daily active users: "1000 concurrent users" means approximately 1000 simultaneous active connections or users at peak. This requires more capacity and different trade-offs than 1000 daily active users. Verify which you mean before provisioning.
-
Assumptions for this guide:
- Django app hosted on Railway (or similar PaaS).
- Postgres database (Railway or external).
- S3-compatible object storage (AWS S3 recommended).
- Cloudflare as DNS + CDN in front of Railway.
Measure before you optimize
Why: Optimization without measurements is guesswork. Know your bottlenecks.
Concrete steps:
- Identify a realistic user flow and script it (e.g.,
browse catalog -> open product -> add to cart -> checkout). - Use load-test tools:
k6,Locust, orheyto simulate realistic traffic and think about pacing (think time between actions). - Collect these metrics:
- Request rate (RPS)
- p50/p95/p99 latencies for endpoints
- Error rates
- DB queries/sec and top slow queries
- DB CPU and active connections
- Redis hit/miss rate and eviction counts
- Celery queue length and task durations
- Define SLOs: for example p95 < 300ms, error rate < 0.5%.
Tip: Start small, run tests, then increase RPS until you find the bottleneck and address it.
Static files & media — S3 + Cloudflare (how & why)
Why this matters
- Serving static or uploaded files from your web dyno consumes CPU, memory, and bandwidth and increases origin response times. S3 + Cloudflare moves those costs to object storage and the CDN.
Best-practices
- Compile/build assets with hashed filenames (cache-busting by content hash).
- Use
Cache-Control: public, max-age=31536000, immutablefor hashed build assets. - Use presigned uploads so clients upload directly to S3 (web dyno never handles large file bodies).
- Use S3 lifecycle policies to move infrequently accessed objects to cheaper classes (Infrequent Access, Glacier).
Presigned POST example (Django DRF view)
# uploads/views.py
import uuid
import os
from django.conf import settings
from rest_framework.decorators import api_view, permission_classes
from rest_framework.permissions import IsAuthenticated
from rest_framework.response import Response
import boto3
@api_view(['GET'])
@permission_classes([IsAuthenticated])
def presign_upload(request):
s3 = boto3.client('s3', region_name=settings.AWS_REGION)
key = f"uploads/{request.user.id}/{uuid.uuid4()}.jpg"
presigned = s3.generate_presigned_post(
Bucket=settings.AWS_STORAGE_BUCKET_NAME,
Key=key,
Fields={"Content-Type": "image/jpeg"},
Conditions=[
{"Content-Type": "image/jpeg"},
],
ExpiresIn=60,
)
return Response({'url': presigned['url'], 'fields': presigned['fields'], 'key': key})
Security notes
- Validate file types and sizes with S3 conditions and client-side checks.
- Use server-side virus scanning or processing for untrusted uploads if required in your domain.
Database — Postgres + connection pooling + query tuning
Problems to avoid
- Large numbers of concurrent DB connections cause memory and CPU strain on the DB server.
Connection pooling
- Use PgBouncer (transaction pooling) in front of Postgres to multiplex many client connections to fewer DB backends. If your code relies on session-level state (e.g.,
SETcommands), use session pooling but size accordingly. - Configure Django with
CONN_MAX_AGE = 60(or tuned value) to reuse TCP connections.
Query & index hygiene
- Add indexes on frequently-filtered/sorted fields (e.g.,
vendor_id,status,created_at). - Use
select_related()for foreign-key single-object joins andprefetch_related()for M2M or reverse relationships. - Use
pg_stat_statementsor your host’s slow-query logging to find top offenders.
Sizing guidance
- Estimate: (max web concurrency) * (avg DB connections per request) to plan pool sizes.
- Let PgBouncer reduce backend connections, but still monitor
pg_stat_activityto detect connection pressure.
Caching strategy (Redis + patterns)
Redis usages
- Django cache backend (
django-redis) for page fragments and small objects. - Rate-limiting counters and session caching (if you choose not to use DB-backed sessions).
- Celery broker (or dedicated broker) — for high-volume systems, consider a separate Redis instance for broker vs cache to avoid contention.
Example CACHES config
CACHES = {
"default": {
"BACKEND": "django_redis.cache.RedisCache",
"LOCATION": os.environ.get("REDIS_URL"),
"OPTIONS": {
"CLIENT_CLASS": "django_redis.client.DefaultClient",
"PARSER_CLASS": "redis.connection.HiredisParser",
}
}
}
Cache key design
- Use versioned keys for easy invalidation: e.g.,
menu:vendor:{id}:v{version}. - Cache only what you can invalidate or can tolerate stale values for. For critical, near-real-time content (order status), prefer short TTLs or event-driven invalidation.
Background jobs & throttling (Celery, RQ, etc.)
What to push to background workers
- Emails, PDF generation, image processing, payouts, and long webhooks.
Minimal Celery example
# celery.py
from __future__ import absolute_import
import os
from celery import Celery
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'myproject.settings')
app = Celery('myproject')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
Worker tuning tips
-
task_acks_late = Truefor safer retries (ensure idempotency). - Start with conservative
concurrency(2–4) and scale horizontally by running more worker processes as needed. - Protect heavy endpoints with rate-limiting using DRF throttling, Cloudflare rate limits, or an API gateway.
Web concurrency & process sizing (Gunicorn / Uvicorn)
Basic rules
- For synchronous Gunicorn workers, a common starting heuristic is
workers = 2 * CPU + 1— then tune downward/upward based on memory and throughput. - For async needs (WebSockets or many concurrent idle connections), prefer ASGI (
uvicorn) with fewer workers and async event loop (uvloop).
Examples
Gunicorn (synchronous + threads):
gunicorn myproject.wsgi:application \
--workers 3 \
--threads 2 \
--timeout 30 \
--bind 0.0.0.0:$PORT
Uvicorn (ASGI — for WebSockets/async):
uvicorn myproject.asgi:application --host 0.0.0.0 --port $PORT --workers 1 --loop uvloop --ws websockets
Memory considerations
- Each worker is a process and consumes memory. On small Railway dynos (e.g., 512MB), prefer 1–2 workers and rely on horizontal scaling instead of packing many workers into a single small instance.
Cloudflare configuration & edge logic
Caching rules
-
Use Cloudflare Cache Rules to:
- Bypass cache for paths like
/api/*,/admin/*, and requests containing Authorization headers or session cookies. - Cache public assets aggressively using long TTLs and content-hashed filenames.
- Bypass cache for paths like
Edge features
- Cloudflare Workers let you run light logic at the edge (A/B tests, redirects, personalization) without hitting the origin.
- Use Tiered Cache to reduce origin requests for global traffic.
- Use Cloudflare rate-limiting and WAF rules to mitigate abusive traffic.
Observability & debugging essentials
Minimum stack
- Error tracking: Sentry for exceptions and performance traces.
- Logs: central log aggregator (Papertrail, LogDNA, or your hosted solution).
- Metrics: track RPS, latency, DB connections, Redis hit rate, worker queue lengths. Use Prometheus/Grafana or a hosted metrics provider.
- Uptime: health endpoints and external monitors (UptimeRobot, Pingdom).
Health check endpoint
Provide a lightweight /health that checks DB and Redis connectivity and returns a simple JSON 200/500 so load balancers and monitors can check service health.
Cost control tips
- Use Cloudflare free tier for DNS and basic CDN functionality — it reduces bandwidth to the origin.
- Use S3 lifecycle policies to archive infrequently accessed media to cheaper storage tiers.
- Right-size Railway services; scale workers only when the queue length or CPU indicates the need.
- Avoid proxying uploads through web dynos and avoid storing blobs in your DB.
Rollout plan: 10 → 1000 concurrent users (step-by-step)
| Step | Action |
|---|---|
| 1 | Baseline: Integrate Sentry, basic metrics, and logging. Add /healthz. |
| 2 | Move build assets & static files to S3. Put Cloudflare in front and confirm cache hit rates. |
| 3 | Add Redis and implement fragment/full-page caching for public content. |
| 4 | Switch to presigned uploads for user media. |
| 5 | Enable PgBouncer or provider-managed pooling. Tune CONN_MAX_AGE. |
| 6 | Move slow tasks to Celery/RQ and set worker scaling rules. |
| 7 | Run iterative load tests (k6/Locust). Monitor DB, Redis, and worker metrics. |
| 8 | Harden Cloudflare (cache bypass for API, rate limits, firewall rules). |
| 9 | Optimize top slow queries discovered during tests. |
| 10 | Repeat measurement and optimization cycles. |
Common pitfalls & how to avoid them
-
Too many DB connections: Use PgBouncer; monitor
pg_stat_activity. - Caching authenticated content accidentally: Use cache-bypass rules for private routes and Authorization headers.
- Proxying uploads through web dyno: Use presigned uploads to protect dyno memory and requests.
- Ignoring observability: Add Sentry and metrics early. You cannot optimize what you do not measure.
- Underestimating memory per worker: Test and monitor memory footprints before scaling worker counts.
Recommended resources & docs
- Railway documentation for deploying Django and managing services.
- PostgreSQL and PgBouncer docs for connection pooling best practices.
- Cloudflare docs: Cache Rules, Workers, Tiered Cache, and rate-limiting.
- Boto3 docs for
generate_presigned_post. - Django ASGI / Uvicorn documentation for async features and websockets.
Appendix: Useful config snippets
Django settings (snippets)
# settings.py (excerpt)
CONN_MAX_AGE = 60
CACHES = { ... } # django-redis example
DEFAULT_FILE_STORAGE = "storages.backends.s3boto3.S3Boto3Storage"
AWS_S3_REGION_NAME = os.environ.get("AWS_REGION")
AWS_STORAGE_BUCKET_NAME = os.environ.get("AWS_STORAGE_BUCKET_NAME")
Procfile (Railway style)
web: gunicorn myproject.wsgi:application --bind 0.0.0.0:$PORT --workers 3 --threads 2 --timeout 30
worker: celery -A myproject worker --loglevel=info --concurrency=${CELERY_CONCURRENCY:-2}
Celery minimal config
# celery.py
from celery import Celery
import os
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'myproject.settings')
app = Celery('myproject')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
Gunicorn start example
gunicorn myproject.wsgi:application --workers 3 --threads 2 --timeout 30
Conclusion
Scaling to 1k+ concurrent users is iterative: measure realistic flows, remove origin bottlenecks, offload work to caches and background processes, and protect the origin with Cloudflare. With Railway + S3 + Cloudflare, you can build an efficient, cost-effective stack — but success depends on careful DB pooling, cache boundaries, worker sizing, and continuous measurement.
Top comments (0)