Isaac

Posted on Oct 28

We built for 10x scale before we had 1,000 users

#architecture #python #startup #nextjs

Most teams wait until their infrastructure is on fire. We made a different bet.

At 500 users, we migrated from Flask SSR to FastAPI + Next.js + Supabase. Not because the app was broken, but because the architecture couldn't scale with our product roadmap.

Six months later: 1.2s page loads, real-time features we couldn't build before, and a stack that supports the product we're building into.

Here's the decision-making process, including the trade-offs we're still living with.

The Forcing Function

SendSage started as a Flask monolith with Jinja templates. Classic server-side rendering. It worked fine for CRUD operations and report generation.

Then our customers started asking for real-time analytics dashboards.

We prototyped with D3.js visualizations in the Flask templates. Power users were hitting 10+ second render times. We were server-side rendering entire pages with 2-3MB JSON payloads embedded directly in the HTML.

The immediate problem was performance. The strategic problem was architectural.

Every feature request followed the same pattern:

"Can we show progress on long-running imports?" → Would need WebSockets bolted onto Flask
"Can multiple users collaborate on campaigns?" → Would need event streaming
"Can we make this dashboard feel responsive?" → Would need client-side state management

We could patch Flask. Add Redis caching. Split API calls. Integrate Socket.io.

But our growth projections showed 5,000-10,000 users in 12-18 months. The product roadmap had real-time collaboration features our customers were asking for.

The decision: Rebuild the infrastructure before we had the scale problem, while we still had breathing room to do it right.

The Architecture Bet

We split into three layers with clear boundaries:

FastAPI for the API layer

Async by default (non-blocking I/O for ML pipeline jobs)
Python type hints caught integration bugs at development time
Worker pattern: FastAPI polls a Postgres jobs table, processes queued tasks, updates status
Same PostgreSQL database, but redesigned schema for normalization

Next.js for the frontend

Client-side data fetching (initial HTML ~50KB, lazy-load visualization data)
React components with incremental static regeneration for marketing pages
Independent deployment from the API (frontend changes don't require backend deploys)

Supabase for auth + real-time + database hosting

Row-level security policies in Postgres (authorization logic in one place, not scattered across API routes)
Realtime subscriptions via Postgres CDC (no separate WebSocket infrastructure)
Generated TypeScript types from database schema → imported directly into Next.js

The type safety story (this was the hidden win):

Supabase generates TypeScript definitions from the Postgres schema. We built a custom script that extracts those types and generates Pydantic models for FastAPI.

One source of truth: the database schema. Changes propagate automatically to both frontend and backend. Catch breaking changes at compile time, not in production.

This script was relatively straightforward to build. But it's saved us probably 40+ hours of debugging mismatched field types over the last six months.

The Migration Strategy (And Its Risks)

We had two options:

Incremental migration:

Run both stacks in parallel
Migrate routes one at a time
Feature flags to control rollout
Gradual user migration over 2-3 months

Parallel build + big-bang cutover:

Build entire new stack at staging domain
Test thoroughly with subset of real users
Schedule maintenance window
Cut over all at once

For 500 users with clear traffic patterns, we chose the big-bang approach.

Why? Incremental migration has hidden costs:

Maintaining two codebases simultaneously for months
Dual-write systems to keep databases in sync
Session compatibility between Flask and Supabase auth
Feature development essentially pauses during migration

At our scale, the "safer" incremental approach would have taken 3-4 months of split focus. The big-bang approach took 6 weeks of dedicated work plus one high-risk cutover night.

The trade-off we made: Accept one night of concentrated risk in exchange for faster return to normal development velocity.

What we built before the cutover:

Full staging environment at a separate domain
Auth migration pipeline - Custom code to transfer bcrypt hashes from Flask's user table to Supabase auth.users (we got lucky here - both systems used bcrypt)
Test cohort validation - Invited 20 users to test the staging environment for two weeks, watched PostHog session replays of their login flows and key workflows
Data integrity checks - Scripts to compare relational data between old and new databases (foreign key relationships, aggregate counts, data type consistency)
Rollback plan - Old infrastructure stayed running, DNS redirect was reversible

The cutover (3am on a Tuesday):

Set database tables to read-only (15-minute window during lowest traffic period)
pg_dump from old database → Supabase tables
Ran auth migration pipeline (bcrypt hashes transferred directly)
Validated user count, data relationships, test logins
DNS redirect from old domain to new stack
Unlocked database tables

The risk we accepted: 15 minutes of downtime. For 500 users, this was the pragmatic choice. At 50,000 users, we would have needed logical replication, blue-green deployment, and true zero-downtime techniques. Those would have added 4-6 weeks to the timeline.

The app was offline for 14 minutes. We notified users 48 hours in advance. If anything broke, we'd reverse the DNS and investigate.

Nothing broke. But let's be honest - it could have. And at larger scale, "nothing broke" isn't acceptable.

What We Kept (The Underrated Decision)

PostgreSQL - Every database migration multiplies risk. We kept Postgres, redesigned the schema, but avoided the NoSQL/NewSQL temptation.

Core business logic - Our ETL validation rules that cleaned messy client data? Those worked. We wrapped them in FastAPI endpoints but kept the algorithms unchanged.

Domain knowledge - The Flask codebase handled 47 edge cases we'd discovered over 18 months. Rewriting from scratch would have meant rediscovering them in production.

The temptation in a rewrite is to "do it right this time." We resisted. Migration is about architecture, not logic.

The Results (And Hidden Costs)

Performance wins:

Page loads: 40s → 1.2s average
D3 visualizations: 10s → instant (data fetches asynchronously in background)
Real-time features we couldn't build before: trivial now

Development velocity:

New features: 2 weeks → 3 days average
Zero merge conflicts between marketing site, app, and blog (independent deployments)
Type safety across the entire stack

Costs we're still paying:

Custom type generation script requires maintenance - When Supabase updates their type generation format, we update our Pydantic converter. This is maybe 2 hours per quarter, but it's tech debt.
More complex local development setup - New machines need Supabase CLI, Next.js dev server, FastAPI server, and Postgres running locally. We've mitigated this significantly with Docker and Turborepo for the Next.js apps, so it's still just a few commands to get up and running. But the initial development cost to build these dev tools was real, and there's ongoing maintenance when dependencies change.
Three deployment pipelines instead of one - Frontend, API, and database migrations deploy independently. This is more flexible but also more coordination overhead.
Supabase lock-in - We're using Realtime, Auth, and their Postgres hosting. Migrating away would be painful. This is an acceptable trade-off, but it's a trade-off.
Worker pattern complexity - Our FastAPI worker polls the jobs table, which works but requires careful handling of concurrent job processing, retry logic, and failure scenarios. It's simpler than a full queue system, but it's still more mental overhead than synchronous request handling.

What I'd Do Differently

We should have profiled earlier. We discovered the SSR + 2MB JSON payload problem pretty late. If we'd caught it sooner, we might have solved it with API-first architecture in Flask. But we also would have missed the type generation, real-time infrastructure, and deployment flexibility.

The maintenance window was the right call for our scale, but it's not a strategy that scales. At 500 users, 15 minutes of scheduled downtime was acceptable. At 5,000+ users, we'd need logical replication, dual-write systems, and proper zero-downtime techniques. Those would have added 4-6 weeks to our timeline and weren't worth it at our current scale.

We underestimated the Docker/Turborepo investment. We eventually built good local dev tooling, but we probably should have done this from day one rather than retrofitting it after the migration. The three weeks we spent dockerizing everything and setting up the monorepo build system would have saved us time during the migration itself.

The Real Lesson

Your migration strategy should optimize for your growth trajectory, not your current pain.

At 500 users, we didn't need this architecture. We built it for the 5,000-user product we'll be in 12 months.

The risk: over-engineering for scale you never reach. We mitigated this by:

Choosing boring technology where possible (Postgres, REST APIs, React)
Building only what we needed for the next 18 months, not the next 5 years
Keeping business logic intact (migration was architectural, not algorithmic)

The payoff: when growth happens, we're shipping features, not fighting infrastructure.

The framework question is usually the wrong question. "Flask vs FastAPI" or "Rails vs Node" misses the point. The question is: Does your architecture support the product you're building?

Server-side rendering works great for content sites. It doesn't work for real-time collaboration tools. We needed client-side state management, async APIs, and real-time subscriptions. The framework choice followed from that.

For Engineering Leaders

If you're considering a similar migration, here are the questions I'd ask:

What's your growth projection? If you're expecting 10x users in 12 months, build for that now. If you're uncertain, optimize for flexibility over scale.
What's your product roadmap? We needed real-time features. If you're building a content site or CRUD app, SSR might be perfectly fine.
What's your runway? We had 6 weeks of dedicated focus as a team of two. If you're resource-constrained or managing a larger team, incremental migration might distribute the work better.
What's your actual bottleneck? Profile first. We thought it was D3.js. It was actually SSR + massive JSON payloads. The fix might be simpler than a rewrite.
Can you accept downtime? At 500 users, yes. At 50,000 users, probably not. This determines your entire migration strategy.

The most expensive migration is the one you do twice. We made a bet on doing it right once, accepting short-term risk for long-term velocity.

Six months in, it's working. Ask me again in 18 months when we're at 10,000 users.

For the engineers: What framework have you migrated from? I'm curious what the common patterns are - I suspect a lot of teams hit this same SSR → SPA transition around 500-1000 users.

DEV Community