JE Ramos

Posted on Feb 24

My Imposter Syndrome at 30M MAU

#ai #scaling #career #architecture

The first time I check the Wallet Service dashboard in production, CloudWatch shows 11,400 requests per minute.

I close the laptop. I open it again. The number hasn't changed. Eleven thousand four hundred gem transactions every sixty seconds, flowing through the service I designed, the one I wrote the first commit for, the one I never imagined would breathe at this rate.

It's 8 PM. My apartment is quiet. The dashboard isn't.

I need to tell you where I was before this. Because the gap matters.

Before I built this, I was working on a mobile app with maybe 200,000 users. A respectable number. The kind of number where a production bug means you get a Slack thread with four messages and someone says "I'll look at it after lunch." The kind of number where your on-call rotation is a polite fiction. Nobody actually gets woken up.

Then I got the contract to build the social platform. Not join. Build. The whole ecosystem. Social media, advertising, food delivery, live streaming, gaming, e-commerce. What would eventually become ninety-five repositories. What would eventually reach thirty million monthly active users.

But when I wrote the first line of code, 30M was a fantasy. I was thinking about hundreds of users. Then thousands. The architecture decisions I made early on, the ones baked into the foundation, were made by a version of me who had never operated at scale. And now those decisions run a small economy.

That's the part nobody warns you about. You don't get to go back and ask the person who made the critical design choices if they really thought it through. Because that person was you, two years ago, with less context than you have now.

I'm tracing the virtual currency advertising flow. One I designed. A merchant creates a campaign, let's say 50,000 GEMs to boost a restaurant post. Those GEMs go into escrow in the Wallet Service. Users scroll their feed, see the boosted post, engage with it. Each engagement triggers a flow: the Ad Engine processes the engagement, calls the Wallet Service to release GEMs from the merchant's escrow into the user's wallet. Five services, one business transaction.

I'm reading the campaign disbursement code in the Ad Engine. Function processEngagement(). Line 147. There's a comment I wrote months ago:

// Batch disbursement runs every 30 seconds
// Avg batch size: 8,000-12,000 engagements

Eight to twelve thousand gem transfers, batched into a single call to the Wallet Service. Every thirty seconds. When I wrote that comment, the average batch size was 400. I designed the batch system for maybe 2,000 at peak. Now it's processing six times that, and it's holding. But every time I look at it, I think: I didn't design this for what it became.

I scroll to the error handling section. There's a retry queue for failed disbursements. A dead letter queue for retries that fail three times. A reconciliation job that runs nightly to catch discrepancies between the campaign escrow ledger and actual wallet balances. There's monitoring for cases where the disbursement rate drops below a threshold, because that means users aren't getting paid for their engagement, and users who don't get paid stop engaging.

Each piece is rational. Each piece makes sense in isolation. I built each piece. But I'm holding the whole thing in my head, the escrow, the batch processing, the retry logic, the reconciliation, the monitoring thresholds, and the thought I can't shake is: I built this. But did I build it well enough for what it's become?

I don't say this to anyone. Obviously. You don't lead a platform's architecture and say "Hey, I'm terrified that the foundation I laid might crack under thirty million users." You say "The system is performing within expected parameters" and you keep your internal monologue to yourself.

The internal monologue goes like this:

I've been writing code professionally for a decade. I've shipped features. I've built things. I built this thing. But the version of me who made the core design decisions had never operated a system where a mistake in a WHERE clause affects 30,000 users. Had never worked at a scale where 0.1% is a city.

What if there's a flaw in the wallet architecture that only shows up at this load? What if an optimization I ship introduces a race condition that drains gem balances? What if the thing I'm confident about, the thing I designed, is the thing that brings down the escrow system?

And the team asks me what can break. They look to me for guidance. Because I built it. Because I'm supposed to know. And most of the time I do know. But sometimes the honest answer is: I'm not sure anymore. It's bigger than what I designed for.

This is the flavor of imposter syndrome nobody writes Medium posts about. It's not "I don't deserve to be here." It's "I built the thing everyone depends on, real people use this for real transactions, and the consequences of my past decisions are 150 times larger than anything I anticipated when I made them."

Week two after the platform crosses 30M. I'm looking at the balance read endpoint in the Wallet Service. The query is simple. SELECT balance FROM wallets WHERE user_id = $1 AND currency_type = $2. There's a composite index on (user_id, currency_type). Average response time is 4ms. P99 is 23ms.

At 200K users, nobody would notice. At 30M users, that P99 tail means roughly 114 requests per minute are taking 23ms instead of 4ms. Those 114 slow requests back up the connection pool during peak hours. The connection pool backup cascades into the campaign disbursement batch, which shares the same database. The disbursement batch slows down. Users wait longer for their GEMs. Engagement metrics dip. The product team sees the dip. Somebody flags it.

Eight milliseconds spawned a chain of cause and effect that I can trace across three services and two dashboards. And the database schema those eight milliseconds live in? I designed it.

I open EXPLAIN ANALYZE on the query. The index scan is fine. The slowdown is coming from TOAST decompression on a metadata JSONB column that's in the SELECT. Wait, it's not in the SELECT. But the table has a wide row. The sequential scan fallback on cache-cold pages is what's pushing P99.

I add a covering index: CREATE INDEX idx_wallet_balance_cover ON wallets (user_id, currency_type) INCLUDE (balance). Index-only scan. No heap access. P99 drops to 6ms.

I stare at the PR for forty-five minutes before clicking "Create Pull Request." I've run the EXPLAIN ANALYZE twenty times. I've tested with production-mirrored data. I've checked that the new index doesn't bloat the WAL (Write-Ahead Log). I've verified the index creation won't lock the table (it will, briefly, but during the early morning which is the platform's lowest traffic window).

At 200K users, I would have shipped this in ten minutes.

At 30M users, I fact-check my fact-checking. Because this is my database schema, my architecture, and if something goes wrong, it's not just code that breaks. People use these GEMs to pay for food delivery. Merchants run ad campaigns with real money converted to GEMs. A bad deployment doesn't just turn a dashboard red. It means someone's order doesn't go through. Someone's campaign stops disbursing. Real impact.

The PR gets approved. The index is created during the maintenance window. P99 drops to 6ms. The disbursement batch recovers its headroom. Nobody notices. That's the point. Nobody should notice.

And I feel... nothing. Not pride. Not relief. Just the absence of disaster. That's what operating at scale teaches you. Success doesn't feel like success. Success feels like the dashboard not turning red.

I call my friend that weekend. Another engineer. He asks how things are going. I tell him about the index. He says, "That's it? One index?"

I try to explain that the index touches a table serving 11,000 reads per minute across a platform with 30 million users and six business domains and a GEM economy that functions like a small country's financial system. I try to explain that getting it wrong could cascade through the advertising disbursement pipeline and affect merchant payouts and user engagement rates. I try to explain that people buy food with this.

He says, "So you added an index."

Yeah. I added an index.

Month two is when the imposter syndrome shape-shifts. It goes from "Did I build this right?" to "Am I overthinking this?"

Because here's the thing about operating at scale: you develop a paranoia. Every change, every migration, every deployment, you run the scenario. What if this fails? What if this is slow? What if this interacts with something unexpected in one of the other 94 repositories? What if there's a design decision from eighteen months ago that I've forgotten about, and this change collides with it?

The paranoia is useful. It makes you write better code. It makes you test more carefully. It makes you think about failure modes before they happen.

But it also makes you slow. Cautious. You start second-guessing decisions that should take five minutes. You rewrite a PR description four times because you want to make sure the reviewer understands every possible implication. You add monitoring for edge cases that will never happen. You over-engineer the rollback plan.

I'm sitting in my apartment at midnight, writing a runbook for a feature flag rollout, and I think: Is this what senior engineers at scale actually do? Or am I just scared?

The answer, I eventually learn, is both. The paranoia and the expertise are the same thing. The engineer who doesn't feel the weight of 30 million users is the engineer who ships the WHERE clause that deletes 30,000 wallet rows. The appropriate level of fear at this scale is nonzero.

And when you're the one who built it, the fear has an extra dimension. The team comes to you. "Will this migration break anything?" "Can we safely deprecate this endpoint?" "What's the blast radius if this queue backs up?" They ask because you're the person who's supposed to know. And you do know, mostly. But "mostly" at 30 million users leaves a gap that could fit a city.

Month three. I'm looking at the architecture I designed. Six domains. Ninety-five repos. The Wallet Service at the center. Auth through OAuth. Logistics shared between Mall and Food. I know this system. I should. I built it. But knowing it and trusting it at this scale are different things.

I'm reviewing a junior engineer's PR. They've added a new field to the wallet transaction log. A useful field. Clean implementation. But the migration adds a column to a table with 1.2 billion rows. At 200K users, that's a five-second migration. At 30M users with two years of transaction history, that's a table lock that could take minutes.

I leave a comment: "This needs to run as a concurrent index build or a background migration. The transactions table has ~1.2B rows. A standard ALTER TABLE ADD COLUMN will acquire an ACCESS EXCLUSIVE lock."

The junior engineer replies: "How did you know the row count?"

I know because I created that table. I know because I've been watching it grow from zero to 1.2 billion rows. I know because I've stared at the wallet dashboard enough times that the numbers are tattooed on the inside of my eyelids. 1.2 billion transactions. 32 TPS sustained. 10x peaks during holidays when everyone is gifting GEMs.

But the real answer is: I know because I was scared enough to keep watching.

I'm writing this at 10 PM. The imposter syndrome hasn't gone away. It's just changed shape again.

It used to whisper: You're not good enough for what you built.

Now it says: Don't get comfortable. You know enough to be dangerous. Stay scared. People depend on this.

And I think that's the right voice to listen to.

Here's what I've learned about imposter syndrome at scale. It's not a bug in your psychology. It's a feature. The fear is information. It's telling you that the blast radius of your decisions has expanded beyond your intuition, and your intuition needs to catch up.

At 200K users, your intuition is probably fine. At 30M, your intuition is a liability until it's been recalibrated by enough production incidents, enough late-night debugging sessions, enough moments where you stare at a dashboard and feel the weight of a number that used to be a goal on a whiteboard.

Thirty million monthly active users. More people than Australia. Connected by a gem economy that I can trace from merchant escrow to user wallet to food delivery payment in my sleep. An economy I designed. An economy that now runs whether I'm watching or not.

The team asks me what can break. I tell them. Sometimes I'm guessing. Sometimes the honest answer is that at this scale, the failure modes outnumber what any single person can hold in their head, even the person who built it.

And every time I open a PR, there's still a voice that says: Are you sure?

I hope it never stops asking.

DEV Community

My Imposter Syndrome at 30M MAU

Top comments (0)