DEV Community

Cover image for I Thought the Hard Part Was the Code. Turns Out Production Is Where Security Assumptions Go to Die.

I Thought the Hard Part Was the Code. Turns Out Production Is Where Security Assumptions Go to Die.

Ravi Gupta on April 13, 2026

This is Part 4 of a 4-part series on building AuthShield - a production-ready standalone authentication microservice. This post covers rate limitin...
Collapse
 
bridgexapi profile image
BridgeXAPI

This part about everything working locally but breaking assumptions in production is very real.

Ran into something similar on the messaging side. Requests were valid, responses were 200, logs looked clean, but delivery behavior still varied in ways we couldn’t explain at first.

What made it tricky is that once the request leaves your system, you’re depending on a whole chain you don’t control. Routing decisions, carrier handling, timing, even filtering can change the outcome without anything in your code changing.

So from the app perspective everything is “correct”, but the execution path isn’t stable.

Feels like a lot of these problems only show up when you treat the system as more than just your code and start looking at what happens after the boundary.

Curious if you ended up adding more visibility around those external layers, or if you just accepted some level of unpredictability there.

Collapse
 
ravigupta97 profile image
Ravi Gupta

Exactly this. The boundary between your system and the external layer is where the interesting failures live, and they are the hardest to debug because everything on your side looks correct.

On visibility around those external layers - partially. The startup SMTP check and health endpoint catch configuration failures immediately. Richer error logging on failures helps narrow down where in the chain things broke. But once the email leaves your SMTP server you are largely dependent on delivery receipts and bounce handling, which I have not fully wired up yet.

The honest answer is some level of unpredictability is just the reality of depending on external systems. The best you can do is fail loudly at the boundary, log enough context to reconstruct what happened, and know quickly when something breaks rather than finding out two days later from a user complaint.

Collapse
 
bridgexapi profile image
BridgeXAPI

Yeah that’s fair.

I used to think of it as unpredictability too, but over time it started looking more like hidden variation rather than randomness.

Same request leaves your system, but depending on how it gets handled downstream, you end up with different timing, paths or even filtering decisions.

From the outside it feels unpredictable, but there’s actually structure there, just not exposed.

That’s the part I’ve been finding hardest to reason about.

Thread Thread
 
ravigupta97 profile image
Ravi Gupta

"Hidden variation rather than randomness" is a much better mental model. There is structure in how those downstream systems behave, it is just not exposed to you. That is what makes it hard to reason about - you are trying to debug a system you can only observe at the boundary, not inspect from inside.

Thread Thread
 
bridgexapi profile image
BridgeXAPI

Exactly that.

At some point the problem isn’t that things fail, it’s that you don’t have a model of how the system behaves anymore.

You can observe inputs and outputs, but without visibility into the execution path, you can’t really reason about why outcomes differ.

That’s where most debugging just turns into trial and error.

Thread Thread
 
ravigupta97 profile image
Ravi Gupta

Trial and error is exactly where you end up without a model of the execution path. At that point you are not debugging, you are guessing with extra steps. The visibility problem is what makes external system failures so expensive - you cannot reason about what you cannot observe.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

the operational layer is always the gap - auth code gets reviewed, rate limits and smtp config get skimmed. spent two days debugging a prod email issue that never surfaced in staging.

Collapse
 
ravigupta97 profile image
Ravi Gupta

That's exactly what happened during my own deployment: registration returned 201, user created, but verification email silently dropped due to an env var mismatch (SMTP_USER vs SMTP_USERNAME). The code worked perfectly, the operational layer failed silently.
Two things I've added since:

Startup SMTP check - app tests the SMTP connection at boot and logs a loud warning if it fails. Catches misconfigured credentials in the first 10 seconds of a deploy rather than when a user complains.
Richer failure logs - email errors now log the host, port, username, and whether the password env var was even set. Turns a two-day debug into two minutes.

The broader lesson I learned is that the Mailtrap in staging creates false confidence because the code path is identical but the operational layer is completely different. The only real test of production email is production email.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

The SMTP_USER vs SMTP_USERNAME gap is exactly what I call config drift - code and infra docs evolving separately. I now treat env var audits as a mandatory pre-launch gate, not a debug step after the fact. A startup smoke test that actually hits the email path catches this before users see it.

Thread Thread
 
ravigupta97 profile image
Ravi Gupta

Config drift is exactly the right term for it. Env var audits as a pre-launch gate rather than a post-incident debug step that is a habit I am adopting going forward. The startup smoke test making it a hard check rather than a hope is the right call.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

Startup smoke test as a hard gate is so underrated. Most teams treat env var validation as a post-incident lesson rather than a pre-flight checklist. Decoupling the audit from the incident timeline is exactly what separates teams that scale from teams that perpetually firefight.

Thread Thread
 
ravigupta97 profile image
Ravi Gupta

"Separates teams that scale from teams that perpetually firefight" - that framing is going in my notes. Pre-flight checklist is the right mental model. Catching it before the incident rather than learning from it after is the whole point.

Collapse
 
archound profile image
Miloslav Homer

Great effort!

Auth is one of the most heavily targeted functionality in prod. Even low rate-limits are problematic for registration endpoint (I'd recommend captcha).

How did you setup logging and monitoring please? This is invaluable info when investigating incidents.

Collapse
 
ravigupta97 profile image
Ravi Gupta

Thank you!

You are absolutely right. Per-IP rate limiting does not stop distributed attacks. CAPTCHA on registration is the proper fix, hCaptcha or Cloudflare Turnstile specifically. It is on the roadmap. Current defence for now is Nginx rate limiting, Redis sliding window, and bcrypt cost factor making each attempt slow even if it gets through.

On logging, AuthShield uses structlog for structured JSON logs on every auth event. The two alerts worth setting up first are a spike in AUTH_INVALID_CREDENTIALS (brute force signal) and any AUTH_REFRESH_TOKEN_REUSED event (token theft signal, every single occurrence deserves investigation). These would ship to Better Stack or Datadog in production.

Good suggestions, appreciate the feedback!

Collapse
 
archound profile image
Miloslav Homer

I'd recommend also logging missing username/email, logging success. And that's before we get into all that GeoIP business.

I've also noticed that the DBs are included in the Docker compose - I'd be careful around that, one wrong push and you're overriding your identity DB.

Good luck, this is tough, very tough to get right.

Thread Thread
 
ravigupta97 profile image
Ravi Gupta

Really appreciate this, clearly production-grade thinking that is hard to learn from tutorials.

On granular logging: you are right, failure reasons are too coarse right now. Distinguishing email_not_found, wrong_password, account_disabled makes the difference between "something failed" and "we are being credential stuffed." Adding this.

On GeoIP : on the roadmap. MaxMind GeoLite2 is the free path, offline database so no API call per request. Country code on every auth event is enough to start spotting unusual patterns before getting into impossible travel detection.

On Docker Compose and the DB : valid warning and You're right to flag this, and more people should hear. Our production database is on Neon, completely independent of any Docker Compose operation so that specific risk does not apply here. Adding a prominent warning comment to the docker-compose.yml to make that explicit for anyone who forks the repo.

Genuinely useful feedback, comments like yours are more useful than any tutorial
Thank you!

Collapse
 
henryaza profile image
Henry A

This matches what I've seen across dozens of AWS accounts. The gap between "it works in dev" and "it's secure in production" is almost always the same list: no CloudTrail in all regions, default VPC still exists, S3 buckets with overly permissive policies, IAM users with long-lived access keys and no MFA enforcement, and no GuardDuty or Config rules to catch drift.

The frustrating part is that most of these are 15-minute fixes with the right CloudFormation or Terraform — but teams don't know what to check. CIS AWS Foundations Benchmark is a solid starting point. Even just implementing the Level 1 controls (password policy, CloudTrail, S3 Block Public Access, root account monitoring) closes 80% of the attack surface that trips people up in production.