Ever wondered why teams ditch third-party auth giants like Google or Firebase for a homegrown system? It's not just about control — it's about tailoring security to real-world chaos. Here's how we evolved our auth strategy from basic sessions to a robust, scalable JWT hybrid, solving revocation headaches at scale. (Spoiler: It's inspired by PayPal and SuperTokens, but customized for API-first, mobile/SPA apps.)
Why Build Custom Auth Over Third-Party Services?
Third-party tools shine for simple logins, but fall short on complexity. We opted for our own system to handle:
- Granular RBAC: Fine-tuned roles and permissions integrated with internal business logic.
- Custom Workflows: Complex approvals, document validation, and verification not supported out-of-the-box.
- Enhanced Control: Better security, scalability, and audit logging for operational platforms.
- Hybrid Approach: Kept Google login for user simplicity, while owning the rest for vendors, admins, and partners.
This gives us the edge in flexibility without compromising basics.
Stateless vs. Stateful: Why JWT?
JWTs are signed, tamper-evident tokens that verify users without constant DB hits — perfect for stateless auth. Here's a quick JWT breakdown:
How JWT Works (Simple Implementation):
- Header: Algorithm (e.g., HS256) and type.
- Payload: Claims like user_id, roles, and jti (unique ID).
- Signature: Encrypted with a secret key.
Example Payload:
{
"jti": "uuid-here",
"roles": ["admin", "user"],
"user_id": "abc123"
}
On request: Server verifies signature; no DB needed for validation.
We chose stateful JWT over traditional session-based auth for these reasons:
- API-First Design: Storing roles in JWT avoids DB calls per request (vs. session_id lookups).
- Future-Proof Scaling: Sessions struggle with horizontal scaling in microservices; JWTs distribute easily.
- Modern Apps: Ideal for mobile and SPA setups, unlike legacy PHP/Rails session cookies.
Challenges with Stateless JWTs
Stateless is efficient, but not flawless. We hit these roadblocks:
- Hard Revocation: Can't invalidate leaked tokens without state.
- Multi-Device Logout Issues: Logging out everywhere is tricky.
- Leak Risks: Exposed tokens grant access until expiry.
- Token Size: Bulky payloads slow things down.
Our initial single-token setup amplified problems:
- Short expiry (e.g., 1 day) forced frequent re-logins.
- Long expiry (e.g., 7 days) created security gaps — no revocation if leaked.
- No clean rotation: Couldn't separate short-term access from long sessions without state.
- Scaling Bottleneck: Checking validity on every request hampers horizontal growth.
Dilemma:
Short-lived for security = poor UX;
long-lived = risky.
Research and Evolution: From Single Token to Hybrid
We dove into articles on token management (shoutout to SuperTokens' JWT revocation guide) and dissected PayPal's granular auth. The breakthrough? A hybrid fusion:
- Access Token: Short-lived (15 mins) for per-request authentication to APIs.
- Refresh Token: Long-lived (7 days) to issue new access tokens seamlessly.
This splits concerns: Access for security, refresh for UX. But it meant DB hits every 15 mins for revocation — inefficient at scale.
Optimizing with Redis and UUIDs
To cut DB load, we added Redis for caching:
Issue: Storing full 700–800 byte tokens for 1M users? 700MB+ memory hog.
Fix: Use 32-byte UUIDs (jti) instead — reduces to 32MB for 1M users.
Storage:
SET session:user_id:access_jti <jti> EX 900 # 15 min
SET session:user_id:refresh_jti <jti> EX 604800 # 7 days
Validation:
- Check jti in Redis on requests
- Revoke by deleting the Redis key
Results: Lightning-fast checks, minimal overhead, and stateful revocation without full stateless trade-offs.
Handling Redis Failures: No Single Point of Breakdown
Redis is great, but failures = SPOF. Our multi-layer backup strategy, inspired by industry pros:
- Primary: On-Prem Redis for speed.
- Hot Backup: Embedded Cache (e.g., Caffeine in Java or node-cache in JS).
- Fallback: Postgres as Truth with partial indexes on jti/user_id for fast queries, plus PgBouncer for pooling.
- Async Resilience: Write-through to Postgres synchronously; async workers sync Redis on recovery (à la LinkedIn).
If Redis Fails: Query Postgres directly — optimized for spikes, like GitHub's OAuth scaling.
Slack-Style Hybrid: Short TTLs (e.g., 30s) in DB for revocation, avoiding constant calls via async writes.
What If Queue Is Delayed (e.g. 20 min)?
Problem Statement!
If refresh_token_jti is written to Redis and queued for async DB persistence, a long queue delay (e.g., 20+ min) creates risk. If Redis evicts the jti before it's written to DB, and the access token is revoked in that window, the user can't refresh their session. The system sees the token as invalid — even though it was issued correctly. This causes unexpected 401s due to a race between caching and persistence.
In this case:
- Write refresh_token_jti to Redis (TTL 7d)
- Write basic refresh_token_jti to DB immediately (sync) with minimal info
- Queue full enrichment write (device info, IP, etc.) for async processing
Key Wins and Takeaways
- Security Boost: Quick revocation, reduced leak windows.
- Scalability: Handles millions without bottlenecks.
- UX Magic: Seamless sessions with minimal re-logins.
- Adaptable: Generic enough for any API-first, RBAC-heavy app — tweak expiries, payloads, or caches to fit your stack.
Final Thoughts
Whether you're scaling a new product or revamping an existing one, smart token design can balance security, UX, and performance. This hybrid architecture saved us from many fires and gave us deeper control.
What's your go-to strategy for JWT revocation at scale? Have you battled similar auth dilemmas? Drop a comment — let's geek out!
This article was originally published on Medium. I’m sharing it here for the Dev.to community!
Top comments (2)
Thank you for the informative article.
What are your architecture and design implementation for Granular RBAC and Custom Workflows?
Glad you found it helpful! 🙂 For granular RBAC, I’ve implemented a role–permission model with hierarchical roles and resource-level access, so it’s easy to extend or restrict access at user, vendor, or admin scope. For custom workflows, I’ve used a rule-based/state-machine approach, where workflows are defined in a configurable way (like handling booking extensions, vendor approvals, or penalties). This gives flexibility to adapt workflows without changing core logic.