11 System Design Interview Questions Every Engineer Should Master (With Real-World Answers 🚀)

#programming #systemdesign #interview

Designing scalable systems isn’t just interview prep—it’s what separates systems that survive 100k users vs 10M users.
Here are 11 real-world challenges you’ll face (or already are facing) in backend and frontend architecture, along with solutions battle-tested by engineers in production. 👇

1.How do you design database schema for millions of users without performance issue?

Designing a database for millions of users isn’t just about tables—it’s about thinking ahead for scale. Here’s how I approach it:

1️⃣ Start with indexing –  
Use proper primary keys and indexes for columns you frequently search or filter.  
No index = slow queries as your data grows.  

2️⃣ Normalize first, denormalize when needed –  
• Normalize to avoid duplicate data and keep storage lean.  
• Denormalize (duplicate some data) if joins become too slow at scale.  

3️⃣ Partition or shard data –  
Split huge tables by region, user ID ranges, or time so queries touch smaller chunks.  

4️⃣ Use caching smartly –  
• Redis/Memcached for frequently accessed data  
• Avoid hitting the DB for every read  

5️⃣ Archive old data –  
• Keep the active database lean  
• Move old logs or inactive users to cold storage  

💡 Pro tip:  
Always design for 10x growth. If your schema struggles at 100k users, it will collapse at 1M.

2.How do you handle secure connection between multiple microservices?
To secure communication between multiple microservices, you don’t just rely on plain HTTP calls. You need authentication + encryption + trust at every step.

🔐 Common ways to do it:  
• mTLS (Mutual TLS): Both client and server verify each other’s identity before exchanging data.  
• Service Mesh (Istio, Linkerd): Handles encryption, service-to-service authentication, and policies automatically.  
• API Gateway: Acts as a secure entry point, handling auth tokens and rate-limiting before traffic reaches services.  
• JWT or OAuth Tokens: Each request is verified so only trusted services can talk to each other.  

⚠️ Without this, a single compromised service can become an open door to your entire system.

3.How do you decide between vertical scaling and horizontal scaling for backend?

Choosing between vertical scaling and horizontal scaling depends on your system’s growth, budget, and performance needs.  

✅Vertical Scaling (Scaling Up)  
• Add more CPU, RAM, or storage to your existing server.  
• Simple to implement, no app changes required.  
• But has a hard limit—you can’t scale forever.  
• Risk: single point of failure (if that server goes down, everything is down).  

✅Horizontal Scaling (Scaling Out)  
• Add more servers/nodes and distribute the load.  
• Improves reliability (one server down ≠ system down).  
• Works great for high traffic apps.  
• But it’s complex — you need load balancers, data replication, and often changes in your app architecture.  

⚖️ Rule of Thumb:  
• Start with vertical scaling — it’s cheaper & simpler.  
• Move to horizontal scaling when traffic grows beyond what a single machine can handle.  
• Modern systems usually mix both.

4.How do you ensure cache consistency when frequently data changes?

Keeping cache consistency is one of the trickiest challenges when your data changes often — because stale data can break user trust faster than downtime.  

Here’s how to handle it like a pro:  

1️⃣ Short TTL (Time-to-Live) — Keep cached items for just enough time to benefit from speed, but refresh before they go stale. Example: 30–60 seconds for highly dynamic data.  

2️⃣ Cache Invalidation Strategies:  
• Write-through → Write to cache + database together, so they’re always in sync.  
• Write-behind → Write to cache first, update DB asynchronously (faster, but risk of brief inconsistency).  
• Explicit purge → Delete or update cache whenever the source data changes.  

3️⃣ Event-driven cache updates — Use a message broker (Kafka, RabbitMQ, Redis Pub/Sub) so that when data changes, all caches listening to that topic get updated instantly.  

4️⃣ Versioning / ETags — Store a version number or hash for each cached object. Before serving, check if the version matches the latest in the DB.  

5️⃣ Read-through caching — If data is missing or stale, fetch fresh from DB, store in cache, then serve — keeps cache self-healing.  

Golden rule: Never blindly trust your cache — have a freshness check if the cost of stale data is high.

5.How do you cache data effectively using redis and memcached?

To cache data effectively with Redis or Memcached, focus on 3 key points:

1️⃣ Cache the right data → Things that are read often but don’t change constantly (user sessions, product listings, dashboards).  
2️⃣ Set a TTL (time-to-live) → Prevents serving stale data forever and keeps your cache healthy.  
3️⃣ Pick the right tool →  
• Use Memcached for simple key-value lookups with ultra-fast performance.  
• Use Redis if you need more advanced features like pub/sub, data structures (lists, sets), persistence, or distributed locks.  

⚡ Pro tip: Always design your app to gracefully fall back to the database if the cache is unavailable (cache-aside pattern is a classic).

6.When do you split monolith frontend to microfrontends?

You should split a monolithic frontend into microfrontends when your application:  

🔸 Has grown so large that it’s slowing down your teams  
🔸 Needs independent development, testing, and deployment per module  
🔸 Involves multiple teams working on different features (like dashboard, payments, admin)  
🔸 Is facing merge conflicts or long CI/CD cycles due to one shared codebase  
🔸 Wants to experiment or migrate to different tech stacks (like React + Vue in the same app)  

🎯 The goal? Decouple teams and scale faster without stepping on each other’s code.  

But beware — microfrontends add operational complexity, so don’t choose them just to follow trends. Choose them when the pain of monolith maintenance exceeds the cost of splitting.

7.How do you prevent a single slow service from causing cascading failures in microservices?
To prevent a single slow service from causing cascading failures in a microservices setup:

1. ⏱️ Timeouts – Always set timeouts. Never wait forever for a response.

2. 🔁 Retry with Backoff – Retry smartly, not blindly. Avoid overload.

3. 🚫 Circuit Breakers – Stop calls to failing services temporarily.

4. 🧱 Bulkheads – Isolate services to contain failure.

5. 📊 Monitoring – Detect slowness before it becomes disaster.

Design with failure in mind. Every service should assume others can go down.

8.How do you decide between SQL and NoSql in new project?

Choosing between SQL vs NoSQL can make or break your project’s scalability and performance.  
Here’s how I decide 👇  

⸻  

1️⃣ Choose SQL (Relational DBs like MySQL / PostgreSQL) if:  
• Your data is structured and relational (e.g., users ↔ orders)  
• You need ACID transactions (banking, payments)  
• Schema rarely changes and you want strong consistency  

💡 Example: E-commerce, financial apps, inventory management  

⸻  

2️⃣ Choose NoSQL (MongoDB, DynamoDB, Redis) if:  
• Your data is unstructured or semi-structured (JSON, logs, events)  
• You need flexibility in schema or fast iteration  
• High read/write scale matters more than strict consistency  

💡 Example: Real-time analytics, chat apps, IoT, content feeds  

⸻  

3️⃣ Hybrid Approach  
• Many real-world systems use both → SQL for critical relational data + NoSQL for high-speed, flexible workloads.  

⸻  

Pro Tip:  
Start with SQL if unsure → migrate specific modules to NoSQL later for scale.

9.How do you handle WebSocket reconnection in a real-time dashboard?

WebSockets are great for real-time dashboards, but they can disconnect due to network drops, server restarts, or user switching tabs.  
Here’s how I handle smooth reconnections 👇  

1️⃣ Auto-Reconnect Logic  
• Implement a retry mechanism with exponential backoff.  
• Example: Try reconnecting after 1s → 2s → 4s → max 30s.  

2️⃣ Heartbeat / Ping-Pong  
• Send periodic ping messages to detect dead connections.  
• If no pong received, trigger reconnect.  

3️⃣ Queue Messages During Disconnect  
• If dashboard needs to send data, queue it until WebSocket is back.  

4️⃣ Handle UI Gracefully  
• Show a “Reconnecting…” toast or indicator.  
• Avoid sudden blank states to keep UX smooth.  

5️⃣ Server Support  
• Ensure server accepts resumed sessions or re-authenticates tokens to avoid full reload.  

💡 Pro Tip:  
Combine WebSocket + fallback to polling for mission-critical dashboards to never miss data.

10.Your webapp loads fast on WiFi but crawls on 4G. What secretly killing your performance?

Your webapp isn’t really “slow” on 4G… it’s just choking on hidden performance killers. 🚦  

Here’s what’s secretly hurting you:  

1️⃣ Huge bundle sizes –  
Your JS, CSS, and images are probably too heavy. On WiFi, it feels okay… but 4G exposes every KB.  

2️⃣ Too many network requests –  
Each API call, font, or image makes the phone wait for multiple round trips. On slower networks, this adds seconds.  

3️⃣ No proper caching –  
If the browser can’t reuse files from previous visits, it downloads everything again.  

4️⃣ Render-blocking scripts –  
That one JS library you imported for a tiny feature might be stopping the entire page from showing up.  

5️⃣ Images not optimized –  
Serving a 5MB banner image to a mobile user is like sending a truckload of files over a narrow street.  

💡 Pro tip:  
• Compress & lazy-load images 🖼  
• Split your JS into smaller chunks ⚡  
• Cache static assets aggressively 🔄  
• Use Lighthouse/Pagespeed to see what’s really slowing you down  

If your app feels fast on WiFi but slow on 4G… it’s not the internet’s fault—it’s your frontend diet. 🥲

11.One microservice is very slow due to external API calls. How will you optimize it?

⚡ Optimizing a Slow Microservice (Due to External API Calls) ⚡  
Use a combination of caching, async processing, and resilience patterns:  

🗄️ 1. Caching Responses  
• Cache frequent API responses using Redis or in-memory cache  
• Set proper TTL ⏳ to avoid stale data  

⚡ 2. Asynchronous / Parallel Calls  
• Call external APIs in parallel instead of sequential  
• Use message queues (Kafka / RabbitMQ) for non-critical async calls  

🛡️ 3. Retry & Circuit Breaker Pattern  
• Implement circuit breakers (Resilience4j / Hystrix) to prevent cascading failures  
• Use retry with exponential backoff for temporary API issues  

🔄 4. Fallbacks & Graceful Degradation  
• Provide cached / fallback data if the API is slow or down  
• Return partial responses to keep the system responsive  

🧱 5. Bulkhead & Timeout  
• Set timeouts for all external API calls ⏱️  
• Use bulkheads to isolate slow APIs from impacting other services  

📦 6. Data Replication / Pre-Fetching  
• Pre-fetch or replicate frequent data in your own DB for faster access  

✅ Smart combo of caching + async + resilience  
can bring 2x–10x performance improvement in real-world microservices 🚀

Final Takeaway
👉 Scalable system design isn’t about buzzwords. It’s about anticipating growth, designing for failure, and always having a fallback.
If you’re leading engineering teams or building for scale, keep these 11 battle-tested strategies in your toolkit.

Happy Coding!

DEV Community

11 System Design Interview Questions Every Engineer Should Master (With Real-World Answers 🚀)

Top comments (0)