From 0 to Production in 48 Hours: Architecting SmartPager

#architecture #java #microservices #springboot

There’s a famous saying in engineering: "Fast, Cheap, or Reliable. Pick two." But when you enter a 48-hour hackathon to build a distributed incident management system, you are forced to pick all three.

This is the story of how my team and I built SmartPager from scratch, moving from a blank IDE to a production-grade alerting system in a single weekend.

The Problem with Incident Management

Incident management isn't just about sending an email when a server goes down. It's about concurrent event handling, real-time escalation, and ensuring that alerts trigger in sub-seconds. If the alerting system itself fails, it's useless. We needed a system that could handle failure scenarios gracefully.

Why Microservices?

The easiest route in a hackathon is a monolith. But we wanted to build something that mirrored real-world production environments. We chose a microservices architecture using Spring Boot, sitting behind an Nginx reverse proxy, backed by PostgreSQL, with a React frontend.

Incident Service : Ingests and processes incoming simulated incidents.
Notification Service: Handles the real-time routing and escalation of alerts.
Auth & Gateway: Handles security and load distribution.

Engineering Under Pressure: The Trade-offs

Senior engineering is about understanding trade-offs. With only 48 hours, we didn't have the luxury of spinning up an entire Kafka cluster for event streaming.

Instead, we engineered an event-driven escalation system using lightweight Spring Boot event listeners and optimized PostgreSQL indexing to process the state of incidents. We prioritized low-latency alerting over perfect eventual consistency, ensuring that when a simulated incident fired, the on-call engineer was notified in milliseconds.

The Outcome

When the judging phase arrived, we didn't just show them a PowerPoint. We bombarded the system with 100+ concurrent simulated incidents.

SmartPager didn't flinch. The distributed nodes handled the ingestion, the event-driven escalation triggered perfectly, and we achieved sub-second alert latency.

Conclusion

Building SmartPager taught me that system resilience isn't something you add at the end of a project; it's a feature you have to architect from minute one. You don't need infinite time to build distributed systems—you just need a solid architecture and the discipline to stick to it.

You can check out the source code for SmartPager here : [https://github.com/mohamedmabrouk09/incident-microservices]