How I built Upple: A modern uptime monitor with Go and React

#webdev #saas #buildinpublic #programming

I built Upple because I wanted to understand monitoring from the ground up—and because I had specific opinions about how health checks, incidents, and workflows should fit together. It's an uptime monitor and incident management platform that runs health checks, triggers automated workflows, and lets teams respond to incidents in real-time.

The stack is quite straightforward: Go backend, React/TypeScript frontend, PostgreSQL for persistence, and Redis for the event bus.

What Upple Does

Quick overview of the main features:

Health checks: HTTP, TCP, DNS, SSL certificate expiry, and ICMP (ping). Configurable consecutive failure thresholds to avoid false positives.
Incident management: Incidents can be created manually or auto-detected when monitors fail. There's a - again, pretty standard - status flow (Investigating → Identified → Monitoring → Resolved) with timeline updates and impact levels.
Workflow builder: A visual drag-and-drop builder similar to n8n (far from being on par with n8n's extensive set of features though). You compose blocks to create complex monitoring scenarios—login to your app, add an item to cart, checkout, verify the confirmation page. These workflows run as monitors, catching issues that simple HTTP pings would miss.
Alert rules: Configurable rules for what happens when things go wrong—create incidents, notify Slack or Discord, page someone via PagerDuty, escalate after a timeout. Similar to PagerDuty's escalation policies.
Real-time collaboration: When an incident is active, team members see live updates, presence indicators, and can add timeline comments. All powered by Server-Sent Events.
Public status pages: Customer-facing dashboards showing uptime history and current status.
Integrations: Slack and Discord for notifications and incident channel sync, PagerDuty for on-call escalation.
Maintenance windows: Schedule downtime windows so monitors don't trigger false incidents during planned deployments.

Nothing revolutionary on the feature list. The interesting parts are under the hood.

Thinking about scale before you need it

One advantage of building an indie project with no deadline: you can afford to think about architecture upfront. I spent a fair amount of time on scalability and performance decisions early on, because these are hard problems to tackle later, and it's been really fun, so far, to challenge myself with this. Two patterns ended up shaping most of the backend.

Event delivery

When I started thinking about running multiple pods in Kubernetes, I realized I had conflicting requirements for events.

Take a health check result coming in. On one hand, exactly one pod should process it: store the result, update the monitor status, maybe create an incident. If every pod processed it, I'd get duplicate writes and race conditions.

On the other hand, that same result needs to reach every pod so they can push updates to their connected SSE clients. If only one pod received it, users connected to other pods wouldn't see real-time updates.

Same event, two different delivery patterns.

I'm using Watermill for the event bus with Redis Streams as the backend. Redis Streams has this concept of consumer groups; consumers in the same group split messages between them, while different groups each receive all messages.

So I created two subscribers:

// Work-queue subscriber with shared consumer group
// Only ONE consumer in the group receives each message
sub, err := redisstream.NewSubscriber(
    redisstream.SubscriberConfig{
        Client:        client,
        ConsumerGroup: "upple-workers",
        Consumer:      consumerID,
    }
)

// Broadcast subscriber with per-pod consumer group
// EVERY pod receives EVERY message
broadcastSub, err := redisstream.NewSubscriber(
    redisstream.SubscriberConfig{
        Client:        client,
        ConsumerGroup: fmt.Sprintf("upple-broadcast-%s", consumerID), // Unique per pod
        Consumer:      consumerID,
    }
)

The consumerID is a UUID generated at startup. For work-queue handlers, all pods join upple-workers, so Redis distributes messages across them. For broadcast handlers, each pod creates its own group (upple-broadcast-abc123), so each pod receives every message.

The event bus then exposes two subscription methods:

// Work-queue: one pod processes
eventBus.Subscribe("check.result", resultHandler.Handle)

// Broadcast: all pods receive (for SSE fan-out)
eventBus.SubscribeBroadcast("check.result", sseBroadcaster.OnCheckResult)

Same topic, different delivery guarantees. The ResultHandler stores the check in the database (only runs once), while the SSE broadcaster pushes updates to all connected clients (runs on every pod).

This pattern also degrades gracefully for local development—when using the in-memory event bus instead of Redis, both subscribers get the same channel, which works fine with a single instance.

Managing complex UI state in the workflow builder

The workflow builder was the most challenging part of the frontend. It's a drag-and-drop interface using React Flow where users create nodes (HTTP requests, conditions, delays, loops), connect them, and watch execution status update in real-time.

I went through the usual progression: started with React's built-in state, moved to Context when prop drilling got painful, tried Zustand when Context re-renders became a problem. Each time I'd add a feature—undo/redo, real-time execution visualization, dirty state tracking—and the state interactions would get tangled.

Eventually I tried XState, and it stuck. Here's what I needed:

Undo/redo that works across node positions, connections, and data changes
Real-time execution visualization via SSE events
Dirty state tracking for "You have unsaved changes" prompts
Sidebar panel that stays in sync with canvas selection
Navigation blocking when dirty

What finally worked: XState with multiple child machines.

Instead of one massive state machine, I split concerns into five specialized machines:

invoke: [
  { id: "canvas", src: "canvasMachine" },       // React Flow state
  { id: "execution", src: "executionMachine" }, // Workflow execution lifecycle
  { id: "sse", src: "sseConnectionMachine" },   // SSE connection management
  { id: "sidebar", src: "sidebarMachine" },     // Panel state
  { id: "navigation", src: "navigationMachine" }, // Dirty tracking + route guards
],

Each machine handles its own complexity. The parent machine just coordinates—when a workflow loads, it tells the canvas machine to initialize and the SSE machine to connect. When an SSE event arrives, it forwards to the execution machine. When the user edits something, the navigation machine tracks dirty state.

Machines communicate using XState's sendTo:

markNavigationDirty: sendTo("navigation", { type: "MARK_DIRTY" }),

This means the parent doesn't need to know how navigation tracking works, just that it needs to be notified.

One subtle problem I ran into: when you drag a node in React Flow, it fires hundreds of position change events. If each one pushed to the undo stack, pressing Ctrl+Z would step back one pixel at a time.

The fix was to batch drag operations into a single history entry:

const onNodesChange: OnNodesChange = useCallback((changes) => {
  const isDragStart = changes.some(c => c.type === "position" && c.dragging === true);
  const isDragEnd = changes.some(c => c.type === "position" && c.dragging === false);

  if (isDragStart && !isDraggingRef.current) {
    // Starting drag - snapshot current state
    isDraggingRef.current = true;
    preDragStateRef.current = history.present;
  }

  if (isDragEnd && isDraggingRef.current) {
    // Drag finished - push the pre-drag snapshot as ONE history entry
    isDraggingRef.current = false;
    setHistory(prev => ({
      past: [...prev.past, preDragStateRef.current!].slice(-MAX_HISTORY_SIZE),
      present: { ...prev.present, nodes: newNodes },
      future: [],
    }));
  }
}, [history.present]);

React Flow tells us when dragging starts (dragging: true) and ends (dragging: false). We capture the state before the drag starts and only commit to history when the drag ends. Dragging 5 nodes across the canvas becomes 1 undo operation.

Looking Back

Trying to stick to the parts of DDD that felt relevant to me was worth the upfront investment. Separating monitors, incidents, workflows, and alerts into distinct aggregates with clear boundaries made the codebase navigable. When I need to change how incidents work, I know exactly where to look.

XState has a STEEP learning curve, but I feel like it pays off. The first few weeks were slow—thinking in states and events felt unnatural. But once I internalized it, features that would have been messy with useState became straightforward. The dev tools that let you visualize the state machine are genuinely super useful for debugging.

SSE is simpler than WebSockets for server-push. I don't need bidirectional communication, the client only needs to receive events, not send them. SSE handles reconnection automatically, works with standard HTTP middleware, and doesn't require a separate protocol.

Abstracting the event bus paid off. I started with an in-memory event bus for local development, knowing I'd switch to Redis for production. When the time came, the migration took less than an hour because the abstraction was clean—handlers didn't care whether events came from memory or Redis Streams.

Upple is live at upple.io. Most of its features are available for free with a very generous free tier, paid tier isn't even available yet. Don't hesitate to try it, any feedbacks would be very welcome. If you have questions about the architecture or want to discuss the patterns I described, drop a comment, I'm curious how others have solved similar problems!