Build Bulletproof Web Apps: 7 Essential Monitoring Patterns That Prevent Downtime

#programming #devto #webdev #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Let’s talk about keeping your web application healthy. You’ve built something complex, with many moving parts. When it works, it’s great. When it doesn’t, you need to know why—fast. That’s the goal here. It’s about moving from guessing to knowing, from reacting to understanding. These are methods to give you that clarity.

I think of it as giving your application a voice. Instead of it sitting silently, hoping you notice when something is wrong, it constantly tells you about its state, its performance, and its problems. You just need to know how to listen. These patterns are the language you teach it.

The first pattern is about following a journey. In a modern application, a single user action, like clicking "Purchase," can ripple through a dozen different services. A billing service talks to inventory, which talks to a notification service. If that purchase is slow or fails, which link in the chain is the problem? Distributed tracing answers that.

It works by attaching a unique identifier to the initial request. As that request travels from service to service, that identifier goes with it. Each service adds its own chapter to the story: when it started work, when it finished, and any details about what it did. At the end, you can see the entire timeline. You can spot the service that took three seconds when everything else took milliseconds.

Here’s a basic way you might start a trace for an order process. The key is that everything happening inside this processOrder function becomes part of the same story.

const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('order-service-tracer');

async function processOrder(request) {
  // Start a new span, which is a segment of work.
  return tracer.startActiveSpan('processOrder', async (span) => {
    // Add details to this chapter of the trace.
    span.setAttribute('order.id', request.orderId);
    span.setAttribute('user.tier', request.userTier);

    try {
      // Your business logic happens here.
      const result = await chargeCreditCard(request);
      await updateInventory(request.items);
      await sendConfirmationEmail(request.userEmail);

      span.setStatus({ code: 1 }); // 1 = OK
      return result;
    } catch (error) {
      // Record the failure in the trace.
      span.recordException(error);
      span.setStatus({ code: 2, message: error.message }); // 2 = Error
      throw error;
    } finally {
      // Always end the span to send the data.
      span.end();
    }
  });
}

The second pattern changes how you write down events. For years, we wrote logs as lines of text. "User logged in." That tells a human something, but it’s tedious for a machine to analyze. Which user? How long did it take? Structured logging uses a consistent format, like JSON, so every log entry is a bundle of labeled data.

This makes searching and alerting powerful. You can easily find all logs for a specific user ID, or all login events that took longer than two seconds. Tools can automatically parse these fields without complex rules.

const pino = require('pino');
// Create a logger that outputs JSON by default.
const logger = pino({
  level: 'info',
  // Add baseline context to every log from this service.
  base: { service: 'auth-service', pod: process.env.POD_NAME }
});

async function handleLogin(username, password) {
  const startTime = Date.now();

  logger.info({ event: 'login_attempt', username }, 'Login flow initiated');

  try {
    const user = await db.findUser(username);
    const isValid = await verifyPassword(password, user.hash);
    const duration = Date.now() - startTime;

    if (isValid) {
      // A rich, structured info log.
      logger.info({
        event: 'login_success',
        userId: user.id,
        durationMs: duration,
        authMethod: 'password'
      }, 'User authenticated successfully');
      return user;
    } else {
      logger.warn({
        event: 'login_failed',
        username: username,
        durationMs: duration,
        reason: 'invalid_password'
      }, 'Authentication failed');
      throw new Error('Invalid credentials');
    }
  } catch (error) {
    logger.error({
      event: 'login_error',
      username: username,
      err: error.message,
      stack: error.stack
    }, 'Unexpected error during login');
    throw error;
  }
}

The third pattern is about measuring. While logs tell you about discrete events, metrics give you numbers over time. How many requests per second is the API handling? What’s the current error rate? How much memory is the service using? These numbers help you see trends, set up alerts for thresholds, and plan for capacity.

I like to think of metrics as the vital signs for your application. A doctor checks heart rate and blood pressure; you check request latency and CPU usage.

const Prometheus = require('prom-client');
// Ensure metrics are reset on each service start (standard in Prometheus).
const register = Prometheus.register;

// Create a counter for total HTTP requests.
const httpRequestsTotal = new Prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// Create a histogram to track request duration.
const httpRequestDurationMs = new Prometheus.Histogram({
  name: 'http_request_duration_milliseconds',
  help: 'HTTP request duration in milliseconds',
  labelNames: ['method', 'route'],
  // Define buckets for duration (in ms): 10, 50, 100, 200, 500, 1000, 2000
  buckets: [10, 50, 100, 200, 500, 1000, 2000]
});

// Express.js middleware to instrument every request.
app.use((req, res, next) => {
  const start = Date.now();

  // Increment the total request counter when the response finishes.
  res.on('finish', () => {
    const duration = Date.now() - start;

    httpRequestsTotal.inc({
      method: req.method,
      route: req.route ? req.route.path : req.path,
      status_code: res.statusCode
    });

    httpRequestDurationMs.observe({
      method: req.method,
      route: req.route ? req.route.path : req.path
    }, duration);
  });

  next();
});

// Expose the metrics on a standard endpoint for Prometheus to scrape.
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

The fourth pattern shifts the focus entirely to the person using your site. All the internal metrics can look perfect, but if a user on a slow mobile network is waiting ten seconds for your page to load, you have a problem. Real User Monitoring tools capture performance data directly from the browser.

They measure things like how long it takes for the main content to load, how quickly the page responds to a click, and how much the layout shifts unexpectedly while loading—which is frustrating if you’re trying to click a button.

// Using the web-vitals library to measure Core Web Vitals.
import {onLCP, onFID, onCLS, onINP} from 'web-vitals';

// A function to send metrics to your analytics endpoint.
function sendMetric({ name, value, id }) {
  const body = JSON.stringify({
    name: name, // e.g., 'LCP', 'FID', 'CLS'
    value: value,
    id: id, // A unique ID for this page load
    path: window.location.pathname,
    // Add connectivity information.
    effectiveType: navigator?.connection?.effectiveType,
    // Add the User-Agent string for device/browser context.
    userAgent: navigator.userAgent
  });

  // Use sendBeacon for reliable delivery, even if page is closing.
  navigator.sendBeacon('/api/web-vitals', body);

  // Also log to console for local debugging.
  console.debug(`[Web Vitals] ${name}: ${Math.round(value)}`);
}

// Attach the listeners for each vital metric.
onLCP(sendMetric);     // Largest Contentful Paint (loading performance)
onFID(sendMetric);     // First Input Delay (interactivity)
onCLS(sendMetric);     // Cumulative Layout Shift (visual stability)
onINP(sendMetric);     // Interaction to Next Paint (new responsiveness metric)

// You can also instrument custom, business-specific user actions.
function monitorCheckoutButton() {
  const button = document.getElementById('checkout-btn');
  let clickTime;

  button.addEventListener('mousedown', () => {
    clickTime = performance.now();
  });

  button.addEventListener('mouseup', () => {
    const reactionTime = performance.now() - clickTime;
    sendMetric({ 
      name: 'CUSTOM_BUTTON_REACTION', 
      value: reactionTime, 
      id: 'checkout_button' 
    });
  });
}

The fifth pattern is about being proactive. Instead of waiting for a user to report that the checkout is broken, you can have a script—a synthetic transaction—run through that flow every five minutes from different locations around the world. It’s like a constant, automated health check for your key user journeys.

I use these to catch problems caused by a new deployment or a third-party API change before they impact real revenue.

const { firefox } = require('playwright'); // or chromium, webkit

async function runSyntheticCheckoutTest() {
  let browser;
  const testResults = {
    name: 'guest_checkout_flow',
    startTime: new Date().toISOString(),
    steps: [],
    success: false
  };

  try {
    browser = await firefox.launch({ headless: true });
    const context = await browser.newContext();
    const page = await context.newPage();

    // Step 1: Load homepage.
    const navStart = Date.now();
    const response = await page.goto('https://your-store.com', { waitUntil: 'networkidle' });
    testResults.steps.push({ 
      name: 'load_homepage', 
      status: response.status(), 
      duration: Date.now() - navStart 
    });

    if (!response.ok()) {
      throw new Error(`Homepage failed with status: ${response.status()}`);
    }

    // Step 2: Add a test product to cart.
    await page.click('data-test-id=product-tile:first-child');
    const addToCartResponse = await page.waitForResponse(resp => 
      resp.url().includes('/api/cart') && resp.request().method() === 'POST'
    );
    testResults.steps.push({ 
      name: 'add_to_cart', 
      status: addToCartResponse.status() 
    });

    // Step 3: Navigate to cart and proceed.
    await page.click('text=View Cart');
    await page.waitForSelector('text=Proceed to Checkout');
    await page.click('text=Proceed to Checkout');

    // Step 4: Fill out guest information.
    await page.fill('#email', 'synthetic-user@test.example.com');
    await page.fill('#shipping-address', '123 Test Street');
    // ... fill more fields

    // Step 5: Submit the order.
    const orderSubmitStart = Date.now();
    await page.click('button[type="submit"]');

    // Wait for a success indicator or a redirect.
    await page.waitForSelector('text=Order Confirmation', { timeout: 10000 });
    testResults.steps.push({ 
      name: 'submit_order', 
      duration: Date.now() - orderSubmitStart 
    });

    testResults.success = true;
    testResults.finalScreenshot = await page.screenshot({ fullPage: true });

  } catch (error) {
    testResults.error = error.message;
    console.error('Synthetic test failed:', error);
    // Here you would trigger an alert to your team.
    sendAlert(`Checkout flow broken: ${error.message}`);
  } finally {
    if (browser) {
      await browser.close();
    }
    // Always send the results to your observability system.
    sendTestResultsToLog(testResults);
  }
}

// Run this function on a schedule (e.g., using a cron job or scheduled Lambda).

The sixth pattern deals with the inevitable: things break. Errors happen. The goal is to collect them, group similar ones together, and provide enough context so a developer can fix them without needing to ask, "What were you doing when this happened?" A good error tracking system shows you the stack trace, the user, the browser, and the state of the application.

// Frontend global error handler for a React-like app.
function initErrorTracking() {
  // Catch unhandled promise rejections.
  window.addEventListener('unhandledrejection', (event) => {
    captureError({
      type: 'UnhandledPromiseRejection',
      message: event.reason?.message || 'Unknown rejection',
      stack: event.reason?.stack,
      timestamp: new Date().toISOString()
    });
  });

  // Catch synchronous errors.
  window.addEventListener('error', (event) => {
    captureError({
      type: 'WindowError',
      message: event.message,
      filename: event.filename,
      lineno: event.lineno,
      colno: event.colno,
      stack: event.error?.stack,
      timestamp: new Date().toISOString()
    });
    // Prevent the default browser error UI if desired.
    // event.preventDefault();
  });

  // For a Single Page App (SPA), also catch navigation errors.
  if (typeof window.history !== 'undefined') {
    const originalPushState = history.pushState;
    history.pushState = function (...args) {
      const result = originalPushState.apply(this, args);
      window.dispatchEvent(new Event('locationchange'));
      return result;
    };
    window.addEventListener('popstate', () => {
      window.dispatchEvent(new Event('locationchange'));
    });

    // Log all route changes for context.
    window.addEventListener('locationchange', () => {
      setCurrentRoute(window.location.pathname);
    });
  }
}

// Function to capture and send error data.
let currentUser = null;
let currentRoute = '/';

function setCurrentUser(user) { currentUser = user; }
function setCurrentRoute(route) { currentRoute = route; }

function captureError(errorDetails) {
  const enrichedError = {
    ...errorDetails,
    // Add rich context.
    url: window.location.href,
    route: currentRoute,
    userAgent: navigator.userAgent,
    viewport: `${window.innerWidth}x${window.innerHeight}`,
    userId: currentUser?.id,
    userEmail: currentUser?.email,
    // Include recent user actions for context (simplified).
    lastActions: window.userActionBuffer?.slice(-5) || []
  };

  // Send to your backend or error service (e.g., Sentry, custom endpoint).
  fetch('/api/client-errors', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(enrichedError),
    keepalive: true // Ensure request completes even if page closes.
  }).catch(() => {
    // Fallback to console if network fails.
    console.error('Logged error:', enrichedError);
  });
}

// Example of buffering user actions for context.
window.userActionBuffer = [];
function trackUserAction(action) {
  window.userActionBuffer.push({
    action,
    timestamp: Date.now()
  });
  // Keep buffer manageable.
  if (window.userActionBuffer.length > 20) {
    window.userActionBuffer.shift();
  }
}

The seventh pattern is about bringing it all together. You have traces, logs, metrics, and error reports flowing in. A dashboard is your control panel. It visualizes this data so you can see the state of your system at a glance. Is the error rate spiking? Is the 95th percentile latency for the search API climbing? A well-designed dashboard answers these questions immediately.

The best dashboards are tailored to different viewers. An on-call engineer needs a high-level health view. A product manager might want a dashboard showing user conversion rates alongside frontend performance.

// Example: Defining a dashboard-as-code for a service health view.
// This is a conceptual example, similar to tools like Grafana's JSON model.
const serviceHealthDashboard = {
  dashboardTitle: "Product Service - Production",
  refreshInterval: "30s",
  panels: [
    {
      panelTitle: "Request Rate & Errors",
      // A time-series graph.
      type: "timeseries",
      layout: { x: 0, y: 0, width: 12, height: 8 },
      targets: [
        {
          // Prometheus query for request rate.
          query: 'sum(rate(http_requests_total{service="product-api",env="prod"}[5m])) by (method)',
          legend: "{{method}} - req/s"
        },
        {
          // Query for 5xx error rate.
          query: 'sum(rate(http_requests_total{service="product-api",env="prod",status=~"5.."}[5m]))',
          legend: "5xx Errors/s"
        }
      ]
    },
    {
      panelTitle: "P95 Response Latency",
      type: "timeseries",
      layout: { x: 12, y: 0, width: 12, height: 8 },
      targets: [
        {
          // Using a histogram metric to calculate latency percentiles.
          query: 'histogram_quantile(0.95, sum(rate(http_request_duration_milliseconds_bucket{service="product-api"}[5m])) by (le, route))',
          legend: "P95 - {{route}}"
        }
      ],
      thresholds: [
        { value: 1000, color: "yellow", label: "Warning" },
        { value: 2000, color: "red", label: "Critical" }
      ]
    },
    {
      panelTitle: "Top Error Messages (Last Hour)",
      // A log aggregation table.
      type: "table",
      layout: { x: 0, y: 8, width: 24, height: 6 },
      query: {
        datasource: "loki", // A log aggregation system.
        expr: '{service="product-api", env="prod"} |= "ERROR" | logfmt | line_format "{{.message}}" | topk(10, count_over_time(1h))'
      },
      columns: ["Error Message", "Count"]
    },
    {
      panelTitle: "Infrastructure",
      type: "stat",
      layout: { x: 0, y: 14, width: 6, height: 4 },
      targets: [
        { query: 'process_resident_memory_bytes{service="product-api"}', format: "bytes", legend: "Memory" },
        { query: 'rate(process_cpu_seconds_total{service="product-api"}[5m]) * 100', format: "percent", legend: "CPU" }
      ]
    },
    {
      panelTitle: "Active Traces Sample",
      // A list showing recent slow or erroneous traces.
      type: "list",
      layout: { x: 6, y: 14, width: 18, height: 4 },
      query: {
        datasource: "tempo", // A tracing system.
        expr: '{service="product-api", status="error"} | latest 5'
      },
      linkToTraceView: true // Makes each item clickable to see the full trace.
    }
  ],
  // Variables that allow dynamic filtering of the dashboard.
  variables: [
    {
      name: "service",
      query: "label_values(http_requests_total, service)",
      defaultValue: "product-api"
    },
    {
      name: "env",
      query: "label_values(http_requests_total, env)",
      defaultValue: "prod"
    }
  ]
};
// This JSON configuration would be consumed by a dashboarding tool to render the views.

These patterns work best when they are connected. An alert from a synthetic monitor can prompt you to check the dashboard. A spike in errors on the dashboard should let you click through to see the related logs and traces for that time period. Seeing a slow trace might lead you to the specific line of code in your error tracking system that has been throwing exceptions.

The aim is to close the loop between detection and understanding. You stop asking "Is there a problem?" and start asking "Why is this specific problem happening?" You move from noticing that a page is slow to knowing that it's slow because a specific database query in the recommendation service is unoptimized, and that this primarily affects users in a particular geographic region.

Start small. Add structured logging to one service. Put a simple RUM script on your homepage. The value becomes obvious quickly. You begin to see problems you didn't know existed, and you fix them before your users have to complain. That is the real point of all this: to build better, more reliable software for the people who use it.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!