DEV Community

Cover image for JavaScript Production Monitoring: Performance Tracking, Error Detection, and User Experience Insights
Nithin Bharadwaj
Nithin Bharadwaj

Posted on

JavaScript Production Monitoring: Performance Tracking, Error Detection, and User Experience Insights

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

I want to talk about keeping your JavaScript application healthy when real people are using it. It's one thing when your code runs on your machine. It's a different world when thousands of users hit it at once. Things slow down. Errors pop up. Parts fail. You need a way to see what's happening, to measure it, and to fix problems before your users notice.

Think of it like a doctor's check-up, but for your app, happening every second. You're not just waiting for the patient to scream in pain. You're constantly checking the heartbeat, the temperature, the reflexes. That's production performance monitoring and observability. It's about building a window into your running application.

Let me share some ways to build that window.

Gathering the Numbers: Metrics

First, you need to measure. You count things. How many requests? How long do they take? How much memory are we using? These numbers are your metrics. They are the vital signs.

A simple counter or timer is a start, but in production, you need more. You need to know not just the average response time, but the distribution. Is it fast for most, but painfully slow for a few? That's where histograms help. You also need snapshots of current state, like "how many users are online right now?" Those are gauges.

Here's a practical way to start collecting these metrics yourself. It's a basic class that can grow with your needs.

// A simple metric collector
class AppMetrics {
  constructor() {
    this.data = { counts: {}, timings: [], gauges: {} };
    this.startTime = Date.now();
  }

  // Count how many times something happens
  increment(metricName, value = 1) {
    if (!this.data.counts[metricName]) this.data.counts[metricName] = 0;
    this.data.counts[metricName] += value;
  }

  // Measure how long something takes
  time(metricName) {
    const start = performance.now();
    return {
      stop: () => {
        const duration = performance.now() - start;
        this.data.timings.push({ name: metricName, duration });
        return duration;
      }
    };
  }

  // Record a current value, like temperature
  setGauge(metricName, value) {
    this.data.gauges[metricName] = { value, timestamp: Date.now() };
  }

  // Calculate some useful summaries
  summarize() {
    const summary = { counts: this.data.counts, gauges: this.data.gauges };

    // Group timings by name and calculate averages, min, max
    const timingGroups = {};
    this.data.timings.forEach(t => {
      if (!timingGroups[t.name]) timingGroups[t.name] = [];
      timingGroups[t.name].push(t.duration);
    });

    summary.timings = {};
    for (const [name, values] of Object.entries(timingGroups)) {
      const sorted = values.sort((a,b) => a - b);
      summary.timings[name] = {
        avg: sorted.reduce((a,b) => a + b, 0) / sorted.length,
        min: sorted[0],
        max: sorted[sorted.length - 1],
        p95: sorted[Math.floor(sorted.length * 0.95)] // 95th percentile
      };
    }

    summary.uptime = Date.now() - this.startTime;
    return summary;
  }
}

// Using it in your API
const metrics = new AppMetrics();

app.get('/api/data', async (req, res) => {
  const timer = metrics.time('api.data.request');
  metrics.increment('api.data.calls');

  try {
    const data = await fetchSomeData();
    timer.stop();
    res.json(data);
  } catch (error) {
    timer.stop();
    metrics.increment('api.data.errors');
    res.status(500).json({ error: 'Failed' });
  }
});

// Every minute, log or send the summary somewhere
setInterval(() => {
  console.log('Metrics snapshot:', metrics.summarize());
  // sendToMonitoringService(metrics.summarize());
}, 60000);
Enter fullscreen mode Exit fullscreen mode

This gives you raw numbers. The next step is understanding the story behind a single user's request as it travels through your system.

Following the Journey: Distributed Tracing

Modern apps are rarely one big piece. A single webpage click might call an API, which queries a database, then calls another service, which uses a cache. If it's slow, where was the delay? Distributed tracing answers this.

A "trace" is the full journey of one request. Each step in that journey is a "span". You give the whole journey a unique ID and pass it along everywhere the request goes.

// A basic tracer to understand request flow
class SimpleTracer {
  static startTrace(requestId) {
    const traceId = requestId || `trace_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
    const rootSpan = { id: `span_root`, traceId, start: Date.now(), children: [] };
    const spanStack = [rootSpan];

    return {
      traceId,
      // Start a new step within the journey
      startSpan(name) {
        const parentSpan = spanStack[spanStack.length - 1];
        const newSpan = {
          id: `span_${Date.now()}_${Math.random().toString(36).substr(2, 5)}`,
          name,
          traceId,
          parentId: parentSpan.id,
          start: Date.now(),
          children: []
        };
        parentSpan.children.push(newSpan);
        spanStack.push(newSpan);
        return newSpan.id;
      },
      // Mark a step as finished
      endSpan(spanId, error = null) {
        const spanIndex = spanStack.findIndex(s => s.id === spanId);
        if (spanIndex === -1) return;

        const span = spanStack[spanIndex];
        span.end = Date.now();
        span.duration = span.end - span.start;
        span.error = error ? error.message : null;

        spanStack.splice(spanIndex, 1); // Remove from stack
      },
      // Get the full story of the trace
      getTrace() {
        const calculateTree = (span) => ({
          ...span,
          children: span.children.map(calculateTree)
        });
        return calculateTree(rootSpan);
      }
    };
  }
}

// Using it across service boundaries
// Service A (Frontend or API Gateway)
const tracer = SimpleTracer.startTrace();
const traceId = tracer.traceId;
const span1 = tracer.startSpan('validate_request');

// Pass the traceId to the next service (e.g., in an HTTP header)
const headers = { 'X-Trace-ID': traceId, 'X-Parent-Span-ID': span1 };
const userData = await fetch('/api/user', { headers });

tracer.endSpan(span1);
const span2 = tracer.startSpan('process_data');
// ... do work
tracer.endSpan(span2);

// Now, inside '/api/user' (Service B)
app.get('/api/user', async (req, res) => {
  const incomingTraceId = req.headers['x-trace-id'];
  const parentSpanId = req.headers['x-parent-span-id'];

  // Continue the same trace
  const remoteTracer = SimpleTracer.startTrace(incomingTraceId);
  const dbSpanId = remoteTracer.startSpan('database_query');

  const user = await database.findUser(req.query.id); // This is the slow part!

  remoteTracer.endSpan(dbSpanId);
  res.json(user);

  // Now you can see that the 'database_query' span inside the larger trace took a long time.
});
Enter fullscreen mode Exit fullscreen mode

When a trace shows a database_query span taking 2 seconds, you've found a problem. Often, those problems are errors.

Catching and Understanding Errors

Logging "Error: Something went wrong" is useless. You need the whole picture. What was the user doing? What were the inputs? What was the system state?

The goal isn't just to log errors, but to group them, count them, and see if they're getting worse.

class SmartErrorTracker {
  constructor() {
    this.errorBuckets = new Map(); // Groups similar errors
  }

  capture(error, context = {}) {
    // 1. Create a fingerprint for this error type
    const fingerprint = this._getErrorFingerprint(error);

    // 2. Get or create a bucket for this type
    let bucket = this.errorBuckets.get(fingerprint);
    if (!bucket) {
      bucket = {
        count: 0,
        firstSeen: new Date(),
        lastSeen: new Date(),
        latestExample: { error, context },
        contexts: new Set() // To see how many different situations cause it
      };
      this.errorBuckets.set(fingerprint, bucket);
    }

    // 3. Update the bucket
    bucket.count++;
    bucket.lastSeen = new Date();
    bucket.latestExample = { error, context };
    bucket.contexts.add(JSON.stringify(context));

    // 4. Check for alerts (e.g., sudden spike)
    this._checkForAlert(fingerprint, bucket);

    console.error(`Error captured [${fingerprint}]:`, { error: error.message, context });
    return fingerprint;
  }

  _getErrorFingerprint(error) {
    // Use error message and first line of stack trace to group similar errors
    const stackFirstLine = error.stack ? error.stack.split('\n')[1] : '';
    return `${error.name}:${error.message}:${stackFirstLine}`;
  }

  _checkForAlert(fingerprint, bucket) {
    // Simple rule: if more than 10 errors in the last minute, alert
    const oneMinuteAgo = new Date(Date.now() - 60000);
    if (bucket.firstSeen > oneMinuteAgo && bucket.count > 10) {
      this._triggerAlert(`Error spike for ${fingerprint}`, bucket);
    }
  }

  _triggerAlert(message, data) {
    // Send to Slack, PagerDuty, etc.
    console.warn(`ALERT: ${message}`, data);
  }

  getSummary() {
    return Array.from(this.errorBuckets.entries()).map(([fingerprint, bucket]) => ({
      fingerprint,
      totalCount: bucket.count,
      firstSeen: bucket.firstSeen,
      lastSeen: bucket.lastSeen,
      uniqueContexts: bucket.contexts.size
    }));
  }
}

// Use it by wrapping risky code
const errorTracker = new SmartErrorTracker();

async function processUserUpload(file) {
  try {
    // ... complex processing ...
  } catch (error) {
    // Now you capture rich context
    errorTracker.capture(error, {
      userId: user.id,
      fileSize: file.size,
      fileType: file.type,
      function: 'processUserUpload',
      timestamp: new Date().toISOString()
    });
    throw error; // Re-throw after logging
  }
}
Enter fullscreen mode Exit fullscreen mode

This tells you "The 'File too large' error is happening 50 times per hour, mostly to users uploading PDFs over 10MB." That's actionable.

So far, we've measured from inside the server. But the user's experience happens in their browser.

Feeling the User's Pain: Real User Monitoring (RUM)

This is about measuring performance as the user actually experiences it. Their internet might be slow. Their phone might be old. We need to measure that.

The browser gives us amazing tools for this through APIs like PerformanceNavigationTiming and PerformanceObserver.

// Capture core web vitals and other user-centric metrics
class UserExperienceMonitor {
  constructor() {
    this.metrics = {};
    this.collectCoreWebVitals();
    this.observeLongTasks();
    this.observeResourceLoad();
  }

  collectCoreWebVitals() {
    // Largest Contentful Paint (LCP): when the main content appears
    const lcpObserver = new PerformanceObserver((entryList) => {
      const entries = entryList.getEntries();
      const lastEntry = entries[entries.length - 1];
      this.metrics.lcp = lastEntry.startTime;
      console.log('LCP:', this.metrics.lcp);
    });
    lcpObserver.observe({ type: 'largest-contentful-paint', buffered: true });

    // First Input Delay (FID): how responsive the page is to first click
    const fidObserver = new PerformanceObserver((entryList) => {
      const entries = entryList.getEntries();
      entries.forEach(entry => {
        this.metrics.fid = entry.processingStart - entry.startTime;
        console.log('FID:', this.metrics.fid);
      });
    });
    fidObserver.observe({ type: 'first-input', buffered: true });

    // Cumulative Layout Shift (CLS): visual stability
    let clsValue = 0;
    const clsObserver = new PerformanceObserver((entryList) => {
      for (const entry of entryList.getEntries()) {
        if (!entry.hadRecentInput) {
          clsValue += entry.value;
          this.metrics.cls = clsValue;
          console.log('CLS updated:', this.metrics.cls);
        }
      }
    });
    clsObserver.observe({ type: 'layout-shift', buffered: true });
  }

  observeLongTasks() {
    // Tasks that block the main thread for over 50ms feel janky
    const longTaskObserver = new PerformanceObserver((entryList) => {
      for (const entry of entryList.getEntries()) {
        console.warn(`Long task blocked main thread for ${entry.duration}ms`, entry);
        // Send this to your monitoring backend
        this.sendMetric('long_task', entry.duration, {
          attribution: entry.attribution?.[0]?.name || 'unknown'
        });
      }
    });
    longTaskObserver.observe({ entryTypes: ['longtask'] });
  }

  observeResourceLoad() {
    // What's loading slowly? Images, scripts, third-party widgets?
    const resourceObserver = new PerformanceObserver((entryList) => {
      for (const entry of entryList.getEntries()) {
        if (entry.duration > 1000) { // Slow if over 1 second
          console.log(`Slow resource: ${entry.name} took ${entry.duration}ms`);
          this.sendMetric('slow_resource', entry.duration, {
            type: entry.initiatorType,
            url: entry.name
          });
        }
      }
    });
    resourceObserver.observe({ entryTypes: ['resource'] });
  }

  sendMetric(name, value, tags) {
    // Send to your backend analytics
    navigator.sendBeacon('/api/rum-metrics', JSON.stringify({
      name, value, tags,
      path: window.location.pathname,
      connection: navigator.connection ? navigator.connection.effectiveType : 'unknown'
    }));
  }
}

// Start monitoring when the page loads
window.addEventListener('load', () => {
  const monitor = new UserExperienceMonitor();
  // Optionally expose to see data in console
  window.__userMetrics = monitor.metrics;
});
Enter fullscreen mode Exit fullscreen mode

This script tells you a user on a "3g" connection experienced a 4-second LCP on the product page. That's a specific problem you can work on.

But what if no users are on a critical checkout path right now? Is it still working? That's where synthetic checks come in.

The Robotic User: Synthetic Monitoring

This is automated scripts that pretend to be users, testing critical paths 24/7. It's like a heartbeat monitor for your key features.

You can write these in Node.js with tools like Puppeteer to control a browser.

// A script that runs every 5 minutes to test the login flow
const puppeteer = require('puppeteer');

async function syntheticCheckoutTest() {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  const testStart = Date.now();
  const metrics = { steps: {}, success: false };

  try {
    // Step 1: Load homepage
    await page.goto('https://yourapp.com', { waitUntil: 'networkidle2' });
    metrics.steps.loadHomepage = Date.now() - testStart;

    // Step 2: Navigate to product
    await page.click('nav a[href="/products"]');
    await page.waitForSelector('.product-list');
    metrics.steps.navigateToProducts = Date.now() - testStart;

    // Step 3: Add to cart
    await page.click('.product-list .product:first-child .buy-button');
    await page.waitForSelector('.cart-count'); // Wait for cart update
    metrics.steps.addToCart = Date.now() - testStart;

    // Step 4: Go to checkout
    await page.click('.cart-icon');
    await page.waitForSelector('#checkout-form');
    metrics.steps.loadCheckout = Date.now() - testStart;

    // Fill dummy data (using a test account)
    await page.type('#checkout-email', 'test@example.com');
    // ... fill other fields

    // Step 5: Submit (but maybe stop before actual payment)
    await page.click('#submit-order');
    await page.waitForSelector('.order-confirmation, .error-message', { timeout: 10000 });

    const confirmation = await page.$('.order-confirmation');
    metrics.success = !!confirmation;
    metrics.totalDuration = Date.now() - testStart;

    console.log(`Synthetic test ${metrics.success ? 'PASSED' : 'FAILED'} in ${metrics.totalDuration}ms`, metrics);

    // Send results to monitoring dashboard
    await sendToDashboard('checkout_flow', metrics);

  } catch (error) {
    metrics.error = error.message;
    metrics.success = false;
    console.error('Synthetic test FAILED:', error);
    await sendToDashboard('checkout_flow', metrics);
    // This failure should trigger an alert!
  } finally {
    await browser.close();
  }
}

// Run it on a schedule (e.g., using cron, or an orchestration tool)
setInterval(syntheticCheckoutTest, 5 * 60 * 1000); // Every 5 minutes
Enter fullscreen mode Exit fullscreen mode

Now you know your checkout works from California, but what about from London or Singapore? You run these synthetic checks from multiple locations.

All this data is pointless if no one looks at it. And when things go wrong, people need to know now.

Getting the Right Message to the Right Person: Alerting

Alerting is tricky. Too many alerts, and people ignore them. Too few, and you miss fires. The key is smart, targeted alerts.

class AlertManager {
  constructor() {
    this.alerts = new Map();
    this.cooldownPeriod = 300000; // 5 minutes per alert type
    this.routingRules = [
      { team: 'frontend', matches: (alert) => alert.metric.includes('lcp') || alert.metric.includes('fid') },
      { team: 'backend', matches: (alert) => alert.metric.includes('api.error') || alert.metric.includes('response_time') },
      { team: 'database', matches: (alert) => alert.metric.includes('db.query') || alert.metric.includes('connection_pool') }
    ];
  }

  evaluate(metricName, currentValue, threshold) {
    if (currentValue > threshold.warning && currentValue <= threshold.critical) {
      this._trigger(metricName, currentValue, 'warning', threshold);
    } else if (currentValue > threshold.critical) {
      this._trigger(metricName, currentValue, 'critical', threshold);
    }
  }

  _trigger(metricName, value, severity, threshold) {
    const alertKey = `${metricName}:${severity}`;
    const lastAlert = this.alerts.get(alertKey);

    // Don't re-alert if we just did recently (cooldown)
    if (lastAlert && (Date.now() - lastAlert.timestamp < this.cooldownPeriod)) {
      return;
    }

    const alert = {
      id: `alert_${Date.now()}`,
      metricName,
      value,
      severity,
      threshold,
      timestamp: new Date(),
      routedTo: this._routeAlert(metricName, severity)
    };

    this.alerts.set(alertKey, alert);
    this._notify(alert);
  }

  _routeAlert(metricName, severity) {
    for (const rule of this.routingRules) {
      if (rule.matches({ metric: metricName, severity })) {
        return rule.team;
      }
    }
    return 'platform'; // Default team
  }

  _notify(alert) {
    console.log(`ALERT [${alert.severity.toUpperCase()}] ${alert.metricName}=${alert.value} -> Team: ${alert.routedTo}`);

    // In reality, you would:
    // 1. Send to Slack/Teams channel for that team
    // 2. If 'critical', maybe send an SMS or call via PagerDuty
    // 3. Create an incident ticket
  }
}

// Using it
const alertManager = new AlertManager();

// Imagine this runs every minute, checking metrics from your system
function runAlertChecks() {
  const latestMetrics = getLatestMetricSummary(); // From your monitoring system

  // Define your thresholds
  const thresholds = {
    'api.response_time.p95': { warning: 500, critical: 2000 }, // milliseconds
    'error.rate': { warning: 0.01, critical: 0.05 }, // 1% errors is warning, 5% is critical
    'memory.usage': { warning: 0.8, critical: 0.95 } // 80% memory used
  };

  for (const [metricName, threshold] of Object.entries(thresholds)) {
    if (latestMetrics[metricName] !== undefined) {
      alertManager.evaluate(metricName, latestMetrics[metricName], threshold);
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

This ensures the frontend team gets pinged about slow page loads, and the database team gets woken up for connection pool exhaustion.

With metrics, traces, errors, RUM, synthetic checks, and alerts, you have a lot of data. You need to see it clearly.

Making Sense of It All: Dashboards

Dashboards turn data into information. A good dashboard tells a story at a glance.

You don't need to build these from scratch. Use tools like Grafana, Datadog, or even a well-built React/Vue app that fetches from your monitoring API. The key is design.

Here’s a conceptual example of the data structure a dashboard might consume.

// An API endpoint that aggregates data for a dashboard
app.get('/api/dashboard/overview', async (req, res) => {
  // Fetch data from various sources (metrics DB, error tracker, etc.)
  const [systemHealth, businessMetrics, recentIncidents] = await Promise.all([
    fetchSystemHealth(),
    fetchBusinessMetrics(),
    fetchRecentIncidents()
  ]);

  // Structure it for the dashboard
  const dashboardData = {
    timeframe: 'last_24_hours',
    summary: {
      uptime: calculateUptimePercentage(systemHealth),
      avgResponseTime: systemHealth.avg_response_time,
      errorRate: systemHealth.error_count / systemHealth.request_count,
      activeUsers: businessMetrics.active_users,
      conversions: businessMetrics.conversion_count
    },
    charts: [
      {
        title: 'Response Time Trend',
        type: 'line',
        data: systemHealth.response_time_by_hour // Array of {time, p50, p95, p99}
      },
      {
        title: 'Error Types',
        type: 'bar',
        data: systemHealth.errors_by_type // Array of {error_type, count}
      }
    ],
    topAlerts: recentIncidents.slice(0, 5),
    slowestEndpoints: systemHealth.slow_endpoints.slice(0, 10)
  };

  res.json(dashboardData);
});
Enter fullscreen mode Exit fullscreen mode

A dashboard might show a big green "99.9% Uptime" badge, a graph of response times slowly creeping up over the last week, and a list of the top 5 erroring endpoints. This tells an operator the system is mostly healthy but might need some attention soon.

Putting It Together: A Culture of Observation

Finally, the most important technique isn't code. It's practice. Monitoring is not a "set it and forget it" tool. It's a living part of your development cycle.

When you write a new feature, ask: "What metrics will tell us if this is working?" Add counters for usage. Add timers for performance. Define what "healthy" looks like for this feature.

When you debug an issue, start with the metrics and traces. Let the data guide you to the problem.

Review your dashboards and alerts regularly. Are you getting alert fatigue? Tune them. Are there metrics no one looks at? Remove them. Is there a question you keep asking that the dashboard doesn't answer? Add it.

Start simple. Add a few key metrics. Get a basic dashboard. Capture errors with context. As you learn what matters to your application and your users, you'll grow your monitoring system naturally. The goal is not to collect all the data, but to collect the right data that helps you understand your system and keep it running smoothly for the people who use it. That’s the heart of observability.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | Java Elite Dev | Golang Elite Dev | Python Elite Dev | JS Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

Top comments (0)