binadit

Posted on Apr 19 • Originally published at binadit.com

How misleading monitoring nearly cost a SaaS platform €50k in lost subscriptions

#monitoring #userexperience #saasperformance #infrastructureaudit

When perfect monitoring dashboards hide critical performance problems

Ever had a monitoring dashboard showing all green while your users are screaming about poor performance? A European fintech SaaS company almost learned this lesson the hard way, facing €50k in potential subscription losses despite 99.94% uptime metrics.

Here's how misleading monitoring nearly killed their business and what we did to fix it.

The problem: green dashboards, angry customers

This platform served 15,000 users across EU markets, processing financial data for small businesses. Their managed hosting provider gave them basic monitoring: server uptime, CPU, memory, and simple HTTP health checks. Everything looked perfect on paper.

But customer support was drowning during peak hours (9-11 AM and 2-4 PM CET). Users complained about slow loading and glitchy behavior. Churn was climbing at €4,200 monthly in lost recurring revenue.

The disconnect was brutal: monitoring said healthy, customers said otherwise.

What we discovered during the audit

Day one of our infrastructure review revealed the core issue. They were monitoring server health, not user experience.

Their HTTP health check hit a lightweight endpoint returning 200 status in under 100ms. Real user workflows involved complex database queries, third-party API calls, and heavy JavaScript execution.

We deployed real user monitoring (RUM) and synthetic transaction monitoring. The actual numbers were shocking:

Dashboard loading: 847ms average (health check showed 120ms)
Financial report generation: 12.3 seconds at 95th percentile
API response times: 2.1 seconds during traffic spikes
Time to interactive: 4.7 seconds average

Server logs revealed more hidden issues:

Database connection exhaustion: PostgreSQL connection pool maxed out during peaks, causing 8-second queue times. Server stayed online, so monitoring registered everything as healthy.

Memory allocation problems: Total system memory looked fine, but application garbage collection pauses hit 300-500ms every few minutes, freezing the UI.

CDN misconfiguration: Static assets bypassed cache, hitting origin servers unnecessarily during peak load.

Our solution approach

Instead of adding more monitoring tools, we redefined what mattered. For financial SaaS, user experience equals trust and retention.

Core principle: monitor what users do, not what servers do.

We implemented three monitoring levels:

1. Real user monitoring (RUM)

Lightweight JavaScript agent sampling 25% of sessions:

const observer = new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    if (entry.entryType === 'navigation') {
      sendMetric('tti', entry.domInteractive - entry.fetchStart);
    }
    if (entry.entryType === 'measure' && entry.name === 'report-generation') {
      sendMetric('report_duration', entry.duration);
    }
  }
});

observer.observe({entryTypes: ['navigation', 'measure']});

2. Synthetic transaction monitoring

Puppeteer scripts replicating real workflows:

const testReportGeneration = async (page) => {
  const start = Date.now();

  await page.goto(process.env.APP_URL + '/login');
  await page.fill('[name="email"]', process.env.TEST_USER_EMAIL);
  await page.fill('[name="password"]', process.env.TEST_USER_PASSWORD);
  await page.click('button[type="submit"]');

  await page.waitForSelector('[data-testid="dashboard"]');
  await page.click('[data-testid="generate-report"]');

  await page.waitForSelector('[data-testid="report-complete"]', { timeout: 15000 });

  return { success: true, duration: Date.now() - start };
};

3. Infrastructure correlation monitoring

Database connection pool visibility:

pool.on('acquire', (client) => {
  metrics.increment('db.connections.acquired');
});

pool.on('error', (err, client) => {
  metrics.increment('db.connections.error');
  logger.error('Database connection error', { error: err.message });
});

User experience-based alerting:

- alert: SlowReportGeneration
  expr: avg_over_time(report_generation_p95[5m]) > 8000
  for: 2m
  annotations:
    summary: "Report generation exceeding 8 seconds"

- alert: HighErrorRate
  expr: rate(user_workflow_errors[5m]) > 0.05
  for: 1m

Results that mattered

Within two weeks, we identified and resolved invisible performance issues:

User experience improvements:

Dashboard loading: 847ms → 312ms
Report generation P95: 12.3s → 4.1s
API response times: 2.1s → 680ms
Time to interactive: 4.7s → 1.9s

Business impact:

Support tickets during peak hours: down 73%
Customer satisfaction scores: improved from 6.2 to 8.4
Monthly churn reduction: €3,800 recovered revenue
Mean time to detection for real issues: 11 minutes → 2 minutes

Key takeaways for developers

Health checks should mirror real user workflows, not just return 200 status codes
Monitor user journeys end-to-end, including third-party dependencies
Alert on user experience degradation, not arbitrary server thresholds
Correlation is crucial between infrastructure metrics and user impact
Synthetic monitoring catches issues before users do

Server uptime means nothing if users can't complete their workflows. Build monitoring that reflects what your customers actually experience.

Originally published on binadit.com

DEV Community