DEV Community

Wilson Xu
Wilson Xu

Posted on

Monitoring Node.js CLI Tools in Production: Error Tracking Beyond the Terminal

Monitoring Node.js CLI Tools in Production: Error Tracking Beyond the Terminal

When we talk about production monitoring, the conversation usually centers on web applications and APIs. Dashboards full of request latency charts, 5xx error rates, and uptime percentages. But there is an entire class of software that runs in production with almost no visibility: command-line tools.

CLI tools handle critical workflows. They process data pipelines, manage deployments, run scheduled migrations, sync third-party services, and automate infrastructure. When a CLI tool fails at 3 AM inside a cron job on a remote server, nobody sees the stack trace. It vanishes into /dev/null or a log file that nobody checks until the damage is already done.

After building and maintaining over 25 npm packages — many of them CLI tools running in production environments — I have learned that CLI tools deserve the same monitoring rigor we give to web services. This article covers why, and exactly how to build that monitoring in.

Why CLI Tools Need Production Monitoring

A web server crashes and your uptime monitor fires an alert within seconds. A CLI tool crashes and... nothing happens. The cron job silently fails. The next scheduled run might fail too. Hours or days pass before someone notices that the nightly data export has not run since Tuesday.

CLI tools are deceptive. During development, you run them interactively. You see every error. You fix things on the spot. But production is different. CLI tools in production run:

  • Inside cron jobs with no TTY attached
  • In CI/CD pipelines where output is buried in build logs
  • On remote servers via SSH sessions that have long since disconnected
  • Inside Docker containers that get destroyed after each run
  • As systemd services with logs rotating into oblivion

The failure modes are also different from web applications. Web apps fail on bad HTTP requests. CLI tools fail in ways that are harder to anticipate and harder to reproduce.

Common Failure Modes in Production CLI Tools

Understanding how CLI tools fail is the first step toward monitoring them effectively.

Network Timeouts

CLI tools that call external APIs are vulnerable to network issues that never surface during development. A tool that fetches data from a third-party API might work flawlessly for months, then start timing out when that API introduces rate limiting or when your server's DNS resolver starts acting up.

// This works fine in development but silently hangs in production
const response = await fetch('https://api.example.com/data');
Enter fullscreen mode Exit fullscreen mode

Without a timeout and proper error reporting, this call can hang indefinitely, blocking the entire process.

Invalid or Unexpected Input

Production data is messy. A CLI tool that parses CSV files will eventually encounter a file with a BOM marker, mixed line endings, or an encoding that is not UTF-8. A tool that reads from stdin will eventually receive truncated input because the upstream pipe broke.

// Crashes on malformed JSON with an unhelpful error
const config = JSON.parse(fs.readFileSync(configPath, 'utf8'));
Enter fullscreen mode Exit fullscreen mode

The resulting SyntaxError: Unexpected token tells you nothing about which file, which line of the file, or what the actual content was.

Out-of-Memory Kills

CLI tools that process large files or datasets can hit memory limits without warning. The OOM killer on Linux terminates the process with SIGKILL — no chance to catch the signal, no chance to log anything. The process just disappears.

// Reads entire file into memory — works on 100MB, killed on 10GB
const data = fs.readFileSync('massive-dataset.csv', 'utf8');
const lines = data.split('\n');
Enter fullscreen mode Exit fullscreen mode

Signal Kills and Ungraceful Shutdowns

Production environments send signals. Containers receive SIGTERM during rolling deployments. Systemd sends SIGTERM followed by SIGKILL after a timeout. CI/CD pipelines send SIGINT when builds are cancelled. A CLI tool that does not handle these signals can leave behind corrupted state, partial writes, or locked resources.

Building a Crash Reporter Into Your CLI Tool

The foundation of CLI monitoring is a crash reporter — a wrapper that catches unhandled errors and reports them before the process exits. Here is a practical implementation:

// lib/crash-reporter.js
const os = require('os');
const { version } = require('../package.json');

class CrashReporter {
  constructor(options = {}) {
    this.serviceName = options.serviceName || 'cli-tool';
    this.version = options.version || version;
    this.reporter = options.reporter || console.error;
    this.context = {};

    this.install();
  }

  install() {
    process.on('uncaughtException', (error) => {
      this.report(error, { type: 'uncaughtException' });
      process.exit(1);
    });

    process.on('unhandledRejection', (reason) => {
      const error = reason instanceof Error ? reason : new Error(String(reason));
      this.report(error, { type: 'unhandledRejection' });
      process.exit(1);
    });

    // Capture signal-based terminations
    for (const signal of ['SIGTERM', 'SIGINT', 'SIGHUP']) {
      process.on(signal, () => {
        this.report(new Error(`Process terminated by ${signal}`), {
          type: 'signal',
          signal,
        });
        process.exit(128 + os.constants.signals[signal]);
      });
    }
  }

  setContext(key, value) {
    this.context[key] = value;
  }

  report(error, metadata = {}) {
    const report = {
      timestamp: new Date().toISOString(),
      service: this.serviceName,
      version: this.version,
      error: {
        name: error.name,
        message: error.message,
        stack: error.stack,
      },
      environment: {
        nodeVersion: process.version,
        platform: process.platform,
        arch: process.arch,
        memory: process.memoryUsage(),
        uptime: process.uptime(),
        cwd: process.cwd(),
        argv: process.argv,
      },
      context: this.context,
      metadata,
    };

    this.reporter(report);
  }
}

module.exports = { CrashReporter };
Enter fullscreen mode Exit fullscreen mode

Usage in your CLI entry point:

#!/usr/bin/env node
const { CrashReporter } = require('./lib/crash-reporter');

const reporter = new CrashReporter({
  serviceName: 'my-data-tool',
});

// Add context as your tool progresses
reporter.setContext('inputFile', process.argv[2]);
reporter.setContext('stage', 'parsing');

// Your actual CLI logic
run().catch((error) => {
  reporter.report(error, { type: 'caught' });
  process.exit(1);
});
Enter fullscreen mode Exit fullscreen mode

The key insight here is the setContext method. As your CLI tool progresses through its stages — parsing config, fetching data, transforming records, writing output — you update the context. When a crash happens, you know exactly where the tool was and what it was working on.

Integrating with Error Tracking Services

A crash reporter that logs to stderr is better than nothing, but it still requires someone to check the logs. Integrating with an error tracking service like Honeybadger turns silent failures into actionable alerts.

// lib/honeybadger-reporter.js
const Honeybadger = require('@honeybadger-io/js');

function createHoneybadgerReporter(apiKey, environment = 'production') {
  Honeybadger.configure({
    apiKey,
    environment,
    reportData: true,
  });

  return {
    report(error, context = {}) {
      // Honeybadger.notify handles both Error objects and plain objects
      Honeybadger.notify(error, {
        context,
        tags: ['cli', context.stage || 'unknown'].join(','),
      });
    },

    setContext(key, value) {
      Honeybadger.setContext({ [key]: value });
    },

    // Ensure errors are flushed before process exits
    async flush() {
      // Give Honeybadger time to send pending reports
      await new Promise((resolve) => setTimeout(resolve, 2000));
    },
  };
}

module.exports = { createHoneybadgerReporter };
Enter fullscreen mode Exit fullscreen mode

Wire it into the crash reporter:

#!/usr/bin/env node
const { CrashReporter } = require('./lib/crash-reporter');
const { createHoneybadgerReporter } = require('./lib/honeybadger-reporter');

const hb = createHoneybadgerReporter(process.env.HONEYBADGER_API_KEY);

const reporter = new CrashReporter({
  serviceName: 'my-data-tool',
  reporter: (report) => {
    hb.report(report.error, {
      ...report.context,
      ...report.metadata,
      environment: report.environment,
    });
  },
});

// Override the exit handler to flush before dying
const originalExit = process.exit;
process.exit = async (code) => {
  await hb.flush();
  originalExit(code);
};
Enter fullscreen mode Exit fullscreen mode

Now when your CLI tool crashes on a production server at 3 AM, you get an alert with the full stack trace, the exact input that caused the failure, the stage of execution, memory usage at the time of the crash, and the Node.js version and platform. That is the difference between "something is broken" and "I know exactly what happened and how to fix it."

Structured Logging for CLI Tool Debugging

Stack traces tell you where a crash happened. Structured logs tell you everything that happened leading up to the crash. For CLI tools, structured logging is particularly valuable because you often need to reconstruct the exact sequence of operations.

// lib/logger.js
class StructuredLogger {
  constructor(options = {}) {
    this.serviceName = options.serviceName || 'cli-tool';
    this.level = options.level || 'info';
    this.output = options.output || process.stderr;
    this.sessionId = this.generateSessionId();
  }

  generateSessionId() {
    return `${Date.now()}-${Math.random().toString(36).slice(2, 8)}`;
  }

  log(level, message, data = {}) {
    const entry = {
      timestamp: new Date().toISOString(),
      level,
      service: this.serviceName,
      session: this.sessionId,
      message,
      ...data,
    };

    this.output.write(JSON.stringify(entry) + '\n');
  }

  info(message, data) { this.log('info', message, data); }
  warn(message, data) { this.log('warn', message, data); }
  error(message, data) { this.log('error', message, data); }
  debug(message, data) { this.log('debug', message, data); }
}

module.exports = { StructuredLogger };
Enter fullscreen mode Exit fullscreen mode

Three rules make structured logging effective in CLI tools:

Log to stderr, not stdout. CLI tools often pipe their output to other tools. Logging to stdout corrupts the pipeline. Always use stderr for operational logs and reserve stdout for program output.

Include a session ID. When multiple instances of your tool run concurrently (parallel cron jobs, CI/CD pipelines), a session ID lets you isolate the logs for a single run.

Log transitions, not states. Do not log "processing file" on every iteration. Log when you start processing, when you finish, and when something unexpected happens. This keeps log volume manageable while preserving the information you need for debugging.

const logger = new StructuredLogger({ serviceName: 'data-sync' });

logger.info('sync started', { source: 'api', recordCount: 1500 });
logger.info('batch processed', { batchNumber: 1, processed: 500, errors: 2 });
logger.warn('retrying failed records', { count: 2, attempt: 1 });
logger.info('sync completed', { duration: 4521, totalProcessed: 1500, totalErrors: 0 });
Enter fullscreen mode Exit fullscreen mode

When these logs are shipped to a log aggregation service, you can search by session ID to reconstruct the full timeline of any run, trace failures back to specific inputs, and build dashboards showing success rates and processing times over time.

Health Checks for Long-Running CLI Processes

Some CLI tools are not one-shot scripts. They are long-running daemons: file watchers, queue consumers, webhook listeners. These need health checks just like web services do.

A minimal health check for a CLI daemon exposes a tiny HTTP endpoint or writes to a heartbeat file:

// lib/healthcheck.js
const fs = require('fs');
const http = require('http');

class HealthCheck {
  constructor(options = {}) {
    this.mode = options.mode || 'file'; // 'file' or 'http'
    this.interval = options.interval || 30000;
    this.filePath = options.filePath || '/tmp/cli-tool-health';
    this.port = options.port || 9090;
    this.checks = [];
    this.timer = null;
  }

  addCheck(name, fn) {
    this.checks.push({ name, fn });
  }

  async runChecks() {
    const results = {};
    let healthy = true;

    for (const check of this.checks) {
      try {
        results[check.name] = await check.fn();
      } catch (error) {
        results[check.name] = { status: 'unhealthy', error: error.message };
        healthy = false;
      }
    }

    return { healthy, timestamp: new Date().toISOString(), results };
  }

  start() {
    if (this.mode === 'http') {
      this.server = http.createServer(async (req, res) => {
        const health = await this.runChecks();
        res.writeHead(health.healthy ? 200 : 503, {
          'Content-Type': 'application/json',
        });
        res.end(JSON.stringify(health));
      });
      this.server.listen(this.port);
    }

    // File-based heartbeat — works with systemd watchdog, Docker HEALTHCHECK, etc.
    this.timer = setInterval(async () => {
      const health = await this.runChecks();
      fs.writeFileSync(
        this.filePath,
        JSON.stringify(health),
      );
    }, this.interval);
  }

  stop() {
    clearInterval(this.timer);
    if (this.server) this.server.close();
  }
}

module.exports = { HealthCheck };
Enter fullscreen mode Exit fullscreen mode

Usage in a long-running CLI process:

const health = new HealthCheck({ mode: 'http', port: 9091 });

health.addCheck('database', async () => {
  await db.ping();
  return { status: 'healthy', latency: Date.now() - start };
});

health.addCheck('queue', async () => {
  const depth = await queue.depth();
  return {
    status: depth < 10000 ? 'healthy' : 'degraded',
    depth,
  };
});

health.start();
Enter fullscreen mode Exit fullscreen mode

Now your container orchestrator or process manager can check /health and restart the tool if it becomes unresponsive. Combine this with Honeybadger check-ins — Honeybadger's cron monitoring feature lets you define expected schedules for recurring jobs and alerts you when a check-in is missed. For CLI tools that should run every hour, a missed check-in means something is silently broken.

Putting It All Together

Here is the complete monitoring setup for a production CLI tool, combining the crash reporter, error tracking, structured logging, and health checks:

#!/usr/bin/env node
const Honeybadger = require('@honeybadger-io/js');
const { CrashReporter } = require('./lib/crash-reporter');
const { StructuredLogger } = require('./lib/logger');

// Initialize monitoring
Honeybadger.configure({
  apiKey: process.env.HONEYBADGER_API_KEY,
  environment: process.env.NODE_ENV || 'production',
});

const logger = new StructuredLogger({ serviceName: 'data-sync' });

const reporter = new CrashReporter({
  serviceName: 'data-sync',
  reporter: (report) => {
    logger.error('crash', report);
    Honeybadger.notify(report.error, { context: report.context });
  },
});

async function run() {
  reporter.setContext('stage', 'init');
  logger.info('starting data sync');

  reporter.setContext('stage', 'fetch');
  const data = await fetchDataWithTimeout('https://api.example.com/records', {
    timeout: 30000,
  });
  logger.info('data fetched', { recordCount: data.length });

  reporter.setContext('stage', 'process');
  const results = await processRecords(data);
  logger.info('processing complete', {
    processed: results.success,
    failed: results.failed,
  });

  // Check in with Honeybadger to confirm the job ran
  await Honeybadger.checkIn('data-sync-hourly');
  logger.info('sync complete');
}

run().catch((error) => {
  reporter.report(error, { type: 'fatal' });
  process.exit(1);
});
Enter fullscreen mode Exit fullscreen mode

This gives you full visibility into a CLI tool that previously operated in the dark. Crashes trigger immediate alerts with full context. Structured logs provide a timeline for debugging. Check-ins confirm the tool is running on schedule. And when something does go wrong, you know exactly where, why, and how to fix it.

Conclusion

CLI tools in production are invisible by default. They run without dashboards, without uptime monitors, without anyone watching. That invisibility is what makes them dangerous — they fail silently, and silent failures compound.

The monitoring patterns covered here — crash reporters with contextual metadata, error tracking integration, structured logging, and health checks — take a CLI tool from "hope it works" to "know it works." The investment is small: a few hundred lines of infrastructure code that you write once and reuse across every tool you build.

Your web APIs have error tracking. Your frontends have error tracking. Your CLI tools deserve it too.

Top comments (0)