DEV Community

Cover image for Nobody Replaced the System. They Just Kept Patching the Callbacks.
Shlomo Friman
Shlomo Friman

Posted on

Nobody Replaced the System. They Just Kept Patching the Callbacks.

There is a particular kind of enterprise codebase that every senior engineer has encountered at least once. It runs. It has been running for a long time. Nobody knows exactly why it works the way it does, but everyone knows that touching it is dangerous. The original authors are gone. The documentation, if it ever existed, is wrong. The tests cover about forty percent of what the code actually does, and the forty percent they do cover are the easy paths.

And somewhere in the middle of all of it, nested six levels deep inside a function that calls another function that calls another function, there is a callback. And next to it, another callback. And inside that one, three more.

This is not a JavaScript problem. This is what a series of deferred architectural decisions looks like after ten years of production pressure.

The Callback as Artifact

When I talk about callbacks in legacy enterprise systems, I am not just talking about JavaScript. The pattern is universal. COBOL programs have their own version of it: chains of paragraph calls where the control flow is implicit, where you have to read the entire program to understand the order of execution, where a change in one place can silently break something three modules away.

The JavaScript callback pyramid is just the version that modern engineers are most likely to encounter. It looks like this:

getUser(userId, function(err, user) {
  if (err) return handleError(err);

  getPermissions(user.id, function(err, permissions) {
    if (err) return handleError(err);

    getAuditLog(user.id, function(err, auditLog) {
      if (err) return handleError(err);

      generateReport(user, permissions, auditLog, function(err, report) {
        if (err) return handleError(err);

        saveReport(report, function(err, result) {
          if (err) return handleError(err);
          res.json(result);
        });
      });
    });
  });
});
Enter fullscreen mode Exit fullscreen mode

This is not bad engineering. When it was written, this was the correct approach. Node.js had no native Promise support. The team was under deadline pressure. They got it working. They shipped it.

Then they got another deadline. And another. And they added more callbacks around it. And more around those.

What you are looking at now is not a callback pyramid. It is a stratigraphic record of every business decision, every urgent fix, and every missed refactoring window that the organization experienced over the past decade.

What Lives Inside the Callbacks

The technical problem with deeply nested callbacks is well understood: they are hard to read, hard to test, and hard to modify without introducing regressions. But the deeper problem is what is encoded inside them.

Consider a real pattern from production systems I have worked with. Inside a callback chain that handles customer record processing, there is a conditional branch that looks roughly like this:

processRecord(record, function(err, result) {
  if (err) return handleError(err);

  if (result.statusCode === 7) {
    // legacy pathway — do not remove
    applyLegacyTransform(result, function(err, transformed) {
      if (err) return handleError(err);
      finalizeRecord(transformed, callback);
    });
  } else {
    finalizeRecord(result, callback);
  }
});
Enter fullscreen mode Exit fullscreen mode

What is statusCode === 7? The comment says "legacy pathway." That is the entire documentation. The person who wrote that comment knew what it meant. They are no longer at the company. The code runs every night in a batch job that processes several hundred thousand records. Nobody has touched it in four years because every time someone tries, they discover they cannot fully explain what it does.

That conditional is not just a code smell. It is a business rule that was never written down anywhere except inside a callback, inside a callback, inside a callback.

This is what IN-COM's work on migrating legacy asynchronous code identifies as the core challenge: callback-based architectures hide dependencies between modules and external APIs, and a small change in one part of the flow can ripple through unrelated processes in ways that are genuinely difficult to predict. The syntax problem is solvable. The knowledge problem is harder.

The Compliance Dimension

Here is where this gets expensive.

Most of the callback-heavy systems I have described are also the systems that organizations are now trying to bring into compliance with frameworks like NIS2, DORA, ISO 27001, and a growing number of sector-specific regulations. And those frameworks have specific technical requirements that callback-era architectures were never designed to meet.

Take incident detection. Most modern regulatory frameworks require that organizations can detect and begin responding to a security incident within 24 hours of it occurring. That requires structured logging. It requires that your system produces events that a SIEM platform can ingest and correlate.

A Node.js application from 2012 with six levels of callback nesting typically does not produce structured logs. It produces text. Sometimes it produces nothing for entire execution paths, because the error handling looks like this:

doSomething(data, function(err, result) {
  if (err) {
    // TODO: handle this properly
    console.log('error:', err);
    return;
  }
  // continues...
});
Enter fullscreen mode Exit fullscreen mode

The error is logged to stdout. The audit trail is gone. The incident happened, and nobody will know for weeks.

Marka's analysis of legacy compliance puts it directly: the organizations that handle this well start with a gap analysis, and the ones that handle it poorly start with a replacement budget. Full replacement is not always the fastest path, and for organizations under active compliance pressure with a live audit timeline, it is almost never the first move.

This matters because it reframes the callback problem. You are not just carrying technical debt. You are carrying compliance debt. And unlike technical debt, compliance debt has a hard deadline attached to it.

Why "Just Migrate to Async/Await" Is Not a Migration Plan

When engineers see a deeply nested callback structure, the instinct is to reach for async/await. It is the right instinct. But the instinct to reach for it immediately, without preparation, is what causes production incidents.

The surface-level transformation looks clean:

// Before
getUser(userId, function(err, user) {
  if (err) return handleError(err);
  getPermissions(user.id, function(err, permissions) {
    if (err) return handleError(err);
    // ...
  });
});

// After
try {
  const user = await getUser(userId);
  const permissions = await getPermissions(user.id);
  // ...
} catch (err) {
  handleError(err);
}
Enter fullscreen mode Exit fullscreen mode

The async/await version is genuinely better. It is easier to read, easier to test, and the error handling is consolidated. But the migration is not finished when the syntax changes. The migration is finished when you understand what you changed.

Here are three things that break in async/await migrations that were invisible inside callback chains:

1. Implicit execution order

Callbacks encode execution order through nesting. The inner callback runs after the outer one resolves. When you flatten that structure into sequential awaits, the order is preserved. But when you parallelize for performance, as is tempting once you have async/await available, you need to verify that the operations are actually independent.

// These look independent. They may not be.
const [user, permissions, auditLog] = await Promise.all([
  getUser(userId),
  getPermissions(userId),
  getAuditLog(userId)
]);
Enter fullscreen mode Exit fullscreen mode

If getPermissions writes a session record that getAuditLog reads, this will produce inconsistent results under load. The callback version could not even attempt this pattern, so nobody ever discovered the dependency.

2. Error swallowing

Callback chains often suppress errors at intermediate steps because the original developer handled the error locally and returned early. When you convert to async/await and add a single top-level try/catch, you lose the granularity. Errors that were handled specifically before now get handled generically.

// Original callback behavior: each error goes to a specific handler
// After migration: every error hits the same catch block
try {
  const user = await getUser(userId);
  const report = await generateReport(user);
  await saveReport(report);
} catch (err) {
  // Is this a user-not-found error? A report generation error?
  // A database write error? You no longer know without extra work.
  handleError(err);
}
Enter fullscreen mode Exit fullscreen mode

3. Unhandled Promise rejections in hybrid environments

During a migration, you will have code that uses callbacks, code that uses Promises, and code that uses async/await, sometimes calling each other. This creates a category of failure that is particularly hard to trace: a Promise rejection that occurs inside a callback-based caller will be unhandled by default in older Node.js versions and will crash the process in newer ones.

// Dangerous hybrid pattern
legacyCallbackFunction(data, function(err, result) {
  if (err) return callback(err);

  // This Promise rejection is not caught by anyone
  asyncModernFunction(result)
    .then(output => callback(null, output));
    // Missing: .catch(callback)
});
Enter fullscreen mode Exit fullscreen mode

This is not hypothetical. It is one of the most common sources of silent failures during async migrations, and it is the kind of bug that only surfaces under specific timing conditions in production.

The Archaeology Step

Before you migrate a single callback, you need to read the codebase the way a historian reads a primary source. Not as instructions to be executed, but as evidence of decisions made by people who had context you no longer have.

That means building three things:

An execution map. Trace every callback chain from initiation to completion. Document the order of operations explicitly. Note every place where the behavior changes based on data state. The statusCode === 7 example from earlier needs an answer before the migration begins, not after.

A dependency inventory. For every function in the chain, document what it reads and what it writes. Shared state, database tables, external API calls, file system operations. Anything that could create a dependency between operations that look parallel but are not.

An error propagation diagram. For every error handling path in the current code, document where the error goes and what happens to the calling context. This is where you will find the swallowed errors, the console.log tombstones, and the // TODO: handle this comments that have been there since 2014.

This work is unglamorous. It does not ship features. It does not close compliance gaps directly. But it is the difference between a migration that succeeds and a migration that produces a production incident at 2am on a Tuesday.

Here is what the readiness checklist looks like in practice:

ASYNC MIGRATION READINESS CHECKLIST

[ ] Callback nesting depth mapped for all critical paths
[ ] Shared mutable state identified and documented
[ ] External API calls catalogued with known timing dependencies
[ ] Error handling paths traced to final disposition
[ ] Business rules in conditional branches documented
[ ] statusCode/flag values resolved to business meaning
[ ] Node.js version confirmed as Promise/async-compatible
[ ] Third-party libraries audited for callback-only interfaces
[ ] Hybrid coexistence wrapper pattern designed
[ ] Rollback plan defined for each migration phase
[ ] Performance baseline established for post-migration comparison
[ ] Compliance logging requirements mapped to new error handling structure
Enter fullscreen mode Exit fullscreen mode

The Migration Pattern That Actually Works

The safest approach to async migration in a production system follows three phases, and the first phase does not touch the async code at all.

Phase 1: Instrument before you refactor

Before changing any callback logic, add structured logging around the existing chains. Every entry point, every exit, every error. This gives you two things: a behavioral baseline to compare against after migration, and the structured event data your compliance framework needs right now.

// Wrapper that adds structured logging without changing behavior
function instrumentedCallback(name, fn) {
  return function(...args) {
    const callback = args[args.length - 1];
    const start = Date.now();

    const wrappedCallback = function(err, result) {
      const duration = Date.now() - start;

      if (err) {
        logger.error({
          operation: name,
          error: err.message,
          duration,
          timestamp: new Date().toISOString()
        });
      } else {
        logger.info({
          operation: name,
          duration,
          timestamp: new Date().toISOString()
        });
      }

      callback(err, result);
    };

    fn(...args.slice(0, -1), wrappedCallback);
  };
}

// Usage: existing code unchanged, observability added
const getUser = instrumentedCallback('getUser', originalGetUser);
Enter fullscreen mode Exit fullscreen mode

This is not a migration. It is preparation for a migration. But it immediately closes the structured logging gap for compliance purposes, which means you have addressed the most urgent audit requirement without touching the business logic.

Phase 2: Promisify at the boundary, not at the core

The next step is to wrap callback-based functions in Promise interfaces without rewriting their internals. Node.js provides util.promisify for this. For custom patterns, you write the wrapper yourself.

const { promisify } = require('util');

// Standard Node.js callback pattern: promisify directly
const getUserAsync = promisify(getUser);
const getPermissionsAsync = promisify(getPermissions);

// Custom callback pattern: write the wrapper explicitly
function getAuditLogAsync(userId) {
  return new Promise((resolve, reject) => {
    getAuditLog(userId, function(err, result) {
      if (err) reject(err);
      else resolve(result);
    });
  });
}
Enter fullscreen mode Exit fullscreen mode

This creates a clean interface boundary. The internals of each function are unchanged. The callers can now use async/await. You can migrate the call sites one at a time, validating each one against the behavioral baseline you established in Phase 1.

Phase 3: Migrate call sites incrementally with feature isolation

Now you replace the callback chains at the call sites, starting with the lowest-risk paths and working toward the business-critical ones.

// Before: six-level callback chain
function processCustomerReport(userId, res) {
  getUser(userId, function(err, user) {
    if (err) return res.status(500).json({ error: err.message });

    getPermissions(user.id, function(err, permissions) {
      if (err) return res.status(500).json({ error: err.message });

      getAuditLog(user.id, function(err, auditLog) {
        if (err) return res.status(500).json({ error: err.message });

        generateReport(user, permissions, auditLog, function(err, report) {
          if (err) return res.status(500).json({ error: err.message });

          saveReport(report, function(err, result) {
            if (err) return res.status(500).json({ error: err.message });
            res.json(result);
          });
        });
      });
    });
  });
}

// After: async/await with structured error handling
async function processCustomerReport(userId, res) {
  try {
    const user = await getUserAsync(userId);
    const permissions = await getPermissionsAsync(user.id);

    // Sequential awaits here because getAuditLog reads
    // a session record that getPermissions writes.
    // This dependency was documented in the archaeology phase.
    const auditLog = await getAuditLogAsync(user.id);

    const report = await generateReportAsync(user, permissions, auditLog);
    const result = await saveReportAsync(report);

    res.json(result);
  } catch (err) {
    logger.error({
      operation: 'processCustomerReport',
      userId,
      error: err.message,
      stack: err.stack,
      timestamp: new Date().toISOString()
    });
    res.status(500).json({ error: 'Report generation failed' });
  }
}
Enter fullscreen mode Exit fullscreen mode

Note the comment about getAuditLog. That comment is there because the archaeology phase found the dependency. Without that phase, the migration would have parallelized those calls with Promise.all and introduced a race condition.

The Compensating Control Layer

While the async migration is in progress, you have a compliance clock running. The phased approach described above takes weeks or months on a real system. Auditors do not wait.

This is where compensating controls become the operational bridge. A compensating control addresses the intent of a compliance requirement when the primary technical control cannot yet be implemented directly.

For legacy callback systems, the most common gaps and their compensating controls look like this:

Compliance Gap Compensating Control Implementation
No structured audit logs Network-layer log aggregation Instrument at the proxy/gateway, not in the app
No MFA on legacy app Identity boundary enforcement Entra ID or similar enforced at network layer
No incident detection Perimeter SIEM ingestion Microsoft Sentinel, Datadog, or equivalent
Unpatched dependencies Documented vulnerability exception Isolation + formal risk acceptance record
No encryption in transit TLS termination at load balancer App unchanged, boundary enforced

The critical point from Marka's compliance framework is that compensating controls must be technically enforced, not just documented. A policy that says access is restricted is not a compensating control. An access boundary enforced at the network layer that prevents unauthorized connections is. Auditors ask for evidence of enforcement: configuration screenshots, access logs, test results.

This means you can close the compliance gap now, run the async migration on an engineering timeline, and arrive at the end with both problems solved rather than neither. The longer-term architectural work — replacing modules progressively rather than in a single cutover — follows what Martin Fowler described as the Strangler Fig pattern: new components built alongside the legacy system, traffic incrementally shifted, the old core decommissioned in stages.

What the Callback Debt Actually Costs

There is a number that organizations rarely calculate: the total cost of maintaining a callback-heavy codebase while simultaneously running a compliance program, compared to the cost of a structured incremental migration.

The maintenance cost compounds. Every new feature that goes into a callback-heavy system inherits the complexity of the existing chains. Every developer who joins the team spends the first two months just reading code. Every compliance control that cannot be implemented natively requires a compensating workaround at the infrastructure layer, which has its own operational cost.

IBM research has put the proportion of enterprise data that is structurally opaque, meaning values and records whose meaning is only recoverable by reading the code that processes them, at a significant share of total production data volume. That opacity is not just a data quality problem. It is a velocity problem. Every sprint that touches an undocumented callback chain moves slower than it should.

The migration is not free. The incremental approach described in this article takes real engineering time. But that time is bounded, predictable, and produces a system you can explain to an auditor, to a new engineer, and to yourself at 2am when something breaks.

The alternative is not stability. The alternative is the same system in three years, with more callbacks, more undocumented conditionals, and a compliance deadline that is now urgent instead of approaching.

The Practical Starting Point

If you are looking at a callback-heavy codebase right now and trying to figure out where to start, here is the sequence that has worked in practice:

Week 1: Run a static analysis pass across the callback chains. You want nesting depth, shared state references, and error handling coverage. Tools like ESLint with appropriate rules, or a dedicated static analysis platform, can surface the structural shape of the problem without requiring manual reading of every function.

Week 2: Interview the longest-tenured engineer who has touched the system. Not to document everything they know, but to identify the three or four places where the implicit knowledge is most dangerous. The statusCode === 7 equivalents. The "do not remove this" comments. The batch jobs that run in a specific order for reasons nobody has written down.

Week 3: Add instrumented logging around those specific paths. Not everywhere. Just the dangerous ones. Get structured log output flowing to somewhere you can query it.

Week 4: Brief the compliance officer on the gap analysis and the compensating control plan. Show them the logging output as evidence of active monitoring. Present the migration roadmap. The posture shift from "we have a problem" to "we have a plan and we have started" is significant in most audit contexts.

From that point, the migration proceeds in phases, each one validated against the behavioral baseline before the next one begins.

The System Is Still Running

The title of this article is true of almost every large enterprise codebase I have seen. Nobody replaced the system. They kept patching the callbacks. And the system kept running.

That is not a failure. That is a team that has kept something alive under sustained pressure, usually with fewer engineers and less time than the work deserved. The callbacks are not evidence of laziness. They are evidence of survival.

But survival is not the same as sustainability. The compliance requirements arriving now were not anticipated by the engineers who wrote the first callback in 2012. The async/await pattern that would have replaced those callbacks had not been standardized yet. The SIEM platforms that auditors now expect you to feed were not part of the production architecture conversation.

The system that survived to 2026 now has to meet requirements that did not exist when it was designed. That is not anyone's fault. It is just the situation.

The path through it is not a rewrite. It is not a crisis. It is archaeology, instrumentation, incremental migration, and a compliance posture that you can explain to an auditor without having to say "we are working on it" for the fourth year in a row.

Read the code. Map the callbacks. Start with the instrumentation. The rest follows.

Top comments (0)