DEV Community: Shlomo Friman

Nobody Replaced the System. They Just Kept Patching the Callbacks.

Shlomo Friman — Wed, 27 May 2026 07:57:40 +0000

There is a particular kind of enterprise codebase that every senior engineer has encountered at least once. It runs. It has been running for a long time. Nobody knows exactly why it works the way it does, but everyone knows that touching it is dangerous. The original authors are gone. The documentation, if it ever existed, is wrong. The tests cover about forty percent of what the code actually does, and the forty percent they do cover are the easy paths.

And somewhere in the middle of all of it, nested six levels deep inside a function that calls another function that calls another function, there is a callback. And next to it, another callback. And inside that one, three more.

This is not a JavaScript problem. This is what a series of deferred architectural decisions looks like after ten years of production pressure.

The Callback as Artifact

When I talk about callbacks in legacy enterprise systems, I am not just talking about JavaScript. The pattern is universal. COBOL programs have their own version of it: chains of paragraph calls where the control flow is implicit, where you have to read the entire program to understand the order of execution, where a change in one place can silently break something three modules away.

The JavaScript callback pyramid is just the version that modern engineers are most likely to encounter. It looks like this:

getUser(userId, function(err, user) {
  if (err) return handleError(err);

  getPermissions(user.id, function(err, permissions) {
    if (err) return handleError(err);

    getAuditLog(user.id, function(err, auditLog) {
      if (err) return handleError(err);

      generateReport(user, permissions, auditLog, function(err, report) {
        if (err) return handleError(err);

        saveReport(report, function(err, result) {
          if (err) return handleError(err);
          res.json(result);
        });
      });
    });
  });
});

This is not bad engineering. When it was written, this was the correct approach. Node.js had no native Promise support. The team was under deadline pressure. They got it working. They shipped it.

Then they got another deadline. And another. And they added more callbacks around it. And more around those.

What you are looking at now is not a callback pyramid. It is a stratigraphic record of every business decision, every urgent fix, and every missed refactoring window that the organization experienced over the past decade.

What Lives Inside the Callbacks

The technical problem with deeply nested callbacks is well understood: they are hard to read, hard to test, and hard to modify without introducing regressions. But the deeper problem is what is encoded inside them.

Consider a real pattern from production systems I have worked with. Inside a callback chain that handles customer record processing, there is a conditional branch that looks roughly like this:

processRecord(record, function(err, result) {
  if (err) return handleError(err);

  if (result.statusCode === 7) {
    // legacy pathway — do not remove
    applyLegacyTransform(result, function(err, transformed) {
      if (err) return handleError(err);
      finalizeRecord(transformed, callback);
    });
  } else {
    finalizeRecord(result, callback);
  }
});

What is statusCode === 7? The comment says "legacy pathway." That is the entire documentation. The person who wrote that comment knew what it meant. They are no longer at the company. The code runs every night in a batch job that processes several hundred thousand records. Nobody has touched it in four years because every time someone tries, they discover they cannot fully explain what it does.

That conditional is not just a code smell. It is a business rule that was never written down anywhere except inside a callback, inside a callback, inside a callback.

This is what IN-COM's work on migrating legacy asynchronous code identifies as the core challenge: callback-based architectures hide dependencies between modules and external APIs, and a small change in one part of the flow can ripple through unrelated processes in ways that are genuinely difficult to predict. The syntax problem is solvable. The knowledge problem is harder.

The Compliance Dimension

Here is where this gets expensive.

Most of the callback-heavy systems I have described are also the systems that organizations are now trying to bring into compliance with frameworks like NIS2, DORA, ISO 27001, and a growing number of sector-specific regulations. And those frameworks have specific technical requirements that callback-era architectures were never designed to meet.

Take incident detection. Most modern regulatory frameworks require that organizations can detect and begin responding to a security incident within 24 hours of it occurring. That requires structured logging. It requires that your system produces events that a SIEM platform can ingest and correlate.

A Node.js application from 2012 with six levels of callback nesting typically does not produce structured logs. It produces text. Sometimes it produces nothing for entire execution paths, because the error handling looks like this:

doSomething(data, function(err, result) {
  if (err) {
    // TODO: handle this properly
    console.log('error:', err);
    return;
  }
  // continues...
});

The error is logged to stdout. The audit trail is gone. The incident happened, and nobody will know for weeks.

Marka's analysis of legacy compliance puts it directly: the organizations that handle this well start with a gap analysis, and the ones that handle it poorly start with a replacement budget. Full replacement is not always the fastest path, and for organizations under active compliance pressure with a live audit timeline, it is almost never the first move.

This matters because it reframes the callback problem. You are not just carrying technical debt. You are carrying compliance debt. And unlike technical debt, compliance debt has a hard deadline attached to it.

Why "Just Migrate to Async/Await" Is Not a Migration Plan

When engineers see a deeply nested callback structure, the instinct is to reach for async/await. It is the right instinct. But the instinct to reach for it immediately, without preparation, is what causes production incidents.

The surface-level transformation looks clean:

// Before
getUser(userId, function(err, user) {
  if (err) return handleError(err);
  getPermissions(user.id, function(err, permissions) {
    if (err) return handleError(err);
    // ...
  });
});

// After
try {
  const user = await getUser(userId);
  const permissions = await getPermissions(user.id);
  // ...
} catch (err) {
  handleError(err);
}

The async/await version is genuinely better. It is easier to read, easier to test, and the error handling is consolidated. But the migration is not finished when the syntax changes. The migration is finished when you understand what you changed.

Here are three things that break in async/await migrations that were invisible inside callback chains:

1. Implicit execution order

Callbacks encode execution order through nesting. The inner callback runs after the outer one resolves. When you flatten that structure into sequential awaits, the order is preserved. But when you parallelize for performance, as is tempting once you have async/await available, you need to verify that the operations are actually independent.

// These look independent. They may not be.
const [user, permissions, auditLog] = await Promise.all([
  getUser(userId),
  getPermissions(userId),
  getAuditLog(userId)
]);

If getPermissions writes a session record that getAuditLog reads, this will produce inconsistent results under load. The callback version could not even attempt this pattern, so nobody ever discovered the dependency.

2. Error swallowing

Callback chains often suppress errors at intermediate steps because the original developer handled the error locally and returned early. When you convert to async/await and add a single top-level try/catch, you lose the granularity. Errors that were handled specifically before now get handled generically.

// Original callback behavior: each error goes to a specific handler
// After migration: every error hits the same catch block
try {
  const user = await getUser(userId);
  const report = await generateReport(user);
  await saveReport(report);
} catch (err) {
  // Is this a user-not-found error? A report generation error?
  // A database write error? You no longer know without extra work.
  handleError(err);
}

3. Unhandled Promise rejections in hybrid environments

During a migration, you will have code that uses callbacks, code that uses Promises, and code that uses async/await, sometimes calling each other. This creates a category of failure that is particularly hard to trace: a Promise rejection that occurs inside a callback-based caller will be unhandled by default in older Node.js versions and will crash the process in newer ones.

// Dangerous hybrid pattern
legacyCallbackFunction(data, function(err, result) {
  if (err) return callback(err);

  // This Promise rejection is not caught by anyone
  asyncModernFunction(result)
    .then(output => callback(null, output));
    // Missing: .catch(callback)
});

This is not hypothetical. It is one of the most common sources of silent failures during async migrations, and it is the kind of bug that only surfaces under specific timing conditions in production.

The Archaeology Step

Before you migrate a single callback, you need to read the codebase the way a historian reads a primary source. Not as instructions to be executed, but as evidence of decisions made by people who had context you no longer have.

That means building three things:

An execution map. Trace every callback chain from initiation to completion. Document the order of operations explicitly. Note every place where the behavior changes based on data state. The statusCode === 7 example from earlier needs an answer before the migration begins, not after.

A dependency inventory. For every function in the chain, document what it reads and what it writes. Shared state, database tables, external API calls, file system operations. Anything that could create a dependency between operations that look parallel but are not.

An error propagation diagram. For every error handling path in the current code, document where the error goes and what happens to the calling context. This is where you will find the swallowed errors, the console.log tombstones, and the // TODO: handle this comments that have been there since 2014.

This work is unglamorous. It does not ship features. It does not close compliance gaps directly. But it is the difference between a migration that succeeds and a migration that produces a production incident at 2am on a Tuesday.

Here is what the readiness checklist looks like in practice:

ASYNC MIGRATION READINESS CHECKLIST

[ ] Callback nesting depth mapped for all critical paths
[ ] Shared mutable state identified and documented
[ ] External API calls catalogued with known timing dependencies
[ ] Error handling paths traced to final disposition
[ ] Business rules in conditional branches documented
[ ] statusCode/flag values resolved to business meaning
[ ] Node.js version confirmed as Promise/async-compatible
[ ] Third-party libraries audited for callback-only interfaces
[ ] Hybrid coexistence wrapper pattern designed
[ ] Rollback plan defined for each migration phase
[ ] Performance baseline established for post-migration comparison
[ ] Compliance logging requirements mapped to new error handling structure

The Migration Pattern That Actually Works

The safest approach to async migration in a production system follows three phases, and the first phase does not touch the async code at all.

Phase 1: Instrument before you refactor

Before changing any callback logic, add structured logging around the existing chains. Every entry point, every exit, every error. This gives you two things: a behavioral baseline to compare against after migration, and the structured event data your compliance framework needs right now.

// Wrapper that adds structured logging without changing behavior
function instrumentedCallback(name, fn) {
  return function(...args) {
    const callback = args[args.length - 1];
    const start = Date.now();

    const wrappedCallback = function(err, result) {
      const duration = Date.now() - start;

      if (err) {
        logger.error({
          operation: name,
          error: err.message,
          duration,
          timestamp: new Date().toISOString()
        });
      } else {
        logger.info({
          operation: name,
          duration,
          timestamp: new Date().toISOString()
        });
      }

      callback(err, result);
    };

    fn(...args.slice(0, -1), wrappedCallback);
  };
}

// Usage: existing code unchanged, observability added
const getUser = instrumentedCallback('getUser', originalGetUser);

This is not a migration. It is preparation for a migration. But it immediately closes the structured logging gap for compliance purposes, which means you have addressed the most urgent audit requirement without touching the business logic.

Phase 2: Promisify at the boundary, not at the core

The next step is to wrap callback-based functions in Promise interfaces without rewriting their internals. Node.js provides util.promisify for this. For custom patterns, you write the wrapper yourself.

const { promisify } = require('util');

// Standard Node.js callback pattern: promisify directly
const getUserAsync = promisify(getUser);
const getPermissionsAsync = promisify(getPermissions);

// Custom callback pattern: write the wrapper explicitly
function getAuditLogAsync(userId) {
  return new Promise((resolve, reject) => {
    getAuditLog(userId, function(err, result) {
      if (err) reject(err);
      else resolve(result);
    });
  });
}

This creates a clean interface boundary. The internals of each function are unchanged. The callers can now use async/await. You can migrate the call sites one at a time, validating each one against the behavioral baseline you established in Phase 1.

Phase 3: Migrate call sites incrementally with feature isolation

Now you replace the callback chains at the call sites, starting with the lowest-risk paths and working toward the business-critical ones.

// Before: six-level callback chain
function processCustomerReport(userId, res) {
  getUser(userId, function(err, user) {
    if (err) return res.status(500).json({ error: err.message });

    getPermissions(user.id, function(err, permissions) {
      if (err) return res.status(500).json({ error: err.message });

      getAuditLog(user.id, function(err, auditLog) {
        if (err) return res.status(500).json({ error: err.message });

        generateReport(user, permissions, auditLog, function(err, report) {
          if (err) return res.status(500).json({ error: err.message });

          saveReport(report, function(err, result) {
            if (err) return res.status(500).json({ error: err.message });
            res.json(result);
          });
        });
      });
    });
  });
}

// After: async/await with structured error handling
async function processCustomerReport(userId, res) {
  try {
    const user = await getUserAsync(userId);
    const permissions = await getPermissionsAsync(user.id);

    // Sequential awaits here because getAuditLog reads
    // a session record that getPermissions writes.
    // This dependency was documented in the archaeology phase.
    const auditLog = await getAuditLogAsync(user.id);

    const report = await generateReportAsync(user, permissions, auditLog);
    const result = await saveReportAsync(report);

    res.json(result);
  } catch (err) {
    logger.error({
      operation: 'processCustomerReport',
      userId,
      error: err.message,
      stack: err.stack,
      timestamp: new Date().toISOString()
    });
    res.status(500).json({ error: 'Report generation failed' });
  }
}

Note the comment about getAuditLog. That comment is there because the archaeology phase found the dependency. Without that phase, the migration would have parallelized those calls with Promise.all and introduced a race condition.

The Compensating Control Layer

While the async migration is in progress, you have a compliance clock running. The phased approach described above takes weeks or months on a real system. Auditors do not wait.

This is where compensating controls become the operational bridge. A compensating control addresses the intent of a compliance requirement when the primary technical control cannot yet be implemented directly.

For legacy callback systems, the most common gaps and their compensating controls look like this:

Compliance Gap	Compensating Control	Implementation
No structured audit logs	Network-layer log aggregation	Instrument at the proxy/gateway, not in the app
No MFA on legacy app	Identity boundary enforcement	Entra ID or similar enforced at network layer
No incident detection	Perimeter SIEM ingestion	Microsoft Sentinel, Datadog, or equivalent
Unpatched dependencies	Documented vulnerability exception	Isolation + formal risk acceptance record
No encryption in transit	TLS termination at load balancer	App unchanged, boundary enforced

The critical point from Marka's compliance framework is that compensating controls must be technically enforced, not just documented. A policy that says access is restricted is not a compensating control. An access boundary enforced at the network layer that prevents unauthorized connections is. Auditors ask for evidence of enforcement: configuration screenshots, access logs, test results.

This means you can close the compliance gap now, run the async migration on an engineering timeline, and arrive at the end with both problems solved rather than neither. The longer-term architectural work — replacing modules progressively rather than in a single cutover — follows what Martin Fowler described as the Strangler Fig pattern: new components built alongside the legacy system, traffic incrementally shifted, the old core decommissioned in stages.

What the Callback Debt Actually Costs

There is a number that organizations rarely calculate: the total cost of maintaining a callback-heavy codebase while simultaneously running a compliance program, compared to the cost of a structured incremental migration.

The maintenance cost compounds. Every new feature that goes into a callback-heavy system inherits the complexity of the existing chains. Every developer who joins the team spends the first two months just reading code. Every compliance control that cannot be implemented natively requires a compensating workaround at the infrastructure layer, which has its own operational cost.

IBM research has put the proportion of enterprise data that is structurally opaque, meaning values and records whose meaning is only recoverable by reading the code that processes them, at a significant share of total production data volume. That opacity is not just a data quality problem. It is a velocity problem. Every sprint that touches an undocumented callback chain moves slower than it should.

The migration is not free. The incremental approach described in this article takes real engineering time. But that time is bounded, predictable, and produces a system you can explain to an auditor, to a new engineer, and to yourself at 2am when something breaks.

The alternative is not stability. The alternative is the same system in three years, with more callbacks, more undocumented conditionals, and a compliance deadline that is now urgent instead of approaching.

The Practical Starting Point

If you are looking at a callback-heavy codebase right now and trying to figure out where to start, here is the sequence that has worked in practice:

Week 1: Run a static analysis pass across the callback chains. You want nesting depth, shared state references, and error handling coverage. Tools like ESLint with appropriate rules, or a dedicated static analysis platform, can surface the structural shape of the problem without requiring manual reading of every function.

Week 2: Interview the longest-tenured engineer who has touched the system. Not to document everything they know, but to identify the three or four places where the implicit knowledge is most dangerous. The statusCode === 7 equivalents. The "do not remove this" comments. The batch jobs that run in a specific order for reasons nobody has written down.

Week 3: Add instrumented logging around those specific paths. Not everywhere. Just the dangerous ones. Get structured log output flowing to somewhere you can query it.

Week 4: Brief the compliance officer on the gap analysis and the compensating control plan. Show them the logging output as evidence of active monitoring. Present the migration roadmap. The posture shift from "we have a problem" to "we have a plan and we have started" is significant in most audit contexts.

From that point, the migration proceeds in phases, each one validated against the behavioral baseline before the next one begins.

The System Is Still Running

The title of this article is true of almost every large enterprise codebase I have seen. Nobody replaced the system. They kept patching the callbacks. And the system kept running.

That is not a failure. That is a team that has kept something alive under sustained pressure, usually with fewer engineers and less time than the work deserved. The callbacks are not evidence of laziness. They are evidence of survival.

But survival is not the same as sustainability. The compliance requirements arriving now were not anticipated by the engineers who wrote the first callback in 2012. The async/await pattern that would have replaced those callbacks had not been standardized yet. The SIEM platforms that auditors now expect you to feed were not part of the production architecture conversation.

The system that survived to 2026 now has to meet requirements that did not exist when it was designed. That is not anyone's fault. It is just the situation.

The path through it is not a rewrite. It is not a crisis. It is archaeology, instrumentation, incremental migration, and a compliance posture that you can explain to an auditor without having to say "we are working on it" for the fourth year in a row.

Read the code. Map the callbacks. Start with the instrumentation. The rest follows.

Modernization Strategies for Critical Transactional Systems

Shlomo Friman — Wed, 20 May 2026 17:51:07 +0000

There is a version of legacy modernization that most consultants sell, and there is the version that actually happens when you are responsible for a system that processes $40 million in transactions overnight and cannot be down for more than four minutes.

The difference between those two versions is where most modernization programs break apart.

I have spent 27 years looking at what enterprise codebases actually mean, not what the documentation says they mean, but what the code itself reveals about the decisions that went into building it. When organizations modernize critical transactional systems, the failure mode I see most consistently is not technical. It is sequencing. They start in the wrong place, with the wrong assumptions, and then wonder why the effort stalls 14 months in with a parallel system they cannot fully trust and a legacy they cannot safely turn off.

This piece is about what the right sequence actually looks like, and why the first step has almost nothing to do with architecture selection.

The Systems That Cannot Tolerate a Rearchitecture Conversation First

COBOL runs an estimated 220 billion lines of active production code today. It processes 95% of ATM transactions globally, powers the core systems of 70% of major banks, and underpins significant government payment infrastructure. These are not historical footnotes. These are systems running this morning.

When we talk about modernizing critical transactional systems, we are talking about systems that have been optimized over decades for one thing: correctness under load. The irony is that the same reliability that makes them essential is what makes their replacement so difficult to plan. You cannot afford to be wrong about what they do.

The standard framing in modernization projects is to begin with architecture selection. Strangler fig or big bang. Replatform or refactor. Cloud-native or hybrid. These are real decisions, and they matter, but they are second-order decisions. They depend on something that most organizations do not have before they start: an accurate picture of what the current system actually does, at the logic level, under conditions that are not in any runbook.

Gartner has estimated that nearly 70% of data migration projects fail to meet their objectives. The most commonly cited causes are underestimated complexity and insufficient validation. Both of those are documentation failures, not architecture failures. You underestimate complexity when you do not know what is in the system. You fail validation when you do not know what correct behavior looks like.

What You Are Actually Dealing With Inside a Transactional Codebase

A transactional system is not a codebase in the way most modern developers think about codebases. It is a sedimentary record of business decisions.

A status field with 12 possible values, 5 of which no longer correspond to active business processes, is not a bug. It is a 2009 product retirement that nobody ever cleaned up, because cleaning it up would have required someone to know which downstream batch jobs were still referencing those values. Nobody wanted to find out the hard way.

A nightly JCL sequence where module B must run after module A because module A writes an intermediate file that module B reads is not in any architecture document. It is in the execution pattern of the job stream, which is readable if you know what to look for, and completely invisible if you do not.

A field called ACCT-TYPE with hardcoded processing logic buried in a nested EVALUATE statement is a business rule. It was probably documented somewhere in 1998. That documentation is almost certainly gone.

This is what 42% of critical business logic at risk looks like in practice. It is not that the code does not contain the logic. The code contains all of it. The problem is that the logic and the reasoning behind it are not the same thing, and you need both before you can modernize safely.

The Sequencing That Actually Works

The correct sequence for modernizing a critical transactional system has three phases before any architectural work begins.

Phase one: codebase archaeology. This is not a code review. It is not a documentation pass. It is a systematic reconstruction of what the system does, at the field level, at the job dependency level, and at the logic branch level, using the code itself as the primary source. Every hardcoded value gets catalogued and classified as either a constant or an undocumented business rule. Every conditional branch gets examined for the assumption it encodes. Every batch job dependency gets mapped. This work is slow and unglamorous. It is also the only way to know what you are actually modernizing.

Phase two: data lineage and integrity mapping. Before a single record moves, you need to know where it came from, what has happened to it, what depends on it downstream, and what the referential integrity rules are, including the ones that are enforced by application logic rather than database constraints. In COBOL environments, referential integrity is often maintained by the programs, not the database. If you migrate the data without understanding the program-level integrity rules, you will produce a technically successful migration with semantically broken data. The symptoms may not surface until six months later when a specific transaction type hits a code path that no one tested.

Phase three: impact analysis before any cutover. Every component that touches the system needs to be identified and evaluated before the modernization scope is finalized. This includes reporting systems, downstream batch consumers, integration layers, and anything that reads data from or writes data to the system, regardless of how indirect that relationship is. The enterprise IT asset disposition and data modernization considerations that arise during this phase are significant: virtual assets such as ETL jobs, scheduled processes, and derived datasets persist across execution layers and must be accounted for before any decommissioning sequence is planned. Removing a component without mapping its downstream impact is how you get silent failures that appear weeks after cutover.

Why the Strangler Fig Pattern Is the Right Architecture Choice, and What It Requires

Once the archaeology work is complete, the strangler fig pattern is the right architectural choice for most critical transactional systems. It is not the right choice because it sounds sophisticated. It is the right choice because it is the only pattern that allows you to fail small.

The pattern works by routing traffic through a proxy layer that can direct requests to either the legacy system or new components depending on which module has been migrated. New microservices or services are built alongside the legacy, functionality is migrated incrementally, and the legacy is retired in pieces rather than all at once. If something fails, traffic routes back. The fallback is always the system you already know works.

What is not often said clearly enough is that the strangler fig pattern requires more preparation than a big bang migration, not less. The proxy layer has to be built before any migration begins. Change Data Capture tooling has to be in place and validated before the first module is strangled. Observability across both the legacy and new execution paths has to be instrumented from day one, not added later. You need to be able to monitor latency, error rates, and business-level metrics on both paths simultaneously, because the moment you cannot see what both systems are doing, you have lost the ability to make safe routing decisions.

A Tier-1 European bank that used this approach to migrate off a mainframe COBOL ledger started by migrating read operations first. Using Change Data Capture to stream mainframe transactions to a cloud-native database, they offloaded 80% of the mainframe's MIPS consumption before attempting to migrate complex write logic. The write path is where the business rules live. You earn the right to touch it by proving correctness on the read path first.

The Hidden Risk That Kills Modernization Programs Midway

There is a failure mode specific to critical transactional systems that does not get discussed enough. It is not the big bang failure, where everything breaks at cutover. It is the slow accumulation failure, where nothing obviously breaks but the parallel systems begin to diverge in ways that are not immediately detectable.

This happens because data in a transactional system is not just records. It is state. The state accumulated in a legacy system over 20 years reflects thousands of business events that shaped it incrementally. When you build a parallel system alongside it, the new system starts accumulating state from the migration cutover point. If the business logic that governs how state changes is not perfectly equivalent between the two systems, the databases will drift. The drift is usually not visible in individual records. It shows up in aggregate reports, in reconciliation exceptions, in the batch job that runs at month end and produces totals that are off by an amount nobody can explain.

The way to manage this is to define equivalence tests before migration begins, not after. For every module being migrated, there should be a defined set of business-level outcomes that the new module must produce identically to the legacy. Not functionally similar. Identical. The comparison runs in parallel for long enough that every significant transaction type has been exercised. The module does not go live until equivalence is confirmed. This is slower than most project plans allow for. It is also the only way to avoid spending three months post-cutover explaining why the numbers do not match.

What the Workforce Situation Means for Timing

By 2027, the majority of remaining COBOL-era developers will have retired. That is not a projection. That is a demographic fact about a population of developers who were already senior when the systems they built went into production. Sixty percent of COBOL-dependent organizations already identify finding skilled developers as their single biggest operational challenge.

The knowledge these developers carry is not in the documentation. For the most part, there is no documentation. The knowledge is in their heads, and it is about decisions that are not visible in the code, only inferable. Why does this program skip records with a status of 7 in this context but process them in a different context? The developer who built it knows. The code does not explain itself.

This is the actual urgency. Not regulatory pressure, not cloud economics, not competitive positioning. The urgency is that the people who understand what these systems actually do are leaving, and when they leave, the only record of that understanding is the codebase itself, which is readable with the right tools and the right analytical approach, but only if someone reads it before the knowledge embedded in it becomes completely orphaned.

Where to Start

If you are responsible for a critical transactional system and the modernization conversation has already started, the first question to ask is not what the target architecture should be. The first question is whether anyone has done the archaeology.

That means: do you have a complete map of field-level business logic across the system? Do you have a dependency graph of every batch process and every downstream consumer? Do you know which parts of the codebase encode active business rules and which encode retired ones? Do you have equivalence tests defined for the modules you plan to migrate first?

If the answer to those questions is no, the architecture conversation is premature. Not because architecture does not matter, but because without that foundation, every architectural decision is based on assumptions about what the system does that may not survive contact with the actual code.

The systems that cannot fail are the ones that require the most preparation before any code moves. That preparation is not glamorous. It does not show up on a roadmap as a milestone. But it is what separates modernization programs that deliver from the ones that stall halfway through with two systems running in parallel, neither of which anyone fully trusts.

The Hardest Part of Inheriting a Legacy Codebase Isn't the Code

Shlomo Friman — Wed, 13 May 2026 12:18:12 +0000

The first thing most developers do when they inherit a legacy codebase is open the files and start reading. That's reasonable. It's also the second-hardest part of the job.

The hardest part is reconstructing everything that was never written down: the decisions, the constraints, the people, the context. The code is still there. Everything that explains why it is the way it is may be gone.

I've been working with enterprise codebases since 1997. I've helped teams take ownership of systems ranging from a few hundred thousand lines to well over fifty million. The pattern that breaks projects is almost never the technology. It's the invisible inheritance: the knowledge that used to live in people, then moved into code, and is now effectively locked inside it with no key.

This is what nobody tells you when you're handed the repository.

There Are Two Kinds of Debt in Every Legacy System

When developers talk about legacy technical debt, they mean the code: the inconsistent naming conventions, the god objects, the absence of tests, the framework that was already outdated when it was chosen. That debt is real, and it's measurable. Tools can scan it. Tickets can track it. Sprints can chip away at it.

There is a second kind of debt that doesn't show up in any static analysis report. Call it social debt: the accumulated gap between what the code does and what anyone still alive can explain about why.

Social debt accrues silently. It grows every time a developer leaves without writing down what they knew. It grows every time a business rule changes in code but not in documentation. It grows every time someone says "ask Marcus, he built that part" and then Marcus leaves. It grows when a system outlives the entire team that built it, which happens more often in enterprise software than anyone likes to admit.

A 2024 PagerDuty report found that mean time to resolution increases by 77% when the responding engineer hasn't previously worked on the affected service. That number isn't measuring code complexity. It's measuring social debt: the cost of not knowing what Marcus knew.

The reason this matters so much right now is that social debt compounds. Technical debt stays roughly stable until someone fixes it. Social debt gets worse every quarter as more of the people who held the context retire, leave, or simply forget. In a codebase with 30 years of history, the social debt can be catastrophic, and the technical debt, which is visible and tractable, is almost a distraction from it.

What You Are Actually Inheriting

When you take ownership of a legacy system, you are not inheriting software. You are inheriting a record of decisions made by people you will never meet, under constraints you may not be able to reconstruct, for reasons that may no longer apply.

Some of those decisions will look bizarre. A field with twelve possible values where seven are actively used and five appear to be artifacts of a business process that was discontinued before anyone on the current team was hired. A module that is called from fourteen different places but only ever produces a meaningful result in three of them. A conditional branch that executes maybe once a year under a set of conditions that nobody has thought about since the compliance requirement it was written for changed in 2011.

Every one of these is a decision. Someone made a choice. They had a reason. The reason mattered at the time. Whether it still matters is a question you cannot answer by reading the code alone.

Studies estimate developers spend 58% to 70% of their time understanding existing source code, not writing new code. In a legacy system with poor documentation and high social debt, that number is almost certainly higher. The code is the most expensive reading material in the building, and a significant fraction of what you need to understand it isn't in the code at all.

What you are inheriting, specifically, is this:

The explicit layer. What the code literally does: the data structures, the control flow, the integrations, the outputs. This is readable. It takes time, but you can get there.

The implicit layer. Why the code does it this way and not another way. This is where the real inheritance problem lives. The implicit layer contains the business rules that were never parameterized, the regulatory requirements that were encoded without comment, the performance constraints that shaped the architecture in ways that aren't obvious until you try to change something and something else breaks.

The absent layer. What the system used to do that it no longer does, but whose traces are still present in the code. Orphaned tables. Commented-out modules that were kept in the repository "just in case." Fields that are populated but never read. Conditions that handle states the system can no longer reach.

Most inheritance failures happen because the new team focuses entirely on the explicit layer, has no systematic method for recovering the implicit layer, and doesn't know the absent layer exists at all.

The Person-Shaped Holes

There is a specific kind of problem that only manifests after someone senior leaves a team, and it is almost impossible to see coming from the outside.

When a developer has been working on a system for years, they develop what researchers call a "mental model" of the system: a cognitive map of how the pieces fit together, what the edge cases are, which parts are fragile, what the system is really doing underneath the behavior the documentation describes. That mental model is not in any document. It cannot be fully transferred in a two-week handoff period. It lives in the person.

When that person leaves, they take the mental model with them. What remains is a code system that looks the same on the outside but has a person-shaped hole in its surrounding knowledge.

The problem becomes visible in specific ways. New changes cause unexpected regressions in parts of the system that seem unrelated. Support tickets start arriving about behaviors that were always the system's behavior, but that the previous team would have known not to change. Debugging sessions that would have taken ten minutes with the original developer take three days. Production incidents occur in exactly the systems where the most senior people left.

Bus factor research defines the minimum number of developers who would need to leave for a project to effectively stall. For most legacy enterprise systems, the meaningful bus factor is not calculated from commit history. It's calculated from who understood which undocumented system behaviors, and that information is rarely tracked anywhere.

What makes this worse in the current moment: the COBOL and mainframe developer population is aging faster than any other segment of software engineering. The people who built systems that now process trillions of dollars in daily transactions are retiring at scale. The knowledge they hold is not in documentation. It is in decades of accumulated context that was never extracted because there was never a reason to extract it while they were still there. The reason only becomes clear after they're gone.

What to Do in the First Thirty Days

Most advice about inheriting a codebase is technical: set up the dev environment, read the README, run the tests, find the CI pipeline. That advice is fine for greenfield projects or well-maintained modern systems. For a legacy codebase with genuine social debt, it is the wrong starting point.

Here is what actually matters in the first thirty days.

Find the people first. Your most important resource is not the repository. It is the people who can still answer questions about what the code was meant to do. That includes people still at the organization who worked on earlier versions, people who can introduce you to retirees or former employees who might take a call, and business-side stakeholders who have been using the system long enough to know when its behavior changed and what changed it. These conversations have a shelf life. Every month you wait, the knowledge degrades further.

Treat code as evidence. The code is accurate about what the system does right now. It is not accurate about why, about what it used to do, or about what it was supposed to do. Approach it the way an archaeologist approaches an artifact: what can this tell me about the people and conditions that produced it? What questions does it raise that I need to answer from other sources?

Map the absent layer early. Before you start optimizing or modernizing anything, do a pass specifically looking for what used to exist. Look for commented-out code that nobody has removed. Look for database tables that have no application code reading them. Look for fields that are written but never read, or read but only in conditions that can't currently be reached. This is not wasted time. Every one of these is a question you need to answer before you change anything nearby. Removing what looks like dead code without understanding why it exists is one of the most reliable ways to cause an incident eight months later.

Document as you discover. The instinct is to wait until you understand the system before writing anything down. That instinct is wrong. Document the questions as you encounter them. Document the partial answers you receive from conversations. Document the assumptions you are making and why. Your notes from the first thirty days, even if incomplete and sometimes wrong, are enormously valuable to the next person who inherits the system. You are not just learning; you are beginning the process of converting social debt back into explicit knowledge.

Resist the rewrite urge. It will be strong. The code will look strange, inconsistent, and in places genuinely bad. Some of it is bad. But a significant fraction of what looks like bad code is code that is solving a problem you don't fully understand yet. The off-by-one workaround in module B that looks like a bug is compensating for the off-by-one behavior in module A that was never fixed because downstream systems adapted to it. Rewriting module B correctly will break those downstream systems. You won't know this until you do it. The best defense against this class of mistake is to build understanding before you build changes, even when the changes look obviously correct.

The Documentation Nobody Writes

There is a category of knowledge in every legacy system that is almost never documented, and the absence of it is responsible for a disproportionate share of the incidents, failed modernization projects, and inherited system nightmares that development teams live through.

It is not the architecture. Architecture gets documented, at least partially. It is not the API surface. That gets documented too, eventually. It is not even the business rules, which people at least know they should document even when they don't.

The undocumented category is the constraints: the things the system cannot do, the conditions under which it behaves unexpectedly, the input combinations that were never handled because they were never supposed to occur, the integration behaviors that depend on undocumented timing assumptions between components.

A system's constraints are almost entirely absent from its documentation because constraints are invisible to the people who built the system. They know the constraints. The constraints are encoded in every decision they made. From the outside, the behavior just looks like behavior. The constraint that explains it is invisible unless you already know it's there.

When you inherit a system, you discover constraints the hard way: by violating them. You change something that looks safe to change, and something else breaks in a way that looks unrelated and takes days to trace. You add a record that looks syntactically valid and the nightly batch job silently produces incorrect results for three months until a quarterly report catches the discrepancy.

The discipline of constraint archaeology, which is what I would call the systematic effort to surface and document what a system cannot do before trying to change what it does, is not widely practiced. It is not a named methodology. There is no tool category for it. Most organizations skip it entirely, and then spend years recovering from the consequences.

What Surviving This Actually Looks Like

I want to be concrete about what it looks like when a team handles this well, because the successful cases are underreported.

The pattern that works is not a grand knowledge-capture initiative. Those rarely succeed. By the time an organization decides to document its legacy systems comprehensively, the people who could have provided the most important context are already gone.

The pattern that works is incremental, opportunistic, and attached to real work.

Every time a developer touches a part of the system, they document what they learned. Not a full specification, just a note. "This field has twelve possible values. Seven are used in active code paths. Five appear to be artifacts of a product line that was discontinued. I don't know which five. Need to check with the billing team." That note, imperfect as it is, is worth more than no note. The next person to touch this code starts their investigation from a better position.

Every time an incident occurs, the post-mortem documents not just what broke and how it was fixed, but what was learned about the system's behavior that wasn't known before. Incidents are expensive knowledge-generation events. The knowledge should be captured.

Every time a senior developer leaves, their exit interview includes a structured conversation specifically about the system: what are the parts you're most worried about? What do you know about this codebase that isn't written down anywhere? What would you want the person who replaces you to understand before they touch certain modules? This is different from a standard handoff. It is explicitly asking the person to surface the implicit layer before they go.

None of this is heroic or expensive. It is the practice of treating knowledge as an asset with the same seriousness you apply to code. A codebase without its surrounding knowledge is like a legal contract without the negotiating history: it says what it says, but understanding what it means requires context you no longer have.

A Note on the Current Moment

This problem is not getting better. It is getting worse faster.

The combination of the retiring mainframe developer cohort, the acceleration of AI-assisted development which produces code faster than understanding can keep pace with, and the economic pressure to modernize legacy systems quickly is creating conditions where social debt is accumulating at a rate that technical tooling cannot address.

Industry research on legacy risk found that 42% of critical business logic in legacy systems is at risk when key personnel leave, because "the system is the documentation" in most legacy environments. That's not a prediction. That's a current condition. The business logic that runs large portions of global financial infrastructure, insurance systems, and government operations exists, right now, in a state where it can only be understood by people who are within a few years of retirement.

The tools being deployed to accelerate legacy modernization are genuinely useful for the explicit layer. They are limited for the implicit layer and essentially blind to the absent layer. They can tell you what the code does. They cannot tell you what it was supposed to do when it was written, what it used to do before the 2009 changes, or which parts of its current behavior are intentional and which are workarounds for problems that no longer exist.

This is not an argument against using those tools. It's an argument for sequencing the work correctly: understand before you modernize, not while you modernize, and not after.

What 27 Years Taught Me

The developers who handle legacy inheritance best share a particular quality of mind. They are genuinely curious about the people who came before them. They approach the code with something closer to respect than contempt. Not because old code is inherently good, and a lot of it isn't, but because they understand that every line of it represents a decision made by a human being who was solving a real problem with the tools and knowledge they had at the time.

That attitude is not just ethically appropriate. It is pragmatically correct. The developer who approaches a legacy codebase as a puzzle left by predecessors they want to understand will discover things the developer who approaches it as a mess to be cleaned up will miss. The missed things will cause incidents. The incidents will cause delays. The delays will cost far more than the time spent trying to understand the system before changing it.

The hardest part of inheriting a legacy codebase is not the code. The hardest part is accepting that you can't fully understand it from the code alone, and doing the work, including the human work, to recover what isn't there.

If you've inherited a legacy system: what was the thing that blindsided you most? Was it in the code, or in the context around it?

Your AI Search Is Only as Smart as What Your Codebase Forgot to Document

Shlomo Friman — Fri, 08 May 2026 11:08:13 +0000

Everyone deploying enterprise AI search is running into the same wall, and blaming the wrong thing.

The model isn't the problem. The retrieval pipeline isn't the problem. The embedding strategy isn't the problem. The problem is what you're asking the AI to search through, and how much of the knowledge that actually runs your organization was never written down anywhere the AI can find it.

It lives in the code.

The Knowledge That Never Made It Into a Document

When you deploy a RAG-based search layer over your enterprise systems, the standard assumption is that your knowledge lives somewhere retrievable: wikis, runbooks, Confluence pages, ticket histories, README files. The AI retrieves the relevant chunks, grounds its answers in them, and gives you something useful.

That assumption holds for maybe 30% of the knowledge that actually matters in an enterprise system.

The rest is implicit. It's baked into the logic of applications that have been running for a decade or more. It's in the field names that made sense in 2003 when the original team named them. It's in the hardcoded values that represent business rules nobody wrote a ticket for, because at the time there was no need: everyone knew. It's in the conditional branches that encode compliance requirements from a regulatory environment that has since changed, where the code was updated but the documentation wasn't.

Gartner's 2025 Market Guide for Enterprise AI Search identified the knowledge layer, not the generation model, as the primary bottleneck in failing deployments. The most common failure mode isn't a bad LLM. It's a retrieval layer that can't surface what's needed because what's needed was never put anywhere retrievable.

That's not a retrieval problem. It's a documentation problem that predates AI by twenty years.

What "Undocumented" Actually Means in a Codebase

When developers say a system is undocumented, they usually mean there are no comments, no architecture docs, no wikis. That's part of it.

But the deeper problem is subtler. Even well-maintained systems have a class of knowledge that is structurally impossible to capture in conventional documentation: the knowledge that was so obvious at the time that nobody thought to write it down.

Consider a few examples.

A field called ACCT-TYPE in a COBOL program has twelve possible values. Seven of them are used in active logic paths. The other five exist in the data but are never referenced by the application, because the business processes they represented were retired in 2009. An AI search tool has no way of knowing that. It sees twelve values. It doesn't know that five of them are artifacts of a world that no longer exists.

A batch job runs nightly. Its processing sequence matters: module B must run after module A because module A writes an intermediate file that module B reads. That dependency is not in any document. It's in a JCL script that nobody has looked at in six years. If someone asks the AI "what does this system do overnight," the answer will be technically incomplete in ways that matter.

A customer record field called STATUS has a value of 7. What does 7 mean? The answer is probably in the code that processes it. It might be in a comment. It might be in neither. It might be in the memory of someone who retired in 2017. The AI can retrieve the word "STATUS" from a dozen documents. It cannot tell you what 7 means unless something, somewhere, says so.

This is what cross-system data alignment breaks down to at its most fundamental level: different parts of the organization operating with divergent interpretations of the same data, because the shared context that would unify those interpretations was never codified.

Why This Breaks AI Search Specifically

Traditional keyword search fails at this problem gracefully. It doesn't know what it doesn't know, and it doesn't pretend to. You search for "STATUS field values," you get whatever documents mention those words, and you accept that you might need to dig further.

AI search fails at this problem ungracefully. It generates confident, fluent answers from whatever it retrieves. If what it retrieves is incomplete, the answer is incomplete in a way that sounds complete. The system will tell you what STATUS 7 means if there is anything anywhere that describes it. If there isn't, it may interpolate from context, from similar patterns in other documents, from general knowledge about enterprise systems. The answer will sound plausible. It may be wrong.

Industry surveys from 2025 found that 72% of enterprise RAG implementations either failed outright or significantly underperformed in their first year. The most cited root cause was not model quality. It was data quality and retrieval relevance, which is a polite way of saying: the knowledge needed to answer the questions wasn't in the knowledge base.

What nobody says clearly enough is that for enterprise systems with history, a large share of that knowledge was never in any document. It was in the people who built the system, and when they left, it went into the code. The code still runs. The explanation of why it works the way it does is gone.

The Archaeology Step Nobody Is Doing

There is a step that should come before any enterprise AI search deployment. Most organizations skip it entirely, because it is slow, unglamorous, and has no vendor selling it.

That step is codebase archaeology: systematically reconstructing the implicit knowledge embedded in the applications themselves before building a retrieval layer on top of them.

What does that look like in practice?

It means tracing every field from definition through every place it is read, written, and transformed, building a map of what the data actually means in context. It means inventorying every hardcoded value and asking: is this a constant, or is this a business rule that was never parameterized? It means identifying every conditional branch where a missing else clause represents an implicit assumption. It means mapping the dependencies between modules that exist in execution order but not in documentation.

None of this is AI work. It is reading work. It requires treating the codebase as a primary source, the way a historian treats a primary source: not as executable instructions, but as evidence of decisions made by people who had context you no longer have.

The output of that work is something that can actually go into a knowledge base. Field definitions with business context. Value code glossaries. Dependency maps. Process flows with the implicit steps made explicit. Annotated logic explanations for the branches that would otherwise be opaque.

That is the knowledge layer an enterprise AI search needs to work. And in most organizations that have been running on the same core systems for ten or twenty years, most of it does not exist yet.

The Compounding Problem of System Fragmentation

The challenge gets harder as enterprise systems grow more fragmented. Most organizations operate not on a single legacy platform but on a constellation of systems that evolved independently and were later integrated through interfaces, APIs, and batch transfers that nobody fully documented either.

Each of those systems has its own implicit knowledge layer. Each integration point between them represents an additional layer of context that may never have been written down: why this field maps to that field, why the transformation applies in this direction but not the other, why the timing of the transfer matters.

IBM has estimated that 90% of data generated by enterprises is unstructured, but even that framing understates the problem for organizations with long-running legacy systems. It is not just that the data is unstructured. It is that significant portions of it are context-free: values and records whose meaning is only recoverable by reading the code that processes them.

An AI search layer deployed over that environment is working with a fractured, context-stripped knowledge base. It can retrieve. It cannot understand, because the understanding was never externalized.

What to Do Before You Deploy

None of this is an argument against enterprise AI search. The capability is real, and the demand is legitimate. Organizations need better ways to navigate the accumulated knowledge in their systems, and AI-assisted search can provide that.

The argument is about sequencing.

Before the retrieval layer, there needs to be a documentation layer. Before the documentation layer, there needs to be an extraction layer: a systematic effort to pull the implicit knowledge out of the codebase and make it explicit. That work requires tools that can parse code structure, trace execution paths, and surface the dependencies and logic patterns that human readers would miss in a manual pass.

The organizations that will get durable value from enterprise AI search are the ones that treat this as a data preparation problem first. Not a model selection problem. Not a pipeline architecture problem. A knowledge extraction problem that starts with the oldest, least-documented, most business-critical systems in their portfolio.

The AI can search what is there. The work is making sure what needs to be there, is.