Shlomo Friman

Posted on Jun 10

Before You Refactor a Legacy API to Cloud-Native, Read What It Is Actually Doing

#refactoring #dependencyinversion #microservices #modernization

Most API migration guides start with the architecture diagram. The clean boxes, the arrows, the new microservices humming along in Kubernetes. What they skip is the part where you have to get from what you have now to that diagram without your production system going dark at 3pm on a Wednesday.

This is a guide about that part.

Why the Standard Advice Falls Short

The standard advice for legacy API modernization is some version of: use the Strangler Fig pattern, put a proxy in front of the old system, and incrementally replace endpoints. That advice is correct. It is also incomplete in ways that get teams into trouble.

The proxy layer assumes you understand what your legacy API actually does. Dependency mapping assumes you know which clients are calling which endpoints. Canary deployment assumes you have a health signal reliable enough to route on. And idempotency handling assumes you have control over both sides of the transaction during the transition.

Most legacy API systems fail at least two of those assumptions before the migration begins.

The gaps are not architectural. They are operational. And they are the reason migrations that look correct on paper produce incidents in production.

Start with Dependency Mapping, Not Architecture

Before you draw the new architecture, you need an accurate picture of the old one. Not the documented architecture. The actual one.

Legacy APIs accumulate undocumented consumers over time. Internal services that started calling an endpoint informally. Batch jobs that hit an API on a schedule nobody remembers setting up. Third-party integrations configured years ago by someone who has since left. Mobile clients on old versions that cannot be force-updated.

A proxy layer that you place in front of the legacy API will see all of this traffic. But if you did not know it existed before you placed the proxy, you will not know to handle it correctly when you start routing.

The dependency mapping step is not glamorous, but it is the one that determines whether the migration succeeds. In practice it requires three things:

Access log analysis. Pull at least 90 days of access logs from the legacy API. Group by endpoint, by caller IP or service identifier, by time of day and day of week. You are looking for the calls that only happen on the 15th of the month, or at 2am on Sundays, or once a quarter. These are the ones that will break silently if you miss them.

Contract documentation. For each endpoint, document the actual request and response shapes in production, not the shapes in the documentation if documentation exists. Legacy APIs drift. What the docs say an endpoint returns and what it actually returns in production are frequently different, and consumers have adapted to the actual behavior.

Client inventory. Cross-reference the access logs with a list of known consumers. Every caller you cannot identify is a risk. Some of them will be internal services you can track down. Some will be external integrations you will need to test against directly. Some will be clients you genuinely cannot contact. Those need a compatibility layer regardless of how clean the new API design is.

The Proxy Layer: What It Does and What It Cannot Do

Once you have an accurate dependency map, the proxy layer becomes genuinely useful. It sits between all clients and both the legacy and new API surfaces, and it gives you control over traffic routing without requiring client-side changes.

A minimal proxy configuration for a Node.js API migration looks like this:

const express = require('express');
const { createProxyMiddleware } = require('http-proxy-middleware');

const app = express();

// Route configuration loaded from a feature flag store
// so routing changes do not require a deployment
const routingConfig = await loadRoutingConfig();

app.use('/api/v1/users', (req, res, next) => {
  const target = routingConfig.users.target; // 'legacy' or 'new'
  const legacyProxy = createProxyMiddleware({
    target: process.env.LEGACY_API_URL,
    changeOrigin: true
  });
  const newProxy = createProxyMiddleware({
    target: process.env.NEW_API_URL,
    changeOrigin: true
  });

  return target === 'new' ? newProxy(req, res, next) : legacyProxy(req, res, next);
});

app.listen(3000);

This is the skeleton. The production version needs several things this skeleton does not have.

Request logging at the boundary. Every request that enters the proxy should produce a structured log entry with the endpoint, the caller, the routing decision, the upstream response code, and the latency. This is your primary diagnostic tool during the migration and your audit trail after it.

Timeout and circuit breaker configuration. The proxy is now a dependency in every request path. If the new API service hangs, the proxy should fail fast and fall back to the legacy endpoint, not queue requests until they all time out. The circuit breaker pattern handles this: after a threshold of failures, the proxy stops attempting to route to the new service and routes everything to legacy until the new service recovers.

Header passthrough. Legacy APIs sometimes make routing or authorization decisions based on headers that clients set. The proxy must pass these through transparently, or you will see authentication failures on the new service that do not reproduce in isolated testing.

API Versioning During the Transition

One of the harder decisions in a live migration is what to do about API versioning. The legacy API probably has no formal versioning, or has versioning that accumulated organically. The new cloud-native service should have explicit versioning from the start.

The temptation is to version the new API starting from where the legacy API left off. If the legacy system responded at /api/users, the new service responds at /api/v2/users. The proxy handles the rewrite.

This works, but it creates a long-term maintenance burden: you are now supporting a version mapping layer in the proxy indefinitely, and clients that hardcode paths will need updates eventually anyway.

A cleaner approach is to treat the proxy itself as the versioning layer during the transition. The proxy accepts legacy paths and internally rewrites them to the versioned new API surface. Clients do not change. The new API is clean from the start. When the migration is complete and legacy clients have been updated or retired, the rewrite rules are removed.

// Proxy rewrites legacy path to new versioned API
app.use('/api/users', createProxyMiddleware({
  target: process.env.NEW_API_URL,
  changeOrigin: true,
  pathRewrite: {
    '^/api/users': '/api/v1/users'
  },
  on: {
    proxyReq: (proxyReq, req) => {
      // Add version header so new service can log which path was used
      proxyReq.setHeader('X-Api-Source', 'legacy-proxy');
    }
  }
}));

The X-Api-Source header matters. It lets you track, in the new service's logs, how much traffic is still arriving via the legacy path versus directly to the versioned endpoint. That metric tells you when the transition is complete.

Canary Deployment: The Mechanics

Canary deployment during an API migration means sending a percentage of production traffic to the new service while the majority continues to the legacy endpoint. You observe the new service under real load before committing to a full cutover.

The percentage split is usually managed through the routing config the proxy reads, so you can adjust it without a deployment. The operational pattern looks like this:

function routeRequest(req, routingConfig) {
  const endpoint = getEndpointKey(req.path);
  const config = routingConfig[endpoint];

  if (!config || !config.canaryEnabled) {
    return 'legacy';
  }

  // Deterministic routing based on user ID means the same user
  // always goes to the same backend during the canary window.
  // This avoids state inconsistencies from mixed routing.
  const userId = req.headers['x-user-id'];
  if (userId) {
    const hash = hashUserId(userId) % 100;
    return hash < config.canaryPercentage ? 'new' : 'legacy';
  }

  // For unauthenticated endpoints, use random routing
  return Math.random() * 100 < config.canaryPercentage ? 'new' : 'legacy';
}

The deterministic routing by user ID is important and frequently skipped in example implementations. If a user's requests are split between the legacy and new backend within the same session, you can get state inconsistencies: a write on the new service that the legacy service does not see, or vice versa. Pinning a user to one backend for the duration of the canary window prevents this class of error.

The canary percentage progression should be tied to observed error rates, not to time. A common progression is 1%, 5%, 10%, 25%, 50%, 100%, with a hold period at each step until error rates on the new service match or beat the legacy baseline. If error rates spike at any step, the routing config rolls back to the previous percentage, and you investigate before proceeding.

Idempotency Across the Boundary

Idempotency is the property that performing the same operation multiple times produces the same result as performing it once. For API migrations, it is critical because the transition period creates conditions where requests can be processed by both the legacy and new service.

This happens in several ways. A request can time out at the proxy before a response is received, even though the upstream service processed it. A client can retry on a network error without knowing whether the original request succeeded. A canary routing decision can change between a client's retry attempts.

The idempotency key pattern handles this. The client generates a unique key for each logical operation and sends it as a header. Both the legacy and new service store processed keys and return the cached response for duplicate requests.

// Idempotency middleware for the new API service
async function idempotencyMiddleware(req, res, next) {
  const idempotencyKey = req.headers['x-idempotency-key'];

  if (!idempotencyKey) {
    return next(); // No key provided; proceed normally
  }

  const cached = await redis.get(`idempotency:${idempotencyKey}`);

  if (cached) {
    const { statusCode, body } = JSON.parse(cached);
    return res.status(statusCode).json(body);
  }

  // Wrap the response to capture it for caching
  const originalJson = res.json.bind(res);
  res.json = async (body) => {
    if (res.statusCode < 500) {
      await redis.setex(
        `idempotency:${idempotencyKey}`,
        86400, // 24 hour TTL
        JSON.stringify({ statusCode: res.statusCode, body })
      );
    }
    return originalJson(body);
  };

  next();
}

The 24-hour TTL is a practical default. It covers client retry windows without accumulating stale entries indefinitely. The conditional on res.statusCode < 500 means server errors are not cached: if the new service fails to process a request, the client should retry and get a fresh attempt, not a cached failure.

The legacy service needs the same middleware, or the proxy needs to strip idempotency keys before forwarding to it if the legacy system does not support them. A proxy-level cache keyed on the idempotency header can substitute for legacy-side support during the transition.

What the Patterns Miss

The Strangler Fig pattern, canary deployments, and idempotency handling are all well-documented. What is less documented is the category of failure that happens not because the patterns are wrong but because the system you are migrating is not the system you thought you were migrating.

IN-COM's analysis of zero-downtime refactoring identifies the core issue: a successful service-oriented refactor depends on running old and new code paths in parallel, but parallel execution only works if you can validate that both paths produce equivalent outputs. For legacy APIs that encode undocumented business rules in conditional branches, that validation is genuinely difficult before you have done the work to surface what those rules are.

This is the dependency mapping problem from the first section, appearing again at the end. The access log analysis tells you who is calling. The contract documentation tells you what they expect. But neither tells you why the legacy API behaves differently for statusCode === 7 inputs, or why a particular user class gets a different response shape, or why a specific combination of parameters triggers a code path that was never meant to be a public API.

Those are the failures that appear in production two weeks after a migration that passed all tests. They are not proxy failures or routing failures or idempotency failures. They are knowledge failures. The migration was technically correct. The understanding of the system it was migrating was not.

The practical response to this is to instrument the legacy API before you build the new one. Not instrumentation for monitoring, though that matters too. Instrumentation for learning: capturing the full request and response shapes of every endpoint under production load, logging every conditional branch that gets exercised, building a behavioral profile of what the system actually does rather than what anyone believes it does.

AWS's prescriptive guidance on the Strangler Fig pattern flags the proxy layer as a potential single point of failure and performance bottleneck. Both are real risks. But the deeper risk is using the proxy as a substitute for understanding. A proxy that routes traffic between two services you understand well is a migration tool. A proxy that routes traffic between a new service and a legacy system nobody fully understands is a liability.

The Cutover Decision

The final phase of an API migration is the decision to cut over completely: route 100% of traffic to the new service, take the legacy endpoint offline, and remove the proxy routing rules.

That decision should be data-driven, not schedule-driven. The signals you are looking for:

Error rate parity. The new service's error rate at 100% traffic should be within the same range as the legacy system's rate at equivalent load. Not necessarily zero errors. The legacy system probably did not have zero errors either.

Latency equivalence. Response time distributions should be similar or better. Pay attention to the 99th percentile, not just the median. Legacy systems often have very consistent median latency but occasional outliers. The new system may improve the median while making outliers worse, or vice versa.

Business metric stability. For APIs that sit in transactional flows, downstream business metrics should be stable through the canary period. Conversion rates, order completion rates, whatever the business measures. A technical migration that degrades a business metric is not a successful migration, even if the API is technically correct.

Legacy traffic at zero. The X-Api-Source: legacy-proxy header introduced earlier tells you when no traffic is arriving via the legacy path. When that metric reaches zero across all relevant time windows, including the off-hours batch jobs and the monthly scheduled operations you found in the dependency mapping step, the cutover window is open.

The legacy API should remain running but unrouted for a defined period after the cutover. The proxy rules stay in place, just pointing entirely to the new service. The legacy system is the rollback path. Until you are confident the rollback path will not be needed, you want it warm.

Running It in Practice

The sequence that works in production:

Before you touch anything: 90-day access log analysis, contract documentation for all endpoints with significant traffic, client inventory with contact information for external consumers.

Proxy deployment: Introduce the proxy routing 100% to legacy. Verify behavior is unchanged. Establish the baseline metrics you will use to evaluate the canary.

New service development: Build against the documented contracts, with idempotency middleware from day one. Run the new service in shadow mode against production traffic before enabling canary routing: receive real requests, process them, log the results, but return the legacy response to the client. Compare outputs.

Canary progression: 1% for 48 hours, then 5%, 10%, 25%, 50%, each with a 24-48 hour hold at normal traffic and a full hold through at least one monthly cycle if your dependency mapping found monthly operations.

Cutover and standby: 100% routing to new service, legacy running but unrouted for 30 days, then decommission.

The timeline from proxy deployment to decommission is typically 8 to 16 weeks for a moderately complex legacy API, assuming the dependency mapping was done first and no major behavioral discrepancies surface during the shadow mode phase. Organizations that skip the dependency mapping or the shadow mode phase typically discover what they skipped somewhere between the 10% and 25% canary step.

The migration is not a project with a completion date. It is a controlled transfer of trust from a system that has earned it through years of operation to a system that is earning it incrementally through validated behavior under real load. That transfer takes the time it takes. Rushing the canary progression does not save time. It moves the risk from the migration into production.

Top comments (1)

arun rajkumar • Jun 10

The shadow-mode point is the one I'd underline. We run open-banking payments, and the migrations that actually bit us were never the routing or the idempotency keys — those are solvable — they were the undocumented branches: the one merchant class that depended on a quirk nobody wrote down. Diffing new vs legacy responses under real traffic before you route a single user is the cheapest insurance you'll buy. One thing I'd add: pin the idempotency-key TTL to your longest legitimate retry window, not a round 24h — settlement retries can outlive a day. How are you handling keys for the batch jobs you can't force-update?