The Eighth Server: How One Missed Deploy Ended Knight Capital, 2012

#postmortem #devops #reliability #deploy

Tales from the Bare Metal — Episode 06

At 09:30 on 1 August 2012, the New York Stock Exchange opened a new programme for retail-order matching, the Retail Liquidity Program. By 10:15, Knight Capital Group, one of the largest market makers on the American equities markets, had ceased to function as a going concern. The forty-five minutes between those two times cost the firm roughly $440 million in realised pre-tax loss, required emergency capital from Jefferies within days, and led to acquisition by Getco within months. The proximate cause was one server out of eight, running code from 2003 that, in a strictly source-control sense, had never gone away.

The Incident

Knight ran a routing system called SMARS, the Smart Market Access Routing System, which decided how to forward orders into the various American equity venues. The production deployment of SMARS sat on eight servers; identical, redundant, all expected to handle a share of the morning's order flow. In late July, in preparation for the NYSE's launch of the Retail Liquidity Program on 1 August, Knight's developers prepared a release that taught SMARS to recognise and route RLP-eligible orders.

The deployment to the eight production servers ran on schedule the day before the launch. By the SEC's later account, the new code reached seven of the eight servers correctly. On the eighth, the deployment did not complete, and nobody knew. The server kept running the previous build.

When the markets opened at 09:30, retail order flow began arriving at SMARS bearing a new piece of metadata: a particular flag in the order-routing protocol, set by the RLP programme, indicating an order eligible for retail matching. On the seven correctly-deployed servers, the routing logic recognised the flag, consulted the new RLP code path, and routed accordingly. On the eighth, the same flag was interpreted through code paths last meaningful in 2003.

What happened next is the part that reads like a horror story even now. The eighth server began generating orders at an exceptional rate, accumulating positions Knight had never intended to take, paying the ask and selling at the bid, again and again, in hundreds of stocks. By the time the firm halted trading and unwound, the numbers were unprecedented: roughly four million orders had been sent, roughly seven billion dollars of unintended positions had been built, and individual share prices had moved by tens of percent in minutes. Knight booked a realised pre-tax loss of approximately $440 million when it unwound the position. The firm raised roughly $400 million of emergency capital from investors led by Jefferies within days, the SEC subsequently fined Knight $12 million for violations of the Market Access Rule, and Getco Holdings acquired the firm before the year was out. The Knight name effectively ended that morning.

The Diagnosis

The defective behaviour was produced by a module called Power Peg.

Power Peg had been written in 2003 to manage parent-order execution: an algorithmic test routine for slicing large orders into many smaller ones. It had been used in production briefly, judged unfit for purpose, and disabled after 2005. The disabling, however, was operational rather than structural. The Power Peg code remained in the SMARS source tree, compiled into the binary, dormant. What kept it dormant was a single flag in the order-routing protocol: when that flag was set, the SMARS code activated Power Peg; when not, the module did nothing. After 2005, the systems that produced upstream orders simply stopped setting the flag, and Power Peg slept.

Years passed. Power Peg was not removed in any subsequent refactor. The 2003 code, unchanged, continued to be built into every release of SMARS.

In 2012, in preparation for the RLP launch, the SMARS routing protocol gained a new feature: the ability to recognise RLP-eligible orders. The implementation, plausibly enough, repurposed the bit position that had once carried the Power Peg activation signal. The bit was no longer in use, the engineers reasoned; let it carry the RLP signal now. The new code, on the seven correctly-deployed servers, read the bit and routed by RLP rules. The old code, still resident on the eighth server, read the same bit and read it as "start Power Peg".

Each RLP-eligible order, on the eighth server, was therefore an instruction to fire up a nine-year-old algorithm. Power Peg, as originally written, was a test routine. It had no production-grade understanding of fills, of order completion, of when to stop. Its inner loop bought at the ask and sold at the bid; the outer loop fed it more parent orders to slice. The retail order stream of a major venue at market open is a great many orders. Each one became a parent. Each parent became a flurry. Within minutes, Knight was the dominant counterparty in dozens of stocks. Within fifteen minutes, the firm's exposure was visible in the prices themselves.

The Context

Three quiet drifts compounded, and none of them, judged at the time they happened, was foolish.

The first was the source tree. In 2005, the team that decided Power Peg was unfit gated it with a flag rather than deleting it. That decision was defensible at the time: the code might still be wanted, removing it risked breaking adjacent assumptions, the codebase was large and removal was real work for an uncertain payoff. The gating worked perfectly, every day, for seven years. The flag became invisible: still there, still loaded, still compiled, but inert. The engineers who made the 2005 decision were no longer present in 2012. The link between the flag, the gated module and the bit position the flag occupied existed in nobody's head, and was nowhere recorded as a constraint on future use of that bit.

The second was the deployment script. The script that pushed code to the eight SMARS hosts treated "deployment" as a file-copy operation. It reported success when files had landed and the SMARS process had restarted on each target. It did not, as part of "success", verify that the running binary on each target was the new one. It did not interrogate each host for a build identifier. It did not require a healthcheck that the new code alone could pass. In a fleet of eight, an old binary on one looks exactly like a new binary on the others from outside, as long as you only ask "did the deploy command succeed". For the eighth server, the deploy command did succeed, in the sense the script meant; the files had not, in fact, arrived. The script's notion of success was the wrong notion.

The third was the release note. The change documentation for the RLP release recorded that the cumulative-quantity-flag bit was now being used to signal RLP eligibility. It did not, and probably could not, record that the same bit had once been the activation signal for an ancient module that was still compiled into the running binary. The reviewer reading "we now use this flag bit for RLP" had no honest way of knowing the bit had a prior life; the prior life had been dormant for seven years and lived in a comment, if it lived anywhere. To call the review careless is to misunderstand what was visible to the reviewer.

All three were ordinary. Gating instead of deleting; treating "files copied" as "deploy succeeded"; documenting the new use of a flag without auditing every prior use across the source tree. Every team does at least one of these. Most teams do all three. Knight's bad luck was to do all three on the morning a stock exchange opened a new programme.

The Principle

Two architectural disciplines would have prevented this incident, and they remain useful in every modern stack.

The first is to delete dead code, not to gate it. A flag that disables a module is a switch that an unrelated change can later flip. The module is still present, still loaded, still subject to whatever the runtime decides to do with it; its absence in behaviour rests on a piece of state that was never intended to be a load-bearing safety mechanism. A deleted module, on the other hand, is gone: not in the binary, not in the loaded process, not waiting for an accidental activation. Source control retains the history; the running system retains nothing. If a module is too dangerous to remove, the right response is to make it safe enough to remove, not to leave it gated.

The second is to verify after deploy, on the property of the deployed code itself. The end of a deployment is not "the files copied successfully". It is "every target host reports the new build's identity". A deployment script that hits a /version endpoint on each host after restart, that compares the returned hash to the expected hash, that fails loudly if any host disagrees, is a small piece of plumbing that catches the specific class of failure that ended Knight Capital. In a FreeBSD shop, this is rc.d managing the SMARS-equivalent, plus a deploy script in plain shell that loops over the host list, hits a /version endpoint on each, and refuses to declare success until every host returns the expected build. This is not exotic engineering. It is half a page of shell.

A great many of the practices that have become normal in the last decade, particularly the discipline of immutable infrastructure, container image hashes, and deploy verification through Kubernetes' Deployment status, are downstream of incidents like Knight's. They exist precisely because, before them, organisations could not be sure what was running on which host.

Where It Travels

The pattern wears the local clothes everywhere.

Kubernetes: a rolling update on a Deployment where one node has the image cached under the same tag from a previous build. With imagePullPolicy: IfNotPresent (the default for some tags), the kubelet uses the cached image. The new pod starts, the readiness probe passes (returning 200 from either version), and the Deployment reports rolled out, with one pod silently still on the old code.

Feature-flag libraries (LaunchDarkly, ConfigCat, Unleash, Flagsmith): a flag whose semantic meaning has shifted between releases. The old code path is still in the binary, gated by the same flag, waiting for somebody to wake it.

Cloud auto-scaling: an EC2 launch template that points at an outdated AMI ID. Newly scaled-out instances run the old binary while the freshly-deployed ones run the new. Traffic is balanced across them as if they were identical.

Helm and Kustomize: a Deployment manifest pins version 2.4.0, but one cluster node's local image cache resolves the tag to an older 2.3.7. The Pod runs from the local cache; the Deployment status reports healthy.

CI/CD job matrices: a deployment job that mass-deploys to a list of targets and reports success when N out of N return zero exit codes, without verifying any target's running version.

The shape is identical in each: a release that completes on N-1 out of N targets, a verification step that asks "did the deploy succeed" rather than "is the new version running everywhere", and a piece of latent surface area (a stale image, a repurposed flag, a cached AMI, an older registry layer) lying in wait on the unverified one.

Coda

Knight rebuilt nothing; Knight was absorbed. The architectural lessons of 1 August 2012 are by now in many shops' release checklists, and the SEC's Market Access Rule has acquired a fixed point in financial-services compliance training. The risk has not gone away; it has merely been moved, into stacks that have richer deployment tooling and, often, the same blind spots dressed in newer vocabulary.

The single sentence worth carrying out of this episode is the question that ends it: when our deployment reports success this afternoon, on what property of the deployed code did it confirm that?

A deploy that succeeds on seven of eight is one that failed quietly on one. Production does not give partial credit; the market gives none at all.

Read the full article on vivianvoss.net →

By Vivian Voss, System Architect and Software Developer. Follow me on LinkedIn for daily technical writing.