Sebastian Lim

Posted on Feb 16

Your Fork Will Outlive Your Patience. A Systems Thinking Post-Mortem.

#devops #opensource #architecture #programming

Every internal fork starts as a one-liner: "we just need to patch this one file." Six months later you're maintaining four parallel repositories, dreading every upstream release, and spending more time keeping your patches alive than building the thing they were supposed to enable.

I know because I did exactly this. I forked four upstream tools to port 973 ROS packages to an unsupported OS. It worked — 61% of the packages compiled, turtlesim ran, my demo was a success. Then the fork ate me alive.

This is not a war story. This is a system dynamics diagnosis of why forking upstream tools creates a structural trap that no amount of discipline can outrun.

The Setup

I was porting ROS 2 Jazzy (the Robot Operating System) to openEuler 24.03 LTS — a Linux distribution that ROS does not officially support. The ROS build toolchain (bloom, rosdep, rospkg, rosdistro) hardcodes its list of supported platforms. openEuler is not on it.

My options were:

Contribute upstream — submit PRs to add openEuler support to the official tools. Slow, dependent on maintainer goodwill, but sustainable.
Fork everything — clone the four repos, add openEuler support myself, build from source. Fast, self-contained, but now I own the maintenance.

I chose option 2. Of course I did. I had a demo to deliver.

The Fix That Fails (R1)

Here's what my fork looked like as a system:

        (Problem)                      (Relief)
    TOOLCHAIN DOESN'T    ---------->  TOOLCHAIN WORKS
    RECOGNIZE openEuler                   |
         ^                                |
         |         (Short Term)           |
         |          BALANCING             |
         |            LOOP                |
         |                                v
         +------ <FORK UPSTREAM> ---------+
         |         (Intervention)         |
         |                                |
         |                                |
         |  (Long Term Side-Effect)       |
         |    REINFORCING LOOP (R1)       |
         |    "Fixes that Fail"           |
         |                                |
         |                                v
    +-----------+                  +-----------------+
    | FORK GETS | <--------------- | UPSTREAM MOVES  |
    | OVERWRITTEN|    (Delay)      | (pip install,   |
    | OR STALE  |                  |  new releases)  |
    +-----------+                  +-----------------+

Every pip install in my build environment could silently overwrite my forked rosdep with the official version. The official version doesn't know openEuler exists. Suddenly my entire pipeline is dead and I'm grepping through pip logs trying to figure out what happened.

This is textbook "Fixes that Fail" — one of the system archetypes described by Donella Meadows. The fix (forking) addresses the symptom (toolchain doesn't recognize my OS), but it creates a side effect (fragile environment that breaks on any upstream interaction) that makes the original problem recur, harder to diagnose each time.

The reinforcing loop at the bottom is the killer. The more upstream moves, the more my fork breaks. The more it breaks, the more time I spend re-patching. The more time I spend re-patching, the less time I have to pursue the fundamental solution (contributing upstream). Which means I'm even more dependent on the fork tomorrow than I am today.

The Data Decay Loop (R2)

R1 didn't run alone. I also forked rosdistro — the central database that maps ROS package names to OS-specific dependencies. My fork contained hand-maintained YAML files mapping ROS dependency keys to openEuler package names.

    Official rosdistro            Forked rosdistro
    (constantly updated)  ------>  (frozen in time)
          ^                              |
          |                              |
          |       REINFORCING            v
          |         LOOP (R2)        METADATA ROTS
          |      "Data Decay"        (Wrong versions,
          |                           missing packages)
          |                              |
          |                              v
          |                        BUILD FAILURES
          |                        INCREASE
          |                              |
          +------------------------------+
              (Need more manual
               patching of YAML)

Every day the official rosdistro receives updates, my fork falls further behind. Every day it falls behind, more builds fail for reasons that have nothing to do with openEuler compatibility — they fail because my metadata is stale.

I wrote a script (auto_generate_openeuler_yaml.py) that reads the official YAML and tries to map each dependency to an openEuler package via dnf list. But this script can only run on an actual openEuler machine. It can't run in CI. It can't run offline. It's a manual process that I have to remember to do, and every time I forget, the data rots a little more.

What R1 + R2 Look Like in Practice

Here's the actual data from my system, running on EulerMaker:

Architecture	Success	Dep Gaps	Failures	Interrupted	Total
aarch64	606	215	152	—	973
x86_64	597	214	151	11	973

61% success rate. Turtlesim runs. That's the good news.

The bad news: those 214 dependency gaps and 151 build failures are the accumulated stock of problems that my two reinforcing loops are feeding. Each gap is a place where my forked metadata is wrong or my forked toolchain did something the real toolchain wouldn't. And every time upstream moves, some of those 597 successes will become new failures, because my fork hasn't kept up.

The system is not failing. The system is succeeding at producing failures, because that's what its structure is designed to do.

The Leverage Point I Missed

In systems thinking, there's a concept called leverage points — places where a small change in structure produces a large change in behavior. Meadows ranked "the rules of the system" as one of the highest leverage points.

My fork was operating under one implicit rule: "we maintain our own version of the toolchain." This rule forced every interaction with upstream into an adversarial relationship. Upstream updates weren't improvements — they were threats.

The high-leverage alternative was to change the rule: "we get our patches accepted upstream." Under this rule, every upstream update would be an improvement that includes our platform support. The same force that was destroying my system (upstream momentum) would be sustaining it instead.

I know why I didn't do this. Contributing upstream is slow, political, and uncertain. Forking is fast, controllable, and certain. But "fast and certain" in the short term turned into "expensive and fragile" in the long term. That's the entire point of the Fixes that Fail archetype — the symptomatic solution is always more attractive in the moment.

What I Actually Learned

A fork is a liability, not an asset. The moment you fork, you've created a maintenance obligation that grows with every upstream commit. If you can't get your changes upstream within a bounded timeframe, you are accumulating structural debt that compounds.
Data forks are worse than code forks. Forking code is bad. Forking data (like my rosdistro YAML files) is worse, because data goes stale silently. Code breaks loudly — a function signature changes and you get a compile error. Data rots quietly — a package version is wrong and you get a mysterious runtime failure three weeks later.
The brute-force approach is valuable — as a probe. v1 was not a failure. It was a deliberate brute-force survey that generated an intelligence map: here are the 973 packages, here's which ones work, here's exactly where the gaps are. The failure was in thinking the probe could become the production system. Probes are disposable. Production systems need structural integrity.
Know your band-aids. I have virtualenv bypasses, RHEL-clone registrations, and frozen YAML snapshots in my system. I know each one is a band-aid. Most teams don't track their band-aids. They accumulate silently until someone asks "why does our build take 45 minutes and fail 30% of the time?" and nobody can answer.

The Follow-Up

v1 taught me what a brute-force pipeline looks like when it hits its structural limits. I documented the full system dynamics, including the trap architecture, in the v1 post-mortem repo.

v2 was designed to break the cycle: verify before building, not after. Instead of feeding 973 packages into a pipeline and watching 40% of them fail, v2 probes the OS environment first, identifies gaps before consuming build resources, and operates on a verified dependency graph. Details in the v2 Verification Engine repo.

The structural lesson applies far beyond ROS porting:

If you're maintaining an internal fork of an OSS library: you're running R1. Get your patches upstream or plan for the maintenance tax.
If you're patching configuration files that upstream keeps overwriting: you're running R2. Automate the merge or accept the data rot.
If you're using --skip-broken, --force, or || true in your build scripts: you're masking symptoms. Each flag is a band-aid. Count them.

Every fork starts with "just this one patch." Every addiction starts with "just this one hit."

The system doesn't care about your intentions. It cares about its structure.

*The v1 post-mortem with system dynamics diagrams: the_brute_force_probe. The v2 verification engine: the_adaptive_verification_engine.

DEV Community