Sebastian Lim

Posted on Feb 17

"Just Patch the Output" — Why Your Clean Architecture Is More Fragile Than My Band-Aids.

#devops #architecture #cicd #programming

You spent three months designing the "proper" solution. You contributed upstream. You negotiated with maintainers. You got your pull request reviewed, revised, re-reviewed, and finally merged — into a release that won't ship until next quarter. Meanwhile, your deadline was last Friday.

I spent an afternoon writing a script that takes the wrong output and makes it right. It's ugly. It's a band-aid. It has seven workarounds that each begin with a comment explaining what upstream should have done differently. And it shipped on Friday.

Here's the part that will make you uncomfortable: six months later, my band-aid script is still running, still auditable, still does exactly what it says. Your "proper" solution is stuck in a merge conflict with a refactor that a new contributor pushed to the upstream repo you thought was stable.

I know this because I've been on both sides. I spent v1 doing the "proper" thing — forking four upstream tools, adding real platform support, building "real" fixes. It nearly killed my project. v3 is the recovery: stop fixing the tools. Fix what they produce. The band-aid is the strategy.

The Lie That Works

I'm porting ROS 2 Jazzy to openEuler 24.03 LTS. The ROS build toolchain — bloom, rosdep, rospkg, rosdistro — does not recognize openEuler. It will not generate packages for it. If you ask bloom to build an RPM, it asks the OS detector what system you're on, gets "openEuler," and gives up.

v1's answer: fork the OS detector, add an OpenEuler(OsDetector) class, register it as an RHEL clone, maintain four forked repositories. Noble. Principled. And structurally suicidal — any pip install could overwrite my forks with the official versions, silently killing the pipeline. (I wrote an entire article about this.)

v3's answer: one line.

export ROS_OS_OVERRIDE=rhel:9

That's it. I'm telling bloom "you're running on RHEL 9." Bloom believes me. It generates RHEL 9 specs. The specs are wrong for openEuler — wrong install paths, wrong dependency names, wrong macros — but bloom runs. It produces output. And output can be fixed.

Is this a lie? Yes. ROS_OS_OVERRIDE is a documented ROS feature, but it was designed for cross-compilation environments, not for making an unsupported OS masquerade as a supported one. I'm using it for something the ROS maintainers probably didn't intend.

And it works better than the "honest" approach of teaching the toolchain what openEuler actually is.

The Fix-the-Output Architecture

Here's the entire system:

  Source Code           Official bloom              My one script
  (ros2.repos)    --->  (unmodified,          --->  (fix_specs.py)    --->  EulerMaker
                         thinks it's RHEL 9)         7 fixes, 1 file         (builds RPMs)

That's three arrows. The middle box is someone else's maintained software. The only code I own is the thing on the right.

Compare to v1:

  Source Code     --->  Forked bloom          --->  Forked rosdep     --->  Forked rosdistro
  (973 packages,        Forked rospkg               (YAML data,             (more YAML data,
   scraped from          (Python code,               rots silently)           rots silently)
   sitemap)              breaks on pip)

                        ^--- I own ALL of this. Every upstream update is my problem. ---^

v1 owned the entire middle of the pipeline. v3 owns a single post-processing script. When upstream ships an update, v1 has to re-patch four repositories. v3 runs pip install --upgrade bloom rosdep and then checks if fix_specs.py still produces correct output. Usually it does. The fixes are against output format, which changes slowly, not against internal implementation, which changes whenever a maintainer feels like refactoring.

The structural difference: v1 is coupled to upstream's internals. v3 is coupled to upstream's outputs. Outputs are a contract. Internals are not.

The Seven Band-Aids

Here they are. Every single one. I'm not going to dress them up:

#	What It Does	Why
1	Adds `%bcond_without tests`	Tests don't pass on openEuler yet. Disable them to unblock the initial build.
2	Adds `%global debug_package %{nil}`	Debug packages cause build failures. Not worth debugging until the main build works.
3	Normalizes `Name:` to `ros-jazzy-<pkg>`	Bloom generates inconsistent package names. EulerMaker rejects them.
4	Fixes `Source0:` path	Bloom's tarball naming doesn't match openEuler conventions.
5	Rewrites install prefix to `/opt/ros/jazzy`	Bloom installs to `/usr`. openEuler ROS goes to `/opt/ros/jazzy`. Every CMake flag, every Python path, every `%files` glob needs rewriting.
6	Injects `PYTHONPATH` into `%build`/`%install`/`%check`	Without this, Python can't find ROS libraries during build. The build environment doesn't set up the path correctly.
7	Simplifies `%files` to `/opt/ros/jazzy`	Package everything under the ROS prefix. Bloom's generated `%files` section lists paths that don't exist on openEuler.

Look at that table. It's not impressive. There's no clever architecture. There's no abstraction layer. There's no plugin system. It's seven string replacements applied in sequence, each one documented with "why it exists" and "what upstream change would make it unnecessary."

That's the point.

Why Band-Aids Beat Architecture

I'm going to make an argument that will offend anyone who's read Clean Code: a visible band-aid is structurally superior to an invisible "proper" fix.

Here's the reasoning.

1. Band-aids are auditable. Proper fixes hide.

Open fix_specs.py. Read it top to bottom. In under ten minutes, you know every single thing v3 works around. Every platform gap between openEuler and RHEL 9. Every bloom output that needs correction. Every assumption the ROS toolchain makes that doesn't hold.

Now try doing the same audit on v1's forked bloom. Where are the openEuler-specific changes? Scattered across four repositories, mixed into the existing codebase, buried under commit messages like "fix OS detection" and "add openEuler support." You'd need to diff against upstream to find them all, and upstream has moved since the fork, so the diff includes changes from both sides.

A system whose workarounds are invisible is a system whose costs are invisible. And invisible costs compound.

2. Band-aids are removable. Proper fixes entangle.

Each fix in fix_specs.py is independent. If openEuler starts shipping python3-flake8-builtins tomorrow, I delete one function and the rest of the pipeline doesn't notice. If bloom fixes its Name: generation, I delete another function. No regression risk. No cascading changes.

Try removing openEuler support from a fork that's had six months of other changes layered on top. The fork has its own git history now. Its own bug fixes. Its own workarounds for things that broke because of the fork. You're not removing a feature — you're performing archaeology.

3. Band-aids fail loudly. Proper fixes fail silently.

When a band-aid stops working, the spec file is visibly wrong and the build fails immediately. The error message points to the spec. The spec points to fix_specs.py. The fix function has a docstring explaining why it exists. Diagnosis time: minutes.

When a fork gets silently overwritten by pip install, everything looks fine until a build fails with an unrelated error three hours later. Diagnosis time: half a day, minimum, because you're looking in the wrong place.

Systems that fail loudly are systems that get fixed. Systems that fail silently are systems that rot.

The Uncomfortable Trade-off

I'm not going to pretend v3 has no cost. It does. The cost is this:

    (New platform issue           (Add fix to
     discovered)            --->   fix_specs.py)
          ^                             |
          |                             |
          |     ACCUMULATION            v
          |        LOOP              fix_specs.py
          |                          GROWS
          |                             |
          +-----------------------------+
               (More complexity
                to maintain)

The band-aids accumulate. Each new openEuler-vs-RHEL divergence means a new function in fix_specs.py. The file gets longer. The surface area grows. Eventually, it hits some threshold where the maintenance cost of the post-processing script rivals the maintenance cost of the fork.

I know this. I chose this consciously. Here's why it's still better:

The accumulation is visible. I can open one file, count the fixes, read their rationale, and make an informed decision about whether the script is still manageable or whether it's time to push fixes upstream. With v1's forks, the equivalent accumulation was happening across four repositories, invisibly, and I didn't realize the cost was unsustainable until the pipeline broke at 3 AM.

The accumulation has an exit ramp. Each band-aid includes a note: "What would eliminate this fix." Every one of those notes describes an upstream contribution or an openEuler packaging change. When I eventually submit upstream PRs, I know exactly what to submit, because the band-aids are a precise map of the gap between "what exists" and "what's needed."

v1's forks had no exit ramp. The longer the fork ran, the harder it was to merge back upstream, because the fork and upstream diverged in both directions. That's the difference between linear accumulation (v3) and exponential divergence (v1).

The ROS_OS_OVERRIDE Principle

The deeper lesson is not about RPM specs or ROS packaging. It's about a design principle that most software engineers are trained to reject:

If you can't change the tool, change its environment, accept its output, and fix the output.

We're taught to fix problems at the source. Don't parse HTML with regex — use a proper parser. Don't post-process compiler output — fix the compiler flags. Don't patch generated code — fix the generator.

These are good defaults. But they assume you can fix the source. When you can't — because the source is upstream, because the maintainers are busy, because the fix requires organizational changes that take months — the "proper" approach becomes a blocking dependency on something outside your control.

ROS_OS_OVERRIDE is the embodiment of this principle. I cannot change how bloom detects operating systems. I cannot add openEuler to rosdistro's supported platform list (not quickly, anyway). But I can set an environment variable that makes bloom produce usable output, and then I can fix that output.

Is this elegant? No. Is this what the ROS developers intended? Probably not. Does it decouple my project's progress from upstream's review cycle? Yes. And that decoupling is worth more than any amount of architectural elegance.

The Real Engineering Skill

The v1 article was about recognizing structural traps. This one is about something harder: recognizing when the "right" solution is the wrong strategy.

Every senior engineer I've met has an instinct for "doing it properly." It's a good instinct. It produces maintainable code. It produces systems that last. But it also produces a specific failure mode: the project that never ships because someone insisted on getting the architecture right before solving the problem.

v3 ships. It ships with seven band-aids, a fake OS identity, and a post-processing script that would make any bloom developer cringe. And it's more maintainable, more auditable, and more resilient than the "proper" version ever was.

The real skill is not knowing how to build the right system. It's knowing when the right system is too expensive, and building the useful system instead — with full awareness of what you're trading away, documented in a file that anyone can read.

My band-aids are in fix_specs.py. Where are yours?

*The fork trap that v3 replaced: the_brute_force_probe. The reproducible pipeline: the_reproducible_pipeline. I'm building systems that are honest about their own limitations.

Top comments (1)

cellurl • Feb 17

waterfall