Your Bash CI Scripts Are a Ticking Time Bomb — Here's What to Use Instead

#productivity #tools #webdev #discuss

TL;DR: The thing that finally broke our bash-based pipeline wasn't a catastrophic failure — it was a Tuesday afternoon where three engineers spent 45 minutes staring at a deploy script trying to figure out why exit code 0 was returned on a job that clearly hadn't finished correctly.

📖 Reading time: ~35 min

What's in this article

The Moment Bash Stops Being Good Enough
What 'Orchestration' Actually Means (Not the Marketing Version)
Where GitHub Actions Hits Its Limits
Buildkite: The First Orchestrator That Felt Like It Was Built for This

The Moment Bash Stops Being Good Enough

The thing that finally broke our bash-based pipeline wasn't a catastrophic failure — it was a Tuesday afternoon where three engineers spent 45 minutes staring at a deploy script trying to figure out why exit code 0 was returned on a job that clearly hadn't finished correctly. The script had accumulated two years of hotfixes, conditional logic bolted on after incidents, and environment variable assumptions baked in from someone's laptop. Nobody touched it unless they absolutely had to.

At 40+ repos and 12 engineers, "works on my machine" stops being a joke and becomes a genuine operational hazard. Bash scripts fail in ways that are genuinely hard to reproduce: step ordering that depends on filesystem state that only exists in CI, set -e that exits silently without logging why, env vars that get sourced from a file that exists in one environment but not another. I've seen pipelines where a missing export caused a downstream service to deploy with a blank DATABASE_URL — and the build still passed because the check happened before the variable was consumed. That's not a bug you catch in code review.

# This is the kind of thing that kills you at scale
./build.sh && ./test.sh && ./deploy.sh

# What you don't see: test.sh exits 0 even when tests are skipped
# because the test runner wasn't installed and the fallback was
# an echo statement someone added "temporarily" 18 months ago

The real cost isn't the broken builds — broken builds are visible. The cost is the hour-long debugging sessions where someone has to mentally reconstruct what the environment looked like at 2:47 PM on Thursday when the pipeline ran. Bash scripts have no audit trail worth speaking of. You get a wall of stdout if you're lucky, a cryptic error if you're not, and zero structured data you can query later. When you're trying to answer "did step X actually run before step Y on this specific commit" — bash just doesn't give you that.

Bash fails at scale in three concrete ways that no amount of discipline fixes:

No retries with backoff: You either wrap everything in a for i in 1 2 3; do ... && break; done pattern that every engineer implements differently, or you accept flakiness as a fact of life. Network blips kill your deploys and someone has to manually re-trigger them.
No real parallelism primitives: & and wait exist, but managing dependencies between parallel jobs, capturing exit codes from background processes, and canceling sibling jobs when one fails — that's a full-time research project, not a CI script.
No audit trail: Which engineer triggered the run? What was the exact environment? Did the pre-deploy hook actually complete or did it time out silently? Bash logs whatever you explicitly echo, which is never enough when you're postmorteming a 3 AM incident.

I got a lot of practical perspective on where bash tooling genuinely breaks down from reading through the Best AI Coding Tools in 2026 (thorough Guide) — the broader pattern of developers reaching for purpose-built tools instead of scripting their way through problems applies directly here. The teams that keep gluing bash scripts together past a certain size aren't being pragmatic, they're accumulating operational debt that compounds every time someone new joins and has to reverse-engineer what a 400-line shell script is actually doing.

What 'Orchestration' Actually Means (Not the Marketing Version)

The marketing version of "orchestration" is basically "your pipeline but with a nicer UI." The real definition is more specific: an orchestrator manages state transitions between dependent tasks, not just sequential execution. Your bash script doesn't know if step 4 failed because step 3 produced bad output or because the network hiccupped. An orchestrator tracks that distinction and uses it to decide what happens next. That's the actual gap.

The difference between scripting and orchestration isn't about whether you write YAML or Python or bash. It's about whether your system has an explicit dependency graph and persistent state per execution. A bash pipeline is a linear chain — set -e stops everything on failure, and you get a binary pass/fail with some logs if you're lucky. An orchestrator models your workflow as a DAG, stores the status of every node, and lets you reason about partial failures, retries, and skips without rerunning the whole thing from scratch. When I moved a 40-job pipeline from a bash monolith to a proper DAG-aware system, the first thing I noticed wasn't speed — it was that I could finally tell which downstream jobs were blocked and why, without parsing 3,000 lines of interleaved log output.

At scale, you hit four specific capabilities that no amount of clever bash will give you reliably. Fan-out/fan-in — spinning up 50 parallel test shards and waiting for all of them before proceeding — requires a coordinator that can track 50 independent state machines simultaneously. Artifact caching with content-addressed keys means job B doesn't recompile what job A already compiled, but only if something is managing the cache keys and invalidating them correctly. Conditional branching based on runtime output (not just exit codes) means your deploy job runs only if the diff touches src/ and the staging smoke tests passed. And failure isolation means one flaky integration test doesn't kill your security scan or your artifact upload. None of these are hard to sketch out in bash, but all of them are genuinely hard to maintain correctly at scale.

GitHub Actions is the right answer for teams with fewer than maybe 10-15 active pipelines where you can tolerate the constraints. The free tier gives you 2,000 minutes/month on public repos, runners are managed for you, and the YAML syntax is approachable. I'd tell anyone starting out to use it without hesitation. The cracks start showing when you need cross-repo artifact sharing that isn't a hack, when your matrix builds exhaust concurrency limits, when you need dynamic job generation based on runtime discovery (Actions' matrix is static unless you serialize JSON through outputs in a way that feels distinctly wrong), or when you need a complete audit trail per-step for compliance. Past that threshold, you're fighting the platform instead of using it.

An orchestrator has to get exactly four things right, and if it fumbles any one of them, the whole thing erodes trust:

Visibility: You need to see which step is running, which failed, which was skipped, and why — in real time, not by grepping logs after the fact. If your team has to ask "did the deploy finish?" you've already lost.
Retries: Not just "retry the whole pipeline" but retry a specific step with the same inputs, with configurable backoff, and with the retry attempt clearly distinguished in the history. Flaky network calls and transient test failures are real; a system that can't handle them gracefully burns engineer time.
Parallelism: True parallelism means the scheduler understands the dependency graph and starts every job that's unblocked, not just the ones you manually wired up. The difference shows up when you have 30 independent lint/test/build jobs that could all run simultaneously but your pipeline serializes them because that was easier to write.
Reproducibility: Given the same inputs, the same step must produce the same outputs — and the system must enforce this, not just hope for it. This means pinned executor environments, content-addressed caching, and no hidden shared state between runs.

The most underrated one is reproducibility. I've debugged enough "it works on my laptop/it passed last Tuesday" incidents to know that reproducibility is what separates an orchestrator from a glorified cron job. Tools like Buildkite, Temporal, and Prefect all handle this differently — Temporal bakes it into the execution model at the language level, while Buildkite offloads it to your agent configuration — and that architectural choice downstream affects everything about how you debug production failures six months later.

Where GitHub Actions Hits Its Limits

The queue time problem hits you fast and without warning. I was running a monorepo with around 220 jobs triggered on each merge to main — a mix of unit tests, integration checks, lint, and build steps across 14 services. GitHub Actions' hosted runners operate on a shared pool, and once you're queuing 200+ jobs simultaneously, you're not running a pipeline anymore, you're running a lottery. Jobs that should start in 5 seconds sit for 4–7 minutes. Your total wall-clock time explodes from 12 minutes to 40+ minutes, and the merge queue backs up fast. You can throw money at larger runners, but you're still subject to GitHub's global runner availability — and at $0.016/minute for a 2-core hosted runner, a 40-minute backed-up release costs real money across a team that merges frequently.

YAML sprawl is the slower kill. The first 6 months on Actions, everything feels clean. Then you hire more teams, add more services, copy-paste a job for "just this one thing." I audited a .github/workflows/ directory last year that had 31 files and roughly 3,200 lines. At least 40% of it was duplicated steps — the same Docker login block, the same Node.js setup, the same artifact upload pattern repeated across a dozen workflows. Composite actions help, but they're a workaround, not a solution. You're still wiring them together manually, still debugging why uses: ./.github/actions/setup-node behaves differently depending on which workflow calls it. Reusable workflows are better but introduce their own input/output serialization gotchas that will burn you in non-obvious ways.

Dynamic pipeline generation is where Actions shows its real ceiling. If you need to decide at runtime which subset of services to build — based on a changed files analysis, a feature flag, or a dependency graph — you're hacking around matrix strategies that weren't built for this. The cleanest workaround I've seen involves a setup job that outputs a JSON array, then a downstream job consumes it with fromJson():

jobs:
  detect-changes:
    outputs:
      services: ${{ steps.changed.outputs.services }}
    steps:
      - id: changed
        run: |
          # outputs something like '["api","auth","worker"]'
          echo "services=$(./scripts/detect-changed.sh)" >> $GITHUB_OUTPUT

  build:
    needs: detect-changes
    strategy:
      matrix:
        service: ${{ fromJson(needs.detect-changes.outputs.services) }}
    # matrix size is capped at 256 — hit this with a big monorepo

The 256-job matrix cap is documented but you won't think about it until you hit it. And when you do hit it at 2am during a release, you'll also potentially see the second nightmare: "No runner matching the specified labels". This happens when you've defined a custom runner label for a specific job — say, runs-on: [self-hosted, gpu, ubuntu-22.04] — and that runner is offline, over-capacity, or was never registered for that label. GitHub won't queue you gracefully; it just stalls the job indefinitely with that error and no ETA. No retry backoff, no fallback pool. Your release is now blocked and you're SSHing into runner VMs at 2am to figure out which one dropped its registration.

Caching feels solid until you realize how quietly it fails. The cache key model is powerful in theory — you hash your lockfile, cache the node_modules, skip reinstall on hits. But cache misses are completely silent. If your cache-dependency-path globs don't match exactly, you get a miss, a full reinstall, and a 10-minute build with zero indication of why. I've debugged this three separate times across different teams who swore their cache key was correct. The fix is always adding a cache-hit output check and logging it explicitly:

- uses: actions/cache@v4
  id: node-cache
  with:
    path: node_modules
    key: node-${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}

# This step reveals whether you actually got a hit — Actions won't tell you otherwise
- run: echo "Cache hit: ${{ steps.node-cache.outputs.cache-hit }}"

My honest line in the sand: Actions is genuinely good tooling for teams up to roughly 50 engineers and one or two repos. Below that threshold, the UX is excellent — the marketplace is huge, the YAML syntax is learnable, and the GitHub integration is smooth. Past that threshold, you're not building pipelines anymore, you're writing framework code to work around Actions' limitations. You spend engineering hours maintaining workflow abstractions instead of shipping product. That's the signal to start looking at tools designed from the ground up for orchestration at scale — things like Buildkite, Temporal for orchestrating build logic, or custom runners behind a proper job scheduler. The platform stops being an accelerant and starts being the project.

Buildkite: The First Orchestrator That Felt Like It Was Built for This

The thing that actually hooked me on Buildkite wasn't the UI or the docs — it was realizing I could generate my pipeline at runtime based on what actually changed. Every other tool I'd used wanted a static YAML file committed to the repo. Buildkite flips that: you commit a bootstrap pipeline that runs a script, and that script calls buildkite-agent pipeline upload with whatever YAML it just generated. That single capability unlocks monorepo CI that doesn't make you want to quit your job.

Here's what that dynamic pipeline generation actually looks like in practice. Your .buildkite/pipeline.yml is just the entry point:

steps:
  - label: ":pipeline: Generate pipeline"
    command: ".buildkite/generate-pipeline.sh | buildkite-agent pipeline upload"

And the shell script uses git diff to figure out what changed:

#!/bin/bash
# Only run service tests if that service's code actually changed
changed=$(git diff --name-only origin/main...HEAD)

cat <

The step-level parallelism model is where Buildkite separates itself from "just a fancier shell script runner." You define a step with `parallelism: 50`, spin up 50 agents, and Buildkite distributes test chunks across all of them automatically. Each agent picks up a slice, runs it, reports back. A `wait` step then blocks until every parallel job returns green before the pipeline moves forward. I've seen this take a 40-minute Django test suite down to under 4 minutes with 12 agents — the math is roughly linear until you hit fixture setup overhead.steps: - label: "RSpec %n" command: "bundle exec rspec --format progress" parallelism: 20 # Buildkite sets BUILDKITE_PARALLEL_JOB and BUILDKITE_PARALLEL_JOB_COUNT agents: queue: "test-runners" - wait # hard gate — all 20 must pass before deploy step runs - label: "Deploy to staging" command: "./deploy.sh staging" branches: "main"Getting a self-hosted agent running on Ubuntu takes about 90 seconds:# One-liner install — works on Ubuntu 20.04/22.04/24.04 curl -sL https://raw.githubusercontent.com/buildkite/agent/main/install.sh | bash # Set your token in the config echo 'token="YOUR_AGENT_TOKEN"' >> ~/.buildkite/buildkite-agent.cfg echo 'tags="queue=test-runners,os=linux"' >> ~/.buildkite/buildkite-agent.cfg # Run it (or wire it to systemd for production) buildkite-agent startThe gotcha that tripped my team up for two days: artifact storage is not handled by Buildkite's servers by default. When you call `buildkite-agent artifact upload`, it needs somewhere to put the files. If you don't set `BUILDKITE_ARTIFACT_UPLOAD_DESTINATION`, artifacts silently go to Buildkite's managed storage which costs extra and has retention limits. Set it on the agent host before you go anywhere near production:# In your agent environment hook: ~/.buildkite/hooks/environment export BUILDKITE_ARTIFACT_UPLOAD_DESTINATION="s3://your-ci-artifacts-bucket/buildkite" export AWS_DEFAULT_REGION="us-east-1" # Your agent's IAM role needs s3:PutObject on that bucket # Forgetting this means artifact uploads silently fall back to managed storageThe pricing model is genuinely different from GitHub Actions or CircleCI, and it matters at scale. Buildkite charges per agent seat per month (check their site for current numbers — around $15-25/agent/month depending on tier), not per compute-minute. Your EC2 or GCP Spot Instance costs are entirely yours to manage. This means if you run 20 agents on Spot Instances 8 hours a day, your Buildkite bill is predictable and flat while your infra bill scales with actual usage. Contrast this with GitHub Actions at $0.008/minute for Linux — run a 2-hour test suite on 20 parallel runners every hour and the per-minute model gets painful fast. The per-agent model rewards teams who are good at infra; the per-minute model rewards teams who want zero ops overhead and don't mind paying the premium. ## Bazel for Build Orchestration: When Your Problem Is Actually a Dependency Graph Problem The thing that converted me to Bazel wasn't performance benchmarks — it was watching our CI queue burn 22 minutes rebuilding a Go service that hadn't changed in three weeks because someone touched a shared proto file. That's not a CI problem. That's a dependency graph problem, and no amount of clever bash caching is going to solve it cleanly. Bazel's model is fundamentally different: it tracks the exact inputs to every build target, hashes them, and refuses to rerun anything unless those inputs actually changed. That's the insight that makes it worth the learning curve. The incremental build story only works if you're also using remote caching — otherwise you're just getting local caching, which doesn't help your teammates or your CI runners. Here's the `.bazelrc` snippet that actually matters when you're pointing at a shared cache backend (I've used both Google Cloud Storage and a self-hosted Bazel remote cache via [bazel-remote](https://github.com/buchgr/bazel-remote)):# .bazelrc build --remote_cache=grpcs://your-cache.internal:9092 build --remote_cache_compression=true # This is critical — without it, failed actions poison the cache build --remote_upload_local_results=true build --remote_timeout=60 # Separate cache key per OS so Linux CI doesn't serve broken artifacts to macOS devs build --host_platform_remote_properties=//platforms:linux_x86_64The cache hit rate is what you watch. First week you'll see maybe 40%. After two or three CI runs warming it up, you should be at 80–90% for unchanged targets. If you're not, check your `--workspace_status_command` — injecting git SHAs directly into action inputs is the most common cache-busting mistake I've seen teams make without realizing it. For actual invocation, I run this in CI rather than bare `bazel build`:bazel build //services/... \ --build_event_stream_file=bep.json \ --build_event_publish_all_actions \ --keep_going # bep.json feeds into observability tools — Buildkite ingests it natively, # and you can parse it yourself with the BuildBuddy OSS stackThe `--build_event_stream_file` flag is underrated. That JSON file contains per-target timing, cache hit/miss status, and test results in a structured format. Pipe it into anything — BigQuery, your own dashboard — and you finally have data to answer "why did this build take longer today?" instead of guessing. Here's the rough edge nobody puts in the tutorial: **BUILD file maintenance is a real, ongoing cost**. Every time you add a new import, rename a package, or reorganize directories, somebody has to update `BUILD` files. Tools like `gazelle` handle Go and some other languages reasonably well:# Auto-generate BUILD files for Go packages bazel run //:gazelle # After adding a new go dependency bazel run //:gazelle -- update-repos -from_file=go.mod -to_macro=deps.bzl%go_depsBut for Python, Java, or mixed-language repos, gazelle coverage is spottier. I've seen teams assign a rotating "Bazel tax" responsibility where someone owns fixing broken BUILD files during a sprint. That's not a dealbreaker — it's just a real operational cost you need to budget for, not a one-time migration. My honest take on when Bazel is actually worth it: if you have fewer than 10 services in a monorepo, the BUILD file overhead probably exceeds what you save. At 10–15+ services with shared libraries, the math flips hard — I've seen CI times drop from 18 minutes to 4 minutes after the migration settled. For polyrepo setups, Bazel is almost certainly overkill. You're solving a monorepo graph problem; if your services don't share a build graph, you don't have the problem Bazel solves. The common production setup I keep seeing is **Bazel + Buildkite** or **Bazel + GitHub Actions**, and the division of responsibility makes sense once you see it: Bazel owns what to build and in what order (the dependency graph), while the CI platform owns where and when to run it (runner orchestration, secrets, notifications, PR integration). GitHub Actions can't do Bazel's job, and Bazel has no concept of "notify Slack on failure" — don't try to collapse those responsibilities. ## Nx for JavaScript/TypeScript Monorepos: The Practical Middle Ground The thing that surprised me most about Nx wasn't any single feature — it was that I could drop it into an existing monorepo on a Friday afternoon and have `affected` working before EOD. No new build file format, no rewriting `package.json` scripts, no convincing half the team to learn a new DSL. Nx reads your existing `package.json` workspaces, infers the dependency graph from your imports, and starts giving you useful output immediately. Compared to migrating to Bazel, which I've seen derail teams for months, that zero-friction adoption is genuinely significant. The `affected` command alone is worth the install. The mental model shift is real: instead of running tests on every package in CI, you run tests on what changed relative to your base branch.# Only tests packages that have changed since origin/main # --parallel=3 spreads them across 3 processes locally npx nx affected --target=test --base=origin/main --parallel=3 # See which projects are actually affected before running anything npx nx affected:graph --base=origin/mainThe graph traversal goes beyond direct changes — if you touched `@acme/utils` and five packages import it, all five are marked affected. I caught a regression in a downstream package this way that would have slipped through on a per-package CI run. The catch is that Nx's graph inference depends on actual ES import statements; dynamic `require()` calls and some monorepo setups with unusual path aliasing will confuse it. Worth auditing your dep graph with `npx nx graph` before trusting it completely. Here's a realistic `nx.json` with caching and `targetDefaults` configured:{ "affected": { "defaultBase": "origin/main" }, "targetDefaults": { "build": { "dependsOn": ["^build"], "cache": true, "outputs": ["{projectRoot}/dist"] }, "test": { "cache": true, "outputs": ["{projectRoot}/coverage"] }, "lint": { "cache": true } }, "tasksRunnerOptions": { "default": { "runner": "nx/tasks-runners/default", "options": { "cacheableOperations": ["build", "test", "lint"], "parallel": 3 } } } }The `"dependsOn": ["^build"]` line is easy to miss but critical — the `^` prefix means "build all upstream dependencies first." Without it, you'll get race conditions on cold cache runs where a package builds before its dependency has output. Distributed Task Execution (DTE) is where Nx punches above its weight for large repos. You point multiple CI agents at the same task graph and Nx Cloud coordinates which agent picks up which task, handles cache sharing between agents, and reassembles the results. You don't write any sharding logic — no manual splitting of test files, no `split-tests` plugins, no figuring out how to balance agent load. In a GitHub Actions matrix setup it looks roughly like this:# In your CI workflow, each matrix job just runs: npx nx-cloud start-agent # And your main job orchestrates: npx nx-cloud start-ci-run --distribute-on="5 linux-medium-js" npx nx affected --target=build,test,lint --parallel=3 npx nx-cloud stop-all-agentsNow the honest trade-off: Nx Cloud's free tier gives you a computation credit allowance, and it goes faster than you'd expect on a busy repo. As of mid-2025 the free plan covers a limited monthly compute budget — check [nx.app/pricing](https://nx.app/pricing) directly because they've adjusted it a few times. I've seen teams hit the ceiling around mid-month on repos with 40+ packages running full CI on every PR. Self-hosting the Nx Cloud runner is possible but requires a paid plan. If you want DTE without paying, you're back to writing the sharding logic yourself, which somewhat defeats the purpose. The ceiling I keep hitting with Nx is the JS-ecosystem boundary. If your monorepo has Go microservices, a Python ML pipeline, or Rust tooling alongside the TypeScript packages, Nx has no native understanding of those. You can wrap them with a generic `executor` that shells out to `go build` or `cargo test`, and the task graph will treat them as black boxes, but you lose fine-grained input hashing and caching for those targets. In that mixed-language scenario I've ended up using Nx for the TS packages and either Bazel or Earthly for the polyglot parts — which means you're maintaining two orchestration systems. That's the point where Turborepo has the same problem, and you start looking at something like Pants or Bazel that was designed for heterogeneous repos from day one. ## The Architecture That Actually Works at Scale The architecture that actually works isn't one big CI system doing everything — it's two distinct layers with a clean handoff between them. The build graph layer (Bazel, Nx, Turborepo) understands your code: what changed, what depends on what, what can be cached. The pipeline orchestrator layer (Buildkite, GitHub Actions, Argo Workflows) understands your infrastructure: where to run things, how to fan out jobs, what environment secrets to inject. When you collapse these into one layer — usually a fat Jenkinsfile that does everything — you get a system where changing a deploy script invalidates your build cache, and where your CI has no idea that touching `packages/auth` doesn't require rebuilding `packages/payments`. The concrete wiring looks like this: Nx (or Bazel) computes the affected graph and emits a list of targets. Your pipeline orchestrator reads that list and spawns parallel jobs. Here's a simplified Buildkite dynamic pipeline that does exactly this:# .buildkite/pipeline.sh #!/bin/bash set -euo pipefail # nx affected outputs newline-separated project names AFFECTED=$(npx nx show projects --affected --base=origin/main) # Emit dynamic steps to Buildkite's pipeline upload echo "steps:" for project in $AFFECTED; do echo " - label: \"Build $project\"" echo " command: \"npx nx build $project --output-style=stream\"" echo " agents:" echo " queue: build" # Key lets downstream steps depend on this artifact echo " key: \"build-${project}\"" done | buildkite-agent pipeline uploadArtifact promotion is where most teams get this wrong. The rule is: build once, tag the artifact with a content hash or commit SHA, then carry that same artifact forward through every stage. You're not rebuilding for staging and again for production — you're promoting. In practice this means your build job pushes `myapp:sha-a3f92c1` to your registry, and every subsequent job — integration tests, smoke tests, canary deploy, full prod deploy — pulls that exact tag. The staging deploy job doesn't call `docker build` again. It calls `docker pull myapp:sha-a3f92c1 && helm upgrade --set image.tag=sha-a3f92c1`. This sounds obvious until you find a team where staging passes and prod fails because the image was rebuilt and a dependency resolved differently. Content-addressable artifacts eliminate that entire class of bug. The queue architecture question — dedicated agents per team vs. shared pool — comes down to one real tradeoff: isolation vs. utilization. Shared pools get better machine utilization (no idle agents sitting on team A's queue while team B is swamped), but you get noisy neighbor problems fast. One team's 45-minute monorepo build starves everyone else. My current setup uses dedicated queues for anything touching production deploys or requiring specific hardware (GPU nodes, ARM runners), and shared pools for unit tests and lint. In Buildkite this is just an agent tag:# For critical path steps that need isolation agents: queue: "prod-deploy" team: "platform" # For commodity workloads — shared pool is fine agents: queue: "general"The gotcha with shared pools: you need queue depth alerting. If the general queue is 40 jobs deep, your developers are waiting 20 minutes for lint feedback and you won't know until someone complains. Set up a CloudWatch metric (or Datadog, whatever you use) on agent queue depth, and page when it exceeds 10 for more than 5 minutes. On observability: green/red in the CI UI is not a dashboard. You need build duration trends, cache hit rates, flaky test rates by test file, and queue wait time — all in your existing monitoring stack, not buried in a CI web interface that only three people have bookmarked. Buildkite exposes a metrics endpoint; Nx Cloud has an API. For self-managed setups I pipe job lifecycle events to a Postgres table via webhook, then graph it in Grafana. The metric that actually changed how we staffed agents was p95 queue wait time per team, which you'll never see in a default CI UI. For Nx specifically, you can pull structured build data from their run summaries:# After nx run, output is structured JSON with timing per task npx nx run-many --target=build --affected \ --output-style=static \ --json > /tmp/nx-build-results.json # Pipe this to your observability pipeline cat /tmp/nx-build-results.json | jq '.[] | {project: .project, duration: .duration, cacheHit: .cacheStatus}' \ | curl -s -X POST https://your-ingest-endpoint/build-events \ -H "Content-Type: application/json" \ -d @-Pipeline definitions in the repo sounds like table stakes but most teams I've worked with have at least one critical pipeline defined through a UI somewhere, owned by one person, with no change history. The actual requirement: every pipeline definition lives in `.buildkite/`, `.github/workflows/`, or equivalent. Changes to pipeline files go through the same PR review as application code. This means your pipeline has a git blame, a rollback path, and can't be silently edited at 2am when something's broken. The thing that makes this stick in practice is treating pipeline YAML changes with the same review bar as infra changes — require a second approval from someone on the platform team. One concrete rule I enforce: no hardcoded secrets in pipeline YAML, ever. Secret references only. If I see `env: AWS_SECRET_KEY: "AKIA..."` in a PR, it's an instant rejection and a conversation about secret scanning in your pre-commit hooks. ## Comparison: Bash Glue vs GitHub Actions vs Buildkite vs Bazel+CI The comparison everyone wants is almost never done honestly because the people writing it are usually selling one of these tools. I've run all four in production environments with 50+ engineers, so here's what actually happens after the honeymoon period ends. ### Side-by-Side Breakdown Tool Parallelism Model Dynamic Pipelines Artifact Caching Self-Hosted Runners Cost Model Maintenance Overhead What Breaks in Practice **Bash Scripts** `xargs -P` at best; no fanout model if/else hell nested 4 levels deep within a week None. You tar and hope. Runs wherever bash runs — and that's the problem Free until someone quits Very high — only the author understands it Race conditions in parallel jobs, silent exit code swallowing, no retry logic, oncall gets paged when the one person who wrote it leaves **GitHub Actions** Matrix strategy; managed runners spin up per job Reusable workflows + `needs` DAG; limited runtime branching Actions Cache API; 10 GB limit per repo, 7-day eviction Yes, via `runs-on: self-hosted` labels $0.008/min Linux, $0.016/min Windows on hosted runners; free tier 2,000 min/month Low to medium — YAML sprawl hits hard at scale Cache misses tank build times suddenly; matrix explosion makes costs unpredictable; YAML size limits force ugly workarounds; no good way to fan out dynamically beyond what you defined at write time **Buildkite** Agent pools; scale to zero or autoscale on AWS/GCP/k8s **pipeline upload** — generate steps at runtime from code, not YAML No built-in; delegates to S3/GCS/Artifactory via plugins Always self-hosted agents; that's the whole model $15/user/month (or $25 for enterprise tier); compute is 100% your cost Medium — agent fleet ops + buildkite-agent config Agent fleet drift if you don't pin agent versions; no managed cache means you're maintaining an S3 bucket and lifecycle rules yourself; audit trail is excellent but the UI search is slow past 90 days of history **Bazel + CI** Hermetic action graph; Bazel decides parallelism, not the CI BUILD files define the graph; no runtime pipeline mutation needed Remote cache (grpc or HTTP); content-addressed, shared across all engineers and CI Run anywhere Bazel runs; pairs with any CI as a thin wrapper Bazel itself is free (Apache 2); remote execution (RBE) via EngFlow or BuildBuddy starts ~$200/month Very high initially; medium long-term if you commit Initial migration takes months; `gazelle` doesn't cover everything; sandbox breakage in macOS; BUILD file maintenance is a discipline unto itself; non-trivial languages (e.g., Rust via `rules_rust`) lag behind upstream toolchains ### The Practical Decision Tree If your monorepo has fewer than 5 services and your CI runs under 10 minutes, GitHub Actions is the right answer and nothing in this table should change your mind. The matrix strategy covers 90% of parallelism needs and the managed runners mean zero agent ops. The thing that catches teams off guard is the 10 GB cache limit — once you have a fat node_modules and a Docker layer cache fighting for space, you start seeing full cache misses on Monday mornings after the 7-day eviction window hits the weekend. Buildkite's `pipeline upload` command is genuinely the killer feature that nothing else matches cleanly. You write a script that emits JSON or YAML to stdout, and Buildkite ingests it as live pipeline steps. That means you can query your actual affected files, hit an API, or read from a config file and generate _exactly_ the steps needed — not a 50-job matrix where 40 jobs exit early:# .buildkite/pipeline.sh — runs first, generates the real pipeline #!/usr/bin/env bash set -euo pipefail # Only run tests for changed packages in the monorepo changed=$(git diff --name-only origin/main...HEAD | xargs -I{} dirname {} | sort -u) echo "steps:" for pkg in $changed; do echo " - label: \":jest: Test $pkg\"" echo " command: \"cd $pkg && yarn test --ci\"" echo " agents:" echo " queue: default" doneThat pattern alone saves 40–60% of compute time on a mid-sized monorepo compared to a static matrix, because you stop running tests for packages that didn't change. Bazel's remote cache is the only option here where **the cache is shared between your laptop and CI by default**. When I first saw a local `bazel build //...` complete in 8 seconds because CI had already built and cached every artifact, it felt like cheating. BuildBuddy's free tier gives you 3 GB of remote cache — enough to evaluate. The honest tradeoff: you're not just adopting a CI tool, you're adopting a build system religion. Every new dependency, every wrapper script, every new language has to go through Bazel. Teams that half-commit get the worst of both worlds. The hidden cost in the Bash row is always personnel. Nobody prices in the three hours an engineer spends every quarter untangling a CI script they didn't write. Across a team of 15 over two years, that's a real number that absolutely dwarfs the licensing cost of Buildkite or even a paid Bazel RBE tier. The scripts keep working until they don't, and when they break they break silently — `set -e` doesn't save you from a subshell that swallowed its own error. ## When to Stick with What You Have The honest answer most orchestrator vendors won't tell you: if you're under 15 engineers and running fewer than 5 repos, GitHub Actions with properly structured reusable workflows is probably the correct choice. Not "good enough for now" — actually correct. The `workflow_call` trigger lets you centralize logic into callable workflows, and combined with composite actions, you can eliminate most of the duplication people blame on Actions. I've seen teams burn a sprint migrating to Buildkite only to realize their real problem was copy-pasted YAML, not the runner infrastructure. Build duration is the other honest signal. If your full CI pipeline — lint, test, build, deploy to staging — completes in under 5 minutes end-to-end, orchestration overhead is pure noise. The graph construction, cache coordination, and agent pooling that Nx Cloud or Turborepo handle start paying off when you have hundreds of packages and parallel execution can shave 20 minutes. On a 4-minute monorepo build, you're adding complexity for a 40-second gain. That math doesn't work. The migration cost is something people consistently underestimate. Moving a mature CI setup to Buildkite — with existing deploy gates, secrets management, environment promotions, and rollback hooks — realistically costs 2 to 3 months of focused engineer time. That's not 2 months of a single person occasionally poking at it. That's someone's primary focus, including the week where a production deploy breaks in a way that traces back to a pipeline timing assumption that was baked into the old setup. Factor that cost before you write the Jira ticket. That said, there are clear signs you've actually hit the wall and can't keep patching around it: - **Builds regularly taking 45+ minutes** — engineers stop treating CI as a fast feedback loop and start merging speculatively - **Engineers running builds locally and skipping CI push** — this is the most damning signal; it means the pipeline has become a liability, not an asset - **Flaky tests masking real failures** — when retry logic is hiding actual regressions because the team has trained itself to ignore red builds - **No isolation between what changed and what runs** — every PR triggers the full suite regardless of which package was touched The incremental path I'd actually recommend: before switching orchestrators, add `nx affected` filtering to your existing pipelines. This alone — running tests only against packages affected by a changeset — can cut 60–70% of unnecessary work without touching your runner infrastructure. Here's the pattern I use with GitHub Actions:- name: Get affected projects run: npx nx affected --target=test --base=origin/main --head=HEAD --plain id: affected - name: Run affected tests run: npx nx affected --target=test --base=origin/main --head=HEAD --parallel=3Only after you've done this and your builds are _still_ painful should you evaluate a full orchestrator swap. The affected filtering surfaces whether your real problem is "we run too much" (solvable with Nx) versus "we run the right things but they're slow and we need distributed execution across 20 agents" (where Buildkite or Nx Cloud's remote execution actually earns its keep). Skipping that diagnostic step is how teams end up doing a 3-month migration and landing back at the same wall. ## Migration Playbook: Moving Off Bash Without Breaking Your Release The thing that kills most CI migrations isn't picking the wrong orchestrator — it's starting in the wrong place. Every team I've seen attempt this ends up doing the same thing: opening their Jenkinsfile or GitHub Actions YAML on day one, deciding it's a mess, and immediately starting to rewrite. Two weeks later they have a half-migrated pipeline that works on nobody's laptop and breaks production. Don't do that. Before you touch a single config file, draw the dependency graph of your current pipeline on paper or in Excalidraw. Not the happy path — the actual graph, including the `|| true` hacks, the scripts that only run on Tuesdays because a cron job triggers them, and the env vars that get set four jobs upstream and consumed without anyone documenting why. Once you have the graph, resist the urge to migrate everything. I picked the three scripts that caused the most pain — in our case, the Docker build script that had 400 lines of bash with nested conditionals, the deploy script that silently swallowed exit codes, and the integration test runner that nobody could debug remotely. Replacing those three with structured Dagger pipelines or Temporal workflows cuts 80% of your support burden immediately. The boring glue scripts — the ones that just `cp` artifacts between stages or ping Slack — leave those alone until everything else is stable. Migrating everything in one shot means when something breaks on cutover, you have no idea what caused it. Run old and new pipelines in parallel for at minimum two weeks. This isn't optional. The new orchestrator should shadow the old one: same triggers, same inputs, but the output of the new pipeline doesn't gate the release yet. You're looking for three things during this phase — timing differences greater than 20%, flaky tests that only appear in the new runner's environment, and any job that succeeds in old-bash but silently no-ops in the new setup. In GitHub Actions you can do this cleanly with a second workflow file pointing at the same repo:# .github/workflows/pipeline-shadow.yml on: push: branches: [main] jobs: shadow-build: runs-on: ubuntu-22.04 continue-on-error: true # don't block merges during shadow phase steps: - uses: actions/checkout@v4 - name: Run new orchestrator run: ./scripts/run-dagger-pipeline.sh env: SHADOW_MODE: "true" # tells the pipeline to skip actual deploysHere's the gotcha that bites everyone without fail: secret and environment variable management. Your bash scripts have probably accumulated env vars over years — some passed explicitly, some inherited from the shell, some written to `/tmp/.env` by a script three jobs ago and sourced later. The new orchestrator won't inherit any of that ambient state. Before you cut over, run this audit explicitly: grep your entire scripts directory for every `$VAR`, `${VAR}`, and `source` call, then cross-reference against your CI secret store. We found six variables that existed in nobody's documented secret store — they'd been set manually on a Jenkins agent two years prior and never written down. Those will cause silent failures in the new system, not loud ones, which makes them especially dangerous.# Quick audit — run this in your repo root before migration grep -rh '\$[A-Z_]\{2,\}' ./scripts/ \ | grep -oP '\$\{?[A-Z_]+\}?' \ | sort -u \ | sed 's/[${}]//g' > /tmp/vars-in-scripts.txt # Then diff against what's actually defined in CI # (for GitHub Actions, this is your repo secrets + org secrets) # Any line in vars-in-scripts.txt with no corresponding secret = riskDefine what "done" looks like before you start, not after. Pick three numbers and write them down: your current median build time (get this from your CI provider's API or dashboard — don't guess), your flake rate over the last 30 days, and your mean time to debug a failure from alert to root cause. If you can't measure these before migration, you have no way to know if the new system is actually better. In my experience the biggest win from moving to a real orchestrator is the last one — median build time drops maybe 15-20%, but time-to-debug a failure often drops by 60-70% because you get structured logs with task boundaries instead of one giant interleaved stdout blob. Last thing: tag your old workflows before you do anything. Create a git tag like `pre-orchestrator-migration-2025-07`, push it, and make sure you can deploy from it in under five minutes. Keep this deployable for 30 days post-cutover minimum. This isn't just psychological comfort — it's your actual rollback plan. If the new orchestrator has a bug that only surfaces during a monthly billing job or a hotfix deploy at 2am, you need to be able to revert the pipeline independently of reverting application code. The tag approach works better than a branch here because it's immutable; nobody accidentally commits to it while the migration is in progress. * * * _**Disclaimer:** This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content._