The engineering team maintains a Python/TypeScript monorepo with 22+ Django microservices, a React frontend, and shared libraries — all running through GitHub Actions CI. Over time, CI had become the productivity bottleneck. PRs routinely waited 10-15 minutes for green checks, and the slowest pipeline consistently exceeded 10 minutes. Engineers were context-switching while waiting for builds, or worse, stacking PRs on top of unverified ones.
I ran a focused initiative to systematically identify and eliminate CI bottlenecks. This post is a record of how I approached it.
1. Finding the Bottleneck
You can't optimize what you can't see. So my first instinct was to write a script to pull GitHub Actions workflow run data from the API and compute aggregate statistics — average, P50, P75, P90 — for both wall clock time and per-job durations across any date range.
This immediately told me two things:
- The largest service's unit-test job was the critical path — at P50 of ~10 minutes, it single-handedly determined the CI wall clock time.
- Other services stayed around 8-9 minutes — they were running multiple jobs in parallel so further investigation was needed.
Having this data let me prioritize ruthlessly. Instead of optimizing everything at once, I focused on the jobs that actually moved the wall clock needle.
And it's always satisfying to compare the before vs. after statistics.
2. Low-Hanging Fruit — Modernizing the Toolchain
My first lever was swapping slow tools for faster alternatives. The Rust-based ecosystem has matured to the point where several tools are genuine drop-in replacements, so this felt like a no-brainer:
- Yarn to Bun (not Rust-written, but famous for its speed and maturity)
- Webpack to Rspack
- storybook-rsbuild
- Adopt
eslint-plugin-oxlint, disabling ESLint rules that oxlint now handles
Each swap reduced CI time per job by 30-40%.
3. Low-Hanging Fruit — Docker Build Optimization
Docker builds were another time sink. I attacked from multiple angles:
-
Enabled
.dockerignore— builds weren't using ignore files because the build context was outside their scope -
Optimized
uv syncin Dockerfiles — eliminated unnecessary dependency installation loops;uv build --packageuses build isolation and never touches the workspace venv - Cleaned up stale cross-service dependencies — after migrating to uv workspaces, some services had phantom dependencies on unrelated services, triggering unnecessary rebuilds
This reduced Docker build CI time by about 50%.
4. Test Duration Enforcement & Slow Test Fixes
After tackling the low-hanging fruit, I turned to the tests themselves. The repo already had a check-test-durations composite action that parses JUnit XML and reports the top 10 slowest tests — but it was only wired up in 2 of 22 services.
Rolling Out Visibility
I added the duration check to all workspace CI workflows with continue-on-error: true — visibility without blocking builds.
This immediately surfaced tests taking 10-30+ seconds across services that had never been profiled.
Fixing the Worst Offenders
With data in hand, I targeted the slowest individual tests. Some were making unnecessary network calls, others were creating excessive test data, and a few were simply doing too much in a single test. I fixed tests exceeding 10 seconds across the two slowest services, bringing meaningful improvements before touching any CI infrastructure.
(One fun find: somewhere the application code called time.sleep(10), and the test didn't mock it — so the test took at least 10 seconds for free. Sometimes reducing test duration is that simple.)
Enforcing the Threshold
After the initial round of fixes, I made the duration check a hard enforcement — any individual test exceeding 10 seconds fails the build. This prevents slow tests from creeping back in, turning a one-time fix into a sustainable practice.
5. Caching: Every Layer Counts
GitHub Actions provides a 10 GB cache quota per repository. In a monorepo with 22+ services, that's not much — so every cache needs to earn its space. I identified and enabled caching at four layers of the CI pipeline.
.git Cache
In a monorepo, git clone with fetch-depth: 0 (needed for change detection and visual regression diff baselines) is expensive — the repo has significant history. I set up a scheduled workflow that runs every 6 hours to pre-populate the .git directory cache on master.
uv Cache
I enabled setup-uv's built-in cache across all CI workflow files.
The cache is keyed on the uv.lock hash, so it invalidates exactly when dependencies change. Impact: ~5 seconds saved per job on the "Install uv" step (8s → 3s average). Small per-job, just a few hundred KB.
node_modules Cache
The frontend workflows cache node_modules after bun install. The cache key was previously overloaded — some callers passed a branch name (${{ github.head_ref }}), others a static date string. Branch-keyed caches created redundant entries for the same lockfile, wasting quota.
I simplified to a single lockfile-based key using bun.lock.
GitHub Actions already scopes cache access by branch (PRs can read from the base branch's cache but not other PRs'), so per-branch keys were pure overhead. This change eliminated redundant cache entries and improved hit rates.
ESLint Cache
ESLint is one of the slower steps in frontend validation. I enabled its persistent cache by restoring .eslintcache between runs.
On cache hit, ESLint only re-lints files that changed since the last run, skipping the majority of the codebase.
6. Test Parallelization (with Cost in Mind)
Parallelization is the most effective lever for reducing wall clock time — but in GitHub Actions, more parallel jobs means more billable minutes. Every matrix shard spins up a fresh runner, installs dependencies, and tears down. The setup overhead isn't free. I approached this deliberately, targeting parallelization where the payoff justified the cost.
pytest-xdist: Free Parallelism
The lowest-cost optimization is utilizing all CPUs within a single runner. I enabled pytest-xdist with -n auto in services that were running tests serially.
For one service, this cut pytest execution from ~4m30s to ~2m07s — roughly 2x faster with zero additional runner cost.
Matrix Sharding: Deliberate Trade-off
For the largest service (8,674 tests, ~10 minute P50), single-runner parallelism wasn't enough. I restructured its CI into 7 matrix shards to keep each under 5 minutes.
The cost trade-off: 7 shards means 7x the setup overhead (dependency installation, database creation). Total billable minutes actually increased. But the wall clock time — what engineers wait on — dropped from ~10 minutes to ~5 minutes. For a team making dozens of PRs daily, the productivity gain far outweighs the compute cost.
Takeaways
Measure first, always. A simple Python script using the
ghCLI gave me reproducible P50/P75/P90 data across any date range. Way more reliable than eyeballing the Actions UI — and it makes before/after comparisons trivial.Modernize your toolchain — it's free speed. Bun, Rspack, and oxlint are drop-in replacements that delivered 30-40% speedups per job with minimal migration effort. Highest ROI work I did.
Instrument test durations, then enforce them. A 30-second test hiding in a 6-minute suite is invisible until you look. These are often the easiest wins — and a hard time budget prevents regression.
In a monorepo, cache quota is a shared resource. GitHub Actions' 10 GB limit is shared across all workflows. A poorly-keyed cache doesn't just waste space — it actively evicts caches that matter.
Parallelism costs money — spend it wisely. pytest-xdist within a single runner is free performance. Matrix sharding trades billable minutes for wall clock time. Make that trade deliberately, and know the numbers before and after.
Top comments (0)