Posted on Feb 4

LGTM != Production Ready: Why your CI pipeline is missing the most important step

#devops #sre #opensource #infrastructure

We have linters for syntax and scanners for security. It’s time we started linting for "will this wake me up at 3 AM?"

We have all been there.

You submit a Pull Request. The CI passes green. Your colleagues review it and comment "LGTM!" The code is merged and deployed.

Three days later, at 3:17 AM, PagerDuty fires.

The root cause wasn't a syntax error. It wasn't a logic bug that unit tests could catch. It was something subtle: an HTTP client missing a timeout setting that caused a cascading failure when a downstream service hiccuped.

The "Senior Intuition" Gap
Why did the PR review miss that missing timeout?

Because standard code reviews usually focus on code style, logic correctness, and maintainability. Standard static analysis tools catch syntax errors or obvious security flaws.

But nobody is actively grepping for operational maturity.

Usually, catching these latent failure modes relies on "senior engineer intuition." It's that Spidey-Sense that a battle-hardened SRE develops after years of being woken up on-call. They glance at code and immediately think: “Where is the backoff retry logic here?” or “This unchecked environment variable is going to brick boot-up one day.”

The problem is that senior intuition doesn't scale. You can't clone your best SRE to review every PR.

The Failure of Checklists
Many organizations try to solve this with the dreaded "Production Readiness Checklist" — a dusty Confluence page with 50 questions like "Have you considered failure domains?"

Let’s be honest: nobody uses these. They are manual, tedious, done at the very end of the development lifecycle (too late to change architecture), and are usually just "checkbox theatre" to appease management.

If it's not automated, it doesn't exist.

Shifting "Operability" Left
We need to treat operational requirements the same way we treat code style. If gofmt fails, you can't merge. If your code contains latent operational risks, you shouldn't be able to merge either.

I wanted a tool that codified that "senior engineer intuition" into something executable. I wanted a scanner that doesn't care about my variable names, but cares deeply about whether my application will survive a partial network outage.

Since I couldn't find one that fit my needs, I built it.

Introducing Production-Readiness
Production-Readiness is an open-source, opinionated scanner designed to detect operational blind spots before they hit production.

It’s not a replacement for Prometheus or Datadog. It’s a pre-flight check. It looks for the patterns that look "correct" in code but cause fires in production.

What does it actually catch?
Unlike a standard linter, this tool is looking for semantic operational patterns. Here are just a few examples of rules that are difficult to catch with standard regex grep:

1.The "Hanging Client" Trap It's shockingly easy in many languages to instantiate an HTTP client with infinite timeouts by default.

The Risk: A slow downstream dependency ties up all your threads/goroutines, eventually crashing your service.
The Fix: The scanner flags clients initialized without explicit timeouts.

2.Missing Graceful Shutdown

The Risk: When Kubernetes scales down a pod, it sends a SIGTERM. If your app doesn't catch it and finish in-flight requests, you drop user traffic during every deployment or auto-scale event.
The Fix: The scanner looks for the presence of signal handling wiring.

3.Unvalidated Configuration

The Risk: Your app ne``eds an API_KEY environment variable to start. You deploy, and it crash-loops because the variable was missing in the new environment.
The Fix: The scanner checks if critical configuration inputs have validation checks associated with them.

Seeing it in Action
The tool is written in Go and designed to be a single binary you can drop into any CI pipeline.

You run it against your source code directory:

$ pr scan ./my-microservice

And it gives you a prioritized report of operational risks:

Conclusion
The gap between "code that works" and "code that runs reliably in production" is massive. We need to stop relying on hope and manual checklists to bridge that gap.

By automating the detection of operational anti-patterns, we can ship faster and, more importantly, sleep better.

The project is open source and we are just getting started defining the rules of what makes software "production-ready."

If you've ever been burned by a "silly" configuration mistake in prod, give it a spin.

👉 Star the repo on GitHub: https://github.com/chuanjin/production-readiness

DEV Community

LGTM != Production Ready: Why your CI pipeline is missing the most important step

Top comments (0)