The real cost of flaky CI: a quick community survey

#devops

Flaky tests quietly drain engineering teams. You hit re-run, the test passes, and move on — but that pattern compounds across the team.

I'm building Culprit (a tool that watches your CI, finds flaky tests, and bisects to the introducing commit automatically), and I'm running a short 5-question survey to put real numbers on the cost before we set pricing.

If your team deals with flaky CI, I'd appreciate 3 minutes. Drop answers in the comments, or email culprit@megaloop.app with subject "CI survey".

The 5 questions

Q1 – Your role:

Engineering Manager / Director
Staff Engineer / Principal Engineer
Senior Software Engineer
VP Engineering / CTO / Head of Engineering
Other

Q2 – How many engineers commit to your CI pipeline each week?

1–3
4–10
11–25
26–50
51+

Q3 – Roughly how many hours per week does your team lose to flaky CI? (reruns, investigations, context-switching)

<1 hour
1–3 hours
4–8 hours
9–15 hours
15+ hours

Q4 – Which tools does your team use to detect or manage flaky tests? (select all)

Trunk
Buildkite Test Engine
Datadog CI Visibility
BuildPulse
Homegrown / internal solution
None — we handle it manually
Other

Q5 – Pricing check: Imagine a tool that automatically identifies the exact commit that first introduced a flaky test and posts it as a PR comment — no manual git bisect needed.

At what price per committer/month would it feel:

(a) too cheap to trust the quality?
(b) a fair bargain?
(c) getting expensive, but you'd still consider it?
(d) too expensive — you'd walk away?

Reference anchors for context: $10 / $18 / $30 / $50 per committer/month

I'll share aggregate results with everyone who participates. Takes 3 minutes — thank you!When an LLM API call fails in production, most engineers find the same problem: the error in the logs doesn't tell you what to actually do next.

I've been collecting patterns from OpenAI and Anthropic production incidents. Here are the five things that consistently slow resolution — and what we should be logging instead.

1. Log the error type, not just the status code

A 429 from OpenAI can mean at least three different things: RPM rate limit hit, TPM rate limit hit, or quota exhausted. Each requires a different fix. Logging only status: 429 is like logging HTTP 500 for every server error.

Log the error.type field from the response body: rate_limit_exceeded vs tokens_quota_exceeded are machine-readable and tell you exactly which meter you hit.

2. Know the difference between RPM and TPM rate limits

OpenAI has two separate meters. Engineers almost always look at the wrong one.

If you're hitting 429s on low-volume requests with large prompts, check x-ratelimit-remaining-tokens — you've hit TPM (fix: reduce output length or batch differently).
If you're hitting 429s despite low token counts, check x-ratelimit-remaining-requests — you've hit RPM (fix: reduce request frequency).

Anthropic has the same structure with x-ratelimit-limit-requests and x-ratelimit-limit-tokens. Log both headers on every failed call.

3. Treat streaming connection drops separately

An SSE stream can stop delivering tokens without closing cleanly. The client receives no exception, no HTTP error — it just stops. If your timeout logic only covers the initial connection, you'll miss mid-stream drops entirely.

You need two separate timeouts: one for the initial connection, and one for idle time between received tokens. Most incident reports I've seen here involve the second one being missing.

4. Pin explicit model versions in production

OpenAI's model aliases (gpt-4-turbo-preview, gpt-4-turbo) have silently pointed at different underlying versions over time. If an AI feature's output quality or cost changes without a deployment, the first thing to check is whether a model alias was migrated.

For production workloads, pin to explicit versioned model IDs. Leave alias resolution to dev/staging.

5. Track fast-fail error rate separately from latency

Anthropic's 529 (API overloaded) returns in milliseconds — so your overall latency p50/p95 might look fine while every AI feature is degraded. Monitor error_rate_by_type (not just overall error rate), and flag fast failures above a threshold as a separate alert.

Running a research project on this: I'm trying to understand where the actual pain is for engineers debugging LLM provider incidents. If you ship AI features (TypeScript or Python, OpenAI or Anthropic in your production path) and you've personally handled at least two provider-side incidents in the last 90 days — I'd like to hear about it. 15 minutes, no pitch, just trying to understand where existing tooling falls short before building anything.

Drop a comment with your stack and whether you're open to a short chat.