Postmortem: CircleCI 3.5 Resource Class Misconfiguration Caused $5k in Overage Charges
On October 12, 2024, our engineering team discovered $5,200 in unexpected CircleCI overage charges tied to a misconfiguration in our CI pipeline resource class settings. This postmortem details the timeline, root cause, resolution, and prevention measures for this incident.
Incident Timeline
- September 1, 2024: CircleCI 3.5 upgrade completed across all organizational pipelines. No immediate billing anomalies reported.
- September 3, 2024: A junior engineer updating a legacy pipeline’s
.circleci/config.ymlaccidentally sets the resource class for all test jobs to2xlargeinstead of the standardmediumclass, to resolve a transient timeout issue without consulting team leads. - September 3 – October 10, 2024: The misconfigured pipeline runs 1,200+ jobs using the
2xlargeresource class, burning 4x the planned credits per minute. - October 11, 2024: Finance team flags a $5,200 overage charge on the monthly CircleCI invoice.
- October 12, 2024: Engineering team identifies the resource class misconfiguration, reverts the setting, and audits all pipelines.
- October 13, 2024: CircleCI support confirms the overage is tied to the resource class usage, no billing adjustments granted as the usage was valid per platform terms.
Root Cause Analysis
CircleCI 3.5 introduced stricter resource class validation, but our team failed to update internal guardrails after the upgrade. The core issue was a manual configuration error in .circleci/config.yml:
# Incorrect configuration
jobs:
test:
resource_class: 2xlarge # Accidental override, previously medium
docker:
- image: cimg/node:20.18
We later discovered two contributing factors:
- No pre-commit linting rule to validate resource class values against our approved list (
small,medium,medium+). - Missing billing alerts for credit usage exceeding 80% of our monthly plan allocation.
- Legacy pipeline configurations were not included in our post-upgrade audit, as they were marked "low priority" for maintenance.
Impact
- Financial: $5,200 in non-budgeted overage charges, representing 12% of our quarterly CI/CD spend.
- Operational: 14 engineering hours spent investigating billing discrepancies, auditing pipelines, and updating configurations.
- Trust: Temporary friction between engineering and finance teams due to unexpected costs.
Resolution
We resolved the incident in 24 hours with the following steps:
- Reverted the
2xlargeresource class tomediumin the affected pipeline, reducing per-job credit burn by 75%. - Audited all 47 organizational pipelines for invalid resource class values, fixing 3 additional misconfigurations in legacy services.
- Enabled CircleCI’s native billing alerts for 80%, 90%, and 100% credit usage thresholds.
- Submitted a billing adjustment request to CircleCI support, which was denied as the usage complied with platform terms.
Prevention Measures
To prevent similar incidents, we implemented the following changes:
- Added a custom YAML lint rule to our pre-commit hooks that rejects any
resource_classvalue not in our approved list. - Integrated CircleCI credit usage metrics into our internal Grafana dashboard, with automated PagerDuty alerts for usage spikes.
- Created a mandatory post-upgrade checklist for all CI/CD platform changes, including full pipeline configuration audits.
- Conducted a team-wide training session on CircleCI 3.5 resource class billing, including cost per minute for each class.
- Set up a monthly CI/CD cost review meeting between engineering and finance teams.
Lessons Learned
Manual configuration changes to CI pipelines carry outsized financial risk, especially for high-resource classes. Automated guardrails for configuration values and proactive billing monitoring are non-negotiable for teams using usage-based CI/CD platforms. While the $5k charge was painful, it prompted critical improvements to our pipeline governance that will prevent far costlier incidents in the future.
Top comments (0)