DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

Postmortem: How a CircleCI 3.5 Resource Class Misconfiguration Caused $5k in Overage Charges

Postmortem: CircleCI 3.5 Resource Class Misconfiguration Caused $5k in Overage Charges

On October 12, 2024, our engineering team discovered $5,200 in unexpected CircleCI overage charges tied to a misconfiguration in our CI pipeline resource class settings. This postmortem details the timeline, root cause, resolution, and prevention measures for this incident.

Incident Timeline

  • September 1, 2024: CircleCI 3.5 upgrade completed across all organizational pipelines. No immediate billing anomalies reported.
  • September 3, 2024: A junior engineer updating a legacy pipeline’s .circleci/config.yml accidentally sets the resource class for all test jobs to 2xlarge instead of the standard medium class, to resolve a transient timeout issue without consulting team leads.
  • September 3 – October 10, 2024: The misconfigured pipeline runs 1,200+ jobs using the 2xlarge resource class, burning 4x the planned credits per minute.
  • October 11, 2024: Finance team flags a $5,200 overage charge on the monthly CircleCI invoice.
  • October 12, 2024: Engineering team identifies the resource class misconfiguration, reverts the setting, and audits all pipelines.
  • October 13, 2024: CircleCI support confirms the overage is tied to the resource class usage, no billing adjustments granted as the usage was valid per platform terms.

Root Cause Analysis

CircleCI 3.5 introduced stricter resource class validation, but our team failed to update internal guardrails after the upgrade. The core issue was a manual configuration error in .circleci/config.yml:

# Incorrect configuration
jobs:
  test:
    resource_class: 2xlarge  # Accidental override, previously medium
    docker:
      - image: cimg/node:20.18
Enter fullscreen mode Exit fullscreen mode

We later discovered two contributing factors:

  • No pre-commit linting rule to validate resource class values against our approved list (small, medium, medium+).
  • Missing billing alerts for credit usage exceeding 80% of our monthly plan allocation.
  • Legacy pipeline configurations were not included in our post-upgrade audit, as they were marked "low priority" for maintenance.

Impact

  • Financial: $5,200 in non-budgeted overage charges, representing 12% of our quarterly CI/CD spend.
  • Operational: 14 engineering hours spent investigating billing discrepancies, auditing pipelines, and updating configurations.
  • Trust: Temporary friction between engineering and finance teams due to unexpected costs.

Resolution

We resolved the incident in 24 hours with the following steps:

  1. Reverted the 2xlarge resource class to medium in the affected pipeline, reducing per-job credit burn by 75%.
  2. Audited all 47 organizational pipelines for invalid resource class values, fixing 3 additional misconfigurations in legacy services.
  3. Enabled CircleCI’s native billing alerts for 80%, 90%, and 100% credit usage thresholds.
  4. Submitted a billing adjustment request to CircleCI support, which was denied as the usage complied with platform terms.

Prevention Measures

To prevent similar incidents, we implemented the following changes:

  • Added a custom YAML lint rule to our pre-commit hooks that rejects any resource_class value not in our approved list.
  • Integrated CircleCI credit usage metrics into our internal Grafana dashboard, with automated PagerDuty alerts for usage spikes.
  • Created a mandatory post-upgrade checklist for all CI/CD platform changes, including full pipeline configuration audits.
  • Conducted a team-wide training session on CircleCI 3.5 resource class billing, including cost per minute for each class.
  • Set up a monthly CI/CD cost review meeting between engineering and finance teams.

Lessons Learned

Manual configuration changes to CI pipelines carry outsized financial risk, especially for high-resource classes. Automated guardrails for configuration values and proactive billing monitoring are non-negotiable for teams using usage-based CI/CD platforms. While the $5k charge was painful, it prompted critical improvements to our pipeline governance that will prevent far costlier incidents in the future.

Top comments (0)