DEV Community

Cover image for Deep dive: optimizing self-hosted GitHub Actions Runners on AWS and GCP for cost efficiency
Siddhant Khare
Siddhant Khare

Posted on

4 1 1 1

Deep dive: optimizing self-hosted GitHub Actions Runners on AWS and GCP for cost efficiency

Running self-hosted GitHub Actions runners on the cloud provides great control, but the costs can spiral if not optimized. In this post, I’ll share how we achieved 30% cost savings on AWS and how you can replicate similar strategies on GCP, with a touch of technical fun and actionable advice. Let's dive in!


Why Self-Hosted Runners?

GitHub Actions' default hosted runners are convenient but can be expensive for large workloads, especially for compute-intensive tasks like integration tests or builds. Self-hosted runners, deployed on AWS or GCP, offer:

  • Cost Control: Pay only for what you use.
  • Custom Environments: Tailored to specific workflows.
  • Scalability: Dynamically scale based on workload.

However, running self-hosted runners at scale comes with its own challenges: idle resources, inefficient configurations, and escalating network costs.


Architecture Overview

Here’s a high-level architecture for both AWS and GCP self-hosted runners:

high-level architecture


Challenges in Cost Management

  1. Idle Resources: Pre-provisioned runners waiting for jobs lead to unnecessary costs.
  2. Networking Overheads: High outbound traffic, especially for Docker pulls.
  3. Instance Type Selection: Choosing cost-effective and performant instance types.
  4. Preemption Risks: Spot instances (AWS) or preemptible VMs (GCP) can fail mid-job.

Optimization Strategies

1. Dynamic Scaling

Both AWS and GCP allow scaling instances based on demand.

AWS

  • Use Auto Scaling Groups (ASGs) with Lambda functions triggered by workflow_job webhooks.
  • Leverage tools like philips-labs/terraform-aws-github-runner to simplify management.

GCP

  • Use Managed Instance Groups (MIGs) with custom autoscaler policies based on job queue size or CPU load.
  • Cloud Functions or Cloud Run can handle scaling triggers.

Scaling Decision


2. Spot Instances (AWS) / Preemptible VMs (GCP)

These offer significant cost savings but require careful handling of preemptions.

AWS Spot Instances

  • Mix instance types in Spot Pools for better availability:
    • m5, m6i, m7i (Intel)
    • m5a, m6a (AMD)

GCP Preemptible VMs

  • Use diverse instance types:
    • e2-standard, n2-highmem, t2d-standard (AMD)
  • Jobs must checkpoint regularly to handle interruptions gracefully.

Pro Tip: Always have fallback capacity with on-demand instances or higher-priority pools for critical workloads.


3. Caching and Artifact Management

Networking Optimization

  • AWS: Implement S3-based caching with tools like actions/cache.
  • GCP: Use Cloud Storage or Artifact Registry for similar functionality.

Docker Pulls

  • Reduce Docker pull costs by:
    • Setting up a pull-through cache in GCP Artifact Registry or AWS ECR.
    • Using VPC endpoints (AWS) or private access (GCP) to minimize outbound traffic.

4. Cost Monitoring and Analysis

Both cloud providers offer tools to analyze costs:

  • AWS: Cost Explorer + CloudWatch for EC2 usage.
  • GCP: Billing Reports + Monitoring with Stackdriver.

Key Metrics to Watch:

  1. Idle instance time
  2. Spot/preemptible interruption rates
  3. Network egress traffic

Cost Breakdown


Case Study: AWS Optimization Outcomes

  • Idle Runners Reduced: Adjusted runner pools based on org activity.
  • Spot Pools Optimized: Added AMD-based m6a instances, reducing costs by 30%.
  • Networking Costs: Introduced Docker pull-through caching with S3.

Case Study: GCP Adaptation

  • Dynamic Scaling: Managed Instance Groups with preemptible VMs.
  • Networking: Switched to private Google Access for egress traffic.
  • Preemptible Instances: n2-highmem provided a balance of cost and performance.

Results

Cost reduction metrics

Cost reduction metrics

Cloud Provider Baseline Cost Optimized Cost Savings (%)
AWS $10,000 $7,000 30%
GCP $9,500 $6,500 31%

User Experience Improvements

  • Reduced job interruptions.
  • Faster job execution due to optimized runner configurations.

Future Opportunities

  1. IPv6 and NAT Gateway Optimization:
    • Both AWS and GCP support IPv6 to reduce NAT costs.
  2. Machine Learning for Scaling Decisions:
    • Use historical data to predict demand spikes.

Conclusion

Optimizing self-hosted GitHub Actions runners on AWS and GCP can save significant costs while improving performance. By dynamically scaling resources, leveraging spot/preemptible instances, and optimizing network usage, you can achieve a highly efficient setup tailored to your workloads.

Feel free to experiment with these strategies and share your results. Happy optimizing! 🚀

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started

Top comments (0)

AWS Security LIVE!

Tune in for AWS Security LIVE!

Join AWS Security LIVE! for expert insights and actionable tips to protect your organization and keep security teams prepared.

Learn More

👋 Kindness is contagious

Explore a sea of insights with this enlightening post, highly esteemed within the nurturing DEV Community. Coders of all stripes are invited to participate and contribute to our shared knowledge.

Expressing gratitude with a simple "thank you" can make a big impact. Leave your thanks in the comments!

On DEV, exchanging ideas smooths our way and strengthens our community bonds. Found this useful? A quick note of thanks to the author can mean a lot.

Okay