Chaitanya Krishna Kasaraneni

Posted on May 29 • Edited on Jun 1 • Originally published at ckasaraneni.com

Why I built cloudfit

#opensource #cloud #nextflow #bioinformatics

A few months back I was sitting in front of GCP Compute Engine pricing for the hundredth time, trying to pick a machine type for a Nextflow process that runs maybe four times a month. The workflow had eighteen processes. Most of them were over-provisioned by a healthy margin. A few were under-provisioned and failing intermittently, because someone (me, probably) had picked a size that was fine on a small test dataset and never revisited it.

I wanted a tool that would just tell me: for this declared CPU and RAM, which machine type is the best fit, ranked by cost or performance or availability. Something I could drop into the pipeline DSL and stop thinking about.

That tool, as far as I could tell, did not exist for batch workloads.

What the existing tools actually solve

Every major cloud has a built-in recommender. GCP has Recommender. AWS has Compute Optimizer. Azure has Advisor. They are all genuinely good at one specific thing: looking at the last week or two of telemetry from a running VM, noticing it is using 12% of its CPU on average, and telling you to downsize.

For long-lived services, this is great. For batch, it is useless.

A Nextflow process that runs four times a month never accumulates enough CloudWatch or Cloud Monitoring data to be evaluated. Neither does a new pipeline that has not run yet. Neither does any "we are about to migrate from on-prem and need to pre-size 40 different job types" exercise. The bigger picture is that batch and forward-planning are the two largest blind spots in cloud cost optimization, and the existing free tooling does nothing for either.

There is paid tooling that does, sort of. Densify, Spot.io, Cast.ai, ProsperOps, and a dozen others sell sophisticated FinOps platforms. They are priced for FinOps teams at companies with eight-figure cloud bills. None of them are something you would pull into a personal pipeline or a small lab's workflow.

So I built cloudfit.

What it actually does

You give it a workload profile and a list of candidate machine types. It scores each candidate against the profile and returns them ranked.

from cloudfit import WorkloadProfile, MachineType, rank

profile = WorkloadProfile(
    vcpu=60,
    ram_gb=224,
    archetype="io",            # io | cpu | mem | gpu | burst
    optimize_for="balanced",   # cost | performance | availability | balanced
)

candidates = [...]  # from cloudfit-provider-gcp, or your own list

ranked = rank(profile, candidates)
best = ranked[0]
print(f"{best.instance.id}  ${best.instance.price_hr}/hr  score: {best.score}")

That is the entire surface area of the library. The complexity sits in two places:

Hard floor filters. A candidate that does not meet the declared CPU or RAM is disqualified, not just scored low. Same for region constraints. Same for status (deprecated instances do not show up in results). I wanted the failure mode to be "no result" rather than "a result that silently underdelivers."
Weighted scoring across cost, performance, and availability. You pick what you optimize for. The score is reproducible and you get a per-factor breakdown so you can see why the top pick is the top pick. No black box.

There is a separate provider package, cloudfit-provider-gcp, that fetches the live GCP catalog (machine types plus pricing from the Cloud Billing API), normalizes it into the scoring schema, and exits. That keeps the scoring engine cloud-agnostic and the cloud-specific work isolated.

There is also a FastAPI service, cloudfit-api, that wraps both as an HTTP API with a bundled snapshot of 875 GCP machine types across 5 regions (us-central1, us-east1, us-west1, europe-west4, asia-southeast1) with realistic asymmetric availability. It runs out of the box, no credentials needed. There is a live demo at https://chaitanyakasaraneni-cloudfit-api.hf.space/docs that you can poke at right now. Hit /recommend with a workload profile and you get back a ranked list with per-factor scores in milliseconds.

If you pass region in the request, cloudfit hard-floors anything not actually available there. So a pipeline that runs in asia-southeast1 will not get a c4 recommendation if c4 has not rolled out to that region yet, even if it would otherwise be the best match.

There is also a /diff endpoint that compares the top pick for two workloads and returns the cost and spec deltas, which is the conversation you have when scaling a pipeline up or down.

For a visual overview with copy-paste curl examples for all four endpoints, the project landing page is at https://cloudfit-io.github.io.

If you would rather not write JSON, there is a one-click UI at https://chaitanyakasaraneni-cloudfit-ui.hf.space. Same scoring engine, same snapshot, just a form on the left and a ranked table on the right. Five example workloads (BWA-MEM2, Cell Ranger, AlphaFold, Nextflow burst, Spark ETL) are one click away.

All three packages are on PyPI under Apache 2.0.

Why Nextflow first

The reason workflow engines keep coming up is that they already speak this language. Every Nextflow process declares its CPU and memory in the DSL. Cromwell does the same. Snakemake does. Argo does. That declaration is exactly the shape cloudfit takes as input.

So the natural next step is a Nextflow plugin. Something that lets you write:

process align {
    machineType cloudfit.recommend(cpu: 16, memory: 64.GB, optimize: 'cost')
    ...
}

instead of hardcoding n2-standard-16 and finding out three months later that t2d-standard-16 would have cost you 30% less, or that c3-standard-22 would have finished 40% faster. That plugin is the next thing I want to build, and it is the reason I am writing this post.

What I am not solving yet

Some things cloudfit does not do today, in the interest of full transparency:

The performance scorer does not yet factor in CPU generation, clock speed, memory bandwidth, or network interconnect. A first-gen c2 and a current-gen c4 with the same vCPU count score identically on perf today. (cloudfit-core v0.3 already fixed the prior "rewards 2x headroom" behavior — fit-based scoring is now the default, peaking at exact match through 1.5x).
It does not understand reservations, committed use discounts, or savings plans. If you have CUDs, your effective price is different from the on-demand price the recommender currently sees.
It does not check live regional quota. If you are at your c4 vCPU limit in us-central1, cloudfit will still recommend a c4 and your job will sit in the queue.
AWS and Azure providers are not built yet. GCP is the only live provider. AWS is the next planned milestone, with a public tracker at github.com/cloudfit-io/cloudfit-provider-aws.
The bundled API snapshot is a representative sample across 5 regions, not exhaustive. To run against the full live catalog you would refresh the snapshot yourself with cloudfit-provider-gcp pointed at your project.

All five of these are on the roadmap, with the full Known Limitations table (including scoring methodology, GPU type discrimination, empirical validation, and more) in the cloudfit-core README. The order I tackle them depends on what people actually need first, which is part of why I am posting this.

What I want to know

Before I write the Nextflow plugin, I want to know if anyone wants it. Specifically:

If your team runs Nextflow, Cromwell, Snakemake, or Argo pipelines on GCP or AWS, do you currently have a way to pick machine types other than guessing once and tuning later?
If cloudfit existed as a plugin in your pipeline DSL, would you use it? Where would you point it first?
What would a credible recommendation actually need to factor in before you trusted it (commitments, reservations, quota, regional availability, GPU-specific constraints, something else)?

If any of these resonate, please reach out. Email is on my site at ckasaraneni.com. The repos live at github.com/cloudfit-io, the landing page is at cloudfit-io.github.io, and the live demo API is linked above.

If it does not resonate, that is also useful. "We already do this with X" or "we do not care because Y" is exactly the signal I am looking for at this stage. I would rather know now than spend two months on a plugin nobody asked for.

DEV Community