Originally published at aicloudstrategist.com/blog/ai-gpu-cost-audit-india.html. This is a cross-post for the dev.to community.
AI GPU Cost Audit for Indian AI Startups: H100, Inferentia2 & Spot Economics (2026)
By Anushka B, Founder · 2026-04-22 · 10 min read
Every Indian AI startup we audit in 2026 is burning at least 30% of its GPU spend on idle capacity, wrong instance family choice, or under-utilised H100s that should have been L40S or Inferentia2. This is the math, the benchmarks, and the decision tree we use in our GPU audits.
The GPU cost problem is not a pricing problem
Indian AI startups usually assume the issue is list price. It isn't — AWS GPU list prices in Mumbai are broadly in line with us-east-1 (within 3-6% on most GPU SKUs, because AWS subsidises GPU parity across regions to discourage arbitrage). The issue is utilisation shape and instance-family fit. A p5.48xlarge with 8 H100s at ~$98.32/hour (~₹8,200/hour) costs ₹6 crore a year if left on 24x7. If you use it 35% of the time, your real cost-per-useful-hour is ~₹23,400 — triple list.
H100 vs L40S vs Inferentia2: when each one is right
| Instance | GPU | On-Demand (Mumbai approx) | Best for |
|---|---|---|---|
| p5.48xlarge | 8x H100 80GB | ₹8,150/hr | 70B+ training, long-context fine-tune |
| p4d.24xlarge | 8x A100 40GB | ₹2,720/hr | 13-34B training, Stable Diffusion XL |
| g6e.12xlarge | 4x L40S 48GB | ₹1,180/hr | 7-34B inference, fine-tuning smaller models |
| g6.12xlarge | 4x L4 24GB | ₹540/hr | Video encoding, small-model inference |
| inf2.48xlarge | 12x Inferentia2 | ₹1,020/hr | High-throughput LLM inference (Llama 3, Mistral) |
| trn1.32xlarge | 16x Trainium | ₹1,760/hr | Training when you can compile with Neuron SDK |
The single most expensive mistake we see is running production LLM inference on H100s. For a 13B-70B inference workload, Inferentia2 is 40-60% cheaper per 1000 tokens once you've invested two engineer-weeks in Neuron SDK compilation. See our cost per 1000 inferences benchmark.
The three GPU audit questions that matter
What is your GPU utilisation P50, P95, and trough? If P50 is under 60%, you're either over-provisioned or your batching is wrong.
Do you actually need 80GB HBM, or will 48GB L40S work? For most Indian AI startups serving models up to 34B with int8/fp8 quantisation, L40S is sufficient and 55-65% cheaper.
Is training separable from inference on the same capacity? If yes, training goes to Spot p5/p4d (65-75% off), inference stays On-Demand or committed on Inf2 / g6e.
Spot GPU economics in Mumbai
Spot interruption rates in ap-south-1 for GPU instances (as of April 2026): p5 is 5-10% per month, p4d is 10-15%, g6e is 5-8%, inf2 is under 5%. For training, this is a non-issue with checkpointing — save every 500-1000 steps to S3 or FSx for Lustre, and a 10% interruption rate adds <2% overhead. Spot savings in Mumbai on p4d are typically 68-74% versus On-Demand; on p5 they are 60-68% (H100s are scarcer, so the discount is smaller).
Critical detail: Spot for training only works if you've set up capacity-optimized allocation with a multi-AZ pool and at least two instance families. A single-AZ, single-family Spot request for p5.48xlarge in Mumbai will be interrupted weekly.
The "capacity block" vs Savings Plan decision
AWS launched EC2 Capacity Blocks for ML in late 2023 — you reserve H100 or A100 capacity in 1-14 day blocks up to 8 weeks in advance, at prices that vary with demand. For Indian AI startups running one big training run a quarter, Capacity Blocks are usually cheaper and more predictable than Savings Plans, because you commit only to the run, not to year-round capacity.
Decision rule we use:
80% GPU utilisation 24x7 -> 1-year Savings Plan on p4d/g6e/inf2.
Spiky, run-based training -> Capacity Blocks + Spot top-up.
Inference steady state -> 1-year Compute Savings Plan on inf2 or g6e.
Research + experimentation -> pure On-Demand with auto-stop Jupyter notebooks.
The 7-point AI GPU audit checklist
GPU utilisation heat-map by day-of-week x hour-of-day. Idle nights and weekends signal a scheduling problem, not a pricing one.
Instance family fit — cross-check model size, batch size, quantisation, latency SLA against the H100/A100/L40S/Inf2 decision matrix.
Storage strategy — FSx for Lustre for training checkpoints ($0.145/GB/mo) vs S3 (₹2.10/GB/mo) vs EFS ($0.30/GB/mo). Wrong choice can double training job cost.
Data egress for training data — pulling S3 data cross-region to a GPU cluster burns ₹0.83/GB. Keep training data in ap-south-1 if the cluster is in ap-south-1.
Jupyter and SageMaker Studio idle costs — notebooks left running overnight are the single most common waste we find. Auto-stop idle after 60 minutes.
Inference batching and continuous batching — vLLM, TGI, or TensorRT-LLM with continuous batching can 3-8x throughput on the same GPU. A ₹40 lakh/month inference bill drops to ₹10-15 lakh after proper batching.
Quantisation and distillation — int8 quantisation on Llama 3 70B can fit it on a single H100 or 2x L40S. fp8 on H100 doubles throughput again. Most Indian startups we audit are running fp16 because "that's what the tutorial used."
What we actually see in AI audits: 5 recurring patterns
A 70B model running in fp16 on 4x H100s when int8 on 2x H100s would serve the same traffic. ~50% instant save.
Training cluster left running after the run finished — because "we'll do the next run Monday." Monday becomes Friday. ₹5-8 lakh burnt per incident.
Data scientists spinning up ml.p4d.24xlarge SageMaker training jobs for a 1B model fine-tune. An ml.g5.12xlarge (₹420/hr) would do the same job.
Inference traffic auto-scaled on CPU utilisation. GPU inference latency has nothing to do with CPU load; use GPU utilisation or request-queue depth.
Multi-region inference deployment for "latency" when 98% of users are in India. Consolidate into Mumbai.
INR benchmark: what Indian AI startups should pay
| Stage | Typical GPU spend (INR/month) | Post-audit target | Savings |
|---|---|---|---|
| Seed, single fine-tune/month | ₹4-8 lakh | ₹2-4 lakh | 40-50% |
| Series A, one production model | ₹15-30 lakh | ₹8-17 lakh | 40-50% |
| Series B, multiple prod models | ₹60 lakh - 1.2 cr | ₹30-70 lakh | 35-45% |
| Gen-AI infra / model-host company | ₹1.5-4 cr | ₹0.9-2.4 cr | 30-40% |
Frequently asked questions
Q: Can you audit Bedrock / managed model spend too?
Yes. Bedrock pricing is per-token; the audit focuses on prompt/response length control, caching, and model-mix (Claude Haiku vs Sonnet vs Opus). On a ₹8 lakh/month Bedrock bill we typically cut 25-35% via caching and model-routing.
Q: Is Inferentia2 production-ready in India?
Yes. Inf2 has been GA in ap-south-1 since 2024. Neuron SDK supports Llama 3, Mistral, Mixtral, Qwen, and most open models. Compilation takes 1-2 engineer weeks per model family.
Q: H100 on-demand is hard to get in Mumbai. What do we do?
Three options: (1) Capacity Blocks for planned runs, (2) p4d (A100) which has better availability at ~67% of p5 cost, (3) Spot on p5 with fallback to p4d — surprisingly, p5 Spot often has capacity when On-Demand is out.
Q: Does the audit recommend us-east-1 for non-regulated training?
Usually no. GPU pricing is near-parity, and moving training data out of India is slow and costly. Unless your dataset is already in us-east-1 or you need a specific instance type unavailable in Mumbai, stay local.
Q: What about GCP or Azure — would they be cheaper?
Sometimes, for specific workloads. GCP A3 High with H100 is occasionally cheaper if you commit via CUD. Azure ND H100 v5 is typically 5-10% more expensive than AWS. Our audit is AWS-specific; we'll flag if a workload is a clear candidate for a multi-cloud move.
Q: Do you benchmark our models, or just audit the bill?
The free 24h audit is bill-based. For a paid engagement we can run actual benchmarks on your models across H100, L40S, and Inf2 to produce a cost-per-1000-inferences table specific to you. See our AI GPU audit service.
Related reading: AI GPU audit service · Cost per 1000 inferences · AI Ops · AWS cost calculator
We run written 24-hour AWS cost audits. Founder-led. Free. No sales call. Send your last bill, get a PDF back. Request yours →
Top comments (0)