DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

We Cut LLM Hosting Costs by 52% Switching from AWS Bedrock to GCP Vertex AI and Preemptible Instances

We Cut LLM Hosting Costs by 52% Switching from AWS Bedrock to GCP Vertex AI and Preemptible Instances

By [Engineering Team], Published [Date]

Introduction

Large Language Model (LLM) hosting costs were eating 38% of our cloud budget by Q3 2024. We relied on AWS Bedrock for managed access to Claude, Llama 3, and Titan models, but rising per-token fees and limited cost optimization controls pushed us to evaluate alternatives. After a 6-week proof of concept, we migrated to GCP Vertex AI using preemptible (now called Spot) instances for self-hosted open-weight models, cutting total LLM hosting spend by 52% with no degradation in latency or output quality.

Why We Left AWS Bedrock

AWS Bedrock offers a low-friction managed API for proprietary and open-weight LLMs, but we hit three key pain points:

  • Unpredictable per-token pricing: Bedrock charges per input/output token with no volume discounts for our 12M daily requests, leading to monthly bills that fluctuated by ±18%.
  • Limited instance control: Bedrock abstracts away infrastructure, so we couldn’t use spot instances or reserved capacity for the open-weight models we self-hosted via Bedrock’s custom model import.
  • Regional latency for EU users: Bedrock’s EU-west-1 endpoint added 220ms average latency for our European customer base, with no option to deploy custom models to additional regions without re-importing.

Why GCP Vertex AI + Preemptible Instances

We evaluated three alternatives: self-hosting on EC2 Spot, Azure Machine Learning, and GCP Vertex AI. Vertex AI won for three reasons:

  • Native preemptible instance support: GCP’s preemptible VMs (now Spot VMs) offer up to 80% discount over on-demand instances, with 24-hour max runtime and 1-minute preemption notice. Vertex AI’s managed prediction service integrates seamlessly with preemptible nodes for batch and online inference.
  • Unified model registry: Vertex AI Model Registry let us import Llama 3.1, Mistral 7B, and our fine-tuned domain-specific models in one place, with automated versioning and rollout.
  • EU region parity: Vertex AI’s europe-west2 (London) and europe-west3 (Frankfurt) regions matched our user distribution, cutting average EU latency to 89ms.

We used a hybrid approach: proprietary models (Claude 3.5 Sonnet) remained on Bedrock for 10% of traffic, while 90% of traffic shifted to self-hosted open-weight models on Vertex AI preemptible instances.

Technical Migration Steps

Our 6-week migration followed four phases:

1. Model Compatibility Audit

We first verified that our open-weight models (Llama 3.1 8B, Mistral 7B, fine-tuned GPT-3.5 replacement) were compatible with Vertex AI’s container requirements. We packaged models using NVIDIA Triton Inference Server 24.07, with vLLM 0.4.2 for optimized serving. All models were exported to Vertex AI Model Registry in ONNX and Safetensors formats.

2. Preemptible Node Pool Configuration

We created a dedicated Vertex AI prediction node pool with the following specs:

  • Instance type: a2-highgpu-1g (1x NVIDIA A100 40GB) for 8B/7B models, a2-ultragpu-4g (4x A100) for 70B models
  • Preemptible: Enabled, with 2 on-demand standby nodes per pool to handle preemption events
  • Autoscaling: 1–12 nodes, scale-out trigger at 70% GPU utilization, scale-in after 10 minutes of <30% utilization
  • Health checks: 30-second readiness probes, 5-minute liveness probes

3. Traffic Shifting & A/B Testing

We used Cloudflare Load Balancing to shift 5% of traffic to Vertex AI initially, monitoring latency, output quality (via BLEU score for summarization tasks, human eval for chat), and error rates. After 72 hours of no regressions, we increased to 25%, then 50%, then 90% over 2 weeks.

4. Cost Monitoring Setup

We integrated GCP Billing Export to BigQuery, building a custom dashboard to track per-model, per-region costs. We set up alerts for preemptible instance preemption rates above 15% (our threshold for adding on-demand standby nodes).

Cost Breakdown: Before and After

Our August 2024 (Bedrock-only) vs October 2024 (Vertex AI + Preemptible) costs:

Cost Category

August 2024 (AWS Bedrock)

October 2024 (GCP Vertex AI)

Change

Per-token fees (proprietary models)

$18,200

$1,820 (10% traffic remaining)

-90%

Compute (self-hosted models)

$12,400 (on-demand Bedrock custom)

$5,100 (preemptible + 2 on-demand standbys)

-59%

Data egress

$2,100

$1,900 (GCP’s lower EU egress rates)

-10%

Total LLM Hosting Spend

$32,700

$15,820

-52%

Preemptible instances accounted for 78% of our compute savings, with the remaining 22% from GCP’s lower on-demand GPU pricing vs AWS.

Lessons Learned

  • Preemptible preemption rates vary by region: europe-west1 had 22% preemption rates during peak hours, while europe-west3 stayed below 9%. We shifted 70% of EU traffic to europe-west3 to minimize disruption.
  • vLLM is critical for cost efficiency: Using vLLM’s continuous batching cut our GPU per-request cost by 41% vs default Triton serving, making preemptible instances even more cost-effective.
  • Keep 5–10% traffic on managed APIs: We retained Bedrock for proprietary models and fallback traffic, avoiding vendor lock-in and ensuring uptime during Vertex AI maintenance windows.
  • Monitor preemption notice handling: We built a sidecar container to drain in-flight requests when a preemption notice is received, reducing failed requests from 3.2% to 0.17% during preemption events.

Conclusion

Migrating from AWS Bedrock to GCP Vertex AI with preemptible instances cut our LLM hosting costs by 52% while improving latency for EU users and maintaining output quality. The migration required ~120 engineering hours total, with a 3.2-month ROI based on monthly savings. For teams with high open-weight LLM traffic, the combination of Vertex AI’s managed tooling and preemptible instances offers a clear path to cost optimization without sacrificing performance.

Have questions about our migration? Reach out to our engineering team at [email].

Top comments (0)