DEV Community

shashank ms
shashank ms

Posted on

Deploying LLMs on Cloud: A Step-by-Step Guide

Running large language models in production requires more than a GPU and a checkpoint. You need to think through serving frameworks, autoscaling rules, batching strategies, and request routing before your first user hits the endpoint. This guide walks through the full stack, from hardware selection to API design, and shows where a managed inference layer like Oxlo.ai can remove operational overhead without sacrificing control.

Assess Your Workload Requirements

Before you provision a single instance, document your constraints. Expected peak queries per second, median and max context length, and acceptable time-to-first-token (TTFT) all dictate your hardware and software choices. A retrieval-augmented generation (RAG) pipeline with 128K context windows needs a different stack than a low-latency code-completion endpoint.

Map your use case to model categories. If you need multilingual reasoning, look at Qwen 3 or GLM 5. For deep reasoning and complex coding, DeepSeek R1 671B MoE or Kimi K2.6 are better fits. If your product mixes vision and language, you will need a vision-capable model such as Kimi VL A3B or Gemma 3 27B. Knowing this early prevents expensive refactoring later.

Choose Your Deployment Model

There are two primary paths: self-managed infrastructure and managed inference APIs. Self-managed gives you full control over quantization, custom schedulers, and colocation with your data. It also forces your team to manage CUDA drivers, networking, and failover logic. Managed APIs remove that burden but traditionally bill by the token, which can make long-context workloads unpredictable.

Oxlo.ai offers a middle ground that behaves like a managed API but bills by the request. Because cost does not scale with input length, it is significantly cheaper for long-context and agentic workloads than token-based alternatives. You get OpenAI SDK compatibility, no cold starts, and access to more than 45 models without maintaining a Kubernetes cluster.

Provision Infrastructure

For self-hosted deployments, start with GPU-backed instances. NVIDIA A100s and H100s remain the standard for large dense models, while L4s work well for smaller quantized checkpoints. Provision nodes in a single availability zone if you are using tensor or pipeline parallelism, because cross-AZ traffic adds latency and cost.

A typical node pool on a cloud provider looks like this:

# Example: AWS CLI snippet for a GPU node group
aws eks create-nodegroup \
  --cluster-name llm-cluster \
  --nodegroup-name gpu-ng \
  --node-type p4d.24xlarge \
  --scaling-config minSize=1,maxSize=4,desiredSize=2 \
  --subnets subnet-abc123

Install the NVIDIA device plugin and ensure your container runtime can access the GPUs. Verify with nvidia-smi inside a test pod before deploying any model server.

Containerize the Model

Most production teams use vLLM or HuggingFace Text Generation Inference (TGI) as the serving layer. Both expose an OpenAI-compatible HTTP interface and handle continuous batching. Below is a minimal vLLM deployment for Llama 3.3 70B on four GPUs:

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-num-seqs 256 \
  --max-model-len 8192

If you are running a mixture-of-experts model like DeepSeek V4 Flash or Qwen 3 MoE variants, confirm that your serving framework supports expert parallelism or efficient all-to-all communication. Otherwise you will underutilize your GPUs.

Configure Routing and Scaling

A single model replica is not production-ready. You need an ingress controller, load balancing across replicas, and an autoscaler that reacts to GPU memory pressure or request queue depth. Kubernetes HPA can scale on custom metrics exported by the inference server.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm:gpu_cache_usage_perc
      target:
        type: AverageValue
        averageValue: "0.7"

Add circuit breakers and rate limits at the ingress layer. LLMs are vulnerable to thundering-herd problems when many long-context requests arrive simultaneously. Queue-based routing with timeout policies keeps your cluster stable.

When to Skip Self-Hosting and Use Oxlo.ai

Self-hosting makes sense when you have strict data residency requirements, custom model weights, or a platform team already fluent in GPU infrastructure. For everyone else, the operational tax is high. You are on the hook for driver updates, CUDA compatibility, and scaling logic every time usage spikes.

Oxlo.ai removes that tax. It is a developer-first AI inference platform with request-based pricing: one flat cost per API request regardless of prompt length. Unlike token-based providers, your bill does not explode when you pass long documents or multi-turn agent histories. The platform hosts 45+ open-source and proprietary models, including DeepSeek R1 671B MoE, Qwen 3 32B, Kimi K2.6, and Llama 3.3 70B. There are no cold starts on popular models, and the API is fully OpenAI SDK compatible.

Switching is a single line change:

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ["OXLO_API_KEY"]
)

stream = client.chat.completions.create(
    model="DeepSeek-R1-671B",
    messages=[{"role": "user", "content": "Write a Redis-backed task queue in Python."}],
    stream=True
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

This drop-in replacement works with Python, Node.js, and cURL. You keep your existing retry logic, observability hooks, and JSON mode calls without rewriting your application.

Monitoring and Observability

Whether you self-host or use a managed platform, expose metrics that matter. Track TTFT, time-per-output-token (TPOT), end-to-end latency, error rate by status code, and queue depth. Prometheus and Grafana are the standard stack for self-hosted clusters.

If you route traffic through Oxlo.ai, you can still instrument your client with OpenTelemetry or simple request timers. Because the platform handles queuing and scaling internally, your dashboards focus

Top comments (0)