DEV Community

shashank ms
shashank ms

Posted on

Deploying LLM Models: A Step-by-Step Guide

Deploying large language models in production requires navigating a maze of hardware requirements, inference frameworks, and scaling strategies. Whether you are fine-tuning a Llama 3.3 70B instance on dedicated GPUs or routing traffic to a managed endpoint, the goal is the same: low latency, high throughput, and predictable costs. This guide walks through the two primary paths, self-hosted infrastructure and managed APIs, with concrete steps and working code you can run today.

Choose Your Deployment Model

Production LLM deployments generally follow one of two architectures. Self-hosting gives you full custody of model weights, custom quantization, and private network isolation. Managed APIs abstract away hardware, drivers, and autoscaling in exchange for a usage fee. Your choice should depend on data residency requirements, team size, and whether your workloads are steady or bursty. Many teams prototype on a managed provider, benchmark latency, and then migrate to self-hosted inference only if the economics justify the operational burden.

Self-Hosting: Hardware and Framework Selection

If you need complete control over the stack, start by matching the model to your GPU memory budget. A 70B parameter model in FP16 requires roughly 140 GB of VRAM, meaning a multi-GPU node or a single high-memory instance. For quantized deployments, AWQ or GPTQ can cut that footprint significantly, but may sacrifice reasoning quality for math-heavy tasks.

The most common inference engines are vLLM, TensorRT-LLM, and Hugging Face TGI. vLLM is widely adopted for its PagedAttention kernel and built-in OpenAI-compatible server mode, which lets you reuse existing client code with minimal changes.

Deploying with vLLM: A Practical Example

Below is a minimal Docker command to serve Llama 3.3 70B on a dual-GPU node. Adjust --tensor-parallel-size to match your local hardware.

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 8192

Once the container passes health checks, query it with the OpenAI SDK.

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

This setup works, but you are now responsible for continuous operation: driver updates, queue management, autoscaling, and failover. For many teams, that operational tax exceeds the cost of a managed endpoint, especially when experimenting across multiple model families.

The Managed API Path: Skip the Infrastructure

Managed inference APIs let you call models over HTTPS without touching CUDA drivers or provisioning GPU nodes. The primary tradeoffs are cost structure, model availability, and cold-start latency. Most providers bill by the token, which means long-context prompts, large retrieval documents, and agentic loops with extensive histories can generate unpredictable bills. If your application repeatedly appends conversation history or ingests multi-page documents, input tokens often dwarf output tokens, making per-token pricing difficult to forecast.

Integrating Oxlo.ai for Production Inference

Oxlo.ai is a developer-first inference platform that uses request-based pricing: one flat cost per API request regardless of prompt length. That model makes Oxlo.ai a strong fit for long-context workloads, agentic pipelines, and any scenario where input size varies widely. Compared with token-based providers such as Together AI, Fireworks AI, OpenRouter, Replicate, and Anyscale, Oxlo.ai can be significantly cheaper when prompts are large or repetitive.

The platform hosts 45+ open-source and proprietary models across seven categories. Relevant production options include Llama 3.3 70B for general chat, DeepSeek R1 671B MoE for deep reasoning, Qwen 3 32B for multilingual agent workflows, and vision models such as Kimi VL A3B. All endpoints are fully OpenAI SDK compatible and carry no cold starts on popular models.

Switching an existing client to Oxlo.ai requires only a base URL change.

import openai

client = openai.OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Summarize this 100K token document..."}],
    stream=True
)

Because Oxlo.ai bills per request, the cost of that 100K context call is identical to a one-sentence greeting. That predictability simplifies budgeting for support bots, code review agents, and multi-turn reasoning workflows that grow their context windows over time.

Oxlo.ai offers a Free tier with 60 requests per day and a 7-day full-access trial, making it a practical sandbox for testing models before committing infrastructure to self-hosting. If you later need dedicated GPUs, the Enterprise plan includes unlimited requests and guaranteed 30% off your current provider. See the Oxlo.ai pricing page for plan details.

Monitoring and Optimization

Regardless of your deployment target, instrument four metrics: time-to-first-token, total generation latency, throughput in requests per second, and error rate. For self-hosted clusters, also track GPU memory fragmentation and KV-cache utilization. If you observe memory pressure, reduce --max-num-seqs or switch to a quantized weight format.

A useful cost-saving pattern is model cascading. Route simple queries to a fast, efficient model such as DeepSeek V4 Flash or Qwen 3 32B, and escalate to DeepSeek R1 671B MoE or GLM 5 only when the task demands advanced reasoning. This tiered approach preserves quality without over-provisioning your most expensive compute.

Decision Framework

Self-host when you require air-gapped weights, custom CUDA kernels, or strict data residency that cloud APIs cannot satisfy. Choose a managed API when your team values instant scaling, broad model coverage, and freedom from driver maintenance.

If your workloads involve long prompts, unpredictable context sizes, or agentic loops that repeatedly append history, Oxlo.ai's request-based pricing offers a cost structure that token-based alternatives cannot match. Start with the Oxlo.ai API to benchmark against your self-hosted baseline, then select the architecture that aligns with your latency, privacy, and budget requirements.

Top comments (0)