Tired of paying per token? I set up a self-hosted Llama 3.1 inference endpoint on an AWS GPU instance using llama.cpp. Here's what it actually looks like end to end.
The Setup
- Instance: g4dn.xlarge (NVIDIA Tesla T4, 15 GB VRAM) - $0.53/hour on-demand
- Model: Llama 3.1 8B Instruct, Q4_K_M quantized (4.58 GiB)
- Backend: llama.cpp compiled with '-DGGML_CUDA=ON'
- API: OpenAI-compatible REST endpoint on port 8080
Real benchmark numbers
| Test | Result |
|---|---|
| Prompt processing (pp512) | 1,093 tokens/sec |
| Text generation (tg128) | 34.36 tokens/sec |
| VRAM usage | 5,292 MiB |
A few things I learned
The Deep Learning Base GPU AMI is worth it. CUDA drivers, build tools, cmake, git - all pre-installed. Saves you an hour of setup that nobody wants to document.
The CUDA build takes ~90 minutes on 4 vCPUs. -DGGML_CUDA=ON is the flag that matters. Without it you're running on CPU and inference is ~10x slower. Snapshot your EBS volume after the build so you never wait again.
GPU instance quotas start at 0 on new AWS accounts. Request your quota increase before you start - it takes up to 2 hours and will block you mid-exercise otherwise.
Q4_K_M is the sweet spot for the T4. Full fp16 needs ~16 GB VRAM (too tight for the T4's 15 GB). Q4_K_M fits in 5.2 GB with minimal quality loss.
llama-server exposes an OpenAI-compatible API. Point your existing code at the new endpoint URL - no other changes needed.
The full guide
I wrote up every step with real terminal output and screenshots:
Covers AMI selection, security group setup, CUDA build, model download, server flags, benchmarking, and cost optimization tips.
If you have questions or need help setting this up for your company, reach me via the contact form at https://gizmojack.com/contact-me/
Top comments (1)
The 1,093 tokens/sec prompt-processing number on a T4 is the data point I was hoping to see written down somewhere. Most "I self-hosted Llama on a cheap GPU" posts give you the model size and the
$/hourand skip the actual throughput, which is the whole reason to do it. 34.36 tok/sec generation is also the right ballpark for a Q4_K_M 8B on T4 — anyone seeing meaningfully lower than that should re-check their-DGGML_CUDA=ONbuild.One observation that I'd add to the "quota starts at 0" lesson: AWS GPU quotas are also per-instance-family, not per-account-overall. A new account gets its
g4dnquota, and then if you decide to upgrade tog5.xlarge(A10G, much faster, similar price) you have to request that family separately. Worth doing both requests at the same time if there's any chance you'll iterate up.Two questions:
On the EBS snapshot after the CUDA build — did you find Q4_K_M weights themselves worth snapshotting too, or did you rely on re-downloading from HuggingFace each cold-start? The snapshot space tax vs. the 1-time download time tradeoff isn't obvious until you've done it a few times.
For the OpenAI-compatible API — when you connect an existing client that assumes OpenAI's full feature set (function calling, structured outputs, vision), which parts of the surface degrade silently versus error loud? I'd rather hear about the silent ones since those are the support tickets you don't see coming.
The "snapshot your EBS volume so you never wait again" line is the one most setup guides skip because the author only did the build once. Nice to see it called out as a permanent fix.