DEV Community

Cover image for How I Deployed Llama 3.1 on AWS EC2 (g4dn.xlarge) with llama.cpp — Real Numbers
Aviram Galim
Aviram Galim

Posted on

How I Deployed Llama 3.1 on AWS EC2 (g4dn.xlarge) with llama.cpp — Real Numbers

Tired of paying per token? I set up a self-hosted Llama 3.1 inference endpoint on an AWS GPU instance using llama.cpp. Here's what it actually looks like end to end.

The Setup

  • Instance: g4dn.xlarge (NVIDIA Tesla T4, 15 GB VRAM) - $0.53/hour on-demand
  • Model: Llama 3.1 8B Instruct, Q4_K_M quantized (4.58 GiB)
  • Backend: llama.cpp compiled with '-DGGML_CUDA=ON'
  • API: OpenAI-compatible REST endpoint on port 8080

Real benchmark numbers

Test Result
Prompt processing (pp512) 1,093 tokens/sec
Text generation (tg128) 34.36 tokens/sec
VRAM usage 5,292 MiB

A few things I learned

The Deep Learning Base GPU AMI is worth it. CUDA drivers, build tools, cmake, git - all pre-installed. Saves you an hour of setup that nobody wants to document.

The CUDA build takes ~90 minutes on 4 vCPUs. -DGGML_CUDA=ON is the flag that matters. Without it you're running on CPU and inference is ~10x slower. Snapshot your EBS volume after the build so you never wait again.

GPU instance quotas start at 0 on new AWS accounts. Request your quota increase before you start - it takes up to 2 hours and will block you mid-exercise otherwise.

Q4_K_M is the sweet spot for the T4. Full fp16 needs ~16 GB VRAM (too tight for the T4's 15 GB). Q4_K_M fits in 5.2 GB with minimal quality loss.

llama-server exposes an OpenAI-compatible API. Point your existing code at the new endpoint URL - no other changes needed.

The full guide

I wrote up every step with real terminal output and screenshots:

👉 https://gizmojack.com/how-to-deploy-llama-3-1-on-aws-ec2-g4dn-xlarge-for-under-1-hour-a-complete-guide/

Covers AMI selection, security group setup, CUDA build, model download, server flags, benchmarking, and cost optimization tips.

If you have questions or need help setting this up for your company, reach me via the contact form at https://gizmojack.com/contact-me/

Top comments (1)

Collapse
 
foxck016077 profile image
foxck016077

The 1,093 tokens/sec prompt-processing number on a T4 is the data point I was hoping to see written down somewhere. Most "I self-hosted Llama on a cheap GPU" posts give you the model size and the $/hour and skip the actual throughput, which is the whole reason to do it. 34.36 tok/sec generation is also the right ballpark for a Q4_K_M 8B on T4 — anyone seeing meaningfully lower than that should re-check their -DGGML_CUDA=ON build.

One observation that I'd add to the "quota starts at 0" lesson: AWS GPU quotas are also per-instance-family, not per-account-overall. A new account gets its g4dn quota, and then if you decide to upgrade to g5.xlarge (A10G, much faster, similar price) you have to request that family separately. Worth doing both requests at the same time if there's any chance you'll iterate up.

Two questions:

  1. On the EBS snapshot after the CUDA build — did you find Q4_K_M weights themselves worth snapshotting too, or did you rely on re-downloading from HuggingFace each cold-start? The snapshot space tax vs. the 1-time download time tradeoff isn't obvious until you've done it a few times.

  2. For the OpenAI-compatible API — when you connect an existing client that assumes OpenAI's full feature set (function calling, structured outputs, vision), which parts of the surface degrade silently versus error loud? I'd rather hear about the silent ones since those are the support tickets you don't see coming.

The "snapshot your EBS volume so you never wait again" line is the one most setup guides skip because the author only did the build once. Nice to see it called out as a permanent fix.