How I Deployed Llama 3.1 on AWS EC2 (g4dn.xlarge) with llama.cpp — Real Numbers

Aviram Galim — Mon, 18 May 2026 06:37:43 +0000

Tired of paying per token? I set up a self-hosted Llama 3.1 inference endpoint on an AWS GPU instance using llama.cpp. Here's what it actually looks like end to end.

The Setup

Instance: g4dn.xlarge (NVIDIA Tesla T4, 15 GB VRAM) - $0.53/hour on-demand
Model: Llama 3.1 8B Instruct, Q4_K_M quantized (4.58 GiB)
Backend: llama.cpp compiled with '-DGGML_CUDA=ON'
API: OpenAI-compatible REST endpoint on port 8080

Real benchmark numbers

Test	Result
Prompt processing (pp512)	1,093 tokens/sec
Text generation (tg128)	34.36 tokens/sec
VRAM usage	5,292 MiB

A few things I learned

The Deep Learning Base GPU AMI is worth it. CUDA drivers, build tools, cmake, git - all pre-installed. Saves you an hour of setup that nobody wants to document.

The CUDA build takes ~90 minutes on 4 vCPUs. -DGGML_CUDA=ON is the flag that matters. Without it you're running on CPU and inference is ~10x slower. Snapshot your EBS volume after the build so you never wait again.

GPU instance quotas start at 0 on new AWS accounts. Request your quota increase before you start - it takes up to 2 hours and will block you mid-exercise otherwise.

Q4_K_M is the sweet spot for the T4. Full fp16 needs ~16 GB VRAM (too tight for the T4's 15 GB). Q4_K_M fits in 5.2 GB with minimal quality loss.

llama-server exposes an OpenAI-compatible API. Point your existing code at the new endpoint URL - no other changes needed.

The full guide

I wrote up every step with real terminal output and screenshots:

👉 https://gizmojack.com/how-to-deploy-llama-3-1-on-aws-ec2-g4dn-xlarge-for-under-1-hour-a-complete-guide/

Covers AMI selection, security group setup, CUDA build, model download, server flags, benchmarking, and cost optimization tips.

If you have questions or need help setting this up for your company, reach me via the contact form at https://gizmojack.com/contact-me/

DEV Community: Aviram Galim

How I Deployed Llama 3.1 on AWS EC2 (g4dn.xlarge) with llama.cpp — Real Numbers

The Setup

Real benchmark numbers

A few things I learned

The full guide