<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aviram Galim</title>
    <description>The latest articles on DEV Community by Aviram Galim (@avigalim).</description>
    <link>https://dev.to/avigalim</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3937347%2F449a3967-4b66-4c82-bd1c-5e578df01678.jpg</url>
      <title>DEV Community: Aviram Galim</title>
      <link>https://dev.to/avigalim</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/avigalim"/>
    <language>en</language>
    <item>
      <title>How I Deployed Llama 3.1 on AWS EC2 (g4dn.xlarge) with llama.cpp — Real Numbers</title>
      <dc:creator>Aviram Galim</dc:creator>
      <pubDate>Mon, 18 May 2026 06:37:43 +0000</pubDate>
      <link>https://dev.to/avigalim/how-i-deployed-llama-31-on-aws-ec2-g4dnxlarge-with-llamacpp-real-numbers-5020</link>
      <guid>https://dev.to/avigalim/how-i-deployed-llama-31-on-aws-ec2-g4dnxlarge-with-llamacpp-real-numbers-5020</guid>
      <description>&lt;p&gt;Tired of paying per token? I set up a self-hosted Llama 3.1 inference endpoint on an AWS GPU instance using llama.cpp. Here's what it actually looks like end to end.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Instance: g4dn.xlarge (NVIDIA Tesla T4, 15 GB VRAM) - $0.53/hour on-demand&lt;/li&gt;
&lt;li&gt;Model: Llama 3.1 8B Instruct, Q4_K_M quantized (4.58 GiB)&lt;/li&gt;
&lt;li&gt;Backend: llama.cpp compiled with '-DGGML_CUDA=ON'&lt;/li&gt;
&lt;li&gt;API: OpenAI-compatible REST endpoint on port 8080&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Real benchmark numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompt processing (pp512)&lt;/td&gt;
&lt;td&gt;1,093 tokens/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Text generation (tg128)&lt;/td&gt;
&lt;td&gt;34.36 tokens/sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VRAM usage&lt;/td&gt;
&lt;td&gt;5,292 MiB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  A few things I learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Deep Learning Base GPU AMI is worth it.&lt;/strong&gt; CUDA drivers, build tools, cmake, git - all pre-installed. Saves you an hour of setup that nobody wants to document.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The CUDA build takes ~90 minutes on 4 vCPUs.&lt;/strong&gt; &lt;code&gt;-DGGML_CUDA=ON&lt;/code&gt; is the flag that matters. Without it you're running on CPU and inference is ~10x slower. Snapshot your EBS volume after the build so you never wait again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPU instance quotas start at 0 on new AWS accounts.&lt;/strong&gt; Request your quota increase before you start - it takes up to 2 hours and will block you mid-exercise otherwise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q4_K_M is the sweet spot for the T4.&lt;/strong&gt; Full fp16 needs ~16 GB VRAM (too tight for the T4's 15 GB). Q4_K_M fits in 5.2 GB with minimal quality loss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;llama-server exposes an OpenAI-compatible API.&lt;/strong&gt; Point your existing code at the new endpoint URL - no other changes needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The full guide
&lt;/h2&gt;

&lt;p&gt;I wrote up every step with real terminal output and screenshots:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://gizmojack.com/how-to-deploy-llama-3-1-on-aws-ec2-g4dn-xlarge-for-under-1-hour-a-complete-guide/" rel="noopener noreferrer"&gt;https://gizmojack.com/how-to-deploy-llama-3-1-on-aws-ec2-g4dn-xlarge-for-under-1-hour-a-complete-guide/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Covers AMI selection, security group setup, CUDA build, model download, server flags, benchmarking, and cost optimization tips.&lt;/p&gt;

&lt;p&gt;If you have questions or need help setting this up for your company, reach me via the contact form at &lt;a href="https://gizmojack.com/contact-me/" rel="noopener noreferrer"&gt;https://gizmojack.com/contact-me/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>llm</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
