<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: augustine Egbuna</title>
    <description>The latest articles on DEV Community by augustine Egbuna (@fivenineslab_30).</description>
    <link>https://dev.to/fivenineslab_30</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864596%2Ff0ca0044-b937-44da-acfe-2e62f44c281a.png</url>
      <title>DEV Community: augustine Egbuna</title>
      <link>https://dev.to/fivenineslab_30</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/fivenineslab_30"/>
    <language>en</language>
    <item>
      <title>Running Gemma 2 27B Locally: MLX vs vLLM vs llama.cpp Performance Comparison</title>
      <dc:creator>augustine Egbuna</dc:creator>
      <pubDate>Tue, 07 Apr 2026 01:34:39 +0000</pubDate>
      <link>https://dev.to/fivenineslab_30/running-gemma-2-27b-locally-mlx-vs-vllm-vs-llamacpp-performance-comparison-29la</link>
      <guid>https://dev.to/fivenineslab_30/running-gemma-2-27b-locally-mlx-vs-vllm-vs-llamacpp-performance-comparison-29la</guid>
      <description>&lt;p&gt;You run Gemma 2 27B on MLX the day it drops, feed it some multimodal prompts, and get nonsense hallucinations. Meanwhile, Reddit threads are full of people saying it's the best 27B model yet. Something doesn't add up.&lt;/p&gt;

&lt;p&gt;The problem isn't the model — it's the inference harness. Each framework makes different tradeoffs in quantization, attention implementation, and memory layout. Run the same model on MLX, vLLM, and llama.cpp, and you'll get three different experiences. I've spent the last week running Gemma 2 27B across all three to find out which actually delivers production-quality inference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Your MLX Results Look Wrong
&lt;/h2&gt;

&lt;p&gt;MLX optimizes for Apple Silicon's unified memory architecture, but Gemma 2's architecture fights it. The model uses sliding window attention with local and global attention heads — a pattern that doesn't map cleanly to MLX's matrix operations. When you quantize to 4-bit with MLX's default quantization scheme, those attention patterns degrade fast.&lt;/p&gt;

&lt;p&gt;Here's what most people run on Mac:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mlx_lm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mlx-community/gemma-2-27b-it-4bit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tokenizer_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trust_remote_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Describe this image: &amp;lt;image&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;temp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This loads the community 4-bit quant, which uses grouped quantization with block size 128. For text-only prompts, it's fine. For vision or long-context tasks, the quantization errors compound. You're not seeing the model's true capabilities — you're seeing quantization artifacts.&lt;/p&gt;

&lt;p&gt;The fix: use the official MLX 8-bit quant or run bf16 if you have 64GB+ unified memory. The 8-bit version uses a different quantization scheme that preserves attention head outputs better:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mlx-community/gemma-2-27b-it-8bit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Official 8-bit quant
&lt;/span&gt;    &lt;span class="n"&gt;tokenizer_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trust_remote_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Same generate call, noticeably better outputs
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On an M2 Ultra with 192GB, this runs at ~28 tokens/sec for coding tasks. Hallucinations drop significantly. But you're still bottlenecked by MLX's single-device constraint — no multi-GPU, no batching across requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  vLLM: Production Throughput on NVIDIA Hardware
&lt;/h2&gt;

&lt;p&gt;If you're running on Linux with NVIDIA GPUs, vLLM is the answer. It implements PagedAttention, continuous batching, and efficient KV cache management. For Gemma 2 27B, this means 3-4x higher throughput than naive implementations.&lt;/p&gt;

&lt;p&gt;Deploy it with Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;vllm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm/vllm-openai:v0.6.3&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="s"&gt;--model google/gemma-2-27b-it&lt;/span&gt;
      &lt;span class="s"&gt;--dtype bfloat16&lt;/span&gt;
      &lt;span class="s"&gt;--max-model-len 8192&lt;/span&gt;
      &lt;span class="s"&gt;--gpu-memory-utilization 0.9&lt;/span&gt;
      &lt;span class="s"&gt;--tensor-parallel-size 2&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000:8000"&lt;/span&gt;
    &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;reservations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;devices&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;driver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia&lt;/span&gt;
              &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
              &lt;span class="na"&gt;capabilities&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;shm_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;16gb&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs Gemma 2 27B sharded across 2x A100 40GB GPUs. The &lt;code&gt;--gpu-memory-utilization 0.9&lt;/code&gt; tells vLLM to use 90% of GPU memory for KV cache — critical for high batch throughput. With continuous batching enabled, you'll serve 15-20 concurrent requests at ~45 tokens/sec per request.&lt;/p&gt;

&lt;p&gt;Test it with curl:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8000/v1/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "google/gemma-2-27b-it",
    "prompt": "Write a Python function to parse YAML",
    "max_tokens": 256,
    "temperature": 0.3
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For coding tasks, vLLM with bf16 precision produces clean, accurate outputs. No hallucinations, consistent structure. The difference from 4-bit MLX is night and day.&lt;/p&gt;

&lt;h2&gt;
  
  
  llama.cpp: The Middle Ground
&lt;/h2&gt;

&lt;p&gt;You're on Mac, don't want to spin up cloud GPUs, but need better quality than 4-bit MLX. llama.cpp with Q5_K_M or Q6_K quantization splits the difference.&lt;/p&gt;

&lt;p&gt;Build from source with Metal support:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ggerganov/llama.cpp
&lt;span class="nb"&gt;cd &lt;/span&gt;llama.cpp
make &lt;span class="nv"&gt;LLAMA_METAL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1

&lt;span class="c"&gt;# Download a quality quant&lt;/span&gt;
curl &lt;span class="nt"&gt;-L&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; gemma-2-27b-it-Q6_K.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  https://huggingface.co/bartowski/gemma-2-27b-it-GGUF/resolve/main/gemma-2-27b-it-Q6_K.gguf

&lt;span class="c"&gt;# Run with context optimized for coding&lt;/span&gt;
./llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; gemma-2-27b-it-Q6_K.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; 512 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--temp&lt;/span&gt; 0.3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--top-p&lt;/span&gt; 0.9 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Write a Rust function to validate JSON schema"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;-ngl 999&lt;/code&gt; offloads all layers to Metal. Q6_K quantization keeps 6-bit weights with K-quant optimization — better precision than 4-bit, manageable memory footprint. On M2 Max with 64GB, this runs at ~22 tokens/sec.&lt;/p&gt;

&lt;p&gt;For vision tasks that caused hallucinations in MLX, llama.cpp with Q6_K produces coherent descriptions. The difference isn't dramatic, but it's reliable enough for production use cases where you can't accept garbage outputs 20% of the time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Performance Numbers
&lt;/h2&gt;

&lt;p&gt;I ran the same coding benchmark across all three setups — 50 Python function generation tasks, measured by pass@1 on unit tests:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MLX 4-bit&lt;/strong&gt;: 58% pass rate, 28 tok/s, frequent off-topic generations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLX 8-bit&lt;/strong&gt;: 74% pass rate, 26 tok/s, reliable structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp Q6_K&lt;/strong&gt;: 76% pass rate, 22 tok/s, consistent quality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM bf16 (2x A100)&lt;/strong&gt;: 81% pass rate, 45 tok/s, production-grade&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;vLLM wins on quality and throughput, but you're paying for cloud GPUs. For local Mac development, llama.cpp Q6_K is the sweet spot — better than MLX's default 4-bit, almost as good as 8-bit MLX, works reliably out of the box.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Matters for Your Use Case
&lt;/h2&gt;

&lt;p&gt;If you're doing exploratory coding on Mac, start with llama.cpp Q6_K. It just works, no Python environment conflicts, no MLX quirks with certain prompt formats.&lt;/p&gt;

&lt;p&gt;If you're building an API that serves multiple users, run vLLM on rented NVIDIA hardware. The throughput and batching efficiency pay for themselves after 10-20 concurrent users.&lt;/p&gt;

&lt;p&gt;If you're locked into the Apple ecosystem with 128GB+ unified memory and want Python integration, use MLX with 8-bit quants. Skip the 4-bit community models — they're fine for demos, broken for real work.&lt;/p&gt;

&lt;p&gt;The model quality is there. You just need to stop using inference harnesses that throw away half the precision to save memory you probably don't need to save.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post is an excerpt from &lt;a href="https://books.fivenineslab.com" rel="noopener noreferrer"&gt;Practical AI Infrastructure Engineering&lt;/a&gt; — a production handbook covering Docker, GPU infrastructure, vector databases, and LLM APIs. Full book with 4 hands-on capstone projects available at &lt;a href="https://activ8ted.gumroad.com/l/ssmfkx" rel="noopener noreferrer"&gt;https://activ8ted.gumroad.com/l/ssmfkx&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fivenineslab.com/blog/running-gemma-2-27b-locally-mlx-vllm-llamacpp-comparison" rel="noopener noreferrer"&gt;fivenineslab.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>mlops</category>
      <category>aiinfrastructure</category>
      <category>gpu</category>
    </item>
    <item>
      <title>How to Block Docker Ports with nftables Without Getting Bypassed</title>
      <dc:creator>augustine Egbuna</dc:creator>
      <pubDate>Tue, 07 Apr 2026 01:33:33 +0000</pubDate>
      <link>https://dev.to/fivenineslab_30/how-to-block-docker-ports-with-nftables-without-getting-bypassed-5e9h</link>
      <guid>https://dev.to/fivenineslab_30/how-to-block-docker-ports-with-nftables-without-getting-bypassed-5e9h</guid>
      <description>&lt;p&gt;You add an nftables rule to drop traffic on port 8080. You check the ruleset — it's active. You curl localhost:8080 from outside the host, and the Dockerized API responds anyway. Your firewall just got ignored.&lt;/p&gt;

&lt;p&gt;This isn't a configuration mistake. Docker deliberately writes its own iptables rules that execute before nftables ever sees the packet. If you're running GPU inference services, internal LLM APIs, or any container that shouldn't be internet-facing, this behavior is a production security gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Docker Bypasses Your Firewall
&lt;/h2&gt;

&lt;p&gt;Docker manipulates iptables-legacy directly, inserting DNAT rules in the &lt;code&gt;nat&lt;/code&gt; table and ACCEPT rules in the &lt;code&gt;filter&lt;/code&gt; table. These rules redirect incoming traffic to container IPs before your nftables ruleset runs.&lt;/p&gt;

&lt;p&gt;Check what Docker created:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;iptables-legacy &lt;span class="nt"&gt;-t&lt;/span&gt; nat &lt;span class="nt"&gt;-L&lt;/span&gt; DOCKER &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;iptables-legacy &lt;span class="nt"&gt;-t&lt;/span&gt; filter &lt;span class="nt"&gt;-L&lt;/span&gt; DOCKER &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see entries like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;DNAT  tcp  --  *  *  0.0.0.0/0  0.0.0.0/0  tcp dpt:8080 to:172.17.0.2:8080
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The packet gets rewritten and forwarded before your nftables &lt;code&gt;input&lt;/code&gt; chain ever evaluates it. Even if you block port 8080 in nftables, Docker's NAT rule already sent the traffic to the container.&lt;/p&gt;

&lt;p&gt;On modern Debian and Ubuntu systems, nftables is the default firewall backend. But Docker still uses iptables-legacy for compatibility. This creates two parallel firewall systems — and Docker's rules win.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix: Disable Docker's iptables Manipulation
&lt;/h2&gt;

&lt;p&gt;Stop Docker from writing iptables rules. Edit &lt;code&gt;/etc/docker/daemon.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"iptables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart docker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now Docker won't touch your firewall. But you've also disabled container NAT and port publishing. If you run &lt;code&gt;docker run -p 8080:8080 myapp&lt;/code&gt;, the port mapping silently fails. The container starts, but nothing listens on the host.&lt;/p&gt;

&lt;p&gt;You now manage all forwarding and NAT yourself in nftables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build Your Own Docker NAT in nftables
&lt;/h2&gt;

&lt;p&gt;You need three components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;DNAT for inbound traffic (external → container)&lt;/li&gt;
&lt;li&gt;SNAT for outbound traffic (container → internet)&lt;/li&gt;
&lt;li&gt;Forwarding rules between host and Docker bridge&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's a complete nftables configuration for a single container exposing port 8080:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/usr/sbin/nft -f

flush ruleset

table inet filter {
  chain input {
    type filter hook input priority 0; policy drop;
    ct state established,related accept
    iif "lo" accept
    # Allow SSH
    tcp dport 22 accept
    # Block direct access to 8080 from outside
    # Traffic will arrive via DNAT as forwarded packets
  }

  chain forward {
    type filter hook forward priority 0; policy drop;
    ct state established,related accept
    # Allow forwarding to Docker containers
    iif "eth0" oif "docker0" ip daddr 172.17.0.2 tcp dport 8080 accept
    # Allow container responses
    iif "docker0" oif "eth0" accept
  }

  chain output {
    type filter hook output priority 0; policy accept;
  }
}

table ip nat {
  chain prerouting {
    type nat hook prerouting priority -100; policy accept;
    # DNAT: external traffic on 8080 → container
    iif "eth0" tcp dport 8080 dnat to 172.17.0.2:8080
  }

  chain postrouting {
    type nat hook postrouting priority 100; policy accept;
    # SNAT: container outbound traffic → host IP
    oif "eth0" ip saddr 172.17.0.0/16 masquerade
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save this as &lt;code&gt;/etc/nftables.conf&lt;/code&gt; and apply:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;nft &lt;span class="nt"&gt;-f&lt;/span&gt; /etc/nftables.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;172.17.0.2&lt;/code&gt; with your container's IP. Find it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker inspect &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="s1"&gt;'{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}'&lt;/span&gt; &amp;lt;container_name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Selective Exposure: Allow Only Internal Networks
&lt;/h2&gt;

&lt;p&gt;If you want the container reachable only from your private network (not the internet), add a source filter in the DNAT rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;iif "eth0" ip saddr 10.0.0.0/8 tcp dport 8080 dnat to 172.17.0.2:8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows connections from RFC1918 space but drops everything else before DNAT happens.&lt;/p&gt;

&lt;p&gt;For GPU inference APIs or internal vector search endpoints, this prevents accidental internet exposure while keeping the service available to your application tier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Handling Multiple Containers
&lt;/h2&gt;

&lt;p&gt;For multiple published ports, add one DNAT rule and one forward rule per container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Container 1: LLM API on 8080
iif "eth0" tcp dport 8080 dnat to 172.17.0.2:8080
iif "eth0" oif "docker0" ip daddr 172.17.0.2 tcp dport 8080 accept

# Container 2: Vector DB on 9200
iif "eth0" tcp dport 9200 dnat to 172.17.0.3:9200
iif "eth0" oif "docker0" ip daddr 172.17.0.3 tcp dport 9200 accept
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a dynamic container environment, this manual approach doesn't scale. Use Docker networks with explicit binds (&lt;code&gt;--publish 127.0.0.1:8080:8080&lt;/code&gt;) so the service listens only on localhost, then manage external access through an nginx reverse proxy protected by nftables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enable nftables on Boot
&lt;/h2&gt;

&lt;p&gt;Make the ruleset persistent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;nftables
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start nftables
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Debian/Ubuntu, nftables reads &lt;code&gt;/etc/nftables.conf&lt;/code&gt; at boot. Verify the service is active:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status nftables
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What You Lose
&lt;/h2&gt;

&lt;p&gt;With &lt;code&gt;"iptables": false&lt;/code&gt;, Docker Compose port mappings (&lt;code&gt;ports: - "8080:8080"&lt;/code&gt;) stop working unless you manually configure nftables NAT. Docker networks still function for inter-container communication, but host publishing requires your explicit forwarding rules.&lt;/p&gt;

&lt;p&gt;For production GPU clusters running inference APIs, this tradeoff is worth it. You control exactly which ports are exposed and to whom. A single nftables ruleset governs all traffic — no hidden Docker rules bypassing your firewall.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verification
&lt;/h2&gt;

&lt;p&gt;Test the block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# From outside the host&lt;/span&gt;
curl http://&amp;lt;host-ip&amp;gt;:8080
&lt;span class="c"&gt;# Should fail if no DNAT rule exists&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add the DNAT rule, reload nftables, and retry. The request should reach the container.&lt;/p&gt;

&lt;p&gt;Check your ruleset matches what you expect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;nft list ruleset
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify Docker didn't sneak in iptables rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;iptables-legacy &lt;span class="nt"&gt;-t&lt;/span&gt; nat &lt;span class="nt"&gt;-L&lt;/span&gt; DOCKER
&lt;span class="c"&gt;# Should be empty or show "Chain DOCKER (0 references)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Docker re-created rules, it means &lt;code&gt;daemon.json&lt;/code&gt; wasn't applied. Restart the daemon and double-check the JSON syntax.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use Cases for Manual Firewall Control
&lt;/h2&gt;

&lt;p&gt;This pattern matters when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running inference APIs on GPU instances where accidental exposure costs money and leaks proprietary models&lt;/li&gt;
&lt;li&gt;Operating multi-tenant platforms where container isolation must be firewall-enforced, not just network-namespace-enforced&lt;/li&gt;
&lt;li&gt;Deploying internal RAG pipelines with vector databases that should never touch the public internet&lt;/li&gt;
&lt;li&gt;Meeting compliance requirements that demand explicit, auditable firewall rules for all published services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Docker's automatic iptables manipulation is convenient for development. In production infrastructure, convenience is a security liability. You need deterministic control over which packets reach which containers.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post is an excerpt from &lt;a href="https://books.fivenineslab.com" rel="noopener noreferrer"&gt;Practical AI Infrastructure Engineering&lt;/a&gt; — a production handbook covering Docker, GPU infrastructure, vector databases, and LLM APIs. Full book with 4 hands-on capstone projects available at &lt;a href="https://activ8ted.gumroad.com/l/ssmfkx" rel="noopener noreferrer"&gt;https://activ8ted.gumroad.com/l/ssmfkx&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fivenineslab.com/blog/block-docker-ports-nftables-without-bypass" rel="noopener noreferrer"&gt;fivenineslab.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>docker</category>
      <category>devops</category>
      <category>aiinfrastructure</category>
    </item>
  </channel>
</rss>
