<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: augustine Egbuna</title>
    <description>The latest articles on DEV Community by augustine Egbuna (@fivenineslab_30).</description>
    <link>https://dev.to/fivenineslab_30</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3864596%2Ff0ca0044-b937-44da-acfe-2e62f44c281a.png</url>
      <title>DEV Community: augustine Egbuna</title>
      <link>https://dev.to/fivenineslab_30</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/fivenineslab_30"/>
    <language>en</language>
    <item>
      <title>Adding Private NuGet Feeds in Multi-Stage Dockerfiles Without Breaking Your Build</title>
      <dc:creator>augustine Egbuna</dc:creator>
      <pubDate>Fri, 17 Apr 2026 07:01:43 +0000</pubDate>
      <link>https://dev.to/fivenineslab_30/adding-private-nuget-feeds-in-multi-stage-dockerfiles-without-breaking-your-build-3pc4</link>
      <guid>https://dev.to/fivenineslab_30/adding-private-nuget-feeds-in-multi-stage-dockerfiles-without-breaking-your-build-3pc4</guid>
      <description>&lt;p&gt;You're building a .NET service in TeamCity. Your build depends on internal NuGet packages from other projects. The TeamCity NuGet feed is configured, but your Docker build agent can't see it. You add &lt;code&gt;dotnet nuget add source&lt;/code&gt; to your Dockerfile, the build completes, but when you actually try to restore packages, the source isn't there.&lt;/p&gt;

&lt;p&gt;I hit this exact problem setting up ML model serving APIs that depended on shared utility packages. The issue isn't obvious: NuGet configuration added during the Docker build doesn't persist to the restore step because of how multi-stage builds and layer caching work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Your NuGet Source Disappears
&lt;/h2&gt;

&lt;p&gt;When you run &lt;code&gt;dotnet nuget add source&lt;/code&gt; in a Dockerfile, it writes to &lt;code&gt;~/.nuget/NuGet/NuGet.Config&lt;/code&gt;. But in multi-stage builds, that config lives in an intermediate layer that may not be present when you actually run &lt;code&gt;dotnet restore&lt;/code&gt;. Even in single-stage builds, if you're running the container with volume mounts or different working directories, the config path changes.&lt;/p&gt;

&lt;p&gt;Here's what typically fails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;mcr.microsoft.com/dotnet/sdk:8.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;build&lt;/span&gt;

&lt;span class="c"&gt;# This completes without error&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;dotnet nuget add &lt;span class="nb"&gt;source &lt;/span&gt;https://teamcity.example.com/httpAuth/app/nuget/feed &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nt"&gt;--name&lt;/span&gt; TeamCityFeed &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nt"&gt;--username&lt;/span&gt; build-agent &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nt"&gt;--password&lt;/span&gt; &lt;span class="nv"&gt;$NUGET_PASSWORD&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nt"&gt;--store-password-in-clear-text&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /src&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; *.csproj .&lt;/span&gt;
&lt;span class="c"&gt;# This step fails: source not found&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;dotnet restore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;add source&lt;/code&gt; command succeeds, but the config is written to &lt;code&gt;/root/.nuget/NuGet/NuGet.Config&lt;/code&gt; in that layer. When &lt;code&gt;dotnet restore&lt;/code&gt; runs, it may be looking in a different user context or the layer cache invalidates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution 1: NuGet.Config in Your Source Tree
&lt;/h2&gt;

&lt;p&gt;The most reliable approach is to commit a &lt;code&gt;NuGet.Config&lt;/code&gt; file to your repository and copy it into the Docker build context. This works because the config travels with your code.&lt;/p&gt;

&lt;p&gt;Create &lt;code&gt;NuGet.Config&lt;/code&gt; at your solution root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="cp"&gt;&amp;lt;?xml version="1.0" encoding="utf-8"?&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;configuration&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;packageSources&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;clear&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;add&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"nuget.org"&lt;/span&gt; &lt;span class="na"&gt;value=&lt;/span&gt;&lt;span class="s"&gt;"https://api.nuget.org/v3/index.json"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;add&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"TeamCityFeed"&lt;/span&gt; &lt;span class="na"&gt;value=&lt;/span&gt;&lt;span class="s"&gt;"https://teamcity.example.com/httpAuth/app/nuget/feed"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/packageSources&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;packageSourceCredentials&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;TeamCityFeed&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;add&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"Username"&lt;/span&gt; &lt;span class="na"&gt;value=&lt;/span&gt;&lt;span class="s"&gt;"%NUGET_USERNAME%"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;add&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"ClearTextPassword"&lt;/span&gt; &lt;span class="na"&gt;value=&lt;/span&gt;&lt;span class="s"&gt;"%NUGET_PASSWORD%"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/TeamCityFeed&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/packageSourceCredentials&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/configuration&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the &lt;code&gt;%NUGET_USERNAME%&lt;/code&gt; and &lt;code&gt;%NUGET_PASSWORD%&lt;/code&gt; placeholders. NuGet will substitute these from environment variables at runtime. Your Dockerfile becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;mcr.microsoft.com/dotnet/sdk:8.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;build&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /src&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; NuGet.Config .&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; *.csproj .&lt;/span&gt;

&lt;span class="c"&gt;# Pass credentials as build args&lt;/span&gt;
&lt;span class="k"&gt;ARG&lt;/span&gt;&lt;span class="s"&gt; NUGET_USERNAME&lt;/span&gt;
&lt;span class="k"&gt;ARG&lt;/span&gt;&lt;span class="s"&gt; NUGET_PASSWORD&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; NUGET_USERNAME=${NUGET_USERNAME}&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; NUGET_PASSWORD=${NUGET_PASSWORD}&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;dotnet restore

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;dotnet publish &lt;span class="nt"&gt;-c&lt;/span&gt; Release &lt;span class="nt"&gt;-o&lt;/span&gt; /app/publish
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Build with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--build-arg&lt;/span&gt; &lt;span class="nv"&gt;NUGET_USERNAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;build-agent &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--build-arg&lt;/span&gt; &lt;span class="nv"&gt;NUGET_PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TEAMCITY_NUGET_TOKEN&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; your-service:latest &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach works because the config file is present in the working directory when &lt;code&gt;dotnet restore&lt;/code&gt; runs. NuGet searches the current directory before looking at the user profile.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution 2: Mount Config at Build Time
&lt;/h2&gt;

&lt;p&gt;If you can't commit credentials (even as placeholders) to source control, mount the config as a build secret. Docker BuildKit supports this with &lt;code&gt;--secret&lt;/code&gt; and &lt;code&gt;RUN --mount=type=secret&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Create &lt;code&gt;nuget-config.xml&lt;/code&gt; on your build agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="cp"&gt;&amp;lt;?xml version="1.0" encoding="utf-8"?&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;configuration&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;packageSources&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;add&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"TeamCityFeed"&lt;/span&gt; &lt;span class="na"&gt;value=&lt;/span&gt;&lt;span class="s"&gt;"https://teamcity.example.com/httpAuth/app/nuget/feed"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/packageSources&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;packageSourceCredentials&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;TeamCityFeed&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;add&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"Username"&lt;/span&gt; &lt;span class="na"&gt;value=&lt;/span&gt;&lt;span class="s"&gt;"build-agent"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;add&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"ClearTextPassword"&lt;/span&gt; &lt;span class="na"&gt;value=&lt;/span&gt;&lt;span class="s"&gt;"actual-token-here"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/TeamCityFeed&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/packageSourceCredentials&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/configuration&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dockerfile with mounted secret:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# syntax=docker/dockerfile:1.4&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;mcr.microsoft.com/dotnet/sdk:8.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;build&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /src&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; *.csproj .&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nt"&gt;--mount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;secret,id&lt;span class="o"&gt;=&lt;/span&gt;nuget_config,target&lt;span class="o"&gt;=&lt;/span&gt;/root/.nuget/NuGet/NuGet.Config &lt;span class="se"&gt;\
&lt;/span&gt;    dotnet restore

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;dotnet publish &lt;span class="nt"&gt;-c&lt;/span&gt; Release &lt;span class="nt"&gt;-o&lt;/span&gt; /app/publish
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Build command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;DOCKER_BUILDKIT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 docker build &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--secret&lt;/span&gt; &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nuget_config,src&lt;span class="o"&gt;=&lt;/span&gt;./nuget-config.xml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; your-service:latest &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--mount=type=secret&lt;/code&gt; makes the config available only during that RUN step. It's never written to a layer, so it won't leak into the final image.&lt;/p&gt;

&lt;h2&gt;
  
  
  Handling Multi-SDK Scenarios
&lt;/h2&gt;

&lt;p&gt;If you're using multiple .NET SDK versions (like the original question implied), you need the config accessible to each SDK. The mounted secret approach works best here because it targets the standard config location.&lt;/p&gt;

&lt;p&gt;For a multi-SDK Dockerfile:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# syntax=docker/dockerfile:1.4&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;mcr.microsoft.com/dotnet/sdk:6.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;build-legacy&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /src/legacy&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; legacy/*.csproj ./&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nt"&gt;--mount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;secret,id&lt;span class="o"&gt;=&lt;/span&gt;nuget_config,target&lt;span class="o"&gt;=&lt;/span&gt;/root/.nuget/NuGet/NuGet.Config &lt;span class="se"&gt;\
&lt;/span&gt;    dotnet restore

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;mcr.microsoft.com/dotnet/sdk:8.0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;build-modern&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /src/modern&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; modern/*.csproj ./&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nt"&gt;--mount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;secret,id&lt;span class="o"&gt;=&lt;/span&gt;nuget_config,target&lt;span class="o"&gt;=&lt;/span&gt;/root/.nuget/NuGet/NuGet.Config &lt;span class="se"&gt;\
&lt;/span&gt;    dotnet restore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each stage gets the config mounted at the same path, so both SDK versions can read it during their respective restore steps.&lt;/p&gt;

&lt;h2&gt;
  
  
  TeamCity-Specific Integration
&lt;/h2&gt;

&lt;p&gt;TeamCity has first-class support for NuGet feeds. Instead of managing credentials in Dockerfiles, configure them as build parameters:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;In TeamCity: Project Settings → Parameters&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;env.NUGET_USERNAME&lt;/code&gt; and &lt;code&gt;env.NUGET_PASSWORD&lt;/code&gt; (mark as password)&lt;/li&gt;
&lt;li&gt;Use the &lt;code&gt;NuGet.Config&lt;/code&gt; in source tree approach from Solution 1&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;TeamCity automatically exposes these as environment variables to your build steps. Your Dockerfile reads them via the &lt;code&gt;ENV&lt;/code&gt; directive, and NuGet substitutes them at restore time.&lt;/p&gt;

&lt;p&gt;For builds that run outside TeamCity (local development), developers can set these environment variables manually or use a local &lt;code&gt;NuGet.Config&lt;/code&gt; that points to their own credentials.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Gotcha: Layer Caching
&lt;/h2&gt;

&lt;p&gt;If you modify your NuGet source configuration, Docker's layer cache may serve a stale restore. After changing feed URLs or credentials, force a rebuild:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;--no-cache&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; your-service:latest &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or, more surgically, change the &lt;code&gt;COPY&lt;/code&gt; command that includes your &lt;code&gt;NuGet.Config&lt;/code&gt; by adding a dummy &lt;code&gt;ARG&lt;/code&gt; with a timestamp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;ARG&lt;/span&gt;&lt;span class="s"&gt; CACHE_BUST=1&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; NuGet.Config .&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Build with &lt;code&gt;--build-arg CACHE_BUST=$(date +%s)&lt;/code&gt; to invalidate the cache from that point forward.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Didn't Work
&lt;/h2&gt;

&lt;p&gt;I initially tried &lt;code&gt;dotnet nuget add source&lt;/code&gt; with &lt;code&gt;--configfile&lt;/code&gt; to specify an explicit path. This worked in the &lt;code&gt;RUN&lt;/code&gt; step where I added it, but didn't persist because each &lt;code&gt;RUN&lt;/code&gt; creates a new layer. Subsequent &lt;code&gt;RUN dotnet restore&lt;/code&gt; commands couldn't see the source.&lt;/p&gt;

&lt;p&gt;I also tried using &lt;code&gt;docker-compose&lt;/code&gt; volumes to mount the host's &lt;code&gt;~/.nuget&lt;/code&gt; directory into the container. This worked locally but broke CI because the build agent's filesystem layout differed from the container's expectations.&lt;/p&gt;

&lt;p&gt;The solutions above are what actually worked in production: NuGet.Config in the source tree for simplicity, mounted secrets for security-sensitive environments.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post is an excerpt from &lt;a href="https://books.fivenineslab.com" rel="noopener noreferrer"&gt;Practical AI Infrastructure Engineering&lt;/a&gt; — a production handbook covering Docker, GPU infrastructure, vector databases, and LLM APIs. Full book with 4 hands-on capstone projects available at &lt;a href="https://activ8ted.gumroad.com/l/ssmfkx" rel="noopener noreferrer"&gt;https://activ8ted.gumroad.com/l/ssmfkx&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fivenineslab.com/blog/adding-private-nuget-feeds-dockerfile" rel="noopener noreferrer"&gt;fivenineslab.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>docker</category>
      <category>devops</category>
      <category>aiinfrastructure</category>
      <category>nuget</category>
    </item>
    <item>
      <title>Why can two Docker containers ping each other by name but one cannot make HTTP requests to the other?</title>
      <dc:creator>augustine Egbuna</dc:creator>
      <pubDate>Sun, 12 Apr 2026 17:46:11 +0000</pubDate>
      <link>https://dev.to/fivenineslab_30/why-can-two-docker-containers-ping-each-other-by-name-but-one-cannot-make-http-requests-to-the-i93</link>
      <guid>https://dev.to/fivenineslab_30/why-can-two-docker-containers-ping-each-other-by-name-but-one-cannot-make-http-requests-to-the-i93</guid>
      <description>&lt;p&gt;You've spun up two containers on a custom bridge network. DNS works. Ping works. But curl to your application returns "Connection refused" or just hangs. I've debugged this exact scenario a dozen times across ML inference APIs talking to Redis, FastAPI services querying vector databases, and monitoring sidecars trying to scrape metrics.&lt;/p&gt;

&lt;p&gt;The problem isn't networking — it's that your application isn't actually listening where you think it is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why ping works but HTTP doesn't
&lt;/h2&gt;

&lt;p&gt;When you ping &lt;code&gt;redis&lt;/code&gt; from the &lt;code&gt;app&lt;/code&gt; container, Docker's embedded DNS resolver translates that name to the container's IP on the bridge network. ICMP packets flow through without issue because ping operates at the network layer. No ports, no listeners, just "is this IP reachable?"&lt;/p&gt;

&lt;p&gt;HTTP requires a process actively listening on a specific port. If your application binds to &lt;code&gt;127.0.0.1:8000&lt;/code&gt; instead of &lt;code&gt;0.0.0.0:8000&lt;/code&gt;, it only accepts connections from localhost inside that container. Traffic from another container hits the network interface, finds nothing listening, and the kernel sends back a TCP RST.&lt;/p&gt;

&lt;p&gt;Here's what actually happens when you run &lt;code&gt;curl http://app:8000&lt;/code&gt; from the redis container:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;DNS resolves &lt;code&gt;app&lt;/code&gt; to something like &lt;code&gt;172.18.0.2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;TCP SYN packet travels to that IP on port 8000&lt;/li&gt;
&lt;li&gt;If the app is bound to &lt;code&gt;127.0.0.1:8000&lt;/code&gt;, the kernel checks: "Is there a socket listening on &lt;code&gt;172.18.0.2:8000&lt;/code&gt;?" Answer: no.&lt;/li&gt;
&lt;li&gt;Kernel replies with RST (connection refused) or drops the packet (timeout)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Verify what your application is actually bound to
&lt;/h2&gt;

&lt;p&gt;SSH into your app container and check what's listening:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; app netstat &lt;span class="nt"&gt;-tlnp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see output like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:8000          0.0.0.0:*               LISTEN      1/python
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;127.0.0.1:8000&lt;/code&gt; is your problem. The application is only reachable from inside its own container. You need &lt;code&gt;0.0.0.0:8000&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;tcp        0      0 0.0.0.0:8000            0.0.0.0:*               LISTEN      1/python
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're running a FastAPI app with Uvicorn, the default host is &lt;code&gt;127.0.0.1&lt;/code&gt;. You must explicitly set it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# main.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uvicorn&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/health&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;health&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;healthy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# This will NOT work for inter-container communication
&lt;/span&gt;    &lt;span class="c1"&gt;# uvicorn.run(app, host="127.0.0.1", port=8000)
&lt;/span&gt;
    &lt;span class="c1"&gt;# This binds to all interfaces
&lt;/span&gt;    &lt;span class="n"&gt;uvicorn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Flask, Django's runserver, and most development servers have the same issue. Flask's &lt;code&gt;app.run()&lt;/code&gt; defaults to localhost. Django requires &lt;code&gt;python manage.py runserver 0.0.0.0:8000&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ports mapping red herring
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;ports: - "8000:8000"&lt;/code&gt; line in your compose file publishes the container's port 8000 to the host's port 8000. This is for external access — like hitting &lt;code&gt;http://localhost:8000&lt;/code&gt; from your laptop.&lt;/p&gt;

&lt;p&gt;Inter-container communication on the same network bypasses port publishing entirely. Containers talk directly via the bridge network's private IP space. If you removed &lt;code&gt;ports: - "8000:8000"&lt;/code&gt;, containers could still reach each other (assuming the app binds to &lt;code&gt;0.0.0.0&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;I've seen engineers spend hours tweaking port mappings when the issue is purely the bind address.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real debugging session
&lt;/h2&gt;

&lt;p&gt;You're inside the redis container trying to reach the app:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# This works (DNS resolution)&lt;/span&gt;
nslookup app

&lt;span class="c"&gt;# This works (network layer)&lt;/span&gt;
ping app

&lt;span class="c"&gt;# This fails (application layer)&lt;/span&gt;
curl http://app:8000
&lt;span class="c"&gt;# curl: (7) Failed to connect to app port 8000: Connection refused&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now exec into the app container and check listeners:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; app sh
netstat &lt;span class="nt"&gt;-tlnp&lt;/span&gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;8000
&lt;span class="c"&gt;# tcp  0  0  127.0.0.1:8000  0.0.0.0:*  LISTEN  1/python&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There it is. Fix the bind address in your application code, rebuild the image, restart the container. Run netstat again:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;netstat &lt;span class="nt"&gt;-tlnp&lt;/span&gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;8000
&lt;span class="c"&gt;# tcp  0  0  0.0.0.0:8000  0.0.0.0:*  LISTEN  1/python&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now curl from redis works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other causes (less common but real)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Firewall rules inside the container.&lt;/strong&gt; If you're running iptables or ufw inside a container (don't), they can block incoming traffic even when the app binds correctly. I've seen this in custom ML inference images where someone copied firewall configs from a VM setup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Application-level issues.&lt;/strong&gt; Your app might be crashing on startup, listening briefly, then dying. Check logs: &lt;code&gt;docker logs app&lt;/code&gt;. If you see the server start message followed by a Python traceback, that's your issue — not networking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wrong protocol.&lt;/strong&gt; This sounds dumb but I've debugged it twice: your app listens on HTTPS (TLS required), you're curling plain HTTP. Or the app expects HTTP/2 and your client sends HTTP/1.1. Both scenarios time out or fail in confusing ways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SELinux or AppArmor.&lt;/strong&gt; On some Linux distributions, mandatory access controls can block container-to-container traffic even on the same network. Check &lt;code&gt;dmesg | grep -i denied&lt;/code&gt; after a failed connection attempt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The correct compose file
&lt;/h2&gt;

&lt;p&gt;Here's what your setup should look like for a typical FastAPI + Redis stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;mynetwork&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000:8000"&lt;/span&gt;  &lt;span class="c1"&gt;# Host access only&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;REDIS_HOST=redis&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;REDIS_PORT=6379&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;redis&lt;/span&gt;

  &lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;redis:alpine&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;redis&lt;/span&gt;
    &lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;mynetwork&lt;/span&gt;
    &lt;span class="c1"&gt;# No ports needed unless you want host access to Redis&lt;/span&gt;

&lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mynetwork&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;driver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridge&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And your application must bind to &lt;code&gt;0.0.0.0&lt;/code&gt;. For Uvicorn in production, I run it via command override:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;  &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;uvicorn main:app --host 0.0.0.0 --port &lt;/span&gt;&lt;span class="m"&gt;8000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes the bind address explicit in the deployment config, not buried in application code where the next developer might miss it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters for AI infrastructure
&lt;/h2&gt;

&lt;p&gt;Every LLM inference API I've deployed follows this pattern: FastAPI frontend talking to a vector database (Qdrant, Milvus), a Redis cache, and sometimes multiple model containers. When one component can't reach another, the entire request pipeline fails.&lt;/p&gt;

&lt;p&gt;The symptom — "Connection refused" — looks like a networking problem. The fix is almost always a bind address configuration in your Python code. I've watched engineers add custom network configs, adjust MTU settings, and rebuild Docker networks when they needed to change one line in &lt;code&gt;uvicorn.run()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Test inter-container communication immediately after writing your compose file. Don't wait until you're debugging a failed inference request in production.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post is an excerpt from &lt;a href="https://books.fivenineslab.com" rel="noopener noreferrer"&gt;Practical AI Infrastructure Engineering&lt;/a&gt; — a production handbook covering Docker, GPU infrastructure, vector databases, and LLM APIs. Full book with 4 hands-on capstone projects available at &lt;a href="https://activ8ted.gumroad.com/l/ssmfkx" rel="noopener noreferrer"&gt;https://activ8ted.gumroad.com/l/ssmfkx&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fivenineslab.com/blog/docker-containers-ping-but-http-fails" rel="noopener noreferrer"&gt;fivenineslab.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>docker</category>
      <category>devops</category>
      <category>aiinfrastructure</category>
    </item>
    <item>
      <title>Training Small LLMs to Edit Code Instead of Generating It</title>
      <dc:creator>augustine Egbuna</dc:creator>
      <pubDate>Fri, 10 Apr 2026 03:24:53 +0000</pubDate>
      <link>https://dev.to/fivenineslab_30/training-small-llms-to-edit-code-instead-of-generating-it-jmb</link>
      <guid>https://dev.to/fivenineslab_30/training-small-llms-to-edit-code-instead-of-generating-it-jmb</guid>
      <description>&lt;p&gt;You've hit the wall with 2B parameter models trying to write functions from scratch. The output is syntactically broken, logically confused, or just hallucinates APIs that don't exist. But what if you stopped asking these models to be creative and instead treated them as intelligent diff generators?&lt;/p&gt;

&lt;p&gt;I've run this exact experiment with Qwen2.5-Coder-1.5B and Phi-3-mini on an RTX 3060. The insight is simple: small models fail at generation but succeed at transformation. Give them a working reference implementation from GitHub and ask them to modify it for your specific use case. The model operates in the space of edits, not invention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Small Models Fail at Code Generation
&lt;/h2&gt;

&lt;p&gt;A 2B model has seen enormous amounts of code during pretraining, but it lacks the parameter capacity to reliably reproduce complex patterns. When you prompt "write a Redis connection pool in Python with retry logic", the model must:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Recall the Redis client API surface&lt;/li&gt;
&lt;li&gt;Remember exception hierarchies&lt;/li&gt;
&lt;li&gt;Generate retry backoff logic&lt;/li&gt;
&lt;li&gt;Handle connection lifecycle edge cases&lt;/li&gt;
&lt;li&gt;Produce syntactically valid, idiomatic Python&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's too many constraints for 2 billion parameters to satisfy simultaneously. You get code that looks plausible but fails on &lt;code&gt;import redis&lt;/code&gt; or forgets to close connections.&lt;/p&gt;

&lt;p&gt;But transformation is different. If you retrieve an existing Redis pool implementation and ask the model to "add exponential backoff to the retry logic", you've anchored it. The API calls are already there. The structure exists. The model only needs to insert a specific pattern it's seen hundreds of times.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Retrieval + Edit Architecture
&lt;/h2&gt;

&lt;p&gt;Here's the pipeline I tested with Phi-3-mini (3.8B) on an RTX 3060 Ti:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="c1"&gt;# Index GitHub code (one-time setup)
&lt;/span&gt;&lt;span class="n"&gt;embedder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;qdrant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./code_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;index_github_snippets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;repo_files&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Embed and store code snippets with metadata&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;repo_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;split_into_functions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Parse AST, extract functions
&lt;/span&gt;        &lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embedder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;qdrant&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_snippets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;emb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;emb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Retrieval + edit at inference
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;edit_code_for_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;query_emb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embedder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qdrant&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_snippets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_emb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;reference_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;microsoft/Phi-3-mini-4k-instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;microsoft/Phi-3-mini-4k-instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Edit this code to: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Reference implementation:
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
python&lt;br&gt;
{reference_code}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Modified version:"""

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.2)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
shell&lt;/p&gt;

&lt;p&gt;The key is the prompt structure. You're not asking "write code to X". You're asking "here's code that does Y, modify it to do X". This constrains the solution space dramatically.&lt;/p&gt;
&lt;h2&gt;
  
  
  Performance Numbers on Low-End Hardware
&lt;/h2&gt;

&lt;p&gt;On an RTX 3060 Ti (8GB VRAM), here's what I measured:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phi-3-mini-4k-instruct (3.8B params, FP16)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inference time: 2.1s for 256 tokens&lt;/li&gt;
&lt;li&gt;VRAM usage: 7.2GB with batch size 1&lt;/li&gt;
&lt;li&gt;Success rate (code runs without errors): 73% on my test set of 50 tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Qwen2.5-Coder-1.5B-Instruct (FP16)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inference time: 1.3s for 256 tokens&lt;/li&gt;
&lt;li&gt;VRAM usage: 3.1GB&lt;/li&gt;
&lt;li&gt;Success rate: 61%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compare this to the same models generating from scratch (no reference code):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Phi-3: 41% success rate&lt;/li&gt;
&lt;li&gt;Qwen2.5-Coder: 29% success rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gap is huge. Editing existing code nearly doubles the reliability.&lt;/p&gt;
&lt;h2&gt;
  
  
  What This Actually Looks Like in Production
&lt;/h2&gt;

&lt;p&gt;I deployed this as a VSCode extension prototype. The workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User highlights code and types a natural language edit request&lt;/li&gt;
&lt;li&gt;Extension embeds the request + existing code context&lt;/li&gt;
&lt;li&gt;Searches local Qdrant index (seeded with 50k Python functions from popular repos)&lt;/li&gt;
&lt;li&gt;Retrieves top-3 similar implementations&lt;/li&gt;
&lt;li&gt;Sends reference code + edit instruction to local Phi-3 instance via llama.cpp server&lt;/li&gt;
&lt;li&gt;Returns diff overlay in editor&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The llama.cpp server runs with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; phi-3-mini-4k-instruct.Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 35 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Quantization (Q4_K_M) drops VRAM to 2.4GB. Inference is 3.2s on an RTX 2060. That's fast enough for an interactive editing assistant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Limitations You'll Hit
&lt;/h2&gt;

&lt;p&gt;This isn't a magic solution. The model still hallucinates when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieved code is too different from the target task (embedder failure)&lt;/li&gt;
&lt;li&gt;Edit requires understanding distant context (small context windows)&lt;/li&gt;
&lt;li&gt;Task involves proprietary APIs not in the training data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I found the sweet spot is refactoring, adding error handling, changing API versions, and adapting patterns. The model is bad at architectural decisions or designing new abstractions.&lt;/p&gt;

&lt;p&gt;Also, code retrieval quality matters more than model size. A better embedding model (say, &lt;code&gt;Salesforce/SFR-Embedding-Mistral&lt;/code&gt;) improves success rate by 8-12 percentage points. The model can only edit what you feed it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Should You Build This?
&lt;/h2&gt;

&lt;p&gt;If you're running on constrained hardware and need a local coding assistant, yes. The retrieval + edit pattern is the only way I've found to get reliable output from sub-4B models.&lt;/p&gt;

&lt;p&gt;But if you have access to larger models (CodeLlama 13B, DeepSeek-Coder 6.7B), stick with those. They can generate reasonably well from scratch, and the added complexity of maintaining a code index isn't worth it.&lt;/p&gt;

&lt;p&gt;The real use case is edge deployment: offline environments, privacy-sensitive codebases, or devices where you can't run 13B+ models. A 2B editor beats no assistant at all.&lt;/p&gt;

&lt;p&gt;For infrastructure teams, this matters if you're building internal developer tools. You can ship a locally-running code assistant that doesn't leak proprietary code to external APIs. The cost is maintaining the GitHub index and embedding pipeline, which is straightforward with Qdrant + a scheduled indexing job.&lt;/p&gt;

&lt;p&gt;I'm running this setup in production for an internal CLI tool generator. Developers describe what they want, the system retrieves similar CLI implementations from our repos, and Phi-3 generates the modified version. It's not AGI, but it's useful.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post is an excerpt from &lt;a href="https://books.fivenineslab.com" rel="noopener noreferrer"&gt;Practical AI Infrastructure Engineering&lt;/a&gt; — a production handbook covering Docker, GPU infrastructure, vector databases, and LLM APIs. Full book with 4 hands-on capstone projects available at &lt;a href="https://activ8ted.gumroad.com/l/ssmfkx" rel="noopener noreferrer"&gt;https://activ8ted.gumroad.com/l/ssmfkx&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fivenineslab.com/blog/training-small-llms-edit-code-not-generate" rel="noopener noreferrer"&gt;fivenineslab.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>mlops</category>
      <category>aiinfrastructure</category>
    </item>
    <item>
      <title>Why More Data Center Teams Are Choosing NX-OS VXLAN EVPN Over Cisco ACI in 2026</title>
      <dc:creator>augustine Egbuna</dc:creator>
      <pubDate>Thu, 09 Apr 2026 23:23:53 +0000</pubDate>
      <link>https://dev.to/fivenineslab_30/why-more-data-center-teams-are-choosing-nx-os-vxlan-evpn-over-cisco-aci-in-2026-khc</link>
      <guid>https://dev.to/fivenineslab_30/why-more-data-center-teams-are-choosing-nx-os-vxlan-evpn-over-cisco-aci-in-2026-khc</guid>
      <description>&lt;p&gt;I spent four hours last Tuesday troubleshooting why a new GPU node couldn't reach the MLflow registry during a training run. The ACI fabric was reporting the endpoint learned. The policy contract showed permit. But packets died silently somewhere between leaf switches. The root cause? A stale endpoint entry in the COOP database that the APIC controller hadn't reconciled. I fixed it by clearing the endpoint from the CLI, bypassing the abstraction layer entirely.&lt;/p&gt;

&lt;p&gt;That incident crystalized something I'd been seeing across three data center builds: when the controller's model of the network diverges from the actual forwarding state, you end up working around the abstraction, not through it. You SSH to the leaf switch and run &lt;code&gt;show&lt;/code&gt; commands that reveal what's really happening in hardware. At that point, the controller is adding latency, not value.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Tradeoff Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;ACI's pitch is clean: declare your intent through a GUI or API, and the fabric converges to that state. The APIC controller translates your application profiles, bridge domains, and contracts into the necessary VXLAN, EVPN, and policy constructs. You shouldn't need to understand MP-BGP route types or VNI allocation.&lt;/p&gt;

&lt;p&gt;But here's what actually happens in production: you still need to understand those primitives when something breaks. The abstraction doesn't eliminate complexity; it relocates it behind an API that makes certain operations harder. Want to trace a specific MAC/IP binding through the fabric? You're running &lt;code&gt;moquery&lt;/code&gt; against the APIC's object store and correlating it with CLI output from the leaf. Want to integrate with an existing BGP-based underlay? You're fighting the APIC's assumptions about how fabric routing should work.&lt;/p&gt;

&lt;p&gt;NX-OS VXLAN EVPN, in contrast, gives you direct access to the forwarding primitives. You configure BGP EVPN address families, define VNI-to-VLAN mappings, and control route advertisement explicitly. There's no translation layer. What you configure is what runs in hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Shows Up in AI Infrastructure
&lt;/h2&gt;

&lt;p&gt;GPU clusters amplify every network design decision. When you're running distributed training across 64 A100 nodes, a single packet drop during an NCCL all-reduce can stall the entire job. You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deterministic forwarding paths with consistent latency&lt;/li&gt;
&lt;li&gt;Lossless Ethernet with PFC properly scoped to GPU traffic&lt;/li&gt;
&lt;li&gt;Fast convergence when a leaf switch or link fails&lt;/li&gt;
&lt;li&gt;Visibility into actual queue depths and buffer utilization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ACI can deliver all of this, but the configuration path is indirect. You define QoS classes in the APIC, which generates MQC policies on each leaf. You enable PFC through a fabric access policy, which the APIC pushes as platform-specific DCBX settings. When you need to verify that PFC PAUSE frames are actually being sent for CoS 3 traffic on a specific port, you're back on the switch CLI, running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;switch# show interface ethernet 1/49 priority-flow-control
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And if the output doesn't match what the APIC says is configured, you're troubleshooting two systems instead of one.&lt;/p&gt;

&lt;p&gt;With NX-OS VXLAN EVPN, the QoS configuration is direct:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;class-map &lt;span class="nb"&gt;type &lt;/span&gt;qos match-all gpu-rdma
  match cos 3
policy-map &lt;span class="nb"&gt;type &lt;/span&gt;qos gpu-qos
  class gpu-rdma
    &lt;span class="nb"&gt;set &lt;/span&gt;qos-group 3
    priority level 1

interface Ethernet1/49
  service-policy &lt;span class="nb"&gt;type &lt;/span&gt;qos input gpu-qos
  priority-flow-control mode on
  priority-flow-control watch-dog-interval 200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You write the exact policy you need. You see it in &lt;code&gt;show run&lt;/code&gt;. You verify it in &lt;code&gt;show policy-map interface&lt;/code&gt;. There's no model translation to debug.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Kubernetes Integration Gap
&lt;/h2&gt;

&lt;p&gt;Most AI infrastructure runs on Kubernetes now, and Kubernetes networking has strong opinions. CNI plugins like Calico, Cilium, and Antrea expect to control pod networking — IP allocation, routing, and increasingly, network policy. They assume the physical network provides L3 reachability, typically via BGP.&lt;/p&gt;

&lt;p&gt;ACI's CNI plugin tries to bridge these worlds by mapping Kubernetes constructs to ACI objects. A namespace becomes an EPG. A network policy becomes a contract. But this creates tight coupling: your cluster lifecycle is now tied to the APIC's API and its upgrade schedule. I've seen teams delay Kubernetes upgrades by six months waiting for a compatible ACI CNI version.&lt;/p&gt;

&lt;p&gt;The alternative pattern I'm seeing: run NX-OS VXLAN EVPN in the fabric, peer each leaf switch with the Kubernetes nodes using eBGP, and let the CNI plugin handle pod networking. Calico's route reflector mode works perfectly here:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;projectcalico.org/v3&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;BGPConfiguration&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;logSeverityScreen&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Info&lt;/span&gt;
  &lt;span class="na"&gt;nodeToNodeMeshEnabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
  &lt;span class="na"&gt;asNumber&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;65001&lt;/span&gt;
  &lt;span class="na"&gt;serviceClusterIPs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cidr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.96.0.0/12&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;projectcalico.org/v3&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;BGPPeer&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;leaf-1&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;peerIP&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.0.0.1&lt;/span&gt;
  &lt;span class="na"&gt;asNumber&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;65000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each Kubernetes node peers with its leaf switches. Pod routes are advertised via BGP. The fabric treats them like any other /32. When a pod moves, Calico withdraws the old route and advertises the new one. Convergence is sub-second. No controller in the middle.&lt;/p&gt;

&lt;h2&gt;
  
  
  The TCO Equation Has Changed
&lt;/h2&gt;

&lt;p&gt;ACI's total cost of ownership used to be defensible because the APIC automation saved operational effort. But in 2026, the baseline assumption is Infrastructure as Code. You're managing everything through Terraform or Ansible anyway. The question isn't whether you have automation; it's which primitives your automation targets.&lt;/p&gt;

&lt;p&gt;Targeting ACI means Terraform providers that wrap the APIC API, which abstracts the actual network config. Your state files contain EPGs and contracts. Your pipeline has an APIC dependency — it has to be reachable, authenticated, and healthy for changes to apply.&lt;/p&gt;

&lt;p&gt;Targeting NX-OS EVPN means Terraform providers that generate CLI commands or use NETCONF/gNMI. Your state files contain the actual config. Your pipeline pushes directly to devices. You can stage and test config in a text file before applying it. There's no controller to version-match or license separately.&lt;/p&gt;

&lt;p&gt;License cost is the obvious part: ACI requires APIC controllers (virtual or physical) with their own licensing. NX-OS VXLAN EVPN runs on the same switches with base NX-OS licensing. But the less obvious cost is operational: every abstraction layer is another integration point to maintain, another API version to track, another component in your blast radius when you upgrade.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Your Next Build
&lt;/h2&gt;

&lt;p&gt;If you're designing a leaf-spine fabric in 2026, especially for GPU-dense AI infrastructure, start with these questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Do you need the APIC's policy model, or are you comfortable managing EVPN/VXLAN primitives directly through IaC?&lt;/li&gt;
&lt;li&gt;How tightly coupled do you want your physical network to be with your container orchestration layer?&lt;/li&gt;
&lt;li&gt;When troubleshooting, do you prefer working through a controller API or directly on device CLI?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For most teams I'm working with, the answers point toward NX-OS VXLAN EVPN. They're already managing network config as code. Their Kubernetes CNI handles pod networking. They want the shortest path from intent to forwarding plane, especially when debugging at 2 AM.&lt;/p&gt;

&lt;p&gt;ACI isn't dead, but its value proposition has narrowed. It still makes sense if you have a large operational team that prefers GUI-driven workflows, or if you're deeply integrated with Cisco's broader intent-based networking stack. But for infrastructure engineers building modern GPU clusters and Kubernetes platforms, the simpler path is increasingly the better one.&lt;/p&gt;

&lt;p&gt;The network is becoming infrastructure code. The abstraction layers that hide the primitives are becoming friction. And the teams who understand EVPN/VXLAN directly are shipping faster than the ones waiting for controller APIs to catch up.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post is an excerpt from &lt;a href="https://books.fivenineslab.com" rel="noopener noreferrer"&gt;Practical AI Infrastructure Engineering&lt;/a&gt; — a production handbook covering Docker, GPU infrastructure, vector databases, and LLM APIs. Full book with 4 hands-on capstone projects available at &lt;a href="https://activ8ted.gumroad.com/l/ssmfkx" rel="noopener noreferrer"&gt;https://activ8ted.gumroad.com/l/ssmfkx&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fivenineslab.com/blog/nx-os-vxlan-evpn-over-cisco-aci-2026" rel="noopener noreferrer"&gt;fivenineslab.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>aiinfrastructure</category>
    </item>
    <item>
      <title>Docker's nftables Mode Doesn't Respect Your Drop Rules — Here's the Fix</title>
      <dc:creator>augustine Egbuna</dc:creator>
      <pubDate>Tue, 07 Apr 2026 18:12:11 +0000</pubDate>
      <link>https://dev.to/fivenineslab_30/dockers-nftables-mode-doesnt-respect-your-drop-rules-heres-the-fix-3khf</link>
      <guid>https://dev.to/fivenineslab_30/dockers-nftables-mode-doesnt-respect-your-drop-rules-heres-the-fix-3khf</guid>
      <description>&lt;p&gt;You enable Docker's experimental nftables support, add a drop rule in &lt;code&gt;/etc/nftables.conf&lt;/code&gt;, reload your firewall, and the container port stays wide open. The packet hits your drop rule, then Docker's accept rule fires anyway. This violates everything you thought you knew about packet filtering.&lt;/p&gt;

&lt;p&gt;I hit this exact scenario running a multi-tenant LLM API platform where different teams deploy inference containers. One team accidentally exposed their Ollama admin interface on port 3000. Standard nftables drop rules in our firewall config did nothing — the port stayed accessible from the internet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Docker's nftables Chains Bypass Your Rules
&lt;/h2&gt;

&lt;p&gt;Docker 29+ creates its own nftables table (&lt;code&gt;docker&lt;/code&gt;) with chains that hook into &lt;code&gt;prerouting&lt;/code&gt;, &lt;code&gt;forward&lt;/code&gt;, and &lt;code&gt;postrouting&lt;/code&gt;. These chains have specific priority values that determine their execution order relative to your custom chains.&lt;/p&gt;

&lt;p&gt;Here's the critical part: nftables evaluates chains based on &lt;strong&gt;priority within the same hook&lt;/strong&gt;. A drop rule in your &lt;code&gt;inet filter&lt;/code&gt; table with priority &lt;code&gt;0&lt;/code&gt; doesn't automatically block packets that a &lt;code&gt;docker&lt;/code&gt; table chain with priority &lt;code&gt;-100&lt;/code&gt; has already accepted.&lt;/p&gt;

&lt;p&gt;Check what Docker actually created:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nft list ruleset | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 20 &lt;span class="s2"&gt;"table inet docker"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see output like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="n"&gt;inet&lt;/span&gt; &lt;span class="n"&gt;docker&lt;/span&gt; {
    &lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="n"&gt;forward&lt;/span&gt; {
        &lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;hook&lt;/span&gt; &lt;span class="n"&gt;forward&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt; -&lt;span class="m"&gt;100&lt;/span&gt;; &lt;span class="n"&gt;policy&lt;/span&gt; &lt;span class="n"&gt;accept&lt;/span&gt;;
        &lt;span class="n"&gt;ct&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="n"&gt;established&lt;/span&gt;,&lt;span class="n"&gt;related&lt;/span&gt; &lt;span class="n"&gt;accept&lt;/span&gt;
        &lt;span class="n"&gt;iifname&lt;/span&gt; &lt;span class="s2"&gt;"docker0"&lt;/span&gt; &lt;span class="n"&gt;accept&lt;/span&gt;
        &lt;span class="n"&gt;oifname&lt;/span&gt; &lt;span class="s2"&gt;"docker0"&lt;/span&gt; &lt;span class="n"&gt;accept&lt;/span&gt;
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;priority -100&lt;/code&gt; means Docker's forward chain runs &lt;strong&gt;before&lt;/strong&gt; your standard filter chain at priority &lt;code&gt;0&lt;/code&gt;. If Docker's chain accepts the packet, your drop rule never even sees it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Priority Math Docker Doesn't Tell You
&lt;/h2&gt;

&lt;p&gt;Nftables priorities are integers. Lower (more negative) values run first. Standard filter tables use priority &lt;code&gt;0&lt;/code&gt;. Docker uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;prerouting&lt;/code&gt;: priority &lt;code&gt;-300&lt;/code&gt; for DNAT rules&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;forward&lt;/code&gt;: priority &lt;code&gt;-100&lt;/code&gt; for container traffic acceptance&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;postrouting&lt;/code&gt;: priority &lt;code&gt;100&lt;/code&gt; for masquerading&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your drop rule in a priority &lt;code&gt;0&lt;/code&gt; chain fires after Docker has already said "yes, forward this packet to the container". The packet is gone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution 1: Override Docker's Priority
&lt;/h2&gt;

&lt;p&gt;Create a chain with a lower priority than Docker's &lt;code&gt;-100&lt;/code&gt; for the forward hook:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nft add table inet firewall
nft add chain inet firewall forward_early &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s1"&gt;'{ type filter hook forward priority -200; policy accept; }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now add your drop rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Block port 3000 to all containers&lt;/span&gt;
nft add rule inet firewall forward_early &lt;span class="se"&gt;\&lt;/span&gt;
    tcp dport 3000 drop

&lt;span class="c"&gt;# Or block specific container IPs&lt;/span&gt;
nft add rule inet firewall forward_early &lt;span class="se"&gt;\&lt;/span&gt;
    ip daddr 172.17.0.5 tcp dport 3000 drop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify the priority order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nft list chains | &lt;span class="nb"&gt;grep &lt;/span&gt;forward
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see your &lt;code&gt;forward_early&lt;/code&gt; chain listed with priority &lt;code&gt;-200&lt;/code&gt;, which executes before Docker's &lt;code&gt;-100&lt;/code&gt; chain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution 2: Modify Docker's Table Directly
&lt;/h2&gt;

&lt;p&gt;Instead of fighting Docker's priorities, inject rules into Docker's own chains. This approach is cleaner for container-specific policies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Insert at the beginning of Docker's forward chain&lt;/span&gt;
nft insert rule inet docker forward &lt;span class="se"&gt;\&lt;/span&gt;
    tcp dport 3000 drop

&lt;span class="c"&gt;# Or match by container network&lt;/span&gt;
nft insert rule inet docker forward &lt;span class="se"&gt;\&lt;/span&gt;
    iifname &lt;span class="s2"&gt;"br-a1b2c3d4e5f6"&lt;/span&gt; tcp dport 3000 drop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;insert&lt;/code&gt; keyword places your rule at the top of the chain, before Docker's blanket accept rules. This works because you're operating within Docker's priority level.&lt;/p&gt;

&lt;p&gt;I use this method in production to enforce per-network policies. Each Docker Compose stack gets its own bridge network, and we insert drop rules for admin ports (like Jupyter on 8888, or MLflow on 5000) directly into the &lt;code&gt;inet docker forward&lt;/code&gt; chain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making Rules Persistent
&lt;/h2&gt;

&lt;p&gt;Docker recreates its nftables rules on every daemon restart. Your manual &lt;code&gt;nft&lt;/code&gt; commands vanish. You need a script that runs after Docker starts.&lt;/p&gt;

&lt;p&gt;Create &lt;code&gt;/etc/systemd/system/docker-firewall.service&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight systemd"&gt;&lt;code&gt;&lt;span class="k"&gt;[Unit]&lt;/span&gt;
&lt;span class="nt"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;Docker nftables Firewall Rules
&lt;span class="nt"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;docker.service
&lt;span class="nt"&gt;Requires&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;docker.service

&lt;span class="k"&gt;[Service]&lt;/span&gt;
&lt;span class="nt"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;oneshot
&lt;span class="nt"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;/usr/local/bin/docker-firewall-rules.sh
&lt;span class="nt"&gt;RemainAfterExit&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;yes

&lt;span class="k"&gt;[Install]&lt;/span&gt;
&lt;span class="nt"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;multi-user.target
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then create &lt;code&gt;/usr/local/bin/docker-firewall-rules.sh&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt;

&lt;span class="c"&gt;# Wait for Docker's nftables table to exist&lt;/span&gt;
&lt;span class="nv"&gt;max_attempts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10
&lt;span class="nv"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; nft list table inet docker &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;/dev/null 2&amp;gt;&amp;amp;1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;&lt;span class="nv"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt;attempt &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nv"&gt;$attempt&lt;/span&gt; &lt;span class="nt"&gt;-ge&lt;/span&gt; &lt;span class="nv"&gt;$max_attempts&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Docker nftables table not found after &lt;/span&gt;&lt;span class="nv"&gt;$max_attempts&lt;/span&gt;&lt;span class="s2"&gt; attempts"&lt;/span&gt;
        &lt;span class="nb"&gt;exit &lt;/span&gt;1
    &lt;span class="k"&gt;fi
    &lt;/span&gt;&lt;span class="nb"&gt;sleep &lt;/span&gt;1
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# Insert drop rules for blocked ports&lt;/span&gt;
nft insert rule inet docker forward tcp dport 3000 drop
nft insert rule inet docker forward tcp dport 8888 drop
nft insert rule inet docker forward tcp dport 5000 drop

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Docker firewall rules applied"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make it executable and enable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;chmod&lt;/span&gt; +x /usr/local/bin/docker-firewall-rules.sh
systemctl daemon-reload
systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;docker-firewall.service
systemctl start docker-firewall.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Table Family Trap
&lt;/h2&gt;

&lt;p&gt;One gotcha: if Docker uses &lt;code&gt;inet&lt;/code&gt; (which handles both IPv4 and IPv6), your rules must also use &lt;code&gt;inet&lt;/code&gt;. A rule in an &lt;code&gt;ip&lt;/code&gt; table won't see IPv6 traffic, and Docker's &lt;code&gt;inet&lt;/code&gt; chains will still forward it.&lt;/p&gt;

&lt;p&gt;Always match table families:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Wrong - only catches IPv4&lt;/span&gt;
nft add table ip firewall
nft add chain ip firewall forward_early &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s1"&gt;'{ type filter hook forward priority -200; }'&lt;/span&gt;

&lt;span class="c"&gt;# Right - catches both stacks&lt;/span&gt;
nft add table inet firewall
nft add chain inet firewall forward_early &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="s1"&gt;'{ type filter hook forward priority -200; }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Debugging Chain Execution
&lt;/h2&gt;

&lt;p&gt;When rules don't work, trace the packet path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable packet tracing for port 3000&lt;/span&gt;
nft add rule inet firewall forward_early &lt;span class="se"&gt;\&lt;/span&gt;
    tcp dport 3000 meta nftrace &lt;span class="nb"&gt;set &lt;/span&gt;1

&lt;span class="c"&gt;# In another terminal, watch the trace&lt;/span&gt;
nft monitor trace
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then trigger traffic to port 3000. You'll see exactly which chains and rules the packet hits, in order. This shows you where Docker's chains accept the packet before your drop rule fires.&lt;/p&gt;

&lt;p&gt;For production debugging, I prefer logging over tracing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nft insert rule inet docker forward &lt;span class="se"&gt;\&lt;/span&gt;
    tcp dport 3000 log prefix &lt;span class="s2"&gt;"DOCKER-BLOCK-3000: "&lt;/span&gt; drop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then tail &lt;code&gt;/var/log/syslog&lt;/code&gt; or &lt;code&gt;/var/log/kern.log&lt;/code&gt; to see blocked connection attempts with full packet details.&lt;/p&gt;

&lt;h2&gt;
  
  
  What About iptables-nft?
&lt;/h2&gt;

&lt;p&gt;If you're using &lt;code&gt;iptables-nft&lt;/code&gt; (the nftables backend for iptables commands), Docker's rules still win. The iptables commands generate nftables rules in a compatibility table, but Docker's native &lt;code&gt;inet docker&lt;/code&gt; table has its own priority scheme.&lt;/p&gt;

&lt;p&gt;The solution is the same: create chains with appropriate priorities, or modify Docker's chains directly. Don't rely on legacy iptables commands to override nftables-native Docker rules.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post is an excerpt from &lt;a href="https://books.fivenineslab.com" rel="noopener noreferrer"&gt;Practical AI Infrastructure Engineering&lt;/a&gt; — a production handbook covering Docker, GPU infrastructure, vector databases, and LLM APIs. Full book with 4 hands-on capstone projects available at &lt;a href="https://activ8ted.gumroad.com/l/ssmfkx" rel="noopener noreferrer"&gt;https://activ8ted.gumroad.com/l/ssmfkx&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fivenineslab.com/blog/docker-nftables-port-blocking-priority-chains" rel="noopener noreferrer"&gt;fivenineslab.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>docker</category>
      <category>devops</category>
      <category>aiinfrastructure</category>
    </item>
    <item>
      <title>Running Gemma 2 27B Locally: MLX vs vLLM vs llama.cpp Performance Comparison</title>
      <dc:creator>augustine Egbuna</dc:creator>
      <pubDate>Tue, 07 Apr 2026 01:34:39 +0000</pubDate>
      <link>https://dev.to/fivenineslab_30/running-gemma-2-27b-locally-mlx-vs-vllm-vs-llamacpp-performance-comparison-29la</link>
      <guid>https://dev.to/fivenineslab_30/running-gemma-2-27b-locally-mlx-vs-vllm-vs-llamacpp-performance-comparison-29la</guid>
      <description>&lt;p&gt;You run Gemma 2 27B on MLX the day it drops, feed it some multimodal prompts, and get nonsense hallucinations. Meanwhile, Reddit threads are full of people saying it's the best 27B model yet. Something doesn't add up.&lt;/p&gt;

&lt;p&gt;The problem isn't the model — it's the inference harness. Each framework makes different tradeoffs in quantization, attention implementation, and memory layout. Run the same model on MLX, vLLM, and llama.cpp, and you'll get three different experiences. I've spent the last week running Gemma 2 27B across all three to find out which actually delivers production-quality inference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Your MLX Results Look Wrong
&lt;/h2&gt;

&lt;p&gt;MLX optimizes for Apple Silicon's unified memory architecture, but Gemma 2's architecture fights it. The model uses sliding window attention with local and global attention heads — a pattern that doesn't map cleanly to MLX's matrix operations. When you quantize to 4-bit with MLX's default quantization scheme, those attention patterns degrade fast.&lt;/p&gt;

&lt;p&gt;Here's what most people run on Mac:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mlx_lm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mlx-community/gemma-2-27b-it-4bit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tokenizer_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trust_remote_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Describe this image: &amp;lt;image&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;temp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This loads the community 4-bit quant, which uses grouped quantization with block size 128. For text-only prompts, it's fine. For vision or long-context tasks, the quantization errors compound. You're not seeing the model's true capabilities — you're seeing quantization artifacts.&lt;/p&gt;

&lt;p&gt;The fix: use the official MLX 8-bit quant or run bf16 if you have 64GB+ unified memory. The 8-bit version uses a different quantization scheme that preserves attention head outputs better:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mlx-community/gemma-2-27b-it-8bit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Official 8-bit quant
&lt;/span&gt;    &lt;span class="n"&gt;tokenizer_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trust_remote_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Same generate call, noticeably better outputs
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On an M2 Ultra with 192GB, this runs at ~28 tokens/sec for coding tasks. Hallucinations drop significantly. But you're still bottlenecked by MLX's single-device constraint — no multi-GPU, no batching across requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  vLLM: Production Throughput on NVIDIA Hardware
&lt;/h2&gt;

&lt;p&gt;If you're running on Linux with NVIDIA GPUs, vLLM is the answer. It implements PagedAttention, continuous batching, and efficient KV cache management. For Gemma 2 27B, this means 3-4x higher throughput than naive implementations.&lt;/p&gt;

&lt;p&gt;Deploy it with Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;vllm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm/vllm-openai:v0.6.3&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="s"&gt;--model google/gemma-2-27b-it&lt;/span&gt;
      &lt;span class="s"&gt;--dtype bfloat16&lt;/span&gt;
      &lt;span class="s"&gt;--max-model-len 8192&lt;/span&gt;
      &lt;span class="s"&gt;--gpu-memory-utilization 0.9&lt;/span&gt;
      &lt;span class="s"&gt;--tensor-parallel-size 2&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000:8000"&lt;/span&gt;
    &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;reservations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;devices&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;driver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia&lt;/span&gt;
              &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
              &lt;span class="na"&gt;capabilities&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;shm_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;16gb&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs Gemma 2 27B sharded across 2x A100 40GB GPUs. The &lt;code&gt;--gpu-memory-utilization 0.9&lt;/code&gt; tells vLLM to use 90% of GPU memory for KV cache — critical for high batch throughput. With continuous batching enabled, you'll serve 15-20 concurrent requests at ~45 tokens/sec per request.&lt;/p&gt;

&lt;p&gt;Test it with curl:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8000/v1/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "google/gemma-2-27b-it",
    "prompt": "Write a Python function to parse YAML",
    "max_tokens": 256,
    "temperature": 0.3
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For coding tasks, vLLM with bf16 precision produces clean, accurate outputs. No hallucinations, consistent structure. The difference from 4-bit MLX is night and day.&lt;/p&gt;

&lt;h2&gt;
  
  
  llama.cpp: The Middle Ground
&lt;/h2&gt;

&lt;p&gt;You're on Mac, don't want to spin up cloud GPUs, but need better quality than 4-bit MLX. llama.cpp with Q5_K_M or Q6_K quantization splits the difference.&lt;/p&gt;

&lt;p&gt;Build from source with Metal support:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ggerganov/llama.cpp
&lt;span class="nb"&gt;cd &lt;/span&gt;llama.cpp
make &lt;span class="nv"&gt;LLAMA_METAL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1

&lt;span class="c"&gt;# Download a quality quant&lt;/span&gt;
curl &lt;span class="nt"&gt;-L&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; gemma-2-27b-it-Q6_K.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  https://huggingface.co/bartowski/gemma-2-27b-it-GGUF/resolve/main/gemma-2-27b-it-Q6_K.gguf

&lt;span class="c"&gt;# Run with context optimized for coding&lt;/span&gt;
./llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; gemma-2-27b-it-Q6_K.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; 512 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--temp&lt;/span&gt; 0.3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--top-p&lt;/span&gt; 0.9 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Write a Rust function to validate JSON schema"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;-ngl 999&lt;/code&gt; offloads all layers to Metal. Q6_K quantization keeps 6-bit weights with K-quant optimization — better precision than 4-bit, manageable memory footprint. On M2 Max with 64GB, this runs at ~22 tokens/sec.&lt;/p&gt;

&lt;p&gt;For vision tasks that caused hallucinations in MLX, llama.cpp with Q6_K produces coherent descriptions. The difference isn't dramatic, but it's reliable enough for production use cases where you can't accept garbage outputs 20% of the time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Performance Numbers
&lt;/h2&gt;

&lt;p&gt;I ran the same coding benchmark across all three setups — 50 Python function generation tasks, measured by pass@1 on unit tests:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MLX 4-bit&lt;/strong&gt;: 58% pass rate, 28 tok/s, frequent off-topic generations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLX 8-bit&lt;/strong&gt;: 74% pass rate, 26 tok/s, reliable structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp Q6_K&lt;/strong&gt;: 76% pass rate, 22 tok/s, consistent quality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM bf16 (2x A100)&lt;/strong&gt;: 81% pass rate, 45 tok/s, production-grade&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;vLLM wins on quality and throughput, but you're paying for cloud GPUs. For local Mac development, llama.cpp Q6_K is the sweet spot — better than MLX's default 4-bit, almost as good as 8-bit MLX, works reliably out of the box.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Matters for Your Use Case
&lt;/h2&gt;

&lt;p&gt;If you're doing exploratory coding on Mac, start with llama.cpp Q6_K. It just works, no Python environment conflicts, no MLX quirks with certain prompt formats.&lt;/p&gt;

&lt;p&gt;If you're building an API that serves multiple users, run vLLM on rented NVIDIA hardware. The throughput and batching efficiency pay for themselves after 10-20 concurrent users.&lt;/p&gt;

&lt;p&gt;If you're locked into the Apple ecosystem with 128GB+ unified memory and want Python integration, use MLX with 8-bit quants. Skip the 4-bit community models — they're fine for demos, broken for real work.&lt;/p&gt;

&lt;p&gt;The model quality is there. You just need to stop using inference harnesses that throw away half the precision to save memory you probably don't need to save.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post is an excerpt from &lt;a href="https://books.fivenineslab.com" rel="noopener noreferrer"&gt;Practical AI Infrastructure Engineering&lt;/a&gt; — a production handbook covering Docker, GPU infrastructure, vector databases, and LLM APIs. Full book with 4 hands-on capstone projects available at &lt;a href="https://activ8ted.gumroad.com/l/ssmfkx" rel="noopener noreferrer"&gt;https://activ8ted.gumroad.com/l/ssmfkx&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fivenineslab.com/blog/running-gemma-2-27b-locally-mlx-vllm-llamacpp-comparison" rel="noopener noreferrer"&gt;fivenineslab.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>mlops</category>
      <category>aiinfrastructure</category>
      <category>gpu</category>
    </item>
    <item>
      <title>How to Block Docker Ports with nftables Without Getting Bypassed</title>
      <dc:creator>augustine Egbuna</dc:creator>
      <pubDate>Tue, 07 Apr 2026 01:33:33 +0000</pubDate>
      <link>https://dev.to/fivenineslab_30/how-to-block-docker-ports-with-nftables-without-getting-bypassed-5e9h</link>
      <guid>https://dev.to/fivenineslab_30/how-to-block-docker-ports-with-nftables-without-getting-bypassed-5e9h</guid>
      <description>&lt;p&gt;You add an nftables rule to drop traffic on port 8080. You check the ruleset — it's active. You curl localhost:8080 from outside the host, and the Dockerized API responds anyway. Your firewall just got ignored.&lt;/p&gt;

&lt;p&gt;This isn't a configuration mistake. Docker deliberately writes its own iptables rules that execute before nftables ever sees the packet. If you're running GPU inference services, internal LLM APIs, or any container that shouldn't be internet-facing, this behavior is a production security gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Docker Bypasses Your Firewall
&lt;/h2&gt;

&lt;p&gt;Docker manipulates iptables-legacy directly, inserting DNAT rules in the &lt;code&gt;nat&lt;/code&gt; table and ACCEPT rules in the &lt;code&gt;filter&lt;/code&gt; table. These rules redirect incoming traffic to container IPs before your nftables ruleset runs.&lt;/p&gt;

&lt;p&gt;Check what Docker created:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;iptables-legacy &lt;span class="nt"&gt;-t&lt;/span&gt; nat &lt;span class="nt"&gt;-L&lt;/span&gt; DOCKER &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;iptables-legacy &lt;span class="nt"&gt;-t&lt;/span&gt; filter &lt;span class="nt"&gt;-L&lt;/span&gt; DOCKER &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see entries like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;DNAT  tcp  --  *  *  0.0.0.0/0  0.0.0.0/0  tcp dpt:8080 to:172.17.0.2:8080
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The packet gets rewritten and forwarded before your nftables &lt;code&gt;input&lt;/code&gt; chain ever evaluates it. Even if you block port 8080 in nftables, Docker's NAT rule already sent the traffic to the container.&lt;/p&gt;

&lt;p&gt;On modern Debian and Ubuntu systems, nftables is the default firewall backend. But Docker still uses iptables-legacy for compatibility. This creates two parallel firewall systems — and Docker's rules win.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix: Disable Docker's iptables Manipulation
&lt;/h2&gt;

&lt;p&gt;Stop Docker from writing iptables rules. Edit &lt;code&gt;/etc/docker/daemon.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"iptables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart docker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now Docker won't touch your firewall. But you've also disabled container NAT and port publishing. If you run &lt;code&gt;docker run -p 8080:8080 myapp&lt;/code&gt;, the port mapping silently fails. The container starts, but nothing listens on the host.&lt;/p&gt;

&lt;p&gt;You now manage all forwarding and NAT yourself in nftables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build Your Own Docker NAT in nftables
&lt;/h2&gt;

&lt;p&gt;You need three components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;DNAT for inbound traffic (external → container)&lt;/li&gt;
&lt;li&gt;SNAT for outbound traffic (container → internet)&lt;/li&gt;
&lt;li&gt;Forwarding rules between host and Docker bridge&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's a complete nftables configuration for a single container exposing port 8080:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/usr/sbin/nft -f

flush ruleset

table inet filter {
  chain input {
    type filter hook input priority 0; policy drop;
    ct state established,related accept
    iif "lo" accept
    # Allow SSH
    tcp dport 22 accept
    # Block direct access to 8080 from outside
    # Traffic will arrive via DNAT as forwarded packets
  }

  chain forward {
    type filter hook forward priority 0; policy drop;
    ct state established,related accept
    # Allow forwarding to Docker containers
    iif "eth0" oif "docker0" ip daddr 172.17.0.2 tcp dport 8080 accept
    # Allow container responses
    iif "docker0" oif "eth0" accept
  }

  chain output {
    type filter hook output priority 0; policy accept;
  }
}

table ip nat {
  chain prerouting {
    type nat hook prerouting priority -100; policy accept;
    # DNAT: external traffic on 8080 → container
    iif "eth0" tcp dport 8080 dnat to 172.17.0.2:8080
  }

  chain postrouting {
    type nat hook postrouting priority 100; policy accept;
    # SNAT: container outbound traffic → host IP
    oif "eth0" ip saddr 172.17.0.0/16 masquerade
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save this as &lt;code&gt;/etc/nftables.conf&lt;/code&gt; and apply:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;nft &lt;span class="nt"&gt;-f&lt;/span&gt; /etc/nftables.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;172.17.0.2&lt;/code&gt; with your container's IP. Find it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker inspect &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="s1"&gt;'{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}'&lt;/span&gt; &amp;lt;container_name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Selective Exposure: Allow Only Internal Networks
&lt;/h2&gt;

&lt;p&gt;If you want the container reachable only from your private network (not the internet), add a source filter in the DNAT rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;iif "eth0" ip saddr 10.0.0.0/8 tcp dport 8080 dnat to 172.17.0.2:8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows connections from RFC1918 space but drops everything else before DNAT happens.&lt;/p&gt;

&lt;p&gt;For GPU inference APIs or internal vector search endpoints, this prevents accidental internet exposure while keeping the service available to your application tier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Handling Multiple Containers
&lt;/h2&gt;

&lt;p&gt;For multiple published ports, add one DNAT rule and one forward rule per container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Container 1: LLM API on 8080
iif "eth0" tcp dport 8080 dnat to 172.17.0.2:8080
iif "eth0" oif "docker0" ip daddr 172.17.0.2 tcp dport 8080 accept

# Container 2: Vector DB on 9200
iif "eth0" tcp dport 9200 dnat to 172.17.0.3:9200
iif "eth0" oif "docker0" ip daddr 172.17.0.3 tcp dport 9200 accept
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a dynamic container environment, this manual approach doesn't scale. Use Docker networks with explicit binds (&lt;code&gt;--publish 127.0.0.1:8080:8080&lt;/code&gt;) so the service listens only on localhost, then manage external access through an nginx reverse proxy protected by nftables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enable nftables on Boot
&lt;/h2&gt;

&lt;p&gt;Make the ruleset persistent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;nftables
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start nftables
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Debian/Ubuntu, nftables reads &lt;code&gt;/etc/nftables.conf&lt;/code&gt; at boot. Verify the service is active:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status nftables
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What You Lose
&lt;/h2&gt;

&lt;p&gt;With &lt;code&gt;"iptables": false&lt;/code&gt;, Docker Compose port mappings (&lt;code&gt;ports: - "8080:8080"&lt;/code&gt;) stop working unless you manually configure nftables NAT. Docker networks still function for inter-container communication, but host publishing requires your explicit forwarding rules.&lt;/p&gt;

&lt;p&gt;For production GPU clusters running inference APIs, this tradeoff is worth it. You control exactly which ports are exposed and to whom. A single nftables ruleset governs all traffic — no hidden Docker rules bypassing your firewall.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verification
&lt;/h2&gt;

&lt;p&gt;Test the block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# From outside the host&lt;/span&gt;
curl http://&amp;lt;host-ip&amp;gt;:8080
&lt;span class="c"&gt;# Should fail if no DNAT rule exists&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add the DNAT rule, reload nftables, and retry. The request should reach the container.&lt;/p&gt;

&lt;p&gt;Check your ruleset matches what you expect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;nft list ruleset
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify Docker didn't sneak in iptables rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;iptables-legacy &lt;span class="nt"&gt;-t&lt;/span&gt; nat &lt;span class="nt"&gt;-L&lt;/span&gt; DOCKER
&lt;span class="c"&gt;# Should be empty or show "Chain DOCKER (0 references)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Docker re-created rules, it means &lt;code&gt;daemon.json&lt;/code&gt; wasn't applied. Restart the daemon and double-check the JSON syntax.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use Cases for Manual Firewall Control
&lt;/h2&gt;

&lt;p&gt;This pattern matters when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running inference APIs on GPU instances where accidental exposure costs money and leaks proprietary models&lt;/li&gt;
&lt;li&gt;Operating multi-tenant platforms where container isolation must be firewall-enforced, not just network-namespace-enforced&lt;/li&gt;
&lt;li&gt;Deploying internal RAG pipelines with vector databases that should never touch the public internet&lt;/li&gt;
&lt;li&gt;Meeting compliance requirements that demand explicit, auditable firewall rules for all published services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Docker's automatic iptables manipulation is convenient for development. In production infrastructure, convenience is a security liability. You need deterministic control over which packets reach which containers.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post is an excerpt from &lt;a href="https://books.fivenineslab.com" rel="noopener noreferrer"&gt;Practical AI Infrastructure Engineering&lt;/a&gt; — a production handbook covering Docker, GPU infrastructure, vector databases, and LLM APIs. Full book with 4 hands-on capstone projects available at &lt;a href="https://activ8ted.gumroad.com/l/ssmfkx" rel="noopener noreferrer"&gt;https://activ8ted.gumroad.com/l/ssmfkx&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fivenineslab.com/blog/block-docker-ports-nftables-without-bypass" rel="noopener noreferrer"&gt;fivenineslab.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>docker</category>
      <category>devops</category>
      <category>aiinfrastructure</category>
    </item>
  </channel>
</rss>
