<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Parker Voeltz</title>
    <description>The latest articles on DEV Community by Parker Voeltz (@madhacker3712).</description>
    <link>https://dev.to/madhacker3712</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3940943%2Fd16a238c-0aaa-4d54-82f0-5c88c5f7b643.png</url>
      <title>DEV Community: Parker Voeltz</title>
      <link>https://dev.to/madhacker3712</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/madhacker3712"/>
    <language>en</language>
    <item>
      <title># How I Containerized an LLM: A Practical MLOps Guide</title>
      <dc:creator>Parker Voeltz</dc:creator>
      <pubDate>Tue, 19 May 2026 19:28:16 +0000</pubDate>
      <link>https://dev.to/madhacker3712/-how-i-containerized-an-llm-a-practical-mlops-guide-4225</link>
      <guid>https://dev.to/madhacker3712/-how-i-containerized-an-llm-a-practical-mlops-guide-4225</guid>
      <description>&lt;p&gt;&lt;em&gt;By MadHacker3712 | May 2026&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Most AI tutorials end at "run this script locally." That's not production. In production, your model needs to run the same way on your laptop, a teammate's machine, and a cloud server—without anyone asking you "wait, which Python version?" That's what Docker solves.&lt;/p&gt;

&lt;p&gt;This is how I containerized a fine-tuned customer support chatbot end to end.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;A FastAPI server that wraps a fine-tuned DialoGPT-small model, quantized to int8 for faster inference. The API accepts a customer question and returns an agent-style response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/chat &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"message": "I need to cancel my order"}'&lt;/span&gt;

&lt;span class="c"&gt;# Response:&lt;/span&gt;
&lt;span class="c"&gt;# {"response": "I understand you'd like to cancel your order...", "latency_ms": 312.4}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One command. Works anywhere Docker runs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Docker for LLMs?
&lt;/h2&gt;

&lt;p&gt;When you ship an LLM-backed service without Docker, you're shipping a list of instructions. When you ship it &lt;em&gt;with&lt;/em&gt; Docker, you're shipping the environment itself.&lt;/p&gt;

&lt;p&gt;Three specific problems Docker solves for LLM work:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Dependency hell&lt;/strong&gt;: PyTorch, transformers, tokenizers—these all have version-sensitive interactions. Docker freezes the exact versions that worked.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reproducibility&lt;/strong&gt;: Your benchmark numbers only mean something if someone else can reproduce them. A containerized model is reproducible by definition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment parity&lt;/strong&gt;: The same image runs locally and in the cloud. No surprises in production.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Dockerfile
&lt;/h2&gt;

&lt;p&gt;Here's the full Dockerfile I wrote, with every decision explained:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; python:3.11-slim&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;

&lt;span class="c"&gt;# Copy requirements first — Docker caches this layer.&lt;/span&gt;
&lt;span class="c"&gt;# If only app.py changes, pip install does NOT re-run.&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; requirements.txt .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; &lt;span class="nv"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;2.3.0 &lt;span class="nt"&gt;--index-url&lt;/span&gt; https://download.pytorch.org/whl/cpu
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Copy fine-tuned model&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; models/chatbot-finetuned ./models/chatbot-finetuned&lt;/span&gt;

&lt;span class="c"&gt;# Copy application code&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; app.py .&lt;/span&gt;

&lt;span class="k"&gt;EXPOSE&lt;/span&gt;&lt;span class="s"&gt; 8000&lt;/span&gt;

&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; MODEL_DIR=./models/chatbot-finetuned&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; MAX_NEW_TOKENS=80&lt;/span&gt;

&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key decisions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;python:3.11-slim&lt;/code&gt;&lt;/strong&gt; — not the full Python image. Slim removes docs, tests, and package manager caches. Keeps the image smaller.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requirements before code&lt;/strong&gt; — this is the most important layer caching trick. Docker builds layers in order and caches each one. If &lt;code&gt;requirements.txt&lt;/code&gt; hasn't changed, that entire 2-minute pip install gets skipped on rebuild. Only the changed code layer re-runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU-only torch&lt;/strong&gt; — the &lt;code&gt;--index-url&lt;/code&gt; flag points pip at PyTorch's CPU wheel instead of the default GPU version. This cuts 1.5GB from the image. For CPU inference, you don't need CUDA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ENV for config&lt;/strong&gt; — &lt;code&gt;MODEL_DIR&lt;/code&gt; and &lt;code&gt;MAX_NEW_TOKENS&lt;/code&gt; as environment variables means you can override them at runtime without rebuilding the image.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What I Learned About Layer Caching
&lt;/h2&gt;

&lt;p&gt;Docker images are built in layers. Each instruction (&lt;code&gt;FROM&lt;/code&gt;, &lt;code&gt;COPY&lt;/code&gt;, &lt;code&gt;RUN&lt;/code&gt;) creates a new layer. Docker caches each layer and only rebuilds from the first changed layer downward.&lt;/p&gt;

&lt;p&gt;This is why order matters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# WRONG — changes to app.py re-run pip install&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# RIGHT — changes to app.py only re-copy app.py&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; requirements.txt .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; app.py .&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The right approach cut my rebuild time from ~8 minutes to ~10 seconds when I only changed &lt;code&gt;app.py&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Quantization Inside the Container
&lt;/h2&gt;

&lt;p&gt;The model loads as fp32, then gets dynamically quantized to int8 at startup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantization&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quantize_dynamic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;qint8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This happens every time the container starts. It takes about 2 seconds and means I don't have to store a separate quantized model file—the container always starts from the clean fine-tuned weights and quantizes on the fly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmark results on CPU:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;th&gt;Peak Memory&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Baseline (fp32)&lt;/td&gt;
&lt;td&gt;474.70 MB&lt;/td&gt;
&lt;td&gt;4543.86 ms&lt;/td&gt;
&lt;td&gt;794.2 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quantized (int8)&lt;/td&gt;
&lt;td&gt;474.70 MB&lt;/td&gt;
&lt;td&gt;4477.93 ms&lt;/td&gt;
&lt;td&gt;1334.5 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Only 1.45% latency improvement. I documented this honestly: dynamic quantization on a small CPU model shows minimal gains because the quantization overhead offsets the speedup. On large models (7B+) on GPU, this same technique shows 30-50% latency reduction. That's the trade-off analysis that matters in production—not just "did it get faster" but "why, and when does it not."&lt;/p&gt;




&lt;h2&gt;
  
  
  Build, Run, Push
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build&lt;/span&gt;
docker build &lt;span class="nt"&gt;-t&lt;/span&gt; topdeveloper123/customer-support-chatbot:v1 &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Test locally&lt;/span&gt;
docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8000:8000 topdeveloper123/customer-support-chatbot:v1

&lt;span class="c"&gt;# Push to Docker Hub (anyone can now pull and run your model)&lt;/span&gt;
docker push topdeveloper123/customer-support-chatbot:v1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Anyone with Docker installed can now run your model with one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8000:8000 topdeveloper123/customer-support-chatbot:v1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No Python setup. No pip install. No version conflicts. That's the point.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This is Project 1 of my MLOps learning path. Next I'll be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Comparing inference setups: standard pipeline vs quantized vs vLLM&lt;/li&gt;
&lt;li&gt;Building a RAG backend with vector search optimization&lt;/li&gt;
&lt;li&gt;Adding production monitoring and latency alerting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/MadHacker3712/customer-support-chatbot" rel="noopener noreferrer"&gt;github.com/MadHacker3712/customer-support-chatbot&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Learning MLOps in public. Building real systems, not tutorials.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>docker</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
