<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community:  kumar</title>
    <description>The latest articles on DEV Community by  kumar (@kccab5b1).</description>
    <link>https://dev.to/kccab5b1</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3875232%2F0066f43e-90e8-4904-ac02-5f63003222bc.png</url>
      <title>DEV Community:  kumar</title>
      <link>https://dev.to/kccab5b1</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kccab5b1"/>
    <language>en</language>
    <item>
      <title>Building a Self-Healing Backend with AI + Docker</title>
      <dc:creator> kumar</dc:creator>
      <pubDate>Sun, 12 Apr 2026 17:37:45 +0000</pubDate>
      <link>https://dev.to/kccab5b1/building-a-self-healing-backend-with-ai-docker-4pm4</link>
      <guid>https://dev.to/kccab5b1/building-a-self-healing-backend-with-ai-docker-4pm4</guid>
      <description>&lt;p&gt;I had this idea that kept bugging me: what if your backend could fix itself?&lt;/p&gt;

&lt;p&gt;Not in some hand-wavy "AI will handle it" way. I mean actually  tail its own logs, spot a real error, figure out what's wrong in the code, patch it, rebuild the container, and move on. While you sleep.&lt;/p&gt;

&lt;p&gt;So I built it. Five Docker containers. One of them watches the others, and when something breaks, it calls an LLM to generate a code fix, applies it, restarts the broken service, and verifies the fix worked. No human in the loop.&lt;/p&gt;

&lt;p&gt;This post is the full breakdown of how it works, what surprised me, and where it falls apart.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;The demo is a small data pipeline. Two FastAPI services and a MongoDB instance, all running in Docker.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Service A&lt;/strong&gt; holds raw records. It doesn't validate them — just stores whatever it gets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service B&lt;/strong&gt; is the strict one. It receives records from Service A, validates every field against business rules, and writes the good ones to a separate database. Bad records get rejected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The AI container&lt;/strong&gt; sits alongside them. It has access to the Docker socket, can SSH into the other containers, and tails their logs in real time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's also an init container that seeds the database on startup : 1000 records, most of them clean, but a handful intentionally malformed. Different kinds of malformed: numbers wrapped in weird JSON objects, mixed-case field names, nested data serialized as strings. The kind of stuff that happens when real systems talk to each other.&lt;/p&gt;

&lt;p&gt;When Service A tries to transfer everything to Service B, those malformed records get rejected. Service A logs the rejections as errors. And that's when the AI container wakes up.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                +-----------+        +-----------+
                | Service A |-------&amp;gt;| Service B |
                | (no       |  HTTP  | (strict   |
                |  validation)       |  validation)
                +-----+-----+        +-----+-----+
                      |                     |
                      | logs errors         | rejects bad data
                      v                     v
              +----------------------------------+
              |        AI Orchestrator           |
              |                                  |
              |  1. tail logs from all services  |
              |  2. regex match on error pattern |
              |  3. build prompt with context    |
              |  4. call LLM for a code fix      |
              |  5. apply patch, rebuild, verify |
              +----------------------------------+
                      |
                      v
                +-----------+
                |  MongoDB  |
                +-----------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Log Watcher
&lt;/h2&gt;

&lt;p&gt;The core of the system is a shell script that runs inside the AI container. It streams logs from the other containers using &lt;code&gt;docker compose logs -f&lt;/code&gt; and watches every line against a configurable regex pattern.&lt;/p&gt;

&lt;p&gt;The idea is simple. Most log lines are noise — request timings, debug output, health checks. But when a line matches the error pattern (in my case, something like &lt;code&gt;transfer_remote_rejections&lt;/code&gt;), the system wakes up.&lt;/p&gt;

&lt;p&gt;Here's the stripped-down logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Stream logs from all monitored services&lt;/span&gt;
docker compose logs &lt;span class="nt"&gt;--tail&lt;/span&gt; 0 &lt;span class="nt"&gt;-f&lt;/span&gt; service_a service_b mongodb | &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nb"&gt;read&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; line&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;

    &lt;span class="c"&gt;# Append to a rolling buffer (keeps last N lines for context)&lt;/span&gt;
    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$line&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$BUFFER_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

    &lt;span class="c"&gt;# Check if this line matches our error pattern&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$line&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-Eq&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ERROR_REGEX&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;

        &lt;span class="c"&gt;# Hash the line to avoid retriggering on the same error&lt;/span&gt;
        &lt;span class="nv"&gt;signature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$line&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;sha256sum&lt;/span&gt; | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $1}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if &lt;/span&gt;should_trigger &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$signature&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
            &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Detected error. Triggering AI fix."&lt;/span&gt;
            run_ai_fix &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$line&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$signature&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;fi
    fi
done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things I want to highlight because they matter:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rolling buffer, not just the matched line.&lt;/strong&gt; When the AI needs to fix something, it doesn't just get the error — it gets the last 30 lines of logs for context. A rejection error alone doesn't tell you much. But 30 lines of context? Now you can see the actual payload that failed, the validation error message, the traceback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signature-based deduplication.&lt;/strong&gt; Without this, the same error triggers the fix loop over and over. Each matched line gets hashed, and if we've already triggered on that hash within a cooldown window (say, 3 minutes), we skip it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reconnection.&lt;/strong&gt; Log streams can drop. The outer &lt;code&gt;while true&lt;/code&gt; loop reconnects automatically with a short delay.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix Prompt
&lt;/h2&gt;

&lt;p&gt;When the watcher triggers, it builds a prompt and sends it to an LLM. This is the part that took the most iteration to get right.&lt;/p&gt;

&lt;p&gt;The naive version — "here's an error, fix it" - doesn't work. The model needs structure. It needs to know what the system is, what the error means, and exactly what files to look at.&lt;/p&gt;

&lt;p&gt;Here's roughly what the prompt looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are debugging a running backend service inside Docker.

Detected error pattern: transfer_remote_rejections
Matched log line: [truncated to ~1400 chars]

Recent log context (last 30 lines):
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;[...actual log output...]&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Task:
- The receiving service rejects records with unexpected payload shapes.
- Fix the validation/normalization code to handle these variants:
  - Numbers wrapped as {"$numberInt": "42"} or {"$numberLong": "999"}
  - Object keys with inconsistent casing (e.g., "Category" vs "category")
  - Nested objects serialized as JSON strings instead of dicts
  - Nested objects sent as a list of {key, value} pairs
- Only modify the receiving service's code. Preserve the API contract.
- Rebuild the container, run the transfer again, verify counts.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key decisions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tell the model exactly which file to modify.&lt;/strong&gt; Don't let it go exploring the whole repo. In my case, the fix always lives in the receiving service's main application file.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;List the variant shapes explicitly.&lt;/strong&gt; The model can't guess what "malformed" means in your context. Be specific about what the data looks like and what it should be normalized into.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Include the verification step.&lt;/strong&gt; The prompt doesn't just say "fix the code" — it says "fix the code, rebuild, re-run the transfer, check the counts." The AI needs to know when it's done.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Retries and Cooldowns
&lt;/h2&gt;

&lt;p&gt;The fix doesn't always work on the first try. Sometimes the model gets it 80% right - handles three out of four variants, misses one. That's fine, because the system is built for retries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;MAX_RETRIES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3
&lt;span class="nv"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1

&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$attempt&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-le&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$MAX_RETRIES&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Fix attempt &lt;/span&gt;&lt;span class="nv"&gt;$attempt&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;$MAX_RETRIES&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

    &lt;span class="c"&gt;# Run the LLM with a timeout&lt;/span&gt;
    &lt;span class="nb"&gt;timeout &lt;/span&gt;900 run_llm_fix &amp;lt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PROMPT_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
    &lt;span class="nv"&gt;exit_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$?&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$exit_code&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-eq&lt;/span&gt; 0 &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Fix succeeded on attempt &lt;/span&gt;&lt;span class="nv"&gt;$attempt&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
        &lt;span class="nb"&gt;break
    &lt;/span&gt;&lt;span class="k"&gt;fi

    &lt;/span&gt;&lt;span class="nv"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt;attempt &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;))&lt;/span&gt;
    &lt;span class="nb"&gt;sleep &lt;/span&gt;2
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 900-second timeout (15 minutes) is generous on purpose. The model doesn't just edit a file - it also rebuilds the container, waits for it to become healthy, triggers the transfer, and checks the results. That whole cycle takes time.&lt;/p&gt;

&lt;p&gt;And the cooldown between error signatures prevents the system from going into a spin loop when something truly can't be fixed. Three strikes and it stops, leaving the error for a human to look at.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Part Nobody Talks About: SSH Into Running Containers
&lt;/h2&gt;

&lt;p&gt;Here's something I didn't see coming when I started this project. The AI container needs to actually &lt;em&gt;do things&lt;/em&gt; inside the other containers - read files, apply patches, restart processes. You can't just &lt;code&gt;docker exec&lt;/code&gt; for everything.&lt;/p&gt;

&lt;p&gt;The solution I landed on: the AI container generates an SSH keypair on startup, shares the public key through a Docker volume, and all service containers configure their SSH daemons to accept it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml (simplified)&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ai_orchestrator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ssh_keys:/shared-keys&lt;/span&gt;          &lt;span class="c1"&gt;# writes the keypair here&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/var/run/docker.sock:/var/run/docker.sock&lt;/span&gt;

  &lt;span class="na"&gt;service_a&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ssh_keys:/shared-keys:ro&lt;/span&gt;       &lt;span class="c1"&gt;# reads the public key&lt;/span&gt;

  &lt;span class="na"&gt;service_b&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ssh_keys:/shared-keys:ro&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ssh_keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each service container runs a small entrypoint script that waits for the public key to appear, copies it into &lt;code&gt;authorized_keys&lt;/code&gt;, starts the SSH daemon, and then launches the actual application.&lt;/p&gt;

&lt;p&gt;This means the AI container can do things like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ssh service_a &lt;span class="s2"&gt;"tail -n 50 /var/log/app/service.log"&lt;/span&gt;
ssh service_b &lt;span class="s2"&gt;"cat /app/main.py"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It felt overengineered at first, but it turned out to be the cleanest way to give the AI full access without mounting every source directory as a shared volume.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Entrypoint Trick: Logs That Go Two Places
&lt;/h2&gt;

&lt;p&gt;One challenge with Docker is that you want logs to go to stdout (so &lt;code&gt;docker logs&lt;/code&gt; works) but you also want them in a file (so the AI can read them via SSH or tail them).&lt;/p&gt;

&lt;p&gt;The solution is a thin entrypoint wrapper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/sh&lt;/span&gt;
&lt;span class="nv"&gt;LOG_FILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/var/log/app/service.log"&lt;/span&gt;
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;dirname&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOG_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# Run the actual command, redirect all output to the log file&lt;/span&gt;
&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$@&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOG_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; 2&amp;gt;&amp;amp;1 &amp;amp;
&lt;span class="nv"&gt;MAIN_PID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$!&lt;/span&gt;

&lt;span class="c"&gt;# Tail the log file to stdout (so docker logs still works)&lt;/span&gt;
&lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; +1 &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOG_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &amp;amp;
&lt;span class="nv"&gt;TAIL_PID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$!&lt;/span&gt;

&lt;span class="nb"&gt;wait&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$MAIN_PID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;kill&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TAIL_PID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; 2&amp;gt;/dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every container uses this as its entrypoint. The actual service command gets passed as arguments. Output goes to a file &lt;em&gt;and&lt;/em&gt; to stdout. Everybody's happy.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Fix Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;For the curious — what does the AI actually change?&lt;/p&gt;

&lt;p&gt;In my demo, the receiving service has strict Pydantic validation. It expects fields like &lt;code&gt;long_value&lt;/code&gt; to be integers, &lt;code&gt;object_values&lt;/code&gt; to be a dict with specific keys, etc. But the malformed records come in with stuff like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"long_value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"$numberLong"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"900000000000000001"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"object_values"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;category&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;ALPHA&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;quality&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;HIGH&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;multiplier&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: 2}"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The AI adds a normalization layer before validation — something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;normalize_payload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Unwrap MongoDB extended JSON and normalize shapes.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Handle {"$numberLong": "..."} and {"$numberInt": "..."} wrappers
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;long_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;short_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;integer_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$numberLong&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$numberInt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Handle object_values as a JSON string
&lt;/span&gt;    &lt;span class="n"&gt;obj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object_values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;obj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object_values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;

    &lt;span class="c1"&gt;# Handle object_values as [{key, value}, ...] list
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object_values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;obj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object_values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Normalize mixed-case keys
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;normalized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;
        &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object_values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;normalized&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. A normalization function that handles four different "dirty" shapes. The AI writes this, plugs it into the ingest endpoint, rebuilds the container, and re-runs the transfer. All 1000 records pass.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;Before the AI fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;source_total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;           &lt;span class="m"&gt;1000&lt;/span&gt;
&lt;span class="na"&gt;transferred&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;             &lt;span class="m"&gt;996&lt;/span&gt;
&lt;span class="na"&gt;rejected&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;                  &lt;span class="m"&gt;4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the AI fix runs (automatically, no human):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;source_total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;           &lt;span class="m"&gt;1000&lt;/span&gt;
&lt;span class="na"&gt;transferred&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;            &lt;span class="m"&gt;1000&lt;/span&gt;
&lt;span class="na"&gt;rejected&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;                  &lt;span class="m"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The whole cycle — error detection, prompt construction, LLM call, code patch, container rebuild, verification — takes about 2-3 minutes depending on the model speed.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The prompt is the product.&lt;/strong&gt; I spent way more time tuning the prompt than writing the orchestration logic. If your prompt is vague, the model will make creative decisions you don't want. Be specific about what to change, what not to touch, and how to verify success.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shell scripts are actually fine for this.&lt;/strong&gt; I started rewriting the orchestrator in Python, then stopped. The core logic is "tail logs, grep for patterns, run a command." Shell does this natively. Don't overcomplicate the glue code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SSH access was the right call.&lt;/strong&gt; I tried volume mounts first (share source code between containers). It works but gets messy fast with permissions and file locking. SSH gives you a clean interface — "read this file, write this file, run this command" — without coupling container filesystems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The circuit breaker matters more than the AI.&lt;/strong&gt; The cooldown, the retry limit, the signature dedup — that's what prevents the system from doing something stupid in a loop. The AI fix is the flashy part, but the guardrails are what make it safe to actually run.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Would You Use This For Real?
&lt;/h2&gt;

&lt;p&gt;Honestly? Not in production. Not yet. But here's where it makes sense today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Staging environments&lt;/strong&gt; where you want fast iteration on integration bugs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Demo environments&lt;/strong&gt; that need to self-recover when data gets messy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data pipelines&lt;/strong&gt; where upstream systems send unpredictable payloads and you need the receiving end to adapt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal tools&lt;/strong&gt; where the cost of an hour of downtime is higher than the risk of an automated fix&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interesting thing is that the &lt;em&gt;pattern&lt;/em&gt; — tail logs, detect errors, call an LLM, apply a fix, verify — doesn't require Docker at all. You could do the same thing with systemd services, Kubernetes pods, or Lambda functions. Docker just makes it easy to prototype.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The whole thing is five containers and a &lt;code&gt;docker-compose.yml&lt;/code&gt;. The stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2x FastAPI services (Python, one stores data, one validates it)&lt;/li&gt;
&lt;li&gt;1x MongoDB&lt;/li&gt;
&lt;li&gt;1x Init container (seeds test data with intentional malformed records)&lt;/li&gt;
&lt;li&gt;1x AI orchestrator (tails logs, calls LLM, applies fixes)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You need an LLM API key and Docker. That's it.&lt;/p&gt;

&lt;p&gt;The orchestrator shell script is under 300 lines. The FastAPI services are under 400 lines each. There's no framework, no agent library, no orchestration platform. Just containers, logs, regex, a prompt, and an API call.&lt;/p&gt;




&lt;p&gt;If you've built something similar - or think this is a terrible idea : I'd genuinely like to hear about it. Drop a comment or ping me. The best feedback I've gotten on this project has been from people who tried to poke holes in it.&lt;/p&gt;

</description>
      <category>docker</category>
      <category>ai</category>
      <category>python</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
