Yedidya Schwartz for AWS Community Builders

Posted on Aug 23

The Human-in-the-Loop Factor: Partnering With Amazon Q During a Production Incident

#ai #kubernetes #data #sre

Abstract

A production outage hit our Debezium cluster that streams MongoDB change events into BigQuery. The first time it happened we realized only hours later, which meant data loss for everything that did not make it to BigQuery. When it happened again, I used Amazon Q to guide the investigation, propose a proven remediation, and help me implement and validate an automated fix that shrinks the data loss window from minutes to seconds. More importantly, this was not about asking the AI to ״solve the incident״ in one shot. I broke the problem into structured steps: letting Q investigate the cluster, reviewing remediation options, simulating failure in staging, validating deployments, and finally summarizing the process for the team.

The outcome was threefold: a practical safeguard that saved real data, a hands-on way to use AI as a Kubernetes CLI companion while learning commands myself, and a strong reminder that human judgment is still needed in the AI feedback loop. With the right thinking cues, the AI delivered context-aware, reliable help that went the extra mile during a critical production incident.

What is Debezium, and our case study

Debezium is a change data capture platform. In our setup, Debezium Server tails MongoDB Atlas change streams and writes events to Google BigQuery. The value is simple: low latency ingestion of document changes into an analytics warehouse where downstream models and dashboards expect near real time data.

Source: MongoDB Atlas change streams
Transport and processing: Debezium Server running on GKE as a StatefulSet
Sink: BigQuery
Goal: keep the pipeline hot so analytics stays fresh

An unfamiliar cluster, and an unavailable owner

I've just had an internal transition in the company to take part in the data team infrastructure efforts, and I'm not familiar with all their systems yet. The teammate who normally owns this cluster was unavailable after a bike crash, so I was missing context and access. Amazon Q bridged that gap quickly by helping me authenticate to GCP, fetch GKE credentials, and verify connectivity.

Me: I want to connect to my gcp account and add to my local kube config the gke cluster debezium-server-stg in us-east1

Amazon Q: I can help you connect to your GCP account and add the GKE cluster to your local kubeconfig. Here are the steps:
Amazon Q: I will run the following shell command: gcloud auth login
Amazon Q: gcloud container clusters get-credentials debezium-server-stg --region us-east1
Amazon Q: Your current kubectl context is now set to: ***

Note: Some of the chat excerpts with Amazon Q have been significantly shortened and summarized.

The article highlights only the essential parts to convey the ideas clearly and keep the reading experience smooth, without overwhelming the reader with repetitive technical details.

Background to the incident and why we needed automation

The first occurrence went unnoticed for a few hours. Everything between the initial error and the manual remediation was lost to BigQuery. After it fired again, this time with better alerting, we decided to automate the remediation so the next failure would auto heal in seconds, not minutes.

Step 1. Asking Amazon Q what happened

I asked Q to take the lead on understanding the failure, even though I already suspected the cause. The idea was to let him explore the pod state, events, and logs on his own, so his remediation suggestions would be grounded in context.

Me: I need you to look at debezium-server-prod cluster, specifically what happened around 2am-3am Israel time. the Debezium stopped functioning with Atlas Mongo CDC integration, I need to understand the root cause

Amazon Q: I'll help you investigate what happened to the Debezium pod around 2am–3am Israel time today. Let me start by examining the cluster and finding the Debezium pod.

kubectl get pods -A | grep -i debezium
kubectl describe pod debezium-gke-prod-0 -n debezium-prod
kubectl logs debezium-gke-prod-0 -n debezium-prod --tail=100

Evidence from logs:

... Resume of change stream was not possible ...
... with error 286 (ChangeStreamHistoryLost) ...
... the resume point may no longer be in the oplog ...

Step 2. Asking Amazon Q for possible solutions

Once Q had the picture, I asked for potential remediations. He quickly recommended few options, and we went with its recommendation of modifying the liveness probe to detect ChangeStreamHistoryLost errors and reset the offsets file so Debezium could restart cleanly.

Me: my friend removed the offset file to solve it from inside the pod, that's what solved the issue. I would like to have an auto remediation for such issue if it happens again. what do you suggest?

Amazon Q: Great insight. Your friend's solution makes perfect sense. Removing the offsets file forces Debezium to start fresh instead of trying to resume from an invalid position. Let me suggest several auto-remediation approaches for this specific issue.

Amazon Q: Option 1: Smart liveness probe (recommended). Detect specific MongoDB resume failures such as ChangeStreamHistoryLost and self heal by removing the problematic offsets file, then exit with non zero to trigger a restart.

Amazon Q: Option 2: Sidecar container. Run a lightweight helper container in the same pod that tails the Debezium logs and triggers cleanup actions if the error is observed. This separates remediation logic from Debezium itself.
I considered other solutions, but the liveness probe approach was both simplest and fastest to implement while still giving us the observability and safety we needed. It required no extra infrastructure, no sidecars, and no external controllers - just Kubernetes doing what it already knows best.

Amazon Q: Option 3: Kubernetes CronJob or Operator. Periodically check Debezium health at the cluster level, and when unhealthy, reset the offsets and restart the StatefulSet. This decouples remediation from pod lifecycle.

Amazon Q: Option 4: External monitoring system integration. Alerting via Prometheus or Cloud Monitoring can trigger a remediation workflow (for example, a Cloud Function that restarts the StatefulSet).

Step 3. Asking Amazon Q to implement the solution

After seeing Q’s initial suggestion, I asked him to help me actually build the probe inside our Debezium StatefulSet. At first, Q suggested a very simple log check that would grep for ChangeStreamHistoryLost and remove the offsets file. That worked in principle, but it was too naive for production.

I directed Q to think deeper:

handle startup when the log might not exist yet,
make sure we back up the log with a timestamp for debugging,
prune old backups to avoid PVC bloat,
truncate the log in place after remediation,
and catch generic errors on the last line without panicking on old log noise.

Here is how the probe evolved through our conversation:

Me: Let's add a liveness probe to automatically detect the error and remove the offset file if it happens.

Amazon Q: Sure. We can create a simple probe that tails the log and looks for 'ChangeStreamHistoryLost'. If detected, remove offsets.dat and exit 1.

Me: That's a start, but we need something more comprehensive. It should handle missing logs at startup, back up the log with timestamps, clean up old backups, and truncate the file. Also, let’s still fail if the last line shows a fresh ERROR.

And here is the final version we built together:

livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - |
      LOG="/debezium/data/logs/debezium.log"

      # Healthy if log isn't there yet (startup/warmup)
      if [ ! -f "$LOG" ]; then
        echo "Log file not found; assuming healthy."
        exit 0
      fi

      # Detect the resume issue only in the current log's tail
      if tail -n 80 "$LOG" | grep -E -q 'ChangeStreamHistoryLost|Resume of change stream was not possible'; then
        echo "AUTO-REMEDIATION: Detected MongoDB oplog resume issue"

        # Timestamped backup, UTC for portability
        TS="$(date -u +%Y%m%dT%H%M%SZ)"
        BAK="/debezium/data/logs/debezium.log.backup-$TS"

        # Best-effort backup
        cp "$LOG" "$BAK" 2>/dev/null || cat "$LOG" > "$BAK" || true
        echo "Backed up current log to: $BAK"

        # Clear offsets and truncate the active log in place
        rm -f /debezium/data/offsets.dat || true
        : > "$LOG" || true

        # Optional: prune old backups (>7 days) to avoid PVC bloat
        find /debezium/data/logs -maxdepth 1 -type f -name 'debezium.log.backup-*' -mtime +7 -print -delete 2>/dev/null || true

        # Trigger immediate restart
        exit 1
      fi

      # Optional: fail on a fresh last-line ERROR
      if tail -n 1 "$LOG" | grep -q 'ERROR'; then
        echo "ERROR found on last line; failing liveness."
        exit 1
      fi

      echo "Healthy."
      exit 0
  initialDelaySeconds: 60
  periodSeconds: 5
  timeoutSeconds: 5
  failureThreshold: 1

Besides the script itself, the probe configuration had to be changed too, for reducing the reaction time of Kubernetes so remediation kicks in as soon as possible.

Originally, the probe ran every 10 seconds (periodSeconds: 10) and allowed 5 failures (failureThreshold: 5) before marking the pod unhealthy. That means the pod could sit in a broken state for up to ~50 seconds before Kubernetes restarted it. In a streaming pipeline like Debezium, 50 seconds of downtime means thousands of events lost.

By tightening it to:

initialDelaySeconds: 60
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 1

I achieved two key improvements:

Faster detection: the probe now samples every 5 seconds, so it sees the error quickly.
Immediate remediation: failing on the first detection (failureThreshold: 1) means Kubernetes restarts Debezium right away, instead of waiting for repeated confirmation.

This combination shortens the outage window from potentially close to a minute down to just a few seconds, which is critical for minimizing data loss in CDC pipelines.

By pushing Q step by step, I ended up with a much more resilient solution than the first draft. Instead of a hacky quick fix, the probe became a well-rounded, production-ready safeguard that not only healed the cluster automatically but also preserved valuable debugging context for the team.

Step 4. Asking Amazon Q to explain the existing CD for the Debezium repository

Before rolling the change, I asked Q to explain how our CD pipeline deploys manifests. He parsed the repo, confirmed the StatefulSet and config structure, and reassured me that applying this change through the CD path would be consistent and safe.

Here is the exact guidance Q gave me after reviewing the GitHub Actions in the CDCStreaming project. He explained how the release trigger works and how staging deployments are normally done:

Me: check the github actions in CDCStreaming project, how should I deploy to staging cluster?

Amazon Q: Perfect! Now I understand the deployment process. Based on the GitHub Actions workflow, here's how you should deploy to the staging cluster:

Deployment Process for Staging:
Option 1: Proper Git-based Deployment (Recommended)

Create a branch from main (not main itself, since manual triggers from main are blocked)

Commit your changes to the StatefulSet

Create a release targeting that branch (not main)

The GitHub Action will:
• Set environment variables for staging (deployment_env=stg, cpu=500, memory=2)
• Replace template variables in the YAML files
• Deploy to the debezium-server-stg cluster
• Restart the pod

Option 2: Manual Deployment (Quick Fix)

Since you want to test this now, let me help you deploy manually using the same process the GitHub Action uses.

Step 5. After I shipped to staging, asking Amazon Q to verify

Once the change was deployed to staging via CD, I asked Q to double check everything. He validated the probe configuration, pod health, and events to confirm that staging was stable.

Me: I applied it with CD. please confirm that the health checks look ok  

Amazon Q: Great! I see the pod restarted successfully on the staging cluster. The enhanced liveness probe is in place with:  
- Correct log path `/debezium/data/logs/debezium.log`  
- Detection of `ChangeStreamHistoryLost` errors  
- Auto-remediation with backup, offset removal, and restart trigger  
- Cleanup of old backups to prevent PVC bloat  

Debezium is streaming normally, no probe failures or unexpected restarts. The auto-remediation is active and will trigger within 10 seconds if the error reappears.

Step 6. Asking Amazon Q to simulate the remediation on staging

Instead of trusting the probe blindly, I asked Q to simulate the failure by injecting the ChangeStreamHistoryLost error into the logs. This way we could watch the remediation in action.

Me: I want to simulate it. Please inject the error and let’s see that it works as expected  

Amazon Q: Great idea! I’ll inject a `ChangeStreamHistoryLost` error into the Debezium log and monitor the probe.  

→ Injected the error at 19:07:03Z  
→ Liveness probe detected it immediately and triggered remediation  
→ Pod restarted (restart count went from 0 → 1)  
→ Offset file removed, backup created (`debezium.log.backup-20250812T190703Z`)  
→ Debezium started cleanly: “No previous offsets found”  
→ Streaming resumed successfully  

🎉 **Auto-remediation test passed**: error detected, pod restarted within ~10s, offsets cleaned, and streaming resumed without manual intervention.

Step 7. After I shipped to production, asking Amazon Q to verify again

With confidence from staging, I deployed to production. Q helped me validate the production pod with kubectl describe and logs, confirming the probe was present and the pod was healthy.

Step 8. Asking Amazon Q to summarize the process for the team

Finally, I asked Q to produce a summary I could share internally as a knowledge base entry. The summary captured the purpose, steps, and indicators of success in a concise way. I asked it to output in Confluence format, so then I just copy-pasted it and did some tunings to match the team atmosphere.

What is still in progress

The long term solution is still under investigation. This issue touches MongoDB change stream history, Debezium offsets, and operational safeguards. The automated remediation already saved a large number of events and dramatically reduced business impact, while also saving many human hours during incidents.

Lessons learned

A small safeguard makes a big difference

A small, targeted probe saved us a lot of data and hours. Incident response is a team sport, and AI can be a productive teammate when you use it intentionally. We kept the remediation simple and observable, tested it by simulating the real failure in staging, and delivered it through the standard CD path. That discipline gave us confidence.

How to Work With AI in an Incident

One important takeaway is that you cannot just tell an AI assistant ״solve the incident for me.״ You need to structure the thinking and let it walk with you step by step. For this incident, I first let Q investigate the cluster alone, even though I already knew what had happened. That gave him context and allowed his remediation suggestions to be sharper because he had ״felt״ the problem himself. Then I asked him for solutions, reviewed them, and deliberately chose the one that fit best. I pushed further by asking him to simulate the failure in staging before touching production, and I asked him to verify every deployment along the way. This orchestration of investigation, proposal, simulation, and verification turned the interaction into a holistic ecosystem. With the right thinking cues, AI goes the extra mile and becomes a true partner when dealing with production critical issues.

Treating Amazon Q as a Kubernetes CLI learning companion

One thing that stood out to me in this incident is how natural it felt to use Amazon Q, or really any AI CLI tool, as an interface to Kubernetes. When you are dropped into a new cluster you are not familiar with, it can be hard to know what command to run or even how to authenticate. Q bridged that gap: it helped me install the GCP CLI, log in to the right project, and fetch kubeconfig for the correct GKE cluster. From there, it navigated the cluster to find the exact issue without me having to recall every kubectl nuance.

A small but important detail is that Q always pauses before running a command. By default, you must approve each one, unless you explicitly type t to grant temporary trust for the session. I personally do not give it blanket approval. There are two reasons: first, I want to learn the commands myself and understand the thinking process; second, I want to be sure nothing breaks. Even though I told Q upfront ״only read, don’t change anything,״ I treat this as advisory rather than guaranteed enforcement. Keeping explicit control allowed me to learn new Kubernetes command practices while staying safe.

My perspective on relying on AI

As long as complex systems still require human intervention for edge cases, and their developers do not make a clear claim that the system can self-heal under failure, we will continue to play an active role in the feedback loop with AI tools.

For example, Debezium’s integration with MongoDB Atlas CDC - the case I described - faced an unusual failure. It was likely caused by a third party, outside the Kubernetes cluster. But the key point is this: the way I guided the AI on how to assist me was essential. Without that structured interaction, the problem would not have been solved in practice.

I highly recommend watching this great talk - where Andrej Karpathy, a founding member of OpenAI, explains how software is changing (again) and how important is the "Human-in-the-Loop" factor.

Until the next big leap arrives, which may not be far away, we must keep involving our own thinking in AI-driven processes, alternating between human judgment and AI automation.

DEV Community