Saurabh Mishra for Google Developer Experts

Posted on Jun 29

Closing the Trust Gap: Automating GKE Incident Response with Antigravity 2.0, GKE MCP, and Artifacts

#antigravity #googlecloud #kubernetes #cloud

Anatomy of the Trust Gap

Before we can talk about the solution, we need to talk honestly about how the trust gap forms. It isn't a technology failure it's an epistemological one. When an automated system takes an action or makes a recommendation, on-call engineers need to answer three questions almost simultaneously:

Is the diagnosis correct? Does the system understand what's actually wrong, or is it pattern-matching superficially?

Is the proposed action safe? Will following this recommendation make things better, worse, or sidestep the real issue entirely?

Can I explain this decision later? If I follow the automation and it goes wrong, will I be able to reconstruct why — for a postmortem, for my team, for myself?

Legacy runbook automation systems answer none of these questions well. They tell you what to do, not why. They surface alerts, not reasoning. And when they're wrong — which they are, reliably, in the tail cases that matter most — engineers stop trusting them for everything, including the cases where they'd be right.

Engineers are rightfully terrified of "runaway automation"—brittle bash scripts or over-eager webhooks that misinterpret a symptom, delete the wrong stateful pod, or trigger an accidental cascading failure across a cluster. Because of this, we default to waking up exhausted humans at 3:00 AM to manually sift through kubectl logs.

With the emergence of agentic AI ecosystems, we finally have a way to close this gap. By pairing Google Antigravity 2.0—Google's standalone agent orchestration platform—with GKE's native infrastructure and Artifacts, teams can build an automated, transparent, and strictly governed incident response pipeline

The Tech Stack: GKE, MCP, and Antigravity

To automate incident resolution safely, an AI agent cannot treat a cluster like a black box. It needs deep, contextual access to the environment without compromising security.

This workflow relies on three core components:
Google Kubernetes Engine (GKE): The underlying managed environment running containerized workloads.

GKE Model Context Protocol (MCP) Server: Introduced to standardize how AI agents interact with Kubernetes, the MCP server exposes standardized capabilities for monitoring, analyzing, and modifying cluster resources.

Google Antigravity 2.0: Operating via the Gemini Enterprise Agent Platform, Antigravity functions as the central orchestrator. It connects to the GKE MCP server using enterprise-grade IAM credentials and Workload Identity, executing automated reasoning loops to triage and fix issues

Bridging the Gap with Artifact-Driven SRE

The secret to trust is transparency. Google Antigravity does not blindly run destructive scripts in the background. Instead, its core design centers on Artifacts—structured, immutable deliverables created by the agent to communicate its thinking, progress, and verification milestones to human users. When applied to GKE Site Reliability Engineering (SRE), Antigravity uses an Artifact-Driven Remediation framework:

Implementation Plans: Before modifying any cluster state, the agent generates a rich Markdown specification detailing the exact API changes it intends to make (e.g., cordoning a node, scaling down a corrupted deployment).

Task Lists: A structured checklist showing the step-by-step diagnostic operations the agent is executing in real time.

Walkthroughs: Once a fix is applied, the agent generates an interactive post-mortem artifact summarizing the changes and verifying cluster health with real data logs

Step-by-Step: The Automated Incident Loop

Let's look at how Antigravity handles a common, painful production issue: a microservice experiencing a memory leak that triggers an Out-Of-Memory (OOM) killer loop, choking out co-located pods on a GKE node.

GKE-Specific Diagnostic Patterns: What We've Learned

Twelve months of running Antigravity in production across seventeen GKE clusters has generated a substantial library of incident patterns. The following are the most common root cause categories our diagnostic engine has learned to identify with high confidence, along with the signal signatures that distinguish them:

Node pool autoscaler contention
Symptoms: pods stuck in Pending despite headroom in existing nodes; cluster autoscaler logs showing scale-up events followed by immediate scale-down; kube_node_status_condition flipping. Common in environments where both HPA and VPA are enabled without coordination, creating competing scaling pressure. Antigravity's diagnostic rule for this pattern has 0.89 average confidence based on the last 6 months of production data.

Workload Identity credential expiry
Symptoms: application pods returning 403s to GCP APIs; token-refresh-timeout errors in container logs; incident opened by latency or error rate alert rather than infrastructure alert. Tricky to diagnose because the failure is in application layer but the root cause is in the identity infrastructure. Signal correlation across Kubernetes events and Cloud Logging together is what makes this diagnosable.

Resource quota saturation at namespace level
Symptoms: new pod creation failing with exceeded quota despite ample node resources; affects all deployments in a namespace simultaneously. Engineers frequently misdiagnose this as a node shortage because node-level metrics look healthy. Antigravity's namespace quota check is the first hypothesis evaluated for any pod-creation failure — it rules in or out in under a second.

Affinity/anti-affinity scheduling deadlocks
Symptoms: 0/N nodes are available: N node(s) didn't match pod anti-affinity in scheduler events; happens after cluster topology changes (node pool resize, zonal failures). Difficult to reason about in the moment because the conflict is between pod specs that were each valid when written. The Artifact for these incidents includes a specific note explaining which pods are in conflict and why.

The Antigravity Pipeline: From Signal to Artifact

What You'll Need to Build This

Antigravity is an internal platform, but the architectural pattern is reproducible. If you're building toward something similar, here's an honest assessment of what's required:

Observability foundations that are actually good
Antigravity is only as smart as its inputs. If your Prometheus metrics are inconsistently labeled, your GKE event retention is too short, or your structured logging is incomplete, the diagnostic engine will produce low-confidence outputs that engineers learn not to trust — and you're back to square one. Invest in observability before investing in automation.

A runbook of failure modes, not just runbooks
The diagnostic patterns that power Antigravity's hypothesis engine came from three months of retrofitting existing incident postmortems into structured, parameterized failure signatures. This work is not glamorous. It also cannot be skipped. LLMs like Claude are remarkably good at synthesizing structured context into legible narrative — they are not (yet) good at doing root cause analysis from raw, unstructured signal streams.

A hard commitment to the human gate
The temptation to auto-approve "low risk" actions will be constant and will come from leadership as well as engineers who get tired of approving the same PDB patches. Resist it. The trust in Antigravity was built precisely because nothing executes without human approval engineers know that if they make a mistake, they made it, and they can learn from it. Eroding the gate erodes the trust model.

Genuine uncertainty representation
Build the uncertainty_notes requirement into your Artifact schema as a non-nullable, non-empty field. Prompt your LLM to fill it honestly. Review generated Artifacts in postmortems not just for cases where the system was wrong, but for cases where it was right but overconfident. Calibration matters as much as accuracy.

Restoring Peace of Mind to On-Call Teams

When an engineer opens their laptop after a resolved incident, they aren't looking at a black box or a string of cryptic logs. They are greeted by structured, historical evidence.

Through the Antigravity 2.0 Desktop Sidebar or CLI, the engineering team has an asynchronous paper trail of the entire event. The trust gap disappears because the system behaves predictably, logs its intentions transparently before acting, and provides concrete receipts of success.

By pairing the declarative, rock-solid infrastructure of GKE with the precise, artifact-backed reasoning of Google Antigravity, organizations can safely transition from reactive fire-fighting to autonomous, self-healing infrastructure.

DEV Community

Closing the Trust Gap: Automating GKE Incident Response with Antigravity 2.0, GKE MCP, and Artifacts

Top comments (0)