manoj mallick

Posted on Mar 16

I Built an AI Agent That Sits in Your Incident War Room and Writes DORA Compliance Reports in Real Time

#gemini #googlecloud #ai #geminiliveagentchallenge

The 5 AM Problem

It's 3 AM. Your payment gateway is down. 73,000 customers can't transact. Your engineers are in a war room, screaming service names, error codes, and blast radius numbers at each other over a Zoom call.

You fix it by 4 AM. Good.

Now comes the other clock. Under EU DORA Article 11.1(a), you have 4 hours from incident classification to notify your competent authority. That means someone — usually the most senior compliance person — has to reconstruct everything that happened from memory, channel logs, and a half-dozen Grafana screenshots, and turn it into a structured regulatory report. By 5 AM. While still exhausted.

That report currently takes 4+ hours.

I built ARIA (Automated Regulatory Incident Analyst) to do it in under 8 minutes, live, while the incident is still happening.

What ARIA Does

ARIA joins the incident call as a silent agent. It:

Listens to every spoken word via the Gemini Live API — hearing service names (payment-gateway-v2), error codes (503, EXHAUSTED), and impact numbers (73,000 users, 7.3% failure rate)
Watches engineers' screens every 5 seconds — reading Grafana dashboards, kubectl output, alert panels
Builds the DORA Article 11 report in real time, section by section, as evidence comes in
Switches persona based on who's speaking — gives technical commands to engineers, cites exact regulatory clauses to compliance officers, and speaks plain business language to executives
Triggers a 4-hour countdown clock the moment the DORA threshold is crossed (>5% transaction failure rate)

By the time the incident is resolved, the compliance report is already written.

The Hard Part: Gemini Live API on AI Studio

This challenge had one brutal technical constraint: the Gemini Live API (bidiGenerateContent) is only available on native-audio models on AI Studio keys. Specifically: gemini-2.5-flash-native-audio-latest.

These models support real-time bidirectional audio streaming and produce inputTranscription of what participants say — but they cannot emit structured JSON text output directly. They're designed for voice-to-voice applications, not voice-to-JSON pipelines.

My first 8 attempts at the architecture failed:

Version	Model	Config	Result
v1	`gemini-2.0-flash`	`startChat()`	No Live API in old SDK
v2	`gemini-2.0-flash-live-001`	correct config	1008 — not found on AI Studio
v3–v4	various flash models	bidiGenerateContent	1008 — model not available
v5	native-audio	TEXT modality	1007 — "Cannot extract voices"
v6	native-audio	AUDIO+TEXT + systemInstruction	1007 — "Invalid argument"
v7	native-audio	AUDIO only, no systemInstruction	✅ Session stays open

The breakthrough was understanding that gemini-2.5-flash-native-audio-latest is a voice-to-voice model. It will reject TEXT modality and reject systemInstruction in the live config. You must give it responseModalities: ['AUDIO'] and nothing else.

The Hybrid Architecture

Gemini Live  (gemini-2.5-flash-native-audio-latest)
  responseModalities: ['AUDIO']   ← stays open
  → inputTranscription fires per speech turn
  → on turnComplete → transcript string

generateContent (gemini-2.5-flash)
  systemInstruction: ARIA_ANALYST_PROMPT
  responseMimeType: 'application/json'
  contents: [{ role: 'user', parts: [{ text: transcript }] }]
  → structured IncidentEvent JSON
  → Zod validation → Pub/Sub → 3 ADK agents → SSE → browser

Two models working together: the Live session handles the real-time audio stream and transcription, and a separate generateContent call handles the structured reasoning. Each does what it's actually good at.

// listenerAgent.js — the key insight
session = await ai.live.connect({
  model: 'gemini-2.5-flash-native-audio-latest',
  config: {
    responseModalities: ['AUDIO'],
    // No systemInstruction here — live model = transcription only
  },
  callbacks: {
    onmessage: (message) => {
      const sc = message.serverContent
      if (sc?.inputTranscription?.text) {
        transcriptBuffer += sc.inputTranscription.text
      }
      if (sc?.turnComplete && transcriptBuffer.trim()) {
        generateIncidentEvent(transcriptBuffer.trim(), incidentId)
        transcriptBuffer = ''
      }
    }
  }
})

The 5-Agent ADK Pipeline

Once a structured IncidentEvent JSON lands in Pub/Sub, three Google ADK agents process it sequentially:

Pub/Sub: incident-events
    │
    ▼
[Analyst Agent]       ← root cause, blast radius, severity classification
    │
    ▼  Pub/Sub: incident-analysis
    │
[Compliance Agent]    ← DORA Art. 11.1(a/b/c), SOX 404, notification deadlines
    │
    ▼  Pub/Sub: compliance-mappings
    │
[Reporter Agent]      ← generates 6 report sections, writes to Firestore, broadcasts via SSE

The Reporter Agent generates each section with a targeted prompt — Timeline, Blast Radius, Root Cause, Regulatory Obligations, Remediation, and Executive Summary — and broadcasts them live via SSE. Each section appears in the browser as it's generated, creating the "report building before your eyes" effect.

The Infrastructure

Everything is Terraform-provisioned on GCP:

Cloud Run — containerised Node.js 20, min-instances=1, CPU always-on (critical for WebSocket longevity)
Cloud Pub/Sub — 4 topics + 4 DLQ topics for the agent chain
Firestore — incident state and report sections
Cloud Build — CI/CD: git push main → build → deploy → new revision
Artifact Registry — Docker image store
Secret Manager — GEMINI_API_KEY never in plaintext env vars

One command to provision everything:

cd terraform
terraform apply -var="project_id=YOUR_PROJECT" -var="gemini_api_key=YOUR_KEY"

What a Real Incident Looks Like

You open ARIA, type the incident title, and click Start Incident. Then click Start Listening — the browser requests microphone permission and immediately begins streaming PCM audio at 16kHz to the server via WebSocket.

You say into your microphone:

"payment-gateway-v2 is throwing 503 errors, postgres connection pool is exhausted, 73,000 users affected, 7.3 percent transaction failure rate"

Seconds later:

The transcript card appears in the Live Transcript panel — raw quote + ARIA's spoken response
The DORA clock switches to orange and starts counting down from 4:00:00
The DORA Article 11 Report panel begins building — Timeline... Blast Radius... Root Cause... Regulatory Obligations (citing exact DORA Article 11.1(a) clause + notification deadline)... Remediation... Executive Summary
The persona badge in the top-right switches based on vocabulary — Engineer → Compliance → Executive

When a compliance officer asks "which DORA clause does this trigger?", ARIA responds with exact clause citations. When the CEO asks "what do I tell the board?", ARIA gives a business-impact summary with no jargon.

The Numbers

Metric	Before ARIA	With ARIA
Compliance report completion	4+ hours post-mortem	Under 8 minutes live
DORA notification prep	Manual, from memory	Automated, from real-time evidence
Multi-stakeholder communication	One message for all	Persona-adapted per audience
Audit trail	Channel logs + memory	Timestamped, structured, Firestore-persisted

Try It

Live demo: https://regguardian-908307939543.us-central1.run.app

Source: https://github.com/manojmallick/regguardian

The repo includes full Terraform IaC, a multi-stage Dockerfile, Zod schemas, and the complete ADK agent pipeline. The README has step-by-step deployment instructions.

What I Learned

The biggest lesson: the Gemini Live API is genuinely different from a text model with audio input. It's a voice-to-voice model designed for conversational agents. Trying to use it like a text model (adding TEXT modality, systemInstruction, structured output) breaks the session within milliseconds with a 1007.

The hybrid architecture — using the Live model purely for transcription and a separate generateContent call for reasoning — is the correct pattern for building agentic pipelines on top of the Live API. The Live session stays permanently open for as long as the incident lasts. The reasoning model gets invoked once per speech turn.

Built for the Gemini Live Agent Challenge 2026.

#GeminiLiveAgentChallenge #GoogleCloud #GeminiAPI #DORA #IncidentResponse #AI

DEV Community