The 5 AM Problem
It's 3 AM. Your payment gateway is down. 73,000 customers can't transact. Your engineers are in a war room, screaming service names, error codes, and blast radius numbers at each other over a Zoom call.
You fix it by 4 AM. Good.
Now comes the other clock. Under EU DORA Article 11.1(a), you have 4 hours from incident classification to notify your competent authority. That means someone — usually the most senior compliance person — has to reconstruct everything that happened from memory, channel logs, and a half-dozen Grafana screenshots, and turn it into a structured regulatory report. By 5 AM. While still exhausted.
That report currently takes 4+ hours.
I built ARIA (Automated Regulatory Incident Analyst) to do it in under 8 minutes, live, while the incident is still happening.
What ARIA Does
ARIA joins the incident call as a silent agent. It:
-
Listens to every spoken word via the Gemini Live API — hearing service names (
payment-gateway-v2), error codes (503,EXHAUSTED), and impact numbers (73,000 users,7.3% failure rate) - Watches engineers' screens every 5 seconds — reading Grafana dashboards, kubectl output, alert panels
- Builds the DORA Article 11 report in real time, section by section, as evidence comes in
- Switches persona based on who's speaking — gives technical commands to engineers, cites exact regulatory clauses to compliance officers, and speaks plain business language to executives
- Triggers a 4-hour countdown clock the moment the DORA threshold is crossed (>5% transaction failure rate)
By the time the incident is resolved, the compliance report is already written.
The Hard Part: Gemini Live API on AI Studio
This challenge had one brutal technical constraint: the Gemini Live API (bidiGenerateContent) is only available on native-audio models on AI Studio keys. Specifically: gemini-2.5-flash-native-audio-latest.
These models support real-time bidirectional audio streaming and produce inputTranscription of what participants say — but they cannot emit structured JSON text output directly. They're designed for voice-to-voice applications, not voice-to-JSON pipelines.
My first 8 attempts at the architecture failed:
| Version | Model | Config | Result |
|---|---|---|---|
| v1 | gemini-2.0-flash |
startChat() |
No Live API in old SDK |
| v2 | gemini-2.0-flash-live-001 |
correct config | 1008 — not found on AI Studio |
| v3–v4 | various flash models | bidiGenerateContent | 1008 — model not available |
| v5 | native-audio | TEXT modality | 1007 — "Cannot extract voices" |
| v6 | native-audio | AUDIO+TEXT + systemInstruction | 1007 — "Invalid argument" |
| v7 | native-audio | AUDIO only, no systemInstruction | ✅ Session stays open |
The breakthrough was understanding that gemini-2.5-flash-native-audio-latest is a voice-to-voice model. It will reject TEXT modality and reject systemInstruction in the live config. You must give it responseModalities: ['AUDIO'] and nothing else.
The Hybrid Architecture
Gemini Live (gemini-2.5-flash-native-audio-latest)
responseModalities: ['AUDIO'] ← stays open
→ inputTranscription fires per speech turn
→ on turnComplete → transcript string
generateContent (gemini-2.5-flash)
systemInstruction: ARIA_ANALYST_PROMPT
responseMimeType: 'application/json'
contents: [{ role: 'user', parts: [{ text: transcript }] }]
→ structured IncidentEvent JSON
→ Zod validation → Pub/Sub → 3 ADK agents → SSE → browser
Two models working together: the Live session handles the real-time audio stream and transcription, and a separate generateContent call handles the structured reasoning. Each does what it's actually good at.
// listenerAgent.js — the key insight
session = await ai.live.connect({
model: 'gemini-2.5-flash-native-audio-latest',
config: {
responseModalities: ['AUDIO'],
// No systemInstruction here — live model = transcription only
},
callbacks: {
onmessage: (message) => {
const sc = message.serverContent
if (sc?.inputTranscription?.text) {
transcriptBuffer += sc.inputTranscription.text
}
if (sc?.turnComplete && transcriptBuffer.trim()) {
generateIncidentEvent(transcriptBuffer.trim(), incidentId)
transcriptBuffer = ''
}
}
}
})
The 5-Agent ADK Pipeline
Once a structured IncidentEvent JSON lands in Pub/Sub, three Google ADK agents process it sequentially:
Pub/Sub: incident-events
│
▼
[Analyst Agent] ← root cause, blast radius, severity classification
│
▼ Pub/Sub: incident-analysis
│
[Compliance Agent] ← DORA Art. 11.1(a/b/c), SOX 404, notification deadlines
│
▼ Pub/Sub: compliance-mappings
│
[Reporter Agent] ← generates 6 report sections, writes to Firestore, broadcasts via SSE
The Reporter Agent generates each section with a targeted prompt — Timeline, Blast Radius, Root Cause, Regulatory Obligations, Remediation, and Executive Summary — and broadcasts them live via SSE. Each section appears in the browser as it's generated, creating the "report building before your eyes" effect.
The Infrastructure
Everything is Terraform-provisioned on GCP:
- Cloud Run — containerised Node.js 20, min-instances=1, CPU always-on (critical for WebSocket longevity)
- Cloud Pub/Sub — 4 topics + 4 DLQ topics for the agent chain
- Firestore — incident state and report sections
-
Cloud Build — CI/CD:
git push main→ build → deploy → new revision - Artifact Registry — Docker image store
-
Secret Manager —
GEMINI_API_KEYnever in plaintext env vars
One command to provision everything:
cd terraform
terraform apply -var="project_id=YOUR_PROJECT" -var="gemini_api_key=YOUR_KEY"
What a Real Incident Looks Like
You open ARIA, type the incident title, and click Start Incident. Then click Start Listening — the browser requests microphone permission and immediately begins streaming PCM audio at 16kHz to the server via WebSocket.
You say into your microphone:
"payment-gateway-v2 is throwing 503 errors, postgres connection pool is exhausted, 73,000 users affected, 7.3 percent transaction failure rate"
Seconds later:
- The transcript card appears in the Live Transcript panel — raw quote + ARIA's spoken response
- The DORA clock switches to orange and starts counting down from 4:00:00
- The DORA Article 11 Report panel begins building — Timeline... Blast Radius... Root Cause... Regulatory Obligations (citing exact DORA Article 11.1(a) clause + notification deadline)... Remediation... Executive Summary
- The persona badge in the top-right switches based on vocabulary — Engineer → Compliance → Executive
When a compliance officer asks "which DORA clause does this trigger?", ARIA responds with exact clause citations. When the CEO asks "what do I tell the board?", ARIA gives a business-impact summary with no jargon.
The Numbers
| Metric | Before ARIA | With ARIA |
|---|---|---|
| Compliance report completion | 4+ hours post-mortem | Under 8 minutes live |
| DORA notification prep | Manual, from memory | Automated, from real-time evidence |
| Multi-stakeholder communication | One message for all | Persona-adapted per audience |
| Audit trail | Channel logs + memory | Timestamped, structured, Firestore-persisted |
Try It
Live demo: https://regguardian-908307939543.us-central1.run.app
Source: https://github.com/manojmallick/regguardian
The repo includes full Terraform IaC, a multi-stage Dockerfile, Zod schemas, and the complete ADK agent pipeline. The README has step-by-step deployment instructions.
What I Learned
The biggest lesson: the Gemini Live API is genuinely different from a text model with audio input. It's a voice-to-voice model designed for conversational agents. Trying to use it like a text model (adding TEXT modality, systemInstruction, structured output) breaks the session within milliseconds with a 1007.
The hybrid architecture — using the Live model purely for transcription and a separate generateContent call for reasoning — is the correct pattern for building agentic pipelines on top of the Live API. The Live session stays permanently open for as long as the incident lasts. The reasoning model gets invoked once per speech turn.
Built for the Gemini Live Agent Challenge 2026.
#GeminiLiveAgentChallenge #GoogleCloud #GeminiAPI #DORA #IncidentResponse #AI
Top comments (0)