DEV Community

Vadim Spiro
Vadim Spiro

Posted on

Building a Safer AI Co-Pilot: 3 Architecture Patterns from our ICU Hackathon Project

When building an AI co-pilot for ICU nurses—where mistakes can literally cost lives—you can't just throw a LangChain wrapper around a PDF and hope for the best.

Our project, Panacea, acts as a voice-enabled real-time assistant for operating complex medical machines at 3 AM. It’s built on Google’s Agent Development Kit (ADK) and the Gemini Live API. Because we were building for healthcare, we had to reject a lot of standard "hackathon" patterns in favor of strict deterministic architecture.

If you are building an AI agent that needs to be fast and safe, here are three architectural patterns we used to build Panacea.

1. The Clean Two-Agent Topology

The current trend is to build massive, 20-agent orchestrators where agents debate each other to reach a conclusion. That’s cool for coding assistants, but in a hospital, extra latency and non-deterministic routing are dangerous.

We kept our topology dead simple: One agent to talk, one agent to read.

  • The Voice Agent (gemini-live-2.5-flash-native-audio): This ADK session handles the websocket, the bidirectional audio stream, and the frontend UI tools. It has a 128K context window.
  • The Knowledge Agent (gemini-3.1-pro-preview): This is a strictly-prompted backend model. It holds the entire 400-page manufacturer manual in its 1M context window and returns structured JSON cites.

Why this matters: The Voice Agent never tries to act as a doctor. When the nurse asks a question, the Voice Agent pauses, uses a query_manual tool to hit the Knowledge Agent, and then reads the exact, cited answer out loud. Clean separation of concerns prevents the conversational agent from hallucinating medical facts.

2. Rejecting RAG for "Full-Context Grounding"

Almost every project that chats with PDFs uses Retrieval-Augmented Generation (RAG): chunk the PDF, embed it, do a vector search, and pass the top 5 chunks into the prompt.

We didn't use RAG. Why? Because RAG relies on search quality. If the vector search misses a critical warning buried in a different chapter, the LLM gives an incomplete (and dangerous) answer.

Instead, we extract the entire 400-page manual into structured Markdown and dump all up to 1M tokens into the Knowledge Agent's context window for every single query.

How is that affordable or fast?
Google’s Implicit Caching in Vertex AI. Because the manual text is placed as a strict prefix in the prompt, Gemini automatically caches the context. For subsequent queries in the same session, input token costs drop by ~90%, and the Time-To-First-Token (TTFT) drops to near-instant. The LLM gets the entire context of the machine every time, guaranteeing it never misses a safety warning.

3. Tool Safety Gates (Interrupting the Let-It-Run Loop)

When your Voice Agent has tools that can control the UI or escalate alerts to the Head Nurse, you can't just let the LLM execute them blindly.

We built Tool Safety Gates into our ADK implementation. Before a critical tool is executed, the agent is forced to ask for explicit confirmation.

flowchart LR
    VA[Voice Agent] -- "wants to escalate" --> CONFIRM[request_escalation_confirmation]
    CONFIRM -- "nurse says yes" --> ESC[escalate_to_admin]
Enter fullscreen mode Exit fullscreen mode

To implement this natively in the ADK, we mapped these specific safety gates. Suppose the nurse says, "I need help with this."

  1. The model decides it needs to escalate.
  2. It is restricted from calling escalate_to_admin directly. Instead, it must call request_escalation_confirmation().
  3. The Voice Agent speaks: "Shall I escalate this to the head nurse?"
  4. Only if the nurse confirms with "Yes," the system unlocks the actual escalation tool.

By hardcoding critical actions behind confirmation tools, we ensured the AI couldn't accidentally trigger a hospital-wide alert just because it misunderstood an ambient conversation in the ICU.

The Takeaway

Building AI for critical environments requires a shift in mindset. You stop optimizing for "creativity" and start optimizing for determinism, speed, and safety boundaries.

By bypassing RAG for full-context caching, heavily gating tools, and keeping our agent topology simple, we built an assistant that actually respects the complexity of the ICU.


Try it out:

Top comments (0)