Okafor Pascal

Posted on Feb 23

# I Built a Real-Time AI Medical Assistant After Losing a Dear Person to a Preventable Diagnosis

#geminiliveagentchallenge #geminiapi #googlecloud #healthtech

By Okafor Ogbonna Pascal

Losing a dear person is never easy. But losing someone to a condition that could have been caught early — that stays with you differently.

Someone close to me passed away from undetected kidney failure. For years, they had been living with hypertension that went unnoticed. Not because they were careless. But because access to basic primary healthcare in Nigeria is not a guarantee, it is a privilege. By the time the tests were run and the diagnosis was clear, it was too late.

They didn't have to die.

If they had access to early detection — even something as simple as someone who could explain the symptoms, flag the warning signs, and say "please see a doctor urgently" — they might still be here today.

That loss never left me. And when I discovered the Gemini Live Agent Challenge, I knew exactly what I had to build.

Introducing MediSight

MediSight is a real-time AI medical visual assistant that anyone can talk to — naturally, instantly, and for free. No appointments. No waiting rooms. No medical jargon. Just a warm, knowledgeable voice that helps you understand your health.

You speak to MediSight as you would speak to a doctor friend. Show it your medication bottle, and it tells you what the drug does, the dosage, and the side effects. Show it a lab result, and it explains what the numbers mean in plain English. Describe your symptoms, and it tells you whether you need to see a doctor urgently or if you can rest at home.

For the 4.5 billion people worldwide who lack access to adequate healthcare — and for the millions in Nigeria and across Africa who self-medicate without guidance — MediSight is the medical friend they never had.

Live Demo: https://medisight-563984701112.us-central1.run.app

GitHub: https://github.com/okaforpascal400/medisight

What MediSight Can Do

💊 Medication Analysis — Point your camera at a pill bottle and get instant information about dosage, side effects, and warnings
📋 Prescription Reading — Show a doctor's handwritten prescription and get a plain English explanation
🩺 Symptom Assessment — Describe or show your symptoms and understand what they might mean
🏥 Medical Document Translation — Show lab results or discharge papers and understand what the numbers mean
🥗 Nutrition Label Analysis — Point at food packaging and get health advice for your specific condition
🚨 Emergency Detection — The AI immediately recognises emergencies and tells you to call for help

The Technology Behind MediSight

MediSight is built on Google's Gemini Live API using the gemini-2.5-flash-native-audio-latest model — the most powerful real-time multimodal AI available today. Here's how the technology comes together:

Architecture Overview

User Browser (Chrome)
    ↕ WSS WebSocket (Audio + Video)
Google Cloud Run (FastAPI Backend)
    ↕ Gemini Live API v1alpha (Bidirectional Stream)
Gemini 2.5 Flash Native Audio

The user's browser captures microphone audio at 16kHz PCM and camera frames as JPEG images. These are streamed in real time over a secure WebSocket connection to a FastAPI backend running on Google Cloud Run. The backend relays everything to the Gemini Live API, which processes audio and video simultaneously and streams back a natural voice response at 24kHz.

Key Technical Features

1. Real-Time Bidirectional Audio Streaming

The frontend uses the Web Audio API to capture microphone input and stream PCM audio chunks directly to Gemini. Gemini's response comes back as raw PCM audio chunks — 29 chunks for a typical response — which are scheduled gaplessly using the Web Audio scheduler for smooth, natural-sounding playback.

2. Professional Interruption Handling

This was the hardest feature to get right — and one of the most important for a real-time agent. When a user starts speaking while MediSight is responding, the AI must stop instantly.

The solution uses Gemini's native Voice Activity Detection (VAD). When Gemini detects the user speaking, it sends a server_content.interrupted signal. The backend immediately forwards this to the frontend, which:

Stops every active AudioBufferSourceNode instantly
Clears the audio queue
Suspends the AudioContext
Recreates the audio pipeline fresh for the next response

3. The VAD Bug That Almost Broke Everything

One of the most memorable challenges in this build was the Voice Activity Detection sensitivity problem.

Early in development, MediSight would stop talking the moment there was any background noise — a fan, a car passing outside, even room echo. This was because the VAD threshold was set too low (0.008 RMS), making it hyper-sensitive to any sound.

The root cause ran deeper than just the threshold. We were using ScriptProcessorNode for audio capture — a deprecated Web Audio API that Chrome throttles and can silence when the main thread is busy. This made VAD unreliable and caused choppy audio on subsequent responses.

The fix was to replace ScriptProcessorNode entirely with AudioWorkletNode — Chrome's modern audio processing standard. The AudioWorklet runs on a dedicated OS audio thread, completely independent of the main JavaScript thread. This gives us:

VAD readings 125 times per second (vs 10 times with the old approach)
Reliable audio capture regardless of main thread load
Consistent, smooth audio playback across all turns

After switching to AudioWorklet and tuning the VAD threshold to 0.25 for real-world environments, MediSight became stable and professional.

4. Auto-Reconnect Session Management

The Gemini Live API has session timeouts. Rather than showing an error to the user, MediSight automatically reconnects with exponential backoff — seamlessly continuing the conversation without the user noticing.

An input queue decouples browser input from the Gemini session lifecycle, meaning no messages are lost during reconnection. Silent PCM frames are sent periodically to keep the connection alive between turns.

Deployment on Google Cloud

MediSight is deployed on Google Cloud Run — a fully managed serverless platform that scales automatically based on demand.

gcloud run deploy medisight \
  --source ./backend \
  --region us-central1 \
  --allow-unauthenticated \
  --memory 2Gi \
  --port 8080

Cloud Run was the perfect choice for this project because:

Zero infrastructure management — no servers to maintain
Automatic HTTPS — WebSocket Secure (WSS) works out of the box
Global scalability — handles thousands of concurrent users automatically
Cost-efficient — free tier covers hackathon usage comfortably

Lessons Learned

Building MediSight in a hackathon sprint taught me several powerful lessons:

1. Real-time audio is hard. Streaming, scheduling, and synchronising audio chunks without gaps or crackles requires a deep understanding of the Web Audio API. The difference between ScriptProcessorNode and AudioWorkletNode is the difference between a broken product and a professional one.

2. Interruption is a feature, not an afterthought. The Gemini Live API has native interruption detection built in. Using it properly — with the right frontend gates to prevent buffered audio from replaying — is what separates a demo from a real product.

3. The real world is noisy. VAD threshold tuning is not a minor detail. A threshold that works in a quiet room breaks in a normal home. Testing in real environments is essential.

4. Persistence beats perfection. There were moments when errors seemed impossible to debug — wrong API versions, WebSocket timeouts, choppy audio, and sessions dropping. Each one had a solution. Keep going.

The Bigger Picture

Google described a real-time AI medical visual assistant as a future use case for the Gemini Live API in their own documentation. MediSight makes that vision real today — accessible to anyone with a browser and a microphone, completely free.

This story is not unique. Across Nigeria, across Africa, across the developing world, millions of people are losing loved ones to conditions that could have been caught early — if only they had someone knowledgeable to talk to about their health.

MediSight cannot replace a doctor. But it can be the voice that says: "These symptoms are serious. Please see a doctor today."

That voice could save lives.

Try MediSight

🌐 Live App: https://medisight-563984701112.us-central1.run.app

💻 GitHub: https://github.com/okaforpascal400/medisight

How to use:

Open in Chrome
Click Start Session
Allow microphone access
Talk naturally — show your camera for visual analysis
Interrupt at any time — MediSight stops and listens

I created this content for the purposes of entering the Gemini Live Agent Challenge hackathon.

#GeminiLiveAgentChallenge #GeminiAPI #GoogleCloud #AIForGood #HealthTech #MediSight

Okafor Ogbonna Pascal is a developer and builder passionate about using AI to solve real human problems. Built with Gemini Live API and Google Cloud Run.

Top comments (2)

Lucas Augusto Kepler • Feb 23

That’s an incredible idea, and I’m sure it will help many people in the future. I'm so sorry for your loss.

Okafor Pascal • Feb 23

Thank you, Lucas