DEV Community: Kapil Shastri

Alpha Fold - Gap Filler

Kapil Shastri — Sun, 24 May 2026 12:17:54 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

"We built AlphaFold Gap-Filler.
It takes any uncharacterized protein — something with no known function, no literature, nothing — and tells you what it likely does, which existing drugs might target it, and what experiment to run first to confirm it.
All automatically. In under 30 minutes.

Why Gemma 4?
Because no model could hold enough context to reason like a biologist.
A real scientist does not look at one thing. They look at five things simultaneously — evolutionary history, genomic context, interaction network, structure, and literature — and the insight comes from connecting all five.
Gemma 4's one million token context window makes this possible for the first time. We feed all five evidence streams into a single inference call. Gemma 4 holds all of it, reasons across all of it, and produces a hypothesis the way an expert scientist would.
No other model can do this. This is not a workaround. This is the capability."

Link to the project - https://www.kaggle.com/competitions/gemma-4-good-hackathon/writeups/alphafold-gap-filler-dark-proteome-drug-candida

To the demo video - https://youtu.be/MrwCRJLAjsA

AlphaFold - Gap Filler.

Kapil Shastri — Sun, 24 May 2026 12:14:20 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

AlphaFold Gap-Filler is the missing piece. It takes an uncharacterized protein and produces a complete, specific, actionable drug development hypothesis in under 30 minutes — automatically, from scratch, with no prior knowledge of the protein.

Demo

https://youtu.be/MrwCRJLAjsA

Code

https://www.kaggle.com/competitions/gemma-4-good-hackathon/writeups/alphafold-gap-filler-dark-proteome-drug-candida

How I Used Gemma 4

Gemma 4's (gemma-4-e2b-it) long-context window makes this possible for the first time. We assemble all five evidence streams — evolutionary homologs from BLAST, genomic neighborhood from Ensembl, interaction network from STRING DB, structural confidence from AlphaFold, and 50+ paper abstracts from PubMed — into a single inference call. Gemma 4 holds all of this simultaneously and reasons across it. No previous model could do this. No existing tool does this. This is the contribution.

Workflow of the Project: -
Stage 1 — Gemma 4 Function Prediction: Five evidence streams are assembled and fed to Gemma 4 in a single call. The model produces a structured hypothesis with molecular function, biological process, disease relevance, confidence scoring per dimension, the top 3 supporting evidence pieces, a specific falsifiable experiment, and alternative hypotheses.

Stage 2 — ChEMBL Drug Repurposing: Drug target names are extracted from the hypothesis and used to query ChEMBL's database of 17,500+ approved drugs. Existing compounds that target related proteins are retrieved with approval phase, SMILES, and Lipinski drug-likeness scoring.

Stage 3 — Boltz-2 Binding Affinity: Candidate drugs and the protein sequence are fed into Boltz-2, the first open-source AI model approaching free-energy perturbation accuracy at 1000x lower compute cost. Predicted IC50 values rank candidates by binding potency.

Stage 4 — OpenTargets Disease Validation: The protein and its interaction partners are queried against OpenTargets' genetic evidence database. GWAS hits, disease associations, and genetic scores are retrieved to validate or challenge the hypothesis with independent genomic evidence.

Stage 5 — Gemma 4 Drug Strategy: A final Gemma 4 call synthesizes all data into a complete preclinical roadmap. Specific assay types, expected positive and negative results, cost estimates, animal model selection, biomarker strategy, and Phase I clinical trial design — all generated automatically.

Authorized to Act | Review

Kapil Shastri — Tue, 07 Apr 2026 06:05:43 +0000

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📝 Personal Comments
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Mission Authorization: Why Your AI Agent Shouldn't
Have Standing Permissions

Here is an uncomfortable truth about AI agents: they
are all over-permissioned, and nobody is checking.

When you connect an AI agent to GitHub, you grant it
read/write access to repos, issues, pull requests, and
sometimes even delete permissions — because the OAuth
consent screen makes it easier to click "Allow All"
than to think carefully about what the agent actually
needs. Then you forget about it. The agent keeps those
permissions forever. This is what I call the standing
permissions problem, and it is the biggest unaddressed
security gap in the agentic AI boom.

Standing permissions means an agent has access granted
once, broadly, and indefinitely. Mission authorization
means an agent earns access for a specific task, scoped
to exactly what that task requires, for exactly as long
as the task runs.

The difference sounds philosophical. It is not. A
standing-permission agent that gets prompt-injected,
jailbroken, or compromised hands an attacker everything
it was ever granted. A mission-authorized agent hands
an attacker nothing — because by the time anyone
notices, the token has already expired and been revoked.

This is the problem Auth0 Token Vault was built to
solve — and building Sanctum taught me exactly why.

Before Token Vault, if your AI agent needed to call the
GitHub API, you had two bad options: hardcode a personal
access token in your .env file (a security disaster
waiting to happen), or implement a full OAuth flow
yourself (weeks of work, endless edge cases). Neither
option gave you runtime control. Neither let you say
"this agent can read issues but cannot delete repos"
and have that enforced at the identity layer.

Token Vault changes the equation entirely. Instead of
your application holding credentials, Auth0 holds them.
Instead of your agent having a permanent token, it
requests a scoped, short-lived one at runtime. The
custody of credentials moves out of application code
— where it can be leaked, logged, or stolen — and into
the identity layer, where it belongs.

Building Sanctum on top of Token Vault revealed
something important: the real power is not just secure
storage. It is the ability to change what an agent is
allowed to do after the fact, without redeployment,
without touching code. When Sanctum's AI engine
determines that an agent only needed 3 of the 12 scopes
it was granted, Token Vault lets us re-provision a
tighter token immediately. The old broad token does not
get rotated on a schedule — it gets replaced right now,
with exactly the minimum the agent needs.

That is mission authorization made real. Not a policy
document. Not a comment in a README. An actual
enforcement mechanism that lives in the identity layer
and cannot be bypassed by application logic.

The teams that win the AI agent security problem will
not be the ones who write the most careful prompts.
They will be the ones who treat token custody as an
identity problem, not an application problem. Auth0
Token Vault is the infrastructure that makes that
possible. Sanctum is proof that it works.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

VISION-AI HACKATHON!

Kapil Shastri — Fri, 27 Feb 2026 22:22:54 +0000

I Built a Real-Time AI Surveillance System with Voice Alerts in One Night — Here's How
Using Stream Vision Agents SDK, Moondream, and Gemini Realtime to build something I genuinely didn't think was possible in a hackathon.

The Idea
It started with a simple question: what if a security camera could actually talk to you?
Not just send a notification. Not just log an event. But literally speak to you the moment it sees something dangerous — "ALERT! Unattended bag detected!" — in real time, like having an AI guard watching your feed 24/7.
That became SentinelAI: a real-time AI surveillance system that watches a live camera feed, detects threats using plain English descriptions (no training required), and announces them via voice automatically.
I built it in one night for the Stream Vision Agents Hackathon, and honestly... it works. Here's the full story.

What SentinelAI Does
Before diving into the how, here's what the finished system does:

📷 Watches a live camera feed via Stream WebRTC at 4 FPS
🎯 Detects threats in zero-shot — you describe what to look for in plain English: "knife or blade weapon", "person fallen on ground", "fire or flames", "gun or pistol", "unattended bag or backpack"
🔊 Speaks voice alerts the instant a threat is detected — Gemini Realtime literally says "ALERT! Unattended bag detected!" out loud into the call
🗣️ Two-way voice conversation — you can talk to SentinelAI and ask it questions
🖥️ Live dashboard with dual video tiles, bounding box overlays, confidence scores, and a real-time alert log

The Tech Stack
LayerTechnologyVision Agent FrameworkStream Vision Agents SDKZero-Shot DetectionMoondream Cloud APIVoice + LLMGemini 2.0 RealtimeVideo InfrastructureStream WebRTCBackendFastAPI + PythonFrontendReact + Vite
The most interesting choice here is Moondream for zero-shot detection. Traditional CV models like YOLO need thousands of labelled images to detect custom objects. Moondream lets you describe what you want to detect in plain English and just... works. No training. No dataset. That's the magic.

Architecture
Browser (React)
│
├── Camera feed via Stream WebRTC
│
▼
FastAPI Backend (Port 8000)
│ Creates Stream sessions
│ Manages agent lifecycle
▼
Vision Agent (Port 8001)
│
├── Moondream CloudDetectionProcessor (4 FPS)
│ └── Scans every frame for threats
│ └── Queues detections when confidence > 0.65
│
└── Gemini 2.0 Realtime
└── Receives alert queue
└── Speaks alerts aloud via WebRTC
Three services, one surveillance system.

Building It: The Vision Agents Core
The heart of SentinelAI is a custom ThreatAlertProcessor — a subclass of Vision Agents' moondream.CloudDetectionProcessor. This is where the zero-shot magic happens:
pythonimport asyncio
from typing import Dict
import moondream

THREATS = [
"kitchen knife or blade weapon",
"person fallen on ground",
"fire or flames",
"gun or pistol",
"unattended bag or backpack"
]

_detection_queue: asyncio.Queue = asyncio.Queue()

class ThreatAlertProcessor(moondream.CloudDetectionProcessor):
def init(self, *args, **kwargs):
super().init(*args, **kwargs)
self._last_alerted: Dict[str, float] = {}
self._alert_cooldown = 15.0 # seconds between same-threat alerts

async def _process_and_add_frame(self, frame):
    await super()._process_and_add_frame(frame)
    detections = self._last_results.get("detections", [])
    now = asyncio.get_event_loop().time()

    for det in detections:
        label = det.get("label", "")
        conf = det.get("confidence", 0)

        if conf >= 0.65:  # High confidence threshold
            last = self._last_alerted.get(label, 0)
            if now - last > self._alert_cooldown:
                self._last_alerted[label] = now
                await _detection_queue.put(label)

The _alert_cooldown is important — without it, once a threat is detected, Gemini would shout about it every 250ms. The 15-second cooldown makes it feel like a real alert system.

The Voice Alert Magic
This was the trickiest part. Making Gemini actually speak the alerts required using agent.llm.session.send() — sending a direct instruction to the Gemini Realtime session:
pythonasync def join_vision_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
try:
call = agent.edge.client.video.call(call_type, call_id)
async with agent.join(call):
logger.info(f"[SentinelAI] 👁 Moondream active. Watching: {THREATS}")

        while True:
            try:
                # Wait for a threat detection (timeout after 30s for keep-alive)
                label = await asyncio.wait_for(
                    _detection_queue.get(), 
                    timeout=30.0
                )

                # Trigger Gemini to speak the alert
                await agent.llm.session.send(
                    input=f"Say out loud urgently: ALERT! {label} detected!",
                    end_of_turn=True
                )
                logger.info(f"[SentinelAI] 🔊 Alert sent: {label}")

            except asyncio.TimeoutError:
                # Keep-alive ping every 30s to prevent idle timeout
                await agent.llm.session.send(
                    input="Continue monitoring silently. Do not speak.",
                    end_of_turn=True
                )

except Exception as e:
    logger.warning(f"[SentinelAI] Session dropped ({e}), will retry on next call.")

The 30-second keep-alive was critical — without it, Gemini's Realtime session would silently die during quiet periods with no detections.

Creating the Agent
Here's how we wire everything together using Vision Agents SDK:
pythonfrom vision_agents import Agent, AgentOptions
from vision_agents.plugins import gemini, moondream
from vision_agents.plugins.getstream import StreamVideoPlugin

def create_agent():
return Agent(
agent_id="sentinel-agent",
options=AgentOptions(
llm=gemini.Realtime(
model="gemini-2.0-flash-exp",
fps=2,
instructions="""You are SentinelAI, an elite AI security guardian.
You monitor live camera feeds for threats. When you first join,
introduce yourself. When threats are detected, respond with urgency.
Always be professional and calm."""
),
plugins=[
StreamVideoPlugin(
join_call=join_vision_call,
processors=[
ThreatAlertProcessor(
detect_objects=THREATS,
conf_threshold=0.65,
fps=4,
)
]
)
]
)
)
Vision Agents SDK handles all the WebRTC complexity, audio/video track management, and plugin lifecycle — we just define what we want.

The Bugs That Almost Broke Me
No honest hackathon post is complete without the war stories.
Bug 1: Windows + asyncio + aiodns = disaster
Running on Windows, the Vision Agents serve mode crashed immediately with:
NotImplementedError: aiodns needs a SelectorEventLoop on Windows
The fix — set the event loop policy before anything else runs:
pythonimport asyncio
import sys

if sys.platform == "win32":
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
One line. One hour of debugging to find it.
Bug 2: Stream API signature verification
The FastAPI backend was returning 401 errors on every session creation. Turns out Stream's SDK requires server-side token generation with proper HMAC signatures — the frontend can't just pass an API key directly.
Bug 3: Gemini session drops
GeminRealtime object has no attribute 'session' — this crashed everything when we tried to send the intro message too early. Gemini's Realtime session isn't available until after the WebRTC connection fully establishes.
The fix: move the intro to the agent's instructions instead of sending it manually. Gemini introduces itself naturally when it's ready.
Bug 4: My phone kept triggering "knife" alerts
Zero-shot detection is powerful but imprecise. A phone's rectangular shape was matching "knife" at low confidence. Two fixes: raise the confidence threshold from 0.35 to 0.65, and use more descriptive threat names ("kitchen knife or blade weapon" instead of just "knife").

The Moment It Worked
At around 2 AM, after fixing the session keep-alive bug, I held an empty bag up to my camera.
Moondream detected it. Queued the alert. Gemini received it.
And then — through my speakers — I heard:
"ALERT! Unattended bag detected!"
I genuinely jumped. It works. It actually works.

The React Dashboard
The frontend is a custom dark surveillance UI with:

Dual video tiles — your camera feed + the SentinelAI agent feed side by side
Live bounding boxes — green overlays with confidence scores on detected threats
Dynamic watchlist editor — add/remove threat categories in real-time
Alert log — timestamped color-coded log of every detection

Built with React + Vite, connecting to Stream's JavaScript SDK for WebRTC video.

What I Learned

Zero-shot detection is a game changer. Not needing to train a model, collect data, or label images means you can prototype surveillance for any use case in minutes. Just describe what you want to detect.
Vision Agents SDK abstracts the hard stuff beautifully. WebRTC negotiation, ICE candidates, audio/video track management, plugin lifecycle — all handled. I focused on the AI logic, not the infrastructure.
Gemini Realtime + WebRTC is genuinely impressive. Having a two-way voice conversation with an AI that's watching your camera feed feels like science fiction. It's not — it's a few hundred lines of Python.
Keep-alives matter. Any long-running AI session needs a heartbeat. Without the 30-second ping, Gemini silently times out during quiet monitoring periods.

What's Next

YOLO local inference on my GTX 1650 for 30+ FPS detection (vs 4 FPS cloud)
Incident memory — SentinelAI remembers and summarizes the last hour of detections
Mobile support via Stream's React Native SDK
Multi-camera — monitor multiple feeds simultaneously

Try It Yourself
🐙 GitHub: github.com/kapilshastriwork-maker/sentinelai
🎬 Demo Video: Watch SentinelAI detect threats and speak voice alerts live on YouTube - https://youtu.be/RX4EW9DvaFQ
To run it locally:
bash# Terminal 1 — FastAPI backend
uv run uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Terminal 2 — Vision Agent

uv run main.py serve --port 8001

Terminal 3 — React frontend

cd frontend && npm run dev
Add your .env with STREAM_API_KEY, STREAM_API_SECRET, MOONDREAM_API_KEY, and GEMINI_API_KEY — and you're watching.

Resources

Stream Vision Agents SDK
Moondream Cloud API
Gemini 2.0 Realtime
Stream WebRTC

Built for the Stream Vision Agents Hackathon 2026. If you're building something cool with Vision Agents, I'd love to see it — drop a comment below!

Tags: python ai machinelearning webdev hackathon