<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kapil Shastri</title>
    <description>The latest articles on DEV Community by Kapil Shastri (@kapil_shastri_e07e03e1dcb).</description>
    <link>https://dev.to/kapil_shastri_e07e03e1dcb</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3797181%2F1fe55c6c-94c3-48a9-89e4-302788f7b800.jpg</url>
      <title>DEV Community: Kapil Shastri</title>
      <link>https://dev.to/kapil_shastri_e07e03e1dcb</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kapil_shastri_e07e03e1dcb"/>
    <language>en</language>
    <item>
      <title>Authorized to Act | Review</title>
      <dc:creator>Kapil Shastri</dc:creator>
      <pubDate>Tue, 07 Apr 2026 06:05:43 +0000</pubDate>
      <link>https://dev.to/kapil_shastri_e07e03e1dcb/authorized-to-act-review-663</link>
      <guid>https://dev.to/kapil_shastri_e07e03e1dcb/authorized-to-act-review-663</guid>
      <description>&lt;p&gt;━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━&lt;br&gt;
📝 Personal Comments&lt;br&gt;
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━&lt;/p&gt;

&lt;p&gt;Mission Authorization: Why Your AI Agent Shouldn't &lt;br&gt;
Have Standing Permissions&lt;/p&gt;

&lt;p&gt;Here is an uncomfortable truth about AI agents: they &lt;br&gt;
are all over-permissioned, and nobody is checking.&lt;/p&gt;

&lt;p&gt;When you connect an AI agent to GitHub, you grant it &lt;br&gt;
read/write access to repos, issues, pull requests, and &lt;br&gt;
sometimes even delete permissions — because the OAuth &lt;br&gt;
consent screen makes it easier to click "Allow All" &lt;br&gt;
than to think carefully about what the agent actually &lt;br&gt;
needs. Then you forget about it. The agent keeps those &lt;br&gt;
permissions forever. This is what I call the standing &lt;br&gt;
permissions problem, and it is the biggest unaddressed &lt;br&gt;
security gap in the agentic AI boom.&lt;/p&gt;

&lt;p&gt;Standing permissions means an agent has access granted &lt;br&gt;
once, broadly, and indefinitely. Mission authorization &lt;br&gt;
means an agent earns access for a specific task, scoped &lt;br&gt;
to exactly what that task requires, for exactly as long &lt;br&gt;
as the task runs.&lt;/p&gt;

&lt;p&gt;The difference sounds philosophical. It is not. A &lt;br&gt;
standing-permission agent that gets prompt-injected, &lt;br&gt;
jailbroken, or compromised hands an attacker everything &lt;br&gt;
it was ever granted. A mission-authorized agent hands &lt;br&gt;
an attacker nothing — because by the time anyone &lt;br&gt;
notices, the token has already expired and been revoked.&lt;/p&gt;

&lt;p&gt;This is the problem Auth0 Token Vault was built to &lt;br&gt;
solve — and building Sanctum taught me exactly why.&lt;/p&gt;

&lt;p&gt;Before Token Vault, if your AI agent needed to call the &lt;br&gt;
GitHub API, you had two bad options: hardcode a personal &lt;br&gt;
access token in your .env file (a security disaster &lt;br&gt;
waiting to happen), or implement a full OAuth flow &lt;br&gt;
yourself (weeks of work, endless edge cases). Neither &lt;br&gt;
option gave you runtime control. Neither let you say &lt;br&gt;
"this agent can read issues but cannot delete repos" &lt;br&gt;
and have that enforced at the identity layer.&lt;/p&gt;

&lt;p&gt;Token Vault changes the equation entirely. Instead of &lt;br&gt;
your application holding credentials, Auth0 holds them. &lt;br&gt;
Instead of your agent having a permanent token, it &lt;br&gt;
requests a scoped, short-lived one at runtime. The &lt;br&gt;
custody of credentials moves out of application code &lt;br&gt;
— where it can be leaked, logged, or stolen — and into &lt;br&gt;
the identity layer, where it belongs.&lt;/p&gt;

&lt;p&gt;Building Sanctum on top of Token Vault revealed &lt;br&gt;
something important: the real power is not just secure &lt;br&gt;
storage. It is the ability to change what an agent is &lt;br&gt;
allowed to do after the fact, without redeployment, &lt;br&gt;
without touching code. When Sanctum's AI engine &lt;br&gt;
determines that an agent only needed 3 of the 12 scopes &lt;br&gt;
it was granted, Token Vault lets us re-provision a &lt;br&gt;
tighter token immediately. The old broad token does not &lt;br&gt;
get rotated on a schedule — it gets replaced right now, &lt;br&gt;
with exactly the minimum the agent needs.&lt;/p&gt;

&lt;p&gt;That is mission authorization made real. Not a policy &lt;br&gt;
document. Not a comment in a README. An actual &lt;br&gt;
enforcement mechanism that lives in the identity layer &lt;br&gt;
and cannot be bypassed by application logic.&lt;/p&gt;

&lt;p&gt;The teams that win the AI agent security problem will &lt;br&gt;
not be the ones who write the most careful prompts. &lt;br&gt;
They will be the ones who treat token custody as an &lt;br&gt;
identity problem, not an application problem. Auth0 &lt;br&gt;
Token Vault is the infrastructure that makes that &lt;br&gt;
possible. Sanctum is proof that it works.&lt;br&gt;
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━&lt;/p&gt;

</description>
      <category>devblog</category>
    </item>
    <item>
      <title>VISION-AI HACKATHON!</title>
      <dc:creator>Kapil Shastri</dc:creator>
      <pubDate>Fri, 27 Feb 2026 22:22:54 +0000</pubDate>
      <link>https://dev.to/kapil_shastri_e07e03e1dcb/vision-ai-hackathon-23ed</link>
      <guid>https://dev.to/kapil_shastri_e07e03e1dcb/vision-ai-hackathon-23ed</guid>
      <description>&lt;p&gt;I Built a Real-Time AI Surveillance System with Voice Alerts in One Night — Here's How&lt;br&gt;
Using Stream Vision Agents SDK, Moondream, and Gemini Realtime to build something I genuinely didn't think was possible in a hackathon.&lt;/p&gt;

&lt;p&gt;The Idea&lt;br&gt;
It started with a simple question: what if a security camera could actually talk to you?&lt;br&gt;
Not just send a notification. Not just log an event. But literally speak to you the moment it sees something dangerous — "ALERT! Unattended bag detected!" — in real time, like having an AI guard watching your feed 24/7.&lt;br&gt;
That became SentinelAI: a real-time AI surveillance system that watches a live camera feed, detects threats using plain English descriptions (no training required), and announces them via voice automatically.&lt;br&gt;
I built it in one night for the Stream Vision Agents Hackathon, and honestly... it works. Here's the full story.&lt;/p&gt;

&lt;p&gt;What SentinelAI Does&lt;br&gt;
Before diving into the how, here's what the finished system does:&lt;/p&gt;

&lt;p&gt;📷 Watches a live camera feed via Stream WebRTC at 4 FPS&lt;br&gt;
🎯 Detects threats in zero-shot — you describe what to look for in plain English: "knife or blade weapon", "person fallen on ground", "fire or flames", "gun or pistol", "unattended bag or backpack"&lt;br&gt;
🔊 Speaks voice alerts the instant a threat is detected — Gemini Realtime literally says "ALERT! Unattended bag detected!" out loud into the call&lt;br&gt;
🗣️ Two-way voice conversation — you can talk to SentinelAI and ask it questions&lt;br&gt;
🖥️ Live dashboard with dual video tiles, bounding box overlays, confidence scores, and a real-time alert log&lt;/p&gt;

&lt;p&gt;The Tech Stack&lt;br&gt;
LayerTechnologyVision Agent FrameworkStream Vision Agents SDKZero-Shot DetectionMoondream Cloud APIVoice + LLMGemini 2.0 RealtimeVideo InfrastructureStream WebRTCBackendFastAPI + PythonFrontendReact + Vite&lt;br&gt;
The most interesting choice here is Moondream for zero-shot detection. Traditional CV models like YOLO need thousands of labelled images to detect custom objects. Moondream lets you describe what you want to detect in plain English and just... works. No training. No dataset. That's the magic.&lt;/p&gt;

&lt;p&gt;Architecture&lt;br&gt;
Browser (React)&lt;br&gt;
    │&lt;br&gt;
    ├── Camera feed via Stream WebRTC&lt;br&gt;
    │&lt;br&gt;
    ▼&lt;br&gt;
FastAPI Backend (Port 8000)&lt;br&gt;
    │ Creates Stream sessions&lt;br&gt;
    │ Manages agent lifecycle&lt;br&gt;
    ▼&lt;br&gt;
Vision Agent (Port 8001)&lt;br&gt;
    │&lt;br&gt;
    ├── Moondream CloudDetectionProcessor (4 FPS)&lt;br&gt;
    │     └── Scans every frame for threats&lt;br&gt;
    │     └── Queues detections when confidence &amp;gt; 0.65&lt;br&gt;
    │&lt;br&gt;
    └── Gemini 2.0 Realtime&lt;br&gt;
          └── Receives alert queue&lt;br&gt;
          └── Speaks alerts aloud via WebRTC&lt;br&gt;
Three services, one surveillance system.&lt;/p&gt;

&lt;p&gt;Building It: The Vision Agents Core&lt;br&gt;
The heart of SentinelAI is a custom ThreatAlertProcessor — a subclass of Vision Agents' moondream.CloudDetectionProcessor. This is where the zero-shot magic happens:&lt;br&gt;
pythonimport asyncio&lt;br&gt;
from typing import Dict&lt;br&gt;
import moondream&lt;/p&gt;

&lt;p&gt;THREATS = [&lt;br&gt;
    "kitchen knife or blade weapon",&lt;br&gt;
    "person fallen on ground",&lt;br&gt;
    "fire or flames",&lt;br&gt;
    "gun or pistol",&lt;br&gt;
    "unattended bag or backpack"&lt;br&gt;
]&lt;/p&gt;

&lt;p&gt;_detection_queue: asyncio.Queue = asyncio.Queue()&lt;/p&gt;

&lt;p&gt;class ThreatAlertProcessor(moondream.CloudDetectionProcessor):&lt;br&gt;
    def &lt;strong&gt;init&lt;/strong&gt;(self, *args, **kwargs):&lt;br&gt;
        super().&lt;strong&gt;init&lt;/strong&gt;(*args, **kwargs)&lt;br&gt;
        self._last_alerted: Dict[str, float] = {}&lt;br&gt;
        self._alert_cooldown = 15.0  # seconds between same-threat alerts&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async def _process_and_add_frame(self, frame):
    await super()._process_and_add_frame(frame)
    detections = self._last_results.get("detections", [])
    now = asyncio.get_event_loop().time()

    for det in detections:
        label = det.get("label", "")
        conf = det.get("confidence", 0)

        if conf &amp;gt;= 0.65:  # High confidence threshold
            last = self._last_alerted.get(label, 0)
            if now - last &amp;gt; self._alert_cooldown:
                self._last_alerted[label] = now
                await _detection_queue.put(label)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The _alert_cooldown is important — without it, once a threat is detected, Gemini would shout about it every 250ms. The 15-second cooldown makes it feel like a real alert system.&lt;/p&gt;

&lt;p&gt;The Voice Alert Magic&lt;br&gt;
This was the trickiest part. Making Gemini actually speak the alerts required using agent.llm.session.send() — sending a direct instruction to the Gemini Realtime session:&lt;br&gt;
pythonasync def join_vision_call(agent: Agent, call_type: str, call_id: str, **kwargs) -&amp;gt; None:&lt;br&gt;
    try:&lt;br&gt;
        call = agent.edge.client.video.call(call_type, call_id)&lt;br&gt;
        async with agent.join(call):&lt;br&gt;
            logger.info(f"[SentinelAI] 👁 Moondream active. Watching: {THREATS}")&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        while True:
            try:
                # Wait for a threat detection (timeout after 30s for keep-alive)
                label = await asyncio.wait_for(
                    _detection_queue.get(), 
                    timeout=30.0
                )

                # Trigger Gemini to speak the alert
                await agent.llm.session.send(
                    input=f"Say out loud urgently: ALERT! {label} detected!",
                    end_of_turn=True
                )
                logger.info(f"[SentinelAI] 🔊 Alert sent: {label}")

            except asyncio.TimeoutError:
                # Keep-alive ping every 30s to prevent idle timeout
                await agent.llm.session.send(
                    input="Continue monitoring silently. Do not speak.",
                    end_of_turn=True
                )

except Exception as e:
    logger.warning(f"[SentinelAI] Session dropped ({e}), will retry on next call.")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The 30-second keep-alive was critical — without it, Gemini's Realtime session would silently die during quiet periods with no detections.&lt;/p&gt;

&lt;p&gt;Creating the Agent&lt;br&gt;
Here's how we wire everything together using Vision Agents SDK:&lt;br&gt;
pythonfrom vision_agents import Agent, AgentOptions&lt;br&gt;
from vision_agents.plugins import gemini, moondream&lt;br&gt;
from vision_agents.plugins.getstream import StreamVideoPlugin&lt;/p&gt;

&lt;p&gt;def create_agent():&lt;br&gt;
    return Agent(&lt;br&gt;
        agent_id="sentinel-agent",&lt;br&gt;
        options=AgentOptions(&lt;br&gt;
            llm=gemini.Realtime(&lt;br&gt;
                model="gemini-2.0-flash-exp",&lt;br&gt;
                fps=2,&lt;br&gt;
                instructions="""You are SentinelAI, an elite AI security guardian. &lt;br&gt;
                You monitor live camera feeds for threats. When you first join, &lt;br&gt;
                introduce yourself. When threats are detected, respond with urgency.&lt;br&gt;
                Always be professional and calm."""&lt;br&gt;
            ),&lt;br&gt;
            plugins=[&lt;br&gt;
                StreamVideoPlugin(&lt;br&gt;
                    join_call=join_vision_call,&lt;br&gt;
                    processors=[&lt;br&gt;
                        ThreatAlertProcessor(&lt;br&gt;
                            detect_objects=THREATS,&lt;br&gt;
                            conf_threshold=0.65,&lt;br&gt;
                            fps=4,&lt;br&gt;
                        )&lt;br&gt;
                    ]&lt;br&gt;
                )&lt;br&gt;
            ]&lt;br&gt;
        )&lt;br&gt;
    )&lt;br&gt;
Vision Agents SDK handles all the WebRTC complexity, audio/video track management, and plugin lifecycle — we just define what we want.&lt;/p&gt;

&lt;p&gt;The Bugs That Almost Broke Me&lt;br&gt;
No honest hackathon post is complete without the war stories.&lt;br&gt;
Bug 1: Windows + asyncio + aiodns = disaster&lt;br&gt;
Running on Windows, the Vision Agents serve mode crashed immediately with:&lt;br&gt;
NotImplementedError: aiodns needs a SelectorEventLoop on Windows&lt;br&gt;
The fix — set the event loop policy before anything else runs:&lt;br&gt;
pythonimport asyncio&lt;br&gt;
import sys&lt;/p&gt;

&lt;p&gt;if sys.platform == "win32":&lt;br&gt;
    asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())&lt;br&gt;
One line. One hour of debugging to find it.&lt;br&gt;
Bug 2: Stream API signature verification&lt;br&gt;
The FastAPI backend was returning 401 errors on every session creation. Turns out Stream's SDK requires server-side token generation with proper HMAC signatures — the frontend can't just pass an API key directly.&lt;br&gt;
Bug 3: Gemini session drops&lt;br&gt;
GeminRealtime object has no attribute 'session' — this crashed everything when we tried to send the intro message too early. Gemini's Realtime session isn't available until after the WebRTC connection fully establishes.&lt;br&gt;
The fix: move the intro to the agent's instructions instead of sending it manually. Gemini introduces itself naturally when it's ready.&lt;br&gt;
Bug 4: My phone kept triggering "knife" alerts&lt;br&gt;
Zero-shot detection is powerful but imprecise. A phone's rectangular shape was matching "knife" at low confidence. Two fixes: raise the confidence threshold from 0.35 to 0.65, and use more descriptive threat names ("kitchen knife or blade weapon" instead of just "knife").&lt;/p&gt;

&lt;p&gt;The Moment It Worked&lt;br&gt;
At around 2 AM, after fixing the session keep-alive bug, I held an empty bag up to my camera.&lt;br&gt;
Moondream detected it. Queued the alert. Gemini received it.&lt;br&gt;
And then — through my speakers — I heard:&lt;br&gt;
"ALERT! Unattended bag detected!"&lt;br&gt;
I genuinely jumped. It works. It actually works.&lt;/p&gt;

&lt;p&gt;The React Dashboard&lt;br&gt;
The frontend is a custom dark surveillance UI with:&lt;/p&gt;

&lt;p&gt;Dual video tiles — your camera feed + the SentinelAI agent feed side by side&lt;br&gt;
Live bounding boxes — green overlays with confidence scores on detected threats&lt;br&gt;
Dynamic watchlist editor — add/remove threat categories in real-time&lt;br&gt;
Alert log — timestamped color-coded log of every detection&lt;/p&gt;

&lt;p&gt;Built with React + Vite, connecting to Stream's JavaScript SDK for WebRTC video.&lt;/p&gt;

&lt;p&gt;What I Learned&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Zero-shot detection is a game changer. Not needing to train a model, collect data, or label images means you can prototype surveillance for any use case in minutes. Just describe what you want to detect.&lt;/li&gt;
&lt;li&gt;Vision Agents SDK abstracts the hard stuff beautifully. WebRTC negotiation, ICE candidates, audio/video track management, plugin lifecycle — all handled. I focused on the AI logic, not the infrastructure.&lt;/li&gt;
&lt;li&gt;Gemini Realtime + WebRTC is genuinely impressive. Having a two-way voice conversation with an AI that's watching your camera feed feels like science fiction. It's not — it's a few hundred lines of Python.&lt;/li&gt;
&lt;li&gt;Keep-alives matter. Any long-running AI session needs a heartbeat. Without the 30-second ping, Gemini silently times out during quiet monitoring periods.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What's Next&lt;/p&gt;

&lt;p&gt;YOLO local inference on my GTX 1650 for 30+ FPS detection (vs 4 FPS cloud)&lt;br&gt;
Incident memory — SentinelAI remembers and summarizes the last hour of detections&lt;br&gt;
Mobile support via Stream's React Native SDK&lt;br&gt;
Multi-camera — monitor multiple feeds simultaneously&lt;/p&gt;

&lt;p&gt;Try It Yourself&lt;br&gt;
🐙 GitHub: github.com/kapilshastriwork-maker/sentinelai&lt;br&gt;
🎬 Demo Video: Watch SentinelAI detect threats and speak voice alerts live on YouTube - &lt;a href="https://youtu.be/RX4EW9DvaFQ" rel="noopener noreferrer"&gt;https://youtu.be/RX4EW9DvaFQ&lt;/a&gt;&lt;br&gt;
To run it locally:&lt;br&gt;
bash# Terminal 1 — FastAPI backend&lt;br&gt;
uv run uvicorn main:app --host 0.0.0.0 --port 8000 --reload&lt;/p&gt;

&lt;h1&gt;
  
  
  Terminal 2 — Vision Agent
&lt;/h1&gt;

&lt;p&gt;uv run main.py serve --port 8001&lt;/p&gt;

&lt;h1&gt;
  
  
  Terminal 3 — React frontend
&lt;/h1&gt;

&lt;p&gt;cd frontend &amp;amp;&amp;amp; npm run dev&lt;br&gt;
Add your .env with STREAM_API_KEY, STREAM_API_SECRET, MOONDREAM_API_KEY, and GEMINI_API_KEY — and you're watching.&lt;/p&gt;

&lt;p&gt;Resources&lt;/p&gt;

&lt;p&gt;Stream Vision Agents SDK&lt;br&gt;
Moondream Cloud API&lt;br&gt;
Gemini 2.0 Realtime&lt;br&gt;
Stream WebRTC&lt;/p&gt;

&lt;p&gt;Built for the Stream Vision Agents Hackathon 2026. If you're building something cool with Vision Agents, I'd love to see it — drop a comment below!&lt;/p&gt;

&lt;p&gt;Tags: python ai machinelearning webdev hackathon&lt;/p&gt;

</description>
      <category>visionai</category>
    </item>
  </channel>
</rss>
