DEV Community: Sherin Joseph Roy

I built a real-time Drivable Area Segmentation model for Indian roads (Here is how it runs at 55 FPS)

Sherin Joseph Roy — Sat, 18 Apr 2026 11:35:02 +0000

Hey everyone! I recently hit a major milestone in my journey of building autonomous vehicle tech, and I wanted to share the raw behind the scenes process.

If you have ever driven in India, you know the reality. Lane markings are more of a suggestion, and often they do not even exist. Standard ADAS (Advanced Driver Assistance Systems) that rely on crisp white lines completely break down in our unstructured traffic.

To solve this at PerceptionAV, we had to rethink the foundational question of self driving: How does the car know where it is actually safe to drive?

Instead of looking for lanes, I built a Drivable Area Segmentation pipeline.

The Core Problem

Most open source datasets are heavily biased toward structured Western roads. Training a model to just "find the lane" results in a fragile system. We needed a model that looks at the chaotic blend of tarmac, dirt patches, and broken curbs and simply highlights the "free space".

What I Built

I put together a real time computer vision pipeline that processes live video feeds and overlays a solid green mask strictly over the drivable surface. The goal was speed and accuracy, because a self driving system cannot afford latency.

Here are the current specs of the run:

Frame Rate: Holding steady between 50 and 55 FPS.
Compute: Running on an RTX 4050 (6GB VRAM) using CUDA acceleration.
Focus: Strictly identifying safe road surfaces while ignoring the barrage of two wheelers, buses, and pedestrians.

The Tech Reality

Getting this to run smoothly on desktop GPU hardware is step one. The real challenge we are tackling next is optimizing this architecture for edge devices. Moving from a dedicated GPU down to a Raspberry Pi 5 or an affordable edge tensor unit changes the entire game. We have to look at aggressive quantization and lightweight model architectures to keep that 30+ FPS threshold on device.

What is Next?

Finding the free space is just the beginning. The road map involves layering in dynamic object detection specifically for Indian traffic nuances like auto rickshaws and stray animals. I am also exploring monocular depth estimation to calculate the actual distance to the edge of the mask without relying on expensive LIDAR setups.

I need your thoughts!

Have any of you tackled severe model quantization for computer vision on edge devices? I would love to hear what frameworks or deployment tricks you recommend when moving off a dedicated GPU.

Drop a comment below and let us talk shop.

I Turned My Old Android Phone Into an L2 Autonomous Driving System (Flutter + C++)

Sherin Joseph Roy — Mon, 13 Apr 2026 10:19:32 +0000

Modern cars ship with expensive L2 driver assistance systems. Most of the heavy lifting for these systems is just computer vision running on a small chip behind the dashboard. Guess what else has a camera, a GPU, and a decent SoC? Your phone.

I decided to build a shadow mode ADAS that runs entirely in my pocket. I call it Zyra ADAS. It requires absolutely no cloud connectivity. There is no network latency. It just watches the road and predicts what a real autonomous system would do in real time.

Here is how I built a highly optimized, lock-free perception engine using Flutter, C++, and Vulkan.

The Problem with Cloud AI for Safety

Sending video frames to a cloud server for processing is fine for basic image recognition. It is completely useless when you are moving at 80 km/h and need to know if a car is braking ahead of you. You need on-device processing.

The Architecture: Bypassing the Framework

The app is built with Flutter, but you cannot afford Flutter's MethodChannel serialization costs when processing video at high frame rates. Bouncing JSON strings across threads adds way too much overhead.

I bypassed it completely using dart:ffi.

The Flutter UI only handles the camera stream and drawing the overlays. The actual hot path lives in a single C++ shared object (libzyra_perception.so). We pass the raw YUV camera frames directly from the hardware into the native engine with zero copy memory pointers.

C++ owns the entire heavy lifting process:

Converting YUV to RGB and letterboxing
Running YOLOv8n inference for object detection
Per class Non-Maximum Suppression (NMS)
Canny and HoughLinesP for classical lane tracking

Dart just reads the final struct of bounding boxes and lane coordinates.

Real-Time Performance on Mobile Silicon

To make YOLOv8n run smoothly on a phone, I used NCNN with Vulkan compute. NCNN is incredible for mobile deployment. It uses FP16 packed storage and Winograd convolutions to squeeze every drop of performance out of the mobile GPU.

The results speak for themselves. On my daily driver Realme smartphone with a Snapdragon 662 (a mid range chip from 2020), I am hitting about 105ms end to end inference time. That is roughly 10 FPS on older hardware. On modern flagship chips like the Snapdragon 8 Gen 2, it easily hits a sustained 30 FPS.

I designed the engine with a bounded queue. If the inference falls behind, the engine explicitly drops the older frame. There is no silent buffering and no latency creep. It always shows you what is happening exactly right now.

What is Next?

I am currently prepping this software to mount inside a Tata Tigor EV for a massive data collection rig. The next step is fusing the phone IMU and GPS data into the perception pipeline to build out proper vehicle dynamics and forward collision warnings.

You can dig into the C++ engine and the FFI bridge here:

Sherin-SEF-AI / Zyra-ADAS

Android L2 ADAS shadow-mode system. On-device YOLOv8n + classical lane tracking with Vulkan-accelerated NCNN inference. Flutter UI + C++ NDK engine.

Zyra ADAS

Your phone is now an L2 ADAS shadow system.

Real-time object detection + lane tracking on Android, powered by on-device NCNN inference with Vulkan acceleration. No cloud, no latency, no compromise.

What it does • Architecture • Performance • Quick start • Roadmap

Why Zyra ADAS

Modern cars ship L2 driver assistance that costs thousands of dollars. Most of the hard work is computer vision running on a small SoC behind the dashboard. Your phone has that same SoC. It has a camera, a GPU, GPS, accelerometers, and a screen bright enough to see in sunlight.

Zyra turns it into a shadow-mode ADAS: it watches the road and predicts what a real L2 system would do, side by side with what you actually do. No vehicle control, no liability, just perception that runs in your pocket.

Built for riders, fleet operators, researchers, and anyone who wants to…

View on GitHub

Have you ever tried bridging heavy C++ computer vision pipelines directly to Flutter? Let me know your architecture choices in the comments.

200GB of Raw Rover Data and No Pipeline to Process It. So I Wrote One.

Sherin Joseph Roy — Mon, 30 Mar 2026 17:00:42 +0000

Building a 33-module Python pipeline that takes raw GoPro/Insta360 recordings and turns them into SLAM-ready datasets with auto-labeling, depth estimation, and edge deployment. The unglamorous engineering that nobody writes about.

Three field sessions in Kerala with a GoPro Hero 11 on a rover chassis, an Insta360 X4 for 360-degree coverage, and an Android phone running Sensor Logger.

Result: 200GB of raw video, a growing spreadsheet tracking which files had GPS lock, which files had HyperSmooth accidentally enabled (destroying IMU correlation), and one SD card that corrupted mid-write.

I tried stitching together scripts. FFmpeg for frames, a GPMF parser for telemetry, a separate tool for calibration, manual CSV wrangling for synchronization. After the third time I forgot which script ran in which order, I stopped and asked myself what I actually needed.

The answer was not a better script. It was a system.

That system became Orvex, and over six months it grew into 33 core modules, a 28-panel PyQt6 desktop app, a FastAPI backend with 96+ endpoints, and a React frontend. All sharing identical business logic. Zero code duplication.

The Gap That Started Everything

The autonomous driving community has world-class open source perception tools. ORBSLAM3 for visual-inertial SLAM. VINS-Mono for state estimation. DepthAnything for monocular depth. SegFormer for semantic segmentation. YOLOv8 for detection.

But all of these tools expect their input in a specific format with specific conventions. Getting your own raw field recordings into that format is where weeks disappear.

What I needed, concretely:

Audit 50+ GoPro files and flag stabilization artifacts automatically
Extract IMU data at 200Hz and GPS from binary GPMF streams inside MP4 containers
Synchronize three devices that have independent clocks
Run a guided camera calibration workflow with validation gates
Auto-label 10,000 frames with classes specific to Indian roads (autorickshaws, potholes, unmarked speed bumps)
Train a detector, export to ONNX, convert to TensorRT for Jetson deployment
Version datasets as they evolve over collection campaigns

No single tool covered this. Every tool covered one step. The pipeline between steps was my problem.

The One Rule That Saved the Project

Early on I made one architectural decision that paid off more than any other:

Every line of business logic lives in core/. No UI imports allowed.

core/              33 Python modules. No PyQt6. No FastAPI.
  |
  +-- desktop/     PyQt6 app (28 stacked widget panels)
  +-- web/
       +-- backend/   FastAPI (96+ endpoints)
       +-- frontend/  React + Vite (25 pages)

The desktop app and web app are both thin wrappers that call the same core/ functions. I fixed a GPS timestamp parsing bug once in core/extractor_gopro.py and both interfaces got the fix immediately.

When a colleague wanted to run extraction on a headless server, they imported core.extractor_gopro directly. No Qt dependency chain. No display server required.

The enforcement was simple: if the string import PyQt6 or import fastapi appears in any file under core/, the code does not merge. No exceptions.

This also meant every core/ module was independently testable with pytest. No GUI mocking required.

The Hardest Lesson: Never Store Timestamps as Floats

If you remember one thing from this post, let it be this.

Store all timestamps as 64-bit integer nanoseconds. Not floats. Not milliseconds. Not ISO strings.

I lost three days to a synchronization bug before learning this. Here is the problem:

GoPro GPMF gives timestamps as microseconds. Android Sensor Logger gives seconds_elapsed as a float counting from app start. Insta360 encodes timestamps in a proprietary binary format that exiftool can extract.

When you cross-correlate 200Hz accelerometer signals between two devices to compute their clock offset, a rounding error of 0.001 seconds in a float shifts your alignment by an entire IMU sample. At 200Hz, one sample is 5 milliseconds. That error propagates through your entire SLAM pipeline.

The fix is absolute:

# Correct. Every timestamp in the pipeline.
timestamp_ns: int  # nanoseconds since Unix epoch

# Wrong. Precision loss at high sample rates.
timestamp: float

This constraint is enforced at the Pydantic model level. The IMUSample model accepts timestamp_ns: int and nothing else. If you pass a float, validation fails immediately rather than producing subtle drift errors 3 modules downstream.

Synchronizing Devices That Don't Share a Clock

This was the most interesting engineering challenge in the project.

Three devices. Three independent clocks. Three different timestamp formats. The goal: compute a nanosecond offset for each device so all data aligns to a common timeline.

Method 1: GPS Time Anchor

GoPro Hero 11 records GPS timestamps in UTC. If the Android phone also has GPS lock, you can compute the offset directly by comparing GPS time readings from the same moment. This is the most accurate method, typically within a few milliseconds.

The catch: both devices need solid GPS fix. Indoor recordings, dense urban canyons, and cloudy conditions can leave you without GPS on one or both devices.

Method 2: IMU Cross-Correlation

When GPS is unavailable, you can exploit the fact that all devices attached to the same rover experienced the same physical motion. The acceleration magnitude signal should be nearly identical across devices, just shifted in time.

# Resample both signals to 200Hz, compute acceleration magnitude
mag_a = np.sqrt(ax_a**2 + ay_a**2 + az_a**2)
mag_b = np.sqrt(ax_b**2 + ay_b**2 + az_b**2)

# Normalized cross-correlation to find the lag
corr = np.correlate(mag_a - mag_a.mean(), mag_b - mag_b.mean(), mode='full')
lag_samples = np.argmax(corr) - (len(mag_b) - 1)
offset_ns = int(lag_samples / 200.0 * 1e9)

If the computed lag exceeds 2 seconds, Orvex raises a warning. A lag that large usually means one device started recording before the other, and you need to trim the non-overlapping segment.

Method 3: Manual

Sometimes you just know. You clapped in front of all cameras at the start of the recording, and you can identify the spike in the accelerometer data manually. Orvex accepts manual offsets in milliseconds.

The 4GB Trap

GoPro cameras split continuous recordings into chapter files at approximately 4GB boundaries. A 20-minute drive produces GH010001.MP4, GH020001.MP4, GH030001.MP4.

The GPMF telemetry stream is split across these chapters. If you process each chapter independently, you get timestamp discontinuities at every boundary. Your SLAM system sees a time jump of several hundred milliseconds and either loses tracking or inserts a phantom loop closure.

Orvex detects chapter sequences automatically:

_GOPRO_CHAPTER_RE = re.compile(r"G[HX](\d{2})(\d{4})\.MP4", re.IGNORECASE)

Files sharing the same last 4 digits (recording ID) are grouped and their GPMF streams concatenated with corrected timestamps. The user sees a single continuous recording. The chapter boundaries become invisible.

This is not an edge case. Any GoPro recording longer than about 12 minutes at 4K hits the 4GB split. If your pipeline does not handle this, it does not work on real data.

Auto-Labeling for Roads That COCO Never Saw

Stock YOLOv8 trained on MS COCO gives you 80 classes. Indian roads have objects that COCO never included:

Class	In COCO?	Why It Matters
autorickshaw	No	Most common vehicle in Kerala
pothole_region	No	Critical for path planning
speed_bump	No	Often unmarked, no paint, no signs
cow	Yes (class 19)	Regularly standing in traffic lanes
person	Yes (class 0)	Far higher density than Western datasets

Orvex defines 13 rover-relevant classes and maps them to COCO class IDs where a mapping exists. For the three India-specific classes (autorickshaw, pothole, speed bump), a stock COCO model simply produces no detections. Those require a fine-tuned model.

The practical workflow:

Run yolov8n.pt at 0.25 confidence on all frames (fast, catches roughly 70% of objects)
Export to CVAT XML for human review
Reviewers correct false positives and add the missing India-specific classes
Use corrected annotations to fine-tune
Repeat

One reviewer can process 500 auto-labeled frames in the time it takes to manually label 50 from scratch. The model does not need to be perfect. It needs to be faster than starting from zero.

Camera Calibration: Four Steps, Three Weeks of Debugging

Camera-IMU calibration is a guided 4-step workflow in Orvex. Each step has requirements that are not obvious until you get them wrong.

Step 1: IMU Static Noise Characterization. Place the GoPro flat on a stable table and record for at least 4 hours. Allan deviation analysis extracts accelerometer noise density, accelerometer random walk, gyroscope noise density, and gyroscope random walk. A 30-minute recording gives you noise density but not random walk. VINS-Mono needs both parameters. There is no shortcut.

Step 2: Camera Intrinsics. Standard chessboard calibration using OpenCV. The requirement that trips people up: you need a minimum of 15 detected poses, and they must cover the corners and edges of the frame, not just the center. Reprojection error must be below 0.5 pixels. Orvex rejects calibrations that fail this threshold.

Step 3: Camera-IMU Extrinsics. This invokes OpenImuCameraCalibrator as a subprocess. It runs for 20 to 40 minutes. Orvex streams the subprocess stdout to the log panel in real time so you can watch it converge. The output is a 4x4 transformation matrix relating the camera frame to the IMU frame.

Step 4: Validation. Automated checks verify that reprojection error is below 0.5 pixels, translation magnitude is below 10 centimeters, and rotation magnitude is below 0.5 degrees.

Each step saves results to a JSON file. If Step 2 results already exist when you start, Orvex skips directly to Step 3. You calibrate once per physical camera mount, not once per session.

ByteTrack in 839 Lines

I implemented multi-object tracking from scratch rather than pulling in a tracking library. The reasons were practical:

Every tracking library I evaluated either required detections in a specific format that did not match my pipeline, pulled in dozens of transitive dependencies, or could not export results in MOT Challenge CSV format for evaluation.

The from-scratch implementation uses three components:

A Kalman filter with 8 states (position x/y, width/height, and their velocities) that predicts where each tracked object will appear in the next frame
Two-stage IoU matching: high-confidence new detections are matched to active tracks first, then low-confidence detections are matched to recently lost tracks
Hungarian assignment via scipy.linear_sum_assignment for globally optimal matching

The result is 839 lines of Python with no dependencies beyond NumPy and SciPy. It outputs track statistics, MOT Challenge formatted CSVs, and heatmap visualizations of track density.

28 Panels, Zero Frozen Frames

Every long-running operation in the desktop app runs in a QThread worker. The base pattern:

class BaseWorker(QThread):
    progress = pyqtSignal(int)           # 0 to 100
    status   = pyqtSignal(str)           # "Processing frame 45/200..."
    result   = pyqtSignal(object)        # Final payload on success
    error    = pyqtSignal(str)           # Actionable error on failure
    timing   = pyqtSignal(float, float)  # (elapsed_seconds, eta_seconds)

Every worker inherits from BaseWorker. The UI connects signals to update progress bars, status labels, and the collapsible log panel. The main thread never blocks. Model downloads, inference passes, subprocess calls to COLMAP or ORBSLAM3... all in workers.

The sidebar is organized as a sequential workflow. Not by technical category, but by what a beginner does first, then second, then third:

SETUP & IMPORT       (1-4)    Create session, audit files, import data
PROCESS & EXTRACT    (5-9)    Extract frames, calibrate, view telemetry
ANALYZE & ANNOTATE   (10-19)  Auto-label, segment, depth, track, review
TRAIN & DEPLOY       (20-28)  Augment, train, export, version

A beginner follows top to bottom. An experienced user jumps directly to whatever they need.

Same Logic, Two Interfaces

The FastAPI backend wraps the same core/ functions in 23 route modules. Long operations return a task ID immediately:

POST /api/v1/autolabel/{session_id}

{"data": {"task_id": "abc-123"}, "error": null}

The client opens a WebSocket to /ws/tasks/abc-123 and receives progress updates:

{"status": "running", "progress": 45, "total": 200, "message": "Annotating frames 41-48 / 200"}

The React frontend has 25 pages that mirror every desktop panel. The web version enables collaborative annotation workflows where multiple people can work on the same project.

No business logic was duplicated. The web routes call the same functions the desktop workers call.

What I Would Do Differently

Start with the data model. I wrote core/models.py before anything else. Every module knows exactly what shape of data it receives and what it returns. Pydantic v2 catches type mismatches at the function boundary, not three calls deep in a traceback. This was the single best decision.

Resist premature abstraction. The first version of the synchronizer tried to handle N arbitrary devices with a plugin system. The second version handles exactly three concrete methods with explicit code paths. Half the lines. Twice as debuggable.

Test with real recordings, not synthetic data. A CSV with perfectly uniform timestamps at exactly 200.000Hz will never expose the bug where real GoPro GPMF timestamps jitter by plus or minus 50 microseconds between samples. Every core module was tested against actual GoPro Hero 11 and Insta360 X4 files before being marked complete.

Make errors actionable. "Error occurred" is not an error message. "HyperSmooth is enabled in GH010001.MP4. Re-record with HyperSmooth OFF. Path: Settings > Stabilization > Off" is an error message. Every exception in Orvex tells the user what went wrong and what to do about it.

The Numbers

Metric	Count
Core business logic modules	33
Desktop UI panels	28
Web API endpoints	96+
React frontend pages	25
Lines of Python and JavaScript	~60,000
AI models integrated	7 (YOLOv8, SegFormer, DepthAnything, ByteTrack, COLMAP, ORBSLAM3, UFLD)
Export formats	EuRoC, ROS bag, HDF5, CVAT XML, YOLO txt
Test files	32

Running It

git clone https://github.com/Sherin-SEF-AI/Orvex.git
cd Orvex
python3.11 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
PYTHONPATH=. python -m desktop.main

Requirements: Python 3.11+, FFmpeg, exiftool. GPU recommended for AI features but not required.

Orvex is MIT licensed. If you are collecting multi-sensor data for robotics, autonomous navigation, road condition monitoring, or dataset research, particularly in environments where Western datasets and assumptions do not apply, take a look.

Sherin-SEF-AI / Orvex

Data collection and processing platform for autonomous vehicle development

Orvex

Production-grade data pipeline for autonomous rover dataset collection, processing, and perception.

Orvex handles the full lifecycle of multi-sensor data — from raw GoPro/Insta360/Android recordings through telemetry extraction, calibration, synchronization, dataset assembly, auto-labeling, training, 3D reconstruction, SLAM validation, and edge deployment. Built for Indian road conditions.

Author: Sherin Joseph Roy Repository: https://github.com/Sherin-SEF-AI/Orvex.git License: MIT Python: 3.11+

Demos

IMU Telemetry Graph

Depth Estimation

Frame Extraction

Architecture

Orvex is a dual-interface application: a PyQt6 desktop app and a FastAPI + React web app. Both share identical business logic through the core/ module — zero code duplication.

                  +------------------+
                  |     core/        |   33 pure-Python modules
                  |  (business logic)|   No UI imports
                  +--------+---------+
                           |
              +------------+------------+
              |                         |
    +---------+----------+   +----------+---------+
    |   desktop/         |

…

View on GitHub

I am Sherin Joseph Roy. I build tools for autonomous systems operating on Indian roads: unstructured surfaces, mixed traffic, limited GPS coverage, and no lane markings. If you are working on similar problems, I would like to hear from you.

I Thought I Understood the Autonomous Vehicle Problem. Indian Roads Corrected Me.

Sherin Joseph Roy — Sat, 21 Mar 2026 10:55:16 +0000

Let me describe a specific moment from one of my test drives.

The system had 33 simultaneous object IDs active in frame. Trucks, motorcycles, pedestrians, autos, cars. All moving, all tracked, all getting individual TTC calculations in real time. The collision warning was firing. Processing latency was sitting at 16ms. The road looked like controlled chaos from the outside.

From inside the car, it just looked like a normal Bangalore afternoon.

That was the moment I realized how broken most ADAS benchmarks are for this part of the world.

What Orvex Actually Is

Orvex is a multi-camera real-time perception system I built specifically for Indian urban driving conditions. Not adapted from something Western. Built from scratch with Indian roads as the primary design constraint.

The system runs 4 simultaneous camera feeds:

Primary forward-facing perception channel
Dedicated pedestrian and license plate detection channel
Optical flow motion analysis channel
Wide-angle coverage feed

The perception dashboard tracks in real time: multi-class object detection across cars, trucks, buses, motorcycles, and pedestrians. Persistent ID-based tracking that survives occlusion. Per-object distance in meters and Time to Collision in seconds. Lane status and lateral offset. Risk classification from INFO to CRITICAL. Scene metadata including road type, traffic density, and visibility score.

Processing latency runs between 15 and 17ms. Everything runs locally. No cloud dependency, no edge server offload. Just a laptop mounted in the car.

Here is actual footage from real road tests:

Three Things That Failed Immediately

I expected the system to struggle. I did not expect it to fail in the specific ways it did.

The behavior scorer scored everyone zero.

I built a 0 to 100 driver safety scoring module. Every single test run on Indian roads returned a score of 0 out of 100 with the label AGGRESSIVE. Not because the driving was reckless. Because the scorer was calibrated on assumptions that do not apply here.

Hard braking, tight gap acceptance, rapid directional changes: these are aggression markers in Western driving norms. In Bangalore or Kochi, they are baseline competence. You cannot navigate a city intersection without doing all three simultaneously. The model was correct by its own logic. It was just the wrong logic entirely.

I had to disable the scorer and rethink it from the ground up.

TTC becomes useless when trajectories are nonlinear.

Time to Collision assumes some continuity of movement. Object is at distance X, moving at velocity V, TTC is X divided by V. Clean math.

Motorcycles in Indian traffic do not follow continuous trajectories. They operate on opportunistic pathing: constantly scanning for gaps, switching lanes without signaling, responding to micro-gaps in traffic that open and close in under a second. A 0.2s TTC reading is not an early warning. It is a post-hoc notification.

This is a fundamental behavioral prediction problem, not a detection problem. Orvex catches the object. It cannot yet predict what the object is about to do. That gap is the real unsolved problem in urban AV for high-density mixed traffic.

Tracker ID counts exposed the true scale of the problem.

By mid-session in the second road test, tracker IDs were in the 1100s. That means the system had individually identified and tracked over 1100 distinct objects across the session. In roughly 50 minutes of driving.

Western AV test datasets do not have this density. nuScenes scenes average around 30 to 40 annotated objects. We were hitting 33 simultaneously active tracks in a single frame in a parking lot. The computational budget assumptions that underpin most published perception architectures are simply not calibrated for this.

The Thing Nobody Talks About: Benchmark Hallucination

Here is an uncomfortable truth about AV development.

A model that hits state of the art on nuScenes, Waymo Open, or KITTI is not a model that works. It is a model that works on those datasets. That is not the same thing.

The entire industry optimizes for benchmark performance because that is how research gets published, how companies get funded, and how progress gets measured. Benchmarks are a useful proxy. In markets where the real-world distribution diverges heavily from benchmark data, that proxy fails completely.

Orvex performs worse than several open-source ADAS baselines on standard benchmarks. The FPS fluctuates under density load. The tracker gets stressed at peak object count. The behavior scorer had to be scrapped.

But it runs on real Indian roads. It catches real collision threats on streets that do not exist in any benchmark dataset. It handles traffic compositions that academic datasets have never seen.

That gap between benchmark performance and deployment performance is the central problem of applied AV work. The teams that understand it build systems that actually work. The teams that do not build great benchmark numbers and then wonder why their system freezes at a Bangalore intersection.

The Optical Flow Channel Was an Accident That Became Essential

The motion analysis feed running optical flow was added almost as an afterthought. It turned out to be one of the most operationally useful parts of the entire system.

In dense traffic, optical flow captures motion vectors for everything in the scene, not just objects that have cleared the detection confidence threshold. Partially occluded vehicles. Objects near frame boundaries. Fast-moving targets that blur enough to drop below the detector's confidence cutoff.

In practice it functions as a soft pre-detection layer. The primary pipeline gives you identity, class, and distance. The optical flow gives you motion context for objects that are not yet fully resolved. In the chaos frames, 30-plus active objects, overlapping bounding boxes, collision warnings firing, the optical flow channel is the thing that keeps the system from being completely blind to unclassified threats.

I did not design it that way. The road taught me that was necessary.

What Comes Next and Why It Changes Everything

Here is something I have not talked about publicly until now.

I have been in conversations with a friend based in Bangalore. Through his connections, we have access to a 200-vehicle electric fleet in active service across the city. Real routes, real operational data, real urban driving at scale, every single day.

We are planning to build a full autonomous vehicle system from scratch together.

Not retrofit. Not integrate a third-party stack. From scratch. Data collection infrastructure, annotation pipelines, perception model training, sensor integration, edge deployment. All of it.

The fleet is the asset that changes the equation. Most AV startups spend years and tens of millions building access to what we already have: a real operational environment with 200 vehicles worth of driving data across one of the densest urban road networks in the world. Every route, every intersection, every edge case: ours to instrument and learn from.

The EV platform matters for a less obvious reason. Clean electrical architecture, no combustion powertrain complexity, standardized actuation interfaces. Integrating drive-by-wire controls with a custom AV stack is a significantly cleaner problem on an EV than retrofitting a conventional vehicle. The integration surface is known and controllable.

The plan in three phases:

Phase 1. Data infrastructure. Instrument the fleet, build the collection and annotation pipeline, start generating a proprietary Indian urban driving dataset that does not exist anywhere else.

Phase 2. Rebuild the Orvex perception stack properly. Multi-camera calibration done right. BEV fusion. Behavioral prediction models trained on local data, not transferred from Waymo.

Phase 3. Vehicle integration. Closed-course autonomy first, then expanding the operational domain incrementally with real data informing every decision.

This is not a short timeline. But the foundation is real. The fleet is real. The perception work from Orvex is real. The distance between where we are and a working prototype is smaller than it looks from outside.

The One Thing I Would Tell Anyone Starting in AV

Test on your actual deployment environment from day one. Not when the system is "ready." Day one.

The failures you discover in your real environment are not setbacks. They are the curriculum. Every broken assumption, the behavior scorer, the TTC model, the density ceiling, became a design requirement that made the system more honest about what it actually needs to do.

Simulation matters. I have built months of Indian road scenarios in CARLA and the synthetic data work has real value. But simulation is a tool for exploring the design space. It is not a substitute for the road.

The road has opinions. You should hear them early.

Orvex is part of the broader perception and safety work at PerceptionAV. If you are building in AV, edge perception, or safety intelligence for high-density urban environments, especially outside Western road contexts, I would like to compare notes.

Building a Voice-Controlled Browser Agent with Three Gemini Models

Sherin Joseph Roy — Mon, 16 Mar 2026 04:38:08 +0000

This post was written as part of my submission to the Gemini Live Agent Challenge hackathon on Devpost. #GeminiLiveAgentChallenge

The Problem That Started This

My grandmother owns a smartphone. She has a broadband connection. She cannot book a train ticket online.

This is not a technology access problem. She has the hardware and the connectivity. What she cannot do is navigate a website. She does not understand dropdown menus. She cannot read small text on form labels. She does not know what "Enter OTP" means. When something goes wrong, she sees an error message she cannot parse and hands the phone to someone younger.

She is not alone. According to government data, 85% of India's elderly population cannot independently use digital services. Over 900 million people globally are in the same situation. India moved pensions online, digitized Aadhaar, made train booking web-only, and shifted bill payments to portals. The interfaces got built. The people who need these services the most got left behind.

I wanted to build something that removes the interface from the equation entirely. Not a simpler interface. Not a tutorial. A system where the user speaks what they need and the computer handles the rest.

That project became SAHAY.

What SAHAY Does

SAHAY listens to the user in their language. Hindi, Malayalam, Tamil, Telugu, English, or any of 24 supported languages. It opens a real Chromium browser, finds the correct website, navigates through the pages, fills forms, clicks buttons, and speaks the results back.

The user says "Amazon par earbuds dikhao 1000 rupaye se kam" and SAHAY opens Amazon with a price-filtered search, reads the results, and reports the top options with prices. In Hindi. Because that is the language the user spoke.

The user says "Download my Aadhaar card" and SAHAY navigates to the UIDAI portal, asks for the Aadhaar number, repeats it back digit by digit for confirmation, enters it, and proceeds through the OTP flow.

Before any login, payment, or form submission, SAHAY stops and asks for permission. The user confirms by voice or by clicking a button. For passwords and CAPTCHAs, the user can take direct control by clicking on the browser screen.

The Three-Agent Architecture

SAHAY runs three separate Gemini agents that coordinate to complete each task.

Agent 1: The Planner

The Planner runs on Gemini 2.5 Flash with Google Search grounding through the GenAI SDK. When the user describes what they want, the Planner searches the internet in real time to find the correct website. It does not use hardcoded URLs. It does not rely on a static list of known portals. It searches, reads the results, and identifies the right destination.

This matters because websites change. The UIDAI download page moved URLs twice in the past year. Government portals restructure their navigation without warning. A hardcoded URL from last month might 404 today. The Planner always researches the current state before creating a plan.

After finding the target, the Planner creates a structured execution plan with step-by-step instructions, visual descriptions of what to look for on each page, and flags for which steps involve sensitive data.

The Planner is a separate ADK agent because Google Search grounding cannot be combined with other tools in the same agent. This constraint shaped the architecture. It turned out to be the right design anyway because it cleanly separates research from execution.

from google import genai
from google.genai import types

client = genai.Client(vertexai=True)
response = await client.aio.models.generate_content(
    model="gemini-2.5-flash",
    contents=task_prompt,
    config=types.GenerateContentConfig(
        tools=[types.Tool(google_search=types.GoogleSearch())],
        temperature=0.2,
    ),
)

Agent 2: The Browser

The Browser Agent runs on Gemini's Computer Use model (gemini-2.5-computer-use-preview-10-2025). It receives the plan from the Planner and executes it step by step.

The execution loop works like this:

Take a screenshot of the current browser state via Playwright
Send the screenshot to the Computer Use model along with the current plan step
The model analyzes the screenshot visually and returns coordinates for where to click or what to type
Playwright executes the action against the real browser
Take a new screenshot
Repeat until the task is complete

The Browser Agent does not read HTML. It does not use CSS selectors for understanding page layout. It does not call any website APIs. It looks at the screenshot the same way a person would look at a screen and decides what to do next. This means it works on any website without site-specific configuration.

The Computer Use model outputs normalized coordinates (0 to 999). SAHAY converts these to actual pixel positions on the 1440x900 viewport:

actual_x = int(normalized_x / 1000 * 1440)
actual_y = int(normalized_y / 1000 * 900)
await page.mouse.click(actual_x, actual_y)

The Browser Agent is wrapped in ADK's ComputerUseToolset, which manages the screenshot-action loop and handles the coordinate conversion.

Agent 3: The Voice

The Voice Agent runs on Gemini 2.5 Flash Native Audio through the Live API. It handles all communication with the user through bidirectional audio streaming.

The user's microphone audio is captured by the browser using AudioWorklet, converted to PCM 16-bit 16kHz mono, and streamed over a WebSocket to the FastAPI backend. The backend pipes this audio into the Live API session. When Gemini responds with audio, it streams back through the same WebSocket to the browser for playback.

The Voice Agent automatically detects the user's language and responds in the same language. If the user starts in English and switches to Hindi mid-sentence, the response comes back in Hindi. No language selection menu. No configuration.

For sensitive inputs like Aadhaar numbers and phone numbers, the Voice Agent repeats back what it heard and waits for explicit confirmation before passing the data to the Browser Agent. This prevents the misheard-digit problem that would otherwise cause the entire flow to fail silently.

voice_agent = Agent(
    name="sahay_voice_agent",
    model="gemini-live-2.5-flash-native-audio",
    instruction=VOICE_INSTRUCTION,
    tools=[plan_task, browser_action, stop_task, rollback],
)

Google Cloud Services

SAHAY uses three Google Cloud services in production.

Vertex AI hosts all three Gemini model endpoints. The Voice Agent connects to the Live API through Vertex AI. The Planner Agent calls Gemini Flash with Google Search grounding through Vertex AI. The Computer Use model calls go through the Gemini API directly using an API key.

Cloud Firestore stores task logs, session state, and workflow recordings. Every task gets a document with the task description, each step taken, screenshots at key moments, the final outcome, and timestamps. This serves as an audit trail and also powers the workflow replay feature where repeated tasks execute faster by following a previously recorded path.

Cloud Run hosts the containerized application. The Dockerfile installs Playwright and Chromium inside the container, so the browser automation works in the cloud environment. The deploy script and Terraform configuration automate the entire deployment process.

What Made This Hard

Google CAPTCHA. The first version of SAHAY used headless Chromium to search Google directly. After a few searches, Google would show a CAPTCHA and the agent would get stuck on the verification page, clicking randomly and wasting steps. Moving search to the Planner Agent via Google Search grounding API eliminated this problem completely. The browser never touches Google Search anymore.

Bot detection. IRCTC, MakeMyTrip, and several banking portals detect Playwright and refuse to load the page. Stealth flags and spoofed user agents helped with some sites but not all. The solution was building a smart browser selection system. SAHAY analyzes the target URL and task description and decides whether to use headless Chromium (fast, works for most sites) or a headed browser window (slower, but bypasses bot detection on protected sites). This decision happens automatically per task.

The Computer Use model finishing early. The ADK runner's run_async() generator exits when the model returns a text response without a function call. The model would sometimes describe what it sees on screen instead of clicking on it, which would end the task prematurely after two or three steps. The fix was a continuation loop that detects when the model exits without reporting completion, re-prompts it with "You have not finished the task. Take an action.", and resumes execution. This loop runs up to three times before giving up.

Voice number accuracy. The Live API voice model occasionally mishears digits. "9895" becomes "9985". For an Aadhaar number, a single wrong digit means the download fails and the user does not understand why. The repeat-back-and-confirm pattern solved this. It adds a few seconds to each interaction but prevents silent failures that would destroy user trust.

The Stack

Component	Technology
Agent framework	Google ADK
Voice model	Gemini 2.5 Flash Native Audio (Live API)
Browser model	Gemini 2.5 Computer Use
Planner model	Gemini 2.5 Flash + Google Search
Browser automation	Playwright (Chromium)
Backend	FastAPI + WebSocket
Frontend	Vanilla JavaScript
Database	Google Cloud Firestore
Hosting	Google Cloud Run
IaC	Terraform

What I Would Do Differently

The visual-only approach is the right architectural choice for universality but the wrong choice for speed. Every action requires a full screenshot capture, a round trip to the Gemini API, and coordinate parsing. A hybrid approach that uses visual understanding for navigation decisions but DOM selectors for precise form filling would be significantly faster.

The three-agent architecture introduces latency at the boundaries. The Planner takes 5 to 15 seconds to research and produce a plan. During this time, the browser sits idle and the user waits in silence. Pre-fetching the target URL while the Planner is still working would cut perceived latency in half.

The continuation loop is a workaround for a fundamental issue with how the Computer Use model signals task completion. A better approach would be fine-tuning the model prompt so it always ends with either a function call or an explicit completion message, never a bare text description.

Source Code

The full source code is available at:
github.com/Sherin-SEF-AI/Sahay-Voice-First-Digital-Navigator

Built by Sherin Joseph Roy, Head of Products at DeepMost AI.

This project was built for the Gemini Live Agent Challenge hackathon, UI Navigator track.

GeminiLiveAgentChallenge

Why my AI crash reconstruction MVP isn't ready for production (and why I'm rebuilding it)

Sherin Joseph Roy — Sun, 08 Mar 2026 15:24:59 +0000

We all love the demo phase. You hook up an API, the UI updates, and for a second, the software feels like absolute magic.

I recently hit that phase with a project called Incident Lens AI. It is a forensic video analysis suite I have been building to automate crash reconstruction for insurance and legal teams. The goal is to take raw dashcam or CCTV footage and turn it into a defensible liability report.

To validate the idea quickly, I built a frontend-first proof of concept using React, Vite, and the Gemini 3 Pro SDK. I piped the video frames and audio directly from the browser to the LLM and asked it to act as a forensic expert.

And honestly, it makes for an incredible demo.

You drop a video in, and the system instantly starts reasoning about the crash. It generates liability timelines, cites traffic laws, and outputs structured JSON that drives interactive charts on the dashboard. Building it this way let me iterate on the UI and prove the multimodal concept without writing a single line of backend infrastructure.

But as I transition from pitching a vision to building the actual product, I have to face a hard engineering truth. A cool demo is not a defensible legal tool.

The architecture I used to validate the idea is the exact architecture I now have to dismantle. Here is why.

The Security Issue

First, there is the obvious security issue. Hitting a public LLM API directly from a client application is a complete non-starter when you are dealing with sensitive enterprise data and personally identifiable information. No insurance pilot program will ever approve that.

The Hallucination Trap

But the much bigger issue is the hallucination trap.

My current documentation states that the AI calculates vehicle speed using photogrammetry and motion mechanics. The reality is that LLMs are not physics engines. If you ask an LLM to estimate the speed of a car from a 2D video without precise camera calibration, it is just guessing. It might sound incredibly confident, but in a courtroom setting, an "AI-estimated" speed calculation would be destroyed by opposing counsel in seconds.

You cannot build a forensic tool on prompt engineering alone.

The Hybrid Architecture Pivot

So, I am moving away from the pure LLM wrapper approach and building a hybrid architecture.

I am shifting the heavy lifting to a secure Python backend. The new pipeline will rely on deterministic computer vision models like OpenCV to extract hard, mathematical data from the footage, such as pixel velocities and exact collision coordinates. Once I have those concrete numbers, I will feed them into established physics formulas to get the actual speed and force.

Only then does Gemini re-enter the picture. I will pass those verified, deterministic numbers to the LLM so it can do what it actually excels at: cross-referencing case law, synthesizing the timeline, and writing the final human-readable dossier.

Building in the public safety and forensics space requires an incredibly high bar for trust and accuracy. It is easy to get caught up in the magic of what generative AI can do out of the box.

I am leaving the current repository up as a proof of concept because it perfectly illustrates the vision of where multimodal AI is heading. But the real engineering work of making it secure, deterministic, and legally defensible starts now.

If anyone else is navigating the jump from AI prototype to production in a zero-trust industry, I would love to hear how you are handling it.

You can check out the frontend prototype here:

Sherin-SEF-AI / Incident-Lens-AI

Incident Lens AI is a professional-grade forensic video analysis suite. It transforms raw crash footage into a defensible legal case file in seconds.

Incident Lens AI 🔍⚖️

Professional Forensic Video Analysis & Accident Reconstruction Platform

https://youtu.be/QUVeahUrCTg?si=0KiQewMFjjllYqv4

Incident Lens AI is a production-grade application designed for insurance carriers, legal defense teams, and fleet safety managers. It leverages the multimodal capabilities of Google Gemini 3 Pro to transform unstructured video evidence (dashcam, CCTV, bodycam) into legally admissible forensic reconstructions.

Unlike standard video players, Incident Lens AI "reasons" about the footage in real-time, calculating vehicle speeds, inferring traffic signal states from indirect visual cues, and citing specific legal statutes for fault determination.

🚀 Key Features

🧠 Autonomous Reconstruction

Physics Engine: Automatically calculates vehicle speed ($v=d/t$) using photogrammetry and motion blur mechanics.
Signal Inference: Deduce the state of occluded traffic lights by analyzing cross-traffic flow and pedestrian behavior.
Debris Field Analysis: Reverse-engineer impact vectors based on glass shard trajectories and fluid spray patterns.

⚖️ Legal Admissibility

Search Grounding: Uses the Gemini…

View on GitHub

I Got Tired of Mocking APIs in pytest. So I Built a Cleaner Way.

Sherin Joseph Roy — Sun, 08 Mar 2026 15:01:33 +0000

If you’ve written integration tests in Python long enough, you’ve hit this wall.

Your test calls three external services.

You mock one endpoint.

Then another.

Then another.

Suddenly your test is 60 lines long and half of it is patching.

At that point you’re not testing behavior.

You’re maintaining scaffolding.

I ran into this repeatedly while working on service-to-service flows, especially when:

A single operation triggered multiple HTTP calls
Different tests required different combinations of responses
Fixtures started turning into mini frameworks

The tooling ecosystem is strong. responses, unittest.mock, httpx mocking utilities. But once endpoint count increases, ergonomics start degrading.

The issue is not capability.

It is readability.

The Breaking Point

Here is what multi-endpoint mocking often turns into:

import responses

@responses.activate
def test_checkout():
    responses.add(
        responses.GET,
        "https://api.example.com/users/1",
        json={"id": 1, "name": "Alice"},
        status=200,
    )

    responses.add(
        responses.POST,
        "https://api.example.com/orders",
        json={"order_id": 42},
        status=201,
    )

    result = checkout_flow()
    assert result.success

This works.

But scale it:

Different combinations per test
Conditional responses
Dynamic payloads
Partial URL matching
Multiple external services

Now your fixtures grow. Helpers grow. Patching spreads.

Tests become infrastructure.

What I Wanted Instead

I wanted something that:

Lives naturally inside pytest
Keeps mocks close to test logic
Makes multi-endpoint flows readable
Avoids spinning up test servers
Avoids deep patch trees

So I built a small utility called api-mocker.

The philosophy was simple: minimal surface area. No heavy DSL. No framework abstraction.

Just explicit endpoint declarations.

Example

`python
def test_checkout_flow(api_mocker):
api_mocker.get("/users/1").respond_with(
status=200,
json={"id": 1, "name": "Alice"}
)

api_mocker.post("/orders").respond_with(
    status=201,
    json={"order_id": 42}
)

result = checkout_flow()
assert result.success

No decorators.
No activation context.
No scattered patch logic.

The fixture handles lifecycle and cleanup per test.

Design Principles

1. Isolation Per Test

Mocks reset automatically after each test. No shared state leakage.

2. Explicit Failure

If an expected endpoint is not called, the test fails.
If an unexpected endpoint is called, the test fails.

Silent success hides integration problems.

3. Lightweight Interception

No embedded server. No process overhead.
Interception happens at the request layer.

Where This Approach Works Best

Microservice architectures
Services calling multiple third-party APIs
Payment or auth flows
Orchestrator style backends

Anywhere a single flow touches two or more HTTP integrations.

Open Questions I’m Exploring

Mocking libraries always face tension between simplicity and flexibility.

Some areas I’m actively thinking about:

Async client handling
Streaming responses
When mocking should give way to contract testing
Detecting over-mocking in large test suites

These are tradeoffs that affect long-term test quality.

Why I’m Sharing This

Mocking strategy has a direct impact on codebase health.

Readable tests scale.
Fixture jungles do not.

If you’ve dealt with messy multi-endpoint integration tests in Python, I’d genuinely like to hear:

What worked well
Where it broke down
When you moved to contract testing

Project links if you want to explore the implementation:

PyPI: https://pypi.org/project/api-mocker/
GitHub: https://github.com/Sherin-SEF-AI/api-mocker

Curious to hear how others approach this problem.

OpsLens - I Built an Autonomous Incident Response System That Turns Notion Into a War Room

Sherin Joseph Roy — Sat, 07 Mar 2026 13:31:04 +0000

This is a submission for the Notion MCP Challenge

What I Built

I built OpsLens, an autonomous incident response orchestrator that uses Notion MCP as its core data layer.

Here is the problem I was solving: when a production incident fires at 3 AM, the on-call engineer has to do six things at once. Triage the alert. Search for past incidents. Find the runbook. Check recent deployments. Notify the team. Document everything for the postmortem. Every step is manual, scattered across different tools, and easy to mess up when you are running on two hours of sleep.

OpsLens takes the alert, runs five AI agents against it, and writes everything back to Notion. The engineer opens their incident page and finds: severity assessment, related past incidents, applicable runbook steps, a draft postmortem, and a list of who to notify. All in one place, all searchable, all happened in seconds.

But the part I am most proud of is that it is not a one-way pipe. OpsLens watches for human edits in Notion. If you disagree with the AI triage and change the severity from P2 to P0, the system detects that within 30 seconds and re-runs the relevant agents with the updated context. The AI proposes. The human decides. The system adapts.

What it actually does

Alert Ingestion: Accepts real webhook payloads from Prometheus AlertManager, Grafana, PagerDuty, or any custom JSON source. Normalizes them into a canonical format, deduplicates, and groups related alerts into a single incident.

Five AI Agents run in sequence on every new incident:

Triage Agent - Validates severity, identifies the affected service, assesses blast radius
Correlation Agent - Searches past incidents, Slack conversations, Google Drive docs, Jira tickets via Notion MCP's connected tool search
Remediation Agent - Finds applicable runbooks, proposes specific commands and rollback steps
Comms Agent - Orchestrates notifications and escalation
Postmortem Agent - Generates a blameless postmortem when the incident resolves

Every agent writes its analysis as a structured comment on the Notion incident page. This is not dumped into a database somewhere. It lives in Notion, searchable, shareable, and visible to everyone.

Incident Commander: A contextual AI co-pilot embedded in the dashboard. During an active incident, you can ask it questions like "What changed recently in this service?" or "Find the runbook for this." It searches Notion, fetches pages, checks past incidents, and comes back with specific answers and clickable action buttons (search, escalate, transition status, notify someone, run a remediation step).

Bi-directional Notion Sync: The Notion Watcher polls active incident pages every 30 seconds. It detects when a human changes severity, updates status, adds a root cause, or writes an escalation comment directly in Notion. When it spots a change, it fires the appropriate callback, re-runs agents, and updates the dashboard via WebSocket.

Real-Time Dashboard: React frontend with live updates. Incident list with filters, full timeline view, agent activity feed, audit trail, semantic search across Notion, a webhook playground for testing, and settings page for configuring integrations, all connected via WebSocket for instant updates.

Enterprise Integrations: Slack war rooms, GitHub deployment correlation, Jira ticket creation, Linear issue tracking, and outbound webhooks with retry logic.

The architecture in one picture

Prometheus/Grafana/PagerDuty
        |
        v (webhooks)
+------------------+       JSON-RPC 2.0       +------------------+
|  OpsLens Backend |  <------------------->   |  Notion MCP      |
|  (FastAPI)       |   Streamable HTTP        |  Server (:3100)  |
|                  |                          |                  |
|  - Incident Mgr  |                          +--------+---------+
|  - 5 AI Agents   |                                   |
|  - Notion Watcher |                                   v
|  - WebSocket Hub  |                          +------------------+
+--------+---------+                          |  Notion          |
         |                                    |  - Incidents DB  |
         v                                    |  - Runbooks DB   |
+------------------+                          |  - Services DB   |
|  React Dashboard |                          |  - Postmortems   |
|  - Incident List |                          |  - On-Call DB    |
|  - Commander     |                          +------------------+
|  - Agent Feed    |
|  - Audit Trail   |
+------------------+

Video Demo

Show us the code

Sherin-SEF-AI / OpsLens

OpsLens: The World’s First Autonomous Incident Command Center powered by Notion MCP.

OpsLens

Autonomous Incident Response Orchestrator powered by Notion MCP

OpsLens transforms Notion into an AI-powered incident command center. It ingests alerts from monitoring tools, runs a pipeline of specialized AI agents for triage, correlation, remediation, and postmortem generation, and writes every finding back to Notion as structured, searchable knowledge. Engineers interact through a real-time dashboard or directly in Notion. The system watches for human edits and reacts, creating a true human-in-the-loop incident response workflow.

Built for the Notion MCP Challenge on DEV.to.

Problem Statement

When a production incident fires at 3 AM, the on-call engineer faces a wall of context switching: triage the alert, search for past incidents, find the runbook, notify stakeholders, check recent deployments, and document everything for…

View on GitHub

The full source is on GitHub. Key files if you want to dive in:

src/notion_mcp/client.py - The async JSON-RPC 2.0 client that talks to Notion MCP
src/notion_mcp/tools.py - Typed wrappers around every MCP tool (search, fetch, create, comment)
src/agents/orchestrator.py - The pipeline that coordinates all five agents
src/agents/commander.py - The Incident Commander with agentic tool-use loop
src/sync/notion_watcher.py - Bi-directional sync that detects human edits in Notion
src/incidents/manager.py - Incident lifecycle, dedup, grouping, and Notion rehydration
src/webhooks/normalizer.py - Converts Prometheus/Grafana/PagerDuty payloads to a canonical format

How I Used Notion MCP

Notion MCP is not a side feature in OpsLens. It is the foundation. Every piece of data flows through it. Here is how.

The MCP Client

OpsLens communicates with the Notion MCP server over Streamable HTTP using JSON-RPC 2.0. The client at src/notion_mcp/client.py handles session initialization, request/response parsing, and rate limiting (180 requests/min, 30 searches/min). Every call is async. Every error is retried with exponential backoff via tenacity.

# Simplified view of how OpsLens talks to Notion MCP
async def call_tool(self, tool_name: str, arguments: dict) -> Any:
    payload = {
        "jsonrpc": "2.0",
        "method": "tools/call",
        "params": {
            "name": tool_name,
            "arguments": arguments,
        },
    }
    response = await self._http.post(self.url, json=payload)
    return self._parse_response(response.json())

Six MCP Tools in Active Use

I use six of the Notion MCP tools throughout the system:

notion-search - This is the workhorse. Every agent starts by searching the workspace for relevant context. The Triage Agent searches for service documentation. The Correlation Agent searches for past incidents with similar patterns. The Remediation Agent searches for runbooks. The Incident Commander uses it interactively when the engineer asks a question.

The best part: notion-search searches across connected tools too. When Slack, Google Drive, Jira, or Confluence are connected to the Notion workspace, the search results include matches from all of them. The Correlation Agent does not need separate API integrations for each tool. One MCP search call, and it gets Slack threads about the last time this service broke, the Jira ticket from the previous incident, and the Confluence page with the architecture diagram. That is a massive unlock.

create-pages - Every new incident gets a structured Notion page. Properties include incident ID, status, severity, alert source, service name, triggered timestamp, and impact description. The content includes a formatted summary with alert details, labels, and linked URLs.

create-comment - Every agent writes its analysis as a comment on the incident page. This was a deliberate design choice. Comments are timestamped, attributed, and visible in the page history. When you open an incident page in Notion, you see the full conversation: the Triage Agent's severity assessment, the Correlation Agent's findings about similar past incidents, the Remediation Agent's suggested fix. It reads like a collaborative investigation.

list-comments - On startup, OpsLens rehydrates its in-memory state from Notion. It queries the Incidents database, then loads all comments for each incident and parses them back into timeline events. This means you can restart the server and lose nothing. The state lives in Notion.

update-page - When agents update severity or status, the incident page properties are updated. The Command Center also uses this to maintain a living dashboard page in Notion with real-time metrics.

query-data-source - Used during startup rehydration to query the Incidents database and rebuild in-memory state. Also used by the Incident Commander to find past incidents by specific criteria.

Notion as the Single Source of Truth

The key insight that shaped the architecture: if every agent writes to Notion, then Notion becomes the knowledge base automatically. When the Correlation Agent searches for "payment service memory leak," it finds not just manually written docs, but also the AI-generated analyses from previous incidents. The system builds institutional memory over time, and it all lives in a place where everyone on the team can see it, search it, and edit it.

This also enables the bi-directional sync. Because the data is in Notion, humans can interact with it using the Notion UI they already know. Change a property, add a comment, update a status. The Notion Watcher picks it up and the system responds. No special tools needed. Just Notion.

Workspace Setup

OpsLens creates six databases automatically on first run via workspace_setup.py:

Incidents - Structured incident tracking with status, severity, service, and timeline
Runbooks - Step-by-step remediation procedures, searchable by service
Services - Service registry with criticality tiers and ownership
Postmortems - Blameless post-incident reviews linked to their incidents
On-Call - Rotation schedules for escalation routing
Confidence Tracker - Agent confidence scores over time for quality monitoring

The setup is idempotent. Run it twice and it finds the existing databases instead of creating duplicates.

What Notion MCP Unlocked

Without Notion MCP, building this would have required maintaining a separate database, building a custom search index, and creating a UI for humans to interact with the data. Notion MCP eliminated all of that.

The agents write to Notion. Humans read and edit in Notion. The search covers the entire workspace and connected tools. The data is structured, queryable, and shareable. The MCP server handles the API complexity. OpsLens just calls tools and gets results.

That is what made this project possible in the scope of a challenge. The MCP layer turned what would have been months of infrastructure work into a few hundred lines of client code.

Tech Stack: Python 3.11, FastAPI, React 18, Vite, Tailwind CSS, Google Gemini API (primary LLM), Anthropic Claude API (fallback), Notion MCP Server (Streamable HTTP), Docker Compose

GitHub: github.com/Sherin-SEF-AI/OpsLens

I Got Tired of Mocking APIs in pytest. So I Built a Cleaner Way.

Sherin Joseph Roy — Tue, 03 Mar 2026 08:15:17 +0000

If you’ve written integration tests in Python long enough, you’ve hit this wall.

Your test calls three external services.

You mock one endpoint.

Then another.

Then another.

Suddenly your test is 60 lines long and half of it is patching.

At that point you’re not testing behavior.

You’re maintaining scaffolding.

I ran into this repeatedly while working on service-to-service flows, especially when:

A single operation triggered multiple HTTP calls
Different tests required different combinations of responses
Fixtures started turning into mini frameworks

The tooling ecosystem is strong. responses, unittest.mock, httpx mocking utilities. But once endpoint count increases, ergonomics start degrading.

The issue is not capability.

It is readability.

The Breaking Point

Here is what multi-endpoint mocking often turns into:

import responses

@responses.activate
def test_checkout():
    responses.add(
        responses.GET,
        "https://api.example.com/users/1",
        json={"id": 1, "name": "Alice"},
        status=200,
    )

    responses.add(
        responses.POST,
        "https://api.example.com/orders",
        json={"order_id": 42},
        status=201,
    )

    result = checkout_flow()
    assert result.success

This works.

But scale it:

Different combinations per test
Conditional responses
Dynamic payloads
Partial URL matching
Multiple external services

Now your fixtures grow. Helpers grow. Patching spreads.

Tests become infrastructure.

What I Wanted Instead

I wanted something that:

Lives naturally inside pytest
Keeps mocks close to test logic
Makes multi-endpoint flows readable
Avoids spinning up test servers
Avoids deep patch trees

So I built a small utility called api-mocker.

The philosophy was simple: minimal surface area. No heavy DSL. No framework abstraction.

Just explicit endpoint declarations.

Example

`python
def test_checkout_flow(api_mocker):
api_mocker.get("/users/1").respond_with(
status=200,
json={"id": 1, "name": "Alice"}
)

api_mocker.post("/orders").respond_with(
    status=201,
    json={"order_id": 42}
)

result = checkout_flow()
assert result.success

No decorators.
No activation context.
No scattered patch logic.

The fixture handles lifecycle and cleanup per test.

Design Principles

1. Isolation Per Test

Mocks reset automatically after each test. No shared state leakage.

2. Explicit Failure

If an expected endpoint is not called, the test fails.
If an unexpected endpoint is called, the test fails.

Silent success hides integration problems.

3. Lightweight Interception

No embedded server. No process overhead.
Interception happens at the request layer.

Where This Approach Works Best

Microservice architectures
Services calling multiple third-party APIs
Payment or auth flows
Orchestrator style backends

Anywhere a single flow touches two or more HTTP integrations.

Open Questions I’m Exploring

Mocking libraries always face tension between simplicity and flexibility.

Some areas I’m actively thinking about:

Async client handling
Streaming responses
When mocking should give way to contract testing
Detecting over-mocking in large test suites

These are tradeoffs that affect long-term test quality.

Why I’m Sharing This

Mocking strategy has a direct impact on codebase health.

Readable tests scale.
Fixture jungles do not.

If you’ve dealt with messy multi-endpoint integration tests in Python, I’d genuinely like to hear:

What worked well
Where it broke down
When you moved to contract testing

Project links if you want to explore the implementation:

PyPI: https://pypi.org/project/api-mocker/
GitHub: https://github.com/Sherin-SEF-AI/api-mocker

Curious to hear how others approach this problem.

Security Cameras Don’t Understand. This One Does - Building an Agentic Physical Intelligence Platform with Gemini

Sherin Joseph Roy — Mon, 02 Mar 2026 17:48:35 +0000

This is a submission for the Built with Google Gemini: Writing Challenge

What I Built with Google Gemini

Most security systems are reactive. They record. They store. When something goes wrong, someone reviews footage after the fact. I've spent years building computer vision systems for autonomous vehicles - where a 200ms latency failure isn't an incident report, it's a crash. That mindset is what drove Aegis Sentinel.

Aegis Sentinel is a real-time physical security intelligence platform powered by Gemini's multimodal and agentic capabilities. It's not a smarter camera. It's a surveillance system that understands what it's watching, reasons about it, and acts.

Perimeter Surveillance & Threat Detection

Aegis Sentinel ingests live video feeds across monitored zones continuously. Using Gemini's vision capabilities, it doesn't just detect motion - it classifies intent. An authorized maintenance worker moving through a restricted corridor is treated differently from an unrecognized individual at an entry point at 2 AM. The model differentiates between those scenarios through contextual reasoning, not bounding box matching.

Operational Anomaly Detection

Beyond intruders, physical ops security is about operational integrity - equipment left in the wrong zone, a door held open past threshold, personnel deviating from standard procedures. Aegis Sentinel monitors these patterns by treating the spatial environment as a semantic map, not a pixel grid.

Real-Time Alerting with Natural Language Narration

When a threat is detected, the system doesn't fire a binary alert. It generates a structured incident narrative:

"Unidentified individual detected in Zone 4 server corridor at 02:14 AM. No access badge visible. Individual stationary for 3 minutes near rack B7."

That narration lands on the ops dashboard and triggers automated escalation workflows directly.

Agentic Decision Loop

This is where Gemini earned its architectural position. The system doesn't just detect - it decides. Gemini acts as the reasoning agent, consuming multimodal context:

📹 Live video frame
🗺️ Zone metadata & spatial layout
🔐 Access control logs
🕐 Time-of-day context

...and resolving what escalation tier is appropriate: log only, notify on-call, lock down zone, or initiate full response. I built the tool-use layer so Gemini calls directly into the incident management API and access control systems as first-class operations.

Demo

What I Learned

1. Gemini's multimodal context window changes the architecture entirely

Coming from traditional CV pipelines - where you chain a detector, tracker, classifier, and rules engine - the first instinct is to slot Gemini into one stage of that chain. That's the wrong mental model entirely.

Gemini can hold spatial context across multiple frames and structured metadata simultaneously. That meant I could collapse what would've been a 4-stage pipeline:

detect → track → classify → reason

...into a single inference call with the right frame context and prompt structure. Fewer moving parts, fewer failure surfaces, dramatically less engineering overhead. One model reasoning end-to-end beat a hand-crafted ensemble in both accuracy and operational simplicity - and in a startup, that simplicity compounds.

2. Agentic tool use requires you to think in failure modes first

Giving Gemini the ability to call into access control APIs and incident management systems is powerful on paper. In practice, every tool call is a trust boundary. I spent more engineering time on the guard layer - validating Gemini's action intent before executing high-stakes calls like zone lockdowns - than on the tool definitions themselves.

The lesson: agentic AI in physical security isn't a chatbot with side effects. It's a decision system operating in the real world. You need:

Explicit confidence thresholds before any action executes
Human-in-the-loop escalation for irreversible operations
Comprehensive audit logging as a foundational constraint, not an afterthought

3. Streaming latency is your hardest constraint - not model accuracy

Gemini's threat classification accuracy was strong. But in physical security ops, a 4-second response window on a live intrusion event is operationally unacceptable.

This forced a two-tier architecture:

Live feed
   │
   ▼
[Tier 1] OpenCV pre-filter
 motion detection + ROI extraction
   │  (only candidate events pass through)
   ▼
[Tier 2] Gemini
 contextual reasoning + agentic decision

Gemini API calls became targeted and high-value instead of continuous. Latency dropped, cost per inference dropped, and the system became predictable under load. In ADAS, we call this "compute budget allocation." The same principle transfers directly to physical security.

4. Natural language incident narration is not a polish feature

I expected narrative alert generation to be a nice demo extra. It became the most operationally significant feature across every stakeholder demo.

Security ops staff are not ML engineers. Replacing:

ALERT: CLASS=INTRUDER CONF=0.87 ZONE=4

...with a plain-English incident description any operator can act on immediately eliminates the cognitive translation layer between detection and response. That directly reduces mean-time-to-response - which is the metric physical security actually cares about.

Google Gemini Feedback

I'll be direct, because that's more useful than praise.

✅ What worked exceptionally well

Multimodal reasoning exceeded my expectations. Passing a video frame, a text description of the spatial zone context, and the last 60 seconds of structured event metadata - and getting back coherent, actionable threat reasoning - is not something I can replicate with any other single API. The cross-modal context integration is the real differentiator, not any individual capability in isolation.

Function calling / tool use is reliable and structurally predictable. The explicit tool call output is clean enough to build production guard layers around. In a security domain, that predictability matters more than raw capability - false positives trigger real-world responses.

⚠️ Where I hit real friction

Latency variability under concurrent load was the most significant operational pain point. Single-request latency was workable. Under burst conditions - multiple concurrent feed analysis requests - latency became inconsistent in ways that are hard to tolerate in real-time security contexts. I ended up building adaptive backpressure queuing, which added complexity I hadn't originally scoped. A more predictable latency SLA under concurrent load would make production deployment planning substantially cleaner.

Native video stream ingestion still doesn't exist at the API level. I'm doing frame-level decomposition and passing sampled frames with temporal metadata. A continuous video input endpoint with configurable temporal sampling would eliminate meaningful preprocessing overhead and unlock more natural architectures for continuous monitoring.

Long-tail visual edge cases - partially obscured individuals, extreme low-light conditions, unusual camera angles - occasionally produced inconsistent threat classifications. This is the tail-distribution problem I know well from autonomous driving development. It's manageable with a human review queue, but it sets a ceiling on full automation today. Worth being honest about.

The honest take

Gemini is the right architectural foundation for physical intelligence applications. Multimodal input, large context windows, and native tool use map almost exactly onto what a physical security reasoning agent needs. The gaps - real-time streaming infrastructure and edge-case visual consistency - are tractable. I expect them to close.

For anyone building in physical security, critical infrastructure, or industrial operations where you need a system that understands its environment rather than pattern-matches it - Gemini is worth the architecture investment.

Building a 13-Agent AI System for Real-Time Road Safety Monitoring

Sherin Joseph Roy — Thu, 26 Feb 2026 15:03:12 +0000

Kerala, India has one of the highest road accident rates in the country — over 40,000 accidents annually across its narrow, winding highways. I built SurakshaNet, a multi-agent intelligence platform that monitors 6 high-risk road segments in real time using 13 AI agents, Byzantine fault-tolerant voting, and Bayesian belief fusion.

This post covers the architecture, the problems I solved, and what I learned.

The Problem

Traditional road safety systems rely on single data sources — a camera feed or a weather alert. But accidents are multi-causal. A wet road alone is not dangerous. A wet road at night, near a school zone, with heavy traffic and poor visibility — that is dangerous.

I needed a system that could fuse multiple independent signals into a single calibrated risk score, attribute causality, and trigger the right response automatically.

Architecture Overview

The system runs 3 agent clusters in parallel per road segment, every 5 minutes:

SWARM-GUARD (Road Safety): Weather friction analysis using IRC SP:73-2015 standards, historical accident pattern matching from NCRB data, traffic anomaly detection via HERE API, and YOLOv8 visual risk assessment.

DRISHTI (Disaster Response): IMD weather situation awareness, KSDMA resource inventory tracking, census-based population vulnerability scoring, OR-Tools logistics optimization, and Claude-powered counterfactual projection.

SENTINEL (Surveillance): RT-DETR edge perception, OSNet cross-camera re-identification, Bayesian Beta-Binomial threat escalation, and Section 65B-compliant evidence packaging.

Each agent returns a standardized output: risk score (0-1), confidence (0-1), and a vote (ESCALATE / HOLD / DISMISS).

Consensus: Byzantine Voting + Bayesian Fusion

With 13 agents, some will fail or return unreliable data. The system handles this in two stages.

Stage 1 — Byzantine Voting. Agents vote weighted by confidence. ESCALATE contributes +1 x confidence, DISMISS contributes -1 x confidence, HOLD contributes 0. Agents with zero confidence are excluded. If more than 50% of agents fail, the system enters degraded mode. The weighted sum determines the consensus: above +0.50 triggers ESCALATE, below -0.50 triggers DISMISS.

Stage 2 — Bayesian Belief Fusion. Raw scores are fused accounting for inter-agent correlation. Visual risk and traffic anomaly share a correlation coefficient of 0.6 (both degrade in bad weather). Independent agents are fused via weighted average. Correlated agents are down-weighted. The final belief is computed through precision weighting:

fused_belief = (mean_indep * prec_indep + mean_corr * prec_corr) / (prec_indep + prec_corr)

Causal Attribution

If the system escalates, it does not just say "high risk detected." That would be useless. Instead, it calls Claude (claude-sonnet-4-20250514) with the raw agent scores and asks for a one-paragraph causal attribution explaining why this segment is dangerous right now.

If the Claude API fails, the attribution is set to null — never fabricated. A Slack alert is sent with raw scores marked as needing manual review.

Tech Stack

Backend: Python 3.12, FastAPI, async SQLAlchemy, PostgreSQL + PostGIS, Redis, LangGraph
AI/ML: YOLOv8, RT-DETR, OSNet, Anthropic Claude API
Frontend: Next.js 14, Leaflet.js, Recharts, Tailwind CSS
Orchestration: LangGraph StateGraph with 3-stage pipeline (vote, fuse, attribute)
Infrastructure: Docker Compose, Alembic migrations, structlog JSON logging

What the Frontend Shows

The platform has 27 pages across multiple dashboards:

Command Center — Full-screen operations view with live risk map, auto-cycling segment display, and real-time metrics
Analytics — Risk heatmap (7x24 day-hour grid), agent reliability table, agent correlation matrix
KSDMA Dashboard — Resource gap visualization, evacuation route planner, monsoon overlay
KSRTC Dashboard — Bus route advisories, depot reports, visual driver fatigue assessment
District Collector Dashboard — All 14 Kerala districts with escalation charts and monsoon preparedness scores
System Health — Agent status by cluster, API rate limits with circuit breaker states

Other features include crowd-sourced hazard reporting with GPS auto-detection, spatiotemporal forecasting, school zone safety, green corridor management, and RTSP camera integration.

Rate Limit Engineering

Free-tier APIs are the backbone. The monitoring interval is 300 seconds across 6 segments:

API	Daily Usage	Limit
Open-Meteo	1,728 calls	10,000
HERE Traffic	144 calls	1,000
OpenWeatherMap	Fallback only	1,000
Anthropic Claude	On escalation only	60 RPM

Per-API tracking uses Redis token buckets. If a limit is exhausted, the agent returns confidence=0.0 and vote=HOLD — it never gets API keys revoked.

Key Design Decisions

No random numbers for risk scores. Every score is derived from real external data. If data is unavailable, the score is 0.0 with confidence 0.0 and vote HOLD. This was a hard rule from day one.

No blocking calls. The entire application is async. asyncio.gather() for parallel agents, httpx.AsyncClient for HTTP, create_async_engine for database access. Using time.sleep() anywhere would be a production-killing bug.

Graceful degradation everywhere. If the database is unreachable, events buffer in a Redis queue and drain with exponential backoff. If an agent fails, it returns a safe default. If Claude fails, attribution is null with a Slack alert. Nothing fails silently.

Lessons Learned

Multi-agent consensus is harder than single-model inference. The correlation correction between agents was the most impactful improvement — without it, correlated agents would double-count evidence.
Rate limit management is an engineering problem, not an afterthought. Building it into the agent abstraction from the start saved significant debugging later.
Causal attribution changes how operators interact with the system. A number like "0.78" means little. "Wet road surface combined with reduced visibility and 23% speed reduction on NH-66 near a school zone active period" — that is actionable.

Links

Built as a solo project. Open to feedback and contributions.

7 Open Source Tools I Discovered in 2025 That Feel Illegal to Know 🤫

Sherin Joseph Roy — Wed, 26 Nov 2025 16:00:57 +0000

We all have that "Tools" bookmark folder we never look at.

Last weekend, I did a deep clean. I audited my workflow to find the utilities that actually saved me hours, not just seconds. I ruthlessly cut the bloat and kept the gems.

If you want to speed up your dev loop in 2025, here is the stack I'm betting on. 👇

Bruno (The Postman Killer) 🐶 Category: API Client Repo: usebruno/bruno

If you are tired of Postman asking you to log in or sync to the cloud just to test a GET request, you need Bruno.

The Killer Feature: It stores your API collections directly in your filesystem (folder structure). This means you can git version control your API tests alongside your code. No more "It works on my machine" sync issues.

Cost: Open Source.

Lazygit 💤 Category: Terminal UI Repo: jesseduffield/lazygit

I know, I know. "Real devs use raw git commands." But when you need to stage specific lines, squash 4 commits, and resolve a merge conflict in 30 seconds, lazygit is a superpower.

It turns your terminal into a visual dashboard. I haven't typed git add . in six months.

Biome ⚡ Category: Formatter & Linter Repo: biomejs/biome

I replaced Prettier and ESLint with this. It is written in Rust and it is blazingly fast.

Why switch? In large monorepos, saving a file used to take 2-3 seconds to format. Biome does it in milliseconds. It handles formatting and linting in a single pass. The speed difference is actually noticeable.

PocketBase 🗄️ Category: Backend-as-a-Service Repo: pocketbase/pocketbase

For your side projects, stop over-engineering Kubernetes clusters. PocketBase is an open-source backend in a single Go file.

It gives you: Realtime database, Auth, File Storage, and an Admin Dashboard. You can drop the binary on a $5 VPS and serve 10k users easily. It is the spiritual successor to Firebase, but you actually own the data.

Zod 💎 Category: TypeScript Validation Repo: colinhacks/zod

If you are trusting your API responses blindly, you are going to have a bad time. Zod allows you to define a schema and validate data at runtime.

The "Aha" Moment: It infers the static TypeScript type from the schema. You write the validation once, and you get the types for free. It essentially eliminates "undefined is not a function" errors.

Ray.so 📸 Category: Sharing Code Link:

Stop taking screenshots of your VS Code with the messy file tree visible.

Ray.so creates those beautiful, syntax-highlighted code snippets you see on Twitter/X. If you are writing documentation or sharing a bug fix with a junior dev, presentation matters. It makes your code look readable and professional.

Excalidraw 🎨 Category: Whiteboarding Repo: excalidraw/excalidraw

The best way to document a system isn't a 50-page Google Doc; it's a messy hand-drawn diagram. Excalidraw brings back the "napkin sketch" feel but makes it collaborative.

Pro Tip: Use the "Libraries" feature to drag and drop AWS/Azure icons instantly.

The "So What?"
We often get obsessed with learning new frameworks (Next.js vs. Remix vs. Vue), but I've found that the biggest quality-of-life improvements come from optimizing the "boring" stuff—how we commit code, how we test APIs, and how we format files.

Which "hidden gem" tool are you gatekeeping? Drop it in the comments so I can steal it. 👇

DEV Community: Sherin Joseph Roy

I built a real-time Drivable Area Segmentation model for Indian roads (Here is how it runs at 55 FPS)

The Core Problem

What I Built

The Tech Reality

What is Next?

I need your thoughts!

I Turned My Old Android Phone Into an L2 Autonomous Driving System (Flutter + C++)

The Problem with Cloud AI for Safety

The Architecture: Bypassing the Framework

Real-Time Performance on Mobile Silicon

What is Next?

Sherin-SEF-AI / Zyra-ADAS

Android L2 ADAS shadow-mode system. On-device YOLOv8n + classical lane tracking with Vulkan-accelerated NCNN inference. Flutter UI + C++ NDK engine.

Zyra ADAS

Your phone is now an L2 ADAS shadow system.

Why Zyra ADAS

200GB of Raw Rover Data and No Pipeline to Process It. So I Wrote One.

The Gap That Started Everything

The One Rule That Saved the Project

The Hardest Lesson: Never Store Timestamps as Floats

Synchronizing Devices That Don't Share a Clock

Method 1: GPS Time Anchor

Method 2: IMU Cross-Correlation

Method 3: Manual

The 4GB Trap

Auto-Labeling for Roads That COCO Never Saw

Camera Calibration: Four Steps, Three Weeks of Debugging

ByteTrack in 839 Lines

28 Panels, Zero Frozen Frames

Same Logic, Two Interfaces

What I Would Do Differently

The Numbers

Running It

Sherin-SEF-AI / Orvex

Data collection and processing platform for autonomous vehicle development

Orvex

Demos

IMU Telemetry Graph

Depth Estimation

Frame Extraction

Table of Contents

Architecture

I Thought I Understood the Autonomous Vehicle Problem. Indian Roads Corrected Me.

What Orvex Actually Is

Three Things That Failed Immediately

The Thing Nobody Talks About: Benchmark Hallucination

The Optical Flow Channel Was an Accident That Became Essential

What Comes Next and Why It Changes Everything

The One Thing I Would Tell Anyone Starting in AV

Building a Voice-Controlled Browser Agent with Three Gemini Models

The Problem That Started This

What SAHAY Does

The Three-Agent Architecture

Agent 1: The Planner

Agent 2: The Browser

Agent 3: The Voice

Google Cloud Services

What Made This Hard

The Stack

What I Would Do Differently

Source Code

GeminiLiveAgentChallenge

Why my AI crash reconstruction MVP isn't ready for production (and why I'm rebuilding it)

The Security Issue

The Hallucination Trap

The Hybrid Architecture Pivot

Sherin-SEF-AI / Incident-Lens-AI

Incident Lens AI is a professional-grade forensic video analysis suite. It transforms raw crash footage into a defensible legal case file in seconds.

Incident Lens AI 🔍⚖️

Unlike standard video players, Incident Lens AI "reasons" about the footage in real-time, calculating vehicle speeds, inferring traffic signal states from indirect visual cues, and citing specific legal statutes for fault determination.

🚀 Key Features

🧠 Autonomous Reconstruction

⚖️ Legal Admissibility

I Got Tired of Mocking APIs in pytest. So I Built a Cleaner Way.

The Breaking Point

What I Wanted Instead

Example

Design Principles

1. Isolation Per Test