Sherin Joseph Roy

Posted on Mar 21

I Thought I Understood the Autonomous Vehicle Problem. Indian Roads Corrected Me.

#autonomouscar #ai #computervision #deeplearning

Let me describe a specific moment from one of my test drives.

The system had 33 simultaneous object IDs active in frame. Trucks, motorcycles, pedestrians, autos, cars. All moving, all tracked, all getting individual TTC calculations in real time. The collision warning was firing. Processing latency was sitting at 16ms. The road looked like controlled chaos from the outside.

From inside the car, it just looked like a normal Bangalore afternoon.

That was the moment I realized how broken most ADAS benchmarks are for this part of the world.

What Orvex Actually Is

Orvex is a multi-camera real-time perception system I built specifically for Indian urban driving conditions. Not adapted from something Western. Built from scratch with Indian roads as the primary design constraint.

The system runs 4 simultaneous camera feeds:

Primary forward-facing perception channel
Dedicated pedestrian and license plate detection channel
Optical flow motion analysis channel
Wide-angle coverage feed

The perception dashboard tracks in real time: multi-class object detection across cars, trucks, buses, motorcycles, and pedestrians. Persistent ID-based tracking that survives occlusion. Per-object distance in meters and Time to Collision in seconds. Lane status and lateral offset. Risk classification from INFO to CRITICAL. Scene metadata including road type, traffic density, and visibility score.

Processing latency runs between 15 and 17ms. Everything runs locally. No cloud dependency, no edge server offload. Just a laptop mounted in the car.

Here is actual footage from real road tests:

Three Things That Failed Immediately

I expected the system to struggle. I did not expect it to fail in the specific ways it did.

The behavior scorer scored everyone zero.

I built a 0 to 100 driver safety scoring module. Every single test run on Indian roads returned a score of 0 out of 100 with the label AGGRESSIVE. Not because the driving was reckless. Because the scorer was calibrated on assumptions that do not apply here.

Hard braking, tight gap acceptance, rapid directional changes: these are aggression markers in Western driving norms. In Bangalore or Kochi, they are baseline competence. You cannot navigate a city intersection without doing all three simultaneously. The model was correct by its own logic. It was just the wrong logic entirely.

I had to disable the scorer and rethink it from the ground up.

TTC becomes useless when trajectories are nonlinear.

Time to Collision assumes some continuity of movement. Object is at distance X, moving at velocity V, TTC is X divided by V. Clean math.

Motorcycles in Indian traffic do not follow continuous trajectories. They operate on opportunistic pathing: constantly scanning for gaps, switching lanes without signaling, responding to micro-gaps in traffic that open and close in under a second. A 0.2s TTC reading is not an early warning. It is a post-hoc notification.

This is a fundamental behavioral prediction problem, not a detection problem. Orvex catches the object. It cannot yet predict what the object is about to do. That gap is the real unsolved problem in urban AV for high-density mixed traffic.

Tracker ID counts exposed the true scale of the problem.

By mid-session in the second road test, tracker IDs were in the 1100s. That means the system had individually identified and tracked over 1100 distinct objects across the session. In roughly 50 minutes of driving.

Western AV test datasets do not have this density. nuScenes scenes average around 30 to 40 annotated objects. We were hitting 33 simultaneously active tracks in a single frame in a parking lot. The computational budget assumptions that underpin most published perception architectures are simply not calibrated for this.

The Thing Nobody Talks About: Benchmark Hallucination

Here is an uncomfortable truth about AV development.

A model that hits state of the art on nuScenes, Waymo Open, or KITTI is not a model that works. It is a model that works on those datasets. That is not the same thing.

The entire industry optimizes for benchmark performance because that is how research gets published, how companies get funded, and how progress gets measured. Benchmarks are a useful proxy. In markets where the real-world distribution diverges heavily from benchmark data, that proxy fails completely.

Orvex performs worse than several open-source ADAS baselines on standard benchmarks. The FPS fluctuates under density load. The tracker gets stressed at peak object count. The behavior scorer had to be scrapped.

But it runs on real Indian roads. It catches real collision threats on streets that do not exist in any benchmark dataset. It handles traffic compositions that academic datasets have never seen.

That gap between benchmark performance and deployment performance is the central problem of applied AV work. The teams that understand it build systems that actually work. The teams that do not build great benchmark numbers and then wonder why their system freezes at a Bangalore intersection.

The Optical Flow Channel Was an Accident That Became Essential

The motion analysis feed running optical flow was added almost as an afterthought. It turned out to be one of the most operationally useful parts of the entire system.

In dense traffic, optical flow captures motion vectors for everything in the scene, not just objects that have cleared the detection confidence threshold. Partially occluded vehicles. Objects near frame boundaries. Fast-moving targets that blur enough to drop below the detector's confidence cutoff.

In practice it functions as a soft pre-detection layer. The primary pipeline gives you identity, class, and distance. The optical flow gives you motion context for objects that are not yet fully resolved. In the chaos frames, 30-plus active objects, overlapping bounding boxes, collision warnings firing, the optical flow channel is the thing that keeps the system from being completely blind to unclassified threats.

I did not design it that way. The road taught me that was necessary.

What Comes Next and Why It Changes Everything

Here is something I have not talked about publicly until now.

I have been in conversations with a friend based in Bangalore. Through his connections, we have access to a 200-vehicle electric fleet in active service across the city. Real routes, real operational data, real urban driving at scale, every single day.

We are planning to build a full autonomous vehicle system from scratch together.

Not retrofit. Not integrate a third-party stack. From scratch. Data collection infrastructure, annotation pipelines, perception model training, sensor integration, edge deployment. All of it.

The fleet is the asset that changes the equation. Most AV startups spend years and tens of millions building access to what we already have: a real operational environment with 200 vehicles worth of driving data across one of the densest urban road networks in the world. Every route, every intersection, every edge case: ours to instrument and learn from.

The EV platform matters for a less obvious reason. Clean electrical architecture, no combustion powertrain complexity, standardized actuation interfaces. Integrating drive-by-wire controls with a custom AV stack is a significantly cleaner problem on an EV than retrofitting a conventional vehicle. The integration surface is known and controllable.

The plan in three phases:

Phase 1. Data infrastructure. Instrument the fleet, build the collection and annotation pipeline, start generating a proprietary Indian urban driving dataset that does not exist anywhere else.

Phase 2. Rebuild the Orvex perception stack properly. Multi-camera calibration done right. BEV fusion. Behavioral prediction models trained on local data, not transferred from Waymo.

Phase 3. Vehicle integration. Closed-course autonomy first, then expanding the operational domain incrementally with real data informing every decision.

This is not a short timeline. But the foundation is real. The fleet is real. The perception work from Orvex is real. The distance between where we are and a working prototype is smaller than it looks from outside.

The One Thing I Would Tell Anyone Starting in AV

Test on your actual deployment environment from day one. Not when the system is "ready." Day one.

The failures you discover in your real environment are not setbacks. They are the curriculum. Every broken assumption, the behavior scorer, the TTC model, the density ceiling, became a design requirement that made the system more honest about what it actually needs to do.

Simulation matters. I have built months of Indian road scenarios in CARLA and the synthetic data work has real value. But simulation is a tool for exploring the design space. It is not a substitute for the road.

The road has opinions. You should hear them early.

Orvex is part of the broader perception and safety work at PerceptionAV. If you are building in AV, edge perception, or safety intelligence for high-density urban environments, especially outside Western road contexts, I would like to compare notes.