I truly believe computer-vision powered control systems are inefficient.
Listen to me…
Computer vision takes more compute power than necessary in the sense that it does more work just to get something simple done. Like imagine I just want to “click” a button. It sounds futuristic and cool, but the world of engineering is not about “cool”, it’s about efficiency. It’s about the cleanest, most reliable, most deterministic way to get something done.
But CV?
You need a camera, a processor running at a constant frame rate, and then a full CV model sitting on top. And these models are heavy — even the optimized ones take more resources than something like a simple switch, an IR sensor, or a muscle-signal reader. That’s the truth. Research literally shows vision models often burn orders of magnitude more energy than basic sensors, even when compressed or run on microcontrollers.
And then we want to make this whole stack the “OS” of the system?
As in… the foundation we’re supposed to trust?
The moment you make CV dynamic, you’re telling the model to deal with noise — light changes, shadows, random movement in the background, mistakes where the model thinks your hand is a command you didn’t mean, etc. Even industry papers emphasize how camera-based gesture systems get false positives in messy environments. And a control layer that triggers stuff you didn’t intend is the fastest way to create a very unreliable system.
An OS should be accurate more often than not.
You should feel safe that the base system gives the same output every time under the same input. Deterministic. Repeatable. Zero surprises. That’s how engineers design systems that people can build experiments on top of.
CV doesn’t give you that consistency unless you freeze the environment — fixed lighting, fixed angle, fixed setup — and that’s not how normal people use computers.
So for me, CV fits more into monitoring roles, not control roles.
Monitoring = “watch the scene, tell me what’s happening.”
Control = “do something based on my action, immediately and reliably.”
CV excels at monitoring. That’s where it shines. Surveillance, object detection, anomaly spotting, robotics feedback loops — the research literally shows CV thrives when it’s observing and reporting.
But controlling an OS?
Triggering actions like a click or command?
That’s when you really want low-latency, low-noise, low-power sensors that give clean signals.
You might ask: “What about VR?”
Well VR actually proves my point — those systems don’t rely on heavy vision models for basic inputs. They use IMUs, IR markers, muscle sensors (EMG), flex sensors, etc. Even the fancy VR gloves combine simple sensors first and then sprinkle in CV only for extra precision. Because those basic sensors are faster, cleaner, and more reliable for moment-to-moment control.
In simple terms:
Computer vision is amazing… just not for everything.
And trying to make it the primary way we control systems just feels like the wrong engineering choice right now.
Top comments (0)