Sam King

Posted on Mar 23

Teaching Machines to See (Part 1): Why Vision Is Hard

#computervision #machinelearning #opencv #cpp

As humans, it’s effortless to look at the images above and instantly recognize a cat, a dog, and a lady. This is because our brains perform intelligent visual processing, combining attention (focusing on relevant parts while ignoring others), memory (recognizing patterns from past experience), and context (like noticing the lady is smiling) to interpret scenes efficiently.

Light enters the retina and is converted into signals that travel to the visual cortex at the back of the brain. There, networks of neurons work together to detect patterns and make sense of what we see.

But how easy is this for a computer?

Unfortunately, it’s not easy at all. To a computer, an image is simply a grid of numbers (a matrix) where each value represents the brightness of a pixel (smallest unit of a digital image) at a specific point. These values typically range from 0 (completely black) to 255 (completely white).

With OpenCV, you can visualize the numerical matrix behind an image, exposing how a computer actually interprets it.

The same beautiful peacock you saw earlier is, to a computer, nothing more than structured numerical data.

The challenge is that these matrices have no inherent meaning as they are simply numerical values. They represent a two-dimensional (2D) projection of a three-dimensional (3D) world, which leads to missing information and inherent ambiguities when interpreting the scene.

A computer trying to see and understand the world from a 2d data representation leads to 3 ambiguities:

1. Depth ambiguity

From a single 2D image, you cannot uniquely determine the true 3D structure of a scene.

To properly understand depth ambiguity, it’s important to first understand how a camera forms an image.

Below is a simplified diagram which captures the core idea.

Light from the real world reflects off objects in all directions. Light rays from the objects travel toward the camera and enter through the lens.

The lens then bends (refracts) and focuses these rays, causing light coming from a single point in the 3D scene to converge onto a specific point on the sensor.

Behind the lens is the image sensor, which is a 2D grid of millions of tiny pixels. Each pixel sits at a fixed position on this grid, defined by its (x, y) coordinates representing width and height.

Each pixel measures the intensity and color of incoming light. In color images, a pixel stores multiple intensity values corresponding to different color channels (typically Blue, Green, and Red). These values combine to form the final perceived color.
In grayscale images, each pixel stores a single intensity value, representing only brightness since there is no color information.

Using opencv one can see the pixel value in bgr format for colored images and single brightness value for grayscale images:

However, while the camera knows where the light landed (x, y), its color and intensity, it does not know how far the light traveled and this is called depth ambiguity since there is no depth (z -axis) on the 2D image captured.

As a result:

Objects of different sizes in the 3d world can produce images of the same size

Objects of the same size in the 3d world can look completely different in an image.

To solve this limitation imposed by camera sensors, there are two major ways:

Note: I am still actively learning these solutions and will provide a more detailed explanation in a future blog post after building practical implementations.

Contextual Information: Contextual information uses visual clues like perspective, shadows, and object overlap to estimate depth from a single image

Depth Sensors: Depth sensors emit light (often infrared), which reflects off objects back to the sensor. By measuring the time delay or pattern distortion of returning light, distance is computed using light’s constant speed.

2. Occlusion ambiguity

Occlusion ambiguity happens when parts of the scene are hidden by other objects, so the image contains incomplete information.

A camera only records what’s visible along each ray. If an object is behind something else, those rays never reach the sensor → that part of the object doesn’t exist in the image.

In practice, this is addressed using contextual cues, multiple viewpoints, temporal information from video, depth sensors, or learning-based models that infer missing parts.

Note: At my current stage, I understand these approaches conceptually, but I haven’t yet implemented them. I plan to explore how systems handle occlusion in real-world scenarios and explain in depth in a future blog as I gain more practical experience.

3. Noise Ambiguity

Noise ambiguity occurs because cameras capture light in varying conditions. In good lighting (e.g., a dog in sunlight), enough light reaches the sensor, producing clear and stable pixel values. However, in low light (e.g., at night), fewer photons are captured, leading to random variations that appear as noise. This makes it difficult to distinguish real details, like the dog’s fur, from random grain, creating uncertainty in the image.

Asides lighting, noise could also stem from weather conditions, movements, imperfection in lens e.t.c

Noise is reduced using techniques like filtering (e.g., Gaussian blur), averaging multiple frames, improved camera hardware, or learning-based methods. These approaches aim to suppress random variations while preserving important image details, though a trade-off between noise reduction and detail loss always exists.

Seeing Isn’t Understanding

This exploration made one thing clear: Vision to a machine is not just about seeing, but about interpreting incomplete and ambiguous data. What seems effortless for humans is fundamentally challenging for machines due to depth, noise, and occlusion ambiguities.

Moving Beyond Theory: Building Real Computer Vision Systems

At this stage, my focus has been on building a strong understanding of how images are formed and represented. The next step is moving beyond theory by implementing and experimenting with methods that address these limitations in practice.

I’ll be documenting that journey as I build and test real computer vision systems. More detailed and practical insights coming next.