Shulamit H.

Posted on Jan 18

Is the User Actually Looking at the Screen? Building Real-Time On/Off Screen Detection

#computervision #python #programming

Imagine you're building a system that tracks whether a user is looking
at a screen. Not whether they are focused in a psychological sense, but
something much simpler --- and surprisingly tricky:

Are they even looking at the screen right now?

At first glance, this sounds trivial. But real users don't behave
cleanly. They glance sideways, blink, tilt their heads, or stare forward
while their eyes drift elsewhere. Very quickly, "On-Screen" turns into a
collection of edge cases.

In this post, I describe the practical process I went through while
building a real-time computer vision system that classifies each video
frame as On-Screen or Off-Screen. No mind reading --- just
concrete, explainable decisions.

Why On / Off Screen Detection Is Useful

On/Off Screen detection is rarely the final goal. It is usually a
supporting signal for other systems, such as:

Estimating concentration or engagement
Monitoring behavior during online exams
Measuring effective screen time
Filtering out irrelevant frames (user leaves the chair)

Without a stable On/Off Screen layer, any downstream metric quickly
becomes noisy and unreliable.

The goal was deliberately narrow:

Classify each frame as On-Screen or Off-Screen in a stable,
explainable, and tunable way.

Early Attempts: Starting with the Eyes

The first approach focused only on the eyes.

I tracked: - Iris position

Normalized iris offset within the eye

This worked reasonably well in calm, frontal cases. However, it failed
in many realistic situations:

Partial face visibility
Slight head rotation while the eyes appeared centered
Increased noise when facial landmarks jittered

Eye-only tracking turned out to be too sensitive to small errors and
visual noise.

Second Attempt: Adding Head Pose

To stabilize the system, I introduced head pose estimation based on
facial landmarks:

Yaw (left / right rotation)
Pitch (up / down rotation)

Head pose provides a coarse estimate of where the face is oriented,
which helps correct many of the failures seen with eye-only tracking.

However, it was still not enough: - The head may face the screen while
the eyes clearly look away

At this stage, it became clear that head pose is necessary but not
sufficient.
(A short mathematical intuition behind Pitch and Yaw appears later in
this post.)

Yaw, Pitch, and Roll --- From Aviation

The terms Yaw, Pitch, and Roll originate from aviation and
describe orientation in 3D space:

Yaw --- rotation left or right around the vertical axis
Pitch --- rotation up or down around the horizontal axis
Roll --- tilting sideways around the forward axis

This is exactly how aircraft orientation is described --- relative to a
fixed coordinate system.

👇 The animation below illustrates these three rotations on a rigid
body:

Yaw, Pitch, and Roll visualized on a rigid body — the same geometry applies to head pose estimation.

For head pose estimation, Yaw and Pitch are the most
informative. Roll mainly reflects head tilt and is less useful for
determining whether the screen is within view.

Mathematical Intuition: Head Pose as Angles Between Vectors

Head pose estimation is fundamentally about 3D orientation, not
position.

I define a 3D coordinate system centered on the head:

X-axis: left ↔ right
Y-axis: up ↔ down
Z-axis: forward ↔ backward (toward the screen)

Using a small set of stable facial landmarks (eyes, nose tip, chin), I
estimate a face-direction vector that represents where the face is
pointing.

Pitch (Up / Down)

To compute Pitch: - Project the face-direction vector onto the
Y--Z plane - Measure the angle between this projection and the
Z-axis

Intuitively: - Looking up → positive pitch\

Looking down → negative pitch

Large pitch values indicate the screen is unlikely to be in the user's
field of view.

Yaw (Left / Right)

To compute Yaw: - Project the face-direction vector onto the X--Z
plane - Measure the angle between this projection and the Z-axis

Turning the head left or right increases yaw and gradually moves the
screen out of view.

Why Angles Matter

This formulation focuses on angles between vectors, not absolute
positions.

As a result, it is: - Scale-invariant

Robust to camera distance
Independent of face size

In practice: - Large angles → strong evidence of Off-Screen

Small angles → ambiguous and must be combined with gaze

Eye Closure: Prolonged Blinks

Short blinks are natural. Prolonged eye closure, measured using the
Eye Aspect Ratio (EAR), is treated as Off-Screen even if head
orientation appears valid.

Handling Instability

Frame-by-frame classification causes rapid flickering: On → Off → On
within fractions of a second.

To address this, I applied temporal smoothing:

A sliding window of 5 frames
Final label determined by majority vote

This introduces a small delay but dramatically improves stability.

Decision Hierarchy

Not all signals are equally important. Through experimentation, I
arrived at the following hierarchy:

Strong Off conditions (prolonged eye closure, extreme head pose)\
Moderate Pitch / Yaw deviations
Subtle gaze deviations

If a strong Off condition is met, the frame is classified as Off-Screen
even if other signals appear valid.

From Frames to Metrics

Once each frame is classified:

On-Screen and Off-Screen frames are counted
Rates such as On-Screen Percentage or Gaze Aversion Rate are computed
Session-level summaries can be derived

All higher-level metrics rely on this foundational layer.

Lessons Learned

Start with the simplest signal
Expect it to fail
Add complementary signals incrementally
Stabilize before optimizing accuracy
Avoid guessing thresholds without real data

Closing Thoughts

Determining whether someone is looking at a screen sounds simple ---
until you actually build it.

By:

Defining a narrow, operational question
Combining multiple weak signals
Adding deliberate stability measures

It is possible to build a reliable foundation for attention-related
systems.

Most importantly, this approach reflects the real process: observe,
fail, refine --- rather than assuming a perfect solution from the
start.

DEV Community