Imagine you're building a system that tracks whether a user is looking
at a screen. Not whether they are focused in a psychological sense, but
something much simpler --- and surprisingly tricky:
Are they even looking at the screen right now?
At first glance, this sounds trivial. But real users don't behave
cleanly. They glance sideways, blink, tilt their heads, or stare forward
while their eyes drift elsewhere. Very quickly, "On-Screen" turns into a
collection of edge cases.
In this post, I describe the practical process I went through while
building a real-time computer vision system that classifies each video
frame as On-Screen or Off-Screen. No mind reading --- just
concrete, explainable decisions.
Why On / Off Screen Detection Is Useful
On/Off Screen detection is rarely the final goal. It is usually a
supporting signal for other systems, such as:
- Estimating concentration or engagement
- Monitoring behavior during online exams
- Measuring effective screen time
- Filtering out irrelevant frames (user leaves the chair)
Without a stable On/Off Screen layer, any downstream metric quickly
becomes noisy and unreliable.
The goal was deliberately narrow:
Classify each frame as On-Screen or Off-Screen in a stable,
explainable, and tunable way.
Early Attempts: Starting with the Eyes
The first approach focused only on the eyes.
I tracked: - Iris position
- Normalized iris offset within the eye
This worked reasonably well in calm, frontal cases. However, it failed
in many realistic situations:
- Partial face visibility
- Slight head rotation while the eyes appeared centered
- Increased noise when facial landmarks jittered
Eye-only tracking turned out to be too sensitive to small errors and
visual noise.
Second Attempt: Adding Head Pose
To stabilize the system, I introduced head pose estimation based on
facial landmarks:
- Yaw (left / right rotation)
- Pitch (up / down rotation)
Head pose provides a coarse estimate of where the face is oriented,
which helps correct many of the failures seen with eye-only tracking.
However, it was still not enough: - The head may face the screen while
the eyes clearly look away
At this stage, it became clear that head pose is necessary but not
sufficient.
(A short mathematical intuition behind Pitch and Yaw appears later in
this post.)
Yaw, Pitch, and Roll --- From Aviation
The terms Yaw, Pitch, and Roll originate from aviation and
describe orientation in 3D space:
- Yaw --- rotation left or right around the vertical axis
- Pitch --- rotation up or down around the horizontal axis
- Roll --- tilting sideways around the forward axis
This is exactly how aircraft orientation is described --- relative to a
fixed coordinate system.
π The animation below illustrates these three rotations on a rigid
body:

Yaw, Pitch, and Roll visualized on a rigid body β the same geometry applies to head pose estimation.
For head pose estimation, Yaw and Pitch are the most
informative. Roll mainly reflects head tilt and is less useful for
determining whether the screen is within view.
Mathematical Intuition: Head Pose as Angles Between Vectors
Head pose estimation is fundamentally about 3D orientation, not
position.
I define a 3D coordinate system centered on the head:
- X-axis: left β right
- Y-axis: up β down
- Z-axis: forward β backward (toward the screen)
Using a small set of stable facial landmarks (eyes, nose tip, chin), I
estimate a face-direction vector that represents where the face is
pointing.
Pitch (Up / Down)
To compute Pitch: - Project the face-direction vector onto the
Y--Z plane - Measure the angle between this projection and the
Z-axis
Intuitively: - Looking up β positive pitch\
- Looking down β negative pitch
Large pitch values indicate the screen is unlikely to be in the user's
field of view.
Yaw (Left / Right)
To compute Yaw: - Project the face-direction vector onto the X--Z
plane - Measure the angle between this projection and the Z-axis
Turning the head left or right increases yaw and gradually moves the
screen out of view.
Why Angles Matter
This formulation focuses on angles between vectors, not absolute
positions.
As a result, it is: - Scale-invariant
- Robust to camera distance
- Independent of face size
In practice: - Large angles β strong evidence of Off-Screen
- Small angles β ambiguous and must be combined with gaze
Eye Closure: Prolonged Blinks
Short blinks are natural. Prolonged eye closure, measured using the
Eye Aspect Ratio (EAR), is treated as Off-Screen even if head
orientation appears valid.
Handling Instability
Frame-by-frame classification causes rapid flickering: On β Off β On
within fractions of a second.
To address this, I applied temporal smoothing:
- A sliding window of 5 frames
- Final label determined by majority vote
This introduces a small delay but dramatically improves stability.
Decision Hierarchy
Not all signals are equally important. Through experimentation, I
arrived at the following hierarchy:
- Strong Off conditions (prolonged eye closure, extreme head pose)\
- Moderate Pitch / Yaw deviations
- Subtle gaze deviations
If a strong Off condition is met, the frame is classified as Off-Screen
even if other signals appear valid.
From Frames to Metrics
Once each frame is classified:
- On-Screen and Off-Screen frames are counted
- Rates such as On-Screen Percentage or Gaze Aversion Rate are computed
- Session-level summaries can be derived
All higher-level metrics rely on this foundational layer.
Lessons Learned
- Start with the simplest signal
- Expect it to fail
- Add complementary signals incrementally
- Stabilize before optimizing accuracy
- Avoid guessing thresholds without real data
Closing Thoughts
Determining whether someone is looking at a screen sounds simple ---
until you actually build it.
By:
- Defining a narrow, operational question
- Combining multiple weak signals
- Adding deliberate stability measures
It is possible to build a reliable foundation for attention-related
systems.
Most importantly, this approach reflects the real process: observe,
fail, refine --- rather than assuming a perfect solution from the
start.
Top comments (0)