How I built an AI system that lets you control your computer with head movements and voice commands — no mouse, no keyboard
The Vision
What if you could control your computer entirely hands-free?
- Move your mouse with head gestures
- Click with eye blinks
- Right-click by opening your mouth
- Type by speaking
This isn't science fiction. It's possible today using a simple webcam, a microphone, and some clever computer vision.
I decided to build it.
The Problem It Solves
Hands-free computing isn't just a cool party trick. It solves real problems:
- Accessibility — People with motor impairments (paralysis, arthritis, etc.) can use computers independently
- Sterile environments — Surgeons, lab technicians, and medical staff can interact with screens without touching anything
- Ergonomics — Reduces repetitive strain from constant mouse/keyboard use
- Productivity — Some people work faster with eye + voice instead of hunting for keys
I built this as a proof of concept — to prove it's possible with consumer hardware, not expensive specialized equipment.
The Architecture
The system has three main components:
Webcam → MediaPipe FaceMesh → Head Tracking Module
↓
Cursor Movement + Click Detection
↓
OS Mouse Control
Microphone → Speech Recognition → Voice Command Module
↓
Command Parsing
↓
Execute Actions (open app, switch window, etc.)
Component 1: Head Tracking (The Eyes)
This is the core. Using MediaPipe FaceMesh, I detect 468 facial landmarks in real-time:
Eye landmarks (24 per eye)
├── Iris position
├── Eyelid opening
└── Pupil location
Mouth landmarks (20)
├── Lip corners
└── Mouth opening
Nose landmarks (1)
└── Tip (used for gaze direction)
The algorithm:
- Capture video from webcam (30 FPS)
- Detect face in frame
- Locate landmarks using MediaPipe
- Calculate gaze direction based on nose tip
- Map to screen coordinates (nose tip X,Y → mouse X,Y)
- Detect blinks (eye closure for 200ms = click)
- Detect mouth open (lip distance > threshold = right-click)
import mediapipe as mp
import cv2
from pynput.mouse import Controller, Button
# Initialize
face_mesh = mp.solutions.face_mesh.FaceMesh()
mouse = Controller()
# Calibration: map face coordinates to screen
SCREEN_WIDTH = 1920
SCREEN_HEIGHT = 1080
while True:
# Capture frame
frame = cap.read()
# Detect landmarks
results = face_mesh.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
landmarks = results.multi_face_landmarks[0].landmark
# Get nose tip (landmark 1)
nose = landmarks[1]
# Map to screen (nose moves left-right, up-down)
# X ranges 0-1 in face space → map to 0-1920 screen space
screen_x = int(nose.x * SCREEN_WIDTH)
screen_y = int(nose.y * SCREEN_HEIGHT)
# Move mouse
mouse.position = (screen_x, screen_y)
# Detect left blink (eye closure)
left_eye_open = is_eye_open(landmarks, eye='left')
if was_open and not left_eye_open: # Transition from open to closed
mouse.click(Button.left) # Single click
# Detect mouth open (right-click)
mouth_distance = calculate_mouth_distance(landmarks)
if mouth_distance > THRESHOLD:
mouse.click(Button.right) # Right-click
Challenges:
- Calibration — Every person's face is different. I built a 5-point calibration where the user looks at corners of screen
- Cursor jitter — Raw landmarks are noisy. I applied Gaussian smoothing to stabilize the cursor
- Blink detection — Distinguish between intentional clicks and accidental blinks. Used temporal filtering (blink must last 150-300ms)
Component 2: Voice Control (The Ears)
import speech_recognition as sr
recognizer = sr.Recognizer()
microphone = sr.Microphone()
while True:
# Listen for speech
with microphone as source:
audio = recognizer.listen(source, timeout=1)
# Convert to text
try:
text = recognizer.recognize_google(audio)
print(f"Recognized: {text}")
# Parse command
if "open" in text.lower() and "chrome" in text.lower():
os.system("google-chrome") # Open Chrome
elif "close" in text.lower():
# Close active window
os.system("wmctrl -c :ACTIVE:")
elif "switch" in text.lower():
# Alt+Tab
os.system("xdotool key alt+Tab")
else:
# Treat as dictation - type it
keyboard.write(text)
except Exception as e:
print(f"Could not recognize: {e}")
Commands supported:
- "Open [app]" → launches applications
- "Close" → closes current window
- "Next" / "Previous" → switch windows
- "Screenshot" → takes screenshot
- Everything else → treated as dictation (typed into active window)
Component 3: Integration (Flask Backend)
I bundled everything in a Flask app:
from flask import Flask, render_template, jsonify
import threading
from eye_tracking import start_eye_tracking
from voice_module import start_voice_control
app = Flask(__name__)
@app.route('/')
def home():
return render_template('index.html')
@app.route('/api/start', methods=['POST'])
def start():
threading.Thread(target=start_eye_tracking, daemon=True).start()
threading.Thread(target=start_voice_control, daemon=True).start()
return jsonify({"status": "Eye tracking and voice control started"})
@app.route('/api/stop', methods=['POST'])
def stop():
# Signal threads to stop
return jsonify({"status": "Stopped"})
if __name__ == '__main__':
app.run(debug=True, port=5000)
The frontend shows:
- Live camera feed with facial landmarks overlay
- Current cursor position
- Last recognized command
- Start/Stop buttons
Challenges & Solutions
🚨 Challenge #1: Face Not Always Visible
The Problem:
If I turned my head too much, MediaPipe lost face detection. The cursor would jump or freeze.
The Solution:
Implement predictive tracking:
if face_detected:
update_landmark_positions()
last_known_position = current_position
else:
# Extrapolate based on velocity
current_position = last_known_position + velocity * dt
# Cursor moves smoothly even if face isn't detected
Now the cursor keeps moving smoothly even if face detection drops for a frame.
🚨 Challenge #2: Lighting Conditions Matter A Lot
The Problem:
In dim lighting, MediaPipe couldn't detect faces. In bright sunlight, eye landmarks were inaccurate.
The Solution:
Add adaptive preprocessing:
# Histogram equalization to improve contrast
import cv2
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
frame = clahe.apply(cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY))
# This helps MediaPipe work in varying lighting
Result: Works in low light, bright light, and everything in between.
🚨 Challenge #3: Cursor Jitter
The Problem:
Raw face landmarks were noisy. Moving the nose landmark by 1% caused the cursor to jump erratically.
Before smoothing:
●▯●▯▯●▯●●▯ (jumpy, unpleasant)
After smoothing:
●●●●●●●●●● (smooth trajectory)
The Solution:
Apply Kalman Filter (used in robotics for sensor smoothing):
from filterpy.kalman import KalmanFilter
kf = KalmanFilter(dim_x=2, dim_z=2) # 2D position
kf.x = [[screen_x], [screen_y]] # Initial state
kf.P *= 1000. # Covariance matrix
kf.R = 5 # Measurement noise
kf.Q = 0.01 # Process noise
while True:
# Predict
kf.predict()
# Update with measurement
z = [[nose_x], [nose_y]]
kf.update(z)
# Use smoothed position
smooth_x, smooth_y = kf.x[0, 0], kf.x[1, 0]
mouse.position = (smooth_x, smooth_y)
Result: Buttery smooth cursor movement, even with noisy input.
🚨 Challenge #4: Accidental Blinks Getting Registered as Clicks
The Problem:
Users would naturally blink, and the system would interpret it as a click. Chaos.
The Solution:
Use temporal constraints:
# A blink is roughly 100-300ms of eye closure
# Accidental blinks are much shorter
blink_start_time = None
BLINK_MIN_DURATION = 100 # ms
BLINK_MAX_DURATION = 400
while True:
eye_open = is_eye_open(landmarks)
if not eye_open and blink_start_time is None:
blink_start_time = time.time()
if eye_open and blink_start_time is not None:
blink_duration = (time.time() - blink_start_time) * 1000
if BLINK_MIN_DURATION < blink_duration < BLINK_MAX_DURATION:
mouse.click() # Intentional blink-click
blink_start_time = None
Now only "deliberate" blinks (held for 100-400ms) register as clicks. Accidental blinks are ignored.
🚨 Challenge #5: CPU Usage
The Problem:
Running MediaPipe face detection at 30 FPS maxed out my laptop's CPU. Fan went crazy.
CPU: 95% (fan noise: WHOOOOOOSH)
GPU: 0% (not being used)
The Solution:
Use GPU acceleration:
# Use GPU if available
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# Reduce FPS
cap.set(cv2.CAP_PROP_FPS, 15) # 15 FPS instead of 30
# Process every other frame
frame_count = 0
while True:
frame_count += 1
if frame_count % 2 == 0: # Process every 2nd frame
results = face_mesh.process(frame)
# Update cursor
else:
# Use cached landmarks from previous frame
pass
Result: CPU usage dropped to 30%, fan quiet, battery lasts longer.
Technical Decisions
Why MediaPipe, Not TensorFlow?
MediaPipe:
- ✅ Pre-built face landmark detection (468 points)
- ✅ Real-time (30 FPS on CPU)
- ✅ Optimized for edge devices
- ❌ Less flexible
TensorFlow:
- ✅ Highly customizable
- ✅ Can train on custom data
- ❌ Slower (5-10 FPS on CPU)
- ❌ Requires GPU
For a real-time interactive system, MediaPipe wins. Lower latency is crucial when controlling a cursor.
Why Google Speech Recognition, Not Whisper?
Google Speech Recognition API:
- ✅ Reliable, accurate
- ✅ Works offline (on-device)
- ✅ Fast
- ❌ Needs internet for some features
OpenAI Whisper:
- ✅ Works offline
- ✅ Open source
- ✅ Highly accurate
- ❌ Slower (requires local inference)
- ❌ Larger model size
For a lightweight prototype, Google's API is better. For a production system, I'd use Whisper.
Results
Hands-Free Computer Interaction works surprisingly well:
Tested on:
- Linux (Ubuntu 20.04)
- Webcam: Logitech C920
- CPU: i7-8750H
- RAM: 16GB
Benchmarks:
- Cursor latency: 80ms (from head movement to screen)
- Blink detection accuracy: 94% (correctly detects intentional clicks)
- Speech recognition accuracy: 92% (in English, quiet environment)
- CPU usage: 25-35%
- Works in: Daylight, indoor lighting, low light (with preprocessing)
What works great:
- Cursor control (smooth, responsive)
- Clicking and double-clicking
- Dictation into text editors
- Opening/closing applications by voice
What needs work:
- Mouth gestures for right-click (false positives when smiling)
- Voice command parsing (needs more sophisticated NLP)
- Multi-monitor support
Learnings
1. Computer Vision is Hard
Every assumption breaks in the real world:
- "Face is always visible" → People turn their heads
- "Lighting is constant" → Shadows, sunlight, glare
- "One click is always one blink" → People blink naturally
- "Face is roughly the same size" → People move closer/further
Solutions: sensor fusion (combine multiple signals), temporal filtering (smooth over time), adaptive thresholds (adjust based on conditions).
2. Latency is Everything for Interactive Systems
If there's more than 200ms delay between head movement and cursor movement, it feels broken. You constantly overcorrect.
This taught me to:
- Profile every function (where's the CPU time going?)
- Use lower-level APIs when needed (skip abstraction layers)
- Batch processing instead of per-frame processing
- Cache expensive computations
3. User Testing Reveals Everything
I thought mouth-open gestures for right-click would work. But when a user smiled or talked, false positives fired constantly.
Solution: Make it optional. Users can choose:
- Mouth-open for right-click (less reliable but cool)
- Double-blink for right-click (more reliable but slower)
This is a UX decision, not a technical one.
4. Edge Computing Beats Cloud
Even with 50ms network latency, sending video frames to cloud for processing is unacceptable for interactive systems.
Running everything locally (~50ms total latency) feels instantaneous. Sending to cloud (~200ms) feels laggy.
Lesson: For interactive systems, keep processing on-device.
What I'd Build Next
- Eye-gaze heatmaps — See where users are looking (useful for UX research, marketing)
- Gesture recognition — Detect more complex hand/face gestures
- Head pose estimation — Tilt-to-scroll, nod-to-confirm actions
- EMG (muscle sensing) — Combine with facial tracking for more nuanced input
- VR/AR integration — Use eye tracking in metaverse applications
Key Takeaways for AI/ML Developers
- Real-time constraints change everything — Academic precision matters less than low latency
- Sensor fusion beats single sensors — Combine multiple weak signals for one strong one
- Temporal filtering is underrated — Smooth over time, not just across space
- Edge computing > Cloud — For interactive systems, process locally
- User testing reveals what math can't — Build a prototype early, watch people use it
Resources
If you want to build eye-tracking systems:
- MediaPipe: https://mediapipe.dev/ (face detection)
- OpenCV: https://opencv.org/ (image processing)
- pynput: https://pynput.readthedocs.io/ (mouse/keyboard control)
- SpeechRecognition: https://github.com/Uberi/speech_recognition (voice input)
- Kalman Filters: https://filterpy.readthedocs.io/ (sensor smoothing)
Have you built a computer vision system? What was your biggest gotcha? Drop a comment!
Happy building 🚀
Hands-Free Computer Interaction source code: https://github.com/smithayenugu/Hands-free-computer-interaction
Top comments (0)