DEV Community: Mohamed Heni

Building a Character-Level Bigram Language Model from Scratch with PyTorch

Mohamed Heni — Mon, 20 Jul 2026 09:52:54 +0000

TL;DR: I built the simplest possible neural language model — a character-level bigram — from scratch using PyTorch. No transformers, no attention, no abstractions. Just 79 characters, a counting matrix, and multinomial sampling.

The Problem

Every time I opened a tutorial about LLMs, it started with "Here's how attention works" or "Let's build a mini GPT."

But I wanted to understand the foundation. What happens before attention? How does text even become numbers? How does a model learn to predict the next token without any neural network at all?

So I went all the way down to the simplest possible language model: the bigram.

What's a Bigram Model?

A bigram model predicts the next character based only on the current character. That's it. No context window of thousands of tokens — just "given character X, what's the most likely character to follow?"

It's the Hello World of language modeling.

The Implementation

Step 1: Raw Text → Character Vocabulary

with open("Oz.txt", 'r', encoding='utf-8') as f:
    text = f.read()

chars = sorted(set(text))
print(len(chars))  # 79 unique characters

79 unique characters: letters (a-z, A-Z), digits, punctuation, spaces, and newlines. A tiny vocabulary compared to GPT-4's ~100K token BPE vocabulary.

Step 2: Character Encoding

string_to_int = {ch: i for i, ch in enumerate(chars)}
int_to_string = {i: ch for i, ch in enumerate(chars)}

encoder = lambda s: [string_to_int[c] for c in s]
decoder = lambda l: ''.join([int_to_string[i] for i in l])

data = torch.tensor(encoder(text), dtype=torch.long)
print(data.shape)  # torch.Size([231881])

Every character maps to an integer 0-78. The entire text of "The Wonderful Wizard of Oz" becomes a 231,881-element tensor.

Step 3: Train/Val Split

n = int(0.8 * len(data))
train_data = data[:n]
val_data = data[n:]

Standard 80/20 split. Simple and effective.

Step 4: Input-Target Pairs

For next-token prediction, we need (input, target) pairs where the target is the next character after the input sequence:

block_size = 8
x = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")
# when input is tensor([0]) the target: 0
# when input is tensor([0, 0]) the target: 43
# when input is tensor([0, 0, 43]) the target: 67
# ...

This is the core pattern of autoregressive language modeling: given a sequence, predict what comes next. Every modern LLM — from GPT to LLaMA to Mistral — uses this same fundamental approach. The only difference is scale.

Step 5: GPU vs CPU Benchmark

Before building the actual bigram probability matrix, I benchmarked PyTorch's matrix multiplication on CUDA vs NumPy on CPU:

torch_rand1 = torch.rand(100, 100, 100, 100).to(device)
torch_rand2 = torch.rand(100, 100, 100, 100).to(device)

# CUDA: 0.017 seconds
# NumPy CPU: 0.134 seconds

8x speedup on GPU — even for random matrix multiplication. Hardware acceleration is not optional at scale.

What's Next?

A bigram model is just the start. The next step is adding an embedding layer, then a Transformer block, then scaling up. But I wanted to build from first principles — understanding the foundation before adding complexity.

Key Takeaways

Language modeling starts with counting — the bigram probability matrix is just normalized counts
Character-level tokenization is simple but powerful: 79 tokens cover an entire novel
The input-target prediction pattern is universal — from bigrams to GPT-4
GPU acceleration matters even at toy scale (8x speedup on random matrices)
Building from scratch teaches you what frameworks abstract away

Resources

Code: github.com/HENI-MOHAMED
Dataset: The Wonderful Wizard of Oz (public domain)

Built with PyTorch 2.x on a local machine with CUDA support. Full notebook available on request.

What's a fundamental ML concept you've built from scratch recently? I'd love to hear what foundations you explored.

How I Built a Real-Time Face Recognition Security System on a Raspberry Pi

Mohamed Heni — Fri, 05 Jun 2026 08:38:05 +0000

TL;DR: I built a face-based access control system that runs completely offline on a Raspberry Pi. It uses cosine similarity between face vectors to grant or deny folder access. The whole thing is under 300 lines of Python.

The Problem

Physical folder-level security is usually done with passwords or encryption. Both have a weakness: anyone with the password gets in. I wanted something that physically identifies the user — and runs on cheap hardware with no cloud dependency.

So I set up a Raspberry Pi with a camera module and built a facial authentication loop.

How It Works

The flow is straightforward:

Registration phase: The camera captures the authorized user's face. The system converts it to grayscale, extracts facial landmarks using face_recognition, and stores the resulting 128-dimensional face vector locally.
Verification phase: When someone opens the protected folder, the camera activates, captures the current face, extracts its vector, and compares it to the stored vector using cosine similarity.
Decision: If similarity exceeds a configurable threshold — access granted. If not — the folder closes immediately.

The Tech

Hardware: Raspberry Pi 3B+ + USB Camera
Software: Python 3, OpenCV, face_recognition (dlib), NumPy
Metric: Cosine similarity via scikit-learn

The key decision was using cosine similarity over Euclidean distance. Face vectors are direction-sensitive — two images of the same person under different lighting produce vectors pointing in roughly the same direction but with different magnitudes. Cosine similarity ignores magnitude and compares direction, which makes it much more robust to lighting changes.

Edge Case That Took the Longest to Solve

The hardest bug wasn't the ML — it was the camera initialization race condition.

When the folder-trigger script launches the camera, there's a brief window where the first frame is black (auto-exposure hasn't settled). If the system captures that black frame as the test face, the feature extraction returns an all-zero vector — and cosine similarity between zero vectors is undefined (division by zero).

Fix: I added a 3-frame warmup that discards the first captures and only processes once pixel variance stabilizes above a threshold. Simple, effective, and took way too long to figure out.

def warmup_camera(cap, warmup_frames=3):
    """Discard initial dark frames until exposure stabilizes."""
    for _ in range(warmup_frames):
        ret, frame = cap.read()
        if not ret:
            continue
    return cap

Why This Matters

A Raspberry Pi costs $35. A camera module costs $15. For $50, you get biometric folder security that works entirely offline — no cloud API calls, no subscription fees, no internet required.

This approach — edge AI with cheap hardware — is massively underused. Most "AI security" products are cloud-dependent, which means latency, privacy risks, and recurring costs. Running the inference locally on a $35 computer changes the economics.

Key Takeaways

Face recognition on a Raspberry Pi is entirely feasible with Python + OpenCV + dlib
Cosine similarity handles lighting variation better than Euclidean distance for face vectors
Camera initialization race conditions are a real edge case — always warm up
Edge AI deployment doesn't need expensive hardware; it needs the right tradeoffs

What's Next

I'm exploring two improvements:

Adding liveness detection (blink detection) to prevent photo-based spoofing
Converting the model to TensorFlow Lite for faster inference on the Pi

The full source is on GitHub: github.com/HENI-MOHAMED/FaceRegnition

Have you run face recognition on edge hardware? What issues did you run into?

Building an AI-Powered Short Video Automation Platform with Flask, React Native, and Playwright

Mohamed Heni — Mon, 01 Jun 2026 08:30:57 +0000

TL;DR: I built TikTonik — an AI platform that detects trending content, generates scripts, renders videos with FFMPEG, and publishes them to TikTok — all without human intervention. Here's the architecture and what I learned building it.

The Problem

Short-form video is the most effective content format right now, but producing it at scale is brutally manual:

Find a trend → Write a script → Record voiceover → Find B-roll → Edit → Render → Upload
Each video takes 1-3 hours for a decent output
Most creators burn out after maintaining this cadence for a few weeks
"AI video tools" are mostly wrappers that still require manual steps

I wanted to build a system that could go from "trend detected" to "video published" with zero human touching the pipeline.

The Architecture

TikTonik is organized as a multi-application workspace with four major components:

└─ LibTikTonk V2.2 (Backend) ── Flask API + Queue System + ThreadPoolExecutor
    ├─ AI Trend Pipeline  (LLM + trend analysis)
    ├─ FFMPEG/MoviePy   (video rendering)
    └─ Playwright Uploader (TikTok auth + post)

└─ frontend-window (Tauri Desktop)  ── Vite + React + Rust
└─ FrontendMobile (React Native/Expo)  ── iOS + Android

1. Backend (LibTikTonkV2.2)

The core is a Flask application with a threaded task queue. When you submit a job:

Trend Detection — LLM scans trends and generates video concepts
Script Generation — AI writes a short-form script
Voice Synthesis — Kokoro TTS converts script to voiceover
Video Rendering — FFMPEG + MoviePy composites the final video
Upload — Playwright automates TikTok upload with session cookies

# Worker thread architecture
worker_thread = threading.Thread(target=queue_worker, daemon=True)
worker_thread.start()

# ThreadPoolExecutor for parallel video processing
def process_task(task):
    if task['type'] == 'create_video':
        result = process_create_video_task(task)
    elif task['type'] == 'upload':
        result = process_upload_task(task)

2. Desktop Dashboard (Tauri + React)

Built with Vite + React + Tauri. Tauri wraps the web app as a native desktop app using Rust. Operators can view the job queue, configure AI limits, and monitor upload history — all from a native window with system tray integration.

3. Mobile Clients (React Native / Expo)

Two mobile apps using React Native (Expo):

TikTonk — Full client for viewing scheduled jobs, monitoring queue state, uploading from mobile
TickTonik 2.0 — Lightweight monitoring for quick status checks

Both connect via REST API to the Flask backend with Appwrite cloud sync.

The Hard Parts

Session-based TikTok Upload

TikTok's API is restricted, so the uploader uses Playwright browser automation with cached session cookies:

browser = playwright.chromium.launch(headless=True)
context = browser.new_context(storage_state="CookiesDir/tiktok_session.json")
page = context.new_page()
page.goto("https://www.tiktok.com/upload/")

The challenge: TikTok changes its DOM structure frequently. The upload selector that worked last week breaks this week. I solved this by maintaining a cookie cache that preserves sessions between uploads, minimizing re-login frequency.

Queue-Based Parallel Processing

task_queue = queue.Queue()
task_status = {}
task_lock = threading.Lock()

with task_lock:
    task_status[task_id] = {'status': 'processing'}

executor.submit(process_task, task)

The lock prevents status update races. ThreadPoolExecutor handles parallel execution with configurable MAX_WORKERS.

Cross-Platform Startup

python_cmd = "python3" if sys.platform.startswith("linux") else sys.executable
backend = subprocess.Popen([python_cmd, "main.py"], cwd=backend_dir)
frontend = subprocess.Popen(["npm", "run", "dev"], cwd=frontend_dir)

Graceful shutdown with signal handlers ensures both processes clean up properly.

What I'd Do Differently

Redis queue instead of in-memory for persistence across restarts
Multi-tenancy — currently single-tenant, needs user isolation for SaaS
Better error recovery — cleanup works but task states could be more descriptive
Direct API upload — TikTok Business API would be more reliable than Playwright

Key Takeaways

Flask + ThreadPoolExecutor is capable for queue-based workloads. You don't always need Celery.
Playwright + session cookies works for platforms without APIs, but plan for DOM changes
Four-platform deployment (Flask API + Tauri Desktop + React Native iOS + Android) from one codebase is viable with shared API contracts
Video rendering is the bottleneck — FFMPEG is CPU-bound. GPU acceleration would be a major upgrade

Tech Stack

Backend: Python, Flask, Flask-Limiter, ThreadPoolExecutor
Desktop: Vite, React, Tauri (Rust)
Mobile: React Native (Expo)
AI: LLM script generation, Kokoro TTS, trend analysis
Automation: Playwright
Rendering: FFMPEG, MoviePy
Sync: Appwrite
Deployment: Docker

Comments / Discussion

Have you built an automated content pipeline? What's your approach to handling platform-specific upload APIs versus browser automation? I'd love to hear how others solve the "last mile" problem.

I Built a Desktop AI Assistant That Controls Your Computer — Here's How

Mohamed Heni — Mon, 25 May 2026 09:30:57 +0000

TL;DR: I built Yaldabaoth — a desktop AI assistant that doesn't just answer questions. It reads your screen, runs PowerShell commands, clicks buttons, types text, and automates your entire workflow. No cloud dependency. No API calls for automation. Just Python, Rust, and raw OS control.

The Problem

In 2025, "AI assistants" meant one of two things:

Chatbots — A text box where you type and an LLM talks back
API wrappers — Tools that chain API calls together but have zero ability to touch the actual operating system

Neither of these actually helps you do things on your computer. Want to open an application, take a screenshot, parse a PDF, run a PowerShell script, and compile a report? Good luck chaining that through a chat interface.

I wanted something different. An assistant that sits on your desktop, sees what you see, and acts on your behalf — like having an engineer sitting next to you who can operate any part of the system.

So I built Yaldabaoth.

The Architecture

Yaldabaoth is a four-layer system:

Layer 1: The Shell — Tauri + React

Instead of Electron (which would have added 150+ MB to the binary), I used Tauri — a Rust-based framework that wraps a webview frontend in a native shell. The UI is React with a glassmorphism design.

Why Tauri:

Binary size: ~5 MB vs Electron's ~150 MB
Native performance for intensive operations
Direct Rust system access when needed

Layer 2: The Orchestrator — Python Backend

The Rust shell communicates with a Python backend that handles all the heavy lifting. Communication is via stdin/stdout JSON protocol — lightweight, no HTTP server needed.

The orchestrator manages:

Command routing (voice/text → appropriate handler)
State persistence between commands
Multi-step task chaining
Personality profile switching (Professional vs. creative modes)

Layer 3: The Automation Engine — Win32 API + PowerShell

This is where the magic happens. The Python backend has direct access to the Windows OS through:

pywinauto — Native Win32 API control for clicking, typing, window management
PowerShell subprocess — OS-level commands (service control, registry edits, file operations)
WMI — System information queries (processes, hardware, network)

Layer 4: The Perception System — OCR + Screen Parsing

Screen parsing runs in a separate thread to keep the UI responsive. It uses:

OCR-based text extraction — Screenshots → text → action decisions
Multi-threaded processing — One thread for screen capture + OCR, another for command execution, a third for UI responsiveness
Chained automation — Click → wait for UI update → re-scan screen → next action

The Hard Parts

Threading Nightmares

Screen parsing is slow. OCR a screenshot, parse the text, decide what to do — you're looking at 500ms to 2 seconds per cycle. If you do this on the main thread, your entire app freezes.

The solution was a producer-consumer architecture:

Thread 1: Screen capture → OCR → queue the parsed text
Thread 2: Command executor — reads from queue, takes action
Thread 3: Main UI thread — stays responsive

┌─────────┐    ┌──────────┐    ┌──────────┐
│ Screen  │───>│ Queue     │───>│ Command  │
│ Capture │    │ (JSON)   │    │ Executor │
└─────────┘    └──────────┘    └──────────┘
      │                              │
      v                              v
 ┌─────────┐                   ┌──────────┐
 │ OCR     │                   │ Win32 API│
 │ Parser  │                   │/PowerShell│
 └─────────┘                   └──────────┘

Python's threading.Queue with daemon=True threads was sufficient — no need for multiprocessing or async for this use case.

Voice-First vs. Text-First UX

I wanted a Push-to-Talk interface (F10 key) so you could speak commands naturally. But speech recognition introduces latency and errors. The compromise:

Voice input preferred for simple commands ("Open Chrome", "Check CPU usage")
Text fallback for complex multi-step sequences
The orchestrator normalizes both into the same command pipeline

Rust ↔ Python Bridge

Tauri apps expect Rust backends. Yaldabaoth needs Python. Bridging them without adding a web server was tricky.

The solution: stdin/stdout JSON-RPC. The Rust shell spawns the Python process and communicates via JSON messages on stdin/stdout. No sockets, no HTTP, no dependency on a running server. The Python process lives as long as the app is open.

// Rust side — minimal example
let python = Command::new("python")
    .arg("backend/main.py")
    .stdin(Stdio::piped())
    .stdout(Stdio::piped())
    .spawn()?;

// Send command
let cmd = r#"{"action": "click", "target": "Chrome"}"#;
python.stdin.as_ref().unwrap().write_all(cmd.as_bytes())?;

// Read response
let mut response = String::new();
python.stdout.as_ref().unwrap().read_to_string(&mut response)?;

What I Learned

1. Desktop automation is harder than cloud automation. Cloud APIs are designed to be called programmatically. Desktop UIs are designed for humans. Parsing a rendered UI and making decisions from it is fundamentally different from calling an API endpoint.

2. Threading early, threading often. I rebuilt the threading model three times. The first version was single-threaded and froze constantly. The second over-engineered with multiprocessing. The third — simple Queue-based threading — was just right.

3. Computer use was already possible in 2025. Before "computer use" became a buzzword in 2026, a Python script + OCR + Win32 API was all you needed. The novelty isn't the technology — it's wiring it together with a voice-first, responsive UI that feels like an assistant, not a script.

The Tech Stack

Component	Technology
Shell	Tauri (Rust + WebView2)
Frontend	React
Backend	Python
Automation	pywinauto, WMI, PowerShell
Voice	Push-to-Talk (F10)
Screen parsing	OCR + multi-threaded pipeline
Binary size	~5 MB

What's Next

Cross-platform support — Currently Windows-only due to Win32 API dependency. Linux adaptation via X11/Wayland is on the roadmap.
Better screen parsing — Using vision models directly instead of OCR for richer UI understanding.
Plugin system — Let users write custom automation modules.

Repo

The full source code is on GitHub: github.com/HENI-MOHAMED/Yaldabaoth

Built with Tauri, React, Python, Rust, and more coffee than I'd like to admit.

Building a Multi-Agent Tax Audit System with LangGraph and Odoo

Mohamed Heni — Fri, 22 May 2026 00:34:31 +0000

TL;DR: I built a multi-agent system that audits invoices, detects fiscal inconsistencies, and generates compliance reports — integrated with Odoo ERP and standalone databases.

The Problem

Tax auditing for small and medium businesses in Tunisia is manual, error-prone, and slow. Most accounting firms rely on Excel and manual checks. Regulations change frequently, and keeping up is a full-time job.

I wanted to build something that could ingest financial data, run audit rules automatically, and produce compliance reports — without needing a team of accountants.

The Architecture

The system uses a LangGraph-based multi-agent architecture:

Agent 1: Data Ingestion Agent

Connects to Odoo ERP or standalone SQL databases
Extracts invoices, ledgers, and financial statements
Normalizes data into a unified schema

Agent 2: Audit Agent

Applies tax rules against the data
Detects anomalies: missing invoices, misclassified expenses, VAT discrepancies
Flags items for human review

Agent 3: Reporting Agent

Generates compliance reports
Produces a summary of findings with risk levels
Suggests corrective actions

Orchestrator

LangGraph manages the flow between agents
Handles state, retries, and error recovery

Technical Challenges

Schema Mismatch: Every Odoo instance is customized differently. The ingestion agent had to handle dynamic schemas — detecting table structures at runtime and mapping them to a canonical audit model.

Multi-Agent Coordination: Getting three agents to work together without stepping on each other's state was the hardest part. LangGraph's checkpointing was essential here.

Regional Tax Rules: Tunisian tax law isn't well-documented in English. Building the rules engine meant working directly with Arabic and French regulatory texts.

What's Next

Real-time invoice validation
Multi-company support
A dashboard for non-technical accountants

The repo is at github.com/HENI-MOHAMED/Audit-Agent.

Built with Python, FastAPI, LangGraph, Odoo, and a lot of coffee.