I built a real AI video processing SaaS from Senegal no GPT wrappers, just HuggingFace + OpenCV + YOLO + Detectron2+Medidapie+ Celery

Ratonpeureu — Sun, 03 May 2026 00:25:07 +0000

## The problem I was solving

Every creator I know spends 3-4 hours manually cutting
one video into clips for TikTok and Instagram.

The algorithm rewards volume — not perfection.
Post 20 clips, maybe 2 go viral.
Post 1 perfectly edited video, maybe 0 do.

So I built ClipFarmer.

Not a GPT wrapper — real computer vision

This is the part I want to be clear about.

Most "AI tools" people encounter — especially in
West Africa — are scams. Someone charges you to
access ChatGPT through a Telegram bot and calls it
"AI formation."

ClipFarmer uses actual machine learning models
running on the processing pipeline:

Whisper (HuggingFace) — automatic speech
recognition for subtitle generation. Runs locally
on the worker, no API call, no per-minute billing.

YOLO + OpenCV (cv2) — scene detection and
object tracking. Used to find the best cut points
in a video — not just splitting at fixed intervals
but finding where scenes actually change.

Detectron2 — instance segmentation. Powers
background removal and masking effects directly
on video frames.

MediaPipe — pose and face landmark detection.
Used for smart reframing — keeping the subject
centered when converting 16:9 to 9:16 vertical
format for TikTok.

OpenCV (cv2) — the backbone of all frame-level
processing. Every effect, every transition, every
crop runs through cv2 pipelines.

These aren't API calls to someone else's model.
They run on our workers.

The effects and transitions pipeline

This was the hardest part to build.

Each effect is a cv2 pipeline that processes frames
individually and reassembles them into a video.
Things like:

Color grading (dark moody, vintage grain, RGB split)
CRT scanline overlay
Motion blur
Skeleton overlay (MediaPipe pose)
Background removal (Detectron2 masks)

Transitions between clips use frame blending and
optical flow — not simple cuts or crossfades.

The whole thing runs as a Celery chord:

workflow = chord(
    spliter_clip.s(job.job_id, input_path),
    workflow_tasks_parallel.s()
)
task_result = workflow()

Split first → then effects + subtitles + transitions
run in parallel on the clips → reassemble.

The stack

Backend: FastAPI + Celery + RabbitMQ + Redis

AI/CV: Whisper + YOLO + Detectron2 + MediaPipe + OpenCV

Storage: MinIO (self-hosted S3-compatible, presigned uploads)

Frontend: React + Vite + TailwindCSS

Database: PostgreSQL + SQLAlchemy async

Deployment: Docker Compose on a VPS

Each AI model runs in its own conda environment
inside the worker container — Whisper, Detectron2,
and MediaPipe have conflicting dependencies so
isolating them was non-negotiable.

The African creator angle

In Senegal and West Africa:

Mobile money (Wave, Orange Money) is how people pay
Credit cards are rare
Most AI tools people see are scams or inaccessible

ClipFarmer accepts Wave and Orange Money natively.
And it runs real models — not a chat interface
pretending to be a video tool.

What I learned

Conflicting ML dependencies are brutal.
Whisper, Detectron2, and MediaPipe cannot share
a Python environment cleanly. The solution was
separate conda envs and subprocess calls between
them from the main worker.

Presigned uploads are mandatory for video.
Having the client upload directly to MinIO instead
of streaming through FastAPI was the difference
between a server that crashes on large files and
one that handles them fine.

cv2 frame processing is slow without batching.
Processing frames one by one destroyed performance.
Batching frame reads and writes cut processing
time significantly.

Docker networking will humble you.
My Celery worker couldn't reach RabbitMQ because
the FastAPI container was missing RABBITMQ_URL —
cost me an afternoon of traceback reading.

Where it is now

Live at clipfarmer.site

Free credits to try it out. Mobile payment for
West African creators.

I'm curious — has anyone else built cv2 processing
pipelines at scale? The frame batching and memory
management on long videos is still something I'm
optimizing.

What would make you switch from manual editing?