DEV Community: zkaria gamal

What I Learned Building a 97%-Accuracy Tumor Classifier (and Why Augmentation Mattered More Than the Model)

zkaria gamal — Thu, 09 Jul 2026 08:54:57 +0000

I want to walk through the pipeline decisions including the ones that didn't work because that's the part that's actually useful to AI engineer.

I recently published my research in MDPI Technologies journal, a unified deep learning framework for multi-class tumor classification.

I designed and built the entire pipeline:

the data flow
the augmentation strategy
model selection, and ablation design.

We hit 97% accuracy across 10 classes on independent datasets. Here's how it was actually built, and the architectural choices that moved the needle.

The problem with a single dataset

Medical imaging whether we're talking about brain MRI or skin lesions, has a massive data problem.
Public datasets are small, heavily imbalanced across classes, and inconsistent in their conditions.
Train on one dataset, and you risk learning dataset specific pattern instead of the actual pathology.

To prove this pipeline actually understand the pattern, the study used 2 dataset design covering both skin and brain MRI data.
This was designed specifically to stress-test whether the pipeline generalized or if it was just over-fitting to one source's.

This is the first decision worth calling out: a single-dataset 97% is a much weaker claim than a dual-dataset 97%. It’s a cheap thing to add if you design your pipeline for it from day one.

Stage 1: A binary gate before the real problem

Before attempting a 10-class tumor classification, the pipeline runs a U-Net segmentation model as a binary gate: Is there a tumor region present at all, and where is it?

Why bother with this stage instead of feeding raw images straight into the classifier? Two reasons:

It forces the downstream classifier to focus on the relevant region instead of learning shortcuts from background or context pixels.
It's a natural checkpoint to catch garbage input early, before it costs you a wrong 10-way classification.

This is the exact same instinct behind a lot of production ML pipelines: don't let your hardest model see noisy input it doesn't need to handle.

Stage 2: The ablation that actually mattered (DCGAN vs. Augmentor)

This is the part I think is most useful to share, because it's where intuition and actual experimental results diverged.

The obvious move for a small, imbalanced medical imaging dataset is synthetic data generation. So, I trained a DCGAN to generate additional samples, balancing out the classes to about 7,000 images per class. But I also ran a much simpler baseline: classical augmentation via Augmentor (rotations, flips, elastic distortions, etc.) with no generative model at all.

Here is what actually happened: the Augmentor baseline did most of the heavy lifting. DCGAN helped marginally on the rarest classes, but the extra GAN training cost bought only a small edge over classical augmentation

This is a genuinely useful finding for anyone building in this space. Sometimes the "simple" baseline does more work than the sophisticated generative approach, and you only know that if you run the ablation.

Stage 3: The classifier, Fine-tuned Inception V3

The final 10-class classification stage used a fine-tuned Inception V3, which successfully reached 97% accuracy on our held-out test sets.

Rather than training a massive architecture from scratch, leveraging transfer learning with a frozen backbone and strategically unfreezing the head gave the model the exact feature extraction power it needed without immediately over-fitting to our 7k/class dataset.

What I'd do differently

it's usually the most skipped one. If I were to rebuild this pipeline from scratch, here is what I would look at closer:

The U-Net Bottleneck: Did the U-Net gate ever drop false negatives that never even got the chance to reach the classifier?
The Edge Cases: Where the dual-dataset generalization struggled the most, and which specific skin/brain classes got confused with each other.

the confusion matrix of the classifier model

the classification report of the classifier model

Dive into the details:
Read the full peer-reviewed paper: Technologies (MDPI), Vol. 13
Try the Live Demo: XCare Platform

built a local AI multi-agent that control my Linux with only 2b model and up to any size

zkaria gamal — Wed, 01 Jul 2026 12:46:42 +0000

I built zkzkAgent because I was tired of cloud AI tools and day-to-day terminal commands. It’s a fully local AI Linux assistant powered by LangGraph + Ollama that intelligently routes your requests through conversational, direct, or planning paths, executes real system tasks safely, and keeps everything private on your machine.

Why LangGraph

I tried simple ReAct-style agents, but it became messy. Every turn, the LLM had to figure out everything in one go, and I had no clean way to maintain consistent state across steps.

That’s why I chose LangGraph. It gave me a proper graph-based structure with shared state between nodes, which was exactly what I needed.

How I Built the Core Architecture

My first version was just one agent trying to handle everything. It worked at first, but after adding more tools it started making strange decisions. Sometimes it would over complicate simple requests, and other times it would skip steps it actually needed.

I fixed that by splitting the system into smaller nodes. The first step simply decides what kind of request came in: a normal conversation, a direct action, or something that needs multiple steps. From there the graph routes the request to the right place.

I also moved all shared information into one state object so every node can access the same context instead of passing values around manually.

That change made a bigger difference than switching models. The behavior became easier to understand and debugging stopped feeling like guessing.

what my agent can do?

-Files and folder operation :my agent can read and write files and create and delete folders and find any

-System and process management :my agent can find running processes, stop them, run deployment scripts, and keep track of background running tasks.

-Dangerous operations :for actions like deleting files, removing packages, or cleaning trash, I added a confirmation step first so it does not do risky actions directly this to make the human in-loop with my agent .

-Network and utility tasks :my agent can check internet connection, recover some network issues automatically, and search the web if needed.

-Applications :my agent can open applications directly such as VS Code and browser windows.

-Voice mode :I also added voice support using Whisper for speech input and TTS for responses.

so why this :

Instead of remembering commands, I wanted things like:

"Find my latest deployment log and summarize it"

"Clean my trash but ask me first"

to work naturally.

and for planning node:

I added a simple planning step. If a task needs multiple actions, my agent first creates a small plan and then executes each step one by one instead of immediately running everything.

Seeing it in action

I recorded some real demos while using zkzkAgent:

1- Intelligent image search

Video:
https://github.com/zkzkGamal/zkzkAgent/search_images_demo_vedio.mp4

In this demo the agent searches for images from the web and handles the results automatically.

2- Creating a new project

Video:
https://github.com/zkzkGamal/zkzkAgent/zkzk_agent_create_project.mp4

In this demo I gave a simple request and the agent created the project structure, installed dependencies, and completed the setup process.

Some other requests I use regularly:

"Find my last deployment log and summarize the errors"

"Deploy the frontend and run it in background"

"Clean up my trash"

For dangerous actions like cleaning files or deleting data, the agent asks for confirmation first before executing.

live image test from empty trash cycle

Safety first

One thing I cared about while building zkzkAgent was making sure it does not blindly execute dangerous actions.

Earlier I used this request:

"Clean up my trash"

Here is what actually happened when I ran it.

I typed:

User: "clean up my trash"

The router understood that this was a direct action request and selected the correct tool. But before running it, the agent detected that empty-trash is a dangerous operation.

Instead of executing immediately, it paused and asked:

System: "I'm about to perform 'empty_trash'. This will delete data permanently. Please confirm with 'yes' or 'no'."

Only after I replied:

User: "yes"

did the agent continue and execute the action.

I added the same confirmation flow for other risky operations like deleting files, removing packages, and similar system actions.

I wanted automation, but I also wanted a human to stay in control. The agent can make decisions and prepare actions, but for sensitive operations the final decision still belongs to the user.

How to install zkzkAgent

I recently added an installation script because I wanted setup to be easier and avoid making users manually install and configure everything.

Clone the project:

git clone https://github.com/zkzkGamal/zkzkAgent.git
cd zkzkAgent

Make the installer executable and run it:

chmod +x install.sh
./install.sh

The installer handles the setup process automatically.

After installation is finished, you can start using zkzkAgent with the CLI:

./cli.sh

or use the normal text mode:

python3 main.py

That's it, setup should be done and ready to use.

Agent CLI

I recently added a dedicated CLI because I wanted something closer to tools like Claude CLI and Codex CLI instead of a simple terminal input loop.

The CLI includes live streamed responses, conversation history, session tracking, and commands like /help, /history and /reset.

I mainly added it because while building the project I was constantly testing requests and reopening terminals. Having a dedicated CLI made it feel more like using a real assistant instead of repeatedly running scripts.

Final thoughts

zkzkAgent started as a small project for myself. I just wanted an easier way to handle daily Linux tasks without relying on cloud services.

I still adding small features and improving things as I used it more.

There is a lot I want to add, but I wanted to share it and see what other people think.

GitHub Repository:

zkzkAgent

If you try it, I would love to hear your feedback. And if you like it, a star always makes me happy ⭐

The auth problem nobody talks about when running AI microservices locally

zkaria gamal — Thu, 18 Jun 2026 17:24:23 +0000

Most voice AI tutorials assume you have an API key and a cloud endpoint. Mine had to run on the machine in front of me — 4GB GPU, no cloud, no managed auth layer.

That constraint forced me to solve a problem I hadn't seen written about anywhere: how do you authenticate requests between two local Python processes without a database, a shared secret sitting in your repo, or a session manager?

This is the story of what I built and the auth protocol I had to design from scratch.

What I Built

AI-RTC-Agent is a fully local real-time voice agent. The architecture has four isolated layers:

React client — captures mic audio, streams it over WebRTC using native RTCPeerConnection
Python WebRTC server — receives 48kHz PCM frames, runs VAD, segments utterances
FastMCP server — runs Whisper small for STT, plus email, calendar, and search tools
Agent layer — LLM intent routing with adapters for OpenAI, Gemini, and local Ollama

The data flow looks like this:

Browser mic → WebRTC (48kHz PCM) → VAD segmentation → FastMCP (Whisper STT) → transcript back over WebRTC DataChannel

No HTTP round-trip on the return path. The transcript is pushed directly over a WebRTC DataChannel, which keeps latency tight.

The Audio Pipeline

Before we get to auth, the VAD pipeline is worth explaining because it drives the whole segmentation design.

The browser streams 48kHz mono PCM. The server runs webrtcvad which requires 16kHz — so every incoming frame gets decimated in lockstep. But here's the thing: you can't feed 16kHz audio to Whisper and expect good results. So the system maintains two separate buffers from the same stream:

A 16kHz buffer evaluated by webrtcvad on a 300ms sliding window at aggressiveness 3
A raw 48kHz buffer that accumulates the actual speech frames for Whisper

When the VAD detects 2 consecutive seconds of silence, the utterance is considered complete. The raw 48kHz buffer gets wrapped with a WAV header, encoded to base64, and sent to the FastMCP server for transcription.

Whisper is preloaded as a module-level singleton at server boot via a LoadModelService — so there's no cold-start penalty on the first utterance.

The Auth Problem

Here's where it gets interesting.

The WebRTC server and the FastMCP server are two separate processes communicating over localhost HTTP. In production you'd put a reverse proxy in front, use mTLS, or drop a secret into a secrets manager. But this is a local developer workspace — no infrastructure, no ops, no database.

The naive solution is a static API key in .env. The problem: static keys sit in config files, get committed to repos, and never rotate. Even locally, it's a bad habit to build into an open-source blueprint.

I needed something that:

Required zero database
Left zero static credentials in the source files
Was stateless — no session syncing between processes
Was time-limited — a captured key shouldn't be reusable

The Solution: Deterministic Timestamp-Based Auth

Both processes independently run the same algorithm:

class api_key_generator:
    def __init__(self, expire_time: int = 5):
        self.expire_time = expire_time  # 5-second sliding epoch window

    def create_value(self) -> int:
        now_utc = datetime.datetime.now(tz=ZoneInfo("UTC"))
        return int(now_utc.timestamp()) // self.expire_time

    def generate_api_key(self) -> str:
        timestamp = self.create_value()
        suffix = self.generate_suffix(timestamp)
        prefix = self.generate_prefix(timestamp)
        return f"{suffix}_{timestamp}_{prefix}"

The prefix and suffix are derived from deterministic functions over the timestamp — math.sqrt(math.log10(timestamp)) — so both sides can independently compute the expected key for any given 5-second window.

How validation works:

WebRTC server generates a key, appends it as X-API-Key header, sends the audio payload
FastMCP middleware intercepts the request, extracts the header, parses the embedded timestamp
Middleware independently generates the expected key for that timestamp window
If the strings match and the timestamp is within one grace interval (5 seconds), the request is authenticated
If the key is replayed outside the window — rejected

What this gives you:

No database, no credential storage
Keys expire automatically every 5 seconds
A captured key is useless after the window closes
Both processes stay fully stateless

The MCP Layer

The FastMCP server is the heavy-lifting microservice. Beyond Whisper STT it exposes:

Mail tools — SMTP send/reply with thread headers (In-Reply-To, References)
Calendar tools — Google Calendar API with .ics fallback if OAuth isn't configured
Search tools — DuckDuckGo with a token-bucket rate limiter (1.0 req/sec)

Every tool response goes through a unified response parser: ok(data, message), err(message, code), paginated(items, total) — consistent shape across the whole layer, which makes testing clean.

Testing

Both the VAD server and the MCP tool layer have full pytest suites:

# Test the FastMCP tools
cd mcp && pytest tests/ -v

# Test the WebRTC VAD server
cd server && pytest tests/ -v

The MCP tests cover transcription accuracy, SMTP reply threading, calendar parsing, and rate limiter behavior.

Running It Locally

Requirements: Python 3.10+, Node.js 18+, ffmpeg, and a 4GB GPU (or CPU, slower).

# 1. Install Python deps
pip install -r requirements.txt

# 2. Start FastMCP server
cd mcp && python main.py        # localhost:8005

# 3. Start WebRTC backend
cd server && python main.py     # localhost:8080

# 4. Start React client
cd client && npm install && npm run dev   # localhost:5173

Speak into the mic, pause for 2 seconds, and the transcript appears in the dashboard.

Repo

github.com/zkzkGamal/AI-RTC-Agent

MIT licensed. Issues, PRs, and questions on the auth protocol or VAD pipeline are all welcome — especially curious if anyone has seen a cleaner approach to the zero-database local auth problem.

Building a real-time voice AI assistant with WebRTC and LangGraph

zkaria gamal — Sun, 14 Jun 2026 11:44:19 +0000

I recently finished building AI-RTC-Agent, an open-source real-time voice assistant workspace. It handles low-latency audio streaming, voice activity segmentation, and executes local tools (like search, email, and calendar) while maintaining a steady voice stream.

Here is the GitHub repository if you want to check out the code or run it locally:
https://github.com/zkzkGamal/AI-RTC-Agent

The Architecture

The project is split into four decoupled services to keep CPU-heavy tasks from blocking the audio processing loop:

React Client: A Vite frontend that manages the microphone with the browser's RTCPeerConnection API and handles half-duplex turn control to prevent audio feedback.
WebRTC Audio Processor: An asynchronous Python backend using aiortc and webrtcvad. It downsamples 48kHz audio to 16kHz for voice activity detection and segments user speech.
FastAPI Orchestrator: Powered by LangGraph to manage intent routing and conversation state.
FastMCP Server: Runs a warm-booted Whisper model locally for speech-to-text (STT) and exposes search and Google API tools.

Decoupling the WebRTC connection from the transcription and tool execution was critical. If the thread running the audio ingestion gets blocked by a transcription job or a web search, the audio stream drops frames. Offloading these to the FastMCP instance solves this.

Dynamic Model Switching

You can configure the system to use different LLMs and STT models by modifying the .env file. The orchestrator supports swapping the main language model between Ollama (for running local models like Qwen), OpenAI, or Google Gemini.

Service-to-Service Security

To secure communication between local microservices without the overhead of a database, I implemented a custom dynamic cryptographic authentication middleware. The client and servers calculate a time-locked token based on a Unix epoch sliding window of 5 seconds. The receiving service verifies the signature against synchronized system clocks, keeping the auth stateless.

UI Feedback

To keep the UX responsive while tools are running, the FastAPI agent broadcasts Socket.IO events (like tool_start and tool_finished). The React frontend immediately displays indicators showing what the agent is doing (such as calling the DuckDuckGo search tool) before streaming the voice response back.

Feel free to check out the setup instructions and run start.sh to test it out. I would love to get your feedback on the architecture.

https://github.com/zkzkGamal/AI-RTC-Agent

How to Convert Binary Segmentation Masks to YOLO Bounding Boxes (Python & OpenCV)

zkaria gamal — Thu, 11 Jun 2026 09:35:38 +0000

Have you ever found a perfect dataset for your object detection project, only to realize the ground truth is in the form of binary segmentation masks (black-and-white images) instead of YOLO bounding boxes (.txt files)?

If you are training a YOLO model (v5, v8, v10, v11, etc.), you need coordinates in the format:
<class_id> <x_center> <y_center> <width> <height> (normalized between 0 and 1).

Converting pixel-level masks into YOLO coordinates manually is a nightmare. Luckily, you can automate this using OpenCV and Python in just a few lines of code.

In this tutorial, we will walk through the exact steps and math required to build a robust conversion pipeline.

The Core Concept: Contours to Bounding Boxes

To convert a binary mask into a bounding box, we need to:

Find the boundaries of the white pixels (foreground object).
Compute the minimum enclosing rectangle around those boundaries.
Normalize the pixel coordinates into YOLO's standard format.

Here is what the visual flow looks like:

(Derived from the ISIC skin lesion dataset: left is the mask region, right is the calculated bounding box).

Step 1: Extract the Contours using OpenCV

OpenCV provides a powerful function called cv2.findContours that detects boundaries of binary shapes.

import cv2
import numpy as np

# Load mask in grayscale
mask = cv2.imread('path_to_mask.png', cv2.IMREAD_GRAYSCALE)

# Find all external boundaries
contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

if not contours:
    print("No objects found in the mask!")
else:
    # Select the largest contour (assuming a single primary object)
    largest_contour = max(contours, key=cv2.contourArea)

Step 2: Calculate Bounding Box Coordinates

Once we have the contour points, we can compute two types of bounding boxes:

Option A: Standard Axis-Aligned Bounding Box

This is the standard box used by most YOLO models.

# x, y are top-left coordinates; w, h are width and height in pixels
x_min, y_min, w_pixel, h_pixel = cv2.boundingRect(largest_contour)

# Calculate center coordinates in pixel space
x_center = x_min + (w_pixel / 2.0)
y_center = y_min + (h_pixel / 2.0)

Option B: Rotated Bounding Box (Minimum Area)

If your object is tilted or elongated and you want to use oriented/rotated bounding boxes:

# Returns ((x_center, y_center), (width, height), angle_of_rotation)
rect = cv2.minAreaRect(largest_contour)
box_points = cv2.boxPoints(rect) # Coordinates of the 4 corners

Step 3: Normalize Coordinates to YOLO Format

YOLO labels must be normalized relative to the overall image dimensions so that they scale correctly regardless of image resolution.

$$x_{norm} = \frac{x_{center}}{img_width}, \quad y_{norm} = \frac{y_{center}}{img_height}$$
$$w_{norm} = \frac{w_{pixel}}{img_width}, \quad h_{norm} = \frac{h_{pixel}}{img_height}$$

Here is the helper function to calculate and save this:

def normalize_to_yolo(x_center, y_center, w_pixel, h_pixel, img_w, img_h):
    x_norm = x_center / img_w
    y_norm = y_center / img_h
    w_norm = w_pixel / img_w
    h_norm = h_pixel / img_h
    return x_norm, y_norm, w_norm, h_norm

# Normalize coordinates (assuming image is 640x640)
x_n, y_n, w_n, h_n = normalize_to_yolo(x_center, y_center, w_pixel, h_pixel, 640, 640)

# Print or write to a YOLO label file (e.g. class 0)
print(f"0 {x_n:.6f} {y_n:.6f} {w_n:.6f} {h_n:.6f}")

The Reverse: YOLO Bounding Boxes back to Masks

What if you want to reconstruct binary masks from your bounding boxes for validation or visualization? You can draw filled rectangles onto a black canvas of the original image dimensions.

# Initialize a black canvas (grayscale)
reconstructed_mask = np.zeros((img_height, img_width), dtype=np.uint8)

# Denormalize coordinates
x_min = int((x_n - w_n/2) * img_width)
y_min = int((y_n - h_n/2) * img_height)
x_max = int((x_n + w_n/2) * img_width)
y_max = int((y_n + h_n/2) * img_height)

# Draw a filled white rectangle on the canvas
cv2.rectangle(reconstructed_mask, (x_min, y_min), (x_max, y_max), 255, thickness=-1)

# Save mask
cv2.imwrite('reconstructed_mask.png', reconstructed_mask)

Streamlining the Workflow

While writing this manually is great for one-off tasks, managing it across hundreds of images, handling missing metadata, splitting datasets into train/test folders, and generating the data.yaml config file can take hours.

If you want a lightweight package that packages all of this (including batch conversions, CSV/JSON metadata parsing, and visualizations), I've open-sourced a helper package called segment-toolkit that does it in a few commands.

How to use it:

Install it via pip:

   pip install segment-toolkit

Convert a directory of masks to YOLO labels:

   segment-toolkit mask-to-yolo \
     --image-dir datasets/images/ \
     --mask-dir datasets/masks/ \
     --output-dir datasets/labels/

Split into standard train/test structures for YOLO training:

   segment-toolkit split \
     --images datasets/images/ \
     --labels datasets/labels/ \
     --output final_dataset/ \
     --ratio 0.8

Whether you write your own script using the OpenCV math above or use the open-source toolkit, automating this step saves days of annotation work.

Source Code: GitHub - mask-to-yolo-toolkit
PyPI Package: segment-toolkit

How do you handle dataset conversions in your machine learning workflows? Let me know in the comments below!

Agent vs Multi-Agent Systems: A Practical Guide with LangGraph & LangChain

zkaria gamal — Tue, 09 Jun 2026 19:09:53 +0000

Hey everyone! 👋
I've been deep in Agentic AI Engineering lately, and one of the most common questions I get is:

What's the real difference between a single Agent and a Multi-Agent system? And how do you actually build them in production?

Today I'll break it down clearly and show you exactly how I implemented both in my repo:

zkzkGamal/agentic-ai-engineering

1. Single Agent (including ReAct)

A Single Agent is one LLM-powered entity that can:

Reason
Call tools
Maintain memory
Loop until the task is done

ReAct Agent is a popular implementation pattern:

Reason → Act (tool call) → Observe (tool result) → Repeat

Pros: Simple, easy to debug, great for focused tasks.

Cons: One agent has to be good at everything → can become bloated, slow, or less specialized.

In my repo:

Chapter 4 teaches classic ReAct agents with LangGraph.
Chapter 5 uses a ReAct-style tool-calling agent inside the Execute node (via create_tool_calling_agent).

2. Multi-Agent System

A Multi-Agent system = multiple specialized agents (or nodes) working together like a team.

Instead of one giant agent, you break the work into roles:

Router / Supervisor
Researcher
Executor / Tool User
Critic / Reviewer
Summarizer
etc.

Benefits:

Better specialization
More efficient (cheap model for routing, powerful model only when needed)
Easier to maintain and scale
Clear separation of concerns

How My Repo Uses Both (Real Architecture)

The crown jewel is in Chapter 5: Multi-Node LangGraph Agent with MCP Tools.

Core Architecture (Single Graph, Multi-Node = Multi-Agent flavor)

Key Nodes (Specialized "Agents"):

Router Node (agent/nodes/router.py)
- Fast intent classification (Math? Email? Just chat?)
- Acts as a Supervisor
Execute Node (agent/nodes/execute.py)
- Runs a full ReAct tool-calling agent
- Connects to tools via MCP (Model Context Protocol) server (math + email tools)
Summarize Node
- Takes raw tool output (JSON, etc.) and makes it human-friendly
Conversation Node
- Lightweight chat fallback (avoids heavy tool path)

This is not a pure peer-to-peer multi-agent (like Researcher → Writer → Critic), but a hierarchical multi-node LangGraph — which is one of the most practical and production-friendly multi-agent patterns today.

You also see pure multi-agent collaboration examples in Chapter 4 (Researcher + Writer + Critic working together).

Why This Architecture Rocks

Separation of concerns → easier debugging and testing
Efficiency → cheap routing + targeted execution
Security → Tools run in isolated MCP server (not directly in the LLM agent)
Observability → Each node is a clear step you can log/monitor
Extensibility → Want to add a "Research" intent? Just add a new node and update the router.

This is built with LangGraph + LangChain (no heavy LlamaIndex or basic OpenAI-only examples).

Key Takeaways

Aspect	Single Agent (ReAct)	Multi-Node / Multi-Agent
Complexity	Simpler	More structured
Specialization	One agent does everything	Each node/agent has a clear role
Efficiency	Can be wasteful	Optimized routing & execution
Debugging	Easier at first	Better long-term traceability
Best For	Focused tasks	Complex, real-world workflows

Modern reality (2026): Most production agent systems are multi-node LangGraph setups that combine ReAct inside specialized nodes.

Want to see it in action?

→ Clone the repo:

git clone https://github.com/zkzkGamal/agentic-ai-engineering.git

Start with Chapter 4 for fundamentals, then run the full Chapter 5 system (LangGraph assistant + separate MCP tool server).

It includes:

Pytest + GitHub Actions CI
Local Ollama support
Memory, streaming, and more

Would love your feedback or contributions! ⭐

I Got Tired of Rewriting the Same Mask-to-YOLO Script So I Shipped a PyPI Package

zkaria gamal — Mon, 08 Jun 2026 12:09:26 +0000

I got tired of writing the same 50 lines of OpenCV boilerplate every new project.

Every pipeline that uses SAM, U-Net, or any segmentation model hands you binary masks. YOLO training wants bounding box labels. The standard move is a custom script — cv2.findContours, normalize coordinates, handle edge cases, repeat. No reusable package existed that did this cleanly end-to-end.

So I built one and shipped it to PyPI.

The install

pip install segment-toolkit

What it does

segment-toolkit is a bidirectional pipeline between binary segmentation masks and YOLO bounding box labels.

Forward: binary mask to YOLO label (axis-aligned or rotated minimum-area box via cv2.minAreaRect)

Reverse: YOLO label back to binary mask

Visualizer: overlay bounding boxes on source images

Dataset split: auto train/test split with data.yaml output

Class mapping: batch conversion with CSV or JSON ground truth

CLI usage

Convert a single mask to a YOLO label:

segment-toolkit mask-to-yolo \
--image images/ISIC_0024310.jpg \
--mask mask/ISIC_0024310_segmentation.png \
--output-txt labels/ISIC_0024310.txt \
--class-id 4

Reconstruct the mask back from the label:

segment-toolkit yolo-to-mask \
--label labels/ISIC_0024310.txt \
--output-mask masks_reconstructed/ISIC_0024310_segmentation.png

Visualize the bounding box overlay:

segment-toolkit visualize \
--image images/ISIC_0024310.jpg \
--label labels/ISIC_0024310.txt \
--output visualization.png

Split into train/test with data.yaml:

segment-toolkit split \
--images images/ \
--labels labels/ \
--output dataset/ \
--ratio 0.8 \
--seed 42

Python API

from segment_toolkit import MaskToYoloConverter, YoloToMaskConverter

conv = MaskToYoloConverter(target_size=(640, 640), bbox_type="standard")
conv.convert_single(
image_path="images/ISIC_0024310.jpg",
mask_path="mask/ISIC_0024310_segmentation.png",
output_txt_path="labels/ISIC_0024310.txt",
class_id=4
)

Validated on

ISIC melanoma skin lesion segmentation and PlantVillage leaf disease. Both tested end-to-end: mask in, YOLO label out, mask reconstructed back, overlay rendered.

Repo

github.com/zkzkGamal/mask-to-yolo-toolkit

If you hit a bug or want a feature, open an issue.

From Segmentation Masks to YOLO Labels: My Dataset Prep Pipeline

zkaria gamal — Thu, 04 Jun 2026 14:32:56 +0000

I just finished a small but useful pipeline for skin lesion dataset preparation and annotation validation.

𝗧𝗵𝗲 𝗽𝗿𝗼𝗷𝗲𝗰𝘁 𝗵𝗮𝗻𝗱𝗹𝗲𝘀 𝘁𝘄𝗼 𝘄𝗼𝗿𝗸𝗳𝗹𝗼𝘄𝘀:
• Converting binary segmentation masks into YOLO labels
• Converting YOLO labels back into masks for validation and visualization

It was built around ISIC-style skin lesion data with 7 classes:
AKIEC, BCC, BKL, DF, MEL, NV, and VASC.

𝗪𝗵𝗮𝘁 𝗜 𝗹𝗲𝗮𝗿𝗻𝗲𝗱 𝗳𝗿𝗼𝗺 𝘁𝗵𝗶𝘀 𝗽𝗿𝗼𝗷𝗲𝗰𝘁:
  • Clean annotation pipelines save a lot of debugging time
  • A quick visual validation step catches label issues early
  • Even simple format conversions can reveal bad labels or inconsistent data

This project helped me better understand the full path from segmentation masks to training-ready YOLO annotations.

In the next phase, I plan to turn it into a more reusable Python package with a cleaner structure, better error handling, and a more maintainable workflow so it can be easier to use and adapt for future datasets.

If you work with medical imaging or dataset preparation, I’d love to hear how you validate your labels before training.
project repo

MachineLearning #ComputerVision #YOLO #DeepLearning #MedicalImaging #DataAnnotation #ISIC #Python #OpenCV

Building Zero-Shared-State Auth Middleware and Real-Time Whisper STT Pipeline for Voice AI

zkaria gamal — Sat, 30 May 2026 09:41:38 +0000

I recently built a production-grade real-time Voice AI workspace from scratch. While the whole system has many moving parts, two components required the most careful engineering: the authentication middleware between services and the Speech-to-Text (STT) pipeline.

Here’s exactly how I approached and solved both.

The Middleware Problem

I needed two local microservices — a WebRTC audio server and a FastMCP server — to communicate securely.

I didn’t want to introduce a database, Redis, or any hardcoded secrets. The solution had to be lightweight, stateless, and still reasonably secure for internal communication.

So I built a dynamic time-locked API key generator.

How it works:

Both services independently calculate the same cryptographic key using the current UTC timestamp.

Take the current timestamp
Divide it by a 5-second epoch window
Generate a deterministic key from that value

If a request arrives outside the valid 5-second window, it is immediately rejected.

This approach gives me:

No shared state
No persistent storage
No single point of failure
Automatic key rotation every 5 seconds

The Real-Time STT Pipeline

I wanted low-latency transcription with zero cold starts and no HTTP polling.

Here’s the exact flow I created:

Browser captures audio at 48kHz via WebRTC
Audio is downsampled to 16kHz
Voice Activity Detection (VAD) runs in 30ms lockstep
2.0 seconds of continuous silence = speech boundary
Audio segment is sent to the FastMCP server
Whisper "small" model is preloaded at boot (zero cold starts)
Transcription result is pushed back to the React frontend over WebRTC DataChannel

This gives a true real-time feeling with sub-second end-to-end latency in most cases.

Architecture Flow

Why This Design Works Well

Completely stateless middleware removes infrastructure complexity.
Preloading Whisper eliminates cold start delays.
Using WebRTC DataChannel for transcription delivery removes polling overhead.
Clear separation of concerns with VAD segmentation and MCP tooling.

The full project is open source and meant to serve as an educational blueprint for developers working with WebRTC, MCP, and real-time AI.

Repository: https://lnkd.in/dFbE44e3

Contributions are welcome — especially on the agent routing and LLM orchestration layer that’s currently in progress.

Let me know in the comments if you’d like me to dive deeper into any specific part (VAD tuning, Whisper post-processing, rate limiting, etc.).

How I Built a Zero-Shared-State Auth Middleware for a Real-Time Voice AI Agent (WebRTC + FastMCP + Whisper)

zkaria gamal — Mon, 25 May 2026 19:17:58 +0000

I've been building an open-source real-time voice AI workspace for the past few weeks and I want to walk through the architecture decisions that were actually hard — not the happy-path stuff you see in tutorials.

The stack: React client → WebRTC Python backend → FastMCP server (Whisper STT, Mail, Calendar) → transcript delivered back over a WebRTC DataChannel. The LLM orchestration layer is still in progress, but the pipeline underneath it is fully live and tested.

Here's what I want to focus on: three engineering decisions that weren't obvious.

The Problem With Securing Local Microservices

When two services run on the same machine — in this case the WebRTC server and the MCP server — the standard advice is to put them behind a shared secret or an API key stored in an environment variable. That works, but it has failure modes: leaked .env files, rotation pain, and the cognitive overhead of managing secrets across services that should be able to trust each other without a database call.

I wanted something stateless and self-expiring.

The solution I landed on is a time-locked hash generator. Both servers independently compute the same key by applying deterministic math to the current UTC timestamp divided by a 5-second epoch window:

import math, hashlib, time

def generate_api_key() -> str:
    epoch_window = int(time.time()) // 5
    raw = math.sqrt(math.log10(epoch_window))
    return hashlib.sha256(str(raw).encode()).hexdigest()

The Starlette middleware on the MCP server recomputes this hash on every incoming request and compares it to the header. If the timestamp window is off by more than one epoch — five seconds — the request is rejected. No database lookup. No token storage. No rotation script. The key rotates itself every five seconds and both sides always agree on what it should be.

This is not production-grade for internet-facing services (TOTP with a proper shared seed is better for that), but for securing local inter-service communication during development and staging it is clean, auditable, and has zero ops overhead.

**The Dual-Rate Audio Pipeline**

WebRTC gives you audio at 48kHz. Whisper is happiest at 16kHz. `webrtcvad` only accepts 8, 16, or 32kHz. Feeding everything through one sample rate loses either fidelity for transcription or compatibility for VAD.

The backend handles both independently in the same 30ms processing loop:

- The full 48kHz PCM buffer accumulates separately for Whisper
- A parallel downsampled 16kHz frame array feeds `webrtcvad` at aggressiveness level 3
- A sliding window tracks the ratio of active to silent frames
- When fewer than 1 in 10 frames in the last 2.0 seconds are active — that's the boundary

python
SILENCE_RATIO_THRESHOLD = 0.1
SILENCE_DURATION_SECONDS = 2.0

active_frames = sum(vad_window)
total_frames = len(vad_window)
if active_frames / total_frames < SILENCE_RATIO_THRESHOLD:
trigger_pipeline()

Splitting the buffers means you get high-quality STT input and accurate VAD detection without either compromising the other.

Service Singletons and the Cold Start Problem

Whisper is not fast to load. If you initialize the model on the first request, your first transcription takes 3–6 seconds depending on hardware. Every user who speaks first gets a broken experience.

The fix is a LoadModelService singleton that runs at server startup:

class LoadModelService:
    _model = None

    @classmethod
    def get_model(cls):
        if cls._model is None:
            cls._model = whisper.load_model("small")
        return cls._model

This gets called inside the FastMCP lifespan hook, so by the time the first WebSocket connection arrives the model is already in memory. Every subsequent transcription call hits a warm model.

The same pattern applies to the mail and calendar services — singletons initialized once, reused across tool calls, with a token-bucket rate limiter (0.5 req/s for Gmail) sitting in front of anything that touches an external API.

**The Pytest Suite**

You can't calibrate a VAD pipeline without tests. The suite covers:

- Frame decimation accuracy at different sample rates
- Speech onset boundary detection under various silence patterns
- SMTP integration with mock SMTP server
- Calendar tool with automatic `.ics` fallback when no calendar service is configured

[ RUN ] test_frame_decimation_48k_to_16k
[ OK ] test_frame_decimation_48k_to_16k
[ RUN ] test_vad_silence_boundary_2s
[ OK ] test_vad_silence_boundary_2s
[ RUN ] test_smtp_send_integration
[ OK ] test_smtp_send_integration




Running `pytest tests/ -v` from the `mcp/` directory gives you live output with real pass/fail visibility — not just a summary at the end.

**What's Next**

The LLM orchestration and conversation routing layer is actively in development. Once that's in, the full loop closes: speech → STT → LLM agent → tool use → response.

The entire codebase is open source and structured as an educational reference for WebRTC, MCP, and secure microservices. If you're building anything in this space — voice agents, real-time audio pipelines, MCP tool servers — I'd love contributions, issues, or just a look.


![Stt Flow](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dapxxg3ypbj526ci35oh.jpeg)

GitHub: https://github.com/zkzkGamal/AI-RTC-Agent

I built a fully tested Agentic AI system with LangGraph + MCP and open-sourced the whole thing

zkaria gamal — Mon, 18 May 2026 10:24:32 +0000

Most LLM tutorials stop at "here's how to call the OpenAI API."

Mine doesn't.

I just shipped v1.1.0 of Agentic AI Tutorial — a 5-chapter open-source repo that takes you from your first raw API call all the way to a production-style multi-node autonomous agent with a CI pipeline, pytest suite, and MCP server integration.

Here's what's inside and why I built it the way I did.

🏗️ The Architecture (Chapter 5)

The final agent uses a LangGraph StateGraph with 4 decoupled nodes:

Router — classifies user intent with a cheap, fast LLM call
Execute — runs a LangChain ReAct agent bound to a local FastMCP server
Summarize — converts raw tool JSON into natural language
Conversation — handles chitchat directly, skipping tool execution entirely

The MCP server exposes math and email tools over SSE. The agent never touches your credentials directly — it talks to the server, which acts as a secure boundary.

🧪 Why I Added Tests to an AI Project

Here's the uncomfortable truth about agentic systems: they don't fail loudly. They drift.

Change one node prompt, and suddenly the router misclassifies 20% of requests. No exception thrown. No stack trace. Just wrong output that you may not catch until a user reports it.

So v1.1.0 ships with:

A pytest suite that validates each node's logic and MCP tool contracts independently — no live API calls needed
A GitHub Actions CI workflow that runs on every push across multiple Python versions
A custom conftest.py reporter that gives real-time output with zero buffering lag

pytest Chapter5/SimpleChatAgent/ -v

📚 Full Roadmap (All 5 Chapters)

Chapter	Focus
1	LLM fundamentals — OpenAI, Gemini, Ollama, streaming
2	LangChain, LCEL, chains, tool binding
3	Memory, entity tracking, RAG with Chroma/FAISS
4	LangGraph agents — ReAct, Router, Multi-Agent, Human-in-the-Loop
5	Multi-node agent + FastMCP Server + CI/pytest

🚀 Get Started in 3 Commands

git clone https://github.com/zkzkGamal/Agentic-AI-Tutorial.git
cd Agentic-AI-Tutorial
pip install -r requirements.txt

Each chapter has its own .env.example. Ollama users can run everything 100% locally, no API keys needed.

If this saves you time or teaches you something new, a ⭐ on the repo helps others find it.

👉 github.com/zkzkGamal/Agentic-AI-Tutorial

Happy to answer questions in the comments — what agentic patterns are you building?

Building Strong ML Foundations: Chapter 2 - Classification is Now Live

zkaria gamal — Sun, 10 May 2026 08:58:15 +0000

A few weeks ago I published Chapter 1 of my hands-on AI tutorial series, focused on Regression. Today, I'm excited to share that Chapter 2: Classification is complete.

This series isn't just another collection of notebook tutorials. I'm building it to truly understand how these algorithms work under the hood — implementing them from scratch where it makes sense, comparing them properly, and focusing on concepts that actually matter in interviews and real projects.

What’s in Chapter 2

I implemented and analyzed five core classification algorithms:

Logistic Regression (implemented from scratch with NumPy, plus scikit-learn version)
K-Nearest Neighbors (KNN) Classifier
Random Forest Classifier
XGBoost Classifier
Support Vector Classifier (SVC) with different kernels

Key Focus Areas

This chapter goes deeper than just training models. I spent a lot of time on:

Visualizing decision boundaries for each algorithm
Understanding probability estimates and calibration
Bias-variance tradeoff in classification problems
Precision vs Recall — one of the most important topics for ML interviews. I dedicated a good portion explaining when to optimize for precision, when to prioritize recall, and how to use F1-score effectively depending on the problem.
Confusion matrices, ROC-AUC, and proper model evaluation
Why ensemble methods (Random Forest and XGBoost) consistently outperform single models

Everything is implemented cleanly using NumPy, scikit-learn, and XGBoost, with real datasets and detailed explanations.

You can check out the full chapter here:

→ https://github.com/zkzkGamal/hands-on-ai-tutorial/tree/main/ml_fundamentals/chapter2

Chapter 1 (Regression) is available in the same repository.

Why I’m Doing This Publicly

I got tired of only knowing how to call model.fit() without understanding what was happening inside. This project is my way of forcing myself to learn deeply while creating a resource that can help others who want the same.

If you're a developer transitioning into ML, preparing for machine learning interviews, or simply want stronger fundamentals, I believe this series can be useful.

What's Next?

I'm planning Chapter 3 soon. I'm thinking about Dimensionality Reduction (PCA, t-SNE, UMAP) or Advanced Model Evaluation & Hyperparameter Tuning. Let me know in the comments what you'd like to see next.

Feedback is always welcome — whether it's about the code, explanations, or structure.

Happy to connect if you're on a similar learning journey.