DEV Community: Preetha

I Built a Medication Safety Companion for Home Care with Hermes Agent

Preetha — Sat, 30 May 2026 12:33:11 +0000

This is a submission for the Hermes Agent Challenge: Build With Hermes Agent

Across millions of homes, older adults managing multiple prescriptions face the same quiet crisis every day.

"I think I took my blood thinner twice today. Or maybe I forgot it.
I honestly can't remember."

That conversation is where this project started.

What I Built

The Medication Safety Companion is a home-care assistant that runs inside Telegram. Older adults talk to it by voice note or text message — the same way they already use their phone. It records doses, figures out which pill they mean when the description is vague, stops them from taking the same dose twice, and escalates immediately when something dangerous is happening. It also sends proactive reminders when a dose is overdue, with no action needed from the patient.

Everything the patient and caregiver sees is delivered through Hermes Agent. The patient never interacts with Python. They talk to Hermes — through Telegram — and Hermes handles the full conversation from greeting to confirmation to escalation.

The core design decision: Hermes is in charge of the conversation. A deterministic Python pipeline is in charge of every safety decision. The two never swap roles.

Demo

Here is what an actual Telegram session looks like. All of these were live on the running system.

https://github.com/preetharaj/Rx-Guardian/blob/main/docs/hermes_demo.gif
https://github.com/preetharaj/Rx-Guardian/blob/main/docs/Telegram_demo.gif

Confirming a dose by voice

The patient sends a voice note through Telegram. Hermes receives it via its Telegram gateway, passes it through faster-whisper for local transcription, and routes the transcript to the safety pipeline. The patient sees the transcription echoed back so they can catch any mishearing.

Patient sends voice note: "I took my heart pill"

🎙️ I heard: I took my heart pill

Got it. I've recorded that you took your heart pill at 08:14 UTC.
Let me know if you miss a dose.

When the description is ambiguous

The system never guesses. It asks. Hermes holds the clarification session open across turns using its built-in memory.

Patient: "the white pill"

I found more than one pill that could match.
1) Metoprolol 50mg  — white oval, blue cap bottle
2) Lisinopril 10mg  — white round, orange bottle, morning dose
3) Amlodipine 5mg   — white round, yellow bottle, evening dose

Which one did you take?

Patient: "2"

Got it. I've recorded that you took your pressure pill at 08:17 UTC.

When the dose was already taken

Patient: "I took my heart pill again"

I already have a record that you took your heart pill today at 08:14 UTC.
You don't need to take it again.
If you think there's a mistake, ask your caregiver to check.

When something unsafe is mentioned

Normal flow stops. No dose gets logged. Hermes delivers this message verbatim — the response formatting step is bypassed entirely for escalations.

Patient: "I want to take an old pain pill with my blood thinner"

I need to stop here. That medication may not be safe to take
alongside your current prescriptions. Please do not take it right now.
Contact your doctor or pharmacist before taking anything new.
I've made a note for your caregiver.

When it sounds like an emergency

Patient: "I accidentally took four pills by mistake"

This sounds urgent. Please call 995 right now or ask someone
nearby to call for you. Tell them which medication was taken
and roughly how much. I have notified your caregiver.
Do not take anything else until help arrives.

Proactive reminder — Hermes cron, no patient action needed

Every morning, Hermes's scheduler runs dispatch.py --remind. If a dose is overdue, Hermes sends the reminder to Telegram automatically. The patient does not have to open the app first.

⚠️ Reminder: You haven't confirmed your blood thinner yet today.
This is an important medication. Please take it and reply:
'I took my blood thinner'

The audit log

/logs

📋 Recent events:
🚨 [2026-05-29 01:09] EMERGENCY_ESCALATION
⚠️  [2026-05-29 01:08] DRUG_INTERACTION_ALERT
✅  [2026-05-29 01:06] MED_CONFIRMED — heart pill
✅  [2026-05-29 01:05] MED_CONFIRMED — pressure pill
🔍  [2026-05-29 01:05] MED_AMBIGUOUS
🔒  [2026-05-28 14:22] MED_DUPLICATE_BLOCKED — heart pill

Code

GitHub: preetharaj/Rx-Guardian

My Tech Stack

Layer	What it does
Hermes Agent	The entire user-facing layer — receives messages and voice notes via Telegram gateway, manages conversation and session memory, runs the skill, delivers responses, schedules and sends proactive reminders via cron
Safety pipeline	Pure Python — deterministic rules only, zero AI involvement in any safety decision
Response formatting	OpenRouter free tier (nvidia/nemotron:free) — called by Hermes after the pipeline decides, only to rephrase output into warmer language
Voice transcription	faster-whisper running locally — Hermes passes Telegram voice notes through it, no audio leaves the device
Database	SQLite with WAL mode — 4 tables, immutable audit log
Tests	pytest — 94 tests, all passing without any API key

Two rows in that table share responsibility for what most people would call "the AI part": Hermes Agent and Response formatting. They do entirely different things. Hermes Agent is the patient-facing intelligence — it manages the full conversation, remembers what was said two turns ago, routes voice notes, and delivers all messages through Telegram. Response formatting is a narrow utility step downstream, called only after the safety pipeline has already made its decision, and its only job is to turn a structured result like {outcome: CONFIRMED, message: "..."} into a warmer sentence. A bad API response or a model hallucination in that step cannot change whether a dose is confirmed or an escalation fires — the pipeline already ran.

Core Python modules

lookup.py               — matches "heart pill" to Metoprolol, handles ambiguous descriptions
ambiguity_handler.py    — manages multi-turn clarification sessions with session keys
duplicate_guard.py      — checks audit log, blocks repeat doses within the 6-hour window
confidence_rules.py     — handles low-confidence voice transcriptions
emergency_escalation.py — 80+ trigger phrases across 5 categories, fires before anything else
caregiver_override.py   — explicit two-step correction with full audit trail
safety_router.py        — wires all the above together in a fixed priority order
reminder.py             — checks overdue doses and generates reminder messages
dispatch.py             — single CLI entry point that Hermes calls via terminal, returns JSON

How Hermes Agent Powers This Project

Hermes is not a wrapper around a prompt here. It is the patient-facing layer, the conversation manager, the voice transcription router, the scheduler, and the delivery channel. Without Hermes, this is a Python script that nobody can talk to.

Let me be specific about each piece.

1. The Telegram gateway handles voice and text

When a patient sends a voice note to the Telegram bot, Hermes receives it through its native Telegram gateway integration. The gateway downloads the audio, passes it through faster-whisper for local transcription, and forwards the resulting text to the skill for processing. The patient never had to install anything beyond Telegram. They did not type a command or navigate a menu. They just sent a voice message the same way they would send one to a family member.

This is Hermes's gateway doing real work. The voice-to-text pipeline, the Telegram connection, the message routing — all of it is handled by Hermes before the first line of safety Python runs.

2. The skill is the orchestration contract

The project is packaged as a native Hermes skill stored in hermes-skill/med-safety/SKILL.md. Dropping this file into ~/.hermes/skills/ registers /med-safety as a slash command and loads all the rules Hermes will follow for every interaction.

The SKILL.md contains:

A step-by-step procedure for what Hermes does on each turn
An outcome table mapping every pipeline result code to a specific response behaviour
Hard rules Hermes must never break (ESCALATION → deliver verbatim, nothing else)
The path and argument format for calling dispatch.py

Alongside it is a SOUL.md — Hermes's built-in personality system. This defines the voice of the assistant: calm, short sentences, everyday words, no medical jargon, one clear next step at the end of every message. The older adult on the other end of this conversation does not want to parse clinical language when they are worried about whether they double-dosed a blood thinner. The SOUL.md enforces that consistently across every response without repeating the instruction in every prompt.

This is why a skill is a better fit than a system prompt. A SOUL.md + SKILL.md combination gives the agent a spec it treats like a contract, not a suggestion.

3. The terminal tool separates conversation from safety decisions

Hermes calls dispatch.py via its terminal tool on every turn. The script runs the full safety pipeline and returns a JSON result:

{
  "outcome": "CONFIRMED",
  "message": "Got it. I've recorded that you took your heart pill at 08:14 UTC.",
  "session_key": null,
  "log_id": 7
}

Hermes reads the outcome field and acts according to the rules in SKILL.md. It never has to decide whether a dose was safe, whether a combination is dangerous, or whether something is an emergency. Those decisions came back in the JSON. Hermes just delivers.

This separation is intentional. The terminal tool pattern lets you put safety-critical logic in code you can test, audit, and reason about, while leaving the agent to do what it is actually good at: understanding natural language, managing conversation state, and delivering messages to a human.

4. Session memory tracks multi-turn clarification

When "the white pill" comes back AMBIGUOUS, the pipeline returns a session_key — a unique identifier for the open clarification session. Hermes holds this key in its session memory and passes it automatically on the next turn:

python dispatch.py --session "ambig_20260529_083200" "2"

The patient just said "2". They did not say which session they were answering, what question was asked, or which medications were in the list. Hermes remembered all of that across turns. The session_key mechanism works because Hermes's memory layer exists and I did not have to build a separate state store to use it.

5. Cron drives the proactive reminder loop

Inside a Hermes session, this instruction registers a scheduled job:

Create a cron job: every day at 08:30
  cd /path/to/project && python dispatch.py --remind
If result is not [SILENT], send the message to Telegram.

dispatch.py --remind checks the reminder engine, compares each medication's scheduled time against the audit log, and returns reminder messages only for doses that are overdue and have not been sent yet today. Hermes's cron scheduler runs this check and delivers the result through Telegram without any polling loop, background thread, or separate infrastructure.

A morning reminder that reaches the patient before they have forgotten is more useful than one they have to ask for. Hermes cron made that easy. I expected to spend a day on this part. It took about an hour.

6. The Telegram bot's own JobQueue as a redundancy layer

The Telegram bot also runs an independent JobQueue check every 15 minutes through python-telegram-bot. If Hermes cron misses a window — restart, connectivity, anything — the bot sends the reminder anyway. A patient's medication reminder should not depend on a single point of failure. Two independent schedules pointing at the same reminder.py logic is the right call.

The architecture in one diagram

Patient voice note or text (Telegram)
    │
    ▼
Hermes Agent — Telegram gateway
    │  receives message, transcribes voice via faster-whisper,
    │  loads med-safety skill, reads SOUL.md personality
    │
    ▼
Hermes calls terminal: python dispatch.py "I took my heart pill"
    │
    ▼
safety_router.py — deterministic Python, no AI
    ├── emergency_escalation.py  — 80+ unsafe keywords, fires first
    ├── confidence_rules.py      — low-confidence STT → ask to repeat
    ├── lookup.py                — 4-pass medication matching
    ├── ambiguity_handler.py     — 2+ matches → clarification session
    └── duplicate_guard.py       — confirmed in last 6h → blocked
    │
    ▼ JSON: {outcome, message, session_key, log_id}
    │
    ▼
If outcome == ESCALATION:
    Hermes delivers message verbatim. Response formatting bypassed.
Else:
    OpenRouter rephrases into SOUL.md-consistent language
    Hermes delivers to Telegram
    │
    ▼
Audit log (SQLite) — immutable, every event written, nothing deleted

Escalation messages never reach OpenRouter. The moment format_response() sees outcome == ESCALATION, it returns the message as-is. The sentence telling someone to call emergency services will never be softened by a model trying to sound less alarming.

Safety rules, tested in code

python -m pytest tests/ -v
# 94 passed in 2.4s

Every safety property has a test. A few that matter:

"the white pill" with three candidates → AMBIGUOUS, nothing logged yet
Same medication confirmed twice within 6 hours → DUPLICATE_BLOCKED
MED_UNCERTAIN in the log → also blocks a second dose attempt
"I accidentally took four pills" → EMERGENCY_ESCALATION, message contains "995"
"I want to take ibuprofen" → DRUG_INTERACTION_ALERT, no dose confirmed
"I took fish oil" → SUPPLEMENT response, never logged as a prescription
"I took it again" → UNKNOWN_MED, not confirmed as insulin

That last one was a real bug. "I took it again" strips down to "it" through the intent extractor, and "it" fuzzy-matched "insulin" because both contain the same token. One silent wrong confirmation of a critical medication. The fix was a stopword list that prevents pronouns from reaching the medication matcher at all.

Running it yourself

git clone <your-repo>
cd med-safety-companion
pip install -r requirements.txt
pip install "python-telegram-bot[job-queue]" faster-whisper

cp .env.example .env
# OPENROUTER_API_KEY — free at openrouter.ai, no card required
# TELEGRAM_BOT_TOKEN — free from @BotFather on Telegram

python seed.py
python -m pytest tests/ -v
python cli.py --bot

Send /start to your Telegram bot. Say "I took my heart pill" by voice. Watch it land in the logs.

To run the full Hermes experience with the /med-safety slash command:

cp -r hermes-skill/med-safety ~/.hermes/skills/
hermes -s med-safety
# then type: /med-safety I took my heart pill

Three things I did not expect

Hermes's voice note handling eliminated an entire integration problem. I expected to write a custom webhook, a file download handler, and audio conversion code. The Telegram gateway handled all of that. I only had to write the transcription step and wire it to the pipeline. The hard part of "voice input on Telegram" was not hard at all.

SKILL.md is a more reliable contract than a system prompt. A well-written skill with outcome tables, hard rules, and a separate SOUL.md produces consistent behaviour across hundreds of turns. A long system prompt drifts. The skill acts like a spec the agent is trying to satisfy, not a tone it vaguely remembers.

Cron + Telegram is a complete proactive notification system. I expected to need a separate scheduler service, a notification database, and probably a Redis queue. I ended up with a --remind flag on a Python script and three lines in a Hermes cron definition. The proactive reminder feature — the one that might actually save someone from a missed dose — was the simplest part of the whole build.

What is next

Adding TTS so the system can speak the confirmation back to patients who find reading difficult. A caregiver summary view showing the week's log in plain language. Letting caregivers add or edit medications through the Telegram chat itself. And eventually, refill tracking — a warning when a medication is running low before the patient runs out entirely.

Hermes is the voice, the memory, the scheduler, and the delivery channel. Python is the safety brain. One without the other is incomplete. Together they form something worth deploying in a real home.

Inside an MCP-Native Content Workflow Engine — Here's What Actually Broke (and What Finally Made Sense)

Preetha — Wed, 27 May 2026 12:45:51 +0000

I started this project thinking, "Let me try this MCP thing." I didn't expect to end up rebuilding how I think about workflow automation, AI integration, and what it really means to build infrastructure instead of just tools.

This isn't a tutorial — the repo ContentOps-MCP has a 7-day structured tutorial for that. This is more honest. It's about the decisions I made, the ones that broke things, and the lessons that actually matter for your next project.

Think of it like a detective story:

each phase of the build was a clue. Some led to dead ends. Some changed the whole direction. By the end, the architecture finally made sense.

The problem I was actually trying to solve

Content teams don't waste time writing. They waste time on the plumbing around the writing.

Checking if the meta description is there.
Pasting the draft link into Slack.
Remembering to send the newsletter email.
Discovering two hours after publishing that paragraph three has a broken link.

Most teams patch this with Zapier or n8n. Notion → WordPress → Slack. Done. It works — until it doesn't.

Until someone publishes a half-finished draft because the automation didn't care.

Until the SEO title is 90 characters because nothing checked.

Until the brand voice that took six months to establish gets eroded one auto-published post at a time.

The automation wasn't the problem. What it lacked was any sense of quality. It just moved things around.

So I started with one question:

What if the workflow itself could check whether the content was ready before it moved?

What I built

The project is called ContentOps MCP Orchestrator. The short version looks like this:

Notion draft → QA gate → WordPress → Slack → email

But the interesting part isn't the pipeline. It's the QA gate sitting in the middle of it, and the architecture underneath that makes the whole thing composable.

Phase 1: get the pipeline working

I started with a straightforward FastAPI app. Notion polling, a workflow runner that called step functions in order, SQLite for run history, and a basic static UI. Nothing fancy.

The point was to get the pipeline working end-to-end and understand what data needed to flow between steps.

Phase 2: make everything MCP-first

Next, I refactored every integration — WordPress, Slack, Resend, Notion — into its own MCP server. Each one exposed /tools and /use_tool endpoints. The orchestrator became an MCP client, dispatching tool calls instead of importing Python functions directly.

I added a ServerRegistry that tried the remote server first and fell back to local mock responses if it wasn't running.

Suddenly the system felt different. Adding a new integration didn't mean editing the runner. It meant writing a new server.

Phase 3: add the QA gate and registry

Then came the QA gate. It's an 11-agent local scoring engine that runs on every workflow before the publish step. It checks:

SEO title completeness
Meta description
Broken links
Readability
Brand voice consistency (against a rubric you define)

It returns a score, a pass/fail, and specific suggestions. The workflow either pauses for human review or continues automatically depending on the mode you configure.

I also added a curated registry of 14 content-stack MCP servers (Ghost, Beehiiv, Substack, WordPress, Webflow, Resend, Loops, Mailchimp, Linear, Notion, Coda, and more).

The architecture decision that changed everything

Here's the thing about building a workflow tool. The naive approach is to write a runner that knows how to call each integration:

if step.app == "wordpress":
    call_wordpress()
elif step.app == "slack":
    call_slack()

It works. It's also a dead end, because every new integration means touching the runner.

The MCP model inverts this. The runner doesn't know anything about WordPress. The WordPress MCP server knows about WordPress. The runner just knows how to make a tool call and handle the result.

You can add Ghost, Beehiiv, Linear, anything — as long as it speaks the protocol, the runner doesn't care.

This sounds obvious in retrospect. It wasn't obvious when I was writing step 1.

🔍 Example: MCP tool call vs HTTP call

This is the turning point where the architecture finally clicked.

❌ Old way (HTTP / Python function):

if step.app == "wordpress":
    create_draft(step.title, step.content)
elif step.app == "slack":
    post_message(step.text)

Every new integration means touching this file. Every new step type means more branching logic. Hard to test. Hard to extend.

✅ New way (MCP tool call):

steps:
  - server: wordpress-mcp
    tool: create_draft
    input_map:
      title: "{trigger.pages[0].title}"
      content: "{trigger.pages[0].body}"
  - server: slack-mcp
    tool: post_message
    input_map:
      text: "New draft published: {steps[0].url}"

The orchestrator doesn't know what WordPress is. It just makes a tool call:

{
  "server": "wordpress-mcp",
  "tool": "create_draft",
  "params": {
    "title": "How to Build an MCP-Native Content Stack",
    "content": "..."
  }
}

Add Ghost? Add another server. Add Beehiiv? Add another server. The orchestrator stays the same. That's the inversion.

Building the QA gate: what I got wrong first

My first attempt at the QA gate was a single function that checked everything. One blob of logic, one result. It was fast to write and immediately painful to extend.

The problem is that content quality isn't one thing. SEO is a different concern from readability. Brand voice is different from link validation. Mixing them into one function meant changing one thing risked breaking another, and the output was a wall of undifferentiated feedback that didn't tell you what to fix first.

The second version split it into 11 specialized agents. Each agent inherits from BaseQAAgent, implements a check() method, and returns a QAResult with a severity level and specific issues. The scoring engine aggregates them with configurable weights.

You can add a new agent without touching any existing one.

This is the open/closed principle, yes. But the deeper lesson was about feedback granularity.

A QA check that says "this draft needs work" is useless.

A check that says "your meta description is missing, your third internal link returns 404, and your sentence length average is 28 words for a technical audience" is actionable.

🔍 Example: QA gate — before vs after

This is where the QA gate actually does something useful.

Imagine this draft comes in from Notion:

title: "My Draft Post"
meta_description: ""
content: "Testing something. Not sure if this will work."

The QA gate checks:

Title: too vague → fails
Meta: missing → fails
Readability: too shallow for technical audience → fails
Brand voice: informal, unclear → fails

Result:

{
  "pass": false,
  "reasoning": "Title is generic and vague, meta description is missing, brand voice is not professional or technical, and readability is too shallow for the target audience.",
  "suggestions": [
    "Choose a more specific, technical title.",
    "Write a clear meta description summarizing the article.",
    "Add concrete examples and structure your content with sections."
  ]
}

The workflow pauses. The editor sees this reasoning and chooses to "retry after edit" instead of publishing a weak draft.

That's the difference between "automation moved things" and "automation checked quality first."

The fallback pattern: underrated engineering

One of the quieter decisions in this project was the ServerRegistry fallback. When the orchestrator needs to call wordpress-mcp::create_draft, it first tries the actual MCP server running on port 8002. If that's not running, it falls back to a local mock that returns a realistic response shape.

This sounds like a convenience feature for demos. It's actually a correctness feature for development.

Without fallback, developing the orchestrator logic requires all four MCP servers to be running simultaneously. With fallback, you can work on the executor, the QA gate, the UI, and the run trace logic independently — the system behaves consistently regardless of which servers are actually up.

The lesson:

Design your system to be testable in parts, not just as a whole.

The fallback pattern is one way to do that. It forces you to define what a "realistic response" looks like for each tool, which in turn forces you to think clearly about your data contracts. The mock becomes a kind of documentation, because it defines the shape of the real response.

🔍 Example: Fallback mock vs real server

This is why the fallback pattern is underrated.

Real server response (if running):

{
  "server": "wordpress-mcp",
  "tool": "create_draft",
  "result": {
    "slug": "mcp-native-content-stack",
    "url": "https://blog.example.com/mcp-native-content-stack",
    "status": "draft"
  }
}

Fallback mock (if server is not running):

{
  "server": "wordpress-mcp",
  "tool": "create_draft",
  "result": {
    "slug": "mcp-native-content-stack-mock",
    "url": "https://blog.example.com/mock-draft",
    "status": "draft"
  }
}

The orchestrator doesn't care which one it gets. The shape is the same. That means:

You can test the QA gate without WordPress running.
You can test the UI without Slack running.
You can test the run trace without any servers running.

The mock becomes documentation because it forces you to define what the real response should look like.

What MCP actually is (and why it matters right now)

If you haven't been following the MCP ecosystem, here's the quick version.

Model Context Protocol (MCP) is an open protocol by Anthropic that standardizes how AI models interact with external tools and data sources. Instead of every LLM integration being a bespoke API adapter, MCP gives you a common interface:

Tools with schemas
Resources with URIs
A client/server model that works across implementations

The practical implication for builders:

You can write a tool once and have any MCP-compatible client call it.

Claude desktop, your custom orchestrator, someone else's agent — they all speak the same protocol. This is a big deal for infrastructure builders because it means your MCP server has leverage beyond your own project.

I built contentops-mcp before the MCP ecosystem fully matured, which meant making some decisions based on where I thought the protocol was going rather than where it was. That's a risk. But it's also an opportunity.

Being early in an ecosystem means the registry of content-stack MCP servers I built has a real chance of becoming the reference catalog for this niche, simply because almost nothing else exists there yet.

Lessons worth keeping

These are the ones I'd tell myself on day one if I could.

1. Build the trace first.

I added detailed run tracing — per-step status, input/output capture, timestamps — later than I should have. Every bug I hit before that required reading logs and mentally reconstructing what happened. After I had the trace, debugging became obvious. Build observability into the data model from the start, not as an afterthought.

2. Real numbers beat descriptions.

"11-agent quality checks" and "14-server content-stack registry" are specific. "AI-powered quality checks" and "growing registry" are not. Specificity signals that you actually built the thing and counted it. Use real numbers everywhere you can.

3. The mock is a contract.

Every time I wrote a fallback mock response, I was forced to decide what the real response would look like. That decision often uncovered inconsistencies in my data model before I'd written the real implementation. Write your mocks early and treat them as interface definitions.

4. One entry point, one command.

The project went through a phase where running it required four terminal windows — one for each MCP server, one for the orchestrator. That's fine for production but terrible for a learner's first experience. The ServerRegistry fallback pattern collapsed it to one command. Always think about what the first five minutes look like for someone who just cloned the repo.

5. Separate the learning path from the feature list.

The 7-day tutorial structure forced me to think about which concepts depended on which other concepts, and in what order they should be introduced. That exercise — independently of any tutorial — made the architecture cleaner. If you can't explain the build order to a beginner, your dependencies probably aren't as clean as you think.

6. Don't wrap, invert.

The temptation when building on top of existing integrations is to wrap them. "I'll write a wrapper for the WordPress API." The better question is: who should know about WordPress? The answer is: the WordPress MCP server, and nothing else. Inversion of knowledge — pushing integration-specific logic to the edge — is what makes the core stay clean.

7. Naming is architecture.

qa-gate::run_check as a tool call address tells you the server, the tool, and the action in one string. It's readable in logs. It's greppable. It matches your file structure. Naming your MCP tool calls with server::tool convention costs nothing and saves enormous cognitive load when you are reading a run trace at midnight trying to figure out why step 3 failed.

What's next

The project has a roadmap that goes in a direction I am genuinely excited about.

Phase 4: visual workflow editor and team-based approval gates — the product layer on top of the infrastructure.
Phase 5: hosted registry with premium MCP server adapters for enterprise stacks.

But the thing I am most interested in is the registry itself. There is no curated, tested, versioned catalog of content-stack MCP servers anywhere in the current ecosystem. The one in this project is a start — 14 servers across publishing, email, and ops categories.

Getting that to 50, with real installation paths and verified tool schemas, is the kind of catalog that becomes a reference point for the whole content-ops vertical.

Where to find it

The project is on GitHub at ContentOps-MCP with a zero-to-running quick start, a 7-day tutorial, and full docs for the QA gate and registry.

The one-line pitch:

Zero Infra Cost — MCP-native ContentOps orchestrator for AI publishing workflows.

Notion → QA Gate → WordPress → Slack → Resend.

Built-in 11-agent quality checks, 14-server content-stack registry. Self-hosted.

If you are building in the MCP ecosystem, working on content tooling, or just want a concrete project to learn FastAPI and workflow orchestration from scratch — the tutorial is designed to take you from day one to a working demo in a week.

I didn't want to build another automation tool. I wanted to build a content system that could reason about quality before it published anything.

The case isn't closed. If you are working on anything in the MCP space, leave a breadcrumb in the comments — I want to know what tools people are wiring together right now before the trail goes cold.

Exploring AI workflow Orchestration: Comparing Weft, Python & Alternative Pipeline Approaches

Preetha — Mon, 25 May 2026 06:12:48 +0000

A few weeks ago I started exploring something that made me rethink how we build AI workflows. Most of us naturally reach for Python when building AI systems. I do too. Python gives flexibility, full control, and honestly it powers almost everything in modern AI stacks.But while experimenting, I came across Weft from Weaver Mind AI.Weft is still in development, but the idea behind it caught my attention.

Instead of thinking purely in terms of code execution, Weft focuses more on orchestration — how AI models, APIs, retrieval systems, humans, and tools coordinate together as a workflow.

Python is amazing at building logic.
Weft feels like it is trying to improve how AI workflows themselves are structured.

That got me curious.

https://github.com/PreethaRaj/TokenWeaver-Lab/releases/download/v1.0.0/Demo.gif

Experiment Setup

So I decided to experiment with it in one of my projects.

I built a local-first research synthesis demo to compare different orchestration styles:

Weft-style orchestration
Traditional Python full-buffer processing
Staged pipeline executor
- Retrieval, summarization, and synthesis handled as isolated sequential steps
- No shared orchestration state
MapReduce-style orchestration
- Distributed map stages for retrieval and synthesis
- Followed by aggregation and reduction steps

The project focuses on:

Token usage
Cost visibility
Orchestration efficiency

Goal

Can changing orchestration patterns alone impact token usage and cost behaviour?

Experiment Design

Some things I experimented with:

Shared retrieval pipelines to keep comparisons fair
Token and cost visibility between orchestration approaches
Deterministic local execution for reproducible experiments
Research synthesis workflows without introducing API costs into evaluation

Measurement Methodology

Token usage was computed by tracking:

Total prompt tokens
Generated completion tokens

across each orchestration path.

Cost estimates were normalized using equivalent model pricing assumptions so orchestration differences could be isolated.

Reduction Formula

((baseline token usage - orchestration token usage) / baseline token usage) × 100

Baseline: Traditional Python full-buffer execution
Larger context windows were repeatedly passed between stages

Orchestration approaches that reduced repeated context movement showed measurable token efficiency improvements.

Key Learnings

Lesson 1: Orchestration Matters More Than Expected

We usually optimize prompts or models. But sometimes the bigger opportunity sits in workflow design itself.

In these experiments, orchestration decisions influenced:

How often context was reloaded
How retrieval outputs were reused
Whether intermediate results were shared across stages

Small architectural choices created downstream effects on token consumption and execution efficiency.

Lesson 2: Fair Benchmarking Is Harder Than It Looks

Comparing approaches only works when inputs stay consistent.Even small differences can distort conclusions.

To ensure fairness:

Retrieval outputs were fixed across runs
Execution remained deterministic
All orchestration layers used identical source data

Without this control, token reductions may be misleading.

Lesson 3: AI Engineering Is Becoming a Systems Design Problem

Models still matter.

But so do:

Pipelines
Retrieval patterns
Context flow
Orchestration decisions

The challenge is shifting from:

“Pick the best model”

To:

Design efficient information movement systems

Key factors influencing performance:

Context buffering strategies
Retrieval reuse
Execution ordering
State management
Orchestration topology

Disclaimer

This is not about replacing Python.

Python remains the foundation
Weft explores an additional layer: AI workflow orchestration

Acknowledgment

Courtesy: https://github.com/WeaveMindAI/weft

Big thanks to Weaver Mind AI for sharing early ideas in this space.

The project is still evolving — and I’m still learning.

Call for Discussion

Curious if others have experimented with orchestration-first AI development approaches.

Links

GitHub: https://github.com/PreethaRaj/TokenWeaver-Lab
Weavermind: https://weavemind.ai/

I built a self-hosted RAG system for Journalism — What Production Retrieval Taught Me

Preetha — Fri, 22 May 2026 08:49:06 +0000

Over the last few months, I built Atlas — a fully self-hosted retrieval system designed for journalism workflows. No paid APIs. No hosted vector databases or AI infrastructure. Just local models, PostgreSQL, pgvector, Celery, and a retrieval pipeline built to survive production traffic.

I originally thought this would mostly be an infrastructure project. It wasn't. The hardest lessons appeared after deployment — when assumptions broke, retrieval quality drifted, and tiny implementation decisions started affecting reliability.

What does Atlas do?

Atlas ingests live RSS feeds from BBC, Guardian, NYT, NPR, Deutsche Welle and more every 15 minutes, embeds content locally using sentence-transformers, stores vectors in PostgreSQL with pgvector, and answers questions with source-grounded citations.

Beyond search it has:

Grounded Q&A — every answer maps to an exact source passage
Claim-level fact-checking — splits text into claims, scores each against evidence
Story brief generation — key facts, open questions, suggested angles for reporters
Multi-format repurposing — one topic becomes newsletter, social post, audio script, headline
A full story workspace — source notebooks, drafts, editorial review, version diff, publish readiness

https://github.com/PreethaRaj/atlas-editorial-intelligence/releases/download/v1.0.0/SearchAnswer.gif

https://github.com/PreethaRaj/atlas-editorial-intelligence/releases/download/v1.0.0/PartnerMode.gif

The retrieval pipeline

Here is the full pipeline before I get into the lessons:

Query string
    │
    ├── embed(query) → vector cosine > 0.30 → top 20 chunks
    ├── websearch_to_tsquery → PostgreSQL FTS → top 20 chunks
    └── Title FTS boost → top 10 articles
              │
              ▼
         RRF merge (k=60)
              │
         recency blend (85% relevance + 15% freshness)
              │
         post-cosine gate > 0.12
              │
         Policy engine (public / partner / paywall)
              │
         Response + inline citations

Lesson 1 — Pure vector search fails for news

This surprised me. I assumed a good embedding model would handle everything. It does not — at least not for current events journalism.

The problem: proper nouns.

Words like Philippines, Kishida, Rafah, Starmer are rare in any model's training data relative to their importance in daily news. The cosine similarity between "Japan missile exports Philippines" and an article titled "Tokyo defence deal with Manila confirmed" was 0.28 — just below my original threshold of 0.30.

The article was clearly relevant. The vector search missed it completely.

Full-text search caught it immediately because Japan, missile, Philippines all appeared in the article text.

The fix was hybrid search. Vector catches semantic similarity. FTS catches proper nouns and exact terminology. Neither is sufficient alone for a news corpus.

# Three search paths merged with RRF
# Path 1: vector cosine similarity
# Path 2: websearch_to_tsquery (handles "Japan Philippines" as two terms)
# Path 3: title-specific FTS (weighted 0.7x to avoid title-only noise)

# RRF merge — no score normalisation needed because it only uses rank position
# final_score = Σ 1 / (60 + rank_i)

Lesson 2 — Batch embedding is not a micro-optimisation

I was calling embed() once per article for the first two weeks. Here is what that looks like:

17 feeds × 30 articles × embed(1 article) × 100ms = 51 seconds per ingest cycle

After switching to batch embedding — collect all articles, call embed([t1, t2, ..., tN]) once:

17 feeds × 30 articles = 510 articles
embed(510 articles)    = ~3 seconds total

17× faster. The model inference overhead is almost entirely fixed cost per batch, not per item. This is obvious from the PyTorch documentation but I had not read it carefully enough.

# Before — slow
for article in articles:
    vec = embed(article.content)
    insert_embedding(article.id, vec)

# After — fast
contents = [a.content for a in articles]
vecs     = embed(contents)   # single call, returns (N, 384) array
for article, vec in zip(articles, vecs):
    insert_embedding(article.id, vec)

Lesson 3 — The cosine threshold is your precision-recall dial

Atlas has two thresholds:

COSINE_MIN      = 0.30   # SQL WHERE — pre-filter before leaving DB
POST_COSINE_MIN = 0.12   # post-RRF — sanity gate after merge

What I learned tuning these:

Threshold	Effect
0.45	Missed "Japan missile Philippines" — too restrictive
0.30	Good balance for a news corpus
0.20	Sports results started appearing for political queries

The intuition: news articles about related topics often use completely different vocabulary than the query. A threshold of 0.30 allows the model to bridge that vocabulary gap. A threshold of 0.45 requires the query and article to use nearly identical language — which defeats the purpose of semantic search.

POST_COSINE_MIN = 0.12 exists only to handle FTS-only hits. When an article is found by keyword search but has no semantic overlap with the query (cosine = 0.0), it means the keyword match was probably accidental. The post-filter removes those.

Lesson 4 — Celery beat scheduling has a startup timing problem

The beat schedule runs ingest_all_feeds every 15 minutes. But there is a subtle issue: on a fresh deploy, the first beat fires at the next :00, :15, :30, or :45 UTC boundary — not 15 minutes from startup.

Deploy at 14:01 → first ingest at 14:15  ✓ fine
Deploy at 14:14 → first ingest at 14:15  ✓ fine
Deploy at 14:00:01 → first ingest at 14:15  ✗ 15 minute corpus gap on first launch

The fix was timedelta(minutes=15) instead of crontab(minute='*/15').

The startup_ingest task also checks corpus article count before honouring the Redis dedup flag. Empty corpus → ingest regardless. This handles docker-compose down -v (fresh database) correctly.

beat_schedule = {
    "ingest-every-15-min": {
        "task":     "tasks.ingest_all_feeds",
        "schedule": timedelta(minutes=15),  # from startup, not clock-aligned
    },
    "startup-ingest-once": {
        "task":     "tasks.startup_ingest",
        "schedule": timedelta(hours=24),    # fires once, Redis dedup prevents repeats
    },
}

Lesson 5 — The Docker healthcheck dependency chain matters

This one took me an embarrassing amount of time.

celery-beat:
  depends_on:
    celery-worker:
      condition: service_healthy   # ← this line is critical

Without service_healthy, beat starts immediately and dispatches tasks before any worker is ready to consume them. The tasks sit in the queue, beat fires again in 15 minutes, tasks pile up.

With service_healthy, beat waits until a worker is confirmed ready. Clean startup every time.

The worker healthcheck uses celery inspect ping which confirms the worker is actually processing — not just that the container started.

What is next

The infrastructure has a warmup_reranker() stub in main.py for a cross-encoder reranker. That is the highest-impact next upgrade — running cross-encoder/ms-marco-MiniLM-L-6-v2 over the top-20 RRF results before returning to the user. Adds ~100ms latency but meaningfully improves ranking for ambiguous queries.

I am also looking at adding a BM25 path via the pg_bm25 extension (ParadeDB) to replace the PostgreSQL FTS path. BM25 handles document length normalisation better than tsvector for longer articles.

The project

Atlas is built to learn and adapt. The README has a two-week tutorial walking through each layer of the system.

GitHub: https://github.com/PreethaRaj/atlas-editorial-intelligence

Stack: FastAPI · PostgreSQL 16 · pgvector · Celery · Redis · sentence-transformers · Next.js 14 · Docker Compose

Happy to answer questions on the retrieval architecture, the pgvector schema, or the Celery configuration in the comments.