DEV Community: Dhaivat Jambudia

Building a Production-Ready ML System for Supply Chain Late Delivery Risk Prediction

Dhaivat Jambudia — Fri, 24 Apr 2026 14:35:05 +0000

The Problem: “Will This Order Arrive Late?”

In any supply chain, late deliveries hurt customer trust and increase support costs. At order placement, logistics teams usually have no reliable way to know which orders will fail their delivery promise. Our client was flagging orders manually, based on gut feeling and experience no formal metric, no repeatability, and a lot of missed late deliveries.

We were asked to build a system that predicts late delivery risk at the moment an order is placed, enabling proactive interventions: expedited shipping, customer notifications, priority routing. The model had to be production‑ready, reproducible, and monitorable.

Problem Formulation – It’s All About the Cost of Missing a Late Order

We framed it as binary classification:

Target: Late_delivery_risk (0 = on time, 1 = late)
Primary metric: F2‑score because missing a late delivery (false negative) is twice as costly as a false alarm.
Guardrail: Recall ≥ 0.80 (catch at least 80% of true late orders)

The dataset contained 180,519 orders with 53 columns. Class distribution was near‑balanced (54.8% late, 45.2% on time).

First Step: Remove Leakage

The raw data included columns that would be impossible to know at order placement:

Delivery Status – literally the target
Days for shipping (real) – actual transit days, known only after delivery
shipping date (DateOrders) – when the order actually left the warehouse
Order Status – may reflect post‑order events (we dropped it to be safe)

We also dropped PII (email, names, password), pure IDs, 100% null columns (Product Description), and columns with extreme cardinality (e.g., Order City with 3,597 unique values).

After cleaning, we kept ~30 features: scheduled shipping days, benefit per order, shipping mode, market, customer segment, order hour, day of week, etc.

Feature Engineering – Preventing the #1 Silent Bug

pipeline = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('numeric', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), NUMERIC_COLS),
        ('onehot', Pipeline([
            ('imputer', SimpleImputer(strategy='constant', fill_value='UNKNOWN')),
            ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ]), ONEHOT_COLS),
        ('target', Pipeline([
            ('imputer', SimpleImputer(strategy='constant', fill_value='UNKNOWN')),
            ('encoder', TargetEncoder())   # out‑of‑fold encoding
        ]), TARGET_ENC_COLS),
    ])),
    ('model', model)
])

At serving time, the same frozen artifact is loaded – zero chance of applying a different imputation median or a different one‑hot encoding.

MLOps Architecture: Ten Stages, Five Pipelines

These stages are implemented as five pipelines in ZenML:

Training pipeline (stages 1‑6)
Inference pipeline (stage 7)
Drift detection pipeline (stage 9)
Monitoring pipeline (stage 8)
Retraining pipeline (stage 10)

All experiments are tracked in MLflow, integrated with ZenML. Every run logs hyperparameters, metrics, the exact dataset version (ZenML artifact ID), and the git commit hash.

Training & Evaluation – Why We Chose LightGBM

We trained a series of models, starting from a Dummy Classifier (majority class) as the absolute floor.

We used stratified 80/10/10 train/validation/test splits. The test set is touched exactly once at final evaluation only.

The success criteria for production:

F2‑score ≥ 0.75 on held‑out test set
Recall ≥ 0.80
Pipeline is reproducible – same data + same code = same result

(After first run, we can revise the F2 threshold based on actual business value.)

Deployment Batch Inference with Human‑in‑the‑Loop Promotion

We chose batch inference because order volume is moderate (~500 orders/day). Predictions are refreshed hourly. This is enough to trigger expedited shipping or customer notifications.

Model Promotion Workflow

Training run completes.
Evaluation gates pass (F2 > 0.75, recall ≥ 0.80).
Model is automatically registered in MLflow as “Staging”.
A human reviews metrics against the current production model in MLflow UI.
Human approves → model promoted to “Production”.
Previous production model is moved to “Archived” (retained for 90 days).

Rollback < 5 minutes. If a bad model slips through, we simply promote the archived version back to Production via MLflow UI. Because predictions are stored with a model_version column, the dashboard can immediately start using the restored model without data deletion.

System Design: How New Orders Get Scored

We needed a reliable way to identify unscored orders and write predictions without double‑writing. The solution uses an absence check and an idempotent insert.

Prediction Table Schema

CREATE TABLE predictions (
    prediction_id   BIGSERIAL PRIMARY KEY,
    order_id        BIGINT NOT NULL REFERENCES orders(order_id),
    late_risk_score FLOAT  NOT NULL CHECK (late_risk_score BETWEEN 0.0 AND 1.0),
    predicted_late  SMALLINT NOT NULL CHECK (predicted_late IN (0, 1)),
    scored_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    model_version   TEXT NOT NULL,        -- MLflow model version tag
    pipeline_run_id TEXT NOT NULL,        -- ZenML run ID for audit
    UNIQUE (order_id, model_version)
);

Detecting Unscored Orders

SELECT * FROM orders
WHERE order_id NOT IN (
    SELECT order_id FROM predictions
    WHERE model_version = :current_version
);

Idempotent Write

INSERT INTO predictions (order_id, late_risk_score, predicted_late, 
                         model_version, pipeline_run_id)
VALUES (...)
ON CONFLICT (order_id, model_version) DO NOTHING;

Monitoring & Drift Detection – Don’t Wait for Ground Truth

Ground truth (Late_delivery_risk) arrives only after delivery – days after the prediction. That’s too late to notice data distribution changes. We use Evidently to detect drift in input features as soon as new orders come in.

Operational Reality What We Learned

1. Start with the simplest deployment that works.

At 500 orders/day, a single scheduled pipeline writing to one database table is correct. We did not need Kubernetes, real‑time APIs, or feature stores. Adding complexity too early kills velocity.

2. Train‑serving skew is the silent killer.

ackaging the fitted sklearn.Pipeline as a single artifact and reloading it in inference is non‑negotiable. Every imputation median, one‑hot mapping, and target encoding must be frozen.

3. Idempotency saves weekends.

ON CONFLICT DO NOTHING and the absence‑based order detection meant we never had to worry about replaying or cleaning up duplicate predictions.

4. Human promotion gates build trust.

The first production model was promoted manually after reviewing slice metrics. After three stable cycles, we may automate promotion – but not before. Trust is earned.

5. Monitor input drift, not just output.

We caught a carrier SLA change because Days for shipment (scheduled) distribution shifted. The Slack alert arrived two days before any ground truth label was available we fixed the feature mapping before any performance degradation.

What’s Next?

We deferred a few items that were not needed for MVP:

Real‑time API serving (batch is fine for now)

Fully automated retraining (earn it after evaluation gates are proven)

Cloud infrastructure (local ZenML stack is sufficient for this scale)

When order volume grows beyond ~10,000/day, we will revisit. But for now, the system is stable, reproducible, and delivers an F2‑score above the target.

Conclusion

The complete codebase follows the structure described in our internal design docs, with a core/ module containing pure business logic (no framework imports) and steps/ as thin ZenML wrappers. This makes testing fast and framework migration cheap.

Happy building, and may your deliveries always be on time. 🚚

MAC Cosmetics Generated 53,000 Leads in 2 Days and 17.2X ROI with AI, Here's Exactly How They Did It

Dhaivat Jambudia — Tue, 21 Apr 2026 13:52:30 +0000

MAC Cosmetics is a 40-year-old beauty brand operating in 120+ countries with a product line that spans lipsticks, foundations, eyeshadows, and everything in between.

When MAC decided to expand into a new region, they hit a wall every growth engineer knows well:

No local customer database. Anonymous web traffic with no identity attached.
Cart abandonment bleeding revenue. Shoppers browsing, adding to cart, disappearing.
Mobile engagement tanking. Short attention spans, no immersive experience to hold them.
Personalization running blind. Product recommendations showing irrelevant items because customer data lived in silos.

The goal wasn't just "run some AI campaigns." It was to build a full-stack personalization engine from anonymous visitor to loyal customer across web, mobile, email, and push in a new market.

Here's how they did it, use case by use case.

Use Case 1: Turn Anonymous Visitors Into Leads — 53,000 in 2 Days

The Problem

Standard lead capture (newsletter popups, discount banners) was failing. Conversion was low, bounce rates were high, and the few emails they collected came with low engagement downstream.

Here they found solution based on Human psychology not ML Algorithm

Gamification as a Data Acquisition Engine

MAC implemented a Wheel of Fortune interactive overlay. First time visitors were invited to spin the wheel for a chance to win a discount coupon — but to receive it, they had to enter their email.

This works because of a well-understood psychological mechanism: variable reward schedules. Unlike a fixed "get 10% off" banner (predictable, easy to ignore), a spin mechanic introduces uncertainty. The outcome isn't guaranteed, which makes the action feel exciting rather than transactional.

The coupon is single-use and expiry-gated this prevents abuse while creating purchase urgency, The email capture is the real prize:

it's the moment an anonymous visitor becomes a trackable, targetable customer.

The Numbers

Metric --------------------- Result
------------------------------------------------
New leads generated -------- 53,000
Time period ---------------- 2 days
Click-through rate (overlay → spin) -- 64.95%
VR increase ---------------- 4.43%
------------------------------------------------

A 64.95% CTR from an overlay is extraordinary. Standard popups run 1–3%. The gamification mechanic was doing significant heavy lifting here.

Use Case 2: Fix Irrelevant Recommendations +2.3% CVR, 20.56% Add-to-Cart Rate

MAC's ecommerce team knew that upselling and cross-selling were their fastest path to higher AOV. But their existing recommendation system was showing random products not products relevant to what a shopper had been browsing.

Promoting irrelevant items doesn't just fail to convert. It actively damages trust and makes the shopping experience feel generic.

The Solution: AI-Powered Behavioral Recommendations

Before any recommendation logic could run, MAC needed a unified customer profile. They consolidated data from all channels into a Customer Data Platform (CDP), giving each customer a 360° identity that merged:

Browsing history (product views, category affinity)
Purchase history (what they've bought, how recently, how often)
Cart behavior (what they added but didn't buy)
Channel data (mobile vs desktop, email engagement)

A 2.3% CVR lift sounds modest but compounds heavily at scale. On high-traffic e-commerce, every tenth of a percent in conversion rate is real money.

Use Case 3: Recover Abandoned Carts 16.69% Conversion Rate

Cart abandonment is the most expensive leak in e-commerce. The industry average abandonment rate sits around 70%. Most brands send a single reminder email and call it recovery.

MAC needed a smarter approach one that didn't rely on a single channel or a single send time.

The Solution: Cross-Channel Journey Orchestration

Using their CDP data, MAC knew exactly which products each user had abandoned. This enabled personalized recovery not "you left something behind," but "you left MAC Ruby Woo Lipstick and Studio Fix Foundation behind, here they are."

A 16.69% cart recovery rate is 2–3x the industry average. The combination of personalized content, multi-channel sequencing, and send-time optimization is doing the work here — no single tactic accounts for it alone.

Use Case 4: Fix Mobile Engagement +123.5% Mobile CVR

Mobile visitors have shorter attention spans, smaller screens, and higher friction. MAC's mobile web experience wasn't holding people long enough to drive discovery and purchase.

The core issue: product discovery on mobile is broken by default. Scrolling through category pages on a 6-inch screen is tedious. Users bounce before they find something they love.

The Solution: Instagram-Style Immersive Stories

MAC deployed InStory a fullscreen story overlay on mobile web that mimics the Instagram/TikTok story format.

The +123.5% mobile CVR is the headline number, but the 5X productivity gain is the one that compounds over time.

MAC's results aren't magic. They're the output of a coherent architecture where every layer reinforces every other layer and a team willing to move through distinct experiments to find what works.

That's the actual playbook.

I Replaced 12 Kitchen Managers Guessing "How Much Chicken Do We Need" With 3 ML Models. Here's the Entire Architecture.

Dhaivat Jambudia — Sat, 11 Apr 2026 04:58:28 +0000

This is a case study: AI in Supply Chain
Every restaurant chain has the same dirty secret. Nobody actually knows how much food they waste.

I worked on a system for a 12-location restaurant chain where the entire inventory process was running on vibes. Kitchen manager walks in at 7 AM, looks around, thinks "yeah we're low on chicken", calls procurement, says "send 20 kg". Thats it. Thats the system.

No data. No tracking. No feedback loop. Just a human eyeballing a fridge and making a phone call.

The mess we started with

Let me walk you through what actually happens every single day across 12 restaurants.

Morning (per restaurant):
Kitchen manager does a visual walkthrough. Decides whats low based on gut feeling and experience. Writes it down on paper. Sometimes doesnt write it down at all, just remembers. Calls the central procurement office and dictates what they need.
Procurement office:
Officer enters the order into a spreadsheet. Then calls 3-4 suppliers asking for price quotes. Same suppliers. Same conversation. Every single week. Compares quotes, picks the cheapest, places the order.
Next day:
Delivery arrives. Sometimes its complete, sometimes its not. Kitchen manager checks it against the order. If somethings wrong, calls procurement again. More phone calls.
The gap nobody talks about:
Throughout the day the kitchen uses ingredients. But nobody tracks how much was actually used versus how much was ordered. You ordered 20 kg chicken. The POS system shows you sold dishes that should use about 12 kg. End of day you have 3 kg left. Where did the other 5 kg go? Nobody knows. Nobody is even asking the question.
Month end:
Each of the 12 restaurant managers compiles their ordering costs into a spreadsheet. Emails it to head office. Someone at head office manually merges 12 spreadsheets into one P&L report. Takes about 8 hours. The report is full of errors but nobody has the energy to check.

Head office looks at the consolidated report and tries to spot problems. But they cant. They dont have waste data. They dont have supplier performance data. They dont know which location is over-ordering. They're making decisions blind.

I mapped 14 steps in this workflow. Here's how they break down.

The pattern is clear. Most steps are mechanical and can be automated with tools. The judgment steps can be replaced with ML models. And the biggest problem (step 11) isnt even a bad process its a missing process entirely.

# Inventory & Procurement Automation Pipeline

| Step                          | Component                              | Why |
|-------------------------------|----------------------------------------|-----|
| Visual inventory check        | TOOL (digital scales + POS data) as TRIGGER | Replace eyeballing with actual measurement. POS tells you what was sold. Scales tell you what's left. This data arriving each morning triggers the whole system. |
| Write order list              | DETERMINISTIC ML (Demand Predictor)    | A regression model takes day-of-week, reservations, weather, past sales, current inventory and outputs precise order quantities. This is not a language problem. It's a math problem. |
| Call procurement              | TOOL (automated routing)               | The call is eliminated entirely. System sends computed order to the procurement workflow. |
| Enter into spreadsheet        | TOOL (database)                        | Spreadsheet replaced by structured database. Orders logged automatically. |
| Call suppliers for quotes     | TOOL (supplier API)                    | Suppliers provide price lists via API or weekly upload. System queries automatically. |
| Compare quotes, pick supplier | DETERMINISTIC ML (Supplier Scorer)     | Score by price + on-time delivery rate + quality rejection rate. A weighted scoring model, not an LLM having a think about it. |
| Place order                   | TOOL (supplier API)                    | Auto-execute within guardrails. |
| Delivery arrives              | TOOL (delivery confirmation)           | Kitchen manager confirms receipt in app. Quantities logged against order. |
| Check delivery vs order       | TOOL (automated matching)              | System compares received vs ordered. Flags mismatches. |
| Resolve mismatches            | HUMAN CHECKPOINT                       | Mismatches above 10% escalated to procurement manager with full context. Small variances auto-accepted and logged. |
| Track usage vs ordered        | DETERMINISTIC ML (Waste Detector)      | Ordered 20 kg. POS shows 12 kg used in dishes. Scale shows 3 kg remaining. 5 kg unaccounted = waste. Model flags restaurants above benchmark. |
| Monthly cost compilation      | TOOL (automated report)                | All orders already in database. Monthly totals are a SQL query. |
| Consolidate 12 restaurants    | TOOL (dashboard)                       | One query across all locations. Real-time. |
| Review for anomalies          | LLM (narrative) + HUMAN CHECKPOINT     | LLM generates readable brief. "Restaurant 7 has 23% higher chicken costs than chain average. Waste rate is 18% vs chain average 9%." Human reads it and decides what to do. |

Notice: the LLM shows up exactly once. At the very end. For narration. It does not make a single decision in this entire pipeline.

System architecture

The system has three layers that run in parallel.

Layer 1: Restaurant agents (x12, running independently)

Each restaurant runs its own agent. Same logic, different data. They don't wait for each other.

TRIGGER: Daily at 5 AM

→ [TOOL] Pull POS sales data (yesterday)
→ [TOOL] Pull digital scale readings (current inventory)

→ [DETERMINISTIC ML] Demand Predictor
    Inputs: day of week, reservations, weather,
            yesterday's sales, historical patterns,
            current inventory
    Output: order quantity per ingredient
            with confidence interval

→ [DETERMINISTIC ML] Waste Detector
    Inputs: yesterday's opening inventory,
            deliveries received, POS-derived usage,
            closing inventory from scale
    Output: waste amount per ingredient,
            flag if above threshold

→ [TOOL] Query supplier prices via API

→ [DETERMINISTIC ML] Supplier Scorer
    Inputs: current prices, historical on-time rate,
            quality rejection rate
    Output: ranked supplier per ingredient

Layer 2: Decision branching

This is where the system decides what needs a human and what doesnt.

IF order < ₹50K
   AND all items from preferred suppliers
   AND waste levels normal:
   → AUTO-EXECUTE order via supplier API
   → Log to database

IF order ≥ ₹50K
   OR non-standard supplier
   OR unusual quantity (>2x average):
   → HUMAN CHECKPOINT
   → Procurement manager approves/modifies/rejects
   → Execute approved order

IF waste flag triggered (>15% waste on any ingredient):
   → ALERT restaurant manager + head office
   → [LLM] Generate waste analysis brief
   → HUMAN investigates and logs root cause

The auto-execute path handles the routine. The human checkpoint catches the unusual. The waste alert handles the unknown. Three paths, clear rules, no ambiguity.

Layer 3: Head office consolidation agent

Runs daily and monthly. Aggregates everything.

DAILY:
→ [TOOL] Aggregate orders, costs, waste across 12 locations
→ [DETERMINISTIC ML] Anomaly detector
   Flags restaurants above/below cost benchmarks
→ Dashboard updated

MONTHLY:
→ [TOOL] Generate consolidated P&L from database
→ [DETERMINISTIC ML] Trend analysis
→ [LLM] Generate monthly narrative:
   "Chain-wide food cost: 32.4% of revenue (target 30%).
    Top performer: Restaurant 2 at 28.1%.
    Needs attention: Restaurant 9 at 37.2%,
    driven by 22% waste rate on vegetables."
→ HUMAN CHECKPOINT: Head office reviews and decides

The feedback loop

This is what makes the system get smarter over time.
Every order, every delivery, every waste measurement, every demand prediction gets stored. The system tracks prediction accuracy.

"We predicted 18 kg chicken demand for Restaurant 3 on a Tuesday. Actual was 21 kg. Error: 16%."

This feeds back into the demand predictor. Over weeks and months, the model learns each restaurant's patterns. It learns that Restaurant 5 always spikes on Fridays.

That Restaurant 9 has higher waste on vegetables every monsoon season. That Supplier B's delivery times slip during festivals.

The ML models dont stay static. They improve because the memory layer captures outcomes, not just decisions.

Failure mode analysis

Every component will fail eventually. The question isnt "will it fail" but "what happens when it does."

POS data feed goes down: Kitchen manager enters yesterday's estimated covers manually. System uses last similar day as a proxy. Alert goes to IT.

Scales malfunction or staff skip weighing: System flags missing reading and falls back to calculated inventory: yesterday's stock + deliveries - POS usage. Not perfect but better than nothing.

Demand predictor is wildly wrong: Confidence intervals widen automatically when recent errors are high. Low-confidence predictions get a 20% buffer and human review flag. Model retrains weekly.

Supplier API is down: Order gets queued. SMS notification sent to procurement officer with order details for manual phone placement. Order logged in app when confirmed.

Waste detector throws false positives: Waste alerts require 2+ consecutive days above threshold before escalating. Single day spikes get noted but dont trigger alarms. Reduces alert fatigue.

Auto-execution bug orders 10x quantity: Hard guardrail. No auto-order can exceed 3x the 4-week average for that ingredient at that restaurant. Anything above requires human approval. Daily spend cap per location.

LLM hallucinates a number in the monthly report: LLM never generates numbers. All figures come from the database as structured data. LLM only wraps them in natural language. If LLM is unavailable, the dashboard with raw numbers still works.

The design principle: If any ML component fails, the system degrades to the current manual process. Not to something worse. The worst case scenario is "things go back to how they are today." Thats the floor, not a cliff.

Conclusion:
ML decides. LLM explains. Humans approve. Thats the pattern.

I Replaced a $200/mo AI Stack with OpenClaw + Free Models. Here's the Exact Setup (and Why Security Almost Killed It)

Dhaivat Jambudia — Fri, 03 Apr 2026 14:10:22 +0000

I spent 2 weeks building a setup with OpenClaw, free/cheap models for the boring stuff, and only routing the hard problems to expensive models. The result? Our AI bill went from ~$200/mo to around $10/mo. And honestly, the quality for 90% of tasks is the same.

But this is the part few people talks about on twitter the security side of OpenClaw nearly burnt us. I'll get into that too because if you're a founder or CTO thinking about deploying this, you need to hear the ugly parts.

Lets get into it.

What is OpenClaw, and Why Should You Care

If you haven't heard of OpenClaw yet... where have you been? Its the fastest growing open source project in GitHub history. Over 163k stars. The project started as a personal AI assistant by Peter Steinberger (yes, the iOS dev legend) and exploded because it does something no other tool does well — it gives you a persistent AI agent that lives in your messaging apps and actually does things on your behalf.

WhatsApp, Telegram, Slack, Discord, email — OpenClaw connects to all of them through a single gateway. Its not a chatbot. Its an agent that can run shell commands, browse the web, manage your calendar, read and write files, and more. Think of it like having a junior employee that never sleeps, never complains, and costs pennies per hour.

Quick Context
OpenClaw is model-agnostic. You can plug in Claude, GPT, Gemini, or run completely free local models through Ollama. This is the key that makes the cost optimization possible.

For founders and CTOs, heres why it matters: you can automate 80% of your operational busywork without building custom software. No Python scripts, no Zapier chains, no hiring a developer to glue APIs together. You write a SOUL.md config file, connect your channels, and you're live.

The Core Idea: Not Every Task Needs a $75/M-token Model

This is the thing that took me embarrassingly long to realize. When someone emails us saying "hey can you send me the latest invoice?", your AI doesn't need Claude Opus to understand that. A 7B parameter model can handle it. When someone fills a form and you need to parse it into structured data — again, small model territory.

// openclaw.json — the config that saved us $$$/mo
{
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen3:32b",
        "thinking": "anthropic/claude-sonnet-4-20250514",
        "fallbacks": [
          "openai/gpt-4.1-mini",
          "google/gemini-2.5-flash"
        ]
      }
    }
  }
}

What We Actually Automate (With Examples)

1. Email Reply Drafts — Free Model

Our support inbox gets maybe 60-80 emails a day. Most of them are variants of the same 15 questions. We wrote a skill that reads incoming emails, matches them against our FAQ knowledge base, and drafts a reply. The human just reviews and hits send.

Model used: Qwen3 32B (local, $0). For templated email replies, this model is more then adequate. It follows instructions well, keeps the tone professional, and doesn't hallucinate company policies because we feed it the exact docs.

# skills/email-drafter/SKILL.md
name: email-reply-drafter
trigger: new email in support inbox
model_override: ollama/qwen3:32b

# Steps:
1. Read incoming email
2. Search FAQ knowledge base for matching topics
3. Draft reply using company voice guidelines
4. Send draft to #email-review Slack channel
5. Wait for human approval before sending

2. Form Processing — Free Model

We have clients who fill out onboarding forms. The data comes in messy — sometimes PDF, sometimes Google Form, sometimes literally a photo of a handwritten form (yes, in 2026). The OpenClaw skill extracts the data, structures it, and pushes it to our CRM.

3. Code Reviews & Bug Fixes — Expensive Model (Sub-Agent)

This is where it gets interesting. When our agent encounters a coding task — someone reports a bug, or we need to generate a script — it spawns a sub-agent that uses Claude Sonnet 4 or routes to Claude Code.

Why not use the free model here? Because I tried, and the results were... lets just say, not production-ready. Qwen 32B can write a for loop fine. But ask it to debug a race condition in an async Node.js service and it falls apart. The bigger models just get it in ways that smaller models don't.

// Sub-agent config for coding tasks
{
  "skills": {
    "code-reviewer": {
      "model_override": "anthropic/claude-sonnet-4-20250514",
      "tools": ["shell", "browser", "file-edit"],
      "sandbox": true
    }
  }
}

You can also pipe coding tasks directly to Claude Code or OpenAI Codex as external tools. OpenClaw's tool system lets you call any CLI tool, so you literally just wrap claude-code or codex as a skill and the agent delegates to them when needed.

Now the important part: Security

OpenClaw's security track record is... not great. And I say this as someone who genuinely loves the project. The speed of adoption outpaced the security hardening by a huge margin. which is user's mistake not openclaw's.

What We Actually Did to Lock It Down

After reading the Cisco blog and the Microsoft post, I spent a full weekend hardening our setup. Heres the non-negotiable stuff:

1. Docker isolation. Our OpenClaw instance runs in a container with no access to the host filesystem except for a single mounted volume for the workspace. The agent physically cannot touch anything outside its sandbox.

2.Dedicated credentials. The agent has its own email account, its own API keys, its own everything. If it gets compromised, the blast radius is limited. Its not sharing my personal Google account or our company's AWS root credentials.

3.No third-party skills. Zero. We write all our skills in-house. I don't care how cool a ClawHub skill looks — its not going on our machine until the ecosystem has proper code review and signing. Maybe in 6 months.

4.Network segmentation. The container runs on an isolated network. It can reach the model APIs and our internal services, but nothing else. No random outbound connections to god-knows-where.

5.Human-in-the-loop for destructive actions. Any action that sends an email, modifies a file, or runs a command requires approval in Slack. The agent proposes, the human approves. No autonomous destructive operations.

The Bottom Line: Is It Worth It?

Absolutely Yes, You need technical chops. This is not a "click three buttons and you're done" setup. You need to understand Docker, networking, API keys, model capabilities, and security basics. If your team doesn't have someone who can set this up and maintain it, pay for a managed solution instead.

Want help setting this up for your team?
I've helped people deploy this exact architecture over the past month. If your burning money on AI API bills and want to cut costs without losing quality, reach out. I do a free 30-minute audit call where we look at your current setup and identify where free models can replace expensive ones.

DM me on Twitter/X or on LinkedIn

I Built a RAG System to Chat With Newton's Entire Wikipedia

Dhaivat Jambudia — Thu, 02 Apr 2026 16:29:02 +0000

Most RAG tutorials just say "chunk your PDF and call OpenAI". I wanted to build something more real — a proper pipeline that actually ingests, cleans, embeds, and serves knowledge from Isaac Newton's Wikipedia page end to end.

The result is Newton LLM. You can now ask things like "What are Newton's contributions in Calculus?" and get proper answers with sources instead of made-up stuff.
Here's how I actually built it and what I learned.
The Problem With Most RAG Demos
Every YouTube RAG tutorial follows the same boring steps: load PDF, split into chunks, put in vector store, done.
But nobody talks about the real issues:

How do you keep the data fresh when the source changes?
How do you clean messy web data before embedding?
How do you separate the ingestion part from the serving part?
How do you make the whole thing actually deployable?

Newton LLM tries to solve these. Its not just a notebook — its a small system.
Architecture Overview
The system has two main layers:

Data Ingestion Layer (the offline part)

Source → Airflow → MongoDB
I pull data from Wikipedia about Newton — his life, physics, math, optics etc.
Apache Airflow runs the whole ETL pipeline through a DAG. It fetches, cleans, and transforms the raw content. No random scripts or cron jobs. Airflow handles retries, scheduling and monitoring.
MongoDB stores the cleaned documents. This is my "source of truth" before anything gets embedded.
Why not embed straight from Wikipedia? Because raw scraped pages are full of garbage — menus, references, bad HTML. You need to clean it first. MongoDB gives me a clean staging area.

RAG Serving Layer (the online part)

Qdrant ← Batch Embeddings ← MongoDB
Since Newton's Wikipedia doesn't change every day, I use batch embedding instead of doing it live. Documents go from MongoDB → embedding model → Qdrant in scheduled batches. Its cheaper and faster.
When user asks a question:
User Question
→ FastAPI gets it
→ Query gets embedded
→ Qdrant finds similar chunks
→ Retrieved docs + question → LLM
→ Answer with sources
The LLM always gets context. It helps a lot with hallucinations.
Tech Stack

Orchestration: Apache Airflow (for DAGs, retries, monitoring)
Document Store: MongoDB (flexible for messy Wikipedia data)
Vector Store: Qdrant (fast and open source)
Backend: FastAPI (quick and clean)
Frontend: Next.js / Streamlit (Next for real use, Streamlit for quick tests)

Key Decisions
Batch Embedding > Real-time Embedding
Most tutorials embed on the fly. For static data like this, its stupid to keep re-embedding the same things. I run batch embedding once or on schedule and save a lot of time and money.
Airflow instead of simple Python script
I could have just written one scrape_and_embed.py file. But Airflow gives retries, proper logging, scheduling and makes each step separate. If Wikipedia is down, it retries automatically. For anything bigger than a toy project, orchestration actually matters.
Separating Ingestion from Serving
The scraping/cleaning part and the answering part are completely separate. Ingestion can break or update without touching the live RAG system. The serving layer just reads from Qdrant.
What I'd Do Differently Next Time

Add a reranker — simple vector search isn't enough. A reranker would make results much better.
Build evaluation from the start — without proper eval, you don't know if your changes actually help.
Add more sources — right now only Wikipedia. Academic papers would make it way stronger.
Try hybrid search — combine vector search with keyword search (BM25).

Final Thoughts
Building a simple RAG demo is easy. Building something that actually works properly is much harder. Most of the work is in the boring parts: cleaning data, setting up orchestration, separating concerns, and deciding when to use batch vs real-time.
Newton LLM showed me that good retrieval matters more than which LLM you use. If your pipeline is solid, even a smaller model gives good answers.
If you're building RAG, focus on the data pipeline first, not fancy prompts.