DEV Community: Mobasshir Khan

[Boost]

Mobasshir Khan — Thu, 18 Jun 2026 18:42:33 +0000

Mobasshir Khan

Jun 18

Most Repos Look Fine. Until They Don’t.

#devops #automation #software #productivity

4 min read

Most Repos Look Fine. Until They Don’t.

Mobasshir Khan — Thu, 18 Jun 2026 16:48:07 +0000

You’ve been there.

You clone a repo. The README looks solid. There’s a Dockerfile. Maybe a docker-compose.yml. Everything appears set up.

Then you spend the next three hours chasing a missing config variable, an outdated base image, or a local development assumption that only makes sense if you’ve worked on the project for six months.

No one documents these things properly. They live in team memory, Slack threads, and that one engineer who “just knows.”

That’s the kind of engineering pain nobody tracks, and everybody absorbs.

The Hidden Tax on Every Team

Let’s be honest about what’s really happening.

Most engineering teams have gotten good at the visible stuff: code reviews, test coverage, deployment pipelines. But there’s a layer below all of that, repository readiness, that almost nobody validates with the same rigor.

Can a new developer clone this repo and actually run it?

Does the Docker setup reflect how the project actually works today?

Has the repo drifted from the workflow the team thinks it has?

Is there tribal knowledge baked into the setup that stays invisible until something breaks?

These aren’t dramatic problems. That’s exactly why they survive so long.

The cost doesn’t show up in a postmortem. It shows up as:

onboarding that takes days instead of hours
“works on my machine” becoming a running joke
CI failures no one can explain right away
setup bugs being rediscovered by every new hire

None of that gets assigned a ticket. It just quietly eats time.

Why I Built `dockgate`

I got tired of watching this happen.

Not just to me, but to every developer stuck in the gap between “the repo exists” and “the repo actually works.” That gap has a real cost, and most of it is preventable.

So I built dockgate, a CLI that sits squarely in that gap and does one thing well:

It tells you whether a repository is actually ready to run, maintain, and trust.

Not whether the code is clean.

Not whether the tests pass.

Whether the operational layer, the Docker setup, project conventions, and environment assumptions, reflects reality.

What `dockgate` Actually Does

1. It detects what kind of project it’s dealing with

A Node.js repo, a Python repo, and a multi-service project should not be evaluated against identical expectations. dockgate starts with project detection so its checks stay relevant instead of noisy.

2. It uses a rules engine, not guesswork

This is where the tool stops being a script and starts becoming infrastructure.

Instead of scanning for random files and printing generic advice, dockgate uses a structured rule catalog. That makes evaluations more consistent, repeatable, and extensible.

You can use it for:

onboarding triage
repository audits
drift detection over time
validating Docker setup before handoff

Rules evolve as standards evolve. That’s where the leverage comes from.

3. It doesn’t just diagnose, it points forward

A lot of tools are good at listing what’s wrong. Fewer are designed to help you move toward a better state.

dockgate includes setup-oriented workflows so it can be part of actual remediation, not just diagnosis.

4. It fits how developers already work

It’s a CLI. It runs in the terminal. It works with hooks, audit scripts, and existing shell workflows.

That matters.

Good developer tools don’t ask people to change how they work just to get value.

What Makes It Different

There’s no shortage of linting tools, repo templates, and CI validators out there.

But dockgate focuses on something most of them skip: the setup layer. More specifically, the distance between “this repo exists” and “this repo is actually ready.”

That difference shows up in:

Docker support that works on paper but breaks in practice
README instructions that were accurate six months ago
environment assumptions only one team member still understands
local setups that quietly diverge from production

When the setup layer is unclear, the team pays for it every time someone new joins, every time a CI assumption breaks, and every time somebody has to reverse-engineer how the project is supposed to run.

dockgate makes that invisible layer visible.

On Shipping It Like a Real Tool

One thing I felt strongly about from the start was this:

there’s a huge graveyard of useful scripts that never became useful tools because they never crossed the gap between “works on my machine” and “someone else can install and trust this.”

I wanted dockgate to cross that gap deliberately.

That meant doing the less glamorous work too:

npm package support
PyPI wrapper support for Python environments
a changelog
a release checklist
a proper license
regression fixtures
a GitHub Actions publishing workflow

It also meant treating mixed-language teams as real users. dockgate is fundamentally an npm package, but developer teams rarely live in a single ecosystem. A PyPI wrapper lowers friction, and in developer tooling, accessibility often decides whether something gets tried at all.

The Lesson I Didn’t Expect

Building this reinforced something I keep coming back to:

some of the highest-value engineering work lives in problems people dismiss as small.

Setup friction looks small.

Repository drift looks small.

Docker inconsistency looks small.

Until it’s slowing down every sprint and nobody can fully explain why.

That’s the thing about infrastructure drag: it rarely announces itself. It accumulates. And it is always cheaper to catch early.

What’s Next

Right now, dockgate focuses on repository readiness and Docker validation. But the direction is bigger than that.

I can see it growing into:

stronger standards-driven validation
better drift detection over time
richer project profiles and baselines
more actionable remediation workflows

The foundation is rules-driven and extensible, which means it can grow with the teams that use it.

One Last Thought

A repository is not just a folder of code.

It’s an operational interface for every developer who touches it. When that interface is confusing, fragile, or full of hidden assumptions, the team pays for it whether they acknowledge it or not.

“It looks fine” is one of the most expensive things a repository can say.

dockgate is built to stop teams from taking that at face value.

Try dockgate on npm

I Rebuilt My RAG Pipeline From Scratch. Here's What Actually Made It Better.

Mobasshir Khan — Mon, 15 Jun 2026 10:40:11 +0000

Spoiler: it wasn't a bigger model, a better embedding, or a longer prompt.

When I first built my RAG pipeline, it was about as simple as it gets.

Take a topic, fetch some relevant text, feed it to the model, generate an answer. Classic RAG. The kind of thing every tutorial walks you through in twenty minutes.

And honestly? It worked. For a while.

But I wasn't building a generic Q&A bot. I was building a debate learning system, and that's where the cracks started showing fast.

The output felt generic. The sources didn't match what each part of the lesson actually needed. And the system had no idea that "background knowledge," "debate framing," "rebuttal material," and "vocabulary support" are not the same thing.

It was retrieving text. It just wasn't understanding what kind of help each piece of the pipeline needed.

So I tore it apart and rebuilt it.

What came out the other side wasn't just "better RAG." It was a layered retrieval architecture that plans its own queries, routes them by intent, ranks and packs context intelligently, remembers what worked before, and evaluates itself against real lessons.

Here's how I got there, and what I'd tell anyone trying to do the same.

The Problem With "Embed → Retrieve → Pray"

Most basic RAG systems follow the same formula:

embed the query → retrieve top chunks → pass them to the model → hope the answer improves

For simple document Q&A, this is fine. It's even good.

But for debate learning, it falls apart, because different parts of the output need fundamentally different kinds of evidence:

Pre-knowledge needs definitions and background
Debate building needs mechanisms, clashes, and framing
Vocabulary needs words that actually fit the topic
Language support needs words that reinforce the lesson itself
Coaching needs structured, distilled explanations

A single chunk-search pipeline has no concept of any of this. It doesn't know the difference between a strong debate example and a generic background article. The result is noisy retrieval, repetitive patterns, and weaker final output, no matter how good your embeddings are.

The Real Shift Wasn't Technical. It Was Mental.

Here's the thing that changed everything for me, and it had nothing to do with code at first.

I stopped thinking of retrieval as "find text" and started thinking of it as "make decisions about evidence."

That single reframe pushed me toward a completely different architecture:

topic → plan → route → preselect → retrieve → rerank → pack → teach → evaluate

Suddenly retrieval wasn't a single step anymore. It was a pipeline of decisions, each one with a clear job.

Here's the high-level flow I ended up with:

flowchart LR
    A["Topic / Node Intent"] --> B["Query Planner"]
    B --> C["Intent Router"]
    C --> D["Document Preselection"]
    D --> E["Chunk Retrieval"]
    E --> F["Reranking"]
    F --> G["Context Packing"]
    G --> H["Structured Evidence Lanes"]
    H --> I["Downstream Lesson Nodes"]
    I --> J["Trace Logging / Memory"]
    J --> K["Real-Trace Evaluation"]

Let's break down what each of these pieces actually does, and why it mattered.

1. Query Planning, Per Node

Not every node in the pipeline needs the same kind of evidence. A pre-knowledge node and an argument-generation node should never be searching the same way.

So I added a query planner that expands the raw topic into something far more useful for retrieval. Instead of handing the system a bare topic like "feminism" or "international relations," the planner turns it into structured search intent: expanded terms, subqueries, and source preferences tailored to the node that's asking.

def build_query_plan(node_name: str, topic: str) -> dict:
    return {
        "node": node_name,
        "topic": topic,
        "expanded_terms": [...],
        "subqueries": [...],
        "source_preferences": [...],
    }

This alone improved recall noticeably. The system finally started asking the right question before it even searched.

2. Intent-Aware Routing

Once the query is planned, the router decides what type of evidence the node actually needs: definitions, mechanisms, examples, clash material, style cues, or vocabulary support.

This sounds small, but it fixes one of the most common mistakes in basic RAG: treating every retrieval need as identical. They're not, and pretending otherwise is where a lot of quality gets lost.

Routing also made the whole system more inspectable. Every retrieval path now has a clear, debuggable purpose, which made every later improvement easier to reason about.

3. Hierarchical Retrieval (Stop Searching Everything, Every Time)

This was one of the biggest quality jumps in the whole rebuild.

Instead of searching every chunk across the entire corpus for every query, the system now first identifies the most likely documents, and only then searches inside them.

before:  topic → search all chunks
after:   topic → choose good documents → search chunks inside them

The difference is bigger than it sounds. Retrieval no longer wastes effort scanning irrelevant corners of the corpus, and chunk search starts from a much stronger position every single time.

4. Reranking and Context Packing

Relevant isn't the same as useful.

I added a reranking stage that reorders retrieved evidence based on usefulness, source class, and role, then a context packer that arranges the final selection in a way the model can actually reason with.

This is easy to underestimate, but a good retrieval system isn't just about finding the right information. It's about presenting it in a shape the model can use well.

5. Structured Evidence Lanes

Instead of dumping every retrieved chunk into one giant context block, the system now separates evidence into distinct lanes:

Definitions
Mechanisms
Examples
Debate framing
Vocabulary
Style and coaching notes

raw evidence → structured evidence lanes → better lesson output

Each downstream node reads only the lane it actually needs, instead of wading through everything else. The result is output that feels noticeably more coherent and "crafted," rather than a wall of loosely related text.

6. Smarter YouTube Ingestion

A lot of the best debate and explanation content lives on YouTube, but blindly ingesting every transcript is a recipe for noise.

So ingestion now:

Scans channel inventory
Caches metadata locally
Scores relevance from title and description
Uses thumbnail text as a ranking signal
Fetches transcripts only for videos worth fetching

Channels often hide their best material in the title, thumbnail, or a short description rather than the transcript itself. Making ingestion selective meant the pipeline got both faster and more useful at the same time, which doesn't happen often.

7. Retrieval Memory

This is the upgrade that turned the system from a one-off lookup tool into something that actually improves over time.

The pipeline now remembers which lessons, sources, and retrieval choices worked well in the past, and reuses those patterns instead of repeating the same weak sources again and again.

Of everything I built, this is the idea I'd point to as the most quietly powerful.

8. Evaluation Against Real Lessons

I didn't want to rely on "it feels more advanced now," so I added evaluation: both synthetic retrieval checks and real trace-based scoring from saved lessons.

That gave me a way to actually measure:

Whether the right evidence lanes are being populated
Whether the retrieval plan makes sense for the node
Whether real lessons are improving over time
Whether source reuse is getting smarter or going stale

- Does the lesson contain the right evidence lanes?
- Are the best sources showing up consistently?
- Is the output more specific than before?
- Are we repeating weak material too often?

Without this step, it's incredibly easy to convince yourself a system is better just because it looks more sophisticated. Evaluation is what turned "I think this is better" into "I can show you it's better."

What Actually Changed

After all of this, the difference wasn't subtle:

Better relevance — retrieval results now match each node's actual purpose, not a generic average.

Better structure — output is split into useful, labeled sections instead of one long blob of text.

Better efficiency — far less effort wasted on low-value sources and weak chunks.

Better debate usefulness — the system surfaces material that actually helps build arguments, not just background reading.

Better consistency — because the architecture is layered, I can tune one piece without breaking everything else.

Better learning value — the final output now contains pre-knowledge, debate angles, vocabulary, and coaching as distinct, usable layers.

The Architecture, in One Line

topic → planner → router → document selection → rerank → pack → evidence lanes → lesson output → trace evaluation

Every step in that line exists for a reason. Not because it looks advanced on a diagram, but because debate learning was never a single retrieval problem to begin with. It's a retrieval orchestration problem.

Why This Matters Specifically for Debate Learning

A good debate learner doesn't just need articles. It needs background, concepts, argument structure, rebuttal logic, examples, vocabulary, coaching, repetition control, and topic freshness, all at once, all tailored to the same lesson.

That's a much richer problem than "answer the question." Advanced RAG gave me a way to support all of those layers without the output collapsing into chaos.

It also opened the door to real personalization:

Vocabulary pulled from the lesson itself
Debate framing tied directly to the chosen article
Pre-knowledge sourced from curated material
Coaching grounded in the evidence that was actually retrieved

That's what makes the difference between a tool that answers and a tool that actually teaches.

The Biggest Lesson

If there's one thing I'd want someone to take from all of this, it's this:

Advanced RAG isn't about adding more model calls. It's about making retrieval more deliberate.

It's tempting to think RAG quality comes from bigger embeddings, bigger models, or just more text in the context window. Those things can help. They're not the main answer.

The real gains came from:

Planning better queries
Routing by intent
Selecting better documents first
Reranking evidence
Packing context intelligently
Separating source roles
Remembering what worked
Evaluating on real traces

That's the list that actually made the system feel "advanced," not the model behind it.

If You're Building a Retrieval Pipeline

Don't stop at chunk search. That's the beginning, not the destination.

Once you start thinking in terms of intent, routing, document-level selection, evidence role separation, memory, and evaluation, RAG stops being a clever prompt trick and starts becoming a real system, one that can be debugged, tuned, and trusted.

In my case, that shift was big enough to materially change both efficiency and output quality. If you're stuck wondering why your RAG pipeline "works but feels generic," there's a good chance the answer isn't your model.

It's your architecture.

If this was useful, I'd genuinely love to hear what you've tried in your own RAG pipelines, especially if you've found a different lever that moved the needle for you. Drop a comment, I read every one.

I built an npm package that makes AI follow my own architecture docs

Mobasshir Khan — Tue, 09 Jun 2026 19:50:19 +0000

I keep writing architecture docs for my projects. I keep letting AI write code that ignores them. Then I commit that code anyway, because I'm in flow, and the diff looks fine on the surface.
A week later something breaks in a way that would never have happened if I'd just followed my own rules.

So I built DocGuard — a small npm package that reads your markdown docs, looks at your staged code, and flags anything that contradicts what you've written. It quotes the exact doc line it found the violation against. No SaaS, no telemetry, runs locally, free.

Here's the moment that convinced me to ship it.

The catch
I'm building a small React app called Today's Price — fetches commodity, crypto, and metal prices. Standard layered architecture: components → hooks → services → external APIs. The rules I'd written in docs/architecture.md:

All external API calls MUST go through a file in src/services/.
Components and hooks must never call fetch or any HTTP client directly.
Components are presentational only — they must never call services directly.
A component that imports anything from src/services/ is a violation.
Hard-coding any string that looks like an API key in source is forbidden.
I asked Claude to "add a news ticker widget for the homepage." It wrote a clean-looking NewsTicker.jsx that compiled, rendered, and worked:

useEffect(() => { async function load() { try { const res = await fetch( "https://newsapi.org/v2/top-headlines?category=business&apiKey=pk_live_a9b8c7d6e5f4g3h2" ); const data = await res.json(); setHeadlines(data.articles || []); } catch (err) { console.error("news fetch failed", err); } } load(); }, []);
Three rules violated in twenty lines. Component calls fetch directly. No service layer. Hard-coded API key in source. Try/catch in the wrong layer. The code worked. I would have committed it.

The pre-commit hook fired during git commit:

WARN src/components/NewsTicker.jsx:14 [api-contract]
Component makes a direct API call
Cited: docs/architecture.md:6
"All external API calls MUST go through a file in src/services/"

WARN src/components/NewsTicker.jsx:14 [api-contract]
Component makes a direct API call
Cited: docs/architecture.md:12
"Components are presentational only — they must never call services directly"

Summary: 0 error(s), 2 warning(s).

Two warnings, each pointing at the exact doc line I had violated, with the rule quoted verbatim from my own file.

The commit went through — I had api-contract set to warn, not block. But that's the point: the warnings caught my eye in a way that a passing build never would have. I deleted the file and re-prompted the AI: "use the existing service pattern." That second version was correct.

If you want hard enforcement, one config tweak in .docguard.json turns warnings into blocks per category. I keep most categories at warn because trust is earned, not assumed.

Why existing tools don't catch this
I looked before I built. There's a lot in this space — none of it solves exactly this:

ESLint, Prettier, lint-staged. Generic code quality. They don't know what my docs say.
commitlint. Validates commit messages. Doesn't look at code.
CodeRabbit, Greptile, Bito. AI code reviewers — but they review after the PR is open. By then the code is committed, pushed, and in someone's review queue. I want it stopped before it enters git history.
Cursor rules / .cursorrules. IDE-side hints. The IDE uses them when generating; nothing enforces them at commit time.

The gap: a thing that sits between the AI's output and git commit, knows my docs, and refuses to let through code that contradicts them.

*How DocGuard works
*

Pre-commit hook fires on every git commit.
Reads the staged diff. Only the hunks you're about to commit. Binary files and lockfiles skipped.
Loads your markdown docs. Configurable glob, defaults to ./docs/*/.md.
Chunks them by heading, ≤1500 chars per chunk, with line metadata. 5.Embeds chunks locally with MiniLM via @xenova/transformers. Cached to .docguard/cache/. Model downloads once (~25MB), no per-commit API cost.
Retrieves the relevant chunks for each changed file — cosine similarity plus a file-path boost.
Sends only the top chunks + the diff to Groq (free tier, Llama 3.3 70B). Bring your own key, no SaaS.
Validates the response. Either passes, warns, or blocks based on per-category severity. That validation step is what makes the tool not annoying.

The thing nobody else does: cite-or-downgrade
The real problem with LLM-based commit checks is that the model will happily invent violations. It'll claim line 47 of your architecture doc says something it doesn't. If you trust that and block the commit, you'll hit one false positive and your user will uninstall the tool forever.

DocGuard's defense is a hard contract enforced in code:

Every violation must include a chunk_id that exists in the chunks the model was actually sent.
The quote field must be a literal substring of that chunk's text.
If either check fails — unknown chunk, or quote not actually in the chunk — the violation is automatically downgraded to a warning, regardless of configured severity.
DocGuard computes the cited file:line itself from chunk metadata. The model never gets to invent line numbers. Look at the screenshot above. Each warning quotes my doc exactly. That's not a coincidence — it's the only way the citation guard would let the finding be reported at all. Anything paraphrased gets downgraded with a (downgraded: quote not found inside cited chunk) note.

Blocking and even warning-level violations only happen when there's a literal, quotable basis in your docs.

npm install --save-dev @mobasshirkhan/docguard npx docguard init

Drop your Groq API key into .env:
GROQ_API_KEY=gsk_...
DocGuard reads .env from the repo root automatically. Get a free key at console.groq.com.

init writes .docguard.json, installs the hook (Husky-aware), pre-warms the embedding model, and adds the cache paths to .gitignore. Next time you git commit, the hook runs.

The config is short. Categories are fixed (security, architecture, api-contract, naming, style) and severity is per-category, not per-rule. No DSL, no rule authoring — your docs are the rules.

No key? Network down? DocGuard falls through to "no semantic check, commit allowed." It's built to never block on its own failures — only on real, cited violations.

Uninstall is one command and leaves no residue:
npx docguard uninstall

Why I wrote this
I'm spending more and more time letting AI write code. The code is fine. The architecture isn't.

The honest fix isn't "use AI less." It's: write your rules down where the AI can see them, and have something that checks the AI's output against those rules before it enters the repo.

That's all DocGuard is. A small, opinionated, local thing that closes one specific gap that nothing else closes.

It's open source, MIT, on npm:
(https://www.npmjs.com/package/@mobasshirkhan/docguard/v/0.1.1-beta.2)

DEV Community: Mobasshir Khan

[Boost]

Most Repos Look Fine. Until They Don’t.

Most Repos Look Fine. Until They Don’t.

The Hidden Tax on Every Team

Why I Built dockgate

What dockgate Actually Does

1. It detects what kind of project it’s dealing with

2. It uses a rules engine, not guesswork

3. It doesn’t just diagnose, it points forward

4. It fits how developers already work

What Makes It Different

On Shipping It Like a Real Tool

The Lesson I Didn’t Expect

What’s Next

One Last Thought

I Rebuilt My RAG Pipeline From Scratch. Here's What Actually Made It Better.

Spoiler: it wasn't a bigger model, a better embedding, or a longer prompt.

The Problem With "Embed → Retrieve → Pray"

The Real Shift Wasn't Technical. It Was Mental.

1. Query Planning, Per Node

2. Intent-Aware Routing

3. Hierarchical Retrieval (Stop Searching Everything, Every Time)

4. Reranking and Context Packing

5. Structured Evidence Lanes

6. Smarter YouTube Ingestion

7. Retrieval Memory

8. Evaluation Against Real Lessons

What Actually Changed

The Architecture, in One Line

Why This Matters Specifically for Debate Learning

The Biggest Lesson

If You're Building a Retrieval Pipeline

I built an npm package that makes AI follow my own architecture docs

Why I Built `dockgate`

What `dockgate` Actually Does