DEV Community

Marcin Dudek
Marcin Dudek

Posted on • Originally published at marcindudek.dev

A Local LLM Reads My Prompt Before Claude Does

Originally published on https://marcindudek.dev/blog/claude-code-prompt-gate-local-llm/

TL;DR

  • The weak link in AI coding is usually the half-formed prompt I type at the end of a long day. The model can only work with what I hand it.
  • A hook on UserPromptSubmit fires the instant I hit enter, before Claude reads a single word. It hands short prompts to a small model running on my own machine, which answers one question: is this clear enough to act on?
  • It runs locally on purpose. UserPromptSubmit blocks the session for up to 30 seconds while it works, so a cloud round-trip on every short message would be unbearable. A small on-device model answers in about a second, costs nothing per call, and the prompt never leaves the laptop.
  • The gate flags, it does not block. A vague prompt gets a one-line note stapled alongside it through additionalContext; my words still reach Claude untouched. I would never let a vagueness score erase something I typed.
  • If the local model is slow or down, the hook exits with a non-blocking error and the prompt goes through ungated. I lose the check, never the message.

I sat down to write this post, the one about the hook that reads my prompts, and typed: "ok, time to write post #4 of the series about". Then I trailed off and hit enter anyway. Before Claude saw a single word, a small model running on my own laptop read that prompt and flagged it: vague, missing context, says nothing about which series or what the post should actually cover. It was right. I had run out of sentence at the end of a long day, and the cheapest model in my stack caught it before the expensive one burned a turn guessing what I meant.

This is part 4 of my series on Claude Code hooks. Part 1 was about context rot and the handoff that resets a session before it decays. Part 2 was guardrails that catch a dangerous command before it runs. Part 3 was quality gates that won't let the agent say "done" without proof. Those all sit downstream of the model. This one sits upstream of it, on the one part of the whole pipeline I am personally responsible for: the prompt.

The weak link is the thing I type

It's tempting to blame the model when a session goes sideways. Most of the time the model did exactly what I asked. The problem is that I asked badly. At 11pm I write three terse words, leave out the file, the constraint, and the goal, and then expect the agent to read my mind. It can't, so it guesses, and a confident guess off a thin prompt is how you get forty minutes of work pointed at the wrong target.

I tried fixing this with a line in the system prompt, the same way I fix everything first: "if a request is unclear, ask before you act." It helps a little. It's also the first instruction to evaporate the moment the model gets a whiff of a plausible interpretation. A prompt is a request the model is free to ignore. So I stopped asking the model to police my input and moved the check to a place the model doesn't control: the moment between hitting enter and the model reading anything.

A hook that runs before the model reads a word

The UserPromptSubmit hook fires when I submit a message and before the model sees it. Claude Code hands it a small JSON blob on stdin with the prompt right there in it:

{
  "hook_event_name": "UserPromptSubmit",
  "prompt": "ok, time to write post #4 of the series about",
  "cwd": "/path/to/project"
}
Enter fullscreen mode Exit fullscreen mode

From here the hook has a choice. It can block the prompt outright, which erases it from context entirely, or it can let the prompt through and staple extra context onto it. For my own input I almost always want the second one. I don't want to stop myself from typing. I want the agent to know when my prompt is thin before it commits to it.

A small local model reads the prompt first

Here is the part people raise an eyebrow at. For short prompts, the ones under about a hundred characters, which are reliably the half-baked ones, the hook hands my text to a separate, much smaller language model. A quantized Qwen running entirely on the laptop through Apple's MLX. It gets one job: read the prompt and decide whether it's clear enough to act on. If it isn't, it writes a single line saying what's missing.

It has to run locally, and the reason is right there in the docs. UserPromptSubmit blocks model processing while the hook runs, so it ships with a 30-second timeout instead of the usual ten minutes. Now imagine putting a cloud API call in that path. Every short message I send would stall behind a network round-trip and a queue, and I would spend my day watching a spinner. A small model on-device answers in roughly a second, costs nothing per call, works on a plane, and the prompt never leaves my machine. The latency budget alone forces this layer to be local.

Flag, don't block

The mechanism is the same one I leaned on in part 3: additionalContext. On a clean exit, the hook prints a small JSON object and whatever sits in additionalContext rides into the conversation right alongside my prompt:

{
  "hookSpecificOutput": {
    "hookEventName": "UserPromptSubmit",
    "additionalContext": "Heads up: this prompt is short and light on specifics (no target, no goal). Confirm scope before you run."
  }
}
Enter fullscreen mode Exit fullscreen mode

There is a more aggressive option, and I deliberately don't use it. Exiting with code 2 blocks the prompt and erases it from context completely. That's the right call for a genuinely dangerous prompt. It's the wrong call for a vague one, because a small model's opinion that my prompt "seems unclear" isn't worth deleting something I just typed. A false positive there would be infuriating: you write a message, the gate disagrees with its phrasing, and your words vanish. So the gate never blocks my input. It flags. Claude receives my three sleepy words and the note, reads both, and asks the clarifying question instead of charging off a cliff. Which, for the record, is exactly what happened with this post.

You submit a short prompt. A local model reads it before Claude does. If the prompt is clear it passes straight to Claude. If it is vague, the hook staples a one-line clarity note onto it as additionalContext, and Claude receives the prompt together with that note.

You hit enter
short prompt

Local model
clear enough?

vague

clear

Staple a clarity note
additionalContext

Claude reads it
your prompt + any note

joined to your prompt

The clarity gate: a vague prompt picks up a one-line note on the way through, a clear one passes straight to Claude. Either way your words reach the model untouched.

What else this prompt layer can do

Once you own that sliver of time between hitting enter and the model reading, flagging vague prompts is just the first thing you can do with it. A few others I run on the same input layer:

  • Keyword to skill. The hook scans the prompt for intent and, when it spots a clear match, pre-loads the right tool or skill so the agent shows up already holding what it needs instead of fumbling for it three messages in.
  • Context that arrives with the prompt. The relevant project facts and past decisions get pulled in and handed to the model along with my message, so the prompt reaches Claude already carrying the context I would otherwise have to retype for the hundredth time.
  • Trimming the empty stall. A small nudge so the agent acts on clear, unambiguous instructions instead of reflexively asking "should I go ahead and do the thing you just asked for?" Real decisions, the ones with a genuine tradeoff, still get a question, because I want to be asked about those. It is only the hollow permission-seeking on an obvious yes that gets trimmed.

A small local model is a cheap, private pre-processor for everything you type. It does not need to be smart. It needs to be fast, free, and right next to the keyboard. Most of its value is in the boring, repetitive judgment you would never bother a frontier model with.

Where this stops working

A four-billion-parameter model is no genius, and I don't pretend otherwise. Sometimes it flags a prompt that was perfectly clear, and sometimes it waves through one that was mud. That's survivable precisely because the note is advisory. Claude can read "this looks vague" and reasonably decide it isn't. Nothing was blocked, nothing was erased. A wrong call from the small model costs me a sentence of context. That's the whole price.

The failure mode I care more about is the gate getting in the way of the work. That's what the fail-open behavior is for. If the local model hangs or isn't running, the hook exits with a non-zero code that isn't 2, which Claude Code treats as a non-blocking error: it shows a small notice and processes my prompt normally. The gate going dark never costs me my message. It just quietly stops checking until I notice. For a tool that sits in front of every single thing I type, that asymmetry is the whole design: a missed flag is a shrug, a swallowed prompt would be a real problem.

And the honest limit: it can only read what I wrote. It catches the prompt that is vague on its face. It can't catch the prompt that is perfectly clear and perfectly wrong, or the crucial constraint I just forgot to type. Garbage in, garbage out still rules. This just catches the garbage I can see in my own sentence before I make the expensive model eat it.

Build your own

The skeleton is small. Point it at whatever "a good prompt" means for your work:

  1. Hook on UserPromptSubmit. Read the JSON on stdin and pull out the prompt field.
  2. Gate cheaply. Only bother for short or terse prompts, where the payoff is biggest. Let the long, detailed ones sail straight through untouched.
  3. Ask a local model one question: is this clear enough to act on, and if not, what is missing? Keep it on-device so you stay inside the 30-second timeout and your text never leaves the machine.
  4. Flag, don't block. Exit 0 and return the verdict in hookSpecificOutput.additionalContext. Never exit 2 on your own input, because a false positive that erases your prompt is far worse than a prompt that was a little vague.
  5. Fail open. If the model is slow or down, let the prompt through. The gate is a helper standing next to the door. It was never the lock.

Series: Claude Code hooks (5 parts)

  1. Context rot + handoff - how I stop sessions from rotting
  2. Guardrails - hooks that won't let the AI shoot me in the foot
  3. Quality gates - forcing the agent to verify before it says "done"
  4. The prompt layer - a local LLM that reads my prompt before it hits Claude (you're here)
  5. The ambient layer - the hooks you stop noticing

All five parts are now live.

Get what I write next in your inbox

Five posts on Claude Code hooks, now complete. Drop your email and I'll send whatever I write next. Just new posts, nothing else.

Email address

Subscribe

Takeaway

The model gets all the attention. The prompt gets almost none, and the prompt is the part I actually control. Putting a small, free, local model in front of the expensive one, to read what I typed before Claude does, turned out to be one of the cheapest upgrades I've made to this setup, and most days it earns its keep before I've finished my coffee. The post you just read exists because that gate stopped me from firing three sleepy words into a fresh session and calling it a brief.

Top comments (0)