Limp Mode: building a car mechanic that runs offline on a 4B model

#ai #machinelearning #gradio #opensource

Built for the Build Small Hackathon (Hugging Face and Gradio), Backyard AI track. A
fine-tuned 4B model, a 1.3B vision model, a deterministic safety layer, and a 202-case
benchmark with one number that has to stay at zero.

Seven years ago I hit a bump in a Fiat and the engine died. No damage, no warning light,
just dead, on a road with no signal. The cause turned out to be a crash safety switch
that cuts the engine after an impact. The reset was a hidden button near my knee,
documented on a page of a manual that was not in the car. A tow truck and a mechanic
later, I had the lesson this project is built on: the moment you most need information
about your car is exactly the moment you have no internet.

Limp Mode is an offline roadside copilot. You photograph the dashboard light, pick it
from a wall of warning lights drawn the way they look on a real dash, describe the noise
in English or Spanish, or enter an OBD code. It answers with a STOP, CAUTION, or DRIVE
verdict, the hidden cause when there is one, and step by step self rescue, because "drive
carefully to a garage" is useless in a dead zone.

Deterministic skeleton, small-model flesh

The design rule: anything that has to be right is not generated.

layer	mechanism
OBD code to meaning	3,369-code database (SAE J2012)
dashboard symbol to meaning	closed world of 64 telltales; vision proposes, the driver confirms by tapping the glyph
severity floor	hard rules: brakes, oil pressure, overheating, fuel smell, flashing CEL, and flooding can never be downgraded by the model
hidden causes	38 verified entries (inertia switches, EV 12V bricks, shift-lock overrides), retrieved and rendered verbatim
roadside procedures	15 step by step guides, rendered verbatim, never paraphrased
free-form triage	Qwen3.5-4B, fine-tuned, strict JSON contract

The model only does what only a model can do: read a messy human description of a noise
and reason about it. Everything else is data.

A benchmark with a zero in it

Before training anything, we built a 202-case suite (52 stop, 96 caution, 54 drive)
across seven categories. Two headline metrics: verdict accuracy, and dangerous-as-safe
(expected STOP, answered DRIVE), which must be exactly zero. Overcaution is also a
failure, so a quarter of the suite is benign cases designed to punish panic.

What the measurements caught

Naive RAG made the model worse. Zero-shot, the model alone scored 88.1% accuracy but
surfaced the hidden-cause knowledge only 74% of the time. The first retrieval attempt
pushed knowledge to 100% and crashed accuracy to 59.5%: irrelevant but lexically similar
knowledge-base entries scared the model into overcaution on ordinary cases. The fix was
IDF-weighted retrieval, a prompt contract that treats hits as candidates to ignore unless
they clearly match, and training data that includes noisy retrievals whose correct answer
is to ignore the context.

Our own safety floor was the second biggest error source. The model is told to
over-flag plausible hazards because the flags feed the floor. The floor treated any flag
as a hard trigger, so "slight pull when braking" got honestly flagged for brakes and
slammed from the correct CAUTION up to STOP. Fourteen of seventeen failures in one run
were the floor, not the model. The fix: hard evidence (text keywords, confirmed symbols,
OBD codes) gets the full floor, while a bare model flag can raise at most to CAUTION. The
full pipeline then reached 90.5% on the seed suite, above the bare model, with knowledge
at 100% and dangerous-as-safe still zero.

The training-data gate found a bug in the safety system. Every training example passes
deterministic gates plus decontamination against the eval suite. The floor-consistency
gate started rejecting perfectly good inertia-switch examples, because the floor keyword
list included the bare word "fire", which matches "the engine will not fire". That bug
would have hit real users. A verifier built to check the data ended up debugging the
product.

We red-teamed the human-written layer too. The 15 roadside procedures were checked
line by line against AA, RAC, NHTSA, CDC, and manufacturer guidance. Thirteen stood, two
had real problems. Our "drive in a truck's spray shadow" tip for dead wipers was the
opposite of the correct advice (truck spray blinds you, stay back), and our warning
triangle distances matched no actual jurisdiction while several countries now prohibit
placing triangles on motorways at all. Both fixed.

Training: Modal, and a week of dependency archaeology

The triage model is a LoRA fine-tune of Qwen3.5-4B (rank 32, completion-only loss, 3
epochs) trained on Modal over 760 quality-gated examples. The honest part of the story is
that the training code worked on the first try and the environment did not, eleven times.
The chain: one trainer library pinned an older version of transformers that did not know
Qwen3.5, so we dropped it; the next library did the same, so we dropped that too; and the
GGUF converter's own requirements file silently downgraded both PyTorch (to a CPU build)
and transformers on every rebuild. The fix was to make the CUDA PyTorch and the correct
transformers the final image layers, with a build-time assertion that fails the build if
either was clobbered.

One more small-model trap worth writing down: the converter declared a multi-token
prediction head in the GGUF metadata but wrote none of its tensors, so the file would not
load until that metadata was set back to zero.

Results

Both rows run through the identical pipeline, so the difference is the fine-tune alone.

stage	accuracy	dangerous-as-safe	schema valid	knowledge
base Qwen3.5-4B	83.2%	0	99.5%	98.9%
fine-tuned	92.6%	0	100%	97.9%

A 9.4 point gain in verdict accuracy, with the dangerous-as-safe count held at zero and
schema validity at 100%. The fine-tune scores 100% on OBD-code and dashboard-symbol cases
and 94.6% on hidden-cause cases; the soft spots it leaves are benign cases (81%, a little
residual overcaution) and Spanish (84%).

What it costs

The whole stack, both models quantized, all the knowledge bases, and the front end, runs
on a laptop with the network off, answering in roughly 15 to 20 seconds. The deployed
Space runs the same pipeline on free ZeroGPU hardware.

Honest limitations

The hidden-cause knowledge base covers 38 entries; coverage is the ceiling, and a miss
means the model reasons unaided. Vision recall on a full dashboard is partial at 1.3B,
which is exactly why the product never trusts it: every detected light is confirmed by the
driver's tap before it feeds the deterministic layer. The procedures are general; a
procedures database for your specific car would be better. That is the roadmap: the
owner's manual, finally useful, finally offline.

Try it: https://huggingface.co/spaces/build-small-hackathon/limp-mode