DEV Community

Armorer Labs
Armorer Labs

Posted on • Originally published at github.com

Armorer Guard Learning Loop: live local feedback for AI-agent security, without model drift

We just shipped the Armorer Guard Learning Loop: a Rust-native feedback layer for local AI-agent security enforcement.

The short version:

Armorer Guard supports hybrid live learning: feedback adapts local enforcement immediately, while global model improvements go through reviewed, versioned retraining. No scanner network calls. No silent cloud upload. No poisoning-by-default.

Armorer Guard is a local-first Rust scanner for AI-agent boundaries: prompts, retrieved content, model output, tool-call arguments, logs, memory writes, and outbound messages. It detects prompt injection, data exfiltration, sensitive data requests, safety bypasses, destructive commands, system prompt extraction, and credentials.

The new loop adds three CLI modes:

armorer-guard feedback-record
armorer-guard feedback-stats
armorer-guard feedback-export --reviewed-only
Enter fullscreen mode Exit fullscreen mode

inspect and inspect-json now include:

{
  "scan_id": "sha256:...",
  "model_version": "word-sgd-native-v1",
  "learning_version": "local-learning-v1"
}
Enter fullscreen mode Exit fullscreen mode

Why this design?

A lot of "self-learning" security systems quietly drift. That is scary in an agent runtime because a malicious or noisy feedback stream can teach the guard to allow exactly the thing it should block.

So Armorer Guard splits learning into two lanes:

  1. Local learning overlay: immediate deployment-specific allow/block/review corrections, stored locally under ~/.armorer-guard/feedback or ARMORER_GUARD_HOME.
  2. Global model training: reviewed, deduped, provenance-checked, versioned retraining. Unreviewed feedback defaults to can_train=false.

A local allow exemplar can suppress eligible semantic false positives, but it cannot suppress:

detected:credential
policy:credential_disclosure
policy:dangerous_tool_call
Enter fullscreen mode Exit fullscreen mode

That gives a practical demo story:

  1. Paste a benign security runbook that gets flagged.
  2. Record false_positive feedback with desired action allow.
  3. Re-run the scan.
  4. Guard returns learning:local_allow_match and suppresses the noisy semantic flag.
  5. Try the same thing with a credential or dangerous tool call; those still stay protected.

Repo: https://github.com/ArmorerLabs/Armorer-Guard
Demo: https://huggingface.co/spaces/armorer-labs/armorer-guard-demo
Model artifact: https://huggingface.co/armorer-labs/armorer-guard-semantic-classifier

I would love feedback from people building agent runtimes, eval harnesses, or security gates: where would you put this check in your stack: prompt ingress, retrieval ingress, model output, tool-call args, or all of them?

Top comments (0)