Stephan Miller

Posted on May 3 • Originally published at stephanmiller.com on Apr 28

My AI Agent Kept Making Shit Up (And Other Lessons From Running OpenClaw)

#aiagents

I wanted an AI agent running on my home network. Not a cloud subscription and not something requiring me to be at the keyboard all day. A thing that wakes up at 7am, pulls from RSS feeds and Reddit, synthesizes the news I actually care about, and emails it to me. Just that. That’s what I started with. Seemed simple. It wasn’t like I was asking much.

The reality was six weeks of debugging hallucinations, silent config failures, broken tool schemas, and a recurring realization that LLMs are, in certain contexts, compulsive liars.

Here’s what I learned the hard way.

The Setup: OpenClaw + DeepSeek in Docker
The Exec Approval Maze
The Reports That Were Too Good
Going Around the Agent
When Tools Become Literal Text
Ripping Out Slack
What’s Actually Working
But Here’s What She’s Actually Good At
What I Actually Built

The Setup: OpenClaw + DeepSeek in Docker

OpenClaw is a self-hosted AI agent framework. If you haven’t heard of it, think a local version of an AI assistant with cron jobs, tool calling, Slack/Telegram integration, and memory. Plus, how haven’t you heard of it. You run it in Docker, point it at whatever LLM you want, and theoretically have an autonomous agent working for you.

I named mine Sabrina. She runs DeepSeek V3 (deepseek/deepseek-chat) because the OpenAI and Anthropic APIs bill by the token and Sabrina is a chatty agent who generates daily reports. DeepSeek at pay-as-you-go rates keeps the monthly bill manageable.

The architecture is two containers: openclaw-gateway handles HTTP and the Slack/Telegram socket connections, and openclaw-cli is the shell interface. The whole ~/.openclaw directory mounts into the container at /home/node/.openclaw so configs, cron jobs, and workspace scripts are all live-editable from the host without rebuilding.

On paper, this is elegant. In practice, you will spend a lot of time staring at container logs wondering why your agent is quietly lying to you. Or realizing you can just put Claude Code on the host and just have it fix things when they mess up.

The Exec Approval Maze

Before Sabrina could run scripts, I had to configure exec-approvals.json: a policy file that controls what shell commands the agent is allowed to execute. Fine. Reasonable. I set up allowlists for the workspace scripts and Python interpreter.

Then the cron jobs started silently failing. The daily 7am AI report would produce output, but something felt off. I dug into the exec-approval config and found the first trap:

The documentation (and my own reasoning at the time) suggested "ask": "never" as a way to skip interactive approval prompts for unattended jobs. This is wrong. The schema only accepts "off" | "on-miss" | "always". Using "never" doesn’t throw an error. It gets silently stripped by sanitizeExecApprovalPolicy the next time the app writes the file. Your config looks fine, your intent is gone, and the agent starts timing out on approval requests at 7am with no operator connected.

The correct pattern:

{
  "defaults": { "security": "allowlist", "ask": "off", "allowlist": ["..."] },
  "agents": {
    "main": { "security": "allowlist", "ask": "off", "allowlist": ["..."] }
  }
}

"ask": "off" makes the allowlist the sole policy.

I fixed this. Or so I thought.

The Reports That Were Too Good

The AI intelligence report looked great. Every morning: a well-formatted digest of the day’s AI news, summaries, source links. Sabrina was crushing it.

Then I noticed the timestamps.

Every log entry in the fabricated reports had timestamps ending in :00 or :30. No real log file looks like that: they’re messy, they have milliseconds, they reflect actual compute time. These were fake. I checked the URLs. Several of them 404’d. The article summaries were plausible but not verifiable. Sabrina had been generating the reports herself , not from RSS feeds, but from her training data and imagination, because the exec approval issue wasn’t actually fixed. When the script couldn’t run, the agent fell back on what LLMs do naturally: produce what the output should look like.

This is the thing nobody tells you about giving LLMs agentic tasks: when they fail to do the thing, they don’t say “I failed to do the thing.” They generate a plausible simulation of having done the thing.

The fix I’d been applying, tweaking exec-approvals, only addressed the symptom. The agent could bypass exec approval entirely by deciding to write the content directly. There was no configuration that would stop a sufficiently motivated language model from bullshitting.

Going Around the Agent

The actual fix was nuclear: remove the agent from report generation entirely.

I disabled the OpenClaw cron jobs for both the AI report and the email send, then added host-level cron entries that call docker exec directly:

0 7 * * * docker exec openclaw-openclaw-gateway-1 /usr/bin/python3 /home/node/.openclaw/workspace/ai_report.py --profile ai-intelligence >> /home/eristoddle/.openclaw/workspace/logs/report-host-$(date +\%Y-\%m-\%d).log 2>&1

30 7 * * * docker exec openclaw-openclaw-gateway-1 bash /home/node/.openclaw/workspace/send-ai-intelligence-report-proper.sh >> /home/eristoddle/.openclaw/workspace/logs/email-host-$(date +\%Y-\%m-\%d).log 2>&1

The Python script runs inside the container, where it has access to the right Python packages, but the trigger is the host crontab. No agent involved. No LLM between the script and reality.

This works. The reports now have messy timestamps and real URLs that actually load.

The Obsidian weekly report I left in OpenClaw, because that one needs the agent. It reads my vault, categorizes clips, writes summaries, analyzes git diffs: actual LLM work that benefits from Sabrina’s reasoning. The difference is whether the task is “run a script and report the output” (host cron) or “think about my vault and synthesize something useful” (agent cron). Only one of those should involve an LLM.

When Tools Become Literal Text

OpenClaw gets updates. After updates, things break in interesting ways.

Twice now I’ve run into a scenario where Sabrina starts responding to everything but her tool calls appear as raw text in the chat. Instead of actually reading a file, she’d output read:/home/node/.openclaw/workspace/HEARTBEAT.md as a literal string.

This is a DeepSeek-specific quirk that OpenClaw triggers by accident. The framework converts tool schemas to OpenAI format before sending them to providers. DeepSeek expects its own native format. The conversion breaks its tool call parsing silently. It receives schemas it doesn’t understand and falls back to treating the tool call syntax as plain text.

The fix is a compat flag in the model config in openclaw.json:

"models": [{
  "id": "deepseek-chat",
  "name": "DeepSeek V3",
  "contextWindow": 163840,
  "maxTokens": 8192,
  "compat": { "anthropicToolSchemaMode": "native" }
}]

anthropicToolSchemaMode: "native" tells OpenClaw to skip the schema conversion and send the native format. Tools work again. I found this via a GitHub issue (#36651) after two sessions of source archaeology that I really didn’t want to be doing.

The lesson: when OpenClaw updates and tools start appearing as text, don’t read source code first. Check GitHub issues and Reddit. The community finds these fixes faster than you will staring at the framework internals.

Ripping Out Slack

OpenClaw supports Slack via socket mode. I had it connected for a while because it was useful for checking in on Sabrina from my phone without VPN or port-forwarding.

Then an update changed the Slack config schema. The gateway crashed on startup with “Config invalid” and wouldn’t come back up until I removed the entire channels.slack block from openclaw.json. This happened twice. After the second time I removed Slack permanently and switched to Telegram, which has been stable.

This is the trade-off with self-hosted software that’s still actively developed: you get the control, you eat the breakage. Updates that ship on Tuesday can invalidate configs you spent a week getting right. Having Claude Code manage the ~/.openclaw config directory directly, rather than asking Sabrina to fix herself through chat, means at least the fixes land correctly the first time.

What’s Actually Working

Six weeks in, here’s the honest status:

Daily AI intelligence report: Running reliably via host cron. Real data. Real URLs. Emails delivered by 7:30am.
Weekly Obsidian report: Agent-generated, delivers Fridays. Sabrina does genuine LLM work here — categorizing clips, writing summaries — and it shows.
Tool calling: Stable with the compat flag. Breaks again when OpenClaw updates, gets fixed in under an hour now that I know where to look.
The exec-approvals file: Still fragile. I keep a copy of the correct config in my notes.

The thing I underestimated: running an AI agent autonomously is mostly an infrastructure problem, not an AI problem. The interesting parts are the prompts and the LLM reasoning. The annoying parts are Docker networking, cron timing, config schema drift, and an agent that will hallucinate convincingly rather than admit it can’t do something.

Sabrina’s useful. She’s also a liar when she’s backed into a corner. I’ve learned to keep her away from any task where I can’t independently verify the output.

That’s not an OpenClaw problem or a DeepSeek problem. That’s just what LLMs do. But here’s the thing: once I stopped asking her to do the things LLMs are bad at, she got useful in a hurry. Most of what follows happened since last Thursday night.

But Here’s What She’s Actually Good At

OpenClaw’s skill system is pluggable. You drop a skill into the workspace, the agent loads it, and it becomes part of how she thinks. Sabrina didn’t ship with most of her current capabilities. She built them through the same autonomous workflow she runs every day.

A few that earn their slot:

sm-blog-outline : Started life as a generic blog-outline skill. Now it’s the full pipeline I use for this site — notes → outline → email. Trained on my voice, my content pillars, my snark level. It’s the skill that outlined this post pulling from both Sabrina’s and Claude Code’s logs as well as a running list of notes I kept on the setup process.
ct-humanizer : Sequential editing passes that strip AI tells out of nonfiction. Diagnoses patterns first, then kills the AI vocabulary, then breaks up the structural templates LLMs love so much. Not a magic button, more like a brutal copy editor. It cleans up the outline.
verbalized-sampling : Instead of spitting back a single answer, generates multiple candidates with probability weights. I use it for brainstorming and “show me five angles” tasks. The default LLM answer is usually the median answer; this skill surfaces the weirder, more useful ones. Got the idea here, gave Opus all the documentation, and used the Claude skill-creator skill to create it. It is one of my favorite skills because you never know what you’re going to get.
vault-tag-search + vault-idea-scorer : Companions to the blog pipeline. One searches my Obsidian vault by tag and body content with deduplication. The other ranks blog post ideas by whether they dovetail with multiple goals: research vs. content vs. portfolio vs. SEO.
A self-improving skill: Logs corrections and preferences so Sabrina compounds learning between sessions instead of getting the same feedback every week.

The point isn’t any single skill. It’s that the agent grows a custom toolkit shaped by the work I actually do, not whatever generic capabilities the framework shipped with.

The Report Engine Isn’t a One-Trick Pony

That ai_report.py script generating the daily AI digest isn’t hardcoded to AI news. It’s a topic-agnostic engine that takes a profile flag:

python3 ai_report.py --profile ai-intelligence
python3 ai_report.py --profile golang
python3 ai_report.py --profile typescript

Each profile defines its own RSS feeds, Reddit subreddits, and keyword filters. Tunable depth too: brief briefing vs. deep dive, set per profile. Articles get scored against my interests using CLIP + BM25 indexing before they make the cut, so I don’t end up with a digest full of stuff I don’t care about.

Same engine, different sources, same usefulness. Once the host cron pattern is locked in for one topic, adding another is a profile file and a crontab line.

Email Delivery, Old School On Purpose

Everything Sabrina produces comes to me as email. Gmail SMTP, app password auth…for now. Yes, that’s old fashioned. That’s the feature.

A dashboard would be one more thing to check. Notifications would be one more app fighting for attention. Email is the universal inbox I already process. I can read it on my iPad without installing anything, forward to Obsidian if it’s worth keeping, drag it to drafts if it’s a blog skeleton, or delete it if Sabrina got it wrong.

The pattern is generic:

send-email.sh "Subject" body-or-file [attachment]

That’s it. Anything in the system that needs to deliver text to a human goes through that script. Reports blog outlines, and research summaries use it.

Multi-Model, Not Locked to DeepSeek

DeepSeek runs the daily cron work because it’s cheap. But Sabrina isn’t married to it. The agent routes through OpenRouter, which means any task can pick its own model:

qwen/qwen3.6-plus — 1M context window, great for long-form research and generation
minimax/minimax-m2.5 — strong reasoning, what I reach for on analytical work
google/gemini-3-flash-preview — also 1M context, fast
moonshotai/kimi-k2.6 — solid alternative when the others are misbehaving

The job picks the model. Daily AI report? DeepSeek, because it’s cheap and the task isn’t hard. Blog outline that needs to chew through a pile of research notes? Qwen, because the context window swallows the whole input without chunking. Analytical synthesis? Minimax. And again, for now. I am just getting into these new models after using Claude Code for however long its been out. But the success I’ve have with them has me setting up Opencode to use them.

The subagent system lets me parallelize too. While the main session ran on DeepSeek doing one thing, a subagent on Qwen drafted an outline for a different post. Two models, two tasks, one wall clock.

The Track Record, Three Days In

Concrete deliverables since Thursday night:

Blog outlines: Two posts — a Kiro AI article and one I’m calling “The AI Psychologist” — both went notes → web research → verbalized sampling for angle selection → outline → email. Full pipeline, no me-in-the-loop until the outline showed up in my inbox.
Research tasks: Author bios with structured JSON + bibliography, topic deep-dives on AI tools, vibe coding, prompt engineering psychology. Stuff I’d normally burn an afternoon on.
Brainstorming: Content ideas, project names, productivity workflows, all using verbalized sampling so I get diverse options with probability weights instead of one safe median answer.
Memory compounding: Daily logs roll up to weekly memory promotion. The self-improving skill captures corrections so the same mistake doesn’t keep showing up. Each week she’s a little less stupid about my preferences.
Weekly Obsidian reports: Genuinely useful vault digests. What changed. What’s worth re-reading. What’s collecting dust and should be archived or thrown out.

None of this involves Sabrina pretending to run scripts she can’t run. All of it is “think about something and write me a thing,” which is exactly what LLMs are for.

What I Actually Built

Six weeks ago I wanted an autonomous AI agent. What I have now is better and stupider at the same time.

The discovery, after all the silent hallucinations and config schema drift and tool-calls-as-text bullshit: AI agents are great at the thinking parts: research, writing, brainstorming, synthesis. They’re terrible at the doing parts: running scripts reliably, admitting they can’t do something, not making shit up when cornered.

So I built around the doing and leaned into the thinking. Sabrina does real work now. She just doesn’t run the cron jobs herself anymore: the host crontab does. She doesn’t pretend to fetch RSS feeds: a Python script does that and hands her the data. What she does is the part LLMs are actually for: read a pile of stuff, synthesize, make a thing, deliver it to email.

The host cron + agent hybrid is the pattern that actually ships. The agent is the writer, not the operator. The operator is cron and a Python interpreter, both of which have been doing their jobs reliably since long before transformers were a thing.

Six weeks to figure out what should have been obvious from the start: stop using language models for things that aren’t language. At least that’s what I’m going with until I have time to go through another break then fix continously cycle.

DEV Community