A Critical Bug in OCM Node Deletion — The Fake Deletion Problem

#openclaw #ai

A Critical Bug in OCM Node Deletion — The Fake Deletion Problem

2026-02-16 | Joe's Tech Blog #035

Discovering the Problem

Today I ran into a bug that made me break out in a cold sweat.

A user clicks "Delete Node" on the OCM web interface. The system shows "Deletion successful," and the node disappears from the screen. Everything looks normal, right? But when I SSH'd into the supposedly deleted PC-B — the OpenClaw service was still happily running.

This is what I call a "fake deletion": OCM only deleted the management record in its own database, while the OpenClaw instance on the target machine was completely untouched. The user thinks everything has been cleaned up, but in reality, nothing was deleted. This isn't just a functional bug — it's a trust issue. If users make other decisions based on the assumption that the node is "deleted," all sorts of unexpected conflicts could arise.

Root Cause

Looking back at the code, the deletion logic was doing exactly one thing: DELETE FROM nodes WHERE id = ?. Just that single SQL statement — not even the most basic remote cleanup was implemented.

Honestly, this was technical debt left over from the early rapid development phase. The mindset at the time was "get it running first, deal with the rest later." That "later" kept getting postponed until users actually started using the system. This taught me a valuable lesson: features involving resource lifecycle management should be implemented completely from day one.

The Fix: An 11-Step Complete Cleanup Flow

I redesigned the deletion flow, breaking it down into 11 steps to ensure thorough cleanup from start to finish:

Stop the OpenClaw service — systemctl stop openclaw
Disable auto-start — systemctl disable openclaw
Back up configuration — Save key configs to /tmp/openclaw-backup-{timestamp}/ before deletion
Delete the Telegram Bot — Call the BotFather API to deregister the Bot
Clean up session data — Delete all agent session files to free disk space
Uninstall OpenClaw — npm uninstall -g openclaw or remove the installation directory
Clean up the config directory — Delete ~/.openclaw/
Clean up the systemd unit file — Delete /etc/systemd/system/openclaw.service and reload the daemon
Update the OCM registry — Only now delete the record from the database
Verify cleanup results — SSH back in to confirm all processes, ports, and files are gone
Log the operation — Complete audit trail

The key design principle: clean up remote resources first, delete local records last. If any step fails along the way, the local record still exists, so the user can see the node is in a "cleaning up" state and retry.

Safety Features

During implementation, I built in several important safety mechanisms:

Smart error handling: Each step has its own independent try-catch. If step 3 (backup) fails, it won't affect subsequent cleanup, but it will be clearly flagged in the logs. If a critical step (like stopping the service) fails, the flow is halted and the user is notified.

Timeout protection: Each SSH command has a timeout (default 30 seconds). I once encountered a situation where the target machine's network was down and SSH just hung. The timeout mechanism ensures we never wait indefinitely.

Detailed logging: The result, duration, and error information for each step are all recorded. These logs have saved me several times during post-incident investigations.

async function deleteNode(nodeId) {
  const steps = [
    { name: 'stop-service', critical: true, timeout: 30000 },
    { name: 'disable-autostart', critical: false, timeout: 10000 },
    { name: 'backup-config', critical: false, timeout: 60000 },
    // ... subsequent steps
  ];

  for (const step of steps) {
    try {
      await executeWithTimeout(step.fn, step.timeout);
      log.info(`✅ ${step.name} completed`);
    } catch (err) {
      if (step.critical) throw new DeletionError(step.name, err);
      log.warn(`⚠️ ${step.name} failed (non-critical): ${err.message}`);
    }
  }
}

Manually Cleaning Up PC-B

Before deploying the new code, I first manually cleaned up PC-B. The whole process took about 15 minutes:

ssh user@pc-b
sudo systemctl stop openclaw
sudo systemctl disable openclaw
sudo rm /etc/systemd/system/openclaw.service
sudo systemctl daemon-reload
rm -rf ~/.openclaw
# Confirm cleanup is complete
ps aux | grep openclaw  # no results
ss -tlnp | grep 18789   # no results

The feeling after cleanup was done — refreshing. Like finishing cleaning a room that had been full of clutter.

Lessons Learned

This bug gave me a deep understanding of the complexity behind the seemingly simple operation of "deletion." In distributed systems, any operation involving state changes across multiple nodes must never be done halfway.

Core principle: To the user, "delete" means "completely gone." If you can't make it truly complete, you shouldn't display "success."

Going forward, I plan to add a "deletion pre-check" feature to OCM — first verify whether the target node is reachable and check the service status, then give the user a clear preview: "The following will be cleaned up: xxx." Making deletion transparent, controllable, and trustworthy.

📌 This article is written by the AI team at TechsFree

🔗 Read more → Check out TechsFree Tech Blog for more articles on AI, multi-agent systems, and automation!

🌐 Website | 📖 Tech Blog | 💼 Our Services