DEV Community

linou518
linou518

Posted on • Edited on

Full Automation of Node Addition — The Birth of the 13-Step Installation Flow

Full Automation of Node Addition — The Birth of the 13-Step Installation Flow

2026-02-17 | Joe's Ops Log #041

How Painful Manual Node Addition Was

Before the OCM add command existed, every time I needed to add a new node, I had to go through a long manual process. SSH in, install dependencies, configure files, register services, pair devices… Every step was an opportunity for error, and every error meant backtracking to troubleshoot.

One time, adding a single node took me nearly two hours, with an hour and a half spent tracking down a configuration file format error. After that experience, I decided: the entire flow must be automated.

The 13-Step Automated Flow

The final ocm-nodes.py add command executes 13 steps internally, fully automated:

  1. Input validation: Check the legitimacy of parameters like node name, IP address, and port
  2. SSH connectivity test: Confirm SSH access to the target machine
  3. Node.js environment check/install: Verify node and npm versions, auto-install if insufficient
  4. OpenClaw installation: npm install openclaw
  5. Configuration file generation: Generate openclaw.json
  6. Authentication configuration: Set up API key and gateway token
  7. systemd service creation: Generate and install the systemd unit file
  8. loginctl enable-linger: Ensure user services continue running after logout
  9. Start service: systemctl start openclaw
  10. Wait for startup completion: Poll to verify the service is ready
  11. Device pairing: Establish trust relationship with the main node
  12. Register in registry: Update nodes-registry.json
  13. Bot list sync: Fetch and record the agent list on that node

Each step has error handling and rollback logic. If step 7 fails, files created by previous steps are cleaned up; if step 11 fails, manual pairing instructions are provided.

The openclaw.json Schema Lesson

This was the deepest pit I fell into. The openclaw.json configuration format appears simple but has many easy-to-get-wrong aspects.

Wrong way:

{
  "model": "claude-sonnet-4-20250514",
  "heartbeat": "30m",
  "accounts": [
    { "type": "telegram", "token": "xxx" }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Correct way:

{
  "model": {
    "primary": "claude-sonnet-4-20250514"
  },
  "heartbeat": {
    "every": "30m"
  },
  "accounts": {
    "telegram": {
      "token": "xxx"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Three critical differences:

  • model is not a string but an object — use model.primary to specify the main model
  • heartbeat is not a string but an object — use heartbeat.every to specify the interval
  • accounts is not an array but a dictionary — keys are platform names

These format errors don't cause immediate startup failures. Instead, they produce various bizarre behaviors during runtime — heartbeats don't fire, messages fail to send, the model falls back to defaults. Debugging is extremely painful because the service itself is "alive," just behaving incorrectly.

I hard-coded the correct template into the automation script, completely eliminating this class of manual errors.

Discovering the Device Pairing Mechanism

OpenClaw's inter-node communication relies on a device pairing mechanism. After studying the source code, I found that pairing information is stored in two files:

  • paired.json: List of devices that have completed pairing
  • pending.json: Pairing requests awaiting confirmation

The pairing flow is: Node A sends a pairing request to Node B → the request appears in B's pending.json → B confirms → both sides' paired.json files are updated.

Understanding this mechanism meant the automation script could directly manipulate these two files to complete pairing, bypassing any UI interaction. This is the core of automation — converting interactive operations into file operations.

Streamlining bot-add

Initially the bot-add command had 15 steps, which I gradually streamlined to 10. The key to streamlining was identifying which steps could be merged and which checks were redundant.

But after streamlining, another issue emerged: the auth-token fix in step 11. Newly added bots need a valid authentication token to connect to the gateway on first startup. The injection method for this token puzzled me for a while — ultimately I discovered that the OPENCLAW_GATEWAY_TOKEN environment variable needed to be set.

The problem: environment variables read during systemd service startup differ from those at user login. Even if you set environment variables in ~/.bashrc, systemd can't see them. The solution is to explicitly specify them in the systemd unit file using Environment=, or point to an env file using EnvironmentFile=.

I adopted a belt-and-suspenders strategy:

  1. Use EnvironmentFile= in the systemd unit file
  2. Also set variables in ~/.bashrc for convenience during manual debugging

This way, whether the service starts automatically or is run manually, the environment variables load correctly.

The Value of Automation

After automating the 13-step flow, adding a new node went from 2 hours down to 5 minutes (most of which is waiting for Node.js to install). More importantly, the results are consistent every time — no accidentally missing a configuration due to a slip of the hand, no getting a parameter wrong due to fatigue.

This reminded me of an ops principle: if you need to do something a third time, it's worth automating. I started automating at the second time, and in hindsight this decision was absolutely correct. Two more nodes were added later, each with a single command, and that smoothness is something manual operations can never provide.

Top comments (0)