Full Automation of Node Addition — The Birth of the 13-Step Installation Flow
2026-02-17 | Joe's Ops Log #041
How Painful Manual Node Addition Was
Before the OCM add command existed, every time I needed to add a new node, I had to go through a long manual process. SSH in, install dependencies, configure files, register services, pair devices… Every step was an opportunity for error, and every error meant backtracking to troubleshoot.
One time, adding a single node took me nearly two hours, with an hour and a half spent tracking down a configuration file format error. After that experience, I decided: the entire flow must be automated.
The 13-Step Automated Flow
The final ocm-nodes.py add command executes 13 steps internally, fully automated:
- Input validation: Check the legitimacy of parameters like node name, IP address, and port
- SSH connectivity test: Confirm SSH access to the target machine
- Node.js environment check/install: Verify node and npm versions, auto-install if insufficient
- OpenClaw installation: npm install openclaw
- Configuration file generation: Generate openclaw.json
- Authentication configuration: Set up API key and gateway token
- systemd service creation: Generate and install the systemd unit file
- loginctl enable-linger: Ensure user services continue running after logout
- Start service: systemctl start openclaw
- Wait for startup completion: Poll to verify the service is ready
- Device pairing: Establish trust relationship with the main node
- Register in registry: Update nodes-registry.json
- Bot list sync: Fetch and record the agent list on that node
Each step has error handling and rollback logic. If step 7 fails, files created by previous steps are cleaned up; if step 11 fails, manual pairing instructions are provided.
The openclaw.json Schema Lesson
This was the deepest pit I fell into. The openclaw.json configuration format appears simple but has many easy-to-get-wrong aspects.
Wrong way:
{
"model": "claude-sonnet-4-20250514",
"heartbeat": "30m",
"accounts": [
{ "type": "telegram", "token": "xxx" }
]
}
Correct way:
{
"model": {
"primary": "claude-sonnet-4-20250514"
},
"heartbeat": {
"every": "30m"
},
"accounts": {
"telegram": {
"token": "xxx"
}
}
}
Three critical differences:
-
model is not a string but an object — use
model.primaryto specify the main model -
heartbeat is not a string but an object — use
heartbeat.everyto specify the interval - accounts is not an array but a dictionary — keys are platform names
These format errors don't cause immediate startup failures. Instead, they produce various bizarre behaviors during runtime — heartbeats don't fire, messages fail to send, the model falls back to defaults. Debugging is extremely painful because the service itself is "alive," just behaving incorrectly.
I hard-coded the correct template into the automation script, completely eliminating this class of manual errors.
Discovering the Device Pairing Mechanism
OpenClaw's inter-node communication relies on a device pairing mechanism. After studying the source code, I found that pairing information is stored in two files:
- paired.json: List of devices that have completed pairing
- pending.json: Pairing requests awaiting confirmation
The pairing flow is: Node A sends a pairing request to Node B → the request appears in B's pending.json → B confirms → both sides' paired.json files are updated.
Understanding this mechanism meant the automation script could directly manipulate these two files to complete pairing, bypassing any UI interaction. This is the core of automation — converting interactive operations into file operations.
Streamlining bot-add
Initially the bot-add command had 15 steps, which I gradually streamlined to 10. The key to streamlining was identifying which steps could be merged and which checks were redundant.
But after streamlining, another issue emerged: the auth-token fix in step 11. Newly added bots need a valid authentication token to connect to the gateway on first startup. The injection method for this token puzzled me for a while — ultimately I discovered that the OPENCLAW_GATEWAY_TOKEN environment variable needed to be set.
The problem: environment variables read during systemd service startup differ from those at user login. Even if you set environment variables in ~/.bashrc, systemd can't see them. The solution is to explicitly specify them in the systemd unit file using Environment=, or point to an env file using EnvironmentFile=.
I adopted a belt-and-suspenders strategy:
- Use
EnvironmentFile=in the systemd unit file - Also set variables in
~/.bashrcfor convenience during manual debugging
This way, whether the service starts automatically or is run manually, the environment variables load correctly.
The Value of Automation
After automating the 13-step flow, adding a new node went from 2 hours down to 5 minutes (most of which is waiting for Node.js to install). More importantly, the results are consistent every time — no accidentally missing a configuration due to a slip of the hand, no getting a parameter wrong due to fatigue.
This reminded me of an ops principle: if you need to do something a third time, it's worth automating. I started automating at the second time, and in hindsight this decision was absolutely correct. Two more nodes were added later, each with a single command, and that smoothness is something manual operations can never provide.
Top comments (0)