James Lee

Posted on May 17

Building an Ops Job Control System: Script Execution, File Distribution & Configuration Management

#automation #devops #systemdesign #tooling

What Is a "Job"?

A job is an abstract definition of a series of ops operations. Any ops task — whether a release, a change, or an incident response — can be decomposed into ordered steps and target objects.

Ops Work
├── Command       — a single independent operation (start/stop service, run script)
├── File Transfer — distribute a file to a target path on target machines
└── Job           — an ordered combination of commands and file transfers
                    with defined execution targets

The power of this abstraction: once ops work is decomposed into reusable, extensible, independently executable units, automated platform-level scheduling becomes possible.

Core Feature Matrix

Feature	Capabilities
Script Execution	Shell & Python support; fast batch execution; task result management
File Distribution	Basic file push; P2P large-file transfer; MD5 consistency verification
Scheduled Tasks	Physical host crontab management; supports one-time and recurring schedules; ideal for off-hours execution
Canary Rollout	Step-based canary; server-count-based canary
Custom Jobs	Compose complex tasks from atomic units; control sub-task impact on parent; support sequential/parallel ordering
Batch Execution	Execute across a server group; customizable server grouping; supports 1000+ concurrent tasks

1. Script Execution

Execution Modes

┌─────────────────────────────────────────────────────────────┐
│  Mode 1: SSH Remote Execution                               │
│                                                             │
│  Center Node ──SSH──▶ Remote Server (execute script)        │
│                                                             │
│  Pros: Simple, no agent needed                              │
│  Cons: Each command spawns a process on center node;        │
│        high center node pressure; slower execution          │
├─────────────────────────────────────────────────────────────┤
│  Mode 2: Agent Execution                                    │
│                                                             │
│  Push model:                                                │
│  Center Node ──task──▶ Agent ──execute──▶ result ──▶ Center │
│                                                             │
│  Pull model:                                                │
│  Agent ──subscribe──▶ task queue ──execute──▶ push result   │
│                                                             │
│  Pros: High efficiency; async non-blocking; 1000+ concurrent│
│  Cons: Requires agent development and maintenance           │
└─────────────────────────────────────────────────────────────┘

Implementation options for SSH mode:

ssh command (shell)
Python: paramiko, ansible, fabric

Agent requirements:

Async non-blocking execution model
Self-registration and health reporting
Extensible for config reporting and metrics collection

Supporting Sub-Systems

Sub-System	Purpose
Script Management	Manage user-uploaded scripts and common base scripts
Account Management	Manage execution users (root / business accounts)
Server Grouping	Organize servers into static or dynamic groups

Batch Execution Capacity

Execution Mode	Max Concurrency
SSH	~30–50 tasks simultaneously
Agent (async)	1000+ tasks simultaneously

Distributed Architecture for Large Fleets

┌─────────────────────────────────────────────────────┐
│              Central Dispatch Node                  │
└──────────────────────┬──────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
  ┌──────────┐   ┌──────────┐   ┌──────────┐
  │  IDC-A   │   │  IDC-B   │   │  IDC-C   │
  │  Relay   │   │  Relay   │   │  Relay   │
  └────┬─────┘   └────┬─────┘   └────┬─────┘
       │              │              │
  ┌────┴────┐    ┌────┴────┐    ┌────┴────┐
  │Agent×N  │    │Agent×N  │    │Agent×N  │
  └─────────┘    └─────────┘    └─────────┘

One central dispatch node → one relay per datacenter → one agent per server.

2. File Distribution

File Types

Type	Examples	Special Handling
Regular files	Code packages, images, software packages	MD5 check before transfer
Template files	Config files (nginx.conf, app.yml)	Variable substitution before transfer

Template file flow:

Template file (with variables)
     │
     ▼
Load variables from Config Management
     │
     ▼
Render → temporary plain file
     │
     ▼
Distribute to target servers

Distribution Methods

Method 1: Management node → target servers

Management Node
     │
     ├── MD5 check (skip if identical)
     │
     ▼
rsync / ansible copy / salt file module
     │
     ▼
Target Servers

MD5 pre-check avoids unnecessary transfers — significant time savings for large files.

Recommended tools:

ansible copy module
salt file module
Custom rsync wrapper (supports MD5 check + resume-on-break)

Method 2: P2P transfer between servers (large files)

Source Server A (e.g. 500GB image)
     │
     ├──rsync──▶ Server B
     ├──rsync──▶ Server C
     └──rsync──▶ Server D ...

Agent wraps rsync for inter-server transfers
Config Management auto-selects transfer source
True P2P: no manual source selection required
rsync handles large files, MD5 verification, and resume-on-break natively

3. Custom Jobs (Task Orchestration)

Custom jobs compose atomic operations into complex workflows:

Custom Job
├── Step 1: File Distribution (parallel)
│     ├── push config.yml to Group A
│     └── push config.yml to Group B
├── Step 2: Command Execution (sequential)
│     └── restart service on all targets
└── Step 3: Verification Command
      └── check service health

Key capabilities:

Mix file distribution and command steps freely
Control whether a sub-task failure blocks the parent job
Reorder steps at any time
Sequential and parallel execution modes

4. Scheduled Tasks

Scheduled Task
├── One-time execution  — run at a specific datetime
├── Recurring           — standard crontab expression
└── Canary + Scheduled  — combine with canary rollout
                          (e.g. roll out to 10% of servers
                           each night at 2 AM)

Ideal for: database backups, log rotation, off-hours deployments, periodic health checks.

5. Canary Rollout

Two rollout dimensions:

Dimension 1: Step-based canary
┌──────────┬──────────┬──────────┬──────────┐
│  Step 1  │  Step 2  │  Step 3  │  Step 4  │
│  verify  │  10 svrs │  50 svrs │  all svrs│
└──────────┴──────────┴──────────┴──────────┘

Dimension 2: Server-count-based canary
┌──────────┬──────────┬──────────┐
│   10%    │   30%    │  100%    │
│  servers │  servers │  servers │
└──────────┴──────────┴──────────┘

Canary rollout integrates with scheduled tasks for automated progressive deployment.

6. Configuration Management

Two Implementation Philosophies

┌──────────────────────────────────────────────────────────────┐
│  Process-based (Active / Push)                               │
│                                                              │
│  Ops triggers change ──▶ Center pushes to servers            │
│  Tools: Ansible, Fabric, Salt, custom daemon                 │
│  Characteristic: explicit, auditable, immediate              │
├──────────────────────────────────────────────────────────────┤
│  Result-based (Passive / Pull)                               │
│                                                              │
│  Ops defines desired state ──▶ Agent polls & self-corrects   │
│  Tools: Puppet, Chef, custom agent                           │
│  Characteristic: declarative, self-healing, eventual         │
└──────────────────────────────────────────────────────────────┘

Configuration Management Sub-Systems

Sub-System	Description
Account Management	root and business user accounts for job execution
File Management	Code versions, scripts, software packages
Script Management	Base scripts + user-uploaded custom scripts
Group Management	Static groups (manual IP selection) and dynamic groups (auto-match by resource attributes)
Variable Management	Multi-level variables used with templates and groups
Template Management	Jinja2 templates + variable binding
Software Management	rpm/yum/npm/pip packages; versioned tarballs with MD5
Service Management	Agent-reported process state; integrates with job system for start/stop/restart

Variable Priority Hierarchy

Variables follow a strict priority order (higher overrides lower):

┌─────────────────────────────────────────────┐
│  Priority (high → low)                      │
│                                             │
│  4. Task Variables    ← runtime input only  │
│  3. Host Variables    ← per-host dynamic    │
│  2. Custom Group      ← user-defined groups │
│     Set Group                               │
│     IDC Group                               │
│  1. Global Group      ← lowest priority     │
│  0. Global Variables  ← affects everything  │
└─────────────────────────────────────────────┘

Variable types:

Type	Scope	Priority	Notes
Global Variables	All templates everywhere	Lowest	One value globally; changes affect all consumers
Group Variables	Specific server group	Medium	Different groups may have conflicting values; group priority resolves conflicts
Host Variables	Single host	High	Dynamic; sourced from resource management system
Task Variables	Single task execution	Highest	Runtime input only; not persisted

Template Rendering Flow

Template file (Jinja2)
┌────────────────────────────────┐
│  server_port: {{ port }}       │
│  db_host: {{ db.host }}        │
│  env: {{ environment }}        │
└────────────────────────────────┘
     │
     ▼
Variable resolution (priority merge)
Global → Group → Host → Task
     │
     ▼
Rendered config file
┌────────────────────────────────┐
│  server_port: 8080             │
│  db_host: 10.0.1.5             │
│  env: production               │
└────────────────────────────────┘
     │
     ▼
Distribute to target servers

Template engine: Jinja2 (same as Ansible)

Service Management

Each server Agent
     │ periodic heartbeat report
     ▼
Config Management System
(maintains live service state per host)
     │
     ▼
Ops views service status dashboard
     │ triggers job
     ▼
Job System executes: start / stop / restart

System Architecture Overview

┌──────────────────────────────────────────────────────────────┐
│                    Job Control Platform                      │
│                                                              │
│  ┌────────────┐  ┌────────────┐  ┌────────────────────────┐ │
│  │  Script    │  │   File     │  │   Custom Job           │ │
│  │ Execution  │  │Distribution│  │   Orchestration        │ │
│  └────────────┘  └────────────┘  └────────────────────────┘ │
│                                                              │
│  ┌────────────┐  ┌────────────┐  ┌────────────────────────┐ │
│  │ Scheduled  │  │  Canary    │  │  Batch Execution       │ │
│  │   Tasks    │  │  Rollout   │  │  (1000+ concurrent)    │ │
│  └────────────┘  └────────────┘  └────────────────────────┘ │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │              Configuration Management                │   │
│  │  Accounts │ Files │ Scripts │ Groups │ Variables     │   │
│  │  Templates │ Software │ Services                     │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │              Task Result Tracking                    │   │
│  │  Execution logs │ Re-run control │ Security audit    │   │
│  └──────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────┘
                        │ SSH or Agent
            ┌───────────┼───────────┐
            ▼           ▼           ▼
        [IDC-A]      [IDC-B]     [IDC-C]
       Agent×N      Agent×N     Agent×N

DEV Community