DEV Community

James Lee
James Lee

Posted on

Building an Ops Job Control System: Script Execution, File Distribution & Configuration Management

What Is a "Job"?

A job is an abstract definition of a series of ops operations. Any ops task — whether a release, a change, or an incident response — can be decomposed into ordered steps and target objects.

Ops Work
├── Command       — a single independent operation (start/stop service, run script)
├── File Transfer — distribute a file to a target path on target machines
└── Job           — an ordered combination of commands and file transfers
                    with defined execution targets
Enter fullscreen mode Exit fullscreen mode

The power of this abstraction: once ops work is decomposed into reusable, extensible, independently executable units, automated platform-level scheduling becomes possible.


Core Feature Matrix

Feature Capabilities
Script Execution Shell & Python support; fast batch execution; task result management
File Distribution Basic file push; P2P large-file transfer; MD5 consistency verification
Scheduled Tasks Physical host crontab management; supports one-time and recurring schedules; ideal for off-hours execution
Canary Rollout Step-based canary; server-count-based canary
Custom Jobs Compose complex tasks from atomic units; control sub-task impact on parent; support sequential/parallel ordering
Batch Execution Execute across a server group; customizable server grouping; supports 1000+ concurrent tasks

1. Script Execution

Execution Modes

┌─────────────────────────────────────────────────────────────┐
│  Mode 1: SSH Remote Execution                               │
│                                                             │
│  Center Node ──SSH──▶ Remote Server (execute script)        │
│                                                             │
│  Pros: Simple, no agent needed                              │
│  Cons: Each command spawns a process on center node;        │
│        high center node pressure; slower execution          │
├─────────────────────────────────────────────────────────────┤
│  Mode 2: Agent Execution                                    │
│                                                             │
│  Push model:                                                │
│  Center Node ──task──▶ Agent ──execute──▶ result ──▶ Center │
│                                                             │
│  Pull model:                                                │
│  Agent ──subscribe──▶ task queue ──execute──▶ push result   │
│                                                             │
│  Pros: High efficiency; async non-blocking; 1000+ concurrent│
│  Cons: Requires agent development and maintenance           │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Implementation options for SSH mode:

  • ssh command (shell)
  • Python: paramiko, ansible, fabric

Agent requirements:

  • Async non-blocking execution model
  • Self-registration and health reporting
  • Extensible for config reporting and metrics collection

Supporting Sub-Systems

Sub-System Purpose
Script Management Manage user-uploaded scripts and common base scripts
Account Management Manage execution users (root / business accounts)
Server Grouping Organize servers into static or dynamic groups

Batch Execution Capacity

Execution Mode Max Concurrency
SSH ~30–50 tasks simultaneously
Agent (async) 1000+ tasks simultaneously

Distributed Architecture for Large Fleets

┌─────────────────────────────────────────────────────┐
│              Central Dispatch Node                  │
└──────────────────────┬──────────────────────────────┘
                       │
        ┌──────────────┼──────────────┐
        ▼              ▼              ▼
  ┌──────────┐   ┌──────────┐   ┌──────────┐
  │  IDC-A   │   │  IDC-B   │   │  IDC-C   │
  │  Relay   │   │  Relay   │   │  Relay   │
  └────┬─────┘   └────┬─────┘   └────┬─────┘
       │              │              │
  ┌────┴────┐    ┌────┴────┐    ┌────┴────┐
  │Agent×N  │    │Agent×N  │    │Agent×N  │
  └─────────┘    └─────────┘    └─────────┘
Enter fullscreen mode Exit fullscreen mode

One central dispatch node → one relay per datacenter → one agent per server.


2. File Distribution

File Types

Type Examples Special Handling
Regular files Code packages, images, software packages MD5 check before transfer
Template files Config files (nginx.conf, app.yml) Variable substitution before transfer

Template file flow:

Template file (with variables)
     │
     ▼
Load variables from Config Management
     │
     ▼
Render → temporary plain file
     │
     ▼
Distribute to target servers
Enter fullscreen mode Exit fullscreen mode

Distribution Methods

Method 1: Management node → target servers

Management Node
     │
     ├── MD5 check (skip if identical)
     │
     ▼
rsync / ansible copy / salt file module
     │
     ▼
Target Servers
Enter fullscreen mode Exit fullscreen mode

MD5 pre-check avoids unnecessary transfers — significant time savings for large files.

Recommended tools:

  • ansible copy module
  • salt file module
  • Custom rsync wrapper (supports MD5 check + resume-on-break)

Method 2: P2P transfer between servers (large files)

Source Server A (e.g. 500GB image)
     │
     ├──rsync──▶ Server B
     ├──rsync──▶ Server C
     └──rsync──▶ Server D ...
Enter fullscreen mode Exit fullscreen mode
  • Agent wraps rsync for inter-server transfers
  • Config Management auto-selects transfer source
  • True P2P: no manual source selection required
  • rsync handles large files, MD5 verification, and resume-on-break natively

3. Custom Jobs (Task Orchestration)

Custom jobs compose atomic operations into complex workflows:

Custom Job
├── Step 1: File Distribution (parallel)
│     ├── push config.yml to Group A
│     └── push config.yml to Group B
├── Step 2: Command Execution (sequential)
│     └── restart service on all targets
└── Step 3: Verification Command
      └── check service health
Enter fullscreen mode Exit fullscreen mode

Key capabilities:

  • Mix file distribution and command steps freely
  • Control whether a sub-task failure blocks the parent job
  • Reorder steps at any time
  • Sequential and parallel execution modes

4. Scheduled Tasks

Scheduled Task
├── One-time execution  — run at a specific datetime
├── Recurring           — standard crontab expression
└── Canary + Scheduled  — combine with canary rollout
                          (e.g. roll out to 10% of servers
                           each night at 2 AM)
Enter fullscreen mode Exit fullscreen mode

Ideal for: database backups, log rotation, off-hours deployments, periodic health checks.


5. Canary Rollout

Two rollout dimensions:

Dimension 1: Step-based canary
┌──────────┬──────────┬──────────┬──────────┐
│  Step 1  │  Step 2  │  Step 3  │  Step 4  │
│  verify  │  10 svrs │  50 svrs │  all svrs│
└──────────┴──────────┴──────────┴──────────┘

Dimension 2: Server-count-based canary
┌──────────┬──────────┬──────────┐
│   10%    │   30%    │  100%    │
│  servers │  servers │  servers │
└──────────┴──────────┴──────────┘
Enter fullscreen mode Exit fullscreen mode

Canary rollout integrates with scheduled tasks for automated progressive deployment.


6. Configuration Management

Two Implementation Philosophies

┌──────────────────────────────────────────────────────────────┐
│  Process-based (Active / Push)                               │
│                                                              │
│  Ops triggers change ──▶ Center pushes to servers            │
│  Tools: Ansible, Fabric, Salt, custom daemon                 │
│  Characteristic: explicit, auditable, immediate              │
├──────────────────────────────────────────────────────────────┤
│  Result-based (Passive / Pull)                               │
│                                                              │
│  Ops defines desired state ──▶ Agent polls & self-corrects   │
│  Tools: Puppet, Chef, custom agent                           │
│  Characteristic: declarative, self-healing, eventual         │
└──────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Configuration Management Sub-Systems

Sub-System Description
Account Management root and business user accounts for job execution
File Management Code versions, scripts, software packages
Script Management Base scripts + user-uploaded custom scripts
Group Management Static groups (manual IP selection) and dynamic groups (auto-match by resource attributes)
Variable Management Multi-level variables used with templates and groups
Template Management Jinja2 templates + variable binding
Software Management rpm/yum/npm/pip packages; versioned tarballs with MD5
Service Management Agent-reported process state; integrates with job system for start/stop/restart

Variable Priority Hierarchy

Variables follow a strict priority order (higher overrides lower):

┌─────────────────────────────────────────────┐
│  Priority (high → low)                      │
│                                             │
│  4. Task Variables    ← runtime input only  │
│  3. Host Variables    ← per-host dynamic    │
│  2. Custom Group      ← user-defined groups │
│     Set Group                               │
│     IDC Group                               │
│  1. Global Group      ← lowest priority     │
│  0. Global Variables  ← affects everything  │
└─────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Variable types:

Type Scope Priority Notes
Global Variables All templates everywhere Lowest One value globally; changes affect all consumers
Group Variables Specific server group Medium Different groups may have conflicting values; group priority resolves conflicts
Host Variables Single host High Dynamic; sourced from resource management system
Task Variables Single task execution Highest Runtime input only; not persisted

Template Rendering Flow

Template file (Jinja2)
┌────────────────────────────────┐
│  server_port: {{ port }}       │
│  db_host: {{ db.host }}        │
│  env: {{ environment }}        │
└────────────────────────────────┘
     │
     ▼
Variable resolution (priority merge)
Global → Group → Host → Task
     │
     ▼
Rendered config file
┌────────────────────────────────┐
│  server_port: 8080             │
│  db_host: 10.0.1.5             │
│  env: production               │
└────────────────────────────────┘
     │
     ▼
Distribute to target servers
Enter fullscreen mode Exit fullscreen mode

Template engine: Jinja2 (same as Ansible)

Service Management

Each server Agent
     │ periodic heartbeat report
     ▼
Config Management System
(maintains live service state per host)
     │
     ▼
Ops views service status dashboard
     │ triggers job
     ▼
Job System executes: start / stop / restart
Enter fullscreen mode Exit fullscreen mode

System Architecture Overview

┌──────────────────────────────────────────────────────────────┐
│                    Job Control Platform                      │
│                                                              │
│  ┌────────────┐  ┌────────────┐  ┌────────────────────────┐ │
│  │  Script    │  │   File     │  │   Custom Job           │ │
│  │ Execution  │  │Distribution│  │   Orchestration        │ │
│  └────────────┘  └────────────┘  └────────────────────────┘ │
│                                                              │
│  ┌────────────┐  ┌────────────┐  ┌────────────────────────┐ │
│  │ Scheduled  │  │  Canary    │  │  Batch Execution       │ │
│  │   Tasks    │  │  Rollout   │  │  (1000+ concurrent)    │ │
│  └────────────┘  └────────────┘  └────────────────────────┘ │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │              Configuration Management                │   │
│  │  Accounts │ Files │ Scripts │ Groups │ Variables     │   │
│  │  Templates │ Software │ Services                     │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │              Task Result Tracking                    │   │
│  │  Execution logs │ Re-run control │ Security audit    │   │
│  └──────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────┘
                        │ SSH or Agent
            ┌───────────┼───────────┐
            ▼           ▼           ▼
        [IDC-A]      [IDC-B]     [IDC-C]
       Agent×N      Agent×N     Agent×N
Enter fullscreen mode Exit fullscreen mode

Top comments (0)