What Is a "Job"?
A job is an abstract definition of a series of ops operations. Any ops task — whether a release, a change, or an incident response — can be decomposed into ordered steps and target objects.
Ops Work
├── Command — a single independent operation (start/stop service, run script)
├── File Transfer — distribute a file to a target path on target machines
└── Job — an ordered combination of commands and file transfers
with defined execution targets
The power of this abstraction: once ops work is decomposed into reusable, extensible, independently executable units, automated platform-level scheduling becomes possible.
Core Feature Matrix
| Feature | Capabilities |
|---|---|
| Script Execution | Shell & Python support; fast batch execution; task result management |
| File Distribution | Basic file push; P2P large-file transfer; MD5 consistency verification |
| Scheduled Tasks | Physical host crontab management; supports one-time and recurring schedules; ideal for off-hours execution |
| Canary Rollout | Step-based canary; server-count-based canary |
| Custom Jobs | Compose complex tasks from atomic units; control sub-task impact on parent; support sequential/parallel ordering |
| Batch Execution | Execute across a server group; customizable server grouping; supports 1000+ concurrent tasks |
1. Script Execution
Execution Modes
┌─────────────────────────────────────────────────────────────┐
│ Mode 1: SSH Remote Execution │
│ │
│ Center Node ──SSH──▶ Remote Server (execute script) │
│ │
│ Pros: Simple, no agent needed │
│ Cons: Each command spawns a process on center node; │
│ high center node pressure; slower execution │
├─────────────────────────────────────────────────────────────┤
│ Mode 2: Agent Execution │
│ │
│ Push model: │
│ Center Node ──task──▶ Agent ──execute──▶ result ──▶ Center │
│ │
│ Pull model: │
│ Agent ──subscribe──▶ task queue ──execute──▶ push result │
│ │
│ Pros: High efficiency; async non-blocking; 1000+ concurrent│
│ Cons: Requires agent development and maintenance │
└─────────────────────────────────────────────────────────────┘
Implementation options for SSH mode:
-
sshcommand (shell) - Python:
paramiko,ansible,fabric
Agent requirements:
- Async non-blocking execution model
- Self-registration and health reporting
- Extensible for config reporting and metrics collection
Supporting Sub-Systems
| Sub-System | Purpose |
|---|---|
| Script Management | Manage user-uploaded scripts and common base scripts |
| Account Management | Manage execution users (root / business accounts) |
| Server Grouping | Organize servers into static or dynamic groups |
Batch Execution Capacity
| Execution Mode | Max Concurrency |
|---|---|
| SSH | ~30–50 tasks simultaneously |
| Agent (async) | 1000+ tasks simultaneously |
Distributed Architecture for Large Fleets
┌─────────────────────────────────────────────────────┐
│ Central Dispatch Node │
└──────────────────────┬──────────────────────────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ IDC-A │ │ IDC-B │ │ IDC-C │
│ Relay │ │ Relay │ │ Relay │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│Agent×N │ │Agent×N │ │Agent×N │
└─────────┘ └─────────┘ └─────────┘
One central dispatch node → one relay per datacenter → one agent per server.
2. File Distribution
File Types
| Type | Examples | Special Handling |
|---|---|---|
| Regular files | Code packages, images, software packages | MD5 check before transfer |
| Template files | Config files (nginx.conf, app.yml) | Variable substitution before transfer |
Template file flow:
Template file (with variables)
│
▼
Load variables from Config Management
│
▼
Render → temporary plain file
│
▼
Distribute to target servers
Distribution Methods
Method 1: Management node → target servers
Management Node
│
├── MD5 check (skip if identical)
│
▼
rsync / ansible copy / salt file module
│
▼
Target Servers
MD5 pre-check avoids unnecessary transfers — significant time savings for large files.
Recommended tools:
-
ansiblecopy module -
saltfile module - Custom
rsyncwrapper (supports MD5 check + resume-on-break)
Method 2: P2P transfer between servers (large files)
Source Server A (e.g. 500GB image)
│
├──rsync──▶ Server B
├──rsync──▶ Server C
└──rsync──▶ Server D ...
- Agent wraps
rsyncfor inter-server transfers - Config Management auto-selects transfer source
- True P2P: no manual source selection required
-
rsynchandles large files, MD5 verification, and resume-on-break natively
3. Custom Jobs (Task Orchestration)
Custom jobs compose atomic operations into complex workflows:
Custom Job
├── Step 1: File Distribution (parallel)
│ ├── push config.yml to Group A
│ └── push config.yml to Group B
├── Step 2: Command Execution (sequential)
│ └── restart service on all targets
└── Step 3: Verification Command
└── check service health
Key capabilities:
- Mix file distribution and command steps freely
- Control whether a sub-task failure blocks the parent job
- Reorder steps at any time
- Sequential and parallel execution modes
4. Scheduled Tasks
Scheduled Task
├── One-time execution — run at a specific datetime
├── Recurring — standard crontab expression
└── Canary + Scheduled — combine with canary rollout
(e.g. roll out to 10% of servers
each night at 2 AM)
Ideal for: database backups, log rotation, off-hours deployments, periodic health checks.
5. Canary Rollout
Two rollout dimensions:
Dimension 1: Step-based canary
┌──────────┬──────────┬──────────┬──────────┐
│ Step 1 │ Step 2 │ Step 3 │ Step 4 │
│ verify │ 10 svrs │ 50 svrs │ all svrs│
└──────────┴──────────┴──────────┴──────────┘
Dimension 2: Server-count-based canary
┌──────────┬──────────┬──────────┐
│ 10% │ 30% │ 100% │
│ servers │ servers │ servers │
└──────────┴──────────┴──────────┘
Canary rollout integrates with scheduled tasks for automated progressive deployment.
6. Configuration Management
Two Implementation Philosophies
┌──────────────────────────────────────────────────────────────┐
│ Process-based (Active / Push) │
│ │
│ Ops triggers change ──▶ Center pushes to servers │
│ Tools: Ansible, Fabric, Salt, custom daemon │
│ Characteristic: explicit, auditable, immediate │
├──────────────────────────────────────────────────────────────┤
│ Result-based (Passive / Pull) │
│ │
│ Ops defines desired state ──▶ Agent polls & self-corrects │
│ Tools: Puppet, Chef, custom agent │
│ Characteristic: declarative, self-healing, eventual │
└──────────────────────────────────────────────────────────────┘
Configuration Management Sub-Systems
| Sub-System | Description |
|---|---|
| Account Management | root and business user accounts for job execution |
| File Management | Code versions, scripts, software packages |
| Script Management | Base scripts + user-uploaded custom scripts |
| Group Management | Static groups (manual IP selection) and dynamic groups (auto-match by resource attributes) |
| Variable Management | Multi-level variables used with templates and groups |
| Template Management | Jinja2 templates + variable binding |
| Software Management | rpm/yum/npm/pip packages; versioned tarballs with MD5 |
| Service Management | Agent-reported process state; integrates with job system for start/stop/restart |
Variable Priority Hierarchy
Variables follow a strict priority order (higher overrides lower):
┌─────────────────────────────────────────────┐
│ Priority (high → low) │
│ │
│ 4. Task Variables ← runtime input only │
│ 3. Host Variables ← per-host dynamic │
│ 2. Custom Group ← user-defined groups │
│ Set Group │
│ IDC Group │
│ 1. Global Group ← lowest priority │
│ 0. Global Variables ← affects everything │
└─────────────────────────────────────────────┘
Variable types:
| Type | Scope | Priority | Notes |
|---|---|---|---|
| Global Variables | All templates everywhere | Lowest | One value globally; changes affect all consumers |
| Group Variables | Specific server group | Medium | Different groups may have conflicting values; group priority resolves conflicts |
| Host Variables | Single host | High | Dynamic; sourced from resource management system |
| Task Variables | Single task execution | Highest | Runtime input only; not persisted |
Template Rendering Flow
Template file (Jinja2)
┌────────────────────────────────┐
│ server_port: {{ port }} │
│ db_host: {{ db.host }} │
│ env: {{ environment }} │
└────────────────────────────────┘
│
▼
Variable resolution (priority merge)
Global → Group → Host → Task
│
▼
Rendered config file
┌────────────────────────────────┐
│ server_port: 8080 │
│ db_host: 10.0.1.5 │
│ env: production │
└────────────────────────────────┘
│
▼
Distribute to target servers
Template engine: Jinja2 (same as Ansible)
Service Management
Each server Agent
│ periodic heartbeat report
▼
Config Management System
(maintains live service state per host)
│
▼
Ops views service status dashboard
│ triggers job
▼
Job System executes: start / stop / restart
System Architecture Overview
┌──────────────────────────────────────────────────────────────┐
│ Job Control Platform │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────────────────┐ │
│ │ Script │ │ File │ │ Custom Job │ │
│ │ Execution │ │Distribution│ │ Orchestration │ │
│ └────────────┘ └────────────┘ └────────────────────────┘ │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────────────────┐ │
│ │ Scheduled │ │ Canary │ │ Batch Execution │ │
│ │ Tasks │ │ Rollout │ │ (1000+ concurrent) │ │
│ └────────────┘ └────────────┘ └────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Configuration Management │ │
│ │ Accounts │ Files │ Scripts │ Groups │ Variables │ │
│ │ Templates │ Software │ Services │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Task Result Tracking │ │
│ │ Execution logs │ Re-run control │ Security audit │ │
│ └──────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
│ SSH or Agent
┌───────────┼───────────┐
▼ ▼ ▼
[IDC-A] [IDC-B] [IDC-C]
Agent×N Agent×N Agent×N
Top comments (0)