Chen Debra

Posted on May 18

DolphinScheduler Agent Is Now Open-Source! Bringing Self-Healing Automation to DataOps

#ai #agents #apachedolphinscheduler #github

At the 2026 Apache DolphinScheduler Meetup technical session, the DolphinScheduler Agent solution presented by Liu Xiaodong immediately became one of the hottest topics in the community. This end-to-end system, connecting “group alert → intelligent diagnosis → automatic recovery → reporting loop,” effectively solves the fragmentation, high manual overhead, and constant context switching of traditional operations workflows, bringing big data incident handling from the era of “manual firefighting” into the age of “intelligent autonomous operations.”

The project’s core supporting tool, dolphinscheduler-cli (dsctl), has now officially been open-sourced on GitHub and is freely available for all developers!

Watch the Replay: https://youtu.be/mnGC-XOf8xU

The Pain of Traditional Operations: Slow Recovery Isn’t About Commands — It’s About Fragmented Context

When using Apache DolphinScheduler in daily production, handling failed tasks has always been a major burden for operations teams.

The workflow is all too familiar:

A Feishu alert pops up → open the DS UI to check instance status → inspect logs to locate the failure → compare with the Runbook → manually decide what to do → return to the group chat and reply with the result...

What truly slows down efficiency is not executing a command itself, but the constant loss of context across multiple systems. Facts, evidence, and risks are scattered across different tools, forcing operators to spend enormous time “searching for information, stitching logic together, and rebuilding context.” Collaboration breaks frequently, troubleshooting costs soar, and incident recovery cycles become unnecessarily long.

With DolphinScheduler Agent, all of this changes.

A Major Upgrade: From Fragmented Human Coordination to an Intelligent End-to-End Closed Loop

To solve these operational gaps, the goal of the DolphinScheduler Agent solution is crystal clear:

Transform every failure alert into a continuous, traceable, and reusable handling workflow.

The old model treated alerts, UI pages, logs, group chats, and postmortems as isolated systems heavily dependent on human coordination.

The new model starts from a Feishu alert and flows through Channel conversations, intelligent orchestration, execution control, verification, and automated reporting, forming a seamless end-to-end process from trigger to resolution — without requiring engineers to jump repeatedly between systems.

Simply put:

Once an alert is triggered, the Agent automatically takes over.
Once handling is complete, it automatically replies in the group and generates a full incident report.

Operations engineers only need to review the conclusion instead of “running around everywhere.”

Five-Layer Core Architecture: Not Just Scripts, but a Safe and Controllable Intelligent Control Chain

Many people mistakenly think automated operations simply mean “bots + scripts.”

However, DolphinScheduler Agent takes a much more robust and engineering-oriented approach: a five-layer decoupled control chain. Each layer has clear responsibilities, ensuring both execution capability and strict safety boundaries.

1. L1 Event & Collaboration

Alerts directly enter Feishu threads, allowing human intervention and questioning at any time. The workflowInstanceId serves as the unique incident anchor, ensuring information is never lost or fragmented.

2. L2 Session Integration

Feishu events synchronize into local sessions, maintaining full conversational context and eliminating interruptions caused by switching systems.

3. L3 Intelligent Orchestration

Claude Code handles information organization and invocation orchestration, while Skills encapsulate DolphinScheduler domain expertise for more accurate decision-making.

4. L4 Execution Control

dsctl centrally handles the core actions of evidence collection, fault recovery, and result verification, providing standardized, reusable, and stable command execution.

5. L5 Governance & Reporting

The system automatically generates Feishu replies, incident reports, and audit logs, balancing real-time collaboration with long-term governance and postmortem analysis.

This architecture directly addresses real operational requirements:
Only through decoupled architecture can capabilities scale reliably; only through clear boundaries can automation safely enter production environments.

Four Core Modules: Making Self-Healing Truly Production-Ready

Built on top of the five-layer architecture, four tightly integrated modules make the system practical, scalable, and trustworthy.

📌 Channel: Native Feishu Entry Point for Unified Collaboration

Feishu groups become the alert entrance, collaboration interface, and result feedback page all in one.

Agents, humans, and on-call workflows collaborate within the same thread. Group chats display concise conclusions, while detailed evidence is preserved in reports for future reference.

📌 Runtime: Intelligent Orchestration Engine with Decoupled Rules and Execution

Claude Code manages conversation orchestration logic, while Skills encapsulate operational expertise such as fault response, workflow design, and data quality governance.

By separating orchestration, rules, and execution into independent layers, the system becomes highly extensible and continuously evolvable.

📌 Control Plane: dsctl as the Unified Execution Foundation

dsctl is the core execution engine powering the entire Agent system.

It provides standardized CLI capabilities that can be safely invoked by automation systems:

Evidence collection: doctor / digest / log
Fault repair: recover-failed / edit --dry-run
Result verification: watch / digest
Unified outputs: fully observable, traceable, and auditable

With dsctl, manual commands become stable automation primitives.

A Seven-Step Standard Closed Loop: Dual-Path Protection for Production Safety

From alert triggering to incident reporting, the Agent strictly follows a seven-step state machine:

Alert Parsing → Diagnosis → Decision → Execution → Verification → Response → Reporting

Two execution paths guarantee safety:

Happy Path
For low-risk scenarios with sufficient evidence:
collect evidence → generate execution plan → recover failed tasks → verify → reply in the group → generate report
Escalation Path
For insufficient evidence, high-risk situations, or failed verification:
escalate to human operators while preserving complete context — never falsely reporting success.

Everything is traceable, auditable, and reviewable, enabling safe and stable production deployment.

📌 Safety: Four-Level Risk Governance — Safety Comes First

In production automation, safety always matters more than speed.

The system classifies operations into four risk levels:

Automatically Allowed: read-only queries and log viewing
Automatic + Protection: low-risk recovery operations like recover-failed
Human Approval Required: high-risk modifications
Forbidden: dangerous operations such as force-success are directly blocked

This defines the system’s core philosophy:

The true strength of an Agent is not “daring to execute,” but knowing “when not to execute.”

A Pragmatic Roadmap: Gradual Delegation Toward Autonomous Operations

To ensure safe production adoption, the Agent follows a gradual empowerment strategy:

MVP Stage: read-only diagnosis + automated short replies
V1 Stage: enable low-risk automatic recovery via recover-failed
V2 Stage: integrate approval mechanisms for broader controllable operations
V3 Stage: accumulate Runbooks and Skills for community collaboration

The true value of this solution is not a single prompt, but an entire engineering framework built around:

Channel + Skill + CLI + Report + Safety

A reusable and portable operational architecture.

Demo

To help the audience better understand DolphinScheduler Agent’s capabilities, Liu Xiaodong also demonstrated a live demo during the session.

Please refer to the video starting from 57:10 for the full demonstration.

🎉 Official Open Source Release: dsctl Is Now Available on GitHub

The great news is that the core project powering DolphinScheduler Agent — dolphinscheduler-cli (dsctl) — has officially been open sourced!

GitHub Repository:
dolphinscheduler-cli GitHub Repository

The project provides a complete CLI toolkit supporting:

DolphinScheduler configuration and environment management
Workflow authoring, lint checking, and DryRun simulation
Runtime monitoring, instance inspection, and log retrieval
Failure recovery, rerun handling, and batch operations
Standardized outputs fully compatible with automation and Agent integration

The project is released under the Apache-2.0 license, supports one-line installation via pip, and is compatible with mainstream DolphinScheduler versions including 3.3.2, 3.4.0, and 3.4.1.

Final Thoughts

DolphinScheduler Agent is redefining the operational paradigm for big data systems:

Free people from repetitive tasks, fragmented workflows, and endless context switching — let systems handle incidents, while humans focus on decision-making and governance.

From alert triggering to automatic recovery, automated replies, and report generation, the entire process becomes a seamless one-click closed loop.

If everything runs smoothly, operations teams really can “lie back and let the system do the work.”

Developers, operators, and big data engineers are all welcome to explore dsctl on GitHub and join the community in building a simpler, smarter, and more efficient future for operations.

DEV Community