Apache SeaTunnel

Posted on Apr 23

Can You Turn “What I Want to Do” into a Runnable SeaTunnel Config with AI?

#ai #apachedolphinscheduler #seatunnel #opensource

Some thoughts around Apache SeaTunnel Discussion #10651: When AI writes configurations, the hard part has never been “writing them,” but whether what’s written can actually be used.

Over the past two years, almost every data tool has been asked one question:

Can configurations stop being handwritten?

When applied to SeaTunnel, this question becomes more specific:

Can a single sentence like “what I want to do” directly become a configuration?

Taking it one step further, can this configuration be not just “roughly correct,” but actually runnable, reviewable, and modifiable?

Writing SeaTunnel configurations manually is something many people are already familiar with. What is truly troublesome is often not “writing the configuration,” but the following:

After writing it, can it actually run;
When errors occur, is it easy to troubleshoot;
If someone else takes over, can they understand it;
When requirements change, can it be modified at low cost.

AI can certainly help. But if the goal is only to “generate a piece of HOCON,” the value is actually not that great. Because the real difficulty has never been typing things out, but making sure that after writing it, you don’t trap yourself, nor the next person who takes over.

So what is more worth doing is not simply “AI helps me write configurations,” but to stably translate the natural language “what I want to do” into a SeaTunnel configuration that is runnable, reviewable, and iterative.

This article mainly discusses three things:

Why this is worth doing;
What a relatively stable implementation path looks like;
How far the recent community discussions and prototypes have progressed.

1. Where the Real Demand Lies for AI Writing Configurations

1.1 Why Manual Configuration Becomes a Bottleneck

SeaTunnel task configuration is essentially a DSL (commonly HOCON, also supporting JSON/SQL), composed of env / source / transform / sink to form an executable data pipeline. Its expressive power is strong, but precisely because of that, configuration writing naturally comes with an “engineering threshold.” When team size, types of data sources, and the number of tasks all grow together, manual configuration will almost inevitably produce four types of cost:

Dense syntax details: nested levels, array/object structures, field types, quotation marks and escaping—any small mistake will explode at runtime.
Error-prone and difficult to troubleshoot: errors often manifest as “task startup failure” or “runtime failure.” When locating issues, you need to simultaneously understand engine-side constraints, connector parameter semantics, variable substitution rules, and default conventions.
High learning cost: newcomers need to learn HOCON syntax, SeaTunnel conventions (such as plugin_output/plugin_input), connector capability boundaries, and engine differences.
Slow adaptation to heterogeneous multi-source scenarios: once evolving from “single-table sync” to “multi-source join / lake ingestion / CDC / multi-table sync,” configuration complexity grows non-linearly, and templates quickly become invalid.

SeaTunnel official documentation on configuration file structure and variable substitution:

https://seatunnel.apache.org/docs/2.3.8/concept/config/

1.2 What Discussion #10651 Is Really Asking

The problem mentioned in Discussion #10651, in my view, is essentially this type of engineering requirement:

I don’t want to start writing DSL from scratch; I want to input “what I want to do + what data sources I have + what constraints I have,” and the system can generate a SeaTunnel configuration that is runnable, reviewable, and iterative, and provide actionable fix suggestions when failures occur.

Discussion entry:

https://github.com/apache/seatunnel/discussions/10651

1.3 Let Me State the Conclusion First

I don’t particularly care whether “AI can directly write a piece of HOCON.” This problem is not difficult to demonstrate; the difficulty lies in whether the generated result can enter daily usage. My judgment is that this needs to take a more engineering-oriented path: first transform natural language into structured IR, then render it into SeaTunnel HOCON, and finally supplement it with a machine-checkable validation report. Doing so brings at least three direct benefits:

Runnable: the generated result satisfies SeaTunnel configuration structure, connector required parameters, and engine constraints.
Reviewable: sensitive information is parameterized, key decisions enter IR, and default values and items to be confirmed are clearly visible.
Iterative: when validation fails, you can go back to the IR or patch layer for minimal fixes, rather than regenerating the entire configuration.

With this judgment, the next question becomes clear: how should this pipeline be built.

2. If We Really Want to Do This, What Should the Pipeline Look Like

2.1 Don’t Rush to Let the Model Directly Output HOCON

Directly letting the model output a piece of HOCON often produces good demo results, but it is not sufficient for engineering. A more stable approach is to break configuration generation into several clear stages, each of which can be checked. A minimal closed loop roughly looks like this:

Intent Parsing: extract task type, source/target, mode (batch/stream), SLA, and fault tolerance requirements from natural language.
Metadata Awareness: obtain source schema, primary keys/incremental positions, and target constraints (field types, partitions, write modes).
Connector Resolution: select connector combinations based on “intent + engine + environment constraints,” and confirm version compatibility.
Parameter Auto Fill: fill required parameters and reasonable default values; uncertain items are output as a “to-confirm list,” rather than guessing.
Syntax and Semantic Validation: HOCON syntax, connector parameter schema, variable substitution, and sensitive information compliance; when failures occur, generate executable fix patches.

The model is responsible for proposing solutions; the system is responsible for fallback and validation.

2.2 Structurally, This Solution Is Actually Two Pipelines

From a structural perspective, this solution can be divided into two pipelines: a control chain (intent → plan) and an artifact chain (plan → configuration → execution). Splitting it this way makes both understanding and implementation clearer.

2.2.1 Module Breakdown

Intent Parser: natural language → IntentSpec (structured JSON)
Metadata Provider: fetch schema and constraints from JDBC/Catalog/information schema
Connector Resolver: connector capability matrix matching (engine compatibility, CDC support, Exactly-Once support, etc.)
Plan Builder: generate JobPlanIR (strongly typed IR, similar to AST)
Config Renderer: JobPlanIR → HOCON/JSON (HOCON by default)
Config Linter: syntax + parameter validation + security policy checks
Submitter (optional): submit jobs, query status, stop jobs, rollback

2.2.2 Execution Flow (Text Sequence)

User inputs natural language + environment constraints
Intent Parser outputs IntentSpec
Metadata Provider fetches schema/primary keys/incremental positions/target constraints
Connector Resolver selects Source/Sink/Transform combinations
Plan Builder outputs JobPlanIR
Config Renderer generates seatunnel.conf
Config Linter outputs validation_report (pass/fail + fix suggestions)
If passed, Submitter submits; if failed, enter a “fix → revalidate” loop based on report

Execution side does not need to start from scratch. SeaTunnel MCP server has already demonstrated how LLMs can submit and manage SeaTunnel tasks via tools, which can be directly referenced when building an MVP:

https://github.com/apache/seatunnel-tools

3. If Building an MVP, What Should the First Version Look Like

3.1 Input and Output Format: Define the Protocol First

The biggest risk for an MVP is inconsistent outputs. The simplest way is to define the I/O protocol first.

3.1.1 Input: IntentSpec (JSON)

{
  "intent": "Sync mysql.shop.orders fully to Doris ods.orders, run daily",
  "engine": "zeta",
  "mode": "BATCH",
  "source": {
    "type": "mysql",
    "jdbc_url": "${MYSQL_URL}",
    "username": "${MYSQL_USERNAME}",
    "password": "${MYSQL_PASSWORD}",
    "database": "shop",
    "table": "orders"
  },
  "sink": {
    "type": "doris",
    "fenodes": "${DORIS_FENODES}",
    "username": "${DORIS_USERNAME}",
    "password": "${DORIS_PASSWORD}",
    "database": "ods",
    "table": "orders"
  },
  "constraints": {
    "parallelism": 4,
    "no_plaintext_secret": true,
    "target_ddl_policy": "validate_only"
  }
}

3.1.2 Output: Configuration + Validation Report

seatunnel.conf: HOCON (default). Sensitive information must be parameterized using ${...}
validation_report.json: errors / warnings / to-be-confirmed parameter list / fix suggestions (can generate patch)

3.2 Prompts Are Not the Main Character, Boundaries Are

There is no need to overcomplicate prompt design. The key point is only one: confine uncertainty within a verifiable range. For MVP, a “three-stage Prompt” is sufficient:

3.2.1 Prompt A: Intent → Plan (Only Output IR, Not Configuration)

Goal: Output JobPlanIR (JSON), with fixed fields and fixed enums, and prohibit natural language explanations.

Key constraints:

Explicitly define job.mode, engine, and plugin_name for source/sink
Determine plugin_output/plugin_input reference relationships; legacy result_table_name/source_table_name only used for compatibility input
Plaintext secrets are not allowed
Uncertain items must be placed in todo_items[]

3.2.2 Prompt B: Plan → HOCON Rendering

Goal: Output only HOCON, and strictly limit sections to env/source/transform/sink.

Key constraints:

All sensitive fields must be written as ${VAR} or ${VAR:default}
Do not output nonexistent parameter names (parameter names must come from the rule set)

3.2.3 Prompt C: Self-check (Lint + Semantic)

Goal: Output structured validation_report.json:

{
  "errors": [],
  "warnings": [],
  "todo_items": [],
  "patch_suggestion": ""
}

3.3 How to Choose Models: Local Open Source or Cloud LLM

Dimension	Local Open-source Models	Cloud LLMs
Generation Quality	Requires fine-tuning / retrieval fallback	Usually stronger, more stable for complex reasoning
Data Compliance	Data stays within domain, strong advantage	Requires desensitization, auditing, contracts, compliance evaluation
Cost	Fixed cost, controllable	Grows with usage
Latency	Can be low or high (depends on inference stack)	More affected by network fluctuations
Operations	Requires GPU / inference services	Depends on vendor stability

In the MVP stage, it is generally better to first use cloud models to run through the full chain of “generation → validation → submission → rollback,” and then move toward local or hybrid deployment based on enterprise compliance and cost considerations.

3.4 Which Compatibility Rules Should Be Fixed from the Beginning

If compatibility rules are not clearly defined upfront, things will become chaotic later. The following are better treated as hard constraints:

Default output is HOCON; JSON/SQL must be explicitly declared and follow extension constraints (e.g., .json)

Reference: https://seatunnel.apache.org/docs/2.3.8/concept/config/

Fixed section order: env → source → transform → sink
plugin_output/plugin_input is only explicitly written when referencing across sections, multiple source/sink, or transform chains; for single-chain scenarios, reduce noise as much as possible
Variable substitution uses ${var} and ${var:default}, uniformly injected at runtime (do not hardcode environment differences)
Plaintext passwords / AK / SK are prohibited; must use variables or external secret management systems

Once these boundaries are defined, the next practical question is: where do connector rules come from?

3.5 The Rule System Does Not Have to Be Fully Handwritten

There is one point in PR #10789 that I find very practical: it does not rely entirely on manually maintained connector rules. Instead, it scans SeaTunnel Java source files such as *Factory.java and *Options.java to automatically generate a connector catalog, and then processes the option inheritance chain. This is very helpful for rule system design.

A more practical approach is not to rely entirely on handwritten rules, but to divide into two layers:

Auto-generated layer: extract connector names, OptionRule, default values, required parameters, and parameter aliases from source code
Human-enhanced layer: supplement knowledge that is difficult to express in static code, such as CDC capabilities, recommended engines, typical combinations, common misconfigurations, and enterprise security policies

If the running SeaTunnel cluster can expose interfaces such as /option-rules, then the knowledge acquisition chain can be further upgraded to:

Runtime interface first: obtain the most accurate connector rules for the current version
Auto-generated catalog fallback: avoid complete failure in offline or no-cluster scenarios
Keyword/example routing supplement: improve the hit rate from natural language to connectors

Therefore, rules/connectors.yaml here is more like a manually corrected layer on top of automatically generated rules, rather than a fully handwritten “parameter encyclopedia.”

At this point, the abstract parts are almost covered. Next, let’s look directly at a complete example.

4. A Complete Example: From “What I Want to Do” to a Runnable Configuration

Let’s look at a full example that connects “natural language → IR → HOCON → validation report.”

Fully sync mysql.shop.orders to Doris ods.orders, run daily, use zeta engine, parallelism 4.

The generator should not only output a piece of HOCON, but also output JobPlanIR, seatunnel.conf, and validation_report. IR is used to review intent, HOCON is used for execution, and the validation report is used to expose risks and items requiring confirmation.

Here is a point that is easy to confuse: in the example, the business type of the source is written as mysql, but the rendered plugin_name is Jdbc. This is not an error. It is because this example describes a “full table read from MySQL,” which is closer to the JDBC Source usage scenario in SeaTunnel. If the goal were MySQL CDC, the resulting source plugin would often become MySQL-CDC.

4.1 First Look at JobPlanIR: It Fixes the Intent

You can think of JobPlanIR as an intermediate representation similar to an AST. It is not directly executed, but is mainly used for connector matching, parameter checking, and subsequent rendering.

{
  "job_mode": "BATCH",
  "engine": "zeta",
  "source": {
    "type": "mysql",
    "plugin_name": "Jdbc",
    "sync_mode": "full",
    "jdbc_url": "${MYSQL_JDBC_URL}",
    "driver": "com.mysql.cj.jdbc.Driver",
    "username": "${MYSQL_USERNAME}",
    "password": "${MYSQL_PASSWORD}",
    "database": "shop",
    "table": "orders",
    "table_path": "shop.orders"
  },
  "sink": {
    "type": "doris",
    "plugin_name": "Doris",
    "fenodes": "${DORIS_FENODES}",
    "username": "${DORIS_USERNAME}",
    "password": "${DORIS_PASSWORD}",
    "database": "ods",
    "table": "orders",
    "data_save_mode": "${DORIS_DATA_SAVE_MODE:APPEND_DATA}",
    "schema_save_mode": "${DORIS_SCHEMA_SAVE_MODE:CREATE_SCHEMA_WHEN_NOT_EXIST}",
    "sink_label_prefix": "${DORIS_LABEL_PREFIX:orders_full_sync}",
    "doris_config": {
      "format": "json",
      "read_json_by_line": "true"
    }
  },
  "transform": [],
  "constraints": {
    "parallelism": 4,
    "schedule": "daily_external",
    "no_plaintext_secret": true,
    "engine_compatibility": "Jdbc source + Doris sink are supported on SeaTunnel Zeta",
    "secret_placeholders": [
      "MYSQL_JDBC_URL",
      "MYSQL_USERNAME",
      "MYSQL_PASSWORD",
      "DORIS_FENODES",
      "DORIS_USERNAME",
      "DORIS_PASSWORD"
    ]
  },
  "todo_items": [
    "Confirm daily scheduling method; SeaTunnel HOCON does not natively support cron, requires external scheduler to trigger daily",
    "Confirm Doris write semantics; current default APPEND_DATA ensures runnability, change to DROP_DATA if overwrite full sync is required",
    "Confirm mysql.shop.orders has primary key or splittable column; otherwise Jdbc Source may degrade to single-thread reading"
  ]
}

4.2 Then Look at seatunnel.conf: It Executes the Job

This layer should be kept concise, containing only necessary runtime parameters. Connection info and passwords are parameterized. Since this is a single-chain job, no need for plugin_output/plugin_input. The empty transform {} is only kept to maintain the typical structure.

env {
  parallelism = 4
  job.mode = "BATCH"
}

source {
  Jdbc {
    url = ${MYSQL_JDBC_URL}
    driver = "com.mysql.cj.jdbc.Driver"
    username = ${MYSQL_USERNAME}
    password = ${MYSQL_PASSWORD}
    table_path = "shop.orders"
  }
}

transform {
}

sink {
  Doris {
    fenodes = ${DORIS_FENODES}
    username = ${DORIS_USERNAME}
    password = ${DORIS_PASSWORD}
    database = "ods"
    table = "orders"
    sink.label-prefix = "${DORIS_LABEL_PREFIX:orders_full_sync}"
    schema_save_mode = "${DORIS_SCHEMA_SAVE_MODE:CREATE_SCHEMA_WHEN_NOT_EXIST}"
    data_save_mode = "${DORIS_DATA_SAVE_MODE:APPEND_DATA}"
    doris.config {
      format = "json"
      read_json_by_line = "true"
    }
  }
}

4.3 Finally Look at validation_report: It Explains the Issues Clearly

The validation report is not decoration. It answers two questions: what is runnable, and what still needs confirmation.

{
  "errors": [],
  "warnings": [
    "Generated based on intent: full sync mysql.shop.orders to Doris ods.orders, run daily, zeta engine, parallelism 4",
    "Default Doris data_save_mode set to APPEND_DATA for runnability; change to DROP_DATA if overwrite full sync is required",
    "Scheduling is not encoded in SeaTunnel config; requires external scheduler for daily trigger",
    "Jdbc partitioning not explicitly set; if no primary key or unique index exists, parallelism may be lower than env.parallelism=4"
  ],
  "todo_items": [
    "Add external scheduler configuration (e.g., cron, Airflow, DolphinScheduler)",
    "Confirm DORIS_DATA_SAVE_MODE should be DROP_DATA",
    "Confirm primary key / unique key or partition_column for orders table"
  ],
  "patch_suggestion": ""
}

In this example, the three points I most want to emphasize are: sensitive information is not stored in plaintext, connector parameters have clear sources, and uncertain items are not guessed blindly.

At this point, the solution, protocol, and example have all been covered. The final question returns to something more practical: is this approach actually worth it?

5. What Do We Ultimately Save by Doing This

5.1 Three Typical Scenarios

5.1.1 Database Synchronization (MySQL → Doris)

Manual: a large number of connector parameters and table mapping details
AI-generated: input intent + connection information → output runnable HOCON + to-confirm items

5.1.2 Lakehouse Ingestion (Hive → Iceberg)

Manual: complex combinations of catalog / warehouse / partition / commit parameters
AI-generated: automatically fills required parameters based on rule system and lists uncertain items as to-confirm items

5.1.3 Log Collection (S3/Local → Elasticsearch)

Manual: format parsing, field mapping, index naming, retry strategies are easy to miss
AI-generated: first produces a “minimum runnable version,” then iteratively enhances based on validation and runtime feedback

5.2 Comparison Dimensions (Intuitive, Non-Academic)

The following numbers are more like experience-based estimates, mainly to give a sense of scale rather than strict experimental data. Actual benefits depend on the team’s familiarity with SeaTunnel, metadata integration, and connector complexity.

Dimension	Manual Configuration	AI-generated Configuration (with validation)
Time to first completion	30–120 minutes	3–15 minutes
Lines of configuration	80–200 lines	40–120 lines (more parameterized)
Syntax error rate	High (common)	Low (lint + rule system fallback)
Learning difficulty	High	Medium (mainly learning input protocol and confirmation list)

6. How This Can Be Further Advanced

6.1 If We Want to Push This Forward in the Community, How Can We Collaborate

Add to Discussion #10651: input/output protocol, MVP milestones, reproducible examples
Continue discussions around PR #10789: whether to evolve seatunnel-cli/ as a standalone tool, or settle into a two-layer architecture of “generation core + CLI/API frontend”
Contribution directions:
- Enhance connector catalog auto-generation (source extraction, inheritance chain parsing, version diffing)
- Improve connector rule system (required parameters, default values, engine compatibility)
- Improve validator (more readable error messages and fix suggestions)
- Strengthen secret handling (session memory desensitization, placeholder injection, external secret manager integration)
- Add more examples (cover JDBC / CDC / file / lakehouse scenarios)

6.2 If We Really Want to Implement This, What Pitfalls Must Be Considered First

The most common issue is still the model “seems to understand but actually doesn’t.” So a more stable approach is not to let it freely generate, but to constrain outputs within verifiable boundaries using IR, rule systems, and lint. When uncertain, it should explicitly list items in the to-confirm list.
Metadata should not be taken for granted. Schema, table structure, and field information can indeed help reduce trial and error, but only if desensitization is the default, data access is controlled, and sensitive values are not included in prompts.
If session memory is supported later, the risk is not only “remembering context,” but also “accidentally remembering connection information.” A better approach is to store only aliases, references, or secret locations—not plaintext credentials.
Another layer is enterprise compliance. Audit logs, permission isolation, whether local models can be used, whether configuration release requires approval and rollback—these are often overlooked, but unavoidable in production environments.

7. Final Questions to Continue the Discussion

At this point, the core concern remains unchanged: whether AI can write configurations is not the hardest part. The harder part is how to stabilize the entire chain of “generation → validation → repair → execution.”

If this is only for occasional demos, being able to generate is enough; but if we truly want it to enter daily team workflows, the fallback, review, and repair mechanisms must also be completed.

If you are also interested in this direction, feel free to continue discussing the following questions.

7.1 Q&A (Leave Your Thoughts)

What is the biggest pain point for your team when writing SeaTunnel configurations: syntax, parameters, or troubleshooting?
Would you prefer AI to first solve “configuration generation” or “automatic repair after failure”?
What interaction style do you prefer: Chat (conversational) or Form (structured form)?

7.2 Quick Poll (Reply with the Option Number)

A: I need one-click “intent → configuration” generation
B: I need “configuration → validation → fix suggestions”
C: I need a full loop of “generation + submission + self-healing on failure”
D: I only want “connector parameter auto-fill + template library”

References

Discussion #10651: AI-generated SeaTunnel job configuration