Apache SeaTunnel

Posted on Jan 9

What a Big JSON Incident Taught Us About Apache SeaTunnel Tuning

#seatunnel #programming #opensource #json

The Trigger: “I Thought It Was Just Copying a Config”

The original idea was embarrassingly simple.

We already had a stable SeaTunnel CDC → Doris pipeline for the amzn_order table.
So I thought: why not just “copy-paste” that configuration, change the table name to amzn_api_logs, and call it a day?

What actually happened:

Memory usage on the production machine skyrocketed to tens of gigabytes
The Java process started hitting OOMs repeatedly
Doris and Trino—running on the same host—were dragged into the chaos

What hurt the most was this realization:

This wasn’t a SeaTunnel bug.
It was my shallow understanding of data chunking, streaming writes, and JVM memory behavior.

This post is my retrospective—from “streaming means it won’t pile up in memory” to the hard truth:

What you think is a “stream” is actually multiple layers of buffers and batches stacked together.

Ground Zero: How I Almost Drained a 60 GB Machine

At the peak of the incident, top looked roughly like this:

MiB Mem : 63005.9 total,  2010.6 free,  53676.2 used,  8097.3 buff/cache
MiB Swap:     0.0 total,     0.0 free,      0.0 used
...
PID      VIRT     RES  %MEM  COMMAND
2366021  22.5g   16.9g  27%  java ... seatunnel-2.3.11 ...
1873099  14.3g    7.1g  11%  trino
1895794  49.5g    1.7g   2%  doris_be

SeaTunnel alone was consuming 16–17 GB of heap.
Free memory on the machine dropped below 2 GB, swap was disabled, and the OOM Killer was basically hovering over the kill switch.

And yet, a thought still lingered in my head:

“But isn’t this streaming? Why is it eating all the memory?”

Table Schema and Configuration: Everything Looked Fine—Until It Didn’t

Table: `amzn_api_logs`

CREATE TABLE `amzn_api_logs` (
  `id` bigint NOT NULL,
  `business_type` varchar(100) NOT NULL,
  `req_params` json DEFAULT NULL,
  `resp` json DEFAULT NULL,
  `seller_id` varchar(32) NOT NULL,
  `market_place_id` varchar(32) NOT NULL,
  `create_time` datetime NOT NULL,
  `update_time` datetime DEFAULT NULL,
  `remark` varchar(255) DEFAULT NULL,
  `is_delete` bigint NOT NULL DEFAULT '0',
  `version` bigint NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`) USING BTREE,
  KEY `idx_create_time` (`create_time`) USING BTREE
) ENGINE=InnoDB;

Two JSON columns: req_params and resp.

If you’ve worked with log-style JSON before, you already know how large these fields can get.

Initial SeaTunnel Configuration (Core Parts)

job.name = "amzn_api_logs"
execution.parallelism = 10

job.mode = "STREAMING"
checkpoint.interval = 60000
}

source {
  MySQL-CDC {
    parallelism = 6
    incremental.parallelism = 4
    snapshot.parallelism = 4

    table-names = ["amzn_data_prd.amzn_api_logs"]
    snapshot.split.size = 50000
    snapshot.fetch.size = 10000
    chunk-key-column = "id"

    exactly_once = true
    startup.mode = "initial"
  }
}

sink {
  doris {
    sink.model = "UNIQUE_KEYS"
    sink.primary_key = "id"
    sink.label-prefix = "amzn_api_logs_cdc_doris"
    sink.enable-2pc = "true"

    doris.batch.size = 50000
    ...
    doris.config {
      format = "json"
      read_json_by_line = "true"
    }
  }
}

My mental model at the time:

“CDC + STREAMING + Doris means rows flow through one by one.
Maybe a bit of buffering—but nothing explosive.”

In hindsight, this configuration was practically a perfect storm for
large JSON + high parallelism + initial snapshot.

Why This Configuration Was a Disaster

1. Huge JSON Fields

JSON is stored compactly in MySQL
Once inside the JVM, it becomes Strings, Maps, objects
Memory expansion of 3–5× is completely normal

2. `doris.batch.size = 50000`

Even 5,000 rows of JSON logs can reach hundreds of megabytes.
50,000 rows? You don’t need a calculator to know where this ends.

3. Parallelism Multiplies Everything

With:

execution.parallelism = 10
multiple snapshot.*.parallelism

You’re not holding one batch—you’re holding many batches at the same time.

4. `exactly_once = true` + `enable-2pc = true`

To guarantee exactly-once semantics:

Data must be retained during checkpoints
Memory peaks grow dramatically during checkpoint windows

Why Linux “Available Memory” Is a Lie (For JVMs)

At one point, I fixated on this:

“Free memory is only 2 GB, but available is still 9 GB. We should be fine, right?”

Wrong.

available ≈ free + reclaimable page cache.

From the kernel’s perspective, cache can be dropped anytime.
But for Java processes (SeaTunnel, Trino, agents):

GC needs extra memory to move objects
Large JSON allocations often require contiguous heap space
When allocation fails → Java heap space → cascading failures

With no swap and large heaps,
“2 GB free + 9 GB available” is not safety—it’s an illusion.

The OOM: Debezium and Snapshot Splits in Distress

Typical error snippet:

Caused by: java.lang.OutOfMemoryError: Java heap space

Caused by: org.apache.seatunnel.common.utils.SeaTunnelException:
  Read split SnapshotSplit(tableId=amzn_data_prd.amzn_api_logs,
  splitKeyType=ROW<id BIGINT>,
  splitStart=[125020918847214509],
  splitEnd=[125027189499467705]) error
  due to java.lang.IllegalArgumentException: Self-suppression not permitted

Up the stack trace:

MySqlSnapshotSplitReadTask.doExecute(...)
MySqlSnapshotSplitReadTask.createDataEventsForTable(...)
OutOfMemoryError: Java heap space

What was happening:

Debezium was processing snapshot splits
Each split contained 50,000 rows
Each row carried large JSON
Objects flooded the heap before the sink could consume them

The Self-suppression not permitted messages were just collateral damage—
the real issue was heap exhaustion.

The Big Realization: “Streaming” Has Dams

My original mental model:

MySQL → SeaTunnel → Doris
Read a bit, write a bit, memory stays flat.

Reality: there are multiple dams in the pipeline.

1. Source (Debezium Snapshot)

snapshot.split.size
snapshot.fetch.size
snapshot.parallelism

2. Mid-pipeline Buffers

Channels between Source and Sink
execution.parallelism × queue sizes

3. Sink (Doris Stream Load)

doris.batch.size
sink buffers
2PC retention during checkpoints

Streaming doesn’t mean no memory.
It means data takes a round trip through memory instead of disk.

How big that round trip is depends entirely on your configuration.

The Tuning Strategy: High Concurrency, Small Chunks

My first instinct was to reduce parallelism.
It worked—but felt like wasting a capable machine.

Eventually, the goal became clear:

Keep concurrency, shrink the unit of work.

Source Side: Split Smaller

From:

snapshot.split.size = 50000
snapshot.fetch.size = 10000
snapshot.parallelism = 4

To something like:

snapshot.split.size = 5000     # smaller splits
snapshot.fetch.size = 1000     # smaller fetches
snapshot.parallelism = 8       # keep / increase concurrency

Benefits:

Each Debezium thread handles smaller payloads
JSON object explosions are contained
OOM risk drops significantly

Sink Side: Batch Size Is a Hard Ceiling

Reducing doris.batch.size from 50,000 → 5,000 changed everything:

Stream Load logs became frequent and steady
SeaTunnel heap usage stabilized instead of climbing endlessly

A representative Doris response:

"NumberTotalRows": 5000,
"LoadBytes": 134564375,
"LoadTimeMs": 1727

5,000 rows = 134 MB raw data.
With JSON encoding and JVM object overhead, hundreds of MB per batch is normal.

Running with 50,000-row batches was basically inviting OOM.

2PC: Disable It for Initial Loads

Yes, enable-2pc = true gives you exactly-once.
But in my case:

This was a 50 GB initial snapshot
The target table used UNIQUE KEY(id)—naturally idempotent

If it failed, I could simply rerun it.

So I changed:

sink.enable-2pc = "false"
exactly_once = false

Immediate effects:

Writes became smoother instead of “burst per checkpoint”
Memory peaks dropped noticeably
GC pressure eased

(For incremental sync, 2PC can—and should—be turned back on.)

Monitoring: Don’t Just Ask “Is It Running?”

What helped most during tuning:

Doris Stream Load Logs

NumberTotalRows
LoadBytes
LoadTimeMs

You can feel when a batch is too large.

`top`: RES and `wa`

Stable RES is healthy
High wa means I/O is saturated—more threads won’t help

SeaTunnel HealthMonitor Logs

heap.memory.used / max
minor.gc.count, major.gc.count

They tell you whether you’re approaching the cliff.

Final Lessons

Configuration Reuse Is Dangerous

amzn_order vs amzn_api_logs differed by just two JSON columns,
yet memory usage was on an entirely different scale.

Row count means nothing—byte size is king.

Streaming Pipelines Need Carefully Designed “Dams”

Source: split size, fetch size, parallelism
Sink: batch size, buffers, 2PC
Middle: checkpoints, exactly-once semantics

Any oversized layer can take down the JVM when JSON is involved.

Concurrency Isn’t the Enemy—Granularity Is the Key

What truly matters is:

Concurrency × Size per Task

Not the parallelism number alone.

DEV Community

What a Big JSON Incident Taught Us About Apache SeaTunnel Tuning

The Trigger: “I Thought It Was Just Copying a Config”

Ground Zero: How I Almost Drained a 60 GB Machine

Table Schema and Configuration: Everything Looked Fine—Until It Didn’t

Table: `amzn_api_logs`

Initial SeaTunnel Configuration (Core Parts)

Why This Configuration Was a Disaster

1. Huge JSON Fields

2. `doris.batch.size = 50000`

3. Parallelism Multiplies Everything

4. `exactly_once = true` + `enable-2pc = true`

Why Linux “Available Memory” Is a Lie (For JVMs)

The OOM: Debezium and Snapshot Splits in Distress

The Big Realization: “Streaming” Has Dams

1. Source (Debezium Snapshot)

2. Mid-pipeline Buffers

3. Sink (Doris Stream Load)

The Tuning Strategy: High Concurrency, Small Chunks

Source Side: Split Smaller

Sink Side: Batch Size Is a Hard Ceiling

2PC: Disable It for Initial Loads

Monitoring: Don’t Just Ask “Is It Running?”

Doris Stream Load Logs

`top`: RES and `wa`

SeaTunnel HealthMonitor Logs

Final Lessons

Configuration Reuse Is Dangerous

Streaming Pipelines Need Carefully Designed “Dams”

Concurrency Isn’t the Enemy—Granularity Is the Key

Top comments (0)

The Trigger: “I Thought It Was Just Copying a Config”

Ground Zero: How I Almost Drained a 60 GB Machine

Table Schema and Configuration: Everything Looked Fine—Until It Didn’t

Table: amzn_api_logs

Initial SeaTunnel Configuration (Core Parts)

Why This Configuration Was a Disaster

1. Huge JSON Fields

2. doris.batch.size = 50000

3. Parallelism Multiplies Everything

4. exactly_once = true + enable-2pc = true

Why Linux “Available Memory” Is a Lie (For JVMs)

The OOM: Debezium and Snapshot Splits in Distress

The Big Realization: “Streaming” Has Dams

1. Source (Debezium Snapshot)

2. Mid-pipeline Buffers

3. Sink (Doris Stream Load)

The Tuning Strategy: High Concurrency, Small Chunks

Source Side: Split Smaller

Sink Side: Batch Size Is a Hard Ceiling

2PC: Disable It for Initial Loads

Monitoring: Don’t Just Ask “Is It Running?”

Doris Stream Load Logs

top: RES and wa

SeaTunnel HealthMonitor Logs

Final Lessons

Configuration Reuse Is Dangerous

Streaming Pipelines Need Carefully Designed “Dams”

Concurrency Isn’t the Enemy—Granularity Is the Key

Table: `amzn_api_logs`

2. `doris.batch.size = 50000`

4. `exactly_once = true` + `enable-2pc = true`

`top`: RES and `wa`