DEV Community

Apache SeaTunnel
Apache SeaTunnel

Posted on

What a Big JSON Incident Taught Us About Apache SeaTunnel Tuning

The Trigger: “I Thought It Was Just Copying a Config”

The original idea was embarrassingly simple.

We already had a stable SeaTunnel CDC → Doris pipeline for the amzn_order table.
So I thought: why not just “copy-paste” that configuration, change the table name to amzn_api_logs, and call it a day?

What actually happened:

  • Memory usage on the production machine skyrocketed to tens of gigabytes
  • The Java process started hitting OOMs repeatedly
  • Doris and Trino—running on the same host—were dragged into the chaos

What hurt the most was this realization:

This wasn’t a SeaTunnel bug.
It was my shallow understanding of data chunking, streaming writes, and JVM memory behavior.

This post is my retrospective—from “streaming means it won’t pile up in memory” to the hard truth:

What you think is a “stream” is actually multiple layers of buffers and batches stacked together.

Ground Zero: How I Almost Drained a 60 GB Machine

At the peak of the incident, top looked roughly like this:

MiB Mem : 63005.9 total,  2010.6 free,  53676.2 used,  8097.3 buff/cache
MiB Swap:     0.0 total,     0.0 free,      0.0 used
...
PID      VIRT     RES  %MEM  COMMAND
2366021  22.5g   16.9g  27%  java ... seatunnel-2.3.11 ...
1873099  14.3g    7.1g  11%  trino
1895794  49.5g    1.7g   2%  doris_be
Enter fullscreen mode Exit fullscreen mode

SeaTunnel alone was consuming 16–17 GB of heap.
Free memory on the machine dropped below 2 GB, swap was disabled, and the OOM Killer was basically hovering over the kill switch.

And yet, a thought still lingered in my head:

“But isn’t this streaming? Why is it eating all the memory?”

Table Schema and Configuration: Everything Looked Fine—Until It Didn’t

Table: amzn_api_logs

CREATE TABLE `amzn_api_logs` (
  `id` bigint NOT NULL,
  `business_type` varchar(100) NOT NULL,
  `req_params` json DEFAULT NULL,
  `resp` json DEFAULT NULL,
  `seller_id` varchar(32) NOT NULL,
  `market_place_id` varchar(32) NOT NULL,
  `create_time` datetime NOT NULL,
  `update_time` datetime DEFAULT NULL,
  `remark` varchar(255) DEFAULT NULL,
  `is_delete` bigint NOT NULL DEFAULT '0',
  `version` bigint NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`) USING BTREE,
  KEY `idx_create_time` (`create_time`) USING BTREE
) ENGINE=InnoDB;
Enter fullscreen mode Exit fullscreen mode

Two JSON columns: req_params and resp.

If you’ve worked with log-style JSON before, you already know how large these fields can get.

Initial SeaTunnel Configuration (Core Parts)

job.name = "amzn_api_logs"
execution.parallelism = 10

job.mode = "STREAMING"
checkpoint.interval = 60000
}

source {
  MySQL-CDC {
    parallelism = 6
    incremental.parallelism = 4
    snapshot.parallelism = 4

    table-names = ["amzn_data_prd.amzn_api_logs"]
    snapshot.split.size = 50000
    snapshot.fetch.size = 10000
    chunk-key-column = "id"

    exactly_once = true
    startup.mode = "initial"
  }
}

sink {
  doris {
    sink.model = "UNIQUE_KEYS"
    sink.primary_key = "id"
    sink.label-prefix = "amzn_api_logs_cdc_doris"
    sink.enable-2pc = "true"

    doris.batch.size = 50000
    ...
    doris.config {
      format = "json"
      read_json_by_line = "true"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

My mental model at the time:

“CDC + STREAMING + Doris means rows flow through one by one.
Maybe a bit of buffering—but nothing explosive.”

In hindsight, this configuration was practically a perfect storm for
large JSON + high parallelism + initial snapshot.

Why This Configuration Was a Disaster

1. Huge JSON Fields

  • JSON is stored compactly in MySQL
  • Once inside the JVM, it becomes Strings, Maps, objects
  • Memory expansion of 3–5× is completely normal

2. doris.batch.size = 50000

Even 5,000 rows of JSON logs can reach hundreds of megabytes.
50,000 rows? You don’t need a calculator to know where this ends.

3. Parallelism Multiplies Everything

With:

  • execution.parallelism = 10
  • multiple snapshot.*.parallelism

You’re not holding one batch—you’re holding many batches at the same time.

4. exactly_once = true + enable-2pc = true

To guarantee exactly-once semantics:

  • Data must be retained during checkpoints
  • Memory peaks grow dramatically during checkpoint windows

Why Linux “Available Memory” Is a Lie (For JVMs)

At one point, I fixated on this:

“Free memory is only 2 GB, but available is still 9 GB. We should be fine, right?”

Wrong.

available ≈ free + reclaimable page cache.

From the kernel’s perspective, cache can be dropped anytime.
But for Java processes (SeaTunnel, Trino, agents):

  • GC needs extra memory to move objects
  • Large JSON allocations often require contiguous heap space
  • When allocation fails → Java heap space → cascading failures

With no swap and large heaps,
“2 GB free + 9 GB available” is not safety—it’s an illusion.

The OOM: Debezium and Snapshot Splits in Distress

Typical error snippet:

Caused by: java.lang.OutOfMemoryError: Java heap space

Caused by: org.apache.seatunnel.common.utils.SeaTunnelException:
  Read split SnapshotSplit(tableId=amzn_data_prd.amzn_api_logs,
  splitKeyType=ROW<id BIGINT>,
  splitStart=[125020918847214509],
  splitEnd=[125027189499467705]) error
  due to java.lang.IllegalArgumentException: Self-suppression not permitted
Enter fullscreen mode Exit fullscreen mode

Up the stack trace:

MySqlSnapshotSplitReadTask.doExecute(...)
MySqlSnapshotSplitReadTask.createDataEventsForTable(...)
OutOfMemoryError: Java heap space
Enter fullscreen mode Exit fullscreen mode

What was happening:

  • Debezium was processing snapshot splits
  • Each split contained 50,000 rows
  • Each row carried large JSON
  • Objects flooded the heap before the sink could consume them

The Self-suppression not permitted messages were just collateral damage—
the real issue was heap exhaustion.

The Big Realization: “Streaming” Has Dams

My original mental model:

MySQL → SeaTunnel → Doris
Read a bit, write a bit, memory stays flat.

Reality: there are multiple dams in the pipeline.

1. Source (Debezium Snapshot)

  • snapshot.split.size
  • snapshot.fetch.size
  • snapshot.parallelism

2. Mid-pipeline Buffers

  • Channels between Source and Sink
  • execution.parallelism × queue sizes

3. Sink (Doris Stream Load)

  • doris.batch.size
  • sink buffers
  • 2PC retention during checkpoints

Streaming doesn’t mean no memory.
It means data takes a round trip through memory instead of disk.

How big that round trip is depends entirely on your configuration.

The Tuning Strategy: High Concurrency, Small Chunks

My first instinct was to reduce parallelism.
It worked—but felt like wasting a capable machine.

Eventually, the goal became clear:

Keep concurrency, shrink the unit of work.

Source Side: Split Smaller

From:

snapshot.split.size = 50000
snapshot.fetch.size = 10000
snapshot.parallelism = 4
Enter fullscreen mode Exit fullscreen mode

To something like:

snapshot.split.size = 5000     # smaller splits
snapshot.fetch.size = 1000     # smaller fetches
snapshot.parallelism = 8       # keep / increase concurrency
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Each Debezium thread handles smaller payloads
  • JSON object explosions are contained
  • OOM risk drops significantly

Sink Side: Batch Size Is a Hard Ceiling

Reducing doris.batch.size from 50,000 → 5,000 changed everything:

  1. Stream Load logs became frequent and steady
  2. SeaTunnel heap usage stabilized instead of climbing endlessly

A representative Doris response:

"NumberTotalRows": 5000,
"LoadBytes": 134564375,
"LoadTimeMs": 1727
Enter fullscreen mode Exit fullscreen mode

5,000 rows = 134 MB raw data.
With JSON encoding and JVM object overhead, hundreds of MB per batch is normal.

Running with 50,000-row batches was basically inviting OOM.

2PC: Disable It for Initial Loads

Yes, enable-2pc = true gives you exactly-once.
But in my case:

  1. This was a 50 GB initial snapshot
  2. The target table used UNIQUE KEY(id)—naturally idempotent

If it failed, I could simply rerun it.

So I changed:

sink.enable-2pc = "false"
exactly_once = false
Enter fullscreen mode Exit fullscreen mode

Immediate effects:

  • Writes became smoother instead of “burst per checkpoint”
  • Memory peaks dropped noticeably
  • GC pressure eased

(For incremental sync, 2PC can—and should—be turned back on.)

Monitoring: Don’t Just Ask “Is It Running?”

What helped most during tuning:

Doris Stream Load Logs

  • NumberTotalRows
  • LoadBytes
  • LoadTimeMs

You can feel when a batch is too large.

top: RES and wa

  • Stable RES is healthy
  • High wa means I/O is saturated—more threads won’t help

SeaTunnel HealthMonitor Logs

  • heap.memory.used / max
  • minor.gc.count, major.gc.count

They tell you whether you’re approaching the cliff.

Final Lessons

Configuration Reuse Is Dangerous

amzn_order vs amzn_api_logs differed by just two JSON columns,
yet memory usage was on an entirely different scale.

Row count means nothing—byte size is king.

Streaming Pipelines Need Carefully Designed “Dams”

  • Source: split size, fetch size, parallelism
  • Sink: batch size, buffers, 2PC
  • Middle: checkpoints, exactly-once semantics

Any oversized layer can take down the JVM when JSON is involved.

Concurrency Isn’t the Enemy—Granularity Is the Key

What truly matters is:

Concurrency × Size per Task

Not the parallelism number alone.

Top comments (0)