Lessons Learned: Choosing a Flink Distribution for Kubernetes (Bitnami vs Official)

#architecture #database #devops #kubernetes

When we were choosing a Flink distribution for Kubernetes N months ago, we went with Bitnami Helm chart + Docker images instead of the official Flink Helm chart and images.

At the time, this looked like a reasonable and “enterprise-ready” choice. In practice, it introduced several non-obvious constraints that only surfaced during real production usage.

This post is a short summary of what we learned the hard way.

1. Bitnami Flink is not “just Flink”

Bitnami does not provide a thin wrapper around upstream Flink.

Instead, it ships:

a custom entrypoint.sh
a whole chain of bash scripts
internal bash utilities
custom logic for resource-based configuration

The Flink JVM process is not started directly.
It is calculated, mutated, and guarded by shell logic.

That has consequences.

2. Some Flink configs are effectively locked

A concrete example: taskmanager.numberOfTaskSlots

This parameter is ignored.

Why?

Because Bitnami’s startup logic:

inspects container CPU/memory
applies its own heuristics
derives the number of slots automatically

Even if you explicitly set taskmanager.numberOfTaskSlots, the final value is overridden by Bitnami’s bash utilities.

**_Can this be overridden?
**_
In theory — yes.

In practice — it means:

diving into entrypoint.sh
following multiple sourced bash scripts
modifying internal bash libraries responsible for slot calculation

At this point, you are no longer “configuring Flink”.
You are forking Bitnami’s runtime logic.

The problem scales over time

So far, we have encountered these limitations through a concrete example: the taskmanager.numberOfTaskSlots parameter. However, it is important to understand that this is not an isolated issue or a one-off bug, but rather a consequence of architectural choices.

Bitnami’s approach assumes that certain Flink parameters should be calculated automatically based on container resources and internal heuristics. Today this manifests itself in slot management, but it is very likely that the same behavior may later affect other critical areas: memory tuning, network buffers, high-availability settings, checkpointing configuration, or job recovery behavior.

Any Flink parameter that Bitnami decides to “handle intelligently” may start behaving differently from what is described in the official Apache Flink documentation. In some cases, configuration values may be partially ignored; in others, they may require additional, non-obvious settings that are not documented upstream. Over time, this significantly increases debugging complexity, raises the cost of changes, and amplifies risks during upgrades and scaling.

Job recovery after restart: another symptom

This issue became particularly visible when we tried to ensure recovery of running jobs after a Flink cluster restart.

From the perspective of the official Flink documentation, this scenario looks fairly straightforward: a small set of configuration parameters needs to be added, and the system should then behave in a predictable way. In practice, however, we found that the Bitnami image changes the semantics of these settings.

Some parameters have different names, others require additional dependent configuration, and some simply behave differently than expected. As a result, even though we relied on the official Apache Flink documentation, the actual behavior of the cluster diverged from what we anticipated. This once again highlighted that the Bitnami image does not fully correspond to upstream Flink behavior and introduces its own layer of configuration and startup semantics.

The official image: a possible step, but not a silver bullet

A natural next step is to consider migrating to the official Apache Flink Docker image and Helm chart. This could potentially bring the runtime closer to upstream behavior and reduce the amount of hidden logic.

At the same time, such a migration also requires caution. It is not yet clear what constraints or trade-offs the official Helm chart introduces, how flexible it is in real production scenarios, and which operational assumptions it makes. This means that either a proper internal research effort is required, or consultation with a team that has been running Flink on Kubernetes in production for a long time.

The key takeaway we arrived at is simple: when choosing a Flink distribution, it is critical to understand how close it is to upstream and which decisions it makes on your behalf. The more logic a distribution hides internally, the higher the operational risk becomes over time.

One additional aspect

that is worth highlighting separately is network connectivity to S3-compatible object storage.

In many Flink deployments, S3 is used not just as a passive storage layer, but as a critical part of the runtime:

checkpoints
savepoints
state backends
Iceberg metadata and data files

In such setups, Flink continuously interacts with object storage and relies on it for both correctness and stability.

As a distributed engine, Flink generates a large number of concurrent requests to S3. During checkpointing, every TaskManager uploads state in parallel, often producing bursts of network traffic and a high volume of small and medium-sized requests. Latency, bandwidth, and request concurrency all become first-class concerns.

If network connectivity to S3 is slow, unstable, or rate-limited, the impact is rarely limited to “slower checkpoints”. In practice, it can lead to cascading effects: checkpoints taking too long, checkpoint timeouts, backpressure propagating through the pipeline, increased memory pressure, and eventually job failures or restarts. These symptoms are often misattributed to Flink configuration issues, while the real bottleneck lies in the network path to object storage.

This is especially easy to underestimate when Flink runs in Kubernetes and S3 is “somewhere else”: in another availability zone, behind additional network layers, or shared across multiple workloads. What looks acceptable for batch-style access patterns can become a serious bottleneck for a stateful streaming engine.

The takeaway here is simple: when designing and operating Flink clusters, network connectivity to S3 should be treated as a critical dependency. It needs to be evaluated, monitored, and capacity-planned just like CPU and memory. Ignoring this aspect can easily undermine even a well-tuned Flink setup.