DEV Community

Rost
Rost

Posted on • Originally published at glukhov.org

Apache Flink on K8s and Kafka: PyFlink, Go, ops, and managed pricing

Apache Flink is a framework for stateful computations over unbounded and bounded data streams.

Teams adopt it for correct, low-latency streaming with event-time semantics (watermarks), fault tolerance (checkpoints), controlled upgrades (savepoints), and operational surfaces (metrics and REST).

This guide targets DevOps and Go/Python developers. It compares deployment models (self-managed vs managed), explains core architecture, covers Kubernetes (Helm and Operator) and standalone setups, contrasts Flink with Spark, Kafka Streams, Beam, and streaming databases, and shows PyFlink plus Go integration patterns including LLM and AI-oriented pipelines.

For broader context on data infrastructure patterns including object storage, databases, and messaging, see Data Infrastructure for AI Systems: Object Storage, Databases, Search & AI Data Architecture.

What is Apache Flink and why teams use it for real-time processing

Apache Flink is explicitly positioned as a stateful stream processing engine: you model your logic as a pipeline of operators and Flink runs it as a distributed dataflow with managed state and time semantics. In modern Flink documentation, the project describes itself as a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.

From a practical DevOps/software engineering perspective, Flink is a good fit when you need at least one of these properties:

If you need join/aggregate/enrich at low latency with correctness guarantees, you typically use Flink’s event-time processing, where “time” is when the event happened (not when it arrived), and watermarks communicate event-time progress through the pipeline.

If you need stateful computation at scale (rolling counters, sessions, fraud rules, feature engineering), Flink treats state as a first-class part of the programming model and makes it fault tolerant via checkpointing.

If you need operationally robust streaming (failures, rolling upgrades, restarts), Flink checkpoints state and stream positions so the job can recover and continue with the same semantics “as a failure-free execution”.

Typical use cases for DevOps, Go, Python, and AI teams

Flink is widely used for “data pipelines & ETL”, “streaming analytics”, and “event-driven applications” (the categories used by the Flink docs).

For a DevOps + Go/Python stack, typical patterns look like this:

A Go service produces events to Kafka; Flink consumes those events, performs stateful processing (e.g., dedupe, windowed aggregation, enrichment), then writes derived facts back to Kafka or a database. Flink’s operator and checkpointing mechanisms exist to make these stateful pipelines production-safe.

For ML/LLM teams, PyFlink explicitly calls out scenarios like “machine learning prediction” and loading machine learning models inside Python UDFs as a dependency-management motivation, which is a direct endorsement of “Flink job as online inference / feature engineering runtime” patterns.

Apache Flink architecture and core features

Apache Flink cluster architecture for production deployments

Flink’s runtime consists of two process types: JobManager and TaskManagers. The docs emphasise that clients submit the dataflow to the JobManager; the client can then disconnect (detached mode) or stay connected (attached mode).

The JobManager coordinates distributed execution: scheduling, reacting to task completion/failure, coordinating checkpoints, and coordinating recovery. Internally, it includes: ResourceManager (slots/resources), Dispatcher (REST + Web UI + per-job JobMaster creation), and JobMaster (manages one job).

The TaskManagers execute the operators/tasks, and exchange/buffer data streams. The smallest scheduling unit is the task slot; multiple operators can execute in one slot (operator chaining and slot sharing affect this).

Operator chaining and task slots for performance and cost control

Flink chains operator subtasks into tasks, where each task is executed by a single thread. This is described as a performance optimisation that reduces thread handover and buffering overhead, increasing throughput and decreasing latency.

Slots matter operationally because they are the unit of resource scheduling/isolation. Flink notes that each TaskManager may have one or more task slots; slotting reserves managed memory per slot, but does not isolate CPU.

Event-time processing, watermarks, and late data

Flink supports multiple notions of time—event time, ingestion time, processing time—and uses watermarks to model progress in event time.

To work with event time, Flink needs timestamps assigned to events and watermarks generated; the official “Generating Watermarks” documentation explains timestamp assignment and watermark generation as the core building blocks, with WatermarkStrategy being the standard way to configure common strategies.

Fault tolerance: checkpoints versus savepoints in real systems

Checkpointing exists because “every function and operator in Flink can be stateful”; state must be checkpointed to become fault tolerant. Checkpoints enable recovery of both state and stream positions so execution can resume with failure-free semantics.

Flink is very explicit that savepoints are “a consistent image of the execution state of a streaming job, created via Flink’s checkpointing mechanism”, used to stop-and-resume, fork, or update jobs. Savepoints live on stable storage (e.g., HDFS, S3).

The official “Checkpoints vs Savepoints” page frames the difference like backups vs recovery logs: checkpoints are frequent, lightweight, managed by Flink for failure recovery; savepoints are user-managed and used for controlled operations like upgrades.

Apache Flink deployment options and pricing plans

Free/self-managed Apache Flink option

The open-source Flink runtime is “free” in the licensing sense, but in production you pay for infrastructure and operational effort.

Flink is designed to integrate with common resource managers (e.g., YARN and Kubernetes) and can also run as a standalone cluster or as a library.

Self-managed cost drivers for Apache Flink

Compute and memory costs are driven by JobManager and TaskManagers, and by your parallelism/slot layout. Flink’s configuration documentation explicitly calls out jobmanager.memory.process.size, taskmanager.memory.process.size, taskmanager.numberOfTaskSlots, and parallelism.default as core knobs for distributed setups.

Local disk is a frequent hidden cost for stateful jobs. Flink notes that io.tmp.dirs stores local data including RocksDB files, spilled intermediate results, and cached JARs; if this data is deleted, it can force “a heavyweight recovery operation”, so it should live on storage that is not periodically purged.

Durable object/file storage cost is driven by checkpoint/savepoint directories. In Flink 2.x config, checkpoints and savepoints are configured via execution.checkpointing.dir and execution.checkpointing.savepoint-dir and accept URIs like s3://… or hdfs://….

Managed Apache Flink plans and typical billing models

Managed services reduce operational cost but add platform fees and constraints. The specifics are provider-dependent.

Amazon Managed Service for Apache Flink bills by KPUs (1 vCPU + 4 GB memory per KPU) and charges by duration and number of KPUs in one-second increments. AWS also charges an additional “orchestration” KPU per application and separate storage/backups fees.

Confluent Cloud for Apache Flink is usage-based and serverless: you create a compute pool, and you’re billed for CFUs consumed per minute while statements are running. The billing page includes an example CFU price of $0.21 per CFU-hour (region-dependent) and emphasises that you can limit spend via compute pool maximums.

Aiven and Alibaba Cloud are notable managed Flink providers in the market, but their public pricing and billing details vary by plan/region and may require calculators or sales contact; treat exact costs as unspecified unless you quote a region+plan from their current docs.

Ververica offers both self-managed and managed deployment options around Flink; public pages emphasise deployment choices and managed service positioning, while exact pricing is typically handled via “contact/pricing details” flows (so specific numbers are often unspecified publicly).

Deployment options table for Apache Flink in production

Deployment option Best for Operational complexity Key benefits Key risks / trade-offs
Standalone cluster (VMs/bare metal) Small teams, fixed capacity Medium–High Full control; simplest mental model HA, autoscaling, upgrades are DIY (more toil)
Kubernetes with Flink Kubernetes Operator Most modern platform teams Medium Declarative deployments; lifecycle management via control loop; operator supports Application/Session/Job deployments Kubernetes + operator expertise required
Native Kubernetes (without operator) K8s teams wanting direct integration Medium–High Direct resource integration; dynamic TaskManager allocation/deallocation described in Flink-on-K8s docs More bespoke automation than operator
YARN Hadoop-centric platforms Medium Integrates with YARN resource mgmt Hadoop stack complexity
AWS Managed Service for Apache Flink AWS-native data stacks Low–Medium Managed orchestration + scaling options; predictable billing unit (KPU) Platform coupling; extra per-app overhead KPU + storage fees
Confluent Cloud for Apache Flink Kafka-first shops, SQL-first stream apps Low Serverless usage billing; CFU-minute accounting; compute pools to cap spend CFU costs + Kafka networking costs; service-specific APIs
Ververica managed offerings Enterprises needing Flink expert ops Low–Medium “Flink experts” managed service positioning Pricing often not transparent (unspecified)

Managed providers and costs table

Prices change by region and time; if you need exact numbers for your region, treat this as a starting point and verify against the provider’s current pricing pages (unquoted regions are unspecified).

Provider “Plan” shape Billing unit Example compute price Notable additional cost drivers
Amazon Managed Service for Apache Flink Managed runtime KPU (1 vCPU + 4 GB) Example shown: $0.11 per KPU-hour (US East N. Virginia) +1 orchestration KPU per app; running storage; optional durable backups
Confluent Cloud for Apache Flink Serverless SQL/processing CFU-minute/CFU-hour Example shown: $0.21 per CFU-hour (region varies) Kafka networking rates still apply; compute pool max to cap spend
Ververica (managed) Managed “Unified Streaming Data Platform” Unspecified (public pages) Unspecified Platform features/SLAs; pricing typically via sales (unspecified)
Aiven for Apache Flink Managed service Hourly usage billing model (platform-wide) Unspecified without plan/region Plan tier + cloud region + add-ons (unspecified)
Alibaba Cloud Realtime Compute for Apache Flink Managed/serverless Hybrid billing (pay-as-you-go + subscription mix) Unspecified without region/workspace details CU-based limits and workspace model (details vary; unspecified here)

Apache Flink vs competitors comparison

Flink sits in a busy ecosystem. The “best” choice depends on latency, statefulness, operational preferences, and authoring model.

Competitor comparison table: Flink vs Spark vs Kafka Streams vs Beam and newer options

Tool What it is Streaming execution model State & exactly-once story Where it shines Typical pain points
Apache Flink Distributed stream processing engine for stateful computations Continuous streaming + event time via watermarks Checkpoint-based fault tolerance; savepoints for controlled upgrades Low-latency stateful pipelines; complex event-time logic Operating state, checkpoints, upgrades correctly takes discipline
Apache Spark Structured Streaming Spark’s streaming engine built around DataFrames/Datasets Default micro-batch model (with a continuous mode discussed separately) Strong for analytical pipelines; state exists but often higher latency Unified batch+stream APIs; Spark ecosystem Micro-batch latency and “streaming as incremental batches” mental model
Kafka Streams Library to build stream-processing apps on Kafka Record-at-a-time processing Supports exactly-once processing semantics (EOS) Simple Kafka-native apps; embed in JVM service JVM-only; less flexible for large distributed compute patterns
Apache Beam Unified programming model + SDKs; executed via runners (Flink, Spark, Dataflow, etc.) Depends on runner; Beam pipelines translate to runner jobs Semantics depend on runner capability matrix (runner-specific) Portability, multi-language pipelines; avoid engine lock-in Operational tuning still ends up being runner-specific
Materialize “Live data layer” / streaming SQL DB; incrementally updates results as data arrives Continuous incremental view maintenance Strong consistency claims in product docs (details are product-specific) Serving fresh derived views to apps/AI agents Different operational model than Flink jobs; not a general operator API runtime
RisingWave Streaming database where stream processing is expressed as materialized views Continuous materialised view maintenance SQL-first; engine-specific semantics SQL-centric streaming apps without building Flink jobs Less flexible for arbitrary code-heavy pipelines

A useful heuristic: if you want a runtime for complex stateful streaming jobs with deep control over event-time, operator logic, and deployments, Flink is a primary candidate. If you want SQL-first incremental views for serving, streaming databases may be alternatives. If you want a library embedded in a service, Kafka Streams is competitive. If you want one portable pipeline definition across engines, Beam is compelling.

For cloud-native event-driven architectures using AWS, Building Event-Driven Microservices with AWS Kinesis covers Kinesis Data Streams patterns for real-time processing and service decoupling.

How to use Apache Flink in custom-made systems

This section is intentionally practical: configuration, deployment, and how your Go/Python services typically interact with Flink.

Recommended architecture pattern: Go services + Kafka + Flink + serving layer

Flink is often the “stateful middle” that turns high-volume events into durable signals (counters, sessions, anomalies, enriched records). Checkpoints and state backends are what make that middle reliable in production.

Standalone configuration example for Apache Flink 2.x

Important version note: starting with Flink 2.0, the supported configuration file is conf/config.yaml; the previous flink-conf.yaml is “no longer supported”.

A minimal (illustrative) conf/config.yaml for a small self-managed cluster:

# conf/config.yaml (Flink 2.x style)
rest:
  address: flink-jobmanager.example.internal
  port: 8081

jobmanager:
  rpc:
    address: flink-jobmanager.example.internal
    port: 6123
  memory:
    process:
      size: 2048m

taskmanager:
  memory:
    process:
      size: 4096m
  numberOfTaskSlots: 2

parallelism:
  default: 2

# Checkpointing defaults (jobs can still override in code)
state:
  backend:
    type: rocksdb
execution:
  checkpointing:
    dir: s3://my-bucket/flink/checkpoints
    savepoint-dir: s3://my-bucket/flink/savepoints
    interval: 60 s

# Avoid tmp dirs that get purged (RocksDB files, cached jars, etc.)
io:
  tmp:
    dirs: ["/var/lib/flink/tmp"]
Enter fullscreen mode Exit fullscreen mode

Why these keys: Flink’s configuration reference explicitly documents the rest.* and jobmanager.rpc.* discovery details, the process memory keys, the slot/parallelism keys, and the default checkpoint settings including state.backend.type, execution.checkpointing.dir, execution.checkpointing.savepoint-dir, and execution.checkpointing.interval.

The io.tmp.dirs choice is operationally important because Flink uses it for local RocksDB files and cached artefacts; deleting it can cause heavyweight recovery.

Legacy standalone config example for Flink 1.x

If you are on Flink 1.x (still common in some managed environments), you’ll see flink-conf.yaml in the wild. This is legacy for Flink 2.x users.

# conf/flink-conf.yaml (legacy 1.x style; NOT supported in Flink 2.x)
jobmanager.rpc.address: flink-jobmanager
rest.port: 8081
taskmanager.numberOfTaskSlots: 2
parallelism.default: 2

# Legacy checkpoint keys differ by version; treat as illustrative.
state.backend.type: rocksdb
state.checkpoints.dir: s3://my-bucket/flink/checkpoints
state.savepoints.dir: s3://my-bucket/flink/savepoints
Enter fullscreen mode Exit fullscreen mode

If you’re migrating, Flink provides a migration script (bin/migrate-config-file.sh) to convert flink-conf.yaml to config.yaml.

Kubernetes/Helm deployment with the Flink Kubernetes Operator

The Flink Kubernetes Operator acts as a control plane for Flink application lifecycle management and is installed using Helm.

From the official operator Helm docs, you can install either from the source tree chart, or from the Apache-hosted chart repository:

# install from bundled chart in source tree
helm install flink-kubernetes-operator helm/flink-kubernetes-operator

# install from Apache downloads Helm repository (replace <OPERATOR-VERSION>)
helm repo add flink-operator-repo https://downloads.apache.org/flink/flink-kubernetes-operator-<OPERATOR-VERSION>/
helm install flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator
Enter fullscreen mode Exit fullscreen mode

These exact commands are shown in the operator’s Helm installation documentation.

Example FlinkDeployment CR (illustrative)

This is a simplified example to show the integration points you’ll typically customise (image, resources, checkpoint locations, logging/metrics). The operator reconciles this desired state via its control loop.

apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
  name: realtime-sessions
  namespace: flink
spec:
  image: my-registry.example.com/flink/realtime-sessions:2026-03-06
  flinkVersion: v2_2
  serviceAccount: flink
  flinkConfiguration:
    taskmanager.numberOfTaskSlots: "2"
    state.backend.type: "rocksdb"
    execution.checkpointing.dir: "s3://my-bucket/flink/checkpoints/realtime-sessions"
    execution.checkpointing.savepoint-dir: "s3://my-bucket/flink/savepoints/realtime-sessions"
    execution.checkpointing.interval: "60 s"
    rest.port: "8081"
  jobManager:
    resource:
      cpu: 1
      memory: "2048m"
  taskManager:
    resource:
      cpu: 2
      memory: "4096m"
  job:
    jarURI: local:///opt/flink/usrlib/realtime-sessions.jar
    parallelism: 4
    upgradeMode: savepoint
    state: running
Enter fullscreen mode Exit fullscreen mode

The upgradeMode: savepoint pattern is common when you want safe stateful upgrades; savepoints are designed for stop/resume/fork/update workflows and point to stable storage locations.

PyFlink development: realistic Kafka streaming job with checkpoints and RocksDB state

PyFlink is the Python API for Apache Flink and is explicitly pitched for scalable batch/stream workloads including ML pipelines and ETL.

Dependency packaging for PyFlink Kafka jobs

When you use JVM connectors (Kafka, JDBC, etc.) from PyFlink, you must ensure the relevant JARs are available to the job. Flink’s Python “Dependency Management” docs show three standard mechanisms:

Setting pipeline.jars (Table API), calling add_jars() (DataStream API), or CLI --jarfile at submission time.

PyFlink Kafka job example (DataStream API + event time + state + checkpointing)

This example reads JSON events from Kafka, assigns event-time timestamps (with bounded out-of-orderness), maintains a per-user rolling count in keyed state, and writes an enriched event to an output topic.

Notes:

  • KafkaSource is built via KafkaSource.builder() and requires bootstrap servers, topics, and a deserialiser.
  • Exactly-once Kafka sink configuration in PyFlink requires setting delivery guarantee and a transactional ID prefix.
  • Checkpoint defaults can be configured in Flink config (execution.checkpointing.*) and/or in code; the config keys are documented in the Flink configuration reference.
import json
from typing import Any

from pyflink.common import Duration, Types
from pyflink.common.serialization import SimpleStringSchema
from pyflink.common.watermark_strategy import WatermarkStrategy, TimestampAssigner
from pyflink.common.configuration import Configuration

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import KeyedProcessFunction, RuntimeContext
from pyflink.datastream.state import ValueStateDescriptor

from pyflink.datastream.connectors import DeliveryGuarantee
from pyflink.datastream.connectors.kafka import (
    KafkaSource,
    KafkaOffsetsInitializer,
    KafkaSink,
    KafkaRecordSerializationSchema,
)

class EventTimeFromJson(TimestampAssigner):
    """
    Extract event_time_ms from the JSON payload.
    Expect: {"user_id":"u1","event_time_ms":1710000000000,"event":"click",...}
    """
    def extract_timestamp(self, value: str, record_timestamp: int) -> int:
        try:
            obj = json.loads(value)
            return int(obj["event_time_ms"])
        except Exception:
            # fallback: use record timestamp (ingestion) if malformed
            return record_timestamp

class RollingCount(KeyedProcessFunction):
    def open(self, runtime_context: RuntimeContext):
        desc = ValueStateDescriptor("rolling_count", Types.LONG())
        self.count_state = runtime_context.get_state(desc)

    def process_element(self, value: str, ctx: 'KeyedProcessFunction.Context'):
        obj = json.loads(value)
        current = self.count_state.value()
        if current is None:
            current = 0
        current += 1
        self.count_state.update(current)

        # emit enriched event
        obj["rolling_count"] = current
        obj["event_time_ms"] = int(obj.get("event_time_ms", 0))
        yield json.dumps(obj)

def build_env() -> StreamExecutionEnvironment:
    # Cluster/job defaults (can also be set in config.yaml)
    cfg = Configuration()
    cfg.set_string("state.backend.type", "rocksdb")
    cfg.set_string("execution.checkpointing.dir", "s3://my-bucket/flink/checkpoints/realtime-sessions")
    cfg.set_string("execution.checkpointing.interval", "60 s")
    env = StreamExecutionEnvironment.get_execution_environment(cfg)

    # In PyFlink, connector jars must be available; use env.add_jars(...) if needed.
    # env.add_jars("file:///opt/flink/lib/flink-connector-kafka-<VERSION>.jar")

    # Enable checkpointing explicitly as well (jobs can override defaults)
    env.enable_checkpointing(60_000)
    env.set_parallelism(4)
    return env

def main():
    env = build_env()

    source = (
        KafkaSource.builder()
        .set_bootstrap_servers("kafka:9092")
        .set_topics("events.raw")
        .set_group_id("realtime-sessions-v1")
        .set_value_only_deserializer(SimpleStringSchema())
        .set_starting_offsets(KafkaOffsetsInitializer.earliest())
        .build()
    )

    watermark_strategy = (
        WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(10))
        .with_timestamp_assigner(EventTimeFromJson())
    )

    stream = (
        env.from_source(source, watermark_strategy=watermark_strategy, source_name="kafka-events-raw")
        .key_by(lambda s: json.loads(s)["user_id"])
        .process(RollingCount(), output_type=Types.STRING())
    )

    record_serializer = (
        KafkaRecordSerializationSchema.builder()
        .set_topic("events.enriched")
        .set_value_serialization_schema(SimpleStringSchema())
        .build()
    )

    sink = (
        KafkaSink.builder()
        .set_bootstrap_servers("kafka:9092")
        .set_record_serializer(record_serializer)
        .set_delivery_guarantee(DeliveryGuarantee.EXACTLY_ONCE)
        .set_transactional_id_prefix("realtime-sessions-txn")
        .build()
    )

    stream.sink_to(sink)

    env.execute("realtime-sessions-pyflink")

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

The API calls above line up with PyFlink’s KafkaSource builder usage pattern and required fields.

For delivery guarantees, PyFlink’s KafkaSinkBuilder documentation explicitly says that for DeliveryGuarantee.EXACTLY_ONCE you must set the transactional ID prefix.

For timestamping/watermarking, Flink’s watermark documentation explains timestamp assignment and watermark generation as the mechanism to process event time, and PyFlink provides a WatermarkStrategy API mirroring this model.

Go integration: Kafka producer/consumer + Flink REST job submission

Go does not have a native Flink job authoring API like Java/Python, so Go systems typically integrate with Flink through:

  • Kafka (or other brokers) as ingestion/egress.
  • The Flink REST API for operational actions (uploading JARs, starting jobs, querying job status, triggering savepoints, rescaling).

For Kafka setup and local development patterns, see Apache Kafka Quickstart - Install Kafka 4.2 with CLI and Local Examples.

Go Kafka producer/consumer example (kafka-go)

package main

import (
    "context"
    "log"
    "time"

    "github.com/segmentio/kafka-go"
)

func main() {
    ctx := context.Background()

    // Producer: write raw events
    writer := &kafka.Writer{
        Addr:         kafka.TCP("kafka:9092"),
        Topic:        "events.raw",
        RequiredAcks: kafka.RequireAll,
    }
    defer writer.Close()

    err := writer.WriteMessages(ctx, kafka.Message{
        Key:   []byte("user:u1"),
        Value: []byte(`{"user_id":"u1","event_time_ms":1710000000000,"event":"click"}`),
        Time:  time.Now(),
    })
    if err != nil {
        log.Fatal(err)
    }

    // Consumer: read enriched events
    reader := kafka.NewReader(kafka.ReaderConfig{
        Brokers:  []string{"kafka:9092"},
        Topic:    "events.enriched",
        GroupID:  "go-debug-consumer",
        MinBytes: 1e3,
        MaxBytes: 10e6,
    })
    defer reader.Close()

    msg, err := reader.ReadMessage(ctx)
    if err != nil {
        log.Fatal(err)
    }
    log.Printf("enriched key=%s value=%s\n", string(msg.Key), string(msg.Value))
}
Enter fullscreen mode Exit fullscreen mode

This is “plumbing” code, but it’s the most common practical integration surface: Kafka topics are the boundary between Flink and custom services.

Flink REST API: upload and run jobs from Go

Flink’s REST API is part of the JobManager web server and listens on port 8081 by default (configurable via rest.port).

The official OpenAPI spec for the dispatcher includes /jars/upload and explicitly states:

  • JAR upload must be sent as multi-part data
  • ensure the Content-Type header is set to application/x-java-archive
  • provides a curl example using -F jarfile=@path/to/flink-job.jar

A practical Go snippet to upload a JAR:

package flink

import (
    "bytes"
    "context"
    "fmt"
    "io"
    "mime/multipart"
    "net/http"
    "os"
)

func UploadJar(ctx context.Context, flinkBaseURL, jarPath string) (*http.Response, error) {
    f, err := os.Open(jarPath)
    if err != nil {
        return nil, err
    }
    defer f.Close()

    var body bytes.Buffer
    w := multipart.NewWriter(&body)

    part, err := w.CreateFormFile("jarfile", "job.jar")
    if err != nil {
        return nil, err
    }
    if _, err := io.Copy(part, f); err != nil {
        return nil, err
    }
    if err := w.Close(); err != nil {
        return nil, err
    }

    req, err := http.NewRequestWithContext(ctx, http.MethodPost, flinkBaseURL+"/jars/upload", &body)
    if err != nil {
        return nil, err
    }

    // Important: multipart boundary
    req.Header.Set("Content-Type", w.FormDataContentType())

    // Some clients also set "Expect:" similarly to the curl example in the spec.
    req.Header.Set("Expect", "")

    client := &http.Client{}
    resp, err := client.Do(req)
    if err != nil {
        return nil, err
    }
    if resp.StatusCode >= 300 {
        return resp, fmt.Errorf("upload failed: %s", resp.Status)
    }
    return resp, nil
}
Enter fullscreen mode Exit fullscreen mode

This code is guided by the REST API OpenAPI description for /jars/upload including its multipart requirement and curl reference.

To run a previously uploaded JAR, Flink exposes /jars/{jarid}/run and supports passing program args via query parameters (and/or JSON body).

Operationally valuable endpoints you’ll likely automate:

  • /jobs and /jobs/{jobid} to list and inspect job state
  • /jobs/{jobid}/savepoints to trigger savepoints (async trigger + polling)
  • /jobs/{jobid}/rescaling to trigger rescaling

Code snippets comparison table: PyFlink vs Go in a Flink-based platform

Concern PyFlink (Python jobs) Go (services around Flink)
Authoring Flink logic Native authoring via DataStream/Table APIs; supports state + timers No native Flink API; implement logic in Flink (Java/Python) and integrate externally
Connectors/dependencies Must ship connector JARs via pipeline.jars, add_jars, or --jarfile Not applicable (you’re not running inside Flink), but you manage Kafka/DB clients
Ingestion/egress KafkaSource/KafkaSink builders in PyFlink Kafka producer/consumer libraries; standard microservice patterns
Ops automation Can call Flink REST endpoints too Often owns automation: upload JAR, deploy, rescale, trigger savepoint via REST

DevOps guide: monitoring, scaling, backups, and CI/CD for Apache Flink

Monitoring Apache Flink in Kubernetes and on VMs

Flink supports exporting metrics by configuring metric reporters in the Flink configuration file; these reporters are instantiated on JobManager and TaskManagers.

For Prometheus, Flink exposes Prometheus-format metrics when configured with metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory in a supported Flink version environment.

You generally combine that with Kubernetes ServiceMonitors (Prometheus Operator) or with your managed monitoring stack.

Scaling: parallelism, slots, and operator-based autoscaling

Flink’s scheduling model defines execution resources via task slots, and each slot can run a pipeline of parallel tasks.

For manual scaling, the REST API provides a rescaling endpoint for a job (/jobs/{jobid}/rescaling) as an async operation.

If you’re on Kubernetes with the Flink Kubernetes Operator, the operator project advertises a “Flink Job Autoscaler” as part of its feature set, which is worth evaluating if your workloads vary substantially.

Backups and safe upgrades: checkpoints and savepoints

Checkpoints are for automated recovery and are managed by Flink; savepoints are for user-driven lifecycle operations (stop/resume/fork/upgrade).

From an SRE standpoint:

  • Use checkpoints for “keep the pipeline running through failures”.
  • Use savepoints for “deploy a new version without losing state”.

Flink’s REST API also supports triggering savepoints asynchronously, which is useful for GitOps-style “deploy → trigger savepoint → upgrade” workflows.

CI/CD: GitOps + Helm + REST job submission

For Kubernetes:

  • Keep the operator installation and your FlinkDeployment CRs in Git, deploy via Argo CD/Flux, and version container images per build. The operator Helm docs explicitly discuss “Working with Argo CD”.

For standalone/session clusters:

  • Use the Flink REST API JAR upload and run endpoints for immutable artefact deployments.

Also note a subtle but valuable security/ops toggle: web.submit.enable governs uploads via the Web UI, but the docs note that even when disabled, session clusters still accept job submissions through REST requests; this is relevant when hardening UI surfaces while retaining CI/CD automation.

LLM/AI integration patterns with Apache Flink for real-time pipelines

LLM systems are often only as good as their real-time context. Flink fits into LLM/AI stacks as the component that produces “always fresh” features, embeddings, and behavioural aggregates.

Real-time embeddings pipeline with Flink

A common pattern is:

  • ingest user actions/events,
  • aggregate sessions and preferences,
  • produce embedding-generation tasks,
  • write embeddings to a vector store and/or feature store.

PyFlink’s dependency management documentation explicitly calls out “machine learning prediction” and loading ML models inside Python UDFs (for remote cluster execution), which maps directly to “online inference inside Flink operators” approaches.

Online feature store updates for recommendation and ranking

Flink’s keyed state and checkpointing model is built to maintain operator state across events and recover it reliably. That’s a natural match for continuous feature computation (rolling rates, counts, time-decayed metrics) that downstream recommenders need.

Practical latency/consistency trade-offs for AI pipelines

If your architecture requires exactly-once semantics end-to-end (e.g., avoid duplicate feature updates or duplicate billing events), you’ll structure sinks and sources around checkpointing and transactional guarantees.

In Kafka-based stacks specifically:

  • Flink’s Kafka connector can deliver exactly-once guarantees when checkpointing is enabled and delivery guarantee options are configured.
  • Kafka Streams also supports exactly-once semantics (EOS), which is relevant if your “AI feature pipeline” is small enough to live inside application code rather than a Flink cluster.

Architecture view for “Flink as the real-time AI context builder”

This diagram is grounded in Flink’s core primitives: event-time processing (watermarks), state backends (state.backend.type and system-managed local state), and checkpoint/savepoint mechanisms for fault tolerance and operations.

Top comments (0)