Rost

Posted on Mar 24 • Originally published at glukhov.org

Apache Flink on K8s and Kafka: PyFlink, Go, ops, and managed pricing

#kubernetes #go #devops #microservices

Apache Flink is a framework for stateful computations over unbounded and bounded data streams.

Teams adopt it for correct, low-latency streaming with event-time semantics (watermarks), fault tolerance (checkpoints), controlled upgrades (savepoints), and operational surfaces (metrics and REST).

This guide targets DevOps and Go/Python developers. It compares deployment models (self-managed vs managed), explains core architecture, covers Kubernetes (Helm and Operator) and standalone setups, contrasts Flink with Spark, Kafka Streams, Beam, and streaming databases, and shows PyFlink plus Go integration patterns including LLM and AI-oriented pipelines.

For broader context on data infrastructure patterns including object storage, databases, and messaging, see Data Infrastructure for AI Systems: Object Storage, Databases, Search & AI Data Architecture.

What is Apache Flink and why teams use it for real-time processing

Apache Flink is explicitly positioned as a stateful stream processing engine: you model your logic as a pipeline of operators and Flink runs it as a distributed dataflow with managed state and time semantics. In modern Flink documentation, the project describes itself as a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.

From a practical DevOps/software engineering perspective, Flink is a good fit when you need at least one of these properties:

If you need join/aggregate/enrich at low latency with correctness guarantees, you typically use Flink’s event-time processing, where “time” is when the event happened (not when it arrived), and watermarks communicate event-time progress through the pipeline.

If you need stateful computation at scale (rolling counters, sessions, fraud rules, feature engineering), Flink treats state as a first-class part of the programming model and makes it fault tolerant via checkpointing.

If you need operationally robust streaming (failures, rolling upgrades, restarts), Flink checkpoints state and stream positions so the job can recover and continue with the same semantics “as a failure-free execution”.

Typical use cases for DevOps, Go, Python, and AI teams

Flink is widely used for “data pipelines & ETL”, “streaming analytics”, and “event-driven applications” (the categories used by the Flink docs).

For a DevOps + Go/Python stack, typical patterns look like this:

A Go service produces events to Kafka; Flink consumes those events, performs stateful processing (e.g., dedupe, windowed aggregation, enrichment), then writes derived facts back to Kafka or a database. Flink’s operator and checkpointing mechanisms exist to make these stateful pipelines production-safe.

For ML/LLM teams, PyFlink explicitly calls out scenarios like “machine learning prediction” and loading machine learning models inside Python UDFs as a dependency-management motivation, which is a direct endorsement of “Flink job as online inference / feature engineering runtime” patterns.

Apache Flink architecture and core features

Apache Flink cluster architecture for production deployments

Flink’s runtime consists of two process types: JobManager and TaskManagers. The docs emphasise that clients submit the dataflow to the JobManager; the client can then disconnect (detached mode) or stay connected (attached mode).

The JobManager coordinates distributed execution: scheduling, reacting to task completion/failure, coordinating checkpoints, and coordinating recovery. Internally, it includes: ResourceManager (slots/resources), Dispatcher (REST + Web UI + per-job JobMaster creation), and JobMaster (manages one job).

The TaskManagers execute the operators/tasks, and exchange/buffer data streams. The smallest scheduling unit is the task slot; multiple operators can execute in one slot (operator chaining and slot sharing affect this).

Operator chaining and task slots for performance and cost control

Flink chains operator subtasks into tasks, where each task is executed by a single thread. This is described as a performance optimisation that reduces thread handover and buffering overhead, increasing throughput and decreasing latency.

Slots matter operationally because they are the unit of resource scheduling/isolation. Flink notes that each TaskManager may have one or more task slots; slotting reserves managed memory per slot, but does not isolate CPU.

Event-time processing, watermarks, and late data

Flink supports multiple notions of time—event time, ingestion time, processing time—and uses watermarks to model progress in event time.

To work with event time, Flink needs timestamps assigned to events and watermarks generated; the official “Generating Watermarks” documentation explains timestamp assignment and watermark generation as the core building blocks, with WatermarkStrategy being the standard way to configure common strategies.

Fault tolerance: checkpoints versus savepoints in real systems

Checkpointing exists because “every function and operator in Flink can be stateful”; state must be checkpointed to become fault tolerant. Checkpoints enable recovery of both state and stream positions so execution can resume with failure-free semantics.

Flink is very explicit that savepoints are “a consistent image of the execution state of a streaming job, created via Flink’s checkpointing mechanism”, used to stop-and-resume, fork, or update jobs. Savepoints live on stable storage (e.g., HDFS, S3).

The official “Checkpoints vs Savepoints” page frames the difference like backups vs recovery logs: checkpoints are frequent, lightweight, managed by Flink for failure recovery; savepoints are user-managed and used for controlled operations like upgrades.

Apache Flink deployment options and pricing plans

Free/self-managed Apache Flink option

The open-source Flink runtime is “free” in the licensing sense, but in production you pay for infrastructure and operational effort.

Flink is designed to integrate with common resource managers (e.g., YARN and Kubernetes) and can also run as a standalone cluster or as a library.

Self-managed cost drivers for Apache Flink

Compute and memory costs are driven by JobManager and TaskManagers, and by your parallelism/slot layout. Flink’s configuration documentation explicitly calls out jobmanager.memory.process.size, taskmanager.memory.process.size, taskmanager.numberOfTaskSlots, and parallelism.default as core knobs for distributed setups.

Local disk is a frequent hidden cost for stateful jobs. Flink notes that io.tmp.dirs stores local data including RocksDB files, spilled intermediate results, and cached JARs; if this data is deleted, it can force “a heavyweight recovery operation”, so it should live on storage that is not periodically purged.

Durable object/file storage cost is driven by checkpoint/savepoint directories. In Flink 2.x config, checkpoints and savepoints are configured via execution.checkpointing.dir and execution.checkpointing.savepoint-dir and accept URIs like s3://… or hdfs://….

Managed Apache Flink plans and typical billing models

Managed services reduce operational cost but add platform fees and constraints. The specifics are provider-dependent.

Amazon Managed Service for Apache Flink bills by KPUs (1 vCPU + 4 GB memory per KPU) and charges by duration and number of KPUs in one-second increments. AWS also charges an additional “orchestration” KPU per application and separate storage/backups fees.

Confluent Cloud for Apache Flink is usage-based and serverless: you create a compute pool, and you’re billed for CFUs consumed per minute while statements are running. The billing page includes an example CFU price of $0.21 per CFU-hour (region-dependent) and emphasises that you can limit spend via compute pool maximums.

Aiven and Alibaba Cloud are notable managed Flink providers in the market, but their public pricing and billing details vary by plan/region and may require calculators or sales contact; treat exact costs as unspecified unless you quote a region+plan from their current docs.

Ververica offers both self-managed and managed deployment options around Flink; public pages emphasise deployment choices and managed service positioning, while exact pricing is typically handled via “contact/pricing details” flows (so specific numbers are often unspecified publicly).

Deployment options table for Apache Flink in production

Deployment option	Best for	Operational complexity	Key benefits	Key risks / trade-offs
Standalone cluster (VMs/bare metal)	Small teams, fixed capacity	Medium–High	Full control; simplest mental model	HA, autoscaling, upgrades are DIY (more toil)
Kubernetes with Flink Kubernetes Operator	Most modern platform teams	Medium	Declarative deployments; lifecycle management via control loop; operator supports Application/Session/Job deployments	Kubernetes + operator expertise required
Native Kubernetes (without operator)	K8s teams wanting direct integration	Medium–High	Direct resource integration; dynamic TaskManager allocation/deallocation described in Flink-on-K8s docs	More bespoke automation than operator
YARN	Hadoop-centric platforms	Medium	Integrates with YARN resource mgmt	Hadoop stack complexity
AWS Managed Service for Apache Flink	AWS-native data stacks	Low–Medium	Managed orchestration + scaling options; predictable billing unit (KPU)	Platform coupling; extra per-app overhead KPU + storage fees
Confluent Cloud for Apache Flink	Kafka-first shops, SQL-first stream apps	Low	Serverless usage billing; CFU-minute accounting; compute pools to cap spend	CFU costs + Kafka networking costs; service-specific APIs
Ververica managed offerings	Enterprises needing Flink expert ops	Low–Medium	“Flink experts” managed service positioning	Pricing often not transparent (unspecified)

Managed providers and costs table

Prices change by region and time; if you need exact numbers for your region, treat this as a starting point and verify against the provider’s current pricing pages (unquoted regions are unspecified).

Provider	“Plan” shape	Billing unit	Example compute price	Notable additional cost drivers
Amazon Managed Service for Apache Flink	Managed runtime	KPU (1 vCPU + 4 GB)	Example shown: $0.11 per KPU-hour (US East N. Virginia)	+1 orchestration KPU per app; running storage; optional durable backups
Confluent Cloud for Apache Flink	Serverless SQL/processing	CFU-minute/CFU-hour	Example shown: $0.21 per CFU-hour (region varies)	Kafka networking rates still apply; compute pool max to cap spend
Ververica (managed)	Managed “Unified Streaming Data Platform”	Unspecified (public pages)	Unspecified	Platform features/SLAs; pricing typically via sales (unspecified)
Aiven for Apache Flink	Managed service	Hourly usage billing model (platform-wide)	Unspecified without plan/region	Plan tier + cloud region + add-ons (unspecified)
Alibaba Cloud Realtime Compute for Apache Flink	Managed/serverless	Hybrid billing (pay-as-you-go + subscription mix)	Unspecified without region/workspace details	CU-based limits and workspace model (details vary; unspecified here)

Apache Flink vs competitors comparison

Flink sits in a busy ecosystem. The “best” choice depends on latency, statefulness, operational preferences, and authoring model.

Competitor comparison table: Flink vs Spark vs Kafka Streams vs Beam and newer options

Tool	What it is	Streaming execution model	State & exactly-once story	Where it shines	Typical pain points
Apache Flink	Distributed stream processing engine for stateful computations	Continuous streaming + event time via watermarks	Checkpoint-based fault tolerance; savepoints for controlled upgrades	Low-latency stateful pipelines; complex event-time logic	Operating state, checkpoints, upgrades correctly takes discipline
Apache Spark Structured Streaming	Spark’s streaming engine built around DataFrames/Datasets	Default micro-batch model (with a continuous mode discussed separately)	Strong for analytical pipelines; state exists but often higher latency	Unified batch+stream APIs; Spark ecosystem	Micro-batch latency and “streaming as incremental batches” mental model
Kafka Streams	Library to build stream-processing apps on Kafka	Record-at-a-time processing	Supports exactly-once processing semantics (EOS)	Simple Kafka-native apps; embed in JVM service	JVM-only; less flexible for large distributed compute patterns
Apache Beam	Unified programming model + SDKs; executed via runners (Flink, Spark, Dataflow, etc.)	Depends on runner; Beam pipelines translate to runner jobs	Semantics depend on runner capability matrix (runner-specific)	Portability, multi-language pipelines; avoid engine lock-in	Operational tuning still ends up being runner-specific
Materialize	“Live data layer” / streaming SQL DB; incrementally updates results as data arrives	Continuous incremental view maintenance	Strong consistency claims in product docs (details are product-specific)	Serving fresh derived views to apps/AI agents	Different operational model than Flink jobs; not a general operator API runtime
RisingWave	Streaming database where stream processing is expressed as materialized views	Continuous materialised view maintenance	SQL-first; engine-specific semantics	SQL-centric streaming apps without building Flink jobs	Less flexible for arbitrary code-heavy pipelines

A useful heuristic: if you want a runtime for complex stateful streaming jobs with deep control over event-time, operator logic, and deployments, Flink is a primary candidate. If you want SQL-first incremental views for serving, streaming databases may be alternatives. If you want a library embedded in a service, Kafka Streams is competitive. If you want one portable pipeline definition across engines, Beam is compelling.

For cloud-native event-driven architectures using AWS, Building Event-Driven Microservices with AWS Kinesis covers Kinesis Data Streams patterns for real-time processing and service decoupling.

How to use Apache Flink in custom-made systems

This section is intentionally practical: configuration, deployment, and how your Go/Python services typically interact with Flink.

Recommended architecture pattern: Go services + Kafka + Flink + serving layer

Flink is often the “stateful middle” that turns high-volume events into durable signals (counters, sessions, anomalies, enriched records). Checkpoints and state backends are what make that middle reliable in production.

Standalone configuration example for Apache Flink 2.x

Important version note: starting with Flink 2.0, the supported configuration file is conf/config.yaml; the previous flink-conf.yaml is “no longer supported”.

A minimal (illustrative) conf/config.yaml for a small self-managed cluster:

# conf/config.yaml (Flink 2.x style)
rest:
  address: flink-jobmanager.example.internal
  port: 8081

jobmanager:
  rpc:
    address: flink-jobmanager.example.internal
    port: 6123
  memory:
    process:
      size: 2048m

taskmanager:
  memory:
    process:
      size: 4096m
  numberOfTaskSlots: 2

parallelism:
  default: 2

# Checkpointing defaults (jobs can still override in code)
state:
  backend:
    type: rocksdb
execution:
  checkpointing:
    dir: s3://my-bucket/flink/checkpoints
    savepoint-dir: s3://my-bucket/flink/savepoints
    interval: 60 s

# Avoid tmp dirs that get purged (RocksDB files, cached jars, etc.)
io:
  tmp:
    dirs: ["/var/lib/flink/tmp"]

Why these keys: Flink’s configuration reference explicitly documents the rest.* and jobmanager.rpc.* discovery details, the process memory keys, the slot/parallelism keys, and the default checkpoint settings including state.backend.type, execution.checkpointing.dir, execution.checkpointing.savepoint-dir, and execution.checkpointing.interval.

The io.tmp.dirs choice is operationally important because Flink uses it for local RocksDB files and cached artefacts; deleting it can cause heavyweight recovery.

Legacy standalone config example for Flink 1.x

If you are on Flink 1.x (still common in some managed environments), you’ll see flink-conf.yaml in the wild. This is legacy for Flink 2.x users.

# conf/flink-conf.yaml (legacy 1.x style; NOT supported in Flink 2.x)
jobmanager.rpc.address: flink-jobmanager
rest.port: 8081
taskmanager.numberOfTaskSlots: 2
parallelism.default: 2

# Legacy checkpoint keys differ by version; treat as illustrative.
state.backend.type: rocksdb
state.checkpoints.dir: s3://my-bucket/flink/checkpoints
state.savepoints.dir: s3://my-bucket/flink/savepoints

If you’re migrating, Flink provides a migration script (bin/migrate-config-file.sh) to convert flink-conf.yaml to config.yaml.

Kubernetes/Helm deployment with the Flink Kubernetes Operator

The Flink Kubernetes Operator acts as a control plane for Flink application lifecycle management and is installed using Helm.

From the official operator Helm docs, you can install either from the source tree chart, or from the Apache-hosted chart repository:

# install from bundled chart in source tree
helm install flink-kubernetes-operator helm/flink-kubernetes-operator

# install from Apache downloads Helm repository (replace <OPERATOR-VERSION>)
helm repo add flink-operator-repo https://downloads.apache.org/flink/flink-kubernetes-operator-<OPERATOR-VERSION>/
helm install flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator

These exact commands are shown in the operator’s Helm installation documentation.

Example FlinkDeployment CR (illustrative)

This is a simplified example to show the integration points you’ll typically customise (image, resources, checkpoint locations, logging/metrics). The operator reconciles this desired state via its control loop.

apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
  name: realtime-sessions
  namespace: flink
spec:
  image: my-registry.example.com/flink/realtime-sessions:2026-03-06
  flinkVersion: v2_2
  serviceAccount: flink
  flinkConfiguration:
    taskmanager.numberOfTaskSlots: "2"
    state.backend.type: "rocksdb"
    execution.checkpointing.dir: "s3://my-bucket/flink/checkpoints/realtime-sessions"
    execution.checkpointing.savepoint-dir: "s3://my-bucket/flink/savepoints/realtime-sessions"
    execution.checkpointing.interval: "60 s"
    rest.port: "8081"
  jobManager:
    resource:
      cpu: 1
      memory: "2048m"
  taskManager:
    resource:
      cpu: 2
      memory: "4096m"
  job:
    jarURI: local:///opt/flink/usrlib/realtime-sessions.jar
    parallelism: 4
    upgradeMode: savepoint
    state: running

The upgradeMode: savepoint pattern is common when you want safe stateful upgrades; savepoints are designed for stop/resume/fork/update workflows and point to stable storage locations.

PyFlink development: realistic Kafka streaming job with checkpoints and RocksDB state

PyFlink is the Python API for Apache Flink and is explicitly pitched for scalable batch/stream workloads including ML pipelines and ETL.

Dependency packaging for PyFlink Kafka jobs

When you use JVM connectors (Kafka, JDBC, etc.) from PyFlink, you must ensure the relevant JARs are available to the job. Flink’s Python “Dependency Management” docs show three standard mechanisms:

Setting pipeline.jars (Table API), calling add_jars() (DataStream API), or CLI --jarfile at submission time.

PyFlink Kafka job example (DataStream API + event time + state + checkpointing)

This example reads JSON events from Kafka, assigns event-time timestamps (with bounded out-of-orderness), maintains a per-user rolling count in keyed state, and writes an enriched event to an output topic.

Notes:

KafkaSource is built via KafkaSource.builder() and requires bootstrap servers, topics, and a deserialiser.
Exactly-once Kafka sink configuration in PyFlink requires setting delivery guarantee and a transactional ID prefix.
Checkpoint defaults can be configured in Flink config (execution.checkpointing.*) and/or in code; the config keys are documented in the Flink configuration reference.

import json
from typing import Any

from pyflink.common import Duration, Types
from pyflink.common.serialization import SimpleStringSchema
from pyflink.common.watermark_strategy import WatermarkStrategy, TimestampAssigner
from pyflink.common.configuration import Configuration

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import KeyedProcessFunction, RuntimeContext
from pyflink.datastream.state import ValueStateDescriptor

from pyflink.datastream.connectors import DeliveryGuarantee
from pyflink.datastream.connectors.kafka import (
    KafkaSource,
    KafkaOffsetsInitializer,
    KafkaSink,
    KafkaRecordSerializationSchema,
)

class EventTimeFromJson(TimestampAssigner):
    """
    Extract event_time_ms from the JSON payload.
    Expect: {"user_id":"u1","event_time_ms":1710000000000,"event":"click",...}
    """
    def extract_timestamp(self, value: str, record_timestamp: int) -> int:
        try:
            obj = json.loads(value)
            return int(obj["event_time_ms"])
        except Exception:
            # fallback: use record timestamp (ingestion) if malformed
            return record_timestamp

class RollingCount(KeyedProcessFunction):
    def open(self, runtime_context: RuntimeContext):
        desc = ValueStateDescriptor("rolling_count", Types.LONG())
        self.count_state = runtime_context.get_state(desc)

    def process_element(self, value: str, ctx: 'KeyedProcessFunction.Context'):
        obj = json.loads(value)
        current = self.count_state.value()
        if current is None:
            current = 0
        current += 1
        self.count_state.update(current)

        # emit enriched event
        obj["rolling_count"] = current
        obj["event_time_ms"] = int(obj.get("event_time_ms", 0))
        yield json.dumps(obj)

def build_env() -> StreamExecutionEnvironment:
    # Cluster/job defaults (can also be set in config.yaml)
    cfg = Configuration()
    cfg.set_string("state.backend.type", "rocksdb")
    cfg.set_string("execution.checkpointing.dir", "s3://my-bucket/flink/checkpoints/realtime-sessions")
    cfg.set_string("execution.checkpointing.interval", "60 s")
    env = StreamExecutionEnvironment.get_execution_environment(cfg)

    # In PyFlink, connector jars must be available; use env.add_jars(...) if needed.
    # env.add_jars("file:///opt/flink/lib/flink-connector-kafka-<VERSION>.jar")

    # Enable checkpointing explicitly as well (jobs can override defaults)
    env.enable_checkpointing(60_000)
    env.set_parallelism(4)
    return env

def main():
    env = build_env()

    source = (
        KafkaSource.builder()
        .set_bootstrap_servers("kafka:9092")
        .set_topics("events.raw")
        .set_group_id("realtime-sessions-v1")
        .set_value_only_deserializer(SimpleStringSchema())
        .set_starting_offsets(KafkaOffsetsInitializer.earliest())
        .build()
    )

    watermark_strategy = (
        WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(10))
        .with_timestamp_assigner(EventTimeFromJson())
    )

    stream = (
        env.from_source(source, watermark_strategy=watermark_strategy, source_name="kafka-events-raw")
        .key_by(lambda s: json.loads(s)["user_id"])
        .process(RollingCount(), output_type=Types.STRING())
    )

    record_serializer = (
        KafkaRecordSerializationSchema.builder()
        .set_topic("events.enriched")
        .set_value_serialization_schema(SimpleStringSchema())
        .build()
    )

    sink = (
        KafkaSink.builder()
        .set_bootstrap_servers("kafka:9092")
        .set_record_serializer(record_serializer)
        .set_delivery_guarantee(DeliveryGuarantee.EXACTLY_ONCE)
        .set_transactional_id_prefix("realtime-sessions-txn")
        .build()
    )

    stream.sink_to(sink)

    env.execute("realtime-sessions-pyflink")

if __name__ == "__main__":
    main()

The API calls above line up with PyFlink’s KafkaSource builder usage pattern and required fields.

For delivery guarantees, PyFlink’s KafkaSinkBuilder documentation explicitly says that for DeliveryGuarantee.EXACTLY_ONCE you must set the transactional ID prefix.

For timestamping/watermarking, Flink’s watermark documentation explains timestamp assignment and watermark generation as the mechanism to process event time, and PyFlink provides a WatermarkStrategy API mirroring this model.

Go integration: Kafka producer/consumer + Flink REST job submission

Go does not have a native Flink job authoring API like Java/Python, so Go systems typically integrate with Flink through:

Kafka (or other brokers) as ingestion/egress.
The Flink REST API for operational actions (uploading JARs, starting jobs, querying job status, triggering savepoints, rescaling).

For Kafka setup and local development patterns, see Apache Kafka Quickstart - Install Kafka 4.2 with CLI and Local Examples.

Go Kafka producer/consumer example (kafka-go)

package main

import (
    "context"
    "log"
    "time"

    "github.com/segmentio/kafka-go"
)

func main() {
    ctx := context.Background()

    // Producer: write raw events
    writer := &kafka.Writer{
        Addr:         kafka.TCP("kafka:9092"),
        Topic:        "events.raw",
        RequiredAcks: kafka.RequireAll,
    }
    defer writer.Close()

    err := writer.WriteMessages(ctx, kafka.Message{
        Key:   []byte("user:u1"),
        Value: []byte(`{"user_id":"u1","event_time_ms":1710000000000,"event":"click"}`),
        Time:  time.Now(),
    })
    if err != nil {
        log.Fatal(err)
    }

    // Consumer: read enriched events
    reader := kafka.NewReader(kafka.ReaderConfig{
        Brokers:  []string{"kafka:9092"},
        Topic:    "events.enriched",
        GroupID:  "go-debug-consumer",
        MinBytes: 1e3,
        MaxBytes: 10e6,
    })
    defer reader.Close()

    msg, err := reader.ReadMessage(ctx)
    if err != nil {
        log.Fatal(err)
    }
    log.Printf("enriched key=%s value=%s\n", string(msg.Key), string(msg.Value))
}

This is “plumbing” code, but it’s the most common practical integration surface: Kafka topics are the boundary between Flink and custom services.

Flink REST API: upload and run jobs from Go

Flink’s REST API is part of the JobManager web server and listens on port 8081 by default (configurable via rest.port).

The official OpenAPI spec for the dispatcher includes /jars/upload and explicitly states:

JAR upload must be sent as multi-part data
ensure the Content-Type header is set to application/x-java-archive
provides a curl example using -F jarfile=@path/to/flink-job.jar

A practical Go snippet to upload a JAR:

package flink

import (
    "bytes"
    "context"
    "fmt"
    "io"
    "mime/multipart"
    "net/http"
    "os"
)

func UploadJar(ctx context.Context, flinkBaseURL, jarPath string) (*http.Response, error) {
    f, err := os.Open(jarPath)
    if err != nil {
        return nil, err
    }
    defer f.Close()

    var body bytes.Buffer
    w := multipart.NewWriter(&body)

    part, err := w.CreateFormFile("jarfile", "job.jar")
    if err != nil {
        return nil, err
    }
    if _, err := io.Copy(part, f); err != nil {
        return nil, err
    }
    if err := w.Close(); err != nil {
        return nil, err
    }

    req, err := http.NewRequestWithContext(ctx, http.MethodPost, flinkBaseURL+"/jars/upload", &body)
    if err != nil {
        return nil, err
    }

    // Important: multipart boundary
    req.Header.Set("Content-Type", w.FormDataContentType())

    // Some clients also set "Expect:" similarly to the curl example in the spec.
    req.Header.Set("Expect", "")

    client := &http.Client{}
    resp, err := client.Do(req)
    if err != nil {
        return nil, err
    }
    if resp.StatusCode >= 300 {
        return resp, fmt.Errorf("upload failed: %s", resp.Status)
    }
    return resp, nil
}

This code is guided by the REST API OpenAPI description for /jars/upload including its multipart requirement and curl reference.

To run a previously uploaded JAR, Flink exposes /jars/{jarid}/run and supports passing program args via query parameters (and/or JSON body).

Operationally valuable endpoints you’ll likely automate:

/jobs and /jobs/{jobid} to list and inspect job state
/jobs/{jobid}/savepoints to trigger savepoints (async trigger + polling)
/jobs/{jobid}/rescaling to trigger rescaling

Code snippets comparison table: PyFlink vs Go in a Flink-based platform

Concern	PyFlink (Python jobs)	Go (services around Flink)
Authoring Flink logic	Native authoring via DataStream/Table APIs; supports state + timers	No native Flink API; implement logic in Flink (Java/Python) and integrate externally
Connectors/dependencies	Must ship connector JARs via `pipeline.jars`, `add_jars`, or `--jarfile`	Not applicable (you’re not running inside Flink), but you manage Kafka/DB clients
Ingestion/egress	KafkaSource/KafkaSink builders in PyFlink	Kafka producer/consumer libraries; standard microservice patterns
Ops automation	Can call Flink REST endpoints too	Often owns automation: upload JAR, deploy, rescale, trigger savepoint via REST

DevOps guide: monitoring, scaling, backups, and CI/CD for Apache Flink

Monitoring Apache Flink in Kubernetes and on VMs

Flink supports exporting metrics by configuring metric reporters in the Flink configuration file; these reporters are instantiated on JobManager and TaskManagers.

For Prometheus, Flink exposes Prometheus-format metrics when configured with metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory in a supported Flink version environment.

You generally combine that with Kubernetes ServiceMonitors (Prometheus Operator) or with your managed monitoring stack.

Scaling: parallelism, slots, and operator-based autoscaling

Flink’s scheduling model defines execution resources via task slots, and each slot can run a pipeline of parallel tasks.

For manual scaling, the REST API provides a rescaling endpoint for a job (/jobs/{jobid}/rescaling) as an async operation.

If you’re on Kubernetes with the Flink Kubernetes Operator, the operator project advertises a “Flink Job Autoscaler” as part of its feature set, which is worth evaluating if your workloads vary substantially.

Backups and safe upgrades: checkpoints and savepoints

Checkpoints are for automated recovery and are managed by Flink; savepoints are for user-driven lifecycle operations (stop/resume/fork/upgrade).

From an SRE standpoint:

Use checkpoints for “keep the pipeline running through failures”.
Use savepoints for “deploy a new version without losing state”.

Flink’s REST API also supports triggering savepoints asynchronously, which is useful for GitOps-style “deploy → trigger savepoint → upgrade” workflows.

CI/CD: GitOps + Helm + REST job submission

For Kubernetes:

Keep the operator installation and your FlinkDeployment CRs in Git, deploy via Argo CD/Flux, and version container images per build. The operator Helm docs explicitly discuss “Working with Argo CD”.

For standalone/session clusters:

Use the Flink REST API JAR upload and run endpoints for immutable artefact deployments.

Also note a subtle but valuable security/ops toggle: web.submit.enable governs uploads via the Web UI, but the docs note that even when disabled, session clusters still accept job submissions through REST requests; this is relevant when hardening UI surfaces while retaining CI/CD automation.

LLM/AI integration patterns with Apache Flink for real-time pipelines

LLM systems are often only as good as their real-time context. Flink fits into LLM/AI stacks as the component that produces “always fresh” features, embeddings, and behavioural aggregates.

Real-time embeddings pipeline with Flink

A common pattern is:

ingest user actions/events,
aggregate sessions and preferences,
produce embedding-generation tasks,
write embeddings to a vector store and/or feature store.

PyFlink’s dependency management documentation explicitly calls out “machine learning prediction” and loading ML models inside Python UDFs (for remote cluster execution), which maps directly to “online inference inside Flink operators” approaches.

Online feature store updates for recommendation and ranking

Flink’s keyed state and checkpointing model is built to maintain operator state across events and recover it reliably. That’s a natural match for continuous feature computation (rolling rates, counts, time-decayed metrics) that downstream recommenders need.

Practical latency/consistency trade-offs for AI pipelines

If your architecture requires exactly-once semantics end-to-end (e.g., avoid duplicate feature updates or duplicate billing events), you’ll structure sinks and sources around checkpointing and transactional guarantees.

In Kafka-based stacks specifically:

Flink’s Kafka connector can deliver exactly-once guarantees when checkpointing is enabled and delivery guarantee options are configured.
Kafka Streams also supports exactly-once semantics (EOS), which is relevant if your “AI feature pipeline” is small enough to live inside application code rather than a Flink cluster.

Architecture view for “Flink as the real-time AI context builder”

This diagram is grounded in Flink’s core primitives: event-time processing (watermarks), state backends (state.backend.type and system-managed local state), and checkpoint/savepoint mechanisms for fault tolerance and operations.