Apache Flink is a framework for stateful computations over unbounded and bounded data streams.
Teams adopt it for correct, low-latency streaming with event-time semantics (watermarks), fault tolerance (checkpoints), controlled upgrades (savepoints), and operational surfaces (metrics and REST).
This guide targets DevOps and Go/Python developers. It compares deployment models (self-managed vs managed), explains core architecture, covers Kubernetes (Helm and Operator) and standalone setups, contrasts Flink with Spark, Kafka Streams, Beam, and streaming databases, and shows PyFlink plus Go integration patterns including LLM and AI-oriented pipelines.
For broader context on data infrastructure patterns including object storage, databases, and messaging, see Data Infrastructure for AI Systems: Object Storage, Databases, Search & AI Data Architecture.
What is Apache Flink and why teams use it for real-time processing
Apache Flink is explicitly positioned as a stateful stream processing engine: you model your logic as a pipeline of operators and Flink runs it as a distributed dataflow with managed state and time semantics. In modern Flink documentation, the project describes itself as a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.
From a practical DevOps/software engineering perspective, Flink is a good fit when you need at least one of these properties:
If you need join/aggregate/enrich at low latency with correctness guarantees, you typically use Flink’s event-time processing, where “time” is when the event happened (not when it arrived), and watermarks communicate event-time progress through the pipeline.
If you need stateful computation at scale (rolling counters, sessions, fraud rules, feature engineering), Flink treats state as a first-class part of the programming model and makes it fault tolerant via checkpointing.
If you need operationally robust streaming (failures, rolling upgrades, restarts), Flink checkpoints state and stream positions so the job can recover and continue with the same semantics “as a failure-free execution”.
Typical use cases for DevOps, Go, Python, and AI teams
Flink is widely used for “data pipelines & ETL”, “streaming analytics”, and “event-driven applications” (the categories used by the Flink docs).
For a DevOps + Go/Python stack, typical patterns look like this:
A Go service produces events to Kafka; Flink consumes those events, performs stateful processing (e.g., dedupe, windowed aggregation, enrichment), then writes derived facts back to Kafka or a database. Flink’s operator and checkpointing mechanisms exist to make these stateful pipelines production-safe.
For ML/LLM teams, PyFlink explicitly calls out scenarios like “machine learning prediction” and loading machine learning models inside Python UDFs as a dependency-management motivation, which is a direct endorsement of “Flink job as online inference / feature engineering runtime” patterns.
Apache Flink architecture and core features
Apache Flink cluster architecture for production deployments
Flink’s runtime consists of two process types: JobManager and TaskManagers. The docs emphasise that clients submit the dataflow to the JobManager; the client can then disconnect (detached mode) or stay connected (attached mode).
The JobManager coordinates distributed execution: scheduling, reacting to task completion/failure, coordinating checkpoints, and coordinating recovery. Internally, it includes: ResourceManager (slots/resources), Dispatcher (REST + Web UI + per-job JobMaster creation), and JobMaster (manages one job).
The TaskManagers execute the operators/tasks, and exchange/buffer data streams. The smallest scheduling unit is the task slot; multiple operators can execute in one slot (operator chaining and slot sharing affect this).
Operator chaining and task slots for performance and cost control
Flink chains operator subtasks into tasks, where each task is executed by a single thread. This is described as a performance optimisation that reduces thread handover and buffering overhead, increasing throughput and decreasing latency.
Slots matter operationally because they are the unit of resource scheduling/isolation. Flink notes that each TaskManager may have one or more task slots; slotting reserves managed memory per slot, but does not isolate CPU.
Event-time processing, watermarks, and late data
Flink supports multiple notions of time—event time, ingestion time, processing time—and uses watermarks to model progress in event time.
To work with event time, Flink needs timestamps assigned to events and watermarks generated; the official “Generating Watermarks” documentation explains timestamp assignment and watermark generation as the core building blocks, with WatermarkStrategy being the standard way to configure common strategies.
Fault tolerance: checkpoints versus savepoints in real systems
Checkpointing exists because “every function and operator in Flink can be stateful”; state must be checkpointed to become fault tolerant. Checkpoints enable recovery of both state and stream positions so execution can resume with failure-free semantics.
Flink is very explicit that savepoints are “a consistent image of the execution state of a streaming job, created via Flink’s checkpointing mechanism”, used to stop-and-resume, fork, or update jobs. Savepoints live on stable storage (e.g., HDFS, S3).
The official “Checkpoints vs Savepoints” page frames the difference like backups vs recovery logs: checkpoints are frequent, lightweight, managed by Flink for failure recovery; savepoints are user-managed and used for controlled operations like upgrades.
Apache Flink deployment options and pricing plans
Free/self-managed Apache Flink option
The open-source Flink runtime is “free” in the licensing sense, but in production you pay for infrastructure and operational effort.
Flink is designed to integrate with common resource managers (e.g., YARN and Kubernetes) and can also run as a standalone cluster or as a library.
Self-managed cost drivers for Apache Flink
Compute and memory costs are driven by JobManager and TaskManagers, and by your parallelism/slot layout. Flink’s configuration documentation explicitly calls out jobmanager.memory.process.size, taskmanager.memory.process.size, taskmanager.numberOfTaskSlots, and parallelism.default as core knobs for distributed setups.
Local disk is a frequent hidden cost for stateful jobs. Flink notes that io.tmp.dirs stores local data including RocksDB files, spilled intermediate results, and cached JARs; if this data is deleted, it can force “a heavyweight recovery operation”, so it should live on storage that is not periodically purged.
Durable object/file storage cost is driven by checkpoint/savepoint directories. In Flink 2.x config, checkpoints and savepoints are configured via execution.checkpointing.dir and execution.checkpointing.savepoint-dir and accept URIs like s3://… or hdfs://….
Managed Apache Flink plans and typical billing models
Managed services reduce operational cost but add platform fees and constraints. The specifics are provider-dependent.
Amazon Managed Service for Apache Flink bills by KPUs (1 vCPU + 4 GB memory per KPU) and charges by duration and number of KPUs in one-second increments. AWS also charges an additional “orchestration” KPU per application and separate storage/backups fees.
Confluent Cloud for Apache Flink is usage-based and serverless: you create a compute pool, and you’re billed for CFUs consumed per minute while statements are running. The billing page includes an example CFU price of $0.21 per CFU-hour (region-dependent) and emphasises that you can limit spend via compute pool maximums.
Aiven and Alibaba Cloud are notable managed Flink providers in the market, but their public pricing and billing details vary by plan/region and may require calculators or sales contact; treat exact costs as unspecified unless you quote a region+plan from their current docs.
Ververica offers both self-managed and managed deployment options around Flink; public pages emphasise deployment choices and managed service positioning, while exact pricing is typically handled via “contact/pricing details” flows (so specific numbers are often unspecified publicly).
Deployment options table for Apache Flink in production
| Deployment option | Best for | Operational complexity | Key benefits | Key risks / trade-offs |
|---|---|---|---|---|
| Standalone cluster (VMs/bare metal) | Small teams, fixed capacity | Medium–High | Full control; simplest mental model | HA, autoscaling, upgrades are DIY (more toil) |
| Kubernetes with Flink Kubernetes Operator | Most modern platform teams | Medium | Declarative deployments; lifecycle management via control loop; operator supports Application/Session/Job deployments | Kubernetes + operator expertise required |
| Native Kubernetes (without operator) | K8s teams wanting direct integration | Medium–High | Direct resource integration; dynamic TaskManager allocation/deallocation described in Flink-on-K8s docs | More bespoke automation than operator |
| YARN | Hadoop-centric platforms | Medium | Integrates with YARN resource mgmt | Hadoop stack complexity |
| AWS Managed Service for Apache Flink | AWS-native data stacks | Low–Medium | Managed orchestration + scaling options; predictable billing unit (KPU) | Platform coupling; extra per-app overhead KPU + storage fees |
| Confluent Cloud for Apache Flink | Kafka-first shops, SQL-first stream apps | Low | Serverless usage billing; CFU-minute accounting; compute pools to cap spend | CFU costs + Kafka networking costs; service-specific APIs |
| Ververica managed offerings | Enterprises needing Flink expert ops | Low–Medium | “Flink experts” managed service positioning | Pricing often not transparent (unspecified) |
Managed providers and costs table
Prices change by region and time; if you need exact numbers for your region, treat this as a starting point and verify against the provider’s current pricing pages (unquoted regions are unspecified).
| Provider | “Plan” shape | Billing unit | Example compute price | Notable additional cost drivers |
|---|---|---|---|---|
| Amazon Managed Service for Apache Flink | Managed runtime | KPU (1 vCPU + 4 GB) | Example shown: $0.11 per KPU-hour (US East N. Virginia) | +1 orchestration KPU per app; running storage; optional durable backups |
| Confluent Cloud for Apache Flink | Serverless SQL/processing | CFU-minute/CFU-hour | Example shown: $0.21 per CFU-hour (region varies) | Kafka networking rates still apply; compute pool max to cap spend |
| Ververica (managed) | Managed “Unified Streaming Data Platform” | Unspecified (public pages) | Unspecified | Platform features/SLAs; pricing typically via sales (unspecified) |
| Aiven for Apache Flink | Managed service | Hourly usage billing model (platform-wide) | Unspecified without plan/region | Plan tier + cloud region + add-ons (unspecified) |
| Alibaba Cloud Realtime Compute for Apache Flink | Managed/serverless | Hybrid billing (pay-as-you-go + subscription mix) | Unspecified without region/workspace details | CU-based limits and workspace model (details vary; unspecified here) |
Apache Flink vs competitors comparison
Flink sits in a busy ecosystem. The “best” choice depends on latency, statefulness, operational preferences, and authoring model.
Competitor comparison table: Flink vs Spark vs Kafka Streams vs Beam and newer options
| Tool | What it is | Streaming execution model | State & exactly-once story | Where it shines | Typical pain points |
|---|---|---|---|---|---|
| Apache Flink | Distributed stream processing engine for stateful computations | Continuous streaming + event time via watermarks | Checkpoint-based fault tolerance; savepoints for controlled upgrades | Low-latency stateful pipelines; complex event-time logic | Operating state, checkpoints, upgrades correctly takes discipline |
| Apache Spark Structured Streaming | Spark’s streaming engine built around DataFrames/Datasets | Default micro-batch model (with a continuous mode discussed separately) | Strong for analytical pipelines; state exists but often higher latency | Unified batch+stream APIs; Spark ecosystem | Micro-batch latency and “streaming as incremental batches” mental model |
| Kafka Streams | Library to build stream-processing apps on Kafka | Record-at-a-time processing | Supports exactly-once processing semantics (EOS) | Simple Kafka-native apps; embed in JVM service | JVM-only; less flexible for large distributed compute patterns |
| Apache Beam | Unified programming model + SDKs; executed via runners (Flink, Spark, Dataflow, etc.) | Depends on runner; Beam pipelines translate to runner jobs | Semantics depend on runner capability matrix (runner-specific) | Portability, multi-language pipelines; avoid engine lock-in | Operational tuning still ends up being runner-specific |
| Materialize | “Live data layer” / streaming SQL DB; incrementally updates results as data arrives | Continuous incremental view maintenance | Strong consistency claims in product docs (details are product-specific) | Serving fresh derived views to apps/AI agents | Different operational model than Flink jobs; not a general operator API runtime |
| RisingWave | Streaming database where stream processing is expressed as materialized views | Continuous materialised view maintenance | SQL-first; engine-specific semantics | SQL-centric streaming apps without building Flink jobs | Less flexible for arbitrary code-heavy pipelines |
A useful heuristic: if you want a runtime for complex stateful streaming jobs with deep control over event-time, operator logic, and deployments, Flink is a primary candidate. If you want SQL-first incremental views for serving, streaming databases may be alternatives. If you want a library embedded in a service, Kafka Streams is competitive. If you want one portable pipeline definition across engines, Beam is compelling.
For cloud-native event-driven architectures using AWS, Building Event-Driven Microservices with AWS Kinesis covers Kinesis Data Streams patterns for real-time processing and service decoupling.
How to use Apache Flink in custom-made systems
This section is intentionally practical: configuration, deployment, and how your Go/Python services typically interact with Flink.
Recommended architecture pattern: Go services + Kafka + Flink + serving layer
Flink is often the “stateful middle” that turns high-volume events into durable signals (counters, sessions, anomalies, enriched records). Checkpoints and state backends are what make that middle reliable in production.
Standalone configuration example for Apache Flink 2.x
Important version note: starting with Flink 2.0, the supported configuration file is conf/config.yaml; the previous flink-conf.yaml is “no longer supported”.
A minimal (illustrative) conf/config.yaml for a small self-managed cluster:
# conf/config.yaml (Flink 2.x style)
rest:
address: flink-jobmanager.example.internal
port: 8081
jobmanager:
rpc:
address: flink-jobmanager.example.internal
port: 6123
memory:
process:
size: 2048m
taskmanager:
memory:
process:
size: 4096m
numberOfTaskSlots: 2
parallelism:
default: 2
# Checkpointing defaults (jobs can still override in code)
state:
backend:
type: rocksdb
execution:
checkpointing:
dir: s3://my-bucket/flink/checkpoints
savepoint-dir: s3://my-bucket/flink/savepoints
interval: 60 s
# Avoid tmp dirs that get purged (RocksDB files, cached jars, etc.)
io:
tmp:
dirs: ["/var/lib/flink/tmp"]
Why these keys: Flink’s configuration reference explicitly documents the rest.* and jobmanager.rpc.* discovery details, the process memory keys, the slot/parallelism keys, and the default checkpoint settings including state.backend.type, execution.checkpointing.dir, execution.checkpointing.savepoint-dir, and execution.checkpointing.interval.
The io.tmp.dirs choice is operationally important because Flink uses it for local RocksDB files and cached artefacts; deleting it can cause heavyweight recovery.
Legacy standalone config example for Flink 1.x
If you are on Flink 1.x (still common in some managed environments), you’ll see flink-conf.yaml in the wild. This is legacy for Flink 2.x users.
# conf/flink-conf.yaml (legacy 1.x style; NOT supported in Flink 2.x)
jobmanager.rpc.address: flink-jobmanager
rest.port: 8081
taskmanager.numberOfTaskSlots: 2
parallelism.default: 2
# Legacy checkpoint keys differ by version; treat as illustrative.
state.backend.type: rocksdb
state.checkpoints.dir: s3://my-bucket/flink/checkpoints
state.savepoints.dir: s3://my-bucket/flink/savepoints
If you’re migrating, Flink provides a migration script (bin/migrate-config-file.sh) to convert flink-conf.yaml to config.yaml.
Kubernetes/Helm deployment with the Flink Kubernetes Operator
The Flink Kubernetes Operator acts as a control plane for Flink application lifecycle management and is installed using Helm.
From the official operator Helm docs, you can install either from the source tree chart, or from the Apache-hosted chart repository:
# install from bundled chart in source tree
helm install flink-kubernetes-operator helm/flink-kubernetes-operator
# install from Apache downloads Helm repository (replace <OPERATOR-VERSION>)
helm repo add flink-operator-repo https://downloads.apache.org/flink/flink-kubernetes-operator-<OPERATOR-VERSION>/
helm install flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator
These exact commands are shown in the operator’s Helm installation documentation.
Example FlinkDeployment CR (illustrative)
This is a simplified example to show the integration points you’ll typically customise (image, resources, checkpoint locations, logging/metrics). The operator reconciles this desired state via its control loop.
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
name: realtime-sessions
namespace: flink
spec:
image: my-registry.example.com/flink/realtime-sessions:2026-03-06
flinkVersion: v2_2
serviceAccount: flink
flinkConfiguration:
taskmanager.numberOfTaskSlots: "2"
state.backend.type: "rocksdb"
execution.checkpointing.dir: "s3://my-bucket/flink/checkpoints/realtime-sessions"
execution.checkpointing.savepoint-dir: "s3://my-bucket/flink/savepoints/realtime-sessions"
execution.checkpointing.interval: "60 s"
rest.port: "8081"
jobManager:
resource:
cpu: 1
memory: "2048m"
taskManager:
resource:
cpu: 2
memory: "4096m"
job:
jarURI: local:///opt/flink/usrlib/realtime-sessions.jar
parallelism: 4
upgradeMode: savepoint
state: running
The upgradeMode: savepoint pattern is common when you want safe stateful upgrades; savepoints are designed for stop/resume/fork/update workflows and point to stable storage locations.
PyFlink development: realistic Kafka streaming job with checkpoints and RocksDB state
PyFlink is the Python API for Apache Flink and is explicitly pitched for scalable batch/stream workloads including ML pipelines and ETL.
Dependency packaging for PyFlink Kafka jobs
When you use JVM connectors (Kafka, JDBC, etc.) from PyFlink, you must ensure the relevant JARs are available to the job. Flink’s Python “Dependency Management” docs show three standard mechanisms:
Setting pipeline.jars (Table API), calling add_jars() (DataStream API), or CLI --jarfile at submission time.
PyFlink Kafka job example (DataStream API + event time + state + checkpointing)
This example reads JSON events from Kafka, assigns event-time timestamps (with bounded out-of-orderness), maintains a per-user rolling count in keyed state, and writes an enriched event to an output topic.
Notes:
- KafkaSource is built via
KafkaSource.builder()and requires bootstrap servers, topics, and a deserialiser. - Exactly-once Kafka sink configuration in PyFlink requires setting delivery guarantee and a transactional ID prefix.
- Checkpoint defaults can be configured in Flink config (
execution.checkpointing.*) and/or in code; the config keys are documented in the Flink configuration reference.
import json
from typing import Any
from pyflink.common import Duration, Types
from pyflink.common.serialization import SimpleStringSchema
from pyflink.common.watermark_strategy import WatermarkStrategy, TimestampAssigner
from pyflink.common.configuration import Configuration
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import KeyedProcessFunction, RuntimeContext
from pyflink.datastream.state import ValueStateDescriptor
from pyflink.datastream.connectors import DeliveryGuarantee
from pyflink.datastream.connectors.kafka import (
KafkaSource,
KafkaOffsetsInitializer,
KafkaSink,
KafkaRecordSerializationSchema,
)
class EventTimeFromJson(TimestampAssigner):
"""
Extract event_time_ms from the JSON payload.
Expect: {"user_id":"u1","event_time_ms":1710000000000,"event":"click",...}
"""
def extract_timestamp(self, value: str, record_timestamp: int) -> int:
try:
obj = json.loads(value)
return int(obj["event_time_ms"])
except Exception:
# fallback: use record timestamp (ingestion) if malformed
return record_timestamp
class RollingCount(KeyedProcessFunction):
def open(self, runtime_context: RuntimeContext):
desc = ValueStateDescriptor("rolling_count", Types.LONG())
self.count_state = runtime_context.get_state(desc)
def process_element(self, value: str, ctx: 'KeyedProcessFunction.Context'):
obj = json.loads(value)
current = self.count_state.value()
if current is None:
current = 0
current += 1
self.count_state.update(current)
# emit enriched event
obj["rolling_count"] = current
obj["event_time_ms"] = int(obj.get("event_time_ms", 0))
yield json.dumps(obj)
def build_env() -> StreamExecutionEnvironment:
# Cluster/job defaults (can also be set in config.yaml)
cfg = Configuration()
cfg.set_string("state.backend.type", "rocksdb")
cfg.set_string("execution.checkpointing.dir", "s3://my-bucket/flink/checkpoints/realtime-sessions")
cfg.set_string("execution.checkpointing.interval", "60 s")
env = StreamExecutionEnvironment.get_execution_environment(cfg)
# In PyFlink, connector jars must be available; use env.add_jars(...) if needed.
# env.add_jars("file:///opt/flink/lib/flink-connector-kafka-<VERSION>.jar")
# Enable checkpointing explicitly as well (jobs can override defaults)
env.enable_checkpointing(60_000)
env.set_parallelism(4)
return env
def main():
env = build_env()
source = (
KafkaSource.builder()
.set_bootstrap_servers("kafka:9092")
.set_topics("events.raw")
.set_group_id("realtime-sessions-v1")
.set_value_only_deserializer(SimpleStringSchema())
.set_starting_offsets(KafkaOffsetsInitializer.earliest())
.build()
)
watermark_strategy = (
WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(10))
.with_timestamp_assigner(EventTimeFromJson())
)
stream = (
env.from_source(source, watermark_strategy=watermark_strategy, source_name="kafka-events-raw")
.key_by(lambda s: json.loads(s)["user_id"])
.process(RollingCount(), output_type=Types.STRING())
)
record_serializer = (
KafkaRecordSerializationSchema.builder()
.set_topic("events.enriched")
.set_value_serialization_schema(SimpleStringSchema())
.build()
)
sink = (
KafkaSink.builder()
.set_bootstrap_servers("kafka:9092")
.set_record_serializer(record_serializer)
.set_delivery_guarantee(DeliveryGuarantee.EXACTLY_ONCE)
.set_transactional_id_prefix("realtime-sessions-txn")
.build()
)
stream.sink_to(sink)
env.execute("realtime-sessions-pyflink")
if __name__ == "__main__":
main()
The API calls above line up with PyFlink’s KafkaSource builder usage pattern and required fields.
For delivery guarantees, PyFlink’s KafkaSinkBuilder documentation explicitly says that for DeliveryGuarantee.EXACTLY_ONCE you must set the transactional ID prefix.
For timestamping/watermarking, Flink’s watermark documentation explains timestamp assignment and watermark generation as the mechanism to process event time, and PyFlink provides a WatermarkStrategy API mirroring this model.
Go integration: Kafka producer/consumer + Flink REST job submission
Go does not have a native Flink job authoring API like Java/Python, so Go systems typically integrate with Flink through:
- Kafka (or other brokers) as ingestion/egress.
- The Flink REST API for operational actions (uploading JARs, starting jobs, querying job status, triggering savepoints, rescaling).
For Kafka setup and local development patterns, see Apache Kafka Quickstart - Install Kafka 4.2 with CLI and Local Examples.
Go Kafka producer/consumer example (kafka-go)
package main
import (
"context"
"log"
"time"
"github.com/segmentio/kafka-go"
)
func main() {
ctx := context.Background()
// Producer: write raw events
writer := &kafka.Writer{
Addr: kafka.TCP("kafka:9092"),
Topic: "events.raw",
RequiredAcks: kafka.RequireAll,
}
defer writer.Close()
err := writer.WriteMessages(ctx, kafka.Message{
Key: []byte("user:u1"),
Value: []byte(`{"user_id":"u1","event_time_ms":1710000000000,"event":"click"}`),
Time: time.Now(),
})
if err != nil {
log.Fatal(err)
}
// Consumer: read enriched events
reader := kafka.NewReader(kafka.ReaderConfig{
Brokers: []string{"kafka:9092"},
Topic: "events.enriched",
GroupID: "go-debug-consumer",
MinBytes: 1e3,
MaxBytes: 10e6,
})
defer reader.Close()
msg, err := reader.ReadMessage(ctx)
if err != nil {
log.Fatal(err)
}
log.Printf("enriched key=%s value=%s\n", string(msg.Key), string(msg.Value))
}
This is “plumbing” code, but it’s the most common practical integration surface: Kafka topics are the boundary between Flink and custom services.
Flink REST API: upload and run jobs from Go
Flink’s REST API is part of the JobManager web server and listens on port 8081 by default (configurable via rest.port).
The official OpenAPI spec for the dispatcher includes /jars/upload and explicitly states:
- JAR upload must be sent as multi-part data
- ensure the
Content-Typeheader is set toapplication/x-java-archive - provides a curl example using
-F jarfile=@path/to/flink-job.jar
A practical Go snippet to upload a JAR:
package flink
import (
"bytes"
"context"
"fmt"
"io"
"mime/multipart"
"net/http"
"os"
)
func UploadJar(ctx context.Context, flinkBaseURL, jarPath string) (*http.Response, error) {
f, err := os.Open(jarPath)
if err != nil {
return nil, err
}
defer f.Close()
var body bytes.Buffer
w := multipart.NewWriter(&body)
part, err := w.CreateFormFile("jarfile", "job.jar")
if err != nil {
return nil, err
}
if _, err := io.Copy(part, f); err != nil {
return nil, err
}
if err := w.Close(); err != nil {
return nil, err
}
req, err := http.NewRequestWithContext(ctx, http.MethodPost, flinkBaseURL+"/jars/upload", &body)
if err != nil {
return nil, err
}
// Important: multipart boundary
req.Header.Set("Content-Type", w.FormDataContentType())
// Some clients also set "Expect:" similarly to the curl example in the spec.
req.Header.Set("Expect", "")
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
return nil, err
}
if resp.StatusCode >= 300 {
return resp, fmt.Errorf("upload failed: %s", resp.Status)
}
return resp, nil
}
This code is guided by the REST API OpenAPI description for /jars/upload including its multipart requirement and curl reference.
To run a previously uploaded JAR, Flink exposes /jars/{jarid}/run and supports passing program args via query parameters (and/or JSON body).
Operationally valuable endpoints you’ll likely automate:
-
/jobsand/jobs/{jobid}to list and inspect job state -
/jobs/{jobid}/savepointsto trigger savepoints (async trigger + polling) -
/jobs/{jobid}/rescalingto trigger rescaling
Code snippets comparison table: PyFlink vs Go in a Flink-based platform
| Concern | PyFlink (Python jobs) | Go (services around Flink) |
|---|---|---|
| Authoring Flink logic | Native authoring via DataStream/Table APIs; supports state + timers | No native Flink API; implement logic in Flink (Java/Python) and integrate externally |
| Connectors/dependencies | Must ship connector JARs via pipeline.jars, add_jars, or --jarfile
|
Not applicable (you’re not running inside Flink), but you manage Kafka/DB clients |
| Ingestion/egress | KafkaSource/KafkaSink builders in PyFlink | Kafka producer/consumer libraries; standard microservice patterns |
| Ops automation | Can call Flink REST endpoints too | Often owns automation: upload JAR, deploy, rescale, trigger savepoint via REST |
DevOps guide: monitoring, scaling, backups, and CI/CD for Apache Flink
Monitoring Apache Flink in Kubernetes and on VMs
Flink supports exporting metrics by configuring metric reporters in the Flink configuration file; these reporters are instantiated on JobManager and TaskManagers.
For Prometheus, Flink exposes Prometheus-format metrics when configured with metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory in a supported Flink version environment.
You generally combine that with Kubernetes ServiceMonitors (Prometheus Operator) or with your managed monitoring stack.
Scaling: parallelism, slots, and operator-based autoscaling
Flink’s scheduling model defines execution resources via task slots, and each slot can run a pipeline of parallel tasks.
For manual scaling, the REST API provides a rescaling endpoint for a job (/jobs/{jobid}/rescaling) as an async operation.
If you’re on Kubernetes with the Flink Kubernetes Operator, the operator project advertises a “Flink Job Autoscaler” as part of its feature set, which is worth evaluating if your workloads vary substantially.
Backups and safe upgrades: checkpoints and savepoints
Checkpoints are for automated recovery and are managed by Flink; savepoints are for user-driven lifecycle operations (stop/resume/fork/upgrade).
From an SRE standpoint:
- Use checkpoints for “keep the pipeline running through failures”.
- Use savepoints for “deploy a new version without losing state”.
Flink’s REST API also supports triggering savepoints asynchronously, which is useful for GitOps-style “deploy → trigger savepoint → upgrade” workflows.
CI/CD: GitOps + Helm + REST job submission
For Kubernetes:
- Keep the operator installation and your FlinkDeployment CRs in Git, deploy via Argo CD/Flux, and version container images per build. The operator Helm docs explicitly discuss “Working with Argo CD”.
For standalone/session clusters:
- Use the Flink REST API JAR upload and run endpoints for immutable artefact deployments.
Also note a subtle but valuable security/ops toggle: web.submit.enable governs uploads via the Web UI, but the docs note that even when disabled, session clusters still accept job submissions through REST requests; this is relevant when hardening UI surfaces while retaining CI/CD automation.
LLM/AI integration patterns with Apache Flink for real-time pipelines
LLM systems are often only as good as their real-time context. Flink fits into LLM/AI stacks as the component that produces “always fresh” features, embeddings, and behavioural aggregates.
Real-time embeddings pipeline with Flink
A common pattern is:
- ingest user actions/events,
- aggregate sessions and preferences,
- produce embedding-generation tasks,
- write embeddings to a vector store and/or feature store.
PyFlink’s dependency management documentation explicitly calls out “machine learning prediction” and loading ML models inside Python UDFs (for remote cluster execution), which maps directly to “online inference inside Flink operators” approaches.
Online feature store updates for recommendation and ranking
Flink’s keyed state and checkpointing model is built to maintain operator state across events and recover it reliably. That’s a natural match for continuous feature computation (rolling rates, counts, time-decayed metrics) that downstream recommenders need.
Practical latency/consistency trade-offs for AI pipelines
If your architecture requires exactly-once semantics end-to-end (e.g., avoid duplicate feature updates or duplicate billing events), you’ll structure sinks and sources around checkpointing and transactional guarantees.
In Kafka-based stacks specifically:
- Flink’s Kafka connector can deliver exactly-once guarantees when checkpointing is enabled and delivery guarantee options are configured.
- Kafka Streams also supports exactly-once semantics (EOS), which is relevant if your “AI feature pipeline” is small enough to live inside application code rather than a Flink cluster.
Architecture view for “Flink as the real-time AI context builder”
This diagram is grounded in Flink’s core primitives: event-time processing (watermarks), state backends (state.backend.type and system-managed local state), and checkpoint/savepoint mechanisms for fault tolerance and operations.
Top comments (0)