The landscape of real-time event streaming has undergone a substantial transformation, even in the short span leading up to early 2026. What was once a clear-cut choice between robust messaging and scalable streaming has evolved into a nuanced spectrum where platforms like Apache Kafka and Apache Pulsar are continuously integrating advanced capabilities. As someone who's spent the last year knee-deep in benchmarks, configuration files, and late-night debugging sessions, I can tell you that the "new" features aren't just incremental updates; they're architectural shifts demanding a fresh perspective on how we design, deploy, and manage our data pipelines.
This isn't about marketing fluff; it's about practical, sturdy engineering. We’re seeing a persistent drive towards cost efficiency, enhanced operational simplicity, and greater elasticity, particularly in cloud-native environments. Let me walk you through some of the most impactful recent developments and how to leverage them.
Kafka's Game-Changing Tiered Storage: Beyond Local Disks (KIP-405)
For years, one of Kafka's core architectural tenets was its reliance on local, high-performance disks for all data retention. This provided stellar throughput and low latency but came with significant operational and cost burdens as retention requirements grew. Enter KIP-405: Tiered Storage, a feature that finally went production-ready with Kafka 3.9.0 in November 2024. This isn't just a band-aid; it's a fundamental re-evaluation of Kafka's storage model.
The core principle behind KIP-405 is to disaggregate compute and storage, allowing Kafka to persist only recent, "hot" data on local broker disks while asynchronously offloading historical, "cold" data to more cost-effective object storage systems like Amazon S3, Google Cloud Storage, or HDFS. This separation means you can scale storage independently from compute, leading to substantial cost reductions for long-term data retention, much like how dbt, Airflow, and Dagster are decoupling the data plane in modern stacks.
How Tiered Storage Works
Under the hood, KIP-405 introduces two tiers:
- Local (Hot) Tier: This is the traditional storage layer where new data is written to the broker's local disks. It's optimized for low-latency writes and reads of recent data.
- Remote (Cold) Tier: This tier stores older log segments in external object storage. When a log segment on the local disk becomes inactive (based on age or size, defined by retention policies), it's asynchronously uploaded to the configured remote storage.
Crucially, the consumer API remains unchanged. When a consumer requests data, the broker transparently fetches it from either the local or remote tier. This is managed by the RemoteLogMetadataManager, which stores metadata about uploaded segments, often in an internal __remote_log_metadata topic, ensuring strong consistency for metadata operations.
Configuration Deep Dive:
To enable tiered storage, you need to configure it at both the cluster and topic levels. First, at the broker level (e.g., in server.properties), you'll enable the remote log storage system and configure the remote storage manager:
# Enable remote log storage
remote.log.storage.system.enable=true
# Specify the remote storage manager implementation (e.g., S3)
remote.log.storage.manager.class.name=org.apache.kafka.server.log.remote.storage.s3.S3RemoteLogManager
remote.log.storage.manager.config.prefix=s3.
# S3 specific configurations
s3.bucket.name=your-kafka-tiered-storage-bucket
s3.region=us-east-1
s3.access.key=YOUR_ACCESS_KEY
s3.secret.key=YOUR_SECRET_KEY
Then, at the topic level, you define the retention policies for both local and remote storage.
# Create a new topic with tiered storage enabled
kafka-topics --bootstrap-server localhost:9092 --create --topic my-tiered-topic \
--partitions 3 --replication-factor 3 \
--config remote.storage.enable=true \
--config local.log.retention.ms=604800000 # 7 days local retention
--config log.retention.ms=-1 # Infinite remote retention
--config local.log.retention.bytes=-1 # No size limit for local
--config log.retention.bytes=-1 # No size limit for remote
The Pulsar Functions Paradigm: Serverless Stream Processing at the Edge
Apache Pulsar, with its disaggregated architecture of brokers and BookKeeper storage, naturally lends itself to flexible processing models. Pulsar Functions exemplify this by offering a lightweight, serverless framework for stream processing that lives natively within the Pulsar ecosystem. You can use this JSON Formatter to verify your event payloads before publishing them to Pulsar.
Stateful Functions and Deployment Models
One of the standout features is built-in state management. Pulsar Functions allow computations to maintain state across messages, persisting this state in a robust manner (backed by BookKeeper). You can use simple key-value APIs (context.putState(), context.getState()) within your function.
Code Example (Python Stateful Function):
# user_event_counter.py
from pulsar import Function
import json
class UserEventCounter(Function):
def process(self, input_message, context):
try:
event = json.loads(input_message.value().decode('utf-8'))
user_id = event.get('user_id')
if user_id:
current_count_bytes = context.get_state(user_id)
current_count = int.from_bytes(current_count_bytes, 'big') if current_count_bytes else 0
new_count = current_count + 1
context.put_state(user_id, new_count.to_bytes(8, 'big'))
context.publish('persistent://public/default/user-counts', f'{{"user_id": "{user_id}", "total_events": {new_count}}}'.encode('utf-8'))
except Exception as e:
context.logger.error(f"Error: {e}")
To deploy this function:
pulsar-admin functions create \
--tenant public \
--namespace default \
--name user-event-counter-function \
--py user_event_counter.py \
--classname user_event_counter.UserEventCounter \
--inputs persistent://public/default/user-events \
--output persistent://public/default/user-counts \
--runtime python \
--state-storage-serviceurl bk://localhost:4181
Mastering Cross-Cluster Resilience: Kafka MirrorMaker 2.0 vs. Pulsar Geo-Replication
In distributed systems, resilience and data locality across multiple data centers or cloud regions are paramount.
Kafka MirrorMaker 2.0 (MM2)
Kafka MirrorMaker 2.0 is built on Kafka Connect and offers a powerful framework for replicating topics between Kafka clusters. It uses source and sink connectors to replicate data, manage offsets, and even synchronize consumer group states.
# Basic MM2 setup
clusters=primary,secondary
primary.bootstrap.servers=kafka-primary:9092
secondary.bootstrap.servers=kafka-secondary:9092
# Replication configuration
primary->secondary.enabled=true
primary->secondary.topics=.*
primary->secondary.consumer.group.replication.enabled=true
primary->secondary.replication.policy.class=org.apache.kafka.connect.mirror.IdentityReplicationPolicy
Pulsar Geo-Replication
Pulsar's architecture makes geo-replication a first-class citizen. Geo-replication in Pulsar is managed at the namespace level, meaning you configure it for a group of topics within a namespace.
# Enable geo-replication for a namespace
pulsar-admin namespaces set-clusters \
--clusters cluster-us-east,cluster-us-west \
my-global-tenant/my-geo-namespace
Achieving Clarity: Observability and Monitoring for Event Streams
Metrics with JMX and Prometheus
Both Kafka and Pulsar expose a rich set of metrics via JMX. For modern monitoring stacks, exporting these JMX metrics to Prometheus is a standard practice.
# kafka-broker-jmx-exporter.yaml snippet
lowercaseOutputLabelNames: true
lowercaseOutputName: true
rules:
- pattern : "kafka.server<type=(KafkaServer|BrokerTopicMetrics), name=(.*)>"
name: kafka_$1_$2
labels: {}
help: "Kafka JMX metric $1 $2"
type: GAUGE
Start your Kafka broker with the agent:
export KAFKA_OPTS="-javaagent:/path/to/jmx_exporter.jar=9093:/path/to/kafka-broker-jmx-exporter.yaml"
bin/kafka-server-start.sh config/server.properties
Tracing with OpenTelemetry
Integrating OpenTelemetry with your Kafka or Pulsar clients allows you to trace messages from their origin through producers, brokers, and consumers.
// Java Producer with OTel Context Propagation
Span span = tracer.spanBuilder("send-message-to-kafka").startSpan();
try (Scope scope = span.makeCurrent()) {
ProducerRecord<String, String> record = new ProducerRecord<>("my-topic", "key", "value");
OpenTelemetry.getGlobalPropagators().getTextMapPropagator()
.inject(Context.current(), record.headers(), (headers, key, value) -> headers.add(key, value.getBytes(StandardCharsets.UTF_8)));
producer.send(record, (metadata, exception) -> {
if (exception != null) span.recordException(exception);
span.end();
});
} catch (Exception e) {
span.recordException(e);
span.end();
}
Stream Processing Synergy: Kafka Connect & Pulsar IO, and Exactly-Once Semantics
Kafka Connect: Advanced Configuration for Throughput
For high-throughput scenarios, tuning internal configurations is critical.
{
"name": "my-jdbc-source",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "5",
"connection.url": "jdbc:postgresql://postgres:5432/mydb",
"batch.size": "20000",
"poll.interval.ms": "5000"
}
}
Pulsar IO: Built-in Connectors
Pulsar IO provides a similar framework for connecting Pulsar to various data systems. These are essentially Pulsar Functions designed for specific input/output tasks, running as lightweight processes managed by the Pulsar Functions Worker.
The Elusive "Exactly-Once Semantics"
Achieving exactly-once semantics (EOS) means each input record is processed precisely once. Kafka achieved EOS through its transactional API.
# Key Producer Configurations for EOS in Kafka
props.put("enable.idempotence", "true");
props.put("acks", "all");
props.put("transactional.id", "my-app-transaction-id");
Hardening Your Event Infrastructure: Pragmatic Security Measures
Authentication and Authorization
- Kafka: Uses SASL/SCRAM or TLS for authentication and ACLs for authorization.
- Pulsar: Supports mTLS, JWT, and Kerberos, with role-based authorization at the tenant/namespace level.
Encryption
Both platforms support TLS/SSL for encryption in transit. Encryption at rest is typically handled by the underlying storage layer (LUKS for Kafka, BookKeeper encryption for Pulsar).
# Kafka server.properties for SASL/SCRAM with TLS
listeners=SASL_SSL://:9092
ssl.keystore.location=/etc/kafka/secrets/kafka.keystore.jks
sasl.enabled.mechanisms=SCRAM-SHA-512
authorizer.class.name=kafka.security.auth.SimpleAclAuthorizer
# Pulsar broker.conf for mTLS
tlsEnable=true
tlsRequireTrustedClientCertificates=true
authenticationEnabled=true
authenticationProviders=org.apache.pulsar.broker.authentication.AuthenticationProviderTls
Performance Deep Dive: Tuning Producers and Consumers for Peak Efficiency
Kafka Client Tuning
// Throughput optimizations
props.put("batch.size", 131072); // 128 KB
props.put("linger.ms", 50);
props.put("compression.type", "lz4");
props.put("acks", "1");
props.put("max.in.flight.requests.per.connection", 5);
Pulsar Client Tuning
Producer<byte[]> producer = client.newProducer()
.topic("my-pulsar-topic")
.compressionType(CompressionType.LZ4)
.batchingEnabled(true)
.batchingMaxMessages(1000)
.batchingMaxPublishDelay(10, TimeUnit.MILLISECONDS)
.create();
Expert Insight: The Dawn of Truly Disaggregated Event Architectures
If you've been following the Kafka community closely, you've seen the discussions around Tiered Storage (KIP-405) evolve into something even more profound: the concept of Diskless Kafka. This isn't just about offloading cold data; it's about fundamentally rethinking the Kafka broker as a stateless compute layer, entirely decoupled from storage.
Several KIPs (e.g., KIP-1150, KIP-1176, KIP-1183) are actively exploring this vision, aiming to allow Kafka brokers to serve topics directly from object storage like S3. The implications are massive: true elasticity, simplified operations, and extreme cost efficiency. Organizations that embrace the diskless paradigm early will gain significant competitive advantages. The event streaming landscape is not just evolving; it's fundamentally restructuring.
Sources
This article was published by the **DataFormatHub Editorial Team, a group of developers and data enthusiasts dedicated to making data transformation accessible and private. Our goal is to provide high-quality technical insights alongside our suite of privacy-first developer tools.
🛠️ Related Tools
Explore these DataFormatHub tools related to this topic:
- JSON Formatter - Format event payloads
- JSON to YAML - Convert Kafka configs
📚 You Might Also Like
- dbt, Airflow, and Dagster: Why the Data Plane Changes Everything in 2026
- Apache Iceberg & the Open Data Stack: Why the Lakehouse is Real in 2026
- GPT-5.x Deep Dive: Why the New OpenAI API Changes Everything in 2026
This article was originally published on DataFormatHub, your go-to resource for data format and developer tools insights.
Top comments (0)