DataFormatHub

Posted on Jan 30 • Originally published at dataformathub.com

Kafka vs Pulsar 2026: Why the Shift to Diskless Changes Everything

#streaming #dataengineering #events #news

The landscape of real-time event streaming has undergone a substantial transformation, even in the short span leading up to early 2026. What was once a clear-cut choice between robust messaging and scalable streaming has evolved into a nuanced spectrum where platforms like Apache Kafka and Apache Pulsar are continuously integrating advanced capabilities. As someone who's spent the last year knee-deep in benchmarks, configuration files, and late-night debugging sessions, I can tell you that the "new" features aren't just incremental updates; they're architectural shifts demanding a fresh perspective on how we design, deploy, and manage our data pipelines.

This isn't about marketing fluff; it's about practical, sturdy engineering. We’re seeing a persistent drive towards cost efficiency, enhanced operational simplicity, and greater elasticity, particularly in cloud-native environments. Let me walk you through some of the most impactful recent developments and how to leverage them.

Kafka's Game-Changing Tiered Storage: Beyond Local Disks (KIP-405)

For years, one of Kafka's core architectural tenets was its reliance on local, high-performance disks for all data retention. This provided stellar throughput and low latency but came with significant operational and cost burdens as retention requirements grew. Enter KIP-405: Tiered Storage, a feature that finally went production-ready with Kafka 3.9.0 in November 2024. This isn't just a band-aid; it's a fundamental re-evaluation of Kafka's storage model.

The core principle behind KIP-405 is to disaggregate compute and storage, allowing Kafka to persist only recent, "hot" data on local broker disks while asynchronously offloading historical, "cold" data to more cost-effective object storage systems like Amazon S3, Google Cloud Storage, or HDFS. This separation means you can scale storage independently from compute, leading to substantial cost reductions for long-term data retention, much like how dbt, Airflow, and Dagster are decoupling the data plane in modern stacks.

How Tiered Storage Works

Under the hood, KIP-405 introduces two tiers:

Local (Hot) Tier: This is the traditional storage layer where new data is written to the broker's local disks. It's optimized for low-latency writes and reads of recent data.
Remote (Cold) Tier: This tier stores older log segments in external object storage. When a log segment on the local disk becomes inactive (based on age or size, defined by retention policies), it's asynchronously uploaded to the configured remote storage.

Crucially, the consumer API remains unchanged. When a consumer requests data, the broker transparently fetches it from either the local or remote tier. This is managed by the RemoteLogMetadataManager, which stores metadata about uploaded segments, often in an internal __remote_log_metadata topic, ensuring strong consistency for metadata operations.

Configuration Deep Dive:
To enable tiered storage, you need to configure it at both the cluster and topic levels. First, at the broker level (e.g., in server.properties), you'll enable the remote log storage system and configure the remote storage manager:

# Enable remote log storage
remote.log.storage.system.enable=true

# Specify the remote storage manager implementation (e.g., S3)
remote.log.storage.manager.class.name=org.apache.kafka.server.log.remote.storage.s3.S3RemoteLogManager
remote.log.storage.manager.config.prefix=s3.

# S3 specific configurations
s3.bucket.name=your-kafka-tiered-storage-bucket
s3.region=us-east-1
s3.access.key=YOUR_ACCESS_KEY
s3.secret.key=YOUR_SECRET_KEY

Then, at the topic level, you define the retention policies for both local and remote storage.

# Create a new topic with tiered storage enabled
kafka-topics --bootstrap-server localhost:9092 --create --topic my-tiered-topic \
  --partitions 3 --replication-factor 3 \
  --config remote.storage.enable=true \
  --config local.log.retention.ms=604800000   # 7 days local retention
  --config log.retention.ms=-1               # Infinite remote retention
  --config local.log.retention.bytes=-1      # No size limit for local
  --config log.retention.bytes=-1            # No size limit for remote

The Pulsar Functions Paradigm: Serverless Stream Processing at the Edge

Apache Pulsar, with its disaggregated architecture of brokers and BookKeeper storage, naturally lends itself to flexible processing models. Pulsar Functions exemplify this by offering a lightweight, serverless framework for stream processing that lives natively within the Pulsar ecosystem. You can use this JSON Formatter to verify your event payloads before publishing them to Pulsar.

Stateful Functions and Deployment Models

One of the standout features is built-in state management. Pulsar Functions allow computations to maintain state across messages, persisting this state in a robust manner (backed by BookKeeper). You can use simple key-value APIs (context.putState(), context.getState()) within your function.

Code Example (Python Stateful Function):

# user_event_counter.py
from pulsar import Function
import json

class UserEventCounter(Function):
    def process(self, input_message, context):
        try:
            event = json.loads(input_message.value().decode('utf-8'))
            user_id = event.get('user_id')

            if user_id:
                current_count_bytes = context.get_state(user_id)
                current_count = int.from_bytes(current_count_bytes, 'big') if current_count_bytes else 0
                new_count = current_count + 1
                context.put_state(user_id, new_count.to_bytes(8, 'big'))
                context.publish('persistent://public/default/user-counts', f'{{"user_id": "{user_id}", "total_events": {new_count}}}'.encode('utf-8'))
        except Exception as e:
            context.logger.error(f"Error: {e}")

To deploy this function:

pulsar-admin functions create \
  --tenant public \
  --namespace default \
  --name user-event-counter-function \
  --py user_event_counter.py \
  --classname user_event_counter.UserEventCounter \
  --inputs persistent://public/default/user-events \
  --output persistent://public/default/user-counts \
  --runtime python \
  --state-storage-serviceurl bk://localhost:4181

Mastering Cross-Cluster Resilience: Kafka MirrorMaker 2.0 vs. Pulsar Geo-Replication

In distributed systems, resilience and data locality across multiple data centers or cloud regions are paramount.

Kafka MirrorMaker 2.0 (MM2)

Kafka MirrorMaker 2.0 is built on Kafka Connect and offers a powerful framework for replicating topics between Kafka clusters. It uses source and sink connectors to replicate data, manage offsets, and even synchronize consumer group states.

# Basic MM2 setup
clusters=primary,secondary
primary.bootstrap.servers=kafka-primary:9092
secondary.bootstrap.servers=kafka-secondary:9092

# Replication configuration
primary->secondary.enabled=true
primary->secondary.topics=.* 
primary->secondary.consumer.group.replication.enabled=true
primary->secondary.replication.policy.class=org.apache.kafka.connect.mirror.IdentityReplicationPolicy

Pulsar Geo-Replication

Pulsar's architecture makes geo-replication a first-class citizen. Geo-replication in Pulsar is managed at the namespace level, meaning you configure it for a group of topics within a namespace.

# Enable geo-replication for a namespace
pulsar-admin namespaces set-clusters \
  --clusters cluster-us-east,cluster-us-west \
  my-global-tenant/my-geo-namespace

Achieving Clarity: Observability and Monitoring for Event Streams

Metrics with JMX and Prometheus

Both Kafka and Pulsar expose a rich set of metrics via JMX. For modern monitoring stacks, exporting these JMX metrics to Prometheus is a standard practice.

# kafka-broker-jmx-exporter.yaml snippet
lowercaseOutputLabelNames: true
lowercaseOutputName: true
rules:
- pattern : "kafka.server<type=(KafkaServer|BrokerTopicMetrics), name=(.*)>"
  name: kafka_$1_$2
  labels: {}
  help: "Kafka JMX metric $1 $2"
  type: GAUGE

Start your Kafka broker with the agent:

export KAFKA_OPTS="-javaagent:/path/to/jmx_exporter.jar=9093:/path/to/kafka-broker-jmx-exporter.yaml"
bin/kafka-server-start.sh config/server.properties

Tracing with OpenTelemetry

Integrating OpenTelemetry with your Kafka or Pulsar clients allows you to trace messages from their origin through producers, brokers, and consumers.

// Java Producer with OTel Context Propagation
Span span = tracer.spanBuilder("send-message-to-kafka").startSpan();
try (Scope scope = span.makeCurrent()) {
    ProducerRecord<String, String> record = new ProducerRecord<>("my-topic", "key", "value");
    OpenTelemetry.getGlobalPropagators().getTextMapPropagator()
        .inject(Context.current(), record.headers(), (headers, key, value) -> headers.add(key, value.getBytes(StandardCharsets.UTF_8)));
    producer.send(record, (metadata, exception) -> {
        if (exception != null) span.recordException(exception);
        span.end();
    });
} catch (Exception e) {
    span.recordException(e);
    span.end();
}

Stream Processing Synergy: Kafka Connect & Pulsar IO, and Exactly-Once Semantics

Kafka Connect: Advanced Configuration for Throughput

For high-throughput scenarios, tuning internal configurations is critical.

{
  "name": "my-jdbc-source",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
    "tasks.max": "5",
    "connection.url": "jdbc:postgresql://postgres:5432/mydb",
    "batch.size": "20000",
    "poll.interval.ms": "5000"
  }
}

Pulsar IO: Built-in Connectors

Pulsar IO provides a similar framework for connecting Pulsar to various data systems. These are essentially Pulsar Functions designed for specific input/output tasks, running as lightweight processes managed by the Pulsar Functions Worker.

The Elusive "Exactly-Once Semantics"

Achieving exactly-once semantics (EOS) means each input record is processed precisely once. Kafka achieved EOS through its transactional API.

# Key Producer Configurations for EOS in Kafka
props.put("enable.idempotence", "true");
props.put("acks", "all");
props.put("transactional.id", "my-app-transaction-id");

Hardening Your Event Infrastructure: Pragmatic Security Measures

Authentication and Authorization

Kafka: Uses SASL/SCRAM or TLS for authentication and ACLs for authorization.
Pulsar: Supports mTLS, JWT, and Kerberos, with role-based authorization at the tenant/namespace level.

Encryption

Both platforms support TLS/SSL for encryption in transit. Encryption at rest is typically handled by the underlying storage layer (LUKS for Kafka, BookKeeper encryption for Pulsar).

# Kafka server.properties for SASL/SCRAM with TLS
listeners=SASL_SSL://:9092
ssl.keystore.location=/etc/kafka/secrets/kafka.keystore.jks
sasl.enabled.mechanisms=SCRAM-SHA-512
authorizer.class.name=kafka.security.auth.SimpleAclAuthorizer

# Pulsar broker.conf for mTLS
tlsEnable=true
tlsRequireTrustedClientCertificates=true
authenticationEnabled=true
authenticationProviders=org.apache.pulsar.broker.authentication.AuthenticationProviderTls

Performance Deep Dive: Tuning Producers and Consumers for Peak Efficiency

Kafka Client Tuning

// Throughput optimizations
props.put("batch.size", 131072); // 128 KB
props.put("linger.ms", 50);
props.put("compression.type", "lz4");
props.put("acks", "1");
props.put("max.in.flight.requests.per.connection", 5);

Pulsar Client Tuning

Producer<byte[]> producer = client.newProducer()
    .topic("my-pulsar-topic")
    .compressionType(CompressionType.LZ4)
    .batchingEnabled(true)
    .batchingMaxMessages(1000)
    .batchingMaxPublishDelay(10, TimeUnit.MILLISECONDS)
    .create();

Expert Insight: The Dawn of Truly Disaggregated Event Architectures

If you've been following the Kafka community closely, you've seen the discussions around Tiered Storage (KIP-405) evolve into something even more profound: the concept of Diskless Kafka. This isn't just about offloading cold data; it's about fundamentally rethinking the Kafka broker as a stateless compute layer, entirely decoupled from storage.

Several KIPs (e.g., KIP-1150, KIP-1176, KIP-1183) are actively exploring this vision, aiming to allow Kafka brokers to serve topics directly from object storage like S3. The implications are massive: true elasticity, simplified operations, and extreme cost efficiency. Organizations that embrace the diskless paradigm early will gain significant competitive advantages. The event streaming landscape is not just evolving; it's fundamentally restructuring.

Sources

This article was published by the **DataFormatHub Editorial Team, a group of developers and data enthusiasts dedicated to making data transformation accessible and private. Our goal is to provide high-quality technical insights alongside our suite of privacy-first developer tools.

🛠️ Related Tools

Explore these DataFormatHub tools related to this topic:

JSON Formatter - Format event payloads
JSON to YAML - Convert Kafka configs

📚 You Might Also Like

This article was originally published on DataFormatHub, your go-to resource for data format and developer tools insights.

DEV Community