Kafka cleanup.policy: A Deep Dive into Log Management for Production Systems
1. Introduction
Imagine a globally distributed financial trading platform. Every trade, order modification, and market data update is streamed through Kafka. We’re dealing with hundreds of millions of events per second, strict regulatory compliance requiring long-term audit trails, and the need for real-time risk analysis. A naive retention policy can quickly lead to disk exhaustion, performance degradation, and even data loss. The cleanup.policy configuration is the critical control point for managing Kafka’s log segments, impacting everything from storage costs to query latency. This post dives deep into cleanup.policy, covering its architecture, configuration, failure modes, and operational best practices for building robust, scalable, and reliable Kafka-based systems. We’ll focus on scenarios involving stream processing with Kafka Streams, CDC replication using Debezium, and event-driven microservices communicating via Kafka.
2. What is cleanup.policy in Kafka Systems?
cleanup.policy dictates how Kafka manages the lifecycle of log segments within a topic partition. It’s a topic-level configuration that determines whether segments are deleted based on time or size, or compacted to retain only the latest value for each key. Introduced in Kafka 0.8, it’s been refined through KIP-47 (introducing compaction) and subsequent improvements.
The core options are:
-
delete: Segments are deleted based onretention.ms(milliseconds) orretention.bytes(bytes). This is the default. -
compact: Segments are compacted, retaining only the latest offset for each key. Requires a key schema for effective operation. -
delete,compact: Combines both policies. Segments are compacted first, then deleted based on retention criteria.
cleanup.policy operates at the broker level, but is configured per topic. Producers and consumers are largely unaware of the underlying cleanup process, relying on Kafka’s guarantees of message ordering within a partition. The controller manages the cleanup process, coordinating with brokers to ensure consistency.
3. Real-World Use Cases
- Change Data Capture (CDC): Debezium streams database changes to Kafka. Using
compact, we can maintain a current snapshot of each database record, enabling efficient stateful stream processing. - Event Sourcing: Storing all events for an entity in a Kafka topic.
compactensures we always have the latest state, whiledeletewith sufficient retention provides an audit trail. - Log Aggregation: Aggregating logs from microservices.
deletewith a time-based retention policy is suitable for short-term analysis, while longer retention might be needed for compliance. - Session State Management: Tracking user sessions.
compactallows quick retrieval of the latest session data based on the user ID as the key. - Multi-Datacenter Replication (MirrorMaker 2): Maintaining consistent data across regions.
cleanup.policymust be carefully aligned to prevent data divergence or unnecessary replication of stale data.
4. Architecture & Internal Mechanics
graph LR
A[Producer] --> B(Kafka Broker 1);
A --> C(Kafka Broker 2);
B --> D{Topic Partition};
C --> D;
D --> E[Log Segments];
E --> F{Index};
G[Consumer] --> D;
H[Controller] -->|Cleanup Request| B;
H -->|Cleanup Request| C;
subgraph Kafka Cluster
B; C; D; E; F; H;
end
style D fill:#f9f,stroke:#333,stroke-width:2px
Kafka stores messages in an append-only log, divided into segments. The cleanup.policy operates on these segments. When delete is used, the controller periodically checks segment age or size against the configured retention parameters. When compact is used, a background thread on each broker scans the log segments, identifying duplicate keys and retaining only the latest offset. The controller coordinates compaction across replicas to ensure consistency.
With KRaft mode, the controller’s role is handled by the Raft quorum, improving scalability and fault tolerance. Schema Registry integration is crucial for compact as it enforces key schemas and ensures data consistency. MirrorMaker 2 replicates topics, including their cleanup.policy, ensuring data consistency across clusters.
5. Configuration & Deployment Details
server.properties (Broker Configuration - relevant for compaction):
log.cleanup.interval.ms=5000
log.retention.check.interval.ms=300000
kafka-topics.sh (Topic Configuration):
kafka-topics.sh --bootstrap-server localhost:9092 --create --topic my-topic --partitions 3 --replication-factor 2 --config cleanup.policy=compact --config retention.ms=604800000 # 7 days
consumer.properties (Consumer Configuration):
isolation.level=read_committed # Important for compacted topics
producer.properties (Producer Configuration):
acks=all # Ensure data durability
idempotence.enabled=true # Prevent duplicate messages
6. Failure Modes & Recovery
- Broker Failure during Compaction: Compaction is idempotent. Upon broker recovery, compaction resumes from where it left off.
- ISR Shrinkage: If the number of in-sync replicas falls below the minimum, compaction may be paused to prevent data inconsistency.
- Message Loss: Incorrect
cleanup.policyconfiguration can lead to unintended data loss. Always thoroughly test changes in a staging environment. - Consumer Lag: Aggressive compaction can lead to consumers falling behind if they cannot keep up with the rate of updates.
Recovery strategies include: idempotent producers, transactional guarantees (using Kafka Transactions), proper offset tracking, and Dead Letter Queues (DLQs) for handling failed messages.
7. Performance Tuning
- Throughput: Compaction can impact write throughput, especially with high update rates. Monitor broker CPU and disk I/O.
- Latency: Compaction can increase read latency as the broker needs to scan log segments for the latest offset.
- Tail Log Pressure: Frequent updates to the same key can create tail log pressure, impacting performance.
Tuning configurations:
-
linger.ms: Increase to batch more messages, reducing the number of requests. -
batch.size: Increase to send larger batches, improving throughput. -
compression.type: Use compression (e.g.,gzip,snappy,lz4) to reduce storage costs and network bandwidth. -
fetch.min.bytes: Increase to reduce the number of fetch requests. -
replica.fetch.max.bytes: Increase to allow replicas to fetch larger batches.
Benchmark: A well-tuned Kafka cluster with compaction can achieve sustained throughput of >100 MB/s, depending on hardware and network conditions.
8. Observability & Monitoring
- Prometheus & JMX: Monitor Kafka JMX metrics using Prometheus.
- Grafana Dashboards: Create Grafana dashboards to visualize key metrics.
- Critical Metrics:
-
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec: Monitor message ingestion rate. -
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec: Monitor data ingestion rate. -
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions: Monitor replication status. -
kafka.consumer:type=consumer-coordinator-metrics,client-id=*,group-id=*,name=Lag: Monitor consumer lag.
-
Alerting conditions: Alert on high consumer lag, low ISR count, or high broker CPU utilization.
9. Security and Access Control
cleanup.policy itself doesn't directly introduce security vulnerabilities, but improper configuration can expose data. Use:
- SASL/SSL: Encrypt communication between clients and brokers.
- SCRAM: Use SCRAM authentication for secure access.
- ACLs: Control access to topics and operations.
- Kerberos: Integrate with Kerberos for strong authentication.
- Audit Logging: Enable audit logging to track access and modifications.
10. Testing & CI/CD Integration
- Testcontainers: Use Testcontainers to spin up ephemeral Kafka clusters for integration testing.
- Embedded Kafka: Use embedded Kafka for unit testing.
- Consumer Mock Frameworks: Mock consumers to verify producer behavior.
- CI Pipeline:
- Schema compatibility checks.
- Throughput tests.
- End-to-end integration tests.
11. Common Pitfalls & Misconceptions
- Incorrect Key Schema: Compaction relies on a well-defined key schema. Missing or inconsistent keys lead to unexpected behavior.
- Insufficient Retention: Deleting data too quickly can lead to data loss.
- Overly Aggressive Compaction: Compaction can impact performance if it cannot keep up with the update rate.
- Ignoring Isolation Level: Using
read_uncommittedwith compacted topics can lead to reading stale data. - Lack of Monitoring: Failing to monitor compaction metrics can lead to undetected performance issues.
Example: kafka-consumer-groups.sh showing consumer lag increasing after a cleanup.policy change.
12. Enterprise Patterns & Best Practices
- Shared vs. Dedicated Topics: Use dedicated topics for different applications to isolate data and simplify management.
- Multi-Tenant Cluster Design: Implement resource quotas and access control to isolate tenants.
- Retention vs. Compaction: Choose the appropriate policy based on data requirements.
- Schema Evolution: Use a Schema Registry to manage schema changes and ensure compatibility.
- Streaming Microservice Boundaries: Design microservices to consume and produce data to well-defined Kafka topics.
13. Conclusion
cleanup.policy is a fundamental configuration for managing Kafka’s log segments, directly impacting storage costs, performance, and data reliability. By understanding its architecture, configuration options, and potential failure modes, you can build robust, scalable, and efficient Kafka-based platforms. Next steps include implementing comprehensive observability, building internal tooling for managing cleanup.policy across a large cluster, and refactoring topic structures to optimize for specific use cases.
Top comments (0)