Today I learned that when you hear the word 'batch' in the context of Apache Kafka, it can mean one of two things:
A reference to batch-only data processing systems. Batch-only systems process data in a bounded way. That means that there's a start time and an end-time. Whether this batching is done in large or micro-batches, it is processed all at once. That's in contrast to the continuous data streaming that Apache Kafka enables, in which data is processed in event-sized pieces.
Within the data streaming context, there's something called producer batching. It's a bit of a misnomer because it's not really related to the batch-only data processing systems. A Kafka producer, the client that publishes records to the Kafka cluster, compresses messages via a process called batching to increase throughput. This batching is part of the process handling data at once and in event-sized pieces, so it doesn't mean the same thing as batch-only data processing.
In conclusion, 'batching' means, in a very general way, 'grouping stuff together'. But 'producer batching' and 'batch-only data processing systems' do not share the term in any significant sense, because they are referring to the completely different functions I described above.
Top comments (0)