Pizofreude

Posted on Mar 18

Study Notes 6.1-2: Introduction to Stream Processing

#dataengineering #dezoomcamp #streamprocessing #kafka

1. Overview of Stream Processing

Definition: Continuous, real-time processing of data as it arrives.
Importance: Enables real-time analytics, event-driven applications, and prompt reactions to data (e.g., fraud detection, live monitoring).

Role in Stream Processing:
- Acts as a high-throughput, distributed message broker.
- Facilitates the ingestion, buffering, and distribution of streaming data.
Core Concepts:
- Kafka Topics: Logical channels where messages are published.
- Message Properties: Characteristics that define how data is handled within streams.

Partitioning:
- Divides data into segments (partitions) to enable parallel processing and load balancing.
Replication:
- Ensures fault tolerance by copying data across multiple brokers.
Retention Time:
- Determines how long messages are stored before being purged.
Other Settings:
- Configuration parameters specific to Kafka that influence performance and reliability.

Kafka Producers:
- Applications or scripts that send (produce) messages to Kafka topics.
- Can be implemented programmatically using various languages.
Kafka Consumers:
- Applications that subscribe to topics to receive (consume) and process messages.
- Support programmatic consumption for integrating with downstream systems.

Purpose:
- Enhances scalability by distributing data across partitions.
- Improves parallel processing, leading to better performance.
Example:
- Partitioning strategies determine how data is distributed among consumers.

Java Examples:
- Kafka Streams examples, demonstrating how to work with Kafka in a Java environment.
Python Examples:
- Spark Streaming examples using Python for those who prefer Python over Java.
Key Takeaway:
- The choice of language and framework can depend on team expertise and project requirements.

What is a Schema?
- A defined structure for data (e.g., field types, format) that ensures consistency.
Importance:
- Helps in managing data quality and enables smooth integration between systems.
- Facilitates schema evolution when data structures change over time.

Kafka Connect:
- A framework for connecting Kafka with external systems (databases, file systems, etc.) without writing custom code.
Additional Tools:
- Brief mention of other tools (e.g., “case equal DB”) that integrate with Kafka, highlighting the ecosystem available for stream processing.

Definition: Data exchange is the process of transferring data from one source (producer) to another (consumer) using various communication channels.
Real-World Analogy – Postal Service:
- Just like writing a letter and sending it through the postal service, data can be written and sent to a designated receiver.
- This simple analogy emphasizes that data exchange involves a sender, a transport medium, and a receiver.

Computer Communication:
- In today’s digital world, data exchange is often handled through APIs such as REST, GraphQL, and webhooks.
- These methods ensure that data flows from one system to another reliably and efficiently.
Notice Board Analogy:
- Producer: Imagine a person posting a flyer on a community notice board.
  - The flyer contains information (data) meant for a specific audience.
- Consumers: Passersby (or subscribers) who read the flyer can act on it, ignore it, or pass it along.
- Topic-based Distribution:
  - If a flyer is posted on a board dedicated to a specific subject (e.g., Kafka, Spark, stream processing, Big Data), only those interested in that subject (consumers subscribed to that topic) will take notice.

Traditional (Batch) Data Exchange:
- In many systems, data exchange happens in batches—data is collected over a period (minutes, hours, or even days) and then processed.
- Examples include receiving emails or checking a physical notice board when passing by.
Stream Processing:
- Real-Time Data Exchange:
  - In stream processing, data is exchanged almost immediately after it is produced.
  - A producer sends a message to a topic (e.g., a Kafka topic), and that message is instantly available to any consumer subscribed to that topic.
- Key Benefit:
  - The reduced delay compared to batch processing means data is processed in near-real time, enabling faster decision making.

Not Instantaneous:
- Real-time processing does not mean zero latency or instantaneous delivery (i.e., not at the speed of light).
- There is typically a few seconds of delay, which is significantly less than the delays common in batch processing.
Comparison to Batch Processing:
- Batch Processing:
  - Data is consumed and processed every minute, hour, or even later.
- Stream Processing:
  - Data flows continuously, allowing for almost immediate processing.

Kafka Topics:
- A producer writes data to a Kafka topic.
- Consumers subscribed to that topic receive the data in near-real time.
Spark Topics:
- Similar to Kafka, in some examples, data might be sent to a Spark topic where consumers process the stream in real time.
Programming Aspect:
- Both Kafka and Spark provide APIs and libraries for programmatically producing and consuming data, making it easier to build real-time applications.