Pizofreude

Posted on Mar 18

Study Notes 6.15: Kafka & Flink Streaming

#dataengineering #dezoomcamp #kafka #pyflink

1. Introduction to Kafka Streaming with PyFlink

Streaming Data Processing:
- Involves the continuous ingestion, processing, and movement of data in real time.
- Critical for use cases where immediate reaction is needed (e.g., fraud detection, real-time analytics, surge pricing).
Key Technologies Covered:
- Kafka (or Kafka-compatible systems like Red Panda):
  - Acts as a high-throughput, low-latency messaging system.
  - Uses topics (like tables) to which data producers send events.
- Apache Flink:
  - Provides a distributed processing framework for both stream and batch (micro-batch) workloads.
  - Excels in stateful processing, windowing, and fault tolerance.
- PostgreSQL:
  - Serves as the sink (destination) to store processed data, enabling query and analysis.

2. Architecture and Environment Setup

Containerization with Docker Compose:
- Multiple containers (or “machines”) are spun up for:
  - Red Panda: Simulating Kafka for local development.
  - Flink Components:
    - Job Manager: Orchestrates job submission and scheduling.
    - Task Manager: Executes parallel tasks.
  - PostgreSQL: Acts as the landing zone for processed events.
- Local Setup Tips:
  - Use database tools like DataGrip, DBeaver, or PGAdmin to connect to PostgreSQL.
  - Verify each container’s status via the Docker CLI and dashboards (e.g., Apache Flink’s dashboard on localhost:8081).

3. Kafka Producers, Topics, and Data Serialization

Kafka Producer Role:
- A Python-based producer script sends test data into Kafka (simulated by Red Panda).
- Data is serialized into JSON for interoperability among different languages and systems.
- Alternatives:
  - Formats like Thrift or Protobuf can be used to reduce message size and improve efficiency.
Kafka Topics:
- Analogous to tables in a relational database.
- Producers write data to a topic, and consumers (like Flink jobs) subscribe to these topics.

4. Flink’s Role and Modes of Operation

Flink as a Stream Processor:
- Reads data from Kafka and writes processed results to PostgreSQL.
- Supports both continuous streaming (staying active to process new events) and batch-like (micro-batch) modes.
Job Lifecycle and Checkpointing:
- Checkpointing:
  - Periodically snapshots the job’s state (e.g., every 10 seconds).
  - Enables job recovery after failures without reprocessing all data from the beginning.
  - Must be carefully configured to balance resilience with processing overhead.
- Offset Management:
  - Earliest Offset: Reads all available data from Kafka.
  - Latest Offset: Reads only data that arrives after the job starts.
  - Custom Timestamp: Allows restarting processing from a specific point in time.

5. Windowing and Watermarking in Flink

Why Windowing?
- Helps group events into finite chunks for aggregation (e.g., counts, sums).
- Especially important when dealing with unbounded (continuous) data streams.
Types of Windows:
- Tumbling Windows:
  - Non-overlapping, fixed-size windows (e.g., one-minute windows).
  - Best for batch-like processing where events are grouped by time intervals.
- Sliding Windows:
  - Overlapping windows that slide over time.
  - Useful for capturing trends with more granularity (e.g., every 30 seconds, even though the window length is one minute).
- Session Windows:
  - Group events based on periods of activity separated by gaps (determined by a “session gap”).
  - Ideal for modeling user sessions or bursts of activity.
Watermarking:
- A mechanism to tolerate out-of-order events by setting a delay (e.g., 15 seconds).
- Allows late-arriving events to be incorporated into the correct window.
- Can be paired with “allowed lateness” settings to update results if events arrive beyond the typical delay.
- Side Outputs:
  - An option to divert extremely late data to a separate stream for later processing.

6. Fault Tolerance and Checkpointing Strategies

Resilience in Streaming Jobs:
- Checkpointing not only captures Kafka offsets but also the state of active windows (even if a job fails in the middle of a window).
- On job restart, Flink can resume processing from the last checkpoint to avoid duplicate work.
- Potential Pitfalls:
  - Restarting a job from scratch (using the “earliest” offset) can lead to duplicate data.
  - Best practice is to redeploy by restoring from the checkpoint rather than starting a new job entirely.

7. When to Use Streaming Versus Batch Processing

Streaming Use Cases:
- Real-time fraud detection.
- Dynamic pricing models (e.g., Uber surge pricing).
- Systems that require near-immediate responses (e.g., alert systems).
Batch (Micro-Batch) Use Cases:
- When a slight delay is acceptable (e.g., hourly data aggregation).
- Scenarios where processing overhead must be minimized.
- Many analytical workloads that do not require instantaneous reaction.
Key Consideration:
- The decision to use streaming over batch processing depends on the business need for real-time insights versus the complexity and maintenance overhead associated with streaming systems.

8. Best Practices and Additional Tips

Connector Libraries:
- Use Flink’s built-in connectors (e.g., Kafka and JDBC connectors) to simplify data ingestion and output.
Schema Management:
- Since Kafka topics do not enforce schemas, extra care is needed to manage data consistency (e.g., using schema registries or defining conventions for producers and consumers).
Scaling and Parallelism:
- Flink’s parallelism is determined by the keys used in processing (e.g., grouping by a particular column).
- Properly keying your streams can help balance workload across available Task Managers.
Managing Complexity:
- Recognize that streaming pipelines have more moving parts than batch jobs (offsets, state management, watermarking).
- It’s important for teams to understand the additional operational complexities and invest in monitoring, alerting, and clear documentation.

9. Spark Streaming vs. Flink Streaming

Spark Streaming:
- Operates on the micro-batch principle (processing data in small, time-based batches).
- Can introduce a slight delay due to batch intervals (e.g., 15–30 seconds).
Flink Streaming:
- Implements true continuous processing (push architecture), processing events as they arrive.
- Generally offers lower latency and more granular control over windowing and state management.
Choosing Between Them:
- For real-time, low-latency applications where every millisecond counts, Flink’s continuous processing is often preferred.
- For use cases where micro-batch latency is acceptable, Spark Streaming might be simpler to implement and maintain.

10. Q&A and Practical Insights

Job Recovery and Duplicate Handling:
- It is essential to correctly configure checkpointing and offset management to prevent duplicate records when a job is restarted.
- Some production environments handle duplicate records by using “upsert” semantics in the sink (e.g., PostgreSQL’s “ON CONFLICT UPDATE”).
Skill Set and Team Organization:
- Streaming data engineering requires specialized skills due to the operational and development complexities involved.
- In some organizations, roles are split between batch data engineers and streaming (or “real-time”) engineers to ensure expertise in each area.
Real-World Examples:
- Netflix Fraud Detection:
  - Streaming is used to identify anomalies and immediately trigger security measures.
- Uber Surge Pricing:
  - Real-time data is crucial to dynamically adjust pricing based on supply and demand fluctuations.

Supplementary Information

Additional Resources:
- Apache Flink Documentation – Detailed guides on state management, windowing, and fault tolerance.
- Kafka Documentation – In-depth information on Kafka’s architecture, producers, consumers, and best practices.
Best Practice Tips:
- Regularly monitor checkpoint intervals and state size to avoid performance bottlenecks.
- Test different watermark strategies to balance latency and completeness of results.
- Consider using a schema registry when working with evolving data schemas in Kafka topics.

Your AI Code Assistant

Implement features, document your code, or refactor your projects.
Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Get started free in your IDE

DEV Community