DEV Community: Okibaba

zoomcamp data engineering project - part 5

Okibaba — Fri, 19 Apr 2024 07:01:05 +0000

editing

zoomcamp data engineering project - part 4

Okibaba — Fri, 19 Apr 2024 07:00:50 +0000

editing

zoomcamp data engineering project - part 3

Okibaba — Fri, 19 Apr 2024 06:20:49 +0000

editing

zoomcamp data engineering project - part 2

Okibaba — Fri, 19 Apr 2024 06:19:14 +0000

in edit

zoomcamp data engineering project - part 1

Okibaba — Fri, 19 Apr 2024 06:18:35 +0000

in edit

Data Engineering Zoomcamp Week 6 readings - pub/sub Vs Message queue

Okibaba — Tue, 09 Apr 2024 06:58:26 +0000

Pub/Sub and Message Queue are both messaging patterns that enable asynchronous communication, loose coupling, and scalability in distributed systems, but differ in their communication style, message consumption, and use cases.

Here's a comparison between Pub/Sub and Message Queue:

Communication Pattern:
- Pub/Sub: In the Pub/Sub pattern, publishers publish messages to a topic or channel, and subscribers who have subscribed to that topic receive the messages. Publishers and subscribers are decoupled and may not have knowledge of each other.
- Message Queue: In the Message Queue pattern, producers send messages to a queue, and consumers retrieve messages from the queue. Consumers typically process messages in a first-in, first-out (FIFO) order.
Message Consumption:
- Pub/Sub: In Pub/Sub, multiple subscribers can receive the same message published to a topic. Each subscriber receives its own copy of the message and processes it independently.
- Message Queue: In a Message Queue, each message is typically consumed by a single consumer. Once a consumer retrieves a message from the queue, it is removed from the queue.
Decoupling:
- Pub/Sub: Pub/Sub provides a higher level of decoupling between publishers and subscribers. Publishers and subscribers do not need to be aware of each other's existence or communicate directly.
- Message Queue: Message Queues also provide decoupling between producers and consumers, but the decoupling is typically less than in Pub/Sub. Consumers are often aware of the specific queue they are consuming from.
Scalability:
- Pub/Sub: Pub/Sub allows for easy scalability as new subscribers can be added dynamically to a topic without impacting the publishers or existing subscribers.
- Message Queue: Message Queues can also scale by adding more consumers to process messages from the queue. However, the scalability may be limited by the order of message processing and the need for coordination among consumers.
Persistence:
- Pub/Sub: In Pub/Sub, messages are typically not persisted by default. If a subscriber is offline or disconnected, it may miss messages published during that time.
- Message Queue: Message Queues often provide message persistence, ensuring that messages are stored until they are successfully processed by a consumer. This allows for reliable message delivery even if consumers are temporarily unavailable.
Use Cases:
- Pub/Sub: Pub/Sub is commonly used in scenarios where real-time data dissemination is required, such as real-time updates, event-driven architectures, or broadcasting messages to multiple subscribers.
- Message Queue: Message Queues are suitable for scenarios where reliable message delivery and processing are important.

Data Engineering Zoomcamp Week 6 - considerations for ingesting streaming data

Okibaba — Tue, 09 Apr 2024 06:54:43 +0000

I went further and did some reading on message and streaming data. My main reading source was fundamentals of data engineering book by Joe Reis & Matt Houseley.

According to them, the key things to consider when ingesting event driven data like streaming data are:
-Schema evolution
-Late arrival data
-Ordering & Multiple delivery
-Time to live
-Message size limitations.
-Location & redundancy goals

References:
Fundamentals of data engineering book by Joe Reis & Matt Houseley.

Data Engineering Zoomcamp Week 6 - using redpanda 1

Okibaba — Tue, 09 Apr 2024 06:43:52 +0000

This week we explored using redpanda as a drop in for kafka.
Using redpanda can be ran from a docker container.

Steps:
Copy the docker-compose.yml file into your install directory (this is a replica of what we used for the course).

version: '3.7'
services:
  # Redpanda cluster
  redpanda-1:
    image: docker.redpanda.com/vectorized/redpanda:v22.3.5
    container_name: redpanda-1
    command:
      - redpanda
      - start
      - --smp
      - '1'
      - --reserve-memory
      - 0M
      - --overprovisioned
      - --node-id
      - '1'
      - --kafka-addr
      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr
      - PLAINTEXT://redpanda-1:29092,OUTSIDE://localhost:9092
      - --pandaproxy-addr
      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082
      - --advertise-pandaproxy-addr
      - PLAINTEXT://redpanda-1:28082,OUTSIDE://localhost:8082
      - --rpc-addr
      - 0.0.0.0:33145
      - --advertise-rpc-addr
      - redpanda-1:33145
    ports:
      # - 8081:8081
      - 8082:8082
      - 9092:9092
      - 28082:28082
      - 29092:29092

Then run

docker-compose up

and voila!

References:
Data engineering zoomcamp week 6 course and homework notes:
https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/cohorts/2024/06-streaming

Data Engineering Zoomcamp Week 6 - Streaming Vs Batch Pipelines

Okibaba — Mon, 08 Apr 2024 06:36:06 +0000

Based on class notes and some readings here and there, here's a summary of differences in data engineering pipelines for streaming and batch processing

References:
https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main
Fundamentals of data engineering by Joe Reis & Matt Housley
Designing Machine Learning Systems by Chip Huyen

Data Engineering Zoomcamp Week 6 - Streaming using kafka

Okibaba — Mon, 08 Apr 2024 02:08:15 +0000

This past couple weeks I spent some time learning about Kafka in week 6 of my data engineering zoomcamp.

Apache Kafka is a distributed streaming platform that has gained immense popularity in recent years due to its ability to handle large-scale, real-time data feeds. It provides a reliable and scalable solution for building streaming data pipelines and applications.

Grokking kafka requires getting familiar with some of its key architectural abstractions.

Kafka Architecture:
Kafka follows a publish-subscribe model (pub sub) , where producers send messages to topics, and consumers read messages from those topics. The architecture consists of the following main components:

Producers:
- Producers are responsible for publishing messages to Kafka topics.
- They can choose to send messages to specific partitions within a topic.
- Producers have the ability to control the partition assignment using keys.
Consumers:
- Consumers are the subscribers who read messages from Kafka topics.
- They are organized into consumer groups, identified by a unique consumer group ID.
- Each consumer within a group reads from a specific partition of a topic.
Topics:
- Topics are the fundamental unit of organization in Kafka.
- They are used to categorize and store streams of records.
- Topics are partitioned, allowing multiple consumers to read from different partitions simultaneously.
Partitions:
- Topics are divided into partitions, which are the smallest storage units in Kafka.
- Each partition is an ordered, immutable sequence of records.
- Partitions enable parallel processing and horizontal scalability.
Cluster:
- Kafka runs as a cluster of one or more servers called brokers.
- The cluster is responsible for storing and managing the topics and their partitions.
- Kafka ensures fault tolerance and high availability through replication.

Kafka Configuration:
Kafka provides various configuration options to control its behavior and performance:

Replication Factor:
- The replication factor determines the number of copies of each partition across the Kafka cluster.
- It ensures fault tolerance and data durability.
- A higher replication factor provides better reliability but increases storage overhead.
Retention:
- Retention refers to how long Kafka retains messages within a topic.
- It can be configured based on time (e.g., retaining messages for a specific number of days) or size (e.g., retaining a certain amount of data).
- Retention policies help manage storage space and comply with data retention requirements.
Offsets:
- Offsets represent the position of a consumer within a partition.
- Consumers keep track of the offsets to know which messages they have already processed.
- Kafka provides different offset management strategies, such as automatic offset commits or manual offset control.
Auto Offset Reset:
- The auto offset reset configuration determines the behavior when a consumer starts reading from a topic without a committed offset.
- It can be set to "earliest" (start from the beginning) or "latest" (start from the most recent message).
Acknowledgment (ACK):
- Acknowledgment settings control the reliability of message delivery.
- Producers can wait for acknowledgments from the Kafka brokers to ensure that messages are persisted.
- The "acks" configuration allows trade-offs between latency and durability.

Conclusion:
Apache Kafka's distributed architecture, pub-sub model, and configurable options make it a powerful tool for building scalable and fault-tolerant streaming applications. Kafka's capabilities to process and analyze real-time data streams efficiently explains why its heavily used in real time data engineering and machine learning work flow.

Zoom camp rising wave workshop

Okibaba — Mon, 18 Mar 2024 14:56:22 +0000

This month I attended the risingwave workshop hosted by datatalks for the 2024 cohort of the zoomcamp data engineering course.

Risingwave is an open-source, cloud-native streaming database system designed for real-time analytics and event-driven applications. With risingwave you can handle streaming data using sql-like queries, making it very powerful for scenarios like; real-time monitoring, anomaly detection, event-driven architectures, and continuous data transformation.

As a quick demo of how sql-like risingwave is, lets create a materialized view to compute the average, min and max trip time between taxi zone as well as find the top 10 maximum trip time for a route pair taking into consideration trip direction (i.e trip from a to b is different from trip from b to a).Data we are working with is the New York openly available yellow taxi data.

CREATE MATERIALIZED VIEW mv_avg_min_max_trip_time AS 
SELECT 
    pu_zone.Zone AS pickup_zone, 
    do_zone.Zone AS dropoff_zone, 
    AVG(EXTRACT(EPOCH FROM (trip_data.tpep_dropoff_datetime - trip_data.tpep_pickup_datetime)) / 60) AS avg_trip_duration, 
    MIN(EXTRACT(EPOCH FROM (trip_data.tpep_dropoff_datetime - trip_data.tpep_pickup_datetime)) / 60) AS min_trip_duration, 
    MAX(EXTRACT(EPOCH FROM (trip_data.tpep_dropoff_datetime - trip_data.tpep_pickup_datetime)) / 60) AS max_trip_duration 
FROM 
    trip_data 
JOIN 
    taxi_zone pu_zone ON trip_data.PULocationID = pu_zone.location_id 
JOIN 
    taxi_zone do_zone ON trip_data.DOLocationID = do_zone.location_id 
GROUP BY 
    pu_zone.Zone, 
    do_zone.Zone; 

SELECT 
    pickup_zone, 
    dropoff_zone, 
    max_trip_duration
FROM 
    mv_avg_min_max_trip_time 
ORDER BY 
    max_trip_duration DESC 
LIMIT 10;

Sources
datatalks risingwave workshop link
https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04/blob/main/workshop.md

Data talks data engineering course repo (week 6)
https://github.com/DataTalksClub/data-engineering-zoomcamp

Zoom camp week 5 - Batch Processing : SPARK

Okibaba — Mon, 04 Mar 2024 17:34:21 +0000

Data engineering zoom camp course week 5 was all about batch data processing using Apache Spark.

Data engineering zoomcamp course link:
https://github.com/DataTalksClub/data-engineering-zoomcamp