DEV Community: Paulius

Lightweight HTTP API for Big Data on S3

Paulius — Wed, 15 Mar 2023 15:50:29 +0000

We are happy to announce our third opensource project - Delta Fetch.
Delta Fetch is a configurable HTTP API service for accessing Delta Lake tables. Service is highly configurable, with possibility to filter your Delta tables by selected columns.

How it works?

Delta Fetch heavily relies on Delta table metadata, which contains statistics about each Parquet file. The same metadata that is used for data skipping is used to read only relevant files, in particular - minimum and maximum value of each column in each file. The Delta table metadata is cached for better performance and can be refreshed by enabling auto cache update or making API requests with the ...?exact=true query parameter.

Request handling flow:

The user makes an API request to one of the configured API resources.
Delta Fetch reads Delta table metadata from file storage and stores it in memory.
Delta Fetch finds the relevant file paths in the stored metadata and starts reading them.
Delta Fetch uses the Hadoop Parquet Reader implementation, which supports filter push down to avoid reading the entire file.
Delta Fetch continues reading Parquet files one by one until the requested or configured limit is reached.

Configuration

Resources can be configured in the following way:

app:
  resources:
    - path: /api/data/{table}/{identifier}
      schema-path: /api/schemas/{table}/{identifier}
      delta-path: s3a://bucket/delta/{table}/
      response-type: SINGLE
      filter-variables:
        - column: id
          path-variable: identifier

path property defines API path which will be used to query your Delta tables. Path variables can be defined by using curly braces as shown in the example.
schema-path (optional) property can be used to define API path for Delta table schema.
delta-path property defines S3 path of your Delta table. Path variables on this path will be filled in by variables provided in API path.
response-type (optional, default: SINGLE) property defines weather to search for multiple resources, or a single one. Use LIST type for multiple resources.
max-results (optional, default: 100) maximum number of rows that can be returned in case of LIST response-type.
filter-variables (optional) additional filters applied to Delta table.

You can also configure one of two security mechanisms - Basic Auth or OAuth2, and some caching parameters for better performance. Refer to Delta Fetch Github Repo for more information.

Recommendations

In order to be able to quickly access the data in Parquet files you need to configure block size to a smaller value that you would normally do. We've got acceptable results by setting parquet.block.size to 1048576 (1mb) value.

Also we highly recommend not to use OPTIMIZE ... ZORDER ... on your tables, which are exposed through Delta Fetch, since this command usually stores data split by 1GB chunks. We suggest to rely on simple data ordering by the columns that you are planning to use as "keys" in Delta Fetch API.

More recommendations and considerations can be found on our recommendations page.

With those recommendations applied we managed to get ~1s response time, when requesting for a single row by a single column value:

time curl http://localhost:8080/api/data/disable_optimize_ordered/872480210503_234678
{"version":5,"data":{"user_id":"872480210503_234678","sub_type":"PREPAID","activation_date":"2018-09-01","status":"ACTIVE","deactivation_date":"9999-01-01"}}
curl   0.00s user 0.01s system 1% cpu 0.982 total
---
time curl http://localhost:8080/api/data/disable_optimize_ordered/579520210231_237911
{"version":5,"data":{"user_id":"579520210231_237911","sub_type":"PREPAID","activation_date":"2018-06-24","status":"ACTIVE","deactivation_date":"9999-01-01"}}
curl   0.00s user 0.01s system 0% cpu 1.250 total
---
➜  ~ time curl http://localhost:8080/api/data/disable_optimize_ordered/875540210000_245810
{"version":2,"data":{"user_id":"875540210000_245810","sub_type":"PREPAID","activation_date":"2018-09-01","status":"ACTIVE","deactivation_date":"9999-01-01"}}
curl   0.00s user 0.01s system 1% cpu 0.870 total

We consider this API service experimental and hope to get some feedback and contributions from Open Source (and also dev.to :)) community. Let us know what do you think about our new project.

Testing PySpark & Pandas in style

Paulius — Thu, 10 Feb 2022 07:49:05 +0000

Today we'd like to share a small utility package for testing Dataframes on PySpark and Pandas.

If you are a fan of test-driven development and had a chance to work on PySpark (or Pandas) projects, you've probably had written tests similar to this one:

from datetime import datetime
from pyspark_test import assert_pyspark_df_equal
from your_module import calculate_result

def test_event_aggregation(spark):
    schema = ["user_id", "even_type", "item_id", "event_time", "country", "dt"]
    expected_df = spark.createDataFrame(
        [
            (123456, 'page_view', None, datetime(2017,12,31,23,50,50), "uk", "2017-12-31"),
            (123456, 'item_view', 68471513, datetime(2017,12,31,23,50,55), "uk", "2017-12-31")
        ], 
        schema
    )

    result_df = calculate_result()

    assert_pyspark_df_equal(expected_df, result_df)

It works OK for small applications, but when your project gets bigger, data gets more complicated and the amount of tests starts to grow, you might want a less tedious way to define test data.

Exacaster alumni Vaidas Armonas came up with an idea to represent Spark DataFrames as markdown tables. This idea materialized to a testing package markdown-frames. With this package the test, which was shown before, can be replaced with this one:

from pyspark_test import assert_pyspark_df_equal
from markdown_frames.spark_dataframe import spark_df
from your_module import calculate_result

def test_event_aggregation(spark):
    input_data = """ 
        |  user_id   |  even_type  | item_id  |    event_time       | country  |     dt      |
        |   bigint   |   string    |  bigint  |    timestamp        |  string  |   string    |
        | ---------- | ----------- | -------- | ------------------- | -------- | ----------- |
        |   123456   |  page_view  |   None   | 2017-12-31 23:50:50 |   uk     | 2017-12-31  |
        |   123456   |  item_view  | 68471513 | 2017-12-31 23:50:55 |   uk     | 2017-12-31  |
    """
    expected_df = spark_df(input_data, spark)

    result_df = calculate_result()

    assert_pyspark_df_equal(expected_df, result_df)

It makes tests more readable and self-explanatory.

Everything looks almost the same, when you need to build a Dataframe for Pandas, you just need to use different function:

from markdown_frames.pandas_dataframe import pandas_df

Share in the comments, if you know any other convienient tips & tricks when writing PySpark (and Pandas) Unit tests.

Monitoring Apartment temperature & humidity with Raspberry Pi, Prometheus & Grafana

Paulius — Sat, 29 Feb 2020 19:29:05 +0000

For quite some time, I had a spare Raspberry Pi lying around in my place. And one weekend I came up with idea to make my apartment "smarter". What I mean by saying "smarter" is tracking some metrics of my surroundings.

I have some experience in working with Prometheus and Grafana, so I decided to incorporate those tools into my solution. Yes, it does sound like overengineering simple task, you can probably get same results in much simpler way : ). But it was fun weekend project for me.

In this post I'll describe my setup for monitoring room temperature & humidity.

Hardware components

These are all the component, I used in my project:

Raspberry Pi 3 Model B
16 GB microSD card
DHT11 Temperature And Humidity Sensor
Mobile phone charger, for powering Raspberry Pi

Connecting Sensor to Raspberry Pi

I connected Ground pin to the Ground of Raspberry PI, Data Pin to GPIO 14 pin, Vcc pin to 3.3V power supply pin.

Reading sensor data

For reading sensor data and feeding it to Prometheus, I chose DHT11_Python library, which is quite unstable, and sometimes does not return valid results, so you might get some gaps in your graphs.

Also I've created simple Flask API to serve metrics for Prometheus:

from flask import Flask
import dht11
import RPi.GPIO as GPIO

GPIO.setwarnings(False)
GPIO.setmode(GPIO.BCM)
instance = dht11.DHT11(pin=14)

app = Flask(__name__)

@app.route("/metrics")
def metrics():
    dht11_data = ""
    result = instance.read()
    if result.is_valid():
        dht11_data = f"""pihome_temperature {result.temperature}
pihome_humidity {result.humidity}"""

    return f"{dht11_data}", 200, {'Content-Type': 'text/plain; charset=utf-8'}

if __name__ == "__main__":
    app.run(host='0.0.0.0')

Prometheus configuration

To scrape metrics from my Flask API, I've added configuration to prometheus.yml:

global:
    scrape_interval: 30s
scrape_configs:
    - job_name: 'pihome'
      static_configs:
        - targets: [pihome:5000]

Grafana Configuration

Then, in /etc/grafana/provisioning, I've added datasource configuration:

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090/
    access: proxy
    isDefault: true

It is also possible to add Grafana dashboards to provisioning folder as json files, so that you don't need to create new dashboard each time you re-deploy Grafana.

Connecting everything together

To make everything portable and easy to install, I packed my Flask API to Docker image and configured all services in docker-compose.yaml:


version: '3'

services:
  pihome:
    image: pihome
    build: .
    restart: always
    devices:
      - "/dev/mem:/dev/mem"
    privileged: true
    ports:
      - 5000:5000

  prometheus:
    image: prom/prometheus:v2.16.0
    user: root
    volumes:
      - ./prometheus/:/etc/prometheus/
      - /var/prometheus:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    ports:
      - 9090:9090
    depends_on:
      - pihome
    restart: always

  grafana:
    image: grafana/grafana:6.6.2
    depends_on:
      - prometheus
    ports:
      - 80:3000
    volumes:
      - ./grafana/:/etc/grafana
    restart: always

Results

I left my stack running for some time, to collect some historical data, and dashboard looked like this:

Git project

You can find my full configuration and code on Github: https://github.com/pdambrauskas/pihome

Kafka Connect: How it let us down?

Paulius — Mon, 03 Feb 2020 06:18:55 +0000

About a year ago me and @minutis had a chance to try out Kafka Connect. We used it as the backbone of one of our ETL processes but eventually, we chose a different approach. In this post, I'll try to remember what problems we met, and why Kafka Connect didn't fit our needs.

For those of you, who do not know what Kafka Connect is, it is a framework for connecting Apache Kafka to external systems such as databases, search indexes and file systems.
Kafka Connect allows both: write data from external source system to Kafka topic and export data from Kafka topic to external system.

Main Kafka Connect concepts

I'm not going in-depth to each and every Kafka Connect component, there is plenty of information online on how Kafka Connect is designed and how it works, however, I'll try to describe them, in short, to give you an idea on how Kafka Connect works so that you have more context on what I'm going to write further in this post.

So Main Kafka Connect components are:

Connector - unit of work and also a logical implementation of integrations with external systems. There are two types of connectors: Source Connectors, responsible for reading from external systems to Kafka and Sink Connectors responsible for writing data from Kafka to external systems. Confluent Inc., the main contributor of Kafka Connect, has quite detailed docummentation on how to implement your own Source and Sink connectors.
Task - a unit of work. When you are configuring Connector for an external system, you can define the maximum number of tasks. This number defines how many processes in parallel should read from your external system (or write to it). So the work done by Connector is parallelized by the number of tasks.
Worker - component, responsible for task execution. Kafka Connect can work in two modes: standalone and distributed. In standalone mode, you have one worker process responsible for executing your connector task, configured on properties file. In distributed mode, you can start many worker processes and distribute them all across your Kafka cluster. Also, in distributed mode, all connector configuration is done by using Kafka Connect Rest API.
Transform - transformation applied to Kafka's message after connector ingests data, but before data is written to Kafka topic. There are many Transforms implementations. It is also very easy to implement custom transforms.

Why it failed us?

The first time I found out about Kafka Connect, I was excited. It looked like really nice and thought through solution for managing different ETL pipelines. It has Rest API for adding/removing connectors, starting stopping and scaling tasks and monitoring task statuses. Extensibility looked really promising too, you can easily add your own connector and transform implementations, without forking Kafka source code and scale it through as many worker processes as you need.
Without further investigations, we decided to try it out. What we experienced wasn't as nice, as we hoped :).

Too early to use in production

At the time we were experimenting with Kafka Connect it wasn't stable enough. It had some bugs, the quality of open-sourced connectors was quite poor, and there were a few architecture flaws which were a deal-breakers for us.

Bugs

In our use case, we wanted to write data to HDFS. For that we decided to use open-sourced kafka-connect-hdfs connector implementation. At the time we used it, it was pretty much unusable:

We had corrupted files after Kafka rebalance #268(open).
We had limited datatypes available #49(open).
We were not able to partition data by multiple fields (we had implemented our own solution for this one) #commons-53(fixed).
We had tasks failing to resume after a pause #53(fixed).

After the experience with open source connectors we saw, that they are not only buggy but also lack the features we need. We decided to use our own connector implementations and didn't stop believing in Kafka Connect. We encountered some Kafka bugs too, but most of them were fixed fast enough (KAFKA-6252).

Some minor Kafka Connect bugs are unfixed even today. One of them, worth mentioning is KAFKA-4107. In the process of testing, we had some cases when we needed to delete and recreate some of the connectors. Kafka Connect provides REST API endpoint for Connector deletion, however when you delete connector through this API, old task offset remain undeleted, so you can not create a connector with the same name. We found a workaround for this problem: we've added connector versioning (appended version numbers on connector name), to avoid conflicts with offsets from deleted tasks.

Rebalance all the time

This was the Kafka Connect design flaw. Kafka Connect rebalanced all of the tasks on its cluster every time you changed task set (add or delete a task or a connector, etc.). That meant all running tasks had to be stopped and re-started. The time needed to rebalance all your tasks grows significantly each time you add a new connector and becomes unacceptable when it comes to ~100 tasks.
This was the biggest roadblock for us since we had a dynamic environment where the task set was changing rapidly, so rebalancing was happening too.
Well, today this is not a problem anymore. With Kafka 2.3.0 which came not so long ago, this flaw was fixed. You can read more on that here.

Conclusion

We droped the idea of using Kafka Connect (some time in 2018). We droped it, because it wasn't production ready yet, and it didn't fully covered our use cases. Today, many of the problems we met are fixed (some of them are not, but you can find workarounds). I'm still kind of skeptikal about Kafka Connect, however trying and experimenting with it was really fun. I'd say you should consider Kafka Connect only if you are willing to invest time in implementing your own Connectors.

Event Sourcing with Redis

Paulius — Sat, 25 Jan 2020 11:34:12 +0000

It is a common practice to use asynchronous messaging technologies to implement communication between multiple microservices. Kafka is a go-to solution when it comes to streaming pipelines and publish/subscribe systems for async communication. It has many advantages over traditional, synchronous HTTP communication, some of them are:

You can easily scale microservice architecture by running multiple instances of your microservice in the same consumer group.
You can add new consumers without modifying existing communication-related code.
Take advantage of persistence of Kafka topics by re-consuming them any time you need.

I have implemented asynchronous communication using Kafka several times. Both at work and on my personal projects. But there are many alternatives. One of the latest implementations is newest Redis Datatype - Redis Streams, which came in Redis 5.0.

Redis Streams may look very similar to traditional Redis Pub/Sub concept, however it is quite different. It shares main conceptual ideas with Apache Kafka:

Stream can have multiple consumers, every new entry on a stream will be delivered to every consumer (unless consumers belong to same consumer group).
Consumed messages do not disappear, Redis stores streamed data and last consumed id for each consumer group, so new consumer groups can consume group from the beginning.

So, as you see, in theory Redis Streams work very similar to Kafka topics, which means it can be used for same use cases. You can find many resources and examples on how to use Kafka for Event Sourcing (example), but when you look at Redis Streams, the variety of examples is very limited. So I decided to contribute to filling in this resource gap :).
In this post I will try to describe how to use Redis Streams for Event sourcing & CQRS.

Redis Streams API theoretical example

Lets say we have a REST API endpoint for GET'ing user social media profile, with personal information, liked pages and list of users' friends. For that we'll have three different Redis Streams:

user_updates stream consists of user profile updates. Every time the user changes his birth date, name or other personal information, entry is published to this stream.
user_activity every time user writes a comment, reacts to some post, or performs any other action (that is available in Facebook), entry is published to this stream.
user_friends every time user gets or loses a friend, entry is published to this stream.

Filling up event streams

So, we would publish our events by using XADD command:

XADD user_updates * user_id 1 command register name Duffy surename Duck
XADD user_updates * user_id 2 command register name Bugs surename Bunny
XADD user_activity * user_id 1 command kick object Bugs
XADD user_activity * user_id 1 command dislike object Bugs face
XADD user_friends * user_id 1 command remove friend_id 2

Reading messages from event streams

Now we can read all streams using XREAD command:

XREAD STREAMS user_updates user_activity user_friends 0 0 0

Your output should look like this:

1) 1) "user_updates"
   2) 1) 1) "1577650357114-0"
         2) 1) "user_id"
            2) "1"
            3) "action"
            4) "register"
            5) "name"
            6) "Duffy"
            7) "surename"
            8) "Duck"
      2) 1) "1577650371803-0"
         2) 1) "user_id"
            2) "2"
            3) "action"
            4) "register"
            5) "name"
            6) "Bugs"
            7) "surename"
            8) "Bunny"
2) 1) "user_activity"
   2) 1) 1) "1577650378926-0"
         2) 1) "user_id"
            2) "1"
            3) "action"
            4) "kick"
            5) "object"
            6) "Bugs"
      2) 1) "1577650384649-0"
         2) 1) "user_id"
            2) "1"
            3) "action"
            4) "dislike"
            5) "object"
            6) "Bugs face"
3) 1) "user_friends"
   2) 1) 1) "1577650389616-0"
         2) 1) "user_id"
            2) "1"
            3) "action"
            4) "remove"
            5) "friend_id"
            6) "2"

Building snapshots

When building profile representation, you can transform and reflect it in any structure you need. Notice, the three zeroes at the end of the XREAD command. Those are entry IDs, that can be used as consumer offsets when you are using XREADGROUP command (we'll use this command later). Entry IDs by default are actually timestamps, with nanoseconds part after dot, so it is easy to read stream from any point in time you need.

In most cases, it is inefficient to consume the whole stream every time you need data. It is a common practice to save your user profile snapshots in one way or another. For snapshoting you can use Redis Hash data structure (or one of other Redis structures, if that makes sense for you). Hash structure can be created by using HMSET command (you can set multiple hash fields for aggregated data, it is skipped in this example), the suffix of Hash key is user id:

HMSET user_snapshot_1 name Duffy ... [field value]

Continuous consumption

But you don't want to keep track of offsets, you have already consumed, right? Redis has a solution to this problem too. What you can do is use XREADGROUP along with XACK. Pseudocode for whole snapshoting process would look something like this:

XGROUP CREATE user_updates snapshotter 0
XGROUP CREATE user_activity snapshotter 0
XGROUP CREATE user_friends snapshotter 0
WHILE true
    entries = XREADGROUP GROUP snapshotter event_consumer BLOCK 2000 COUNT 10 STREAMS user_updates user_activity user_friends > > >
    if entries == nil
        puts "Timeout... try again"
        CONTINUE
    end

    FOREACH entries AS stream_entries
        FOREACH stream_entries as message
            process_message(message)
            XACK message.stream snapshotter message.id
        END
    END
END

Java example

I've also implemented Java App to illustrate how to use Redis Streams for Event Sourcing. I chose to use lettuce library for communication with Redis server. Source code of my implementation can be found on Github

There are three main classes:

EventStore - this class can be used for event publishing, and subscribing to Redis Streams.
SnapshotStore - this class can be used to store and retrieve snapshot objects from Redis.
StreamProcessor - combines EventStore and Snapshot store. You can supply multiple event handlers, which are used for building snapshots.

The whole combination of these classes can be found on RedisEventSourcingTest Unit test class.

Conclusion

While Redis Streams is relatively new concept, it is heavily inspired by Apache Kafka, and has many overlapping features. Also we mustn't forget that Redis has many more data structures and features, which can be used alongside Redis streams (one of which we used for the snapshotting in this post). If you already have Redis in your technology stack and are looking into streaming solutions, consider using Redis Streams. Not only Redis is feature rich it is also very easy to learn and use.