DEV Community: Exacaster

Lightweight HTTP API for Big Data on S3

Paulius — Wed, 15 Mar 2023 15:50:29 +0000

We are happy to announce our third opensource project - Delta Fetch.
Delta Fetch is a configurable HTTP API service for accessing Delta Lake tables. Service is highly configurable, with possibility to filter your Delta tables by selected columns.

How it works?

Delta Fetch heavily relies on Delta table metadata, which contains statistics about each Parquet file. The same metadata that is used for data skipping is used to read only relevant files, in particular - minimum and maximum value of each column in each file. The Delta table metadata is cached for better performance and can be refreshed by enabling auto cache update or making API requests with the ...?exact=true query parameter.

Request handling flow:

The user makes an API request to one of the configured API resources.
Delta Fetch reads Delta table metadata from file storage and stores it in memory.
Delta Fetch finds the relevant file paths in the stored metadata and starts reading them.
Delta Fetch uses the Hadoop Parquet Reader implementation, which supports filter push down to avoid reading the entire file.
Delta Fetch continues reading Parquet files one by one until the requested or configured limit is reached.

Configuration

Resources can be configured in the following way:

app:
  resources:
    - path: /api/data/{table}/{identifier}
      schema-path: /api/schemas/{table}/{identifier}
      delta-path: s3a://bucket/delta/{table}/
      response-type: SINGLE
      filter-variables:
        - column: id
          path-variable: identifier

path property defines API path which will be used to query your Delta tables. Path variables can be defined by using curly braces as shown in the example.
schema-path (optional) property can be used to define API path for Delta table schema.
delta-path property defines S3 path of your Delta table. Path variables on this path will be filled in by variables provided in API path.
response-type (optional, default: SINGLE) property defines weather to search for multiple resources, or a single one. Use LIST type for multiple resources.
max-results (optional, default: 100) maximum number of rows that can be returned in case of LIST response-type.
filter-variables (optional) additional filters applied to Delta table.

You can also configure one of two security mechanisms - Basic Auth or OAuth2, and some caching parameters for better performance. Refer to Delta Fetch Github Repo for more information.

Recommendations

In order to be able to quickly access the data in Parquet files you need to configure block size to a smaller value that you would normally do. We've got acceptable results by setting parquet.block.size to 1048576 (1mb) value.

Also we highly recommend not to use OPTIMIZE ... ZORDER ... on your tables, which are exposed through Delta Fetch, since this command usually stores data split by 1GB chunks. We suggest to rely on simple data ordering by the columns that you are planning to use as "keys" in Delta Fetch API.

More recommendations and considerations can be found on our recommendations page.

With those recommendations applied we managed to get ~1s response time, when requesting for a single row by a single column value:

time curl http://localhost:8080/api/data/disable_optimize_ordered/872480210503_234678
{"version":5,"data":{"user_id":"872480210503_234678","sub_type":"PREPAID","activation_date":"2018-09-01","status":"ACTIVE","deactivation_date":"9999-01-01"}}
curl   0.00s user 0.01s system 1% cpu 0.982 total
---
time curl http://localhost:8080/api/data/disable_optimize_ordered/579520210231_237911
{"version":5,"data":{"user_id":"579520210231_237911","sub_type":"PREPAID","activation_date":"2018-06-24","status":"ACTIVE","deactivation_date":"9999-01-01"}}
curl   0.00s user 0.01s system 0% cpu 1.250 total
---
➜  ~ time curl http://localhost:8080/api/data/disable_optimize_ordered/875540210000_245810
{"version":2,"data":{"user_id":"875540210000_245810","sub_type":"PREPAID","activation_date":"2018-09-01","status":"ACTIVE","deactivation_date":"9999-01-01"}}
curl   0.00s user 0.01s system 1% cpu 0.870 total

We consider this API service experimental and hope to get some feedback and contributions from Open Source (and also dev.to :)) community. Let us know what do you think about our new project.

Testing PySpark & Pandas in style

Paulius — Thu, 10 Feb 2022 07:49:05 +0000

Today we'd like to share a small utility package for testing Dataframes on PySpark and Pandas.

If you are a fan of test-driven development and had a chance to work on PySpark (or Pandas) projects, you've probably had written tests similar to this one:

from datetime import datetime
from pyspark_test import assert_pyspark_df_equal
from your_module import calculate_result

def test_event_aggregation(spark):
    schema = ["user_id", "even_type", "item_id", "event_time", "country", "dt"]
    expected_df = spark.createDataFrame(
        [
            (123456, 'page_view', None, datetime(2017,12,31,23,50,50), "uk", "2017-12-31"),
            (123456, 'item_view', 68471513, datetime(2017,12,31,23,50,55), "uk", "2017-12-31")
        ], 
        schema
    )

    result_df = calculate_result()

    assert_pyspark_df_equal(expected_df, result_df)

It works OK for small applications, but when your project gets bigger, data gets more complicated and the amount of tests starts to grow, you might want a less tedious way to define test data.

Exacaster alumni Vaidas Armonas came up with an idea to represent Spark DataFrames as markdown tables. This idea materialized to a testing package markdown-frames. With this package the test, which was shown before, can be replaced with this one:

from pyspark_test import assert_pyspark_df_equal
from markdown_frames.spark_dataframe import spark_df
from your_module import calculate_result

def test_event_aggregation(spark):
    input_data = """ 
        |  user_id   |  even_type  | item_id  |    event_time       | country  |     dt      |
        |   bigint   |   string    |  bigint  |    timestamp        |  string  |   string    |
        | ---------- | ----------- | -------- | ------------------- | -------- | ----------- |
        |   123456   |  page_view  |   None   | 2017-12-31 23:50:50 |   uk     | 2017-12-31  |
        |   123456   |  item_view  | 68471513 | 2017-12-31 23:50:55 |   uk     | 2017-12-31  |
    """
    expected_df = spark_df(input_data, spark)

    result_df = calculate_result()

    assert_pyspark_df_equal(expected_df, result_df)

It makes tests more readable and self-explanatory.

Everything looks almost the same, when you need to build a Dataframe for Pandas, you just need to use different function:

from markdown_frames.pandas_dataframe import pandas_df

Share in the comments, if you know any other convienient tips & tricks when writing PySpark (and Pandas) Unit tests.

Spark is lit once again

Mindaugas — Fri, 29 Oct 2021 13:30:23 +0000

@pdambrauskas and I are marking hactoberfest by releasing our little in-house project...

Lighter - Running Spark applications on Kubernetes

Here at Exacaster Spark applications have been used extensively for years. We started using them on our Hadoop clusters with YARN as an application manager. However, with our recent product, we started moving towards a Cloud-based solution and decided to use Kubernetes for our infrastructure needs.

Livy

When running Spark applications on YARN, you can submit jobs using:

Spark client
Apache Livy - an open-source REST API for interacting with Apache Spark from anywhere.

Latter was a go-to solution at the time when we were only using Spark on YARN. Sadly Apache Livy is not maintained anymore: it has no K8s support, Spark client is more and more outdated with every passing day. For some time we used @jahstreet's fork which had K8s available. But then we saw that the Livy project hadn't received any updates and we decided to implement our own solution - Exacaster Lighter.

Lighter

Exacaster Lighter is heavily inspired by Apache Livy. The Idea is the same: hide Spark application client under the REST API. However, we are focusing on running those applications on the K8s cluster. YARN mode is also supported. We designed our application to be extendible with different execution backends.

Lighter has lightweight, React based UI written in TS and back-end written in Java with minor Python integration points.

Simplified illustration of the architecture:

                                              ┌────────────────────────────────────────────────────────────────────────────┐
                                              │ Lighter                                                                    │
                                              │     ┌────────────────────────────────────────────────────────────────┐     │
                                              │     │                                                                │     │
                                              │     │                         Internal storage                       │     │
                                              │     │                                                                │     │
                                              │     │                                                                │     │
                                              │     └▲────────▲────────────────────┬─────────────────────────┬───────┘     │
                                              │      │        │                    │                         │             │
                                              │  store app    │                 get│new apps            sync status        │
                                              │      │     check status            │                         │             │
┌────────────────────┐                    ┌───┴──────┴──────────┐           ┌──────▼─────────┐      ┌────────▼────────┐    │
│                    │                    │                     │           │                │      │                 │    │
│                    │  Submit            │                     │           │                │      │                 │    │
│                    ├────────────────────►                     │           │                │      │                 │    │
│      Client        │                    │       REST api      │           │  App executor  │      │ Status tracker  │    │
│                    │  Check status      │                     │           │                │      │                 │    │
│                    ◄────────────────────┤                     │           │                │      │                 │    │
│                    │                    │                     │           │                │      │                 │    │
│                    │                    │                     │           │                │      │                 │    │
└────────────────────┘                    └───┬─────────────────┘           └────────┬───────┘      └────────┬────────┘    │
                                              │                                      │                       │             │
                                              │                                   execute               get status         │
                                              │                                      │                       │             │
                                              │                              ┌───────▼───────────────────────▼──────┐      │
                                              │                              │                                      │      │
                                              │                              │                                      │      │
                                              │                              │                Backend               │      │
                                              │                              │               (YARN/K8s)             │      │
                                              │                              │                                      │      │
                                              │                              │                                      │      │
                                              │                              └──────────────────────────────────────┘      │
                                              │                                                                            │
                                              └────────────────────────────────────────────────────────────────────────────┘

More information can be found on our documentation page.

UI

This is the job list view:

You can see the configuration of the submitted job inside:

Driver logs are also available for each job:

How it works?

Glad you asked. It is quite simple. Lighter uses Spark Launcher to launch Spark applications on Kubernetes cluster. The launcher takes care of creating all Pods needed for the Spark application to run. When launching applications we tag them with a unique identifier by setting config property spark.kubernetes.driver.label.spark-app-tag. Then we use that identifier to check application status and retrieve application logs by calling pods API with labelSelector property.

Things get a bit more complicated on interactive sessions. We've created Sparkmagic compatible REST API so that Sparkmagic kernel could communicate with Lighter the same way as it does with Apache Livy. When a user creates an interactive session Lighter server submits a custom PySpark application which contains an infinite loop which constantly checks for new commands to be executed. Each Sparkmagic command is saved on Java collection, retrieved by the PySpark application through Py4J Gateway and executed.

Uscases

Spark on K8s

Since Apache Spark 2.4, applications can be executed on the K8s cluster. When you submit your Spark application, driver and executor pods are created for your application and removed after the application completes. But if you want to track application status and report them to end-users in a nice manner it gets complicated. Haha.

Spark on YARN

In the early days of the Big Data era when K8s hasn't even been born yet, the common open source go-to solution was the Hadoop stack. We have written several old-fashioned Map-Reduce jobs, scripts using Pig until we came across Spark. Since then Spark has became one of the most popular data processing engines. It is very easy to start using Lighter on YARN deployments. Just run a docker with proper configuration and mount necessary configurations in all the default paths.

Jupyterlab

For ad-hoc data analysis Jupyterlab on top of Spark is an elegant solution. Between themselves, however, these two great tools cannot communicate so Lighter together with SparkMagic acts as a bridge. You only need to provide the correct configuration to SparkMagic to have it working.

Closing remarks

Lighter is a freshly baked tool and open-sourced for everyone to use. Since we developed it to the use-cases that are familiar to us, feel free to contribute if you see any opportunities to make it better.