DEV Community

Paulius for Exacaster

Posted on

Lightweight HTTP API for Big Data on S3

We are happy to announce our third opensource project - Delta Fetch.
Delta Fetch is a configurable HTTP API service for accessing Delta Lake tables. Service is highly configurable, with possibility to filter your Delta tables by selected columns.

How it works?

Delta Fetch heavily relies on Delta table metadata, which contains statistics about each Parquet file. The same metadata that is used for data skipping is used to read only relevant files, in particular - minimum and maximum value of each column in each file. The Delta table metadata is cached for better performance and can be refreshed by enabling auto cache update or making API requests with the ...?exact=true query parameter.

Request handling flow:

  • The user makes an API request to one of the configured API resources.
  • Delta Fetch reads Delta table metadata from file storage and stores it in memory.
  • Delta Fetch finds the relevant file paths in the stored metadata and starts reading them.
  • Delta Fetch uses the Hadoop Parquet Reader implementation, which supports filter push down to avoid reading the entire file.
  • Delta Fetch continues reading Parquet files one by one until the requested or configured limit is reached.

Configuration

Resources can be configured in the following way:

app:
  resources:
    - path: /api/data/{table}/{identifier}
      schema-path: /api/schemas/{table}/{identifier}
      delta-path: s3a://bucket/delta/{table}/
      response-type: SINGLE
      filter-variables:
        - column: id
          path-variable: identifier
Enter fullscreen mode Exit fullscreen mode
  • path property defines API path which will be used to query your Delta tables. Path variables can be defined by using curly braces as shown in the example.
  • schema-path (optional) property can be used to define API path for Delta table schema.
  • delta-path property defines S3 path of your Delta table. Path variables on this path will be filled in by variables provided in API path.
  • response-type (optional, default: SINGLE) property defines weather to search for multiple resources, or a single one. Use LIST type for multiple resources.
  • max-results (optional, default: 100) maximum number of rows that can be returned in case of LIST response-type.
  • filter-variables (optional) additional filters applied to Delta table.

You can also configure one of two security mechanisms - Basic Auth or OAuth2, and some caching parameters for better performance. Refer to Delta Fetch Github Repo for more information.

Recommendations

In order to be able to quickly access the data in Parquet files you need to configure block size to a smaller value that you would normally do. We've got acceptable results by setting parquet.block.size to 1048576 (1mb) value.

Also we highly recommend not to use OPTIMIZE ... ZORDER ... on your tables, which are exposed through Delta Fetch, since this command usually stores data split by 1GB chunks. We suggest to rely on simple data ordering by the columns that you are planning to use as "keys" in Delta Fetch API.

More recommendations and considerations can be found on our recommendations page.

With those recommendations applied we managed to get ~1s response time, when requesting for a single row by a single column value:

time curl http://localhost:8080/api/data/disable_optimize_ordered/872480210503_234678
{"version":5,"data":{"user_id":"872480210503_234678","sub_type":"PREPAID","activation_date":"2018-09-01","status":"ACTIVE","deactivation_date":"9999-01-01"}}
curl   0.00s user 0.01s system 1% cpu 0.982 total
---
time curl http://localhost:8080/api/data/disable_optimize_ordered/579520210231_237911
{"version":5,"data":{"user_id":"579520210231_237911","sub_type":"PREPAID","activation_date":"2018-06-24","status":"ACTIVE","deactivation_date":"9999-01-01"}}
curl   0.00s user 0.01s system 0% cpu 1.250 total
---
➜  ~ time curl http://localhost:8080/api/data/disable_optimize_ordered/875540210000_245810
{"version":2,"data":{"user_id":"875540210000_245810","sub_type":"PREPAID","activation_date":"2018-09-01","status":"ACTIVE","deactivation_date":"9999-01-01"}}
curl   0.00s user 0.01s system 1% cpu 0.870 total
Enter fullscreen mode Exit fullscreen mode

We consider this API service experimental and hope to get some feedback and contributions from Open Source (and also dev.to :)) community. Let us know what do you think about our new project.

Top comments (0)