Anthony Ikeda

Posted on Jun 15, 2021

Open Source Data Ingestion Pipeline

#apachedrill #ingestion #data

In this post I'm going to set up the foundation for a data ingestion pipeline integrating RDBMS and BLOB Storage.

Install and Set up Minio
Setup and configure PostgreSQL
Install and Set up Apache Drill
Your First Data Transfer

Install and Set up Minio

On macOS run the following command:

$ brew install minio/stable/minio

You will now have a new service configured for Minio:

$ brew services list
Name          Status  User         Plist
minio         stopped

You can start Minio with the following command:

$ brew services start minio
==> Successfully started `minio` (label: homebrew.mxcl.minio)
$ brew services list
Name          Status  User         Plist
minio         started anthonyikeda /Users/user_a/Library/LaunchAgents/homebrew.mxcl.minio.plist

You can access the web console on http://localhost:9000 in your web browser with the default credentials minioadmin/minioadmin

Part of the Minio tool kit is the client mc. In order to use the client from the command line, Minio expects an alias to be configured to the URL.

You can set up a new alias to connect to with:

$ mc alias minioserv http://localhost:9000 minioadmin minioadmin
Added `minioserv` successfully.

You should now be able to execute commands against your Minio instance:

$ mc mb minioserve/apachedrill
Bucket created successfully `minioserv/apachedrill`.
$ mc ls minioserv
[2021-06-11 16:15:29 PDT]     0B apachedrill/

In order to avoid using our admin user and privileges, let's create a new user that external apps will use.

From the command line, execute the following command to create a new user apachedrill with password letmeintotheparty:

$ mc admin user add minioserv apachedrill letmeintotheparty
Added user `apachedrill` successfully.

Our user, apachedrill can now be used to access Minio, however, they have no privileges to do anything. Thankfully, Minio is S3 compliant so we can use AWS policies to enable our user to interact with Minio.

Create a new file minio-privileges.json with the following content:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
            "s3:ListBucket",
            "s3:PutObject",
            "s3:GetObject",
            "s3:DeleteObject"
        ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::apachedrill/*",
        "arn:aws:s3:::apachedrill",
        "arn:aws:s3:::drilltmp/*",
        "arn:aws:s3:::drilltmp"
      ],
      "Sid": "BucketAccessForUser"
    }
  ]
}

Create the policy in Minio:

$ mc admin policy add minioserv drill-bucket-policy /path/to/minio-privileges.json

And allow our user to make use of the policy:

$ mc admin policy set minioserv drill-bucket-policy user=apachedrill
Policy `drill-bucket-policy` is set on user `apachedrill`

This allows your new user apachedrill to perform actions at a bucket level in Minio.

Set up and Configure PostgreSQL

If you haven't installed postgres before, we will follow a similar approach like we did with Minio:

$ brew install postgres
$ brew services start postgresql

You can access postgres with:

$ psql postgres
psql (13.3)
Type "help" for help.

postgres=#

For this article we will:

Create a database
Create a database user
Grant the user control over the database
Create some tables and data

Create the Database

Straight forward, we will create the database with:

postgres=# create database address_book_db;
CREATE DATABASE
postgres=#

Let's add a user and grant them control:

postgres=# create role drill_user with password 'letmein';
CREATE ROLE
postgres=# alter user drill_user with login;
postres=# grant all on database address_book_db to drill_user;
GRANT

For this article I have some tables and mock data here [https://github.com/anthonyikeda/contact-api/tree/main/src/main/resources/db/data]

Now for the heart of the article...

Install and Set up Apache Drill

Apache Drill integrates different datasources and uses a common syntax. Typically you'd run Drill in a clustered environment to handle the different streams of data processing; for this article we will stick with standalone.

Download the latest release from: https://drill.apache.org/download/

Move it to a local accessible folder, unzip it and open a command line to the drill home folder.

NOTE

You may need to install your JDBC Driver. You can easily download the JAR file from mvnrepository.org and copy the files to <DRILL_HOME>/jars/3rdparty/

Starting drill is easy:

$ bin/drill-embedded
Apache Drill 1.18.0
"Your Drill is the Drill that will pierce the heavens."
apache drill>

This starts the server and command line tool allowing you to browse the web console and execute commands from the terminal.

Open a web browser at: http://localhost:8047

Typically in a clustered environment you'd see more nodes, but since we are running in standalone mode you should see your single instance that is running.

Next we are going to connect to PostgreSQL.

Connect to PostgreSQL

Navigate to the Storage Tab and hit the Create button. This will bring up a rudimentary editor to paste our configuration.

Give the Storage option a name (address_book_db) and paste the following customizing for your environment:

{
  "type": "jdbc",
  "driver": "org.postgresql.Driver",
  "url": "jdbc:postgresql://localhost:5432/address_book_db",
  "username": "drill_user",
  "password": "letmein",
  "caseInsensitiveTableNames": false,
  "sourceParameters": {
    "minimumIdle": 5,
    "autoCommit": false,
    "connectionTestQuery": "select version() as postgresql_version",
    "dataSource.cachePrepStmts": true,
    "dataSource.prepStmtCacheSize": 250
  },
  "enabled": true
}

Hit Create and the new Storage should test the connection to the database and be ready to use.

If you return to the drill-cli you can now inspect your database connection:

apache drill> show databases;
+----------------------------------+
|           SCHEMA_NAME            |
+----------------------------------+
| addressbookdb.address_book_db    |
| addressbookdb.information_schema |
| addressbookdb.pg_catalog         |
| addressbookdb.public             |
| addressbookdb                    |
| sys                              |
+----------------------------------+

Now we will connect to the database and view the tables:

apache drill> use addressbookdb.address_book_db
+------+-----------------------------------------------------------+
|  ok  |                          summary                          |
+------+-----------------------------------------------------------+
| true | Default schema changed to [addressbookdb.address_book_db] |
+------+-----------------------------------------------------------+
1 row selected (0.13 seconds)
apache drill (addressbookdb.address_book_db)> show tables;
+-------------------------------+--------------------------------+
|      TABLE_SCHEMA             |            TABLE_NAME          |
+-------------------------------+--------------------------------+
| addressbookdb.address_book_db | flyway_schema_history_pk       |
| addressbookdb.address_book_db | flyway_schema_history_s_idx.   |
| addressbookdb.address_book_db | pk_address_id                  |
| addressbookdb.address_book_db | pk_contact_id                  |
| addressbookdb.address_book_db | address_address_id_seq         |
| addressbookdb.address_book_db | contacts_contact_id_seq        |
| addressbookdb.address_book_db | address                        |
| addressbookdb.address_book_db | contacts                       |
| addressbookdb.address_book_db | flyway_schema_history          |
+-------------------------------+--------------------------------+
9 rows selected (0.347 seconds)

apache drill (addressbookdb.address_book_db)>

We can query the tables using standard SQL syntax:

apache drill (addressbookdb.address_book_db)> select * from addressbookdb.contacts limit 10;
+------------+--------------------+-------------------+---------------------------+
| contact_id | contact_first_name | contact_last_name |   contact_email_address   |
+------------+--------------------+-------------------+---------------------------+
| 100        | Bert               | Gedling           | bgedling0@myspace.com     |
| 101        | Dacie              | Beddingham        | dbeddingham1@usda.gov     |
| 102        | Lina               | Wey               | lwey2@deviantart.com      |
| 103        | Savina             | Housbie           | shousbie3@google.com.au   |
| 104        | Delphinia          | Lente             | dlente4@delicious.com     |
| 105        | Zonda              | Blinder           | zblinder5@google.es       |
| 106        | Charmian           | Costley           | ccostley6@java.com        |
| 107        | Darcy              | Bulluck           | dbulluck7@flavors.me      |
| 108        | Tim                | Duesberry         | tduesberry8@google.com.br |
| 109        | Torr               | Cordero           | tcordero9@ftc.gov         |
+------------+--------------------+-------------------+---------------------------+
10 rows selected (0.228 seconds)

It's time to turn our attention back to Minio!

Wiring up Minio and pushing data

Open up the Storage tab in Drill again and create a new Datasource.

Give it a name (minio) and paste the following configuration:

{
  "type": "file",
  "connection": "s3a://apachedrill",
  "config": {
    "fs.s3a.secret.key": "letmeintotheparty",
    "fs.s3a.access.key": "apachedrill",
    "fs.s3a.connection.ssl.enabled": "false",
    "fs.s3a.endpoint": "http://127.0.0.1:9000",
    "fs.s3a.path.style.access": "false"
  },
  "workspaces": {
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    },
    "root": {
      "location": "/data",
      "writable": true,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    }
  },
  "formats": {
    "parquet": {
      "type": "parquet"
    },
    "avro": {
      "type": "avro",
      "extensions": [
        "avro"
      ]
    },
    "json": {
      "type": "json",
      "extensions": [
        "json"
      ]
    },
    "pcap": {
      "type": "pcap",
      "extensions": [
        "pcap"
      ]
    },
    "csvh": {
      "type": "text",
      "extensions": [
        "csvh"
      ],
      "extractHeader": true
    },
    "sequencefile": {
      "type": "sequencefile",
      "extensions": [
        "seq"
      ]
    },
    "pcapng": {
      "type": "pcapng",
      "extensions": [
        "pcapng"
      ]
    },
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "fieldDelimiter": "|"
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "fieldDelimiter": "\t"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ]
    },
    "syslog": {
      "type": "syslog",
      "extensions": [
        "syslog"
      ],
      "maxErrors": 10
    },
    "ltsv": {
      "type": "ltsv",
      "extensions": [
        "ltsv"
      ]
    },
    "hdf5": {
      "type": "hdf5",
      "extensions": [
        "h5"
      ],
      "defaultPath": null
    },
    "spss": {
      "type": "spss",
      "extensions": [
        "sav"
      ]
    },
    "shp": {
      "type": "shp",
      "extensions": [
        "shp"
      ]
    },
    "excel": {
      "type": "excel",
      "extensions": [
        "xlsx"
      ],
      "lastRow": 1048576
    }
  },
  "enabled": true
}

Key things to note:

S3 Connection

 "connection": "s3a://apachedrill",

This is the bucket we are targeting when we connect to Minio. Any workspaces we set up are subfolders of this bucket. Earlier we created a policy that outlined what can be done with this bucket and applied the policy to our user apachedrill.

S3 Configuration

"config": {
    "fs.s3a.secret.key": "letmeintotheparty",
    "fs.s3a.access.key": "apachedrill",
    "fs.s3a.connection.ssl.enabled": "false",
    "fs.s3a.endpoint": "http://127.0.0.1:9000",
    "fs.s3a.path.style.access": "false"
  }

Here we are configuring the connection to the Minio server, from which endpoint to use (fs.s3a.endpoint) to the credentials used to access Minio (fs.s3a.access.key, fs.s3a.secret.key)

It's important to override the fs.s3a.endpoint otherwise Drill will attempt to connect to S3 on AWS which we don't want right now.

Workspace Directories

"workspaces": {
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    },
    "root": {
      "location": "/data",
      "writable": true,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    }
  }

These are the directories in the bucket we have nominated that we can put data in. When we push data to Minio from Drill, we will reference these directories as writeable targets.

Push data to Minio

What we will do next is create a materialized view from the data in postgresql and write the materialized view into Minio.

In the Apache Drill console type:

apache drill> use minio.root;
+------+----------------------------------------+
|  ok  |                summary                 |
+------+----------------------------------------+
| true | Default schema changed to [minio.root] |
+------+----------------------------------------+
1 row selected (0.195 seconds)
apache drill (minio.root)>

Now, we will create a table in Minio using data from PostgreSQL:

apache drill (minio.root)> create table contact_cities (contact_id, contact_first_name, contact_last_name, city) AS
2...............semicolon> select a.contact_id, c.contact_first_name, c.contact_last_name, a.city
3...............semicolon> from addressbookdb.address a
4...............semicolon> join addressbookdb.contacts c 
5...............semicolon> on c.contact_id = a.contact_id;
+----------+---------------------------+
| Fragment | Number of records written |
+----------+---------------------------+
| 0_0      | 1001                      |
+----------+---------------------------+
1 row selected (1.357 seconds)

What we have just done is taken the data in the form of our query (select a.contact_id, c.contact_first_name, c.contact_last_name, a.city from addressbookdb.address a join addressbookdb.contacts c on c.contact_id = a.contact_id;) and created a new view in Minio stored in parquet format.

If open your Minio browser and navigate to the apachedrill/data folder you should see a new folder: /contact_cities

In there should be a file called 0_0_0.parquet;

Let's see if we got the results we were after:

apache drill (minio.root)> select * from contact_cities limit 10;
+------------+--------------------+-------------------+------------+
| contact_id | contact_first_name | contact_last_name |    city    |
+------------+--------------------+-------------------+------------+
| 278        | Jania              | Kilalea           | Rochester  |
| 1          | Jonah              | Kent              | Lalor Park |
| 100        | Bert               | Gedling           | Newark     |
| 101        | Dacie              | Beddingham        | Sioux City |
| 102        | Lina               | Wey               | Lexington  |
| 103        | Savina             | Housbie           | Clearwater |
| 104        | Delphinia          | Lente             | Spokane    |
| 105        | Zonda              | Blinder           | New Castle |
| 106        | Charmian           | Costley           | Orlando    |
| 107        | Darcy              | Bulluck           | Dayton     |
+------------+--------------------+-------------------+------------+
10 rows selected (0.774 seconds)
apache drill (minio.root)> select count(*) from contact_cities;
+--------+
| EXPR$0 |
+--------+
| 1001   |
+--------+
1 row selected (0.491 seconds)

Looks like we were successful!

In the next article we will set up Kafka and query records coming from Kafka in Apache Drill!

DEV Community