Akmal Chaudhri for SingleStore

Posted on Nov 2, 2022 • Edited on Nov 1, 2024

Quick tip: Using Pentaho Data Integration (PDI) with SingleStoreDB

#singlestoredb #pentaho #etl

Abstract

Continuing our use and evaluation of open-source and commercial tools with SingleStoreDB, we'll focus on Pentaho Data Integration (PDI) in this short article. We'll use the Developer Edition of PDI, and work through the PDI Tutorial to demonstrate ETL with SingleStoreDB as our database system.

The project files used in this article are available on GitHub.

Introduction

This article used a Virtual Machine running Ubuntu 22.04.2 with OpenJDK 11.

Create a SingleStoreDB Cloud account

A previous article showed the steps required to create a free SingleStoreDB Cloud account. We'll use PDI Demo Group as our Workspace Group Name and pdi-demo as our Workspace Name. We'll make a note of our password and host name. Finally, we'll create a new database using the SQL Editor:

CREATE DATABASE IF NOT EXISTS pdi_demo;

Download and unpack PDI

We'll download the Developer Edition edition of PDI. The version used when writing this article was pdi-ce-10.2.0.0-222. The zip file was unpacked to a data-integration folder.

From the data-integration folder, we'll run:

./spoon.sh

Several warning and error messages may appear; we'll need to fix these before the software runs correctly.

1. no libwebkitgtk-1.0 detected

This can be fixed by editing the following file:

sudo vim /etc/apt/sources.list

and adding the following to the end of the file and saving the file:

deb http://cz.archive.ubuntu.com/ubuntu bionic main universe

Next, we'll run the following:

sudo apt update

If there is an error similar to this:

The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 3B4FE6ACC0B21F32

We need to run the following:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 3B4FE6ACC0B21F32

Then the following commands should be successful:

sudo apt update

sudo apt install libwebkitgtk-1.0-0

2. Failed to load module "canberra-gtk-module"

This can be fixed by installing the following:

sudo apt install libcanberra-gtk-module libcanberra-gtk3-module

3. Assistive Technology not found: org.GNOME.Accessibility.AtkWrapper

This can be fixed by editing the following file:

sudo vim /etc/java-11-openjdk/accessibility.properties

and commenting out the next line and saving the file:

assistive_technologies=org.GNOME.Accessibility.AtkWrapper

Download MySQL connector

We need to obtain mysql-connector-java-5.1.49 from the MySQL archives. This is a PDI requirement. Unpacking the downloaded archive file, we'll find two jar files:

mysql-connector-java-5.1.49-bin.jar

mysql-connector-java-5.1.49.jar

We'll copy these two files to data-integration/lib.

Launch PDI

From the data-integration folder, we'll again run:

./spoon.sh

This time, the software should start correctly.

PDI Tutorial

The PDI tutorial works through an example that reads data from CSV files, creates data pipelines, performs data cleanup, and eventually writes the data to a database system. The project files can be downloaded from GitHub and opened in PDI to save time.

We'll need to adjust our project files with the full paths to the CSV files:

data-integration/samples/transformations/files/sales_data.csv

data-integration/samples/transformations/files/Zipssortedbycitystate.csv

and configure the database settings, as follows:

Connection name: SingleStoreDB
Connection type: MySQL
Access: Native (JDBC)
Host Name: <host>
Database Name: pdi_demo
Port Number: 3306
Username: admin
Password: <password>

We'll replace the <host> and <password> with the values from our SingleStoreDB Cloud account.

We'll make any other file path adjustments, as required.

When ready, we'll run the Transformation pipeline using Action > Run. The result should be as shown in Figure 1.

Figure 1. Transformation.

Similarly, we'll run the Job pipeline using Action > Run. The result should be as shown in Figure 2.

Figure 2. Job.

We can check our SALES_DATA table in SingleStoreDB. It is quite a wide table with 2823 rows.

Summary

In this short article, we have successfully configured and run PDI using SingleStoreDB Cloud as our backend database system. Our data pipelines have cleaned the CSV data and written the final output to SingleStoreDB, demonstrating a complete ETL example.

DEV Community