DEV Community

Akmal Chaudhri for SingleStore

Posted on • Updated on

Quick tip: Using Pentaho Data Integration (PDI) with SingleStoreDB

Abstract

Continuing our use and evaluation of open-source and commercial tools with SingleStoreDB, we'll focus on Pentaho Data Integration (PDI) in this short article. We'll use the Community Edition (CE) of PDI, and work through the PDI Tutorial to demonstrate ETL with SingleStoreDB as our database system.

The project files used in this article are available on GitHub.

Introduction

This article used a Virtual Machine running Ubuntu 20.04.3 with OpenJDK 1.8.

Create a SingleStoreDB Cloud account

A previous article showed the steps required to create a free SingleStoreDB Cloud account. We'll use PDI Demo Group as our Workspace Group Name and pdi-demo as our Workspace Name. We'll make a note of our password and host name. Finally, we'll create a new database using the SQL Editor:

CREATE DATABASE pdi_demo;
Enter fullscreen mode Exit fullscreen mode

Download and unpack PDI

PDI can be downloaded from SourceForge. The version used when writing this article was pdi-ce-9.3.0.0-428. The zip file was saved in the Downloads folder and unpacked to a data-integration folder.

From the data-integration folder, we'll run:

./spoon.sh
Enter fullscreen mode Exit fullscreen mode

Several warning and error messages may appear; we'll need to fix these before the software runs correctly.

1. no libwebkitgtk-1.0 detected

This can be fixed by editing the following file:

sudo vim /etc/apt/sources.list
Enter fullscreen mode Exit fullscreen mode

and adding the following to the end of the file and saving the file:

deb http://cz.archive.ubuntu.com/ubuntu bionic main universe
Enter fullscreen mode Exit fullscreen mode

Next, we'll run the following commands:

sudo apt update

sudo apt install libwebkitgtk-1.0-0
Enter fullscreen mode Exit fullscreen mode

2. Failed to load module "canberra-gtk-module"

This can be fixed by installing the following:

sudo apt install libcanberra-gtk-module libcanberra-gtk3-module
Enter fullscreen mode Exit fullscreen mode

3. Assistive Technology not found: org.GNOME.Accessibility.AtkWrapper

This can be fixed by editing the following file:

sudo vim /etc/java-8-openjdk/accessibility.properties
Enter fullscreen mode Exit fullscreen mode

and commenting out the next line and saving the file:

assistive_technologies=org.GNOME.Accessibility.AtkWrapper
Enter fullscreen mode Exit fullscreen mode

Download MySQL connector

We need to obtain mysql-connector-java-5.1.49 from the MySQL archives. This is a PDI requirement. Unpacking the downloaded archive file, we'll find two jar files:

mysql-connector-java-5.1.49-bin.jar

mysql-connector-java-5.1.49.jar
Enter fullscreen mode Exit fullscreen mode

We'll copy these two files to data-integration/lib.

Launch PDI

From the data-integration folder, we'll again run:

./spoon.sh
Enter fullscreen mode Exit fullscreen mode

This time, the software should start correctly.

PDI Tutorial

The PDI tutorial works through an example that reads data from CSV files, creates data pipelines, performs data cleanup, and eventually writes the data to a database system. The project files can be downloaded from GitHub and opened in PDI to save time.

We'll need to adjust our project files with the full paths to the CSV files:

data-integration/samples/transformations/files/sales_data.csv

data-integration/samples/transformations/files/Zipssortedbycitystate.csv
Enter fullscreen mode Exit fullscreen mode

and configure the database settings, as follows:

  • Connection name: SingleStoreDB
  • Connection type: MySQL
  • Access: Native (JDBC)
  • Host Name: <host>
  • Database Name: pdi_demo
  • Port Number: 3306
  • Username: admin
  • Password: <password>

We'll replace the <host> and <password> with the values from our SingleStoreDB Cloud account.

We'll make any other file path adjustments, as required.

When ready, we'll run the Transformation pipeline using Action > Run. The result should be as shown in Figure 1.

Figure 1. Transformation.

Figure 1. Transformation.

Similarly, we'll run the Job pipeline using Action > Run. The result should be as shown in Figure 2.

Figure 2. Job.

Figure 2. Job.

We can check our SALES_DATA table in SingleStoreDB. It is quite a wide table with 2823 rows.

Summary

In this short article, we have successfully configured and run PDI using SingleStoreDB Cloud as our backend database system. Our data pipelines have cleaned the CSV data and written the final output to SingleStoreDB, demonstrating a complete ETL example.

Top comments (0)