David Mwandairo

Posted on Jun 15 • Edited on Jul 5

Linux Fundamentals for Data Engineering

#linux #dataengineering #cli #learning

Introduction

The Linux operating system has become the go to platform for handling data engineering workloads in the modern data landscape. Some of the prominent data-related software it powers include:

Amazon Redshift which is a cloud based data warehouse that runs on a customized version of Linux.
Apache Spark which is a big data framework that runs on Linux clusters.
Apache Airflow which is an orchestration tool that is deployed on Linux servers.

Since these industry standard tools depend on Linux to run, proficiency in Linux is now a core competency, therefore, data engineers need to know how to navigate, manipulate and automate data workflows using the command line interface. This article covers some of the basic commands that a beginner in using Linux should know.

Connecting to a Remote Server

To demonstrate how to use the commands, I will first connect to a remote Linux server using the secure shell(ssh) command. Since I use the Fedora Linux distribution, I did not have to install any proprietary software to access the server. I can do it directly from the terminal. To access the server from a specific port, I added -p 22 at the end of the server IP address and entered the password to log in.
ssh root@159.65.222.96 -p 22

Once I logged in, I first updated the server to be able to use the latest features using the command sudo apt update. The commands used in the Ubuntu Linux distribution may be different from other Linux distributions. In this case, we use apt since the server is an Ubuntu Linux distribution.

After completing the update, I created my user account within the server using the command sudo adduser DavidM to isolate my workflow.

I got an error telling me to change the naming convention to my user name but I intend to keep it as it is to make it distinct. To bypass the error, I use the command sudo adduser DavidMw --force-badname and proceed to create my user account.

To switch to my user, I run the command su DavidMw.

Since my account is active, I can alternatively log in directly into the server without going through the root user as shown below. To do this, I run the command ssh DavidMw@159.65.222.96 -p 22.

Checking Postgresql version

To check the Postgresql version available, I run the command psql --version.

Next, I check the status of the Postgresql server using the command sudo systemctl status postgresql.

Now that I have confirmed it is enabled, I can log in to the Postgresql server using the command sudo -i -u postgres. After that, I access the Postgres interface using the command psql.

Creating a Database and Schema

Now that I have logged in to the Postgres user interface, I can create my database using the SQL script CREATE DATABASE DavidMw;.

To confirm whether it has been created, I run the command \l and I am able to see it as shown below. To exit the list of databases, I press 'q'.

Next, I create a schema named 'staging' within my database in which I will upload my data. To do that, I will first move into my database using the command \c then I run the script CREATE SCHEMA staging;.
Note: Always include a semi-colon(;) at the end of your SQL scripts to mark the end of the statement.

To view the schema, I run the command \dn.

Upload Data to a Schema

To upload data to a schema, we use Dbeaver which is a universal database management tool. First, we connect it to the Postgresql server following the steps below.
First, I create a new connection in Dbeaver

Next, I setup the connection using the details I created previously.

Next, I establish my connection and connect to my database.

To import the data, I right-click the staging schema and hover to the "Import Data" option after which I select the file I want to upload.

In this case, I select the 'salary_data.csv' file and continue.

After a few seconds, the data is uploaded.

Common Linux Commands

The core commands used in Linux are necessary for creating, editing and navigating through files. The creating, editing and updating of files in Linux systems happens within directories, hence a data engineer should always know where he/she is within the file system. Thus, the command to run to know where you are in the file system is pwd(print working directory) as shown below:

To see the contents of the directory, we use the command ls(list) and to create a new directory, we use the command mkdir(make directory) to create the 'newfolder' as shown below:

When working with remote servers, it is possible to upload files and directories directly using the scp(secure copy) command. To locate the file to be uploaded, I first navigated into the folder containing the file by using the cd(change directory) command. The difference between uploading a file and a folder is adding '-r' after the scp command. In my case, I uploaded the 'hotel_data.csv', 'salary_data.csv' files and 'stocks' folder to my instance of the server using the process below.
Note: When running the scp command, ensure that to run the command in the origin device's command line interface.

To create files in Linux, one can use either the echo or touch command. The echo command inputs content directly within a file while the touch command creates an empty file. Below is an example of how I created the 'file1.txt' and 'file2.py' files using the two commands respectively.

To insert data into the empty 'file2.py' file, I used the vim editor by running the vi command as shown below. To insert data into the file, press "i" to access the interactive interface and once done, press "Esc" + ":" and subsequently "wq" to save the changes.

To view the changes made, use the command cat followed by the file name.

Files can be viewed in different formats. The ls command enables one to view the files within a directory through the basic format. To view them in a more detailed format, we use the commands:

ls -l   #lists the files in a directory in the long listing format.
ls -la  #shows all the details of all the files within a directory.
ls -a   #lists all the files including the hidden ones.

Copying and moving files are also important commands used in Linux. They are executed by the cp and mv commands respectively. The cp command comes in handy when creating back-up/duplicate files. The mv command moves files from one folder to another and can also be used to rename files. Directories can also be copied by including -r after the cp command as shown below.

To delete files, we use the rm command. Folders can also be deleted by adding "-r" to the rm command to delete the files recursively. To move up a directory, we use the cd .. command.

File Permissions

Setting file permissions help data engineers determine who has access to particular files. File permissions are determined using a 10-character string. An example of a permission string is -rw-rw-r-- where the first character represents the file type which can be "-" to signify a file, "d" to signify a directory and "l" to signify a link. The next 3 characters represent the file owner's permissions, the middle 3 characters represent the group's file permissions and the last 3 characters represent others' file permissions. There are two ways in which file permissions can be assigned as shown in the tables below.

The example below shows how file permissions are given on the command line using the chmod command.

Conclusion

Mastering the Linux fundamentals transforms a data engineer from someone that just runs scripts to someone that can design, build and deploy strong data systems. The commands are important in managing tasks such as file navigation and handling, remote access, file permissions and text processing which form the foundation of data engineering. Through frequent practice, the command line becomes a powerful tool to run data engineering tasks on a large scale.

Top comments (1)

leslie angu • Jun 16

This is a masterpiece. It should have been posted on the NewYork's bestseller since everything has been well thought out. If I was a recruiter I would have dmed you since you have a really good grasp of stroy telling. Ensure you continue taking us through this interesting roller coaster ride again. I will be waiting for more articles.