DEV Community

Cover image for Linux Fundamentals for Data Engineering
David Mwandairo
David Mwandairo

Posted on

Linux Fundamentals for Data Engineering

Introduction

The Linux operating system has become the go to platform for handling data engineering workloads in the modern data landscape. Some of the prominent data-related software it powers include:

  • Amazon Redshift which is a cloud based data warehouse that runs on a customized version of Linux.
  • Apache Spark which is a big data framework that runs on Linux clusters.
  • Apache Airflow which is an orchestration tool that is deployed on Linux servers.

Since these industry standard tools depend on Linux to run, proficiency in Linux is now a core competency, therefore, data engineers need to know how to navigate, manipulate and automate data workflows using the command line interface. This article covers some of the basic commands that a beginner in using Linux should know.

Connecting to a Remote Server

To demonstrate how to use the commands, I will first connect to a remote Linux server using the secure shell(ssh) command. Since I use the Fedora Linux distribution, I did not have to install any proprietary software to access the server. I can do it directly from the terminal. To access the server from a specific port, I added -p 22 at the end of the server IP address and entered the password to log in.
ssh root@159.65.222.96 -p 22

Logging in to the remote server

Once I logged in, I first updated the server to be able to use the latest features using the command sudo apt update. The commands used in the Ubuntu Linux distribution may be different from other Linux distributions. In this case, we use apt since the server is an Ubuntu Linux distribution.
Updating the server

After completing the update, I created my user account within the server using the command sudo adduser DavidM to isolate my workflow.

Adding the user

I got an error telling me to change the naming convention to my user name but I intend to keep it as it is to make it distinct. To bypass the error, I use the command sudo adduser DavidMw --force-badname and proceed to create my user account.

Force add my specific user name

To switch to my user, I run the command su DavidMw.

Switch to my user account

Since my account is active, I can alternatively log in directly into the server without going through the root user as shown below. To do this, I run the command ssh DavidMw@159.65.222.96 -p 22.

Direct log in to my user account

Checking Postgresql version

To check the Postgresql version available, I run the command psql --version.

Check Postgresql version

Next, I check the status of the Postgresql server using the command sudo systemctl status postgresql.

Check psql server status

Now that I have confirmed it is enabled, I can log in to the Postgresql server using the command sudo -i -u postgres. After that, I access the Postgres interface using the command psql.

Accessing the Postgres user interface

Creating a Database and Schema

Now that I have logged in to the Postgres user interface, I can create my database using the SQL script CREATE DATABASE DavidMw;.

Create database

To confirm whether it has been created, I run the command \l and I am able to see it as shown below. To exit the list of databases, I press 'q'.

List database command

Database list

Next, I create a schema named 'staging' within my database in which I will upload my data. To do that, I will first move into my database using the command \c then I run the script CREATE SCHEMA staging;.
Note: Always include a semi-colon(;) at the end of your SQL scripts to mark the end of the statement.

create schema

To view the schema, I run the command \dn.

View schema

Upload Data to a Schema

To upload data to a schema, we use Dbeaver which is a universal database management tool. First, we connect it to the Postgresql server following the steps below.
First, I create a new connection in Dbeaver
create new connection

Next, I setup the connection using the details I created previously.
test connection

Next, I establish my connection and connect to my database.
connection established

To import the data, I right-click the staging schema and hover to the "Import Data" option after which I select the file I want to upload.
import data

In this case, I select the 'salary_data.csv' file and continue.
select salarydata.csv

select salarydata.csv

After a few seconds, the data is uploaded.
Data imported

Common Linux Commands

The core commands used in Linux are necessary for creating, editing and navigating through files. The creating, editing and updating of files in Linux systems happens within directories, hence a data engineer should always know where he/she is within the file system. Thus, the command to run to know where you are in the file system is pwd(print working directory) as shown below:

pwd command

To see the contents of the directory, we use the command ls(list) and to create a new directory, we use the command mkdir(make directory) to create the 'newfolder' as shown below:

ls and mkdir

When working with remote servers, it is possible to upload files and directories directly using the scp(secure copy) command. To locate the file to be uploaded, I first navigated into the folder containing the file by using the cd(change directory) command. The difference between uploading a file and a folder is adding '-r' after the scp command. In my case, I uploaded the 'hotel_data.csv', 'salary_data.csv' files and 'stocks' folder to my instance of the server using the process below.
Note: When running the scp command, ensure that to run the command in the origin device's command line interface.

scp upload file and folder

list scp file and folder

To create files in Linux, one can use either the echo or touch command. The echo command inputs content directly within a file while the touch command creates an empty file. Below is an example of how I created the 'file1.txt' and 'file2.py' files using the two commands respectively.

echo and touch commands

To insert data into the empty 'file2.py' file, I used the vim editor by running the vi command as shown below. To insert data into the file, press "i" to access the interactive interface and once done, press "Esc" + ":" and subsequently "wq" to save the changes.

vi command

Enter data and quit

To view the changes made, use the command cat followed by the file name.

cat command

Files can be viewed in different formats. The ls command enables one to view the files within a directory through the basic format. To view them in a more detailed format, we use the commands:

ls -l   #lists the files in a directory in the long listing format.
ls -la  #shows all the details of all the files within a directory.
ls -a   #lists all the files including the hidden ones.
Enter fullscreen mode Exit fullscreen mode

listing files

Copying and moving files are also important commands used in Linux. They are executed by the cp and mv commands respectively. The cp command comes in handy when creating back-up/duplicate files. The mv command moves files from one folder to another and can also be used to rename files. Directories can also be copied by including -r after the cp command as shown below.

copy file

copy file into directory

mv rename

mv move to folder

mv to rename in different folder

cp directory

To delete files, we use the rm command. Folders can also be deleted by adding "-r" to the rm command to delete the files recursively. To move up a directory, we use the cd .. command.

delete files

File Permissions

Setting file permissions help data engineers determine who has access to particular files. File permissions are determined using a 10-character string. An example of a permission string is -rw-rw-r-- where the first character represents the file type which can be "-" to signify a file, "d" to signify a directory and "l" to signify a link. The next 3 characters represent the file owner's permissions, the middle 3 characters represent the group's file permissions and the last 3 characters represent others' file permissions. There are two ways in which file permissions can be assigned as shown in the tables below.

chown syntax 1

chown syntax 2

The example below shows how file permissions are given on the command line using the chmod command.

chmod in action

Conclusion

Mastering the Linux fundamentals transforms a data engineer from someone that just runs scripts to someone that can design, build and deploy strong data systems. The commands are important in managing tasks such as file navigation and handling, remote access, file permissions and text processing which form the foundation of data engineering. Through frequent practice, the command line becomes a powerful tool to run data engineering tasks on a large scale.

Top comments (1)

Collapse
 
leslie_angu_ profile image
leslie angu

This is a masterpiece. It should have been posted on the NewYork's bestseller since everything has been well thought out. If I was a recruiter I would have dmed you since you have a really good grasp of stroy telling. Ensure you continue taking us through this interesting roller coaster ride again. I will be waiting for more articles.