DEV Community

Cover image for Linux Fundamentals for Data Engineering
Shaquille Mburu
Shaquille Mburu

Posted on

Linux Fundamentals for Data Engineering

What is Data Engineering?

Data refers to a collection of raw observations. Engineering broadly refers to the art of building and in this case, using data as its raw material. Data engineering focuses on building systems that collect, move, clean and store data, making it available, reliable and useable.

Data engineering workflow showing four stages: Ingestion, Transformation, Storage, Orchestration

Why Linux?

Linux is a Unix-like operating system (OS), meaning it borrows heavily from its philosophy of simplicity, modularity and code reusability. Its simplistic nature consumes minimal compute resources with superior performance making it widely adopted across servers, cloud infrastructure and big data technologies. This adoption allows native compatibility in engineering tasks making Linux a suitable choice for data engineers.

Linux in Action...

Let's look at a case of accessing a server to set up PostgreSQL database.

1. Server Access.

A remote server is accessed using a secure cryptographic network protocol called Secure Shell - ssh.
The syntax is:ssh <username>@<server_ip_address> while using a password, or ssh -i /path/to/private-key.pem <username>@<server_ip_address> in the case of public/private key pair.

Remote server access using ssh

2. User Access.

  • After accessing the server, start by updating the server using the command sudo apt update.

Update the server

  • Create a new user using the command adduser <username>.

Creating a linux user

  • Verify the user was created using id <username>

Verifying the user was created

  • To switch to the new user run the command su - <username>

Switching to new user created

  • Alternatively the new user can be accessed directly without going through root by running the command ssh <new_user>@<remote_ip_address>.

Accessing new user directly

Accessing new user directly cont

  • The new user can be granted administrative privileges by adding them to the super users (sudo) group, using the command sudo usermod -aG <group_name> <user_name>

Adding a user to sudo

3. Setting up PostgreSQL.

  • Start by checking if PostgreSQL is installed using the command psql --version. If it is not installed run these commands:
sudo apt update
sudo apt install postgresql postgresql-contrib
Enter fullscreen mode Exit fullscreen mode

In this case, version 16 is installed.

Checking PostgreSQL version

  • Login as its super user - postgres by running the command sudo -i -u postgres. This command should be run from root or an account with adminstrative priviledges.

  • To access the PostgreSQL shell run the command psql.

Accessing PostgreSQL shell

  • Create a user using CREATE USER <username> WITH PASSWORD 'user_password'; and a database using CREATE DATABASE <database_name>;. Grant all access to the user by running the command GRANT ALL PRIVILEGES ON DATABASE <database_name> TO <username>;. Note: Ensure all lines are terminated with a semicolon (;) and correct spelling, otherwise it results in an error.

Creating a User, Database and granting access

  • Confirm the creation of the database using \l.

Check the created database

  • To change the owner from postgres run the command ALTER DATABASE <database_name> OWNER TO <user_name>.

Change the database owner

  • Confirm the owner was changed to the new user using \l.

Confirming owner change

  • To create a schema named staging, move into the database created using the command \c <database_name> or directly move from the postgres admin user using the command psql -U <username> -d <database_name> -h <host_name>. From the database run the command CREATE SCHEMA <schema_name>. List the schemas in the database using the \dn command.

Creating a schema

4. Configuring for external tools and traffic.

Data can be uploaded using external tools like DBeaver. Before switching to DBeaver, postgresql needs to be configured to allow external tools and traffic to connect to the databases.

  • To allow external tools we run the command sudo vim /etc/postgresql/16/main/postgresql.conf. This opens the configuration file in the vim editor.

Allow external tools

  • Edit the listeners to listen globally using the '*' symbol. To save and exit the editor press esc then :wq!.

Vim config file

  • To allow global traffic connections to the database run the command sudo vim /etc/postresql/16/main/pg_hba.conf to open the file in the vim editor.

Configuring for global traffic

  • Include a configuration line (the last line) which allows global traffic and enforce a password using either md5 or scram-sha-256 authentication. To save and exit the editor press esc then :wq!.

Vim global traffic

  • Restart the PostgreSQL service for the configuration changes to be effected.

Restart the postgresql service

  • Check the status of the postgresql service.

Checking the postgresql status

5. Uploading Data.

  • Switch to DBeaver to upload data and establish a connection to PostgreSQL.

Connection to PostgreSQL

  • To complete the connection, input the credentials created in the linux server.

Credentials input

  • Test the connection before establishing it. If the credentials match a connection test will pass.

Connection test

  • After the connection is established, move to the created database, then select the schema to import data.

Importing data

  • Browse the file to be imported.

Import file

  • Select the table mapping.

Table mapping

  • Complete the importation.

Importation complete

Conclusion

Linux forms the backbone of data engineering enabling native interaction with servers, cloud infrastructure and big data technologies. Its mastery enables engineers to design and build, robust and scalable systems to solve present day challenges.

Top comments (0)