Shaquille Mburu

Posted on Jun 25

Linux Fundamentals for Data Engineering

#linux #postgres #database #dataengineering

What is Data Engineering?

Data refers to a collection of raw observations. Engineering broadly refers to the art of building and in this case, using data as its raw material. Data engineering focuses on building systems that collect, move, clean and store data, making it available, reliable and useable.

Why Linux?

Linux is a Unix-like operating system (OS), meaning it borrows heavily from its philosophy of simplicity, modularity and code reusability. Its simplistic nature consumes minimal compute resources with superior performance making it widely adopted across servers, cloud infrastructure and big data technologies. This adoption allows native compatibility in engineering tasks making Linux a suitable choice for data engineers.

Linux in Action...

Let's look at a case of accessing a server to set up PostgreSQL database.

1. Server Access.

A remote server is accessed using a secure cryptographic network protocol called Secure Shell - ssh.
The syntax is:ssh <username>@<server_ip_address> while using a password, or ssh -i /path/to/private-key.pem <username>@<server_ip_address> in the case of public/private key pair.

2. User Access.

After accessing the server, start by updating the server using the command sudo apt update.

Create a new user using the command adduser <username>.

Verify the user was created using id <username>

To switch to the new user run the command su - <username>

Alternatively the new user can be accessed directly without going through root by running the command ssh <new_user>@<remote_ip_address>.

The new user can be granted administrative privileges by adding them to the super users (sudo) group, using the command sudo usermod -aG <group_name> <user_name>

3. Setting up PostgreSQL.

Start by checking if PostgreSQL is installed using the command psql --version. If it is not installed run these commands:

sudo apt update
sudo apt install postgresql postgresql-contrib

In this case, version 16 is installed.

Login as its super user - postgres by running the command sudo -i -u postgres. This command should be run from root or an account with adminstrative priviledges.
To access the PostgreSQL shell run the command psql.

Create a user using CREATE USER <username> WITH PASSWORD 'user_password'; and a database using CREATE DATABASE <database_name>;. Grant all access to the user by running the command GRANT ALL PRIVILEGES ON DATABASE <database_name> TO <username>;. Note: Ensure all lines are terminated with a semicolon (;) and correct spelling, otherwise it results in an error.

Confirm the creation of the database using \l.

To change the owner from postgres run the command ALTER DATABASE <database_name> OWNER TO <user_name>.

Confirm the owner was changed to the new user using \l.

To create a schema named staging, move into the database created using the command \c <database_name> or directly move from the postgres admin user using the command psql -U <username> -d <database_name> -h <host_name>. From the database run the command CREATE SCHEMA <schema_name>. List the schemas in the database using the \dn command.

4. Configuring for external tools and traffic.

Data can be uploaded using external tools like DBeaver. Before switching to DBeaver, postgresql needs to be configured to allow external tools and traffic to connect to the databases.

To allow external tools we run the command sudo vim /etc/postgresql/16/main/postgresql.conf. This opens the configuration file in the vim editor.

Edit the listeners to listen globally using the '*' symbol. To save and exit the editor press esc then :wq!.

To allow global traffic connections to the database run the command sudo vim /etc/postresql/16/main/pg_hba.conf to open the file in the vim editor.

Include a configuration line (the last line) which allows global traffic and enforce a password using either md5 or scram-sha-256 authentication. To save and exit the editor press esc then :wq!.

Restart the PostgreSQL service for the configuration changes to be effected.

Check the status of the postgresql service.

5. Uploading Data.

Switch to DBeaver to upload data and establish a connection to PostgreSQL.

To complete the connection, input the credentials created in the linux server.

Test the connection before establishing it. If the credentials match a connection test will pass.

After the connection is established, move to the created database, then select the schema to import data.

Browse the file to be imported.

Select the table mapping.

Complete the importation.

Conclusion

Linux forms the backbone of data engineering enabling native interaction with servers, cloud infrastructure and big data technologies. Its mastery enables engineers to design and build, robust and scalable systems to solve present day challenges.

DEV Community