DEV Community

Cover image for Linux Fundamentals for Data Engineering
leslie angu
leslie angu

Posted on

Linux Fundamentals for Data Engineering

What is data Engineering

Data engineering is the work of building and maintaining the data pipelines that allows organizations to collect, store, process and use data effectively.
Data engineers use different tools to store data such as digital ocean, google cloud platform, azure and Aws. These Saas run on Linux Operating Systems, thus data engineers have to use the terminal to connect, manage and monitor data flows.

Linux Essentials

Data engineers use different operating systems to connect to virtual private servers (VPS). The most common method that is used - secure shell(SSH).
SSH works on windows after installation of wsl (windows subsystem for linux), however MacOs and Linux operating systems will work with it out of the box.
Since I use windows, I installed wsl, and used ssh @ip - the username and Ip were provided. The VPS had a password for security purpose and inputting it you get access to an Ubuntu Operating System.
>ssh root@159.65.222.96
root@159.65.222.96's password:
Fig 1.shows access to the VPS that I had the logins.

When data engineers can't access a particular server because of company restrictions, they use jump servers to connect to a different server.
To check if all the files have the same name contained in all the folders we use find -name 'file_name' as shown below.

Fig 2. Shows the output of checking for a particular file in the server

To move files to and from a local machine to the server we use secure copy protocol (SCL)
a) Moving files from local machine to the server:

scp C:/Users/Admin/Downloads/Rental_property root@159.65.222.96 /root/<folder>

Fig 3. Shows the transfer of files from local machine to the server

b) Moving files from server to local machine:
scp /root/<folder>/rental_property.csv C:/Users/Admin/Desktop

Fig 4. Moving data from the server to the local machine
To open a file in the server you can use vim editor or nano editor. Once the file is open, you can edit it by clicking esc and using I for inserting. Click escape then full colon then type wq(write-quit) to permanently save the changes that have been added to the file. Adding ! to the command wq!, means that your changes will overide any changes that were previously added/currently added.
root@class:~# vim <filename>

A look at the vim editor
To remove files from a folder you can use rm
rm -r <filename>

removing a file from the folder using rm

To view the contents of a file you can use more or less
root@ip~:# more <filename>

using more to see the content inside the csv file

To check the first 5 lines of a file use head and to check the last 5 lines of the code use tail.
root@ip~:# head <filename>
To update the linux operating system currently running in the virtual private server we use sudo apt update

Updating the Linux server

Working with databases in the server - Postgresql

Once you have access to the server, install postgres using the following code: sudo apt install postgresql postgresql-contrib -y. check the status of the server and the version of the psql. The status check type: sudo systemctl status postgresql and the version is psql --version.

Installing postgresql, checking the status and psql version
To add a user to a server use:sudo adduser <user>. If the server rejects the format of the username that you want to add, use: sudo adduser --force-badname username.

Adding a user to the server
To check if the user was added use more/etc/passwd. Ensure you check the last line of the list displayed.

Checking if the user is added - check last line.
To access the database using psql shell we access the postgres instance in the server: sudo -i -u postgres.then we use psqlcommand to access the psql shell where we can create a database.

Creating the database and schema
After creating the database of your choice, the original postgres super user is the only one who has write and read privileges which makes it a challenge to connect the created database and the new user added to manage the databases.

Dbeaver error due to absence of privileges to the new user added to the postgres instance
The solution I opted for is altering the postgres password since I didn't set it up when the postgres was installed.

Altering the postgres passwordThe command to alter postgres password in the psql shell is: ALTER USER postgres WITH PASSWORD <newpassword>.

After setting up the postgres database and schema, you need to change a few files in postgresql found in the root folder. The first file is the postgresql.conf that has the 'listenaddress' which should be uncommented after using vim to edit it. The next file to be edited is the pg_hba.conf file that has the firewall setup. Add the following configuration: 'host all all 0.0.0.0 md5' on the last line.
Restart the servers using:sudo systemctl restart postgresql. This will ensure you don't get the same fault as you saw on dbeaver.

The expected outcome from the database connection

To check if the database has data, we access the psql shell and type the following commands: \l : List all databases, \c database_name : Connect to a different database, \dt : List all tables in the current database, \du : List all the database users and their roles, \conninfo: display current connection details, \q : Exit

list all databases
\l : List all databases,

Changing databses in psql
\c database_name : Connect to a different database

using dt to check the databases available
\dt : List all tables in the current database

using dt to check database name
\du : List all the database users and their roles

connection information
\conninfo: display current connection details

Top comments (1)

Collapse
 
navas_herbert profile image
Navas Herbert

Step by step, The screenshot , I can easily learn from it. Interesting