What is Data Engineering?
Data refers to a collection of raw observations. Engineering broadly refers to the art of building and in this case, using data as its raw material. Data engineering focuses on building systems that collect, move, clean and store data, making it available, reliable and useable.
Why Linux?
Linux is a Unix-like operating system (OS), meaning it borrows heavily from its philosophy of simplicity, modularity and code reusability. Its simplistic nature consumes minimal compute resources with superior performance making it widely adopted across servers, cloud infrastructure and big data technologies. This adoption allows native compatibility in engineering tasks making Linux a suitable choice for data engineers.
Linux in Action...
Let's look at a case of accessing a server to set up PostgreSQL database.
1. Server Access.
A remote server is accessed using a secure cryptographic network protocol called Secure Shell - ssh.
The syntax is:ssh <username>@<server_ip_address> while using a password, or ssh -i /path/to/private-key.pem <username>@<server_ip_address> in the case of public/private key pair.
2. User Access.
- After accessing the server, start by updating the server using the command
sudo apt update.
- Create a new user using the command
adduser <username>.
- Verify the user was created using
id <username>
- To switch to the new user run the command
su - <username>
- Alternatively the new user can be accessed directly without going through root by running the command
ssh <new_user>@<remote_ip_address>.
- The new user can be granted administrative privileges by adding them to the super users (sudo) group, using the command
sudo usermod -aG <group_name> <user_name>
3. Setting up PostgreSQL.
- Start by checking if PostgreSQL is installed using the command
psql --version. If it is not installed run these commands:
sudo apt update
sudo apt install postgresql postgresql-contrib
In this case, version 16 is installed.
Login as its super user - postgres by running the command
sudo -i -u postgres. This command should be run from root or an account with adminstrative priviledges.To access the PostgreSQL shell run the command
psql.
- Create a user using
CREATE USER <username> WITH PASSWORD 'user_password';and a database usingCREATE DATABASE <database_name>;. Grant all access to the user by running the commandGRANT ALL PRIVILEGES ON DATABASE <database_name> TO <username>;. Note: Ensure all lines are terminated with a semicolon (;) and correct spelling, otherwise it results in an error.
- Confirm the creation of the database using
\l.
- To change the owner from postgres run the command
ALTER DATABASE <database_name> OWNER TO <user_name>.
- Confirm the owner was changed to the new user using
\l.
- To create a schema named staging, move into the database created using the command
\c <database_name>or directly move from the postgres admin user using the commandpsql -U <username> -d <database_name> -h <host_name>. From the database run the commandCREATE SCHEMA <schema_name>. List the schemas in the database using the\dncommand.
4. Configuring for external tools and traffic.
Data can be uploaded using external tools like DBeaver. Before switching to DBeaver, postgresql needs to be configured to allow external tools and traffic to connect to the databases.
- To allow external tools we run the command
sudo vim /etc/postgresql/16/main/postgresql.conf. This opens the configuration file in the vim editor.
- Edit the listeners to listen globally using the
'*'symbol. To save and exit the editor pressescthen:wq!.
- To allow global traffic connections to the database run the command
sudo vim /etc/postresql/16/main/pg_hba.confto open the file in the vim editor.
- Include a configuration line (the last line) which allows global traffic and enforce a password using either md5 or scram-sha-256 authentication. To save and exit the editor press
escthen:wq!.
- Restart the PostgreSQL service for the configuration changes to be effected.
- Check the status of the postgresql service.
5. Uploading Data.
- Switch to DBeaver to upload data and establish a connection to PostgreSQL.
- To complete the connection, input the credentials created in the linux server.
- Test the connection before establishing it. If the credentials match a connection test will pass.
- After the connection is established, move to the created database, then select the schema to import data.
- Browse the file to be imported.
- Select the table mapping.
- Complete the importation.
Conclusion
Linux forms the backbone of data engineering enabling native interaction with servers, cloud infrastructure and big data technologies. Its mastery enables engineers to design and build, robust and scalable systems to solve present day challenges.
























Top comments (0)