DEV Community

Tony Kamande
Tony Kamande

Posted on

Linux Fundamentals for Data Engineering

Introduction

Data engineering is the backbone of modern data-driven organizations. Every day, businesses generate massive amounts of data that must be collected, stored, processed, and analyzed. Behind these processes are data engineers who build and maintain the systems that make data available for analytics and decision-making.

One of the most important skills for a data engineer is proficiency in Linux. Most production servers, cloud environments, databases, and big data platforms run on Linux. Whether managing databases, deploying applications, automating workflows, or troubleshooting infrastructure, Linux knowledge is essential.

This article explores the fundamental Linux concepts every data engineer should understand, supported by practical examples from a hands-on Linux and PostgreSQL administration project.


Why Linux Matters in Data Engineering

Linux is the preferred operating system for data engineering because it is:

  • Open source
  • Stable and reliable
  • Highly scalable
  • Secure
  • Efficient in resource utilization
  • Widely supported across cloud platforms

Many popular data engineering technologies run on Linux, including:

  • PostgreSQL
  • MySQL
  • Apache Airflow
  • Apache Spark
  • Hadoop
  • Kafka
  • Docker
  • Kubernetes

As a result, understanding Linux fundamentals allows data engineers to work effectively across different environments.


Connecting to Remote Servers with SSH

One of the first tasks data engineers perform is accessing remote servers.

SSH (Secure Shell) provides a secure way to connect to remote Linux systems.

Example:

ssh root@159.65.222.96
Enter fullscreen mode Exit fullscreen mode

SSH provides:

  • Secure encrypted communication
  • Remote administration capabilities
  • Authentication mechanisms
  • Secure file transfers

During my assignment, SSH was used to connect to a remote Linux server where PostgreSQL administration tasks were performed.


Linux User Management

User management is critical for maintaining security and controlling access to resources.

Instead of allowing everyone to use the root account, Linux administrators create separate user accounts with specific permissions.

Creating a user:

adduser tonym
Enter fullscreen mode Exit fullscreen mode

Verifying the user:

id tonym
Enter fullscreen mode Exit fullscreen mode

Useful user management commands include:

whoami
id
groups
passwd
useradd
adduser
usermod
Enter fullscreen mode Exit fullscreen mode

These commands help administrators manage user identities and access rights.


Understanding the Linux File System

Linux organizes files using a hierarchical directory structure.

Some important directories include:

Directory Purpose
/home User home directories
/etc Configuration files
/var Logs and application data
/tmp Temporary files
/usr Installed applications
/bin Essential system binaries
/root Root user's home directory

Understanding the Linux file system helps data engineers locate configuration files, logs, datasets, and scripts.


Essential Linux Navigation Commands

Navigation is one of the first Linux skills every engineer learns.

Display current directory:

pwd
Enter fullscreen mode Exit fullscreen mode

List files:

ls
Enter fullscreen mode Exit fullscreen mode

Detailed listing:

ls -la
Enter fullscreen mode Exit fullscreen mode

Change directory:

cd /home/tonym
Enter fullscreen mode Exit fullscreen mode

These commands allow users to move efficiently through the filesystem.


File and Directory Operations

Data engineers frequently work with files and datasets.

Create a File

touch dataset.csv
Enter fullscreen mode Exit fullscreen mode

Create a Directory

mkdir datasets
Enter fullscreen mode Exit fullscreen mode

Copy Files

cp source.csv backup.csv
Enter fullscreen mode Exit fullscreen mode

Move Files

mv old.csv archive.csv
Enter fullscreen mode Exit fullscreen mode

Remove Files

rm unwanted.csv
Enter fullscreen mode Exit fullscreen mode

These commands are useful when organizing scripts, logs, and datasets.


Viewing and Searching Files

Inspecting files is a common task when troubleshooting data pipelines.

Display file contents:

cat file.txt
Enter fullscreen mode Exit fullscreen mode

View beginning of file:

head file.txt
Enter fullscreen mode Exit fullscreen mode

View end of file:

tail file.txt
Enter fullscreen mode Exit fullscreen mode

Search text:

grep "ERROR" logfile.log
Enter fullscreen mode Exit fullscreen mode

Find files:

find . -name "*.csv"
Enter fullscreen mode Exit fullscreen mode

These tools make it easy to locate information within large systems.


Linux Permissions and Security

Linux uses a permission-based security model.

View permissions:

ls -l
Enter fullscreen mode Exit fullscreen mode

Example output:

-rw-r--r-- 1 user user 1024 file.txt
Enter fullscreen mode Exit fullscreen mode

Permission management commands:

chmod
chown
chgrp
Enter fullscreen mode Exit fullscreen mode

Example:

chmod 755 script.sh
Enter fullscreen mode Exit fullscreen mode

For data engineers, managing permissions is important for protecting datasets, scripts, and database resources.


Monitoring System Resources

Data pipelines can consume significant system resources.

Linux provides tools for monitoring system performance.

Check disk usage:

df -h
Enter fullscreen mode Exit fullscreen mode

Check memory usage:

free -m
Enter fullscreen mode Exit fullscreen mode

View running processes:

ps aux
Enter fullscreen mode Exit fullscreen mode

Real-time monitoring:

top
Enter fullscreen mode Exit fullscreen mode

Monitoring helps identify performance bottlenecks and resource constraints.


PostgreSQL Administration on Linux

Databases are central to data engineering.

As part of a practical assignment, PostgreSQL was configured and managed on a Linux server.

Verify PostgreSQL installation:

psql --version
Enter fullscreen mode Exit fullscreen mode

Check PostgreSQL service status:

systemctl status postgresql
Enter fullscreen mode Exit fullscreen mode

Start PostgreSQL service:

systemctl start postgresql
Enter fullscreen mode Exit fullscreen mode

This ensures that the database server is operational and available for connections.


Creating a Database

A database named after the Linux username was created.

CREATE DATABASE tonym;
Enter fullscreen mode Exit fullscreen mode

Connecting to the database:

\c tonym
Enter fullscreen mode Exit fullscreen mode

This follows common administrative practices where databases are associated with specific users or projects.


Using Schemas for Organization

Schemas provide logical organization inside a database.

A staging schema was created:

CREATE SCHEMA staging;
Enter fullscreen mode Exit fullscreen mode

In data engineering, staging schemas are commonly used to store raw or intermediate data before transformation.

Benefits include:

  • Better organization
  • Easier maintenance
  • Clear separation of data layers
  • Improved governance

Creating Tables and Loading Data

A sample employee dataset was created inside the staging schema.

Create Table

CREATE TABLE staging.employees (
    employee_id SERIAL PRIMARY KEY,
    full_name VARCHAR(100),
    department VARCHAR(50),
    salary NUMERIC(10,2),
    hire_date DATE
);
Enter fullscreen mode Exit fullscreen mode

Insert Sample Data

INSERT INTO staging.employees
(full_name, department, salary, hire_date)
VALUES
('John Doe', 'Engineering', 75000, '2023-01-15'),
('Mary Wanjiku', 'Finance', 68000, '2022-06-20'),
('Peter Mwangi', 'IT', 72000, '2023-03-10');
Enter fullscreen mode Exit fullscreen mode

Verify Data

SELECT * FROM staging.employees;
Enter fullscreen mode Exit fullscreen mode

This process mirrors real-world ETL workflows where data is first loaded into staging areas before transformation and analysis.


Secure File Transfers with SCP

Data engineers often move datasets and scripts between systems.

SCP (Secure Copy Protocol) provides secure file transfers over SSH.

Upload a file to a server:

scp sample.csv tonym@159.65.222.96:/home/tonym/
Enter fullscreen mode Exit fullscreen mode

Download a file from a server:

scp tonym@159.65.222.96:/home/tonym/sample.csv .
Enter fullscreen mode Exit fullscreen mode

SCP is widely used for moving backups, configuration files, and datasets securely.


Managing Services with systemctl

Linux systems use systemd to manage services.

Check service status:

systemctl status postgresql
Enter fullscreen mode Exit fullscreen mode

Start service:

systemctl start postgresql
Enter fullscreen mode Exit fullscreen mode

Stop service:

systemctl stop postgresql
Enter fullscreen mode Exit fullscreen mode

Restart service:

systemctl restart postgresql
Enter fullscreen mode Exit fullscreen mode

Service management is an essential skill for maintaining databases and other infrastructure components.


Best Practices for Linux in Data Engineering

To work effectively in Linux environments, data engineers should:

  1. Avoid using root unless necessary.
  2. Use SSH keys whenever possible.
  3. Regularly monitor system resources.
  4. Organize files and directories consistently.
  5. Automate repetitive tasks with scripts.
  6. Maintain proper permissions.
  7. Keep systems updated.
  8. Document processes thoroughly.
  9. Version-control important scripts.
  10. Back up critical data regularly.

Conclusion

Linux is one of the most important technologies in the data engineering ecosystem. From remote server management and user administration to database configuration and file transfers, Linux provides the tools necessary to build and maintain modern data platforms.

Through practical experience configuring PostgreSQL, creating databases and schemas, loading sample data, managing users, transferring files with SCP, and documenting the process using GitHub, I gained a deeper understanding of how Linux supports real-world data engineering workflows.

For aspiring data engineers, investing time in learning Linux is one of the most valuable career decisions they can make.

Top comments (1)

Collapse
 
navas_herbert profile image
Navas Herbert

Good stuff! Linux File system and Monitoring resource , is my highlights.