DEV Community

Cover image for How Linux is Used in Real-World Data Engineering
Frederick M
Frederick M

Posted on

How Linux is Used in Real-World Data Engineering

Linux is the backbone of modern data engineering. From running ETL pipelines on cloud servers to managing distributed systems like Hadoop and Spark, proficiency with the Linux command line is non‑negotiable. In this guide, we’ll walk through a realistic data‑engineering workflow on an Ubuntu server – the kind of tasks you’ll perform daily when managing data pipelines, securing sensitive files, and organising project assets.

We’ll cover:

  1. Secure login to a remote server
  2. Structuring a data project with version‑aware directories
  3. Creating and manipulating data files (CSV, logs, scripts)
  4. Copying, moving, renaming, and cleaning up files
  5. Setting correct permissions to protect sensitive data
  6. Navigating the file system and re‑using command history

1. Logging into a Linux Server

In the real world, data engineers rarely work on their local laptop. Most tasks happen on remote servers (on‑premises or in the cloud). The first step is to securely connect to it using SSH, and put in the password when prompted.

ssh root@143.110.224.135
Enter fullscreen mode Exit fullscreen mode

After login it, should look like this

Once logged in, it’s good practice to confirm you are using the correct account. Data pipelines often run under dedicated service accounts, so knowing your user context matters.

whoami      # displays the current username
Enter fullscreen mode Exit fullscreen mode

Next, verify your current working directory, which should indicate where you will start creating folders and files using the command pwd

pwd          # prints the current working directory
Enter fullscreen mode Exit fullscreen mode

2. Folder and File Creation

A well‑organised directory structure is vital for any data project. Let’s create a main folder named after ourselves and inside it create subfolders for raw data, processed data, logs, and scripts.
We will then confirm our folders were created using ls

mkdir ~/fredrickMDataEngineering   # make a new folder
cd ~/fredrickMDataEngineering  # enter the folder
mkdir raw_data processed_data logs scripts # make multiple folders
ls   # check current working directory content
Enter fullscreen mode Exit fullscreen mode

Now we need to create files that simulate real data engineering assets. Inside each folder, create the appropriate csv files using the "touch" command:

# Raw data
touch raw_data.csv

# Processed data
touch cleaned_data.csv

# Logs
touch logs.csv

# Scripts
touch scripts.csv
Enter fullscreen mode Exit fullscreen mode

3. File Operations: Copy, Move, Rename, Delete

Data engineering involves frequent file manipulation, backing up raw data, moving files between stages, versioning assets, and cleaning up obsolete files.

Lets back by copying the raw CSV file as a precaution before processing using the cp command

cp raw_data.csv raw_data_backup.csv
Enter fullscreen mode Exit fullscreen mode

Lets move a file to simulate data flow:

mv cleaned_data.csv logs/
Enter fullscreen mode Exit fullscreen mode

Rename the processed file to indicate a version:

mv raw_data/sample_data.csv raw_data/sample_data_v1.csv
Enter fullscreen mode Exit fullscreen mode

Deleting unnecessary files:

rm raw_data/raw_data_backup.csv
Enter fullscreen mode Exit fullscreen mode
  1. Managing Permissions for Security In production, sensitive files (e.g., credentials, raw PII data) must have strict permissions. Here’s how we secure our project.

Lets make the main directory accessible only by its owner (no one else can read, write, or execute). We use the "-R" flag to make the command affect all files and folders in the curent working directory

chmod 700 -R ~/fredrickMDataEngineering  
Enter fullscreen mode Exit fullscreen mode

We will also set permissions for sensitive files using this command

chmod 600 raw_data/raw_data.csv 
Enter fullscreen mode Exit fullscreen mode

To confirm if our permissions are set, use this command in the current working directory.

ls -l  
Enter fullscreen mode Exit fullscreen mode

Our ETL script needs execute permission to run, we will use this command to make the script executable

chmod +x scripts/scripts.csv
Enter fullscreen mode Exit fullscreen mode

7. Navigation and Command History

Data engineers constantly move between directories. Use relative and absolute paths to navigate:

cd ~/fMwangiDataEngineering/scripts   # go to scripts folder
cd ../logs                               # move back to logs
cd ~                                    # back to home
Enter fullscreen mode Exit fullscreen mode

Hidden files (e.g., .env for environment variables) are common in data projects. View them with:

ls -a
Enter fullscreen mode Exit fullscreen mode

Your terminal history is a goldmine, it helps reproduce exact commands, and track actions within the terminal. We view it with this command:

history
Enter fullscreen mode Exit fullscreen mode

To re‑run a previous command (e.g., command number 2104), use

!2104
Enter fullscreen mode Exit fullscreen mode

8. Why These Skills Matter

What you just practised is a miniature version of real‑world data engineering:

  1. Structured folders mirror how data lakes or data warehouses are organised.
  2. File operations (copy, move, rename) simulate the stages of an ETL pipeline, from ingestion to transformation to archiving.
  3. Permissions protect sensitive data and ensure only authorised users (or processes) can modify critical files.
  4. Scripts automate repetitive tasks, and command history allows you to audit or replay steps.

Top comments (0)