DEV Community

Cover image for Introduction to Linux for Data Engineers
Samuel Wachira
Samuel Wachira

Posted on

Introduction to Linux for Data Engineers

1. Why Linux is Important for Data Engineers

Most data engineering work happens on servers and cloud platforms and almost all of them run Linux. Whether you are:

  • Deploying databases
  • Running ETL pipelines
  • Managing cloud virtual machines
  • Using tools like Hadoop, Spark, Airflow or Docker

You will interact with a Linux terminal.

Linux is important for data engineers because:

  • It is stable and efficient for servers
  • It gives powerful command-line tools for automation
  • Most data tools are built for Linux first

I.e.

If you work with data infrastructure, you will eventually work with Linux.

2. Understanding the Linux Terminal

The terminal is a text-based interface where you type commands to interact with the system.

When you open a terminal, you might see something like:

samwel@ubuntu-server:~$
Enter fullscreen mode Exit fullscreen mode

This means:

  • samwel → your username
  • ubuntu-server → computer name
  • ~ → home directory
  • $ → ready for command input

3. Basic Linux Commands

📁 Check current directory

pwd
Enter fullscreen mode Exit fullscreen mode

Output:

/home/samwel
Enter fullscreen mode Exit fullscreen mode

📂 List files

ls
Enter fullscreen mode Exit fullscreen mode

Output:

data.csv  scripts  logs
Enter fullscreen mode Exit fullscreen mode

📁 Create a new folder

mkdir projects
Enter fullscreen mode Exit fullscreen mode

📄 Create an empty file

touch pipeline.py
Enter fullscreen mode Exit fullscreen mode

📖 View file contents

cat data.csv
Enter fullscreen mode Exit fullscreen mode

🗑️ Remove a file

rm old_data.csv
Enter fullscreen mode Exit fullscreen mode

🌍 Download data from the web

wget https://example.com/data.csv
Enter fullscreen mode Exit fullscreen mode

🔗 Connect to a remote server

ssh user@192.165.1.10
Enter fullscreen mode Exit fullscreen mode

This is common in data engineering when working with cloud servers.

4. Editing Files in Linux: Nano vs Vi

Data engineers often edit:

  • Configuration files
  • Python scripts
  • SQL files

Two popular terminal editors are Nano and Vi.

5. Using Nano

Open or create a file:

nano script.py
Enter fullscreen mode Exit fullscreen mode

Terminal view:

 GNU nano 6.2              script.py

 print("Hello Data Engineering!")

^X Exit   ^O Save   ^K Cut   ^U Paste
Enter fullscreen mode Exit fullscreen mode

Actions:

  • Press CTRL + O → Save
  • Press ENTER to confirm
  • Press CTRL + X → Exit

Nano is simple and perfect for beginners.

6. Using Vi

Vi is faster but has modes.

Open a file:

vi script.py
Enter fullscreen mode Exit fullscreen mode

Vi Modes:

Mode Purpose
Normal Navigation
Insert Typing text
Command Saving and quitting

➤ Enter Insert Mode

Press:

i
Enter fullscreen mode Exit fullscreen mode

Now type:

print("Hello from Vi Editor")
Enter fullscreen mode Exit fullscreen mode

➤ Save and Exit

Press:

ESC
Enter fullscreen mode Exit fullscreen mode

Then type:

:wq
Enter fullscreen mode Exit fullscreen mode

Press ENTER.

Meaning:

  • :w → write (save)
  • :q → quit

➤ Exit without saving

:q!
Enter fullscreen mode Exit fullscreen mode

7. Why These Skills Matter for Data Engineers

  • Editing configuration files on servers
  • Writing Python ETL scripts remotely
  • Managing cron jobs for scheduled pipelines
  • Fixing errors directly in cloud terminals
  • Deploying database services

Mastering Linux editing tools saves time and prevents mistakes.

8. Summary

Concept Key Point
Linux Core operating system for data infrastructure
Terminal Command-based system control
Basic Commands Navigate, create, delete, download files
Nano Easy editor for beginners
Vi Advanced editor used on servers
Practical Use Editing scripts and configs directly on servers

Final Thoughts

Linux may look scary at first, but once you practice basic commands and text editing, it becomes natural. For data engineers, Linux is is a daily working environment.

Top comments (0)