Abstract
In class i learnt that Linux is the dominant operating system in modern data engineering environments. Most data pipelines, databases and cloud-based data platforms are deployed on Linux servers. This article introduces fundamental Linux concepts for data engineers, presents essential Linux commands in the order they are typically used, and demonstrates how the nano text editor can be applied in practical data engineering tasks.
1. Introduction
Data engineering involves building, maintaining, and optimizing data pipelines that operate in production environments. Linux plays a central role in this ecosystem due to its stability, performance, and extensive command-line tooling. As a result, data engineers are expected to interact with Linux systems directly through the terminal.
2. Why Data Engineers Use Linux
Linux is widely used in data engineering for several reasons:
- Most databases and distributed data platforms are optimized for Linux
- Powerful command-line tools enable efficient automation
- Remote access simplifies server and cloud infrastructure management
- Strong support for scripting and scheduling data pipelines
Because of these advantages, Linux proficiency is considered a foundational skill for data engineers.
3. Server Access Using SSH
In real-world environments, data engineers access servers remotely using SSH (Secure Shell).
While logged in as root, create a new user:
Switch from the root user to the newly created user:
Confirm the active user and working directory:
In order to create directories i make my user account a super user
Verify the user is now sudo-enabled
4. Essential Linux Commands in a Typical Workflow
Linux commands are usually executed in a logical sequence during real-world data engineering tasks. The commands below are ordered according to a typical workflow.
5. Checking the Current Directory
creating directories or folders
A directory for organizing project files is created using mkdir:
Listing Files and Directories
Navigating into the Directory
The newly created directory is accessed using:
cd example
Creating Files
An empty file is created using:
touch etl.py
Downloading Data
A dataset is downloaded from the internet using:
wget https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv
Viewing File Contents
The contents of a file are displayed using:
cat iris.csv
Viewing Large Files
Large files are viewed page by page using:
more iris.csv
Previewing File Data
The first and last lines of a file are viewed using:
head iris.csv
tail iris.csv
Searching Within Files
Specific patterns within files are searched using:
grep petal_width iris.csv
Appending Output to a File
Text is appended to a file using output redirection:
echo ETL job completed successfully >> iris.csv
5. Using the Nano Editor
Nano is a lightweight and beginner-friendly text editor available on most Linux systems. It allows users to edit files directly from the terminal without requiring complex commands or modes.
Opening a File with Nano
Files such as Python scripts, SQL files, or configuration files are opened using:
nano iris.csv
Saving and Exiting Nano
Changes are saved using the appropriate nano shortcut, and the editor is exited safely after editing is complete. Nano displays all available shortcuts at the bottom of the screen, making it easy to use for beginners.
7. Conclusion
This assignment demonstrates foundational Linux skills required for data engineering tasks. By securely accessing a remote server using SSH, managing user accounts, organizing files with Linux commands, and editing scripts using the nano editor, data engineers gain practical experience working in real-world Linux environments. Mastery of these skills provides a strong foundation for working with production data systems.

















Top comments (0)