PETER AMORO

Posted on Jan 26

Getting Started with Linux for Data Engineers: Working with Servers, Permissions and Nano Editor.

#linux #dataengineering #nanoeditor #luxdevhq

Abstract

In class i learnt that Linux is the dominant operating system in modern data engineering environments. Most data pipelines, databases and cloud-based data platforms are deployed on Linux servers. This article introduces fundamental Linux concepts for data engineers, presents essential Linux commands in the order they are typically used, and demonstrates how the nano text editor can be applied in practical data engineering tasks.

1. Introduction

Data engineering involves building, maintaining, and optimizing data pipelines that operate in production environments. Linux plays a central role in this ecosystem due to its stability, performance, and extensive command-line tooling. As a result, data engineers are expected to interact with Linux systems directly through the terminal.

2. Why Data Engineers Use Linux

Linux is widely used in data engineering for several reasons:

Most databases and distributed data platforms are optimized for Linux
Powerful command-line tools enable efficient automation
Remote access simplifies server and cloud infrastructure management
Strong support for scripting and scheduling data pipelines

Because of these advantages, Linux proficiency is considered a foundational skill for data engineers.

3. Server Access Using SSH

In real-world environments, data engineers access servers remotely using SSH (Secure Shell).

While logged in as root, create a new user:

Switch from the root user to the newly created user:

Confirm the active user and working directory:

In order to create directories i make my user account a super user

Verify the user is now sudo-enabled

4. Essential Linux Commands in a Typical Workflow

Linux commands are usually executed in a logical sequence during real-world data engineering tasks. The commands below are ordered according to a typical workflow.

5. Checking the Current Directory

creating directories or folders

A directory for organizing project files is created using mkdir:

Listing Files and Directories

Navigating into the Directory

The newly created directory is accessed using:

cd example

Creating Files

An empty file is created using:

touch etl.py

Downloading Data

A dataset is downloaded from the internet using:

wget https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv

Viewing File Contents

The contents of a file are displayed using:

cat iris.csv

Viewing Large Files

Large files are viewed page by page using:

more iris.csv

Previewing File Data

The first and last lines of a file are viewed using:

head iris.csv

tail iris.csv

Searching Within Files

Specific patterns within files are searched using:

grep petal_width iris.csv

Appending Output to a File

Text is appended to a file using output redirection:

echo ETL job completed successfully >> iris.csv

5. Using the Nano Editor

Nano is a lightweight and beginner-friendly text editor available on most Linux systems. It allows users to edit files directly from the terminal without requiring complex commands or modes.

Opening a File with Nano

Files such as Python scripts, SQL files, or configuration files are opened using:

nano iris.csv

Saving and Exiting Nano

Changes are saved using the appropriate nano shortcut, and the editor is exited safely after editing is complete. Nano displays all available shortcuts at the bottom of the screen, making it easy to use for beginners.

7. Conclusion

This assignment demonstrates foundational Linux skills required for data engineering tasks. By securely accessing a remote server using SSH, managing user accounts, organizing files with Linux commands, and editing scripts using the nano editor, data engineers gain practical experience working in real-world Linux environments. Mastery of these skills provides a strong foundation for working with production data systems.

DEV Community