Lorraine Njagi

Posted on Mar 28

How Linux is Used in Real-World Data Engineering

#dataengineering #linux #linuxfordataengineering

What is Data Engineering?

Data engineering refers to the transformation of data and preparing it for analysis or use by data analyst and data scientist. This is what ensures the infrastructure and the data to be used is in the right form. They convert vast amounts of raw data into usable data sets.

Why Linux is Used In Data Engineering?

Most Cloud infrastructures such as AWS, Azure and GCP run on Linux. They use Linux for their virtual machines and data services.
Tools such as Kafka, Hadoop, Spark and Apache are more suited by its open source ecosystem.
Linux offers performance and stability for running large data pipelines without needing reboots.
Automation and Scripting Linux offers the command line CLI and tools such as CRON which enable automation of data tasks and Extract Transform and Load (ETL) pipelines.

Linux Basics for Data Engineering

There are a few Linux basics that data engineers should be aware of.

1. The File System Structure

The Linux file system takes the structure of a tree, with the starting point as the root (/) directory.

Important directories under the root are:

/etc
/var
/bin
/tmp

/etc

etc contains configuration files and folders. This folder controls the configuration of the entire system, how the OS and how the user behaves. For example the passwd file which contains details about users.

/var

This folder contains variable data that changes continuously during system operations.

Examples:

System logs
Authorization logs
Databases
Runtime state files

It is of importance to a data engineer in various ways such as:

/var/log/              # Where spark, kafka and apache logs live
/var/lib/postgresql/   # Actual PostgreSQL database storage
/var/spool/cron        # Job queues for scheduled cron tasks

/bin

This contains essential command line programs that are available for all users even in single use or recovery mode such as cd, ls.

cd   # Used to change directory
ls   # Used to list files and directories

ls options:

ls -a   # Include hidden files
ls -l   # Show file permissions and details

Single User / Recovery Mode

Single user or recovery mode refers to a Linux special boot mode used for repair and maintenance.

In this mode:

Only one user logs in (root user)
No networks are started
No GUI is started
Used for system repair (e.g password reset)

2. File Permissions

File permissions are relevant to a data engineer because the role involves:

Moving data between systems
Automatic running of scripts
Handling sensitive credentials
Maintaining data integrity

Permission Table

Permission	Symbol	Numerical Value
Write	w	4
Read	r	2
Execute	x	1

The command used:

chmod   # change mode

Example

Create a directory:

mkdir Yourname

Create a file:

touch file.txt

Restrict permissions so only the owner has full access:

chmod 700 file.txt

Allow:

Group → read & execute
Others → read only

chmod 731 file.txt

To change ownership:

chown

3. Disk Usage

A data engineer should pay attention to disk usage because it directly affects:

Performance
Usage
Cost
Pipeline failures (when disk fills)

Commands:

du -sh   # Check file size
df -h    # Check disk space

4. Searching

Data engineers handle large files, so searching is important.

Commands:

grep    # Search within a file
find    # Search for files and directories
locate  # Fast search using index

5. Process Management

A process is a program that is running in memory.

Commands:

ps                 # View running processes
ps -u yourusername # View your processes
top                # Live process monitor
htop               # Modern version of top
kill               # End a process

Real World Data Engineering Workflow on Linux

Linux servers are used to manage the entire data pipeline.

A data pipeline refers to the entire process of:

Data Collection → Cleaning → Formatting → Storage → Analysis → Presentation

Data Pipelines

Batch Pipelines

Process data in batches
Scheduled (hourly, daily)
Suitable for historical data
Used for computationally expensive operations
Tools: Apache, Hadoop, Airflow

Realtime Processing Pipelines

Data analyzed continuously as it flows
Used in fraud detection and monitoring systems
Requires realtime analysis

Stages of a Data Pipeline

1. Ingestion

This refers to bringing data from different sources into your system storage.

Sources include:

Data lakes
APIs
IoT devices

Two methods of ingestion

Batch ingestion
Streaming ingestion

Ingestion Tool

Apache Kafka – sits between your data source and destination and uses streaming.

2. Transformation

This involves:

Cleaning
Restructuring
Enriching
Standardization
Aggregation
Validation
Filtering

3. Storage

Storage types:

Relational databases (SQL)
Data warehouses (Azure Synapse, Amazon Redshift)

4. Analysis

Type	Description
Descriptive	What happened
Diagnostic	Why it happened
Predictive	What will happen
Prescriptive	What should be done

Conclusion

Linux provides:

Control
Automation
Scale

Automation includes scheduling jobs using CRON.

Dealing With Large Files in Linux

Example: tail

tail -f /var/log/syslog

This:

Shows last 10 lines
Keeps the file open
Prints new lines in realtime
Stops when you press CTRL + C

This uses inotify, a Linux kernel mechanism where the OS notifies tail when new lines are written.

Final Note

This article does not delve so much into every aspect of Linux involved in data engineering. Looking forward to sharing more thoughts in my future articles. There is a first time for everything!!

DEV Community

How Linux is Used in Real-World Data Engineering

What is Data Engineering?

Why Linux is Used In Data Engineering?

Linux Basics for Data Engineering

1. The File System Structure

/etc

/var

/bin

Single User / Recovery Mode

2. File Permissions

Permission Table

Example

3. Disk Usage

4. Searching

5. Process Management

Real World Data Engineering Workflow on Linux

Data Pipelines

Batch Pipelines

Realtime Processing Pipelines

Stages of a Data Pipeline

1. Ingestion

Two methods of ingestion

Ingestion Tool

2. Transformation

3. Storage

4. Analysis

Conclusion

Dealing With Large Files in Linux

Example: tail

Final Note

Top comments (0)