DEV Community

Lorraine Njagi
Lorraine Njagi

Posted on

How Linux is Used in Real-World Data Engineering

What is Data Engineering?

Data engineering refers to the transformation of data and preparing it for analysis or use by data analyst and data scientist. This is what ensures the infrastructure and the data to be used is in the right form. They convert vast amounts of raw data into usable data sets.


Why Linux is Used In Data Engineering?

  • Most Cloud infrastructures such as AWS, Azure and GCP run on Linux. They use Linux for their virtual machines and data services.
  • Tools such as Kafka, Hadoop, Spark and Apache are more suited by its open source ecosystem.
  • Linux offers performance and stability for running large data pipelines without needing reboots.
  • Automation and Scripting Linux offers the command line CLI and tools such as CRON which enable automation of data tasks and Extract Transform and Load (ETL) pipelines.

Linux Basics for Data Engineering

There are a few Linux basics that data engineers should be aware of.


1. The File System Structure

The Linux file system takes the structure of a tree, with the starting point as the root (/) directory.

Important directories under the root are:

  • /etc
  • /var
  • /bin
  • /tmp

/etc

etc contains configuration files and folders. This folder controls the configuration of the entire system, how the OS and how the user behaves. For example the passwd file which contains details about users.


/var

This folder contains variable data that changes continuously during system operations.

Examples:

  • System logs
  • Authorization logs
  • Databases
  • Runtime state files

It is of importance to a data engineer in various ways such as:

/var/log/              # Where spark, kafka and apache logs live
/var/lib/postgresql/   # Actual PostgreSQL database storage
/var/spool/cron        # Job queues for scheduled cron tasks
Enter fullscreen mode Exit fullscreen mode

/bin

This contains essential command line programs that are available for all users even in single use or recovery mode such as cd, ls.

cd   # Used to change directory
ls   # Used to list files and directories
Enter fullscreen mode Exit fullscreen mode

ls options:

ls -a   # Include hidden files
ls -l   # Show file permissions and details
Enter fullscreen mode Exit fullscreen mode

Single User / Recovery Mode

Single user or recovery mode refers to a Linux special boot mode used for repair and maintenance.

In this mode:

  • Only one user logs in (root user)
  • No networks are started
  • No GUI is started
  • Used for system repair (e.g password reset)

2. File Permissions

File permissions are relevant to a data engineer because the role involves:

  • Moving data between systems
  • Automatic running of scripts
  • Handling sensitive credentials
  • Maintaining data integrity

Permission Table

Permission Symbol Numerical Value
Write w 4
Read r 2
Execute x 1

The command used:

chmod   # change mode
Enter fullscreen mode Exit fullscreen mode

Example

Create a directory:

mkdir Yourname
Enter fullscreen mode Exit fullscreen mode

Create a file:

touch file.txt
Enter fullscreen mode Exit fullscreen mode

Restrict permissions so only the owner has full access:

chmod 700 file.txt
Enter fullscreen mode Exit fullscreen mode

Allow:

  • Group → read & execute
  • Others → read only
chmod 731 file.txt
Enter fullscreen mode Exit fullscreen mode

To change ownership:

chown
Enter fullscreen mode Exit fullscreen mode

3. Disk Usage

A data engineer should pay attention to disk usage because it directly affects:

  • Performance
  • Usage
  • Cost
  • Pipeline failures (when disk fills)

Commands:

du -sh   # Check file size
df -h    # Check disk space
Enter fullscreen mode Exit fullscreen mode

4. Searching

Data engineers handle large files, so searching is important.

Commands:

grep    # Search within a file
find    # Search for files and directories
locate  # Fast search using index
Enter fullscreen mode Exit fullscreen mode

5. Process Management

A process is a program that is running in memory.

Commands:

ps                 # View running processes
ps -u yourusername # View your processes
top                # Live process monitor
htop               # Modern version of top
kill               # End a process
Enter fullscreen mode Exit fullscreen mode

Real World Data Engineering Workflow on Linux

Linux servers are used to manage the entire data pipeline.

A data pipeline refers to the entire process of:

Data Collection → Cleaning → Formatting → Storage → Analysis → Presentation
Enter fullscreen mode Exit fullscreen mode

Data Pipelines

Batch Pipelines

  • Process data in batches
  • Scheduled (hourly, daily)
  • Suitable for historical data
  • Used for computationally expensive operations
  • Tools: Apache, Hadoop, Airflow

Realtime Processing Pipelines

  • Data analyzed continuously as it flows
  • Used in fraud detection and monitoring systems
  • Requires realtime analysis

Stages of a Data Pipeline

1. Ingestion

This refers to bringing data from different sources into your system storage.

Sources include:

  • Data lakes
  • APIs
  • IoT devices

Two methods of ingestion

  • Batch ingestion
  • Streaming ingestion

Ingestion Tool

Apache Kafka – sits between your data source and destination and uses streaming.


2. Transformation

This involves:

  • Cleaning
  • Restructuring
  • Enriching
  • Standardization
  • Aggregation
  • Validation
  • Filtering

3. Storage

Storage types:

  • Relational databases (SQL)
  • Data warehouses (Azure Synapse, Amazon Redshift)

4. Analysis

Type Description
Descriptive What happened
Diagnostic Why it happened
Predictive What will happen
Prescriptive What should be done

Conclusion

Linux provides:

  • Control
  • Automation
  • Scale

Automation includes scheduling jobs using CRON.


Dealing With Large Files in Linux

Example: tail

tail -f /var/log/syslog
Enter fullscreen mode Exit fullscreen mode

This:

  • Shows last 10 lines
  • Keeps the file open
  • Prints new lines in realtime
  • Stops when you press CTRL + C

This uses inotify, a Linux kernel mechanism where the OS notifies tail when new lines are written.


Final Note

This article does not delve so much into every aspect of Linux involved in data engineering. Looking forward to sharing more thoughts in my future articles. There is a first time for everything!!

Top comments (0)