What is Data Engineering?
Data engineering refers to the transformation of data and preparing it for analysis or use by data analyst and data scientist. This is what ensures the infrastructure and the data to be used is in the right form. They convert vast amounts of raw data into usable data sets.
Why Linux is Used In Data Engineering?
- Most Cloud infrastructures such as AWS, Azure and GCP run on Linux. They use Linux for their virtual machines and data services.
- Tools such as Kafka, Hadoop, Spark and Apache are more suited by its open source ecosystem.
- Linux offers performance and stability for running large data pipelines without needing reboots.
- Automation and Scripting Linux offers the command line CLI and tools such as CRON which enable automation of data tasks and Extract Transform and Load (ETL) pipelines.
Linux Basics for Data Engineering
There are a few Linux basics that data engineers should be aware of.
1. The File System Structure
The Linux file system takes the structure of a tree, with the starting point as the root (/) directory.
Important directories under the root are:
/etc/var/bin/tmp
/etc
etc contains configuration files and folders. This folder controls the configuration of the entire system, how the OS and how the user behaves. For example the passwd file which contains details about users.
/var
This folder contains variable data that changes continuously during system operations.
Examples:
- System logs
- Authorization logs
- Databases
- Runtime state files
It is of importance to a data engineer in various ways such as:
/var/log/ # Where spark, kafka and apache logs live
/var/lib/postgresql/ # Actual PostgreSQL database storage
/var/spool/cron # Job queues for scheduled cron tasks
/bin
This contains essential command line programs that are available for all users even in single use or recovery mode such as cd, ls.
cd # Used to change directory
ls # Used to list files and directories
ls options:
ls -a # Include hidden files
ls -l # Show file permissions and details
Single User / Recovery Mode
Single user or recovery mode refers to a Linux special boot mode used for repair and maintenance.
In this mode:
- Only one user logs in (root user)
- No networks are started
- No GUI is started
- Used for system repair (e.g password reset)
2. File Permissions
File permissions are relevant to a data engineer because the role involves:
- Moving data between systems
- Automatic running of scripts
- Handling sensitive credentials
- Maintaining data integrity
Permission Table
| Permission | Symbol | Numerical Value |
|---|---|---|
| Write | w | 4 |
| Read | r | 2 |
| Execute | x | 1 |
The command used:
chmod # change mode
Example
Create a directory:
mkdir Yourname
Create a file:
touch file.txt
Restrict permissions so only the owner has full access:
chmod 700 file.txt
Allow:
- Group → read & execute
- Others → read only
chmod 731 file.txt
To change ownership:
chown
3. Disk Usage
A data engineer should pay attention to disk usage because it directly affects:
- Performance
- Usage
- Cost
- Pipeline failures (when disk fills)
Commands:
du -sh # Check file size
df -h # Check disk space
4. Searching
Data engineers handle large files, so searching is important.
Commands:
grep # Search within a file
find # Search for files and directories
locate # Fast search using index
5. Process Management
A process is a program that is running in memory.
Commands:
ps # View running processes
ps -u yourusername # View your processes
top # Live process monitor
htop # Modern version of top
kill # End a process
Real World Data Engineering Workflow on Linux
Linux servers are used to manage the entire data pipeline.
A data pipeline refers to the entire process of:
Data Collection → Cleaning → Formatting → Storage → Analysis → Presentation
Data Pipelines
Batch Pipelines
- Process data in batches
- Scheduled (hourly, daily)
- Suitable for historical data
- Used for computationally expensive operations
- Tools: Apache, Hadoop, Airflow
Realtime Processing Pipelines
- Data analyzed continuously as it flows
- Used in fraud detection and monitoring systems
- Requires realtime analysis
Stages of a Data Pipeline
1. Ingestion
This refers to bringing data from different sources into your system storage.
Sources include:
- Data lakes
- APIs
- IoT devices
Two methods of ingestion
- Batch ingestion
- Streaming ingestion
Ingestion Tool
Apache Kafka – sits between your data source and destination and uses streaming.
2. Transformation
This involves:
- Cleaning
- Restructuring
- Enriching
- Standardization
- Aggregation
- Validation
- Filtering
3. Storage
Storage types:
- Relational databases (SQL)
- Data warehouses (Azure Synapse, Amazon Redshift)
4. Analysis
| Type | Description |
|---|---|
| Descriptive | What happened |
| Diagnostic | Why it happened |
| Predictive | What will happen |
| Prescriptive | What should be done |
Conclusion
Linux provides:
- Control
- Automation
- Scale
Automation includes scheduling jobs using CRON.
Dealing With Large Files in Linux
Example: tail
tail -f /var/log/syslog
This:
- Shows last 10 lines
- Keeps the file open
- Prints new lines in realtime
- Stops when you press CTRL + C
This uses inotify, a Linux kernel mechanism where the OS notifies tail when new lines are written.
Final Note
This article does not delve so much into every aspect of Linux involved in data engineering. Looking forward to sharing more thoughts in my future articles. There is a first time for everything!!
Top comments (0)