Lameck Odhiambo

Posted on Jun 8

Linux Fundamentals for Data Engineers

#linux #database #dataengineering #techtalks

Introduction

Linux is a popular open-source operating system modeled after UNIX (Think of Unix as the original blueprint or architectural inspiration, and Linux as a modern, completely independent recreation built using that same blueprint). At its core is the Linux kernel - the base code that manages the communication between a computer's hardware and software.

Used cases of Linux other than in Data Engineering?

You likely use Linux every day without realizing it:
Mobile Devices: The Android operating system is built on top of the Linux kernel.
Servers & Cloud: The vast majority of web servers and cloud services (like AWS and Google Cloud) run on Linux.
Smart Home & IoT: Smart TVs, routers, and embedded devices often use Linux.
Supercomputers: An estimated 90% of the world’s supercomputers run on Linux for peak performance and efficiency.
Gaming: Handheld gaming devices and PC gaming platforms (like SteamOS) rely heavily on Linux to run Windows-based games.

Because we are focusing on Data Engineering lets see how Data Engineers use Linux come along...

Data engineers use Linux as the underlying foundation for modern data infrastructure, since nearly all cloud environments, container systems, and big data frameworks run natively on Linux servers.

Linux used cases for Data Engineers

Processing data before python touches it
Building Automation & Ingestion scripts
Interracting with Cloud Systems and remote servers
Deploying containers and Orchestration tools
Debugging and Infrastructure monitoring

Sample Linux Commands

File & Directory management

ls -la                     # List all files (including hidden) with details
ls -lh                     # List files with human-readable sizes
pwd                        # Print current working directory
cd /path/to/dir            # Change directory
cd ~                       # Go to home directory
cd -                       # Go back to previous directory

mkdir foldername           # Create directory
mkdir -p dir1/dir2/dir3    # Create nested directories
touch filename.txt         # Create empty file

cp file.txt /dest/         # Copy file
cp -r folder/ /dest/       # Copy folder recursively
mv oldname newname         # Rename or move file/folder
rm file.txt                # Remove file
rm -rf folder/             # Remove folder and contents (use with caution!)

System Information

uname -a                   # Show kernel and system info
lsb_release -a             # Show distribution info
cat /etc/os-release        # Show OS details
hostname                   # Show hostname
uptime                     # Show system uptime
free -h                    # Show memory usage (human readable)
df -h                      # Show disk space usage
du -sh /path               # Show size of directory
top                        # Live process viewer (press q to quit)
htop                       # Better interactive process viewer (if installed)

Process Management

ps aux                     # List all running processes
ps aux | grep nginx        # Find specific process
kill 1234                  # Kill process by PID
kill -9 1234               # Force kill process
pkill nginx                # Kill process by name
jobs                       # List background jobs
fg %1                      # Bring job to foreground
bg %1

                # Send job to background

File searching & Content

find / -name "*.txt" 2>/dev/null   # Find files by name
locate filename                    # Fast search (needs updatedb)
grep "search text" file.txt        # Search inside file
grep -r "text" /path/              # Recursive search in directory
cat file.txt                       # Display file content
less file.txt                      # View file with scrolling
head -n 20 file.txt                # First 20 lines
tail -n 20 file.txt                # Last 20 lines
tail -f /var/log/syslog            # Follow log file in real-time

Networking

ip addr show               # Show network interfaces (modern)
ifconfig                   # Show interfaces (older)
ping google.com            # Test connectivity
curl -I https://example.com # Get HTTP headers
wget https://example.com/file.zip
ssh user@192.168.1.100     # SSH into remote server
scp file.txt user@host:/path/   # Copy file via SSH
netstat -tuln              # Show listening ports
ss -tuln                   # Modern alternative to netstat

Package Management

#### Debian/Ubuntu
sudo apt update
sudo apt upgrade
sudo apt install htop
sudo apt remove htop

User & Permissions

whoami                     # Current user
sudo command               # Run as superuser
su - username              # Switch user
chmod 755 file.sh          # Change permissions (rwxr-xr-x)
chmod +x script.sh         # Make executable
chown user:group file.txt  # Change owner
id                         # Show user/group IDs
passwd                     # Change password

Compression & Archives

tar -czvf archive.tar.gz /folder/     # Create compressed tarball
tar -xzvf archive.tar.gz              # Extract
zip -r archive.zip folder/            # Create zip
unzip archive.zip                     # Extract zip

Practical Example

Conclusion

Linux is the essential foundation for modern data engineering. Mastery of Linux command-line skills, shell scripting, text processing, process management, and server administration is critical for building, managing, and troubleshooting data pipelines effectively.As data infrastructure grows more complex with cloud, containers, and tools like Spark, Kafka, Airflow, and Kubernetes, strong Linux knowledge provides a significant competitive edge. It enables faster automation, better problem-solving, and higher efficiency.Key Takeaway: Investing in Linux fundamentals offers one of the best returns for any data engineer. The terminal is the primary language of data platforms — master it to unlock greater productivity and career growth.

Top comments (2)

leslie angu • Jun 10

The article was well written. The linux commands were well structured with titles. I would like to see the linux commands being used in the server that was provided or, if you have a virtual machine that has ubuntu. The two examples were a good demonstration of what we learnt in class and some that you managed to research on and were foreign to me. I hope to see more of your work.

Lameck Odhiambo • Jun 19

Here's the link to the demo github.com/Lameck-22/step-by-step-...