<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Emilio Ochieng</title>
    <description>The latest articles on DEV Community by Emilio Ochieng (@emilio_ochieng_632030149c).</description>
    <link>https://dev.to/emilio_ochieng_632030149c</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3952242%2Fdc9ad1a1-cf7a-46eb-9abe-93c1a2468598.jpg</url>
      <title>DEV Community: Emilio Ochieng</title>
      <link>https://dev.to/emilio_ochieng_632030149c</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/emilio_ochieng_632030149c"/>
    <language>en</language>
    <item>
      <title>Linux Fundamentals for Data Engineers.</title>
      <dc:creator>Emilio Ochieng</dc:creator>
      <pubDate>Thu, 18 Jun 2026 13:02:38 +0000</pubDate>
      <link>https://dev.to/emilio_ochieng_632030149c/linux-fundamentals-for-data-engineers-84h</link>
      <guid>https://dev.to/emilio_ochieng_632030149c/linux-fundamentals-for-data-engineers-84h</guid>
      <description>&lt;h3&gt;
  
  
  The Essential Guide
&lt;/h3&gt;

&lt;p&gt;In the world of data engineering, Python, SQL, and Spark often steal the spotlight. Yet underneath these tools lies the operating system that powers most data platforms: Linux. Whether you're managing Airflow on an EC2 instance, troubleshooting a Kafka cluster, or building ETL pipelines in a Docker container, Linux proficiency directly impacts your productivity and reliability as a data engineer.This guide covers the Linux fundamentals every data engineer should master.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Why Linux Matters in Data Engineering
&lt;/h2&gt;

&lt;p&gt;Most cloud data platforms (AWS, GCP, Azure) run on Linux. Self-hosted tools like Apache Airflow, dbt, Spark, Kafka, Flink, and PostgreSQL are designed for Linux environments. Data engineers who understand Linux can:Debug infrastructure issues faster&lt;br&gt;
Write more efficient automation scripts&lt;br&gt;
Secure data pipelines properly&lt;br&gt;
Optimize resource usage&lt;br&gt;
Reduce dependency on DevOps teams&lt;/p&gt;

&lt;p&gt;Mastering Linux turns you from a "SQL + Python" engineer into a true infrastructure-aware data professional.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Installation &amp;amp; User Management
&lt;/h2&gt;

&lt;h4&gt;
  
  
  Choosing the Right Distribution
&lt;/h4&gt;

&lt;p&gt;For data engineering, Ubuntu LTS (22.04 or 24.04) is the most popular choice due to its stability and vast package ecosystem. CentOS/Rocky Linux/AlmaLinux are common in enterprise environments.&lt;/p&gt;

&lt;h4&gt;
  
  
  Creating a Dedicated UserNever run data pipelines as root.
&lt;/h4&gt;

&lt;p&gt;Create a dedicated user:&lt;br&gt;
&lt;strong&gt;bash&lt;/strong&gt;&lt;br&gt;
sudo adduser dataeng&lt;br&gt;
sudo usermod -aG sudo dataeng   # Optional: grant sudo access&lt;/p&gt;

&lt;p&gt;SSH Key Authentication (Best Practice)bash&lt;/p&gt;

&lt;p&gt;ssh-keygen -t ed25519 -C "dataeng@workstation"&lt;br&gt;
ssh-copy-id dataeng@your-server-ip&lt;/p&gt;

&lt;p&gt;Disable password authentication in /etc/ssh/sshd_config for better security.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. File System &amp;amp; Permissions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Understanding the Linux Filesystem Hierarchy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;/home – User files&lt;/li&gt;
&lt;li&gt;/var/log – Application and system logs (critical for debugging)&lt;/li&gt;
&lt;li&gt;/etc – Configuration files&lt;/li&gt;
&lt;li&gt;/opt – Third-party software&lt;/li&gt;
&lt;li&gt;/tmp – Temporary files (cleaned on reboot)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Permissions Deep Divebash&lt;/p&gt;

&lt;p&gt;ls -la&lt;br&gt;
chmod 755 script.sh          # Owner: rwx, Group/Other: rx&lt;br&gt;
chown dataeng: dataeng /opt/pipeline&lt;/p&gt;

&lt;p&gt;Special Permissions for Data WorkUse umask to control default file permissions and setfacl for complex shared directories in team environments.&lt;br&gt;
Practical Example:&lt;br&gt;
&lt;strong&gt;bash&lt;/strong&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Create a shared data directory
&lt;/h5&gt;

&lt;p&gt;sudo mkdir -p /data/lakehouse&lt;br&gt;
sudo chown -R dataeng:dataeng /data&lt;br&gt;
sudo chmod -R 775 /data&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Process &amp;amp; Resource Management
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Essential commands&lt;/strong&gt;&lt;br&gt;
ps aux | grep spark          # Find processes&lt;br&gt;
top / htop                   # Interactive monitoring&lt;br&gt;
kill -9                 # Force kill (use carefully)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Systemd – The Modern Init System&lt;/strong&gt;&lt;br&gt;
Most data tools run as systemd services:&lt;br&gt;
sudo systemctl status postgresql&lt;br&gt;
sudo systemctl restart airflow&lt;br&gt;
sudo journalctl -u airflow -f   # Live logs&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>dataengineering</category>
      <category>linux</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
