<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Karen Wangui</title>
    <description>The latest articles on DEV Community by Karen Wangui (@karen_wangui_).</description>
    <link>https://dev.to/karen_wangui_</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3953047%2F4ee4102e-7ae1-475b-89a9-997c19cba3c5.png</url>
      <title>DEV Community: Karen Wangui</title>
      <link>https://dev.to/karen_wangui_</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/karen_wangui_"/>
    <language>en</language>
    <item>
      <title>LINUX FUNDAMENTALS FOR DATA ENGINEERING</title>
      <dc:creator>Karen Wangui</dc:creator>
      <pubDate>Mon, 08 Jun 2026 13:34:08 +0000</pubDate>
      <link>https://dev.to/karen_wangui_/linux-fundamentals-for-data-engineering-35c6</link>
      <guid>https://dev.to/karen_wangui_/linux-fundamentals-for-data-engineering-35c6</guid>
      <description>&lt;h1&gt;
  
  
  What is Linux
&lt;/h1&gt;

&lt;p&gt;Linux is an open-source operating system (OS) that has been widely used in the tech industry for many years. At its center is the Linux kernel, which acts as the core of the system by managing hardware and system resources. Unlike closed-source systems such as Windows and macOS, Linux is built and supported by a worldwide community of developers. This collaborative development approach makes Linux highly flexible, secure, and efficient.&lt;br&gt;
This article explores the key Linux fundamentals every data engineer should understand and how they apply in real-world data systems.&lt;/p&gt;
&lt;h2&gt;
  
  
  WHY DO DATA ENGINEERS PREFER LINUX
&lt;/h2&gt;

&lt;p&gt;Data engineers tend to prefer Linux because it offers the control, flexibility, and reliability required for handling large-scale data systems. Here’s a clear breakdown:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Built for servers and large-scale systems&lt;br&gt;
Most data platforms—such as Hadoop, Spark, Airflow, and Kafka—are designed to run on Linux servers. Production data pipelines almost always operate in Linux environments, not Windows.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Powerful command-line tools&lt;br&gt;
Linux provides robust terminal utilities (bash, grep, awk, sed, cron) that make it easy to:&lt;br&gt;
· process files quickly&lt;br&gt;
· automate repetitive tasks&lt;br&gt;
· inspect logs&lt;br&gt;
· move and transform data efficiently&lt;br&gt;
These are essential tasks in data engineering workflows.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Better performance and stability&lt;br&gt;
Linux is lightweight compared to Windows, which means it:&lt;br&gt;
· uses fewer system resources&lt;br&gt;
· runs reliably for long periods without crashing&lt;br&gt;
· handles heavy workloads more effectively&lt;br&gt;
This is critical for pipelines that need to run 24/7.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Straightforward automation and scripting&lt;br&gt;
With Linux, you can easily use:&lt;br&gt;
· shell scripts (Bash)&lt;br&gt;
· Python automation&lt;br&gt;
· cron jobs for scheduling&lt;br&gt;
This simplifies building and maintaining ETL pipelines.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cloud and DevOps compatibility&lt;br&gt;
Major cloud platforms—AWS, Google Cloud, and Azure—mostly run on Linux. As a result, deploying data pipelines almost always means working in Linux-based environments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Open-source ecosystem&lt;br&gt;
Linux is open source, like most data engineering tools. That brings:&lt;br&gt;
· better compatibility&lt;br&gt;
· broader community support&lt;br&gt;
· easier integration with tools like Spark, Docker, and Kubernetes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Easy remote server access&lt;br&gt;
Data engineers frequently work on remote machines. Linux makes this simple with SSH and remote terminal access.&lt;/p&gt;
&lt;h2&gt;
  
  
  The LINUX FILE SYSTEM
&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;linux file system&lt;/em&gt; isthe way linux organizes and stores files on  a computer. Linux uses a single hierachical tree structure that starts from one root directory.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;/&lt;/td&gt;
&lt;td&gt;Root directory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/home&lt;/td&gt;
&lt;td&gt;User data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/var/log&lt;/td&gt;
&lt;td&gt;Log files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/tmp&lt;/td&gt;
&lt;td&gt;Temporary files (cleared on reboot)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;/mnt / /media&lt;/td&gt;
&lt;td&gt;Mount points for external storage&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  ESSENTIAL COMMAND-LINE SKILLS
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Navigating and Inspecting Files
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;pwd&lt;/td&gt;
&lt;td&gt;Show the current working directory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ls -lah&lt;/td&gt;
&lt;td&gt;List all files with permissions and sizes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cd /var/log/nginx&lt;/td&gt;
&lt;td&gt;Change directory to /var/log/nginx&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;du -sh *&lt;/td&gt;
&lt;td&gt;Display sizes of directories and files&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Viewing and searching data
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Data Eng Use Case&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;head -n 20 access.log&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Show first 20 lines&lt;/td&gt;
&lt;td&gt;Peek at CSV/log structure without loading full file&lt;/td&gt;
&lt;td&gt;&lt;code&gt;head -n 1 data.csv&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tail -f access.log&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Follow file live as it grows&lt;/td&gt;
&lt;td&gt;Watch Airflow/Spark/Nginx logs in real time&lt;/td&gt;
&lt;td&gt;&lt;code&gt;tail -f /var/log/spark/app.log&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;less -S huge_file.csv&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;View file with horizontal scroll, no full load&lt;/td&gt;
&lt;td&gt;Browse 200+ column CSVs without wrapping&lt;/td&gt;
&lt;td&gt;&lt;code&gt;less -S +F huge.csv&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tail -100f access.log&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Last 100 lines + keep following&lt;/td&gt;
&lt;td&gt;Start from recent logs then watch&lt;/td&gt;
&lt;td&gt;&lt;code&gt;tail -100f app.log&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;grep "ERROR" app.log&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Filter lines matching pattern&lt;/td&gt;
&lt;td&gt;Isolate errors from huge logs&lt;/td&gt;
&lt;td&gt;`grep "500" access.log \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;{% raw %}&lt;code&gt;sed -n '1000000,1000020p' file&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Print lines 1M-1,000,020&lt;/td&gt;
&lt;td&gt;Sample middle of huge file without loading all&lt;/td&gt;
&lt;td&gt;&lt;code&gt;sed -n '1,5p' data.csv&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;awk -F',' '{print $1,$3}' file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Print column 1 and 3&lt;/td&gt;
&lt;td&gt;Quick column extraction before Spark&lt;/td&gt;
&lt;td&gt;&lt;code&gt;awk -F',' '{print $2}' data.csv&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;`sort file \&lt;/td&gt;
&lt;td&gt;uniq -c`&lt;/td&gt;
&lt;td&gt;Count unique values&lt;/td&gt;
&lt;td&gt;Fast frequency table on a column&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Text Processing Trio:
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;grep,awk,sed&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;grep&lt;/code&gt; – pattern matching&lt;br&gt;
Extract HTTP 500 errors&lt;br&gt;
&lt;code&gt;grep ' 500 ' access.log &amp;gt; server_errors.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sed&lt;/code&gt; – stream editing&lt;br&gt;
Replace , with | as delimiter:&lt;br&gt;
&lt;code&gt;sed&lt;/code&gt; &lt;code&gt;'s/,/|/g' data.csv &amp;gt; data_pipe.txt&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;awk&lt;/code&gt; – column-based processing&lt;br&gt;
Calculate average order value from a CSV:&lt;br&gt;
&lt;code&gt;awk -F ',' '{sum+=$4} END {print "Average: " sum/NR}' orders.csv&lt;/code&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Redirection and Pipes: Building Pipelines
&lt;/h3&gt;

&lt;p&gt;Pipes (|) connect commands — the essence of ETL in shell.&lt;br&gt;
&lt;code&gt;cat raw_events.json | jq '.user_id' | sort | uniq -c | sort -nr &amp;gt; top_users.txt&lt;/code&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Redirect output
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#stdout&lt;/span&gt;
python parse_logs.py &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; output.log

&lt;span class="c"&gt;#stderr&lt;/span&gt;
python parse_logs.py 2&amp;gt; error.log

&lt;span class="c"&gt;#both&lt;/span&gt;
python parse_logs.py &amp;amp;&amp;gt; all_output.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Permissions and Ownership
&lt;/h3&gt;

&lt;p&gt;In shared data environments, correct permissions prevent accidental writes or data leaks.&lt;br&gt;
&lt;code&gt;chmod 640 data/file.parquet   # rw-r-----&lt;br&gt;
chown data_engineer:etl_group data/&lt;/code&gt;&lt;br&gt;
&lt;code&gt;rwxr-xr--&lt;/code&gt;    754&lt;br&gt;
&lt;code&gt;rw-------&lt;/code&gt;    600&lt;br&gt;
&lt;code&gt;rw-rw-r--&lt;/code&gt;    664&lt;/p&gt;
&lt;h4&gt;
  
  
  Check current user and groups
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;whoami&lt;br&gt;
groups&lt;br&gt;
id&lt;/code&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Process Management for Long-Running Jobs
&lt;/h3&gt;

&lt;p&gt;Your ETL script may run for hours. Managing processes is key.&lt;br&gt;
Run job in background&lt;br&gt;
&lt;code&gt;python transform.py &amp;gt; transform.log 2&amp;gt;&amp;amp;1 &amp;amp;&lt;/code&gt;&lt;br&gt;
View running processes&lt;br&gt;
&lt;code&gt;ps aux | grep python&lt;br&gt;
htop   # interactive resource monitor&lt;/code&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Kill a stuck process
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;kill -9 PID&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  survive terminal logout
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;nohup python heavy_etl.py &amp;amp;&lt;/code&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Session management
&lt;/h4&gt;

&lt;p&gt;screen -S etl_job&lt;br&gt;
python run_pipeline.py&lt;/p&gt;
&lt;h4&gt;
  
  
  Ctrl+A, D to detach
&lt;/h4&gt;

&lt;p&gt;screen -r etl_job   # reattach&lt;br&gt;
Scheduling with Cron&lt;br&gt;
Orchestration doesn’t always require Airflow — cron is perfect for periodic data pulls.&lt;/p&gt;

&lt;p&gt;Edit user’s crontab:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;crontab -e&lt;/code&gt;&lt;br&gt;
Schedule examples&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="c"&gt;# Every day at 2 AM
&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt; * * * /&lt;span class="n"&gt;home&lt;/span&gt;/&lt;span class="n"&gt;de&lt;/span&gt;/&lt;span class="n"&gt;scripts&lt;/span&gt;/&lt;span class="n"&gt;ingest_daily&lt;/span&gt;.&lt;span class="n"&gt;sh&lt;/span&gt;

&lt;span class="c"&gt;# Every 15 minutes
&lt;/span&gt;*/&lt;span class="m"&gt;15&lt;/span&gt; * * * * /&lt;span class="n"&gt;home&lt;/span&gt;/&lt;span class="n"&gt;de&lt;/span&gt;/&lt;span class="n"&gt;scripts&lt;/span&gt;/&lt;span class="n"&gt;check_new_data&lt;/span&gt;.&lt;span class="n"&gt;sh&lt;/span&gt;

&lt;span class="c"&gt;# First day of month at 4 AM
&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt; * * /&lt;span class="n"&gt;home&lt;/span&gt;/&lt;span class="n"&gt;de&lt;/span&gt;/&lt;span class="n"&gt;scripts&lt;/span&gt;/&lt;span class="n"&gt;aggregate_monthly&lt;/span&gt;.&lt;span class="n"&gt;sh&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Environment Variables &amp;amp; Configuration
&lt;/h3&gt;

&lt;p&gt;Never hardcode credentials. Use env vars:&lt;br&gt;
&lt;code&gt;export DB_HOST="localhost"&lt;br&gt;
export DB_PASS="s3cr3t"&lt;/code&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Make persistent in &lt;code&gt;~/.bashrc&lt;/code&gt;or &lt;code&gt;~/.bashrc&lt;/code&gt;:
&lt;/h4&gt;

&lt;p&gt;&lt;code&gt;echo 'export DATA_LAKE="/mnt/data_lake"' &amp;gt;&amp;gt; ~/.bashrc&lt;br&gt;
source ~/.bashrc&lt;/code&gt;&lt;/p&gt;
&lt;h5&gt;
  
  
  Assignment example
&lt;/h5&gt;

&lt;p&gt;We stored API keys in .env file and loaded in Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;WEATHER_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Understanding Linux fundamentals is essential for any aspiring data engineer because it forms the backbone of modern data infrastructure. In this article, we explored how important Linux concepts such as file system navigation, text manipulation using awk and sed, process monitoring, and task automation with cron are applied in real-world ETL workflows. The example assignment involving clickstream log ingestion, bot traffic filtering, and hourly aggregation reflects the type of practical challenges data engineers regularly solve in production systems. Developing command-line skills allows engineers to work more efficiently, automate repetitive tasks, and troubleshoot systems with greater confidence and speed.&lt;/p&gt;

&lt;p&gt;As you advance in your data engineering career, view Linux skills as a valuable long-term asset rather than just another technical requirement. Begin by automating small repetitive tasks with shell scripts and challenge yourself to process large log files using command-line utilities before relying on graphical tools or Python libraries. Familiarize yourself with monitoring tools like htop and storage commands such as df -h to better understand system performance and resource usage. Mastering commands like grep, pipes, and cron will strengthen your ability to work across the entire data stack, including technologies such as Airflow, Spark, and Kubernetes. Since Linux powers much of today’s data infrastructure, becoming fluent in it will help you design pipelines that are efficient, scalable, and resilient.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>linux</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
