Macphalen Oduor

Posted on Jan 23 • Edited on Jan 25

Introduction to Linux for Data Engineers:

#linux #ubuntu #vim

Including Practical Use of Vi and Nano with Examples

Contents Guide

Introduction
Why Linux Matters for Data Engineers
Getting Started with Linux
Essential Linux Commands
Working with Files and Directories
Text Editing with Vi and Nano
Advanced Concepts for Data Engineers
Real-World Data Engineering Scenarios
Best Practices and Tips
Conclusion

Introduction

Welcome to the world of Linux! If you're a data engineer or aspiring to become one, understanding Linux is like learning to drive before becoming a race car driver. It's not just helpful—it's essential.

In this guide, we'll walk through everything you need to know about Linux, from the very basics to advanced concepts that will make your daily work as a data engineer smooth and efficient. Don't worry if you've never used a command line before; we'll start from the ground up and build your skills step by step.

Think of Linux as the operating system that powers most of the data infrastructure you'll work with. Whether you're processing petabytes of data on cloud servers, managing databases, or building data pipelines, Linux will be your constant companion.

Why Linux Matters for Data Engineers

The Foundation of Modern Data Infrastructure

Imagine you're building a house. You could use any materials, but there's a reason why professionals choose specific tools and materials—they're reliable, efficient, and proven. Linux is that foundation for data engineering.

Here's why Linux is indispensable:

1. It Powers the Cloud
Most cloud computing platforms (AWS, Google Cloud, Azure) run Linux servers. When you spin up a data processing cluster or deploy a machine learning model, chances are you're working with Linux. Understanding Linux means you can directly interact with these systems rather than relying on simplified interfaces that might limit your capabilities.

2. Performance and Efficiency
Linux is lightweight and efficient. When you're processing millions of records, every bit of system resources matters. Linux doesn't waste computing power on unnecessary background processes or visual effects. This efficiency translates to faster data processing and lower costs.

3. Automation and Scripting
Data engineering is all about automation. You don't want to manually process data every day—you want scripts that run automatically. Linux provides powerful scripting capabilities through bash, Python, and other tools that integrate seamlessly with the operating system.

4. Open Source Ecosystem
Most modern data tools are built for Linux first. Apache Spark, Hadoop, Kafka, Airflow, and countless other data engineering tools are designed to run optimally on Linux. By mastering Linux, you're unlocking the full potential of these tools.

5. Remote Server Management
Data engineers rarely work on local machines. You'll be connecting to remote servers to deploy pipelines, troubleshoot issues, and monitor systems. Linux's command-line interface is perfect for remote work—you can do everything through a terminal connection (SSH), even with limited bandwidth.

6. Cost-Effectiveness
Linux is free and open source. For companies processing data at scale, this means significant cost savings. Instead of paying for expensive operating system licenses for hundreds or thousands of servers, organizations use Linux and invest those savings into better hardware or more engineers.

Real-World Example

Let's say you're building a data pipeline that ingests customer transaction data, processes it, and loads it into a data warehouse. Here's what a typical workflow looks like:

You write your data processing code in Python or Scala
You deploy it to a Linux server or cluster (like AWS EMR)
You schedule it using cron (a Linux job scheduler)
You monitor logs using Linux command-line tools
When something breaks at 2 AM, you SSH into the Linux server to troubleshoot
You use Linux text editors to quickly fix configuration files

Every step of this process relies on Linux knowledge. Without it, you'd be stuck waiting for someone else to help or struggling with unfamiliar interfaces.

Getting Started with Linux

Understanding the Terminal

The terminal (also called command line or shell) is your primary interface with Linux. Unlike Windows or macOS where you click buttons and icons, in Linux you type commands. This might seem old-fashioned, but it's actually more powerful.

Think of it this way: clicking is like ordering from a fixed menu at a restaurant. The command line is like telling a chef exactly what you want and how you want it prepared. You have complete control.

When you open a terminal, you'll see something like this:

user@hostname:~$

Let's break this down:

user: Your username
hostname: The name of the computer you're on
~: The tilde represents your home directory (more on this later)
$: The prompt symbol (it's # if you're the root/admin user)

Your First Commands

Let's start with the absolute basics. Type these commands and press Enter after each one:

# See where you are
pwd

Output:

/home/username

The pwd command stands for "print working directory." It tells you your current location in the file system. Think of it as checking your GPS location.

# List what's in the current directory
ls

Output:

Documents  Downloads  Pictures  Music  Videos

The ls command lists files and directories. It's like looking at the contents of a folder in a graphical file manager.

# See who you are
whoami

Output:

username

This simply tells you which user account you're currently using.

Essential Linux Commands

Let's explore the commands you'll use daily as a data engineer. I'll explain each one with practical examples you might encounter in real work.

Navigation Commands

Moving Around the File System

# Go to your home directory
cd ~

# Go to a specific directory
cd /var/log

# Go back to the previous directory
cd -

# Go up one level
cd ..

# Go up two levels
cd ../..

Example Scenario: You're troubleshooting a data pipeline that failed. The logs are in /var/log/airflow/. Here's how you'd navigate:

# Start from your home directory
pwd
# Output: /home/dataengineer

# Navigate to the logs
cd /var/log/airflow/

# Check where you are
pwd
# Output: /var/log/airflow

# List the log files
ls
# Output: scheduler.log  webserver.log  worker.log

File and Directory Operations

Creating Directories

# Create a single directory
mkdir data_projects

# Create nested directories (with -p flag)
mkdir -p projects/etl_pipeline/scripts

# Create multiple directories at once
mkdir raw_data processed_data output_data

Example: Setting up a new data project structure:

# Create a complete project structure
mkdir -p ~/data_projects/customer_analytics/{raw,processed,scripts,output}

# Navigate to it
cd ~/data_projects/customer_analytics

# Verify the structure
ls -R

Output:

.:
processed  raw  scripts  output

./processed:

./raw:

./scripts:

./output:

Creating and Viewing Files

# Create an empty file
touch data.csv

# Create multiple files
touch file1.txt file2.txt file3.txt

# View file contents
cat data.csv

# View large files page by page
less large_logfile.log

# View first 10 lines
head data.csv

# View last 10 lines
tail data.csv

# View last 20 lines
tail -n 20 data.csv

# Follow a log file in real-time (useful for monitoring)
tail -f /var/log/application.log

Real-World Example: Monitoring a data pipeline log:

# Watch the pipeline log as it runs
tail -f /var/log/data_pipeline/etl_job.log

This command will show new log entries as they're written, perfect for watching your data pipeline in action.

Copying and Moving Files

# Copy a file
cp source.csv destination.csv

# Copy a directory and its contents
cp -r source_folder destination_folder

# Move/rename a file
mv old_name.csv new_name.csv

# Move a file to a different directory
mv data.csv /home/user/projects/

# Move multiple files
mv file1.txt file2.txt file3.txt /destination/

Example: Organizing raw data files:

# You just received new data files
ls
# Output: transactions_2024.csv  customers.csv  products.csv

# Create organized structure
mkdir -p raw_data/{transactions,customers,products}

# Move files to appropriate directories
mv transactions_2024.csv raw_data/transactions/
mv customers.csv raw_data/customers/
mv products.csv raw_data/products/

# Verify
ls raw_data/transactions/
# Output: transactions_2024.csv

Deleting Files and Directories

# Delete a file (be careful!)
rm file.txt

# Delete multiple files
rm file1.txt file2.txt file3.txt

# Delete with confirmation prompt
rm -i important_file.csv

# Delete a directory and its contents
rm -r directory_name

# Force delete without confirmation (DANGEROUS!)
rm -rf directory_name

Warning: The rm command is permanent. Unlike Windows or Mac, there's no Recycle Bin. Deleted files are gone forever. Always double-check before using rm -rf.

Searching and Finding

Finding Files

# Find files by name
find /home/user -name "*.csv"

# Find files modified in the last 7 days
find /data -type f -mtime -7

# Find large files (bigger than 100MB)
find /data -type f -size +100M

# Find and delete old log files
find /var/log -name "*.log" -mtime +30 -delete

Searching Inside Files

# Search for text in a file
grep "error" logfile.log

# Search case-insensitively
grep -i "Error" logfile.log

# Search recursively in directories
grep -r "TODO" /home/user/projects/

# Count occurrences
grep -c "success" pipeline.log

# Show line numbers
grep -n "failed" etl.log

# Search with context (2 lines before and after)
grep -C 2 "exception" app.log

Real-World Example: Finding errors in a data pipeline:

# Search for all errors in today's logs
grep -i "error" /var/log/airflow/scheduler.log | grep "2024-01-23"

# Count how many times a specific error occurred
grep "ConnectionTimeout" pipeline.log | wc -l

# Find which data files contain null values
grep -l "NULL" *.csv

Output:

customers.csv
transactions.csv

File Permissions and Ownership

Linux has a robust permission system. Every file has an owner and permissions that control who can read, write, or execute it.

Understanding Permissions

When you run ls -l, you see something like:

ls -l script.py

Output:

-rwxr-xr-x 1 dataengineer users 2048 Jan 23 10:30 script.py

Let's decode this:

-rwxr-xr-x: Permissions
- First character: - (file) or d (directory)
- Next three: rwx (owner can read, write, execute)
- Next three: r-x (group can read and execute)
- Last three: r-x (others can read and execute)
dataengineer: Owner
users: Group
2048: File size in bytes
Jan 23 10:30: Last modified date and time
script.py: Filename

Changing Permissions

# Make a script executable
chmod +x script.sh

# Give full permissions to owner, read to others
chmod 744 data_script.py

# Remove write permission for everyone
chmod a-w important_config.txt

# Recursively change permissions
chmod -R 755 /data/scripts/

Example: Preparing a Python ETL script for execution:

# Check current permissions
ls -l etl_pipeline.py
# Output: -rw-r--r-- 1 user users 5242 Jan 23 10:30 etl_pipeline.py

# Make it executable
chmod +x etl_pipeline.py

# Verify
ls -l etl_pipeline.py
# Output: -rwxr-xr-x 1 user users 5242 Jan 23 10:30 etl_pipeline.py

# Now you can run it directly
./etl_pipeline.py

Process Management

As a data engineer, you'll often run long-running processes and need to manage them.

# See running processes
ps aux

# See processes in a tree structure
ps auxf

# Filter processes
ps aux | grep python

# Interactive process viewer
top

# Better interactive viewer
htop

# Kill a process
kill 1234  # where 1234 is the process ID

# Force kill a stubborn process
kill -9 1234

# Kill by process name
pkill python

# See what's using a port
lsof -i :8080

Example: Managing a stuck data processing job:

# Find the process
ps aux | grep "spark-submit"

# Output:
# user  12345  15.2  8.5  4567890 123456 ?  R  10:30  45:23 spark-submit data_job.py

# Kill it gracefully
kill 12345

# Wait a few seconds, then verify it's gone
ps aux | grep 12345

# If it's still running, force kill
kill -9 12345

System Information

# Disk usage
df -h

# Directory size
du -sh /data/warehouse

# Memory usage
free -h

# CPU information
lscpu

# System uptime
uptime

# Check OS version
cat /etc/os-release

# Kernel version
uname -r

Example Output:

df -h

Output:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       100G   45G   50G  48% /
/dev/sdb1       1.0T  750G  250G  75% /data

This shows you have 250GB of free space on your data drive—important when processing large datasets!

Working with Files and Directories

The Linux File System Structure

Linux organizes everything as files in a hierarchical tree structure. Understanding this is crucial for navigating and managing your data projects.

/                          (root - top of the hierarchy)
├── home/                  (user home directories)
│   └── dataengineer/     (your home directory)
│       ├── projects/
│       ├── data/
│       └── scripts/
├── var/                   (variable data)
│   ├── log/              (log files)
│   └── lib/              (state information)
├── etc/                   (configuration files)
├── usr/                   (user programs)
│   ├── bin/              (executable programs)
│   └── local/            (locally installed software)
├── tmp/                   (temporary files)
└── opt/                   (optional software)

Important Paths for Data Engineers:

/home/username: Your personal space
/data: Often where large datasets are stored
/var/log: System and application logs
/etc: Configuration files for applications
/tmp: Temporary files (often cleared on reboot)

Working with Paths

Absolute vs Relative Paths

# Absolute path (starts with /)
cd /home/dataengineer/projects/etl

# Relative path (from current location)
cd projects/etl

# Current directory
.

# Parent directory
..

# Home directory
~

Example Navigation:

# You're here
pwd
# Output: /home/dataengineer

# Using relative path
cd projects/etl

# Using absolute path
cd /home/dataengineer/projects/etl

# Go up two levels
cd ../..

# Where are we now?
pwd
# Output: /home/dataengineer

Wildcards and Pattern Matching

Wildcards help you work with multiple files at once.

# List all CSV files
ls *.csv

# List files starting with "data"
ls data*

# List files with exactly 5 characters
ls ?????

# List all Python files in subdirectories
ls */*.py

# Copy all CSV files to another directory
cp *.csv /backup/

# Delete all log files
rm *.log

Advanced Pattern Matching:

# List files matching a pattern
ls transactions_[0-9][0-9][0-9][0-9].csv
# Matches: transactions_2023.csv, transactions_2024.csv

# List files NOT matching a pattern
ls !(*.tmp)

# Multiple extensions
ls *.{csv,txt,json}

Piping and Redirection

One of Linux's most powerful features is the ability to chain commands together.

Output Redirection:

# Save command output to a file (overwrites)
ls -l > file_list.txt

# Append to a file
echo "New log entry" >> application.log

# Redirect errors to a file
python script.py 2> errors.log

# Redirect both output and errors
python script.py > output.log 2>&1

Piping Commands:

# Chain commands with pipe (|)
cat large_file.csv | grep "error" | wc -l

# Sort and get unique values
cat data.txt | sort | uniq

# Find the 10 largest files
ls -lS | head -10

# Count how many processes are running
ps aux | wc -l

Real-World Example: Analyzing log files:

# Find the most common errors in a log file
cat application.log | grep "ERROR" | sort | uniq -c | sort -rn | head -10

This command:

Reads the log file
Filters for lines containing "ERROR"
Sorts them
Counts unique occurrences
Sorts by count (descending)
Shows top 10

Output:

  145 ERROR: Database connection timeout
   89 ERROR: Invalid data format
   45 ERROR: Memory allocation failed
   23 ERROR: Network unreachable
   ...

Text Processing Commands

cut, awk, and sed

These are powerful text processing tools essential for data engineers.

# Extract specific columns from CSV
cut -d',' -f1,3 data.csv

# Example: Extract first and third columns
# Input:  name,age,city,salary
#         John,30,NYC,75000
# Output: name,city
#         John,NYC

# Using awk for more complex operations
awk -F',' '{print $1, $3}' data.csv

# Sum values in a column
awk -F',' '{sum += $4} END {print sum}' data.csv

# Search and replace with sed
sed 's/old_text/new_text/g' file.txt

# Delete lines matching a pattern
sed '/pattern/d' file.txt

Real-World Example: Processing a CSV file:

# You have a transactions CSV: date,product,quantity,price
# Calculate total revenue

awk -F',' 'NR>1 {revenue += $3 * $4} END {print "Total Revenue: $" revenue}' transactions.csv

Output:

Total Revenue: $125430.50

Compression and Archives

Working with compressed files is daily business for data engineers.

# Create a tar archive
tar -cvf backup.tar /data/files

# Create a compressed tar.gz archive
tar -czvf backup.tar.gz /data/files

# Extract a tar.gz archive
tar -xzvf backup.tar.gz

# Compress a file with gzip
gzip large_file.csv
# Creates: large_file.csv.gz

# Decompress
gunzip large_file.csv.gz

# Compress with bzip2 (better compression)
bzip2 large_file.csv

# View compressed file without extracting
zcat file.gz

# View compressed log file
zcat logs.gz | grep "error"

Example: Backing up project data:

# Create a dated backup
tar -czvf backup_$(date +%Y%m%d).tar.gz ~/projects/data_pipeline/

# Verify contents without extracting
tar -tzvf backup_20240123.tar.gz

# Extract to a specific directory
tar -xzvf backup_20240123.tar.gz -C /restore/location/

Text Editing with Vi and Nano

As a data engineer, you'll constantly edit configuration files, scripts, and data files on remote servers. Mastering terminal-based text editors is essential. We'll cover two editors: Nano (beginner-friendly) and Vi (powerful and ubiquitous).

Why Learn Terminal Text Editors?

When you SSH into a remote server, you won't have access to graphical editors like VS Code or Sublime Text. You need to edit files directly in the terminal. Additionally, terminal editors are:

Fast and lightweight
Available on every Linux system
Efficient for quick edits
Scriptable and automatable

Nano: The Beginner-Friendly Editor

Nano is straightforward and displays its commands at the bottom of the screen, making it perfect for beginners.

Starting Nano

# Open a new file
nano myfile.txt

# Open an existing file
nano /etc/config/app.conf

# Open file at a specific line number
nano +25 script.py

Basic Nano Operations

When you open Nano, you'll see something like this:

  GNU nano 5.4              myfile.txt                Modified  

This is my text file.
I can start typing immediately.
No special mode needed!




^G Get Help  ^O Write Out  ^W Where Is  ^K Cut Text   ^J Justify
^X Exit      ^R Read File  ^\ Replace   ^U Uncut Text ^T To Spell

The ^ symbol means Ctrl. So ^X means "Ctrl+X".

Essential Nano Commands:

# While in Nano:
Ctrl+O          # Save file (WriteOut)
Ctrl+X          # Exit
Ctrl+K          # Cut current line
Ctrl+U          # Paste (UnCut)
Ctrl+W          # Search
Ctrl+\          # Search and replace
Ctrl+G          # Show help
Ctrl+C          # Show current line number
Alt+U           # Undo
Alt+E           # Redo
Ctrl+A          # Go to beginning of line
Ctrl+E          # Go to end of line

Practical Nano Example 1: Creating a Python Script

Let's create a simple data processing script:

# Open a new Python file
nano data_processor.py

Now type:

#!/usr/bin/env python3
# Data processor script

import pandas as pd

def process_data(input_file):
    # Read CSV file
    df = pd.read_csv(input_file)

    # Basic cleaning
    df = df.dropna()

    # Save processed data
    df.to_csv('processed_data.csv', index=False)
    print(f"Processed {len(df)} records")

if __name__ == "__main__":
    process_data('raw_data.csv')

Steps:

Type or paste the code
Press Ctrl+O to save
Press Enter to confirm filename
Press Ctrl+X to exit

Making the script executable:

chmod +x data_processor.py
./data_processor.py

Practical Nano Example 2: Editing a Configuration File

Let's edit a database configuration file:

nano ~/config/database.conf

Original content:

[database]
host=localhost
port=5432
dbname=olddb
user=admin
password=oldpass

To search and replace:

Press Ctrl+\ (search and replace)
Enter search term: olddb
Press Enter
Enter replacement: newdb
Press Enter
Press A to replace all occurrences
Press Ctrl+O to save
Press Ctrl+X to exit

Nano Tips for Data Engineers

# Open multiple files in tabs
nano file1.txt file2.txt
# Use Alt+< and Alt+> to switch between files

# Enable line numbers
nano -l script.py

# Create a backup automatically
nano -B important_config.conf
# Creates: important_config.conf~

# Soft wrap long lines
nano -$ data.csv

Vi (Vim): The Power User's Editor

Vi (or its improved version, Vim) is more complex but incredibly powerful. It's on every Unix/Linux system, making it essential to know at least the basics.

Understanding Vi Modes

Vi has different modes, which is what confuses beginners. Think of it like a smartphone: sometimes you're typing a message (insert mode), sometimes you're navigating menus (normal mode).

Three Main Modes:

Normal Mode (default): For navigation and commands
Insert Mode: For typing text
Command Mode: For saving, quitting, searching

Starting Vi

# Open a new file
vi myfile.txt

# Open an existing file
vi /etc/config/app.conf

# Open at a specific line
vi +25 script.py

# Open in read-only mode
view logfile.log

Basic Vi Operations

Entering Insert Mode (where you can type):

# In Normal mode, press:
i          # Insert before cursor
a          # Insert after cursor
I          # Insert at beginning of line
A          # Insert at end of line
o          # Open new line below
O          # Open new line above

# You'll see -- INSERT -- at the bottom when in Insert mode

Returning to Normal Mode:

Esc        # Always gets you back to Normal mode

Saving and Quitting:

# In Normal mode, press : to enter Command mode
:w         # Save (write)
:q         # Quit
:wq        # Save and quit
:q!        # Quit without saving (force)
:x         # Save and quit (shortcut)
ZZ         # Save and quit (even shorter)

Navigation in Vi

In Normal mode:

# Basic movement
h          # Left
j          # Down
k          # Up
l          # Right

# Word movement
w          # Next word
b          # Previous word
e          # End of word

# Line movement
0          # Beginning of line
^          # First non-blank character
$          # End of line

# Screen movement
gg         # Top of file
G          # Bottom of file
15G        # Go to line 15
Ctrl+f     # Page down
Ctrl+b     # Page up

Editing in Vi

In Normal mode:

# Deleting
x          # Delete character
dd         # Delete line
d$         # Delete to end of line
d0         # Delete to beginning of line
5dd        # Delete 5 lines

# Copying (yanking) and pasting
yy         # Copy line
5yy        # Copy 5 lines
p          # Paste after cursor
P          # Paste before cursor

# Undo and redo
u          # Undo
Ctrl+r     # Redo

# Change text
cw         # Change word
c$         # Change to end of line
cc         # Change entire line

Searching in Vi

# In Normal mode
/pattern   # Search forward
?pattern   # Search backward
n          # Next match
N          # Previous match

# Search and replace (in Command mode)
:s/old/new/       # Replace first occurrence in line
:s/old/new/g      # Replace all in line
:%s/old/new/g     # Replace all in file
:%s/old/new/gc    # Replace with confirmation

Practical Vi Example 1: Quick Configuration Edit

Let's edit an Apache Airflow configuration:

vi ~/airflow/airflow.cfg

Scenario: You need to change the executor from Sequential to Local.

Steps:

Open the file: vi ~/airflow/airflow.cfg
Search for executor: /executor then press Enter
Press n to find the next occurrence until you find the right one
Move to the word you want to change: w (to move forward by word)
Change the word: cw (change word)
Type the new value: LocalExecutor
Press Esc to return to Normal mode
Save and quit: :wq

Before:

[core]
executor = SequentialExecutor

After:

[core]
executor = LocalExecutor

Practical Vi Example 2: Editing a Data Pipeline Script

vi etl_pipeline.py

Scenario: Add logging to a Python script.

Original script:

def process_data(file_path):
    data = read_csv(file_path)
    cleaned = clean_data(data)
    load_to_db(cleaned)

Steps:

Open file: vi etl_pipeline.py
Go to the function: /def process_data then Enter
Open a line below: o
Add logging: type print(f"Processing {file_path}...")
Press Esc
Move to the next line: j
Repeat for other lines
Save: :wq

Modified script:

def process_data(file_path):
    print(f"Processing {file_path}...")
    data = read_csv(file_path)
    print(f"Cleaning data...")
    cleaned = clean_data(data)
    print(f"Loading to database...")
    load_to_db(cleaned)
    print(f"Complete!")

Practical Vi Example 3: Bulk Editing a CSV File

vi data.csv

Scenario: Your CSV has incorrect delimiter (semicolon instead of comma).

Before:

name;age;city
John;30;NYC
Jane;25;LA
Bob;35;SF

Steps:

Open file: vi data.csv
Enter Command mode: :
Replace all semicolons: %s/;/,/g
Press Enter
Verify changes: navigate with j and k
Save and quit: :wq

After:

name,age,city
John,30,NYC
Jane,25,LA
Bob,35,SF

Advanced Vi Techniques

Visual Mode (for selecting text):

v          # Start visual character selection
V          # Start visual line selection
Ctrl+v     # Start visual block selection

# Then use movement keys, then:
d          # Delete selection
y          # Copy selection
c          # Change selection

Macros (record and replay actions):

qa         # Start recording macro in register 'a'
# ... perform actions ...
q          # Stop recording
@a         # Replay macro
5@a        # Replay macro 5 times

Example Macro Use: Add quotes around words in a CSV:

# Start on a word
qa         # Start recording
i"         # Insert opening quote
Esc        # Normal mode
e          # End of word
a"         # Insert closing quote
Esc        # Normal mode
q          # Stop recording

# Now replay on other words
@a         # Apply to next word

Vi Configuration

Create a .vimrc file in your home directory for custom settings:

vi ~/.vimrc

Add useful settings for data engineering:

" Enable line numbers
set number

" Enable syntax highlighting
syntax on

" Set tab width
set tabstop=4
set shiftwidth=4
set expandtab

" Show matching brackets
set showmatch

" Enable search highlighting
set hlsearch

" Incremental search
set incsearch

" Auto-indent
set autoindent

" Show cursor position
set ruler

" Enable mouse support
set mouse=a

Save this configuration and restart Vi. Your editing experience will be much improved!

Vi vs Nano: When to Use Each

Use Nano when:

Making quick, simple edits
You're a beginner
You want immediate visual feedback
Working with small configuration files

Use Vi when:

Editing remotely on minimal systems
You need power and speed
Working with large files
Performing complex text transformations
You want to invest in long-term efficiency

Personal Recommendation: Learn the basics of both. Use Nano for daily quick edits and gradually incorporate Vi for more complex tasks. Many data engineers use Nano for simple configs and Vi/Vim for coding and complex edits.

Recovering from Vi Panic

Stuck in Vi and don't know how to exit? Here's the "escape ladder":

1. Press Esc (multiple times if needed)
2. Type :q! and press Enter (quit without saving)

If that doesn't work:
3. Press Esc again
4. Type :qa! and press Enter (quit all)

Still stuck?
5. Press Ctrl+Z (suspends Vi)
6. Type: kill %1 (kills the suspended process)

Real-World Text Editing Scenarios

Scenario 1: Emergency Log Analysis

Your data pipeline failed at 3 AM. You SSH into the server:

# Quick look with Nano
nano /var/log/airflow/scheduler.log

# Search for "ERROR": Ctrl+W, type "ERROR", Enter
# Navigate through errors: Ctrl+W repeatedly
# Found the issue, exit: Ctrl+X

Scenario 2: Batch Configuration Update

You need to update database connections across 10 config files:

# Use Vi for efficiency
for file in config/*.conf; do
    vi -c '%s/old_db_host/new_db_host/g | wq' "$file"
done

This opens each file, replaces the text, saves, and quits automatically.

Scenario 3: Creating a Cron Job

# Edit crontab
crontab -e

# If prompted, choose your preferred editor (select Nano if unsure)
# Add a line to run your data pipeline daily at 2 AM:
0 2 * * * /home/dataengineer/scripts/daily_etl.sh >> /var/log/etl.log 2>&1

# Save and exit

Scenario 4: Editing a Large JSON Configuration

# Use Vi with syntax highlighting
vi config.json

# Navigate quickly
:250    # Jump to line 250
/timeout   # Search for timeout setting
cw5000    # Change the word to 5000
:wq       # Save and exit

Text Editing Best Practices

Always backup important files before editing:

   cp important.conf important.conf.backup
   vi important.conf

Use version control for scripts and configs:

   git add config.yaml
   git commit -m "Updated database connection"

Test syntax after editing code:

   python -m py_compile script.py
   bash -n script.sh

Keep editing sessions focused: Edit one file at a time to avoid confusion.
Learn keyboard shortcuts: They save immense time over months and years.

Advanced Concepts for Data Engineers

Now that you've mastered the basics, let's dive into advanced concepts that will make you truly proficient in Linux as a data engineer.

Shell Scripting and Automation

Shell scripts allow you to automate repetitive tasks. Instead of typing the same commands every day, you write them once in a script.

Creating Your First Bash Script

# Create a new script
nano daily_backup.sh

Type this:

#!/bin/bash
# Daily backup script for data files

# Set variables
BACKUP_DIR="/backup/data"
SOURCE_DIR="/data/warehouse"
DATE=$(date +%Y%m%d)
BACKUP_FILE="backup_${DATE}.tar.gz"

# Create backup
echo "Starting backup at $(date)"
tar -czf "${BACKUP_DIR}/${BACKUP_FILE}" "${SOURCE_DIR}"

# Check if successful
if [ $? -eq 0 ]; then
    echo "Backup completed successfully"
    echo "File: ${BACKUP_FILE}"
    echo "Size: $(du -h ${BACKUP_DIR}/${BACKUP_FILE} | cut -f1)"
else
    echo "Backup failed!"
    exit 1
fi

# Delete backups older than 30 days
find "${BACKUP_DIR}" -name "backup_*.tar.gz" -mtime +30 -delete
echo "Old backups cleaned up"

Make it executable and run:

chmod +x daily_backup.sh
./daily_backup.sh

Output:

Starting backup at Wed Jan 23 14:30:00 UTC 2024
Backup completed successfully
File: backup_20240123.tar.gz
Size: 2.3G
Old backups cleaned up

Script Variables and User Input

#!/bin/bash
# Data processing script with user input

# Read user input
echo "Enter the data file to process:"
read DATA_FILE

# Check if file exists
if [ ! -f "$DATA_FILE" ]; then
    echo "Error: File $DATA_FILE not found!"
    exit 1
fi

# Process the file
echo "Processing $DATA_FILE..."
LINES=$(wc -l < "$DATA_FILE")
echo "Total lines: $LINES"

# Calculate processing time
START=$(date +%s)
# Your processing command here
python process_data.py "$DATA_FILE"
END=$(date +%s)
DURATION=$((END - START))

echo "Processing completed in $DURATION seconds"

Conditional Logic in Scripts

#!/bin/bash
# ETL pipeline with error handling

LOG_FILE="/var/log/etl_pipeline.log"
ERROR_COUNT=0

# Function to log messages
log_message() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Extract data
log_message "Starting data extraction..."
if python extract_data.py; then
    log_message "Extraction successful"
else
    log_message "ERROR: Extraction failed"
    ((ERROR_COUNT++))
fi

# Transform data
log_message "Starting data transformation..."
if python transform_data.py; then
    log_message "Transformation successful"
else
    log_message "ERROR: Transformation failed"
    ((ERROR_COUNT++))
fi

# Load data
log_message "Starting data loading..."
if python load_data.py; then
    log_message "Loading successful"
else
    log_message "ERROR: Loading failed"
    ((ERROR_COUNT++))
fi

# Summary
if [ $ERROR_COUNT -eq 0 ]; then
    log_message "Pipeline completed successfully"
    exit 0
else
    log_message "Pipeline completed with $ERROR_COUNT errors"
    exit 1
fi

Loops in Shell Scripts

#!/bin/bash
# Process multiple CSV files

DATA_DIR="/data/raw"
OUTPUT_DIR="/data/processed"

# Loop through all CSV files
for file in "${DATA_DIR}"/*.csv; do
    filename=$(basename "$file" .csv)
    echo "Processing $filename..."

    # Run Python processing script
    python process_csv.py "$file" "${OUTPUT_DIR}/${filename}_processed.csv"

    if [ $? -eq 0 ]; then
        echo "✓ $filename completed"
    else
        echo "✗ $filename failed"
    fi
done

echo "All files processed"

Functions in Shell Scripts

#!/bin/bash
# Modular data pipeline with functions

# Function to check if a file exists
check_file() {
    if [ ! -f "$1" ]; then
        echo "Error: File $1 not found"
        return 1
    fi
    return 0
}

# Function to validate CSV format
validate_csv() {
    local file=$1
    local expected_columns=$2

    actual_columns=$(head -1 "$file" | tr ',' '\n' | wc -l)

    if [ "$actual_columns" -eq "$expected_columns" ]; then
        echo "✓ CSV validation passed"
        return 0
    else
        echo "✗ Expected $expected_columns columns, found $actual_columns"
        return 1
    fi
}

# Function to process data
process_data() {
    local input=$1
    local output=$2

    echo "Processing: $input -> $output"
    python data_processor.py "$input" "$output"
    return $?
}

# Main execution
MAIN() {
    INPUT_FILE="raw_data.csv"
    OUTPUT_FILE="processed_data.csv"

    # Check input file
    check_file "$INPUT_FILE" || exit 1

    # Validate format
    validate_csv "$INPUT_FILE" 5 || exit 1

    # Process data
    process_data "$INPUT_FILE" "$OUTPUT_FILE" || exit 1

    echo "Pipeline completed successfully"
}

# Run main function
MAIN

Environment Variables

Environment variables store system-wide or user-specific configuration.

# View all environment variables
env

# View specific variable
echo $HOME
echo $PATH
echo $USER

# Set a temporary variable (current session only)
export DATABASE_URL="postgresql://localhost:5432/mydb"

# Set permanently (add to ~/.bashrc)
echo 'export DATABASE_URL="postgresql://localhost:5432/mydb"' >> ~/.bashrc
source ~/.bashrc

# Use in scripts
#!/bin/bash
echo "Connecting to: $DATABASE_URL"
python connect_db.py

Common Environment Variables for Data Engineers

# Python path
export PYTHONPATH="/home/user/python_modules:$PYTHONPATH"

# Java home (for Spark)
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk"

# Spark configuration
export SPARK_HOME="/opt/spark"
export PATH="$SPARK_HOME/bin:$PATH"

# AWS credentials
export AWS_ACCESS_KEY_ID="your_key"
export AWS_SECRET_ACCESS_KEY="your_secret"
export AWS_DEFAULT_REGION="us-east-1"

# Airflow home
export AIRFLOW_HOME="/home/user/airflow"

Scheduling with Cron

Cron is the Linux job scheduler. It's essential for automating data pipelines.

Cron Syntax

* * * * * command
│ │ │ │ │
│ │ │ │ └─── Day of week (0-7, 0 and 7 are Sunday)
│ │ │ └───── Month (1-12)
│ │ └─────── Day of month (1-31)
│ └───────── Hour (0-23)
└─────────── Minute (0-59)

Common Cron Patterns

# Edit crontab
crontab -e

# Every day at 2 AM
0 2 * * * /home/user/scripts/daily_etl.sh

# Every hour
0 * * * * /home/user/scripts/hourly_sync.sh

# Every 15 minutes
*/15 * * * * /home/user/scripts/frequent_check.sh

# Monday to Friday at 9 AM
0 9 * * 1-5 /home/user/scripts/weekday_report.sh

# First day of every month at midnight
0 0 1 * * /home/user/scripts/monthly_summary.sh

# Every Sunday at 3 AM
0 3 * * 0 /home/user/scripts/weekly_backup.sh

Real-World Cron Examples

# Data pipeline that runs every 6 hours
0 */6 * * * cd /home/dataengineer/pipelines && python etl_pipeline.py >> /var/log/etl.log 2>&1

# Database backup every day at 1 AM
0 1 * * * /usr/local/bin/backup_database.sh

# Clean temporary files every day at midnight
0 0 * * * find /tmp/data_processing -type f -mtime +7 -delete

# Send daily report at 8 AM on weekdays
0 8 * * 1-5 python /home/user/scripts/daily_report.py && mail -s "Daily Report" team@company.com < report.txt

# Monitor disk space every 30 minutes
*/30 * * * * df -h | grep -E '9[0-9]%|100%' && echo "Disk space warning" | mail -s "Alert" admin@company.com

Cron Best Practices

# Always use absolute paths in cron jobs
# BAD: python script.py
# GOOD: /usr/bin/python3 /home/user/scripts/script.py

# Redirect output to log files
0 2 * * * /home/user/script.sh >> /var/log/script.log 2>&1

# Set environment in the script
#!/bin/bash
export PATH=/usr/local/bin:/usr/bin:/bin
export PYTHONPATH=/home/user/modules
# ... rest of script

# Test scripts before scheduling
/home/user/scripts/test_pipeline.sh

# Use meaningful names and comments in crontab
# Daily ETL pipeline for customer data
0 2 * * * /home/user/pipelines/customer_etl.sh

SSH and Remote Server Management

SSH (Secure Shell) lets you connect to remote servers securely.

Basic SSH Usage

# Connect to a server
ssh username@server_ip

# Example
ssh dataengineer@192.168.1.100

# Connect with a specific port
ssh -p 2222 username@server_ip

# Connect and execute a command
ssh user@server "ls -la /data"

# Copy files to remote server
scp local_file.csv user@server:/remote/path/

# Copy from remote server
scp user@server:/remote/file.csv /local/path/

# Copy entire directory
scp -r local_directory user@server:/remote/path/

# Copy with compression (faster for large files)
scp -C large_dataset.csv user@server:/data/

SSH Key Authentication

More secure and convenient than passwords:

# Generate SSH key pair
ssh-keygen -t rsa -b 4096 -C "your_email@example.com"

# Copy public key to server
ssh-copy-id user@server

# Now connect without password
ssh user@server

SSH Configuration

Create a config file for easier connections:

nano ~/.ssh/config

Add:

Host production
    HostName 10.0.1.50
    User dataengineer
    Port 22
    IdentityFile ~/.ssh/id_rsa

Host staging
    HostName 10.0.1.51
    User dataengineer
    Port 22
    IdentityFile ~/.ssh/id_rsa

Host datawarehouse
    HostName warehouse.company.com
    User admin
    Port 2222
    IdentityFile ~/.ssh/warehouse_key

Now connect simply:

ssh production
ssh staging
ssh datawarehouse

Remote Command Execution

# Check disk space on remote server
ssh production "df -h"

# Run a data pipeline remotely
ssh production "cd /data/pipelines && python daily_etl.py"

# Monitor logs in real-time
ssh production "tail -f /var/log/application.log"

# Execute multiple commands
ssh production << 'EOF'
cd /data/pipelines
source venv/bin/activate
python etl.py
deactivate
EOF

Working with Databases from Linux

As a data engineer, you'll frequently interact with databases from the command line.

PostgreSQL

# Connect to PostgreSQL
psql -h localhost -U username -d database_name

# Execute a query from command line
psql -h localhost -U user -d mydb -c "SELECT COUNT(*) FROM customers;"

# Execute SQL file
psql -h localhost -U user -d mydb -f query.sql

# Export query results to CSV
psql -h localhost -U user -d mydb -c "COPY (SELECT * FROM sales) TO STDOUT WITH CSV HEADER" > sales_export.csv

# Import CSV into table
psql -h localhost -U user -d mydb -c "\COPY customers FROM 'customers.csv' WITH CSV HEADER"

MySQL

# Connect to MySQL
mysql -h localhost -u username -p database_name

# Execute query
mysql -h localhost -u user -p -e "SELECT COUNT(*) FROM orders;" mydb

# Execute SQL file
mysql -h localhost -u user -p mydb < script.sql

# Export to CSV
mysql -h localhost -u user -p -e "SELECT * FROM products;" mydb > products.csv

# Backup database
mysqldump -h localhost -u user -p mydb > backup.sql

# Restore database
mysql -h localhost -u user -p mydb < backup.sql

MongoDB

# Connect to MongoDB
mongo

# Connect to specific database
mongo mydb

# Execute command
mongo mydb --eval "db.users.count()"

# Export collection to JSON
mongoexport --db=mydb --collection=users --out=users.json

# Import JSON
mongoimport --db=mydb --collection=users --file=users.json

# Backup
mongodump --db=mydb --out=/backup/

# Restore
mongorestore --db=mydb /backup/mydb/

System Monitoring and Performance

Understanding system performance is crucial when running data pipelines.

CPU and Memory Monitoring

# Real-time system monitoring
top

# Better alternative with more features
htop

# Memory usage
free -h

# Detailed memory info
cat /proc/meminfo

# CPU information
lscpu

# System load average
uptime

# Process-specific resource usage
ps aux | sort -nrk 3 | head -10  # Top 10 CPU users
ps aux | sort -nrk 4 | head -10  # Top 10 memory users

Disk Monitoring

# Disk usage by filesystem
df -h

# Disk usage by directory
du -sh /data/*

# Find largest directories
du -h /data | sort -rh | head -20

# Disk I/O statistics
iostat -x 2  # Update every 2 seconds

# Monitor disk usage in real-time
watch -n 5 df -h

Network Monitoring

# Network interface statistics
ifconfig

# Or on newer systems
ip addr show

# Network statistics
netstat -tuln

# Check open ports
ss -tuln

# Test network speed
iftop  # (may need installation)

# Check if a port is open
telnet hostname port
nc -zv hostname port

# Trace route to destination
traceroute google.com

# DNS lookup
nslookup example.com
dig example.com

Log Management

Logs are vital for troubleshooting data pipelines.

System Logs

# View system logs
tail -f /var/log/syslog  # Debian/Ubuntu
tail -f /var/log/messages  # RedHat/CentOS

# Application logs
tail -f /var/log/apache2/error.log
tail -f /var/log/nginx/access.log

# Kernel logs
dmesg

# Authentication logs
tail -f /var/log/auth.log

journalctl (systemd logs)

# View all logs
journalctl

# Follow logs in real-time
journalctl -f

# Logs from specific service
journalctl -u apache2

# Logs from last boot
journalctl -b

# Logs from specific time period
journalctl --since "2024-01-23 00:00:00" --until "2024-01-23 23:59:59"

# Show only errors
journalctl -p err

# Logs for specific process
journalctl _PID=1234

Log Analysis Examples

# Count error occurrences
grep "ERROR" application.log | wc -l

# Find most common errors
grep "ERROR" application.log | sort | uniq -c | sort -rn | head -10

# Extract errors from a time period
awk '/2024-01-23 14:00:00/,/2024-01-23 15:00:00/' application.log | grep "ERROR"

# Monitor log file size
watch -n 10 'ls -lh /var/log/application.log'

# Rotate logs (manually)
mv application.log application.log.$(date +%Y%m%d)
touch application.log

# Find slow queries in database logs
awk '$NF > 5' slow_query.log  # Queries taking more than 5 seconds

File Transfer and Synchronization

rsync - Efficient File Synchronization

# Basic syntax
rsync -av source/ destination/

# Sync to remote server
rsync -av /local/data/ user@server:/remote/data/

# Sync from remote server
rsync -av user@server:/remote/data/ /local/data/

# Sync with compression (good for slow connections)
rsync -avz source/ destination/

# Show progress
rsync -av --progress large_file.csv user@server:/data/

# Dry run (test without actually copying)
rsync -avn source/ destination/

# Delete files in destination that don't exist in source
rsync -av --delete source/ destination/

# Exclude certain files
rsync -av --exclude='*.tmp' --exclude='*.log' source/ destination/

# Bandwidth limit (in KB/s)
rsync -av --bwlimit=1000 large_dataset/ user@server:/data/

Real-World rsync Examples

# Backup data directory daily
rsync -av --delete /data/warehouse/ /backup/warehouse_$(date +%Y%m%d)/

# Sync processed data to production
rsync -avz --exclude='*.tmp' /data/processed/ prod_server:/data/production/

# Mirror data between servers with logging
rsync -av --log-file=/var/log/rsync.log server1:/data/ server2:/data/

# Resume interrupted transfer
rsync -av --partial --progress user@server:/large_file.csv ./

Package Management

Installing and managing software on Linux.

Debian/Ubuntu (apt)

# Update package list
sudo apt update

# Upgrade all packages
sudo apt upgrade

# Install a package
sudo apt install python3-pip

# Remove a package
sudo apt remove package_name

# Search for packages
apt search postgresql

# Show package info
apt show docker.io

# List installed packages
apt list --installed

# Install from .deb file
sudo dpkg -i package.deb

Red Hat/CentOS (yum/dnf)

# Update packages
sudo yum update

# Install package
sudo yum install python3

# Remove package
sudo yum remove package_name

# Search
yum search postgresql

# List installed
yum list installed

# On newer versions, use dnf
sudo dnf install package_name

Installing Data Engineering Tools

# Python and pip
sudo apt install python3 python3-pip

# PostgreSQL
sudo apt install postgresql postgresql-contrib

# Docker
sudo apt install docker.io

# Apache Spark (download and extract)
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xvzf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 /opt/spark

# Python libraries
pip3 install pandas numpy apache-airflow pyspark

Real-World Data Engineering Scenarios

Let's put everything together with practical scenarios you'll encounter as a data engineer.

Scenario 1: Setting Up a Data Processing Environment

You've just been hired, and you need to set up your Linux environment for data engineering work.

# Step 1: Update the system
sudo apt update && sudo apt upgrade -y

# Step 2: Install essential tools
sudo apt install -y git vim htop curl wget tree

# Step 3: Install Python and dependencies
sudo apt install -y python3 python3-pip python3-venv

# Step 4: Create project structure
mkdir -p ~/projects/{data_pipelines,scripts,notebooks,data/{raw,processed,output}}

# Step 5: Set up Python virtual environment
cd ~/projects/data_pipelines
python3 -m venv venv
source venv/bin/activate

# Step 6: Install Python libraries
pip install pandas numpy apache-airflow pyspark sqlalchemy psycopg2-binary

# Step 7: Configure git
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

# Step 8: Create .bashrc aliases
cat >> ~/.bashrc << 'EOF'

# Data Engineering Aliases
alias ll='ls -lah'
alias projects='cd ~/projects'
alias activate='source venv/bin/activate'
alias pipes='cd ~/projects/data_pipelines'

# Environment variables
export DATA_HOME=~/projects/data
export PYTHONPATH=$PYTHONPATH:~/projects/scripts

EOF

# Reload bashrc
source ~/.bashrc

# Step 9: Create initial pipeline script
cat > ~/projects/data_pipelines/sample_etl.py << 'EOF'
#!/usr/bin/env python3
"""
Sample ETL Pipeline
"""
import pandas as pd
from datetime import datetime

def extract():
    print(f"[{datetime.now()}] Extracting data...")
    # Your extraction logic
    return pd.DataFrame({'id': [1, 2, 3], 'value': [10, 20, 30]})

def transform(data):
    print(f"[{datetime.now()}] Transforming data...")
    # Your transformation logic
    data['value_doubled'] = data['value'] * 2
    return data

def load(data):
    print(f"[{datetime.now()}] Loading data...")
    # Your loading logic
    data.to_csv('output.csv', index=False)
    print(f"[{datetime.now()}] Completed!")

if __name__ == "__main__":
    df = extract()
    df_transformed = transform(df)
    load(df_transformed)
EOF

chmod +x ~/projects/data_pipelines/sample_etl.py

echo "Environment setup complete!"

Scenario 2: Troubleshooting a Failed Data Pipeline

It's 3 AM, and you get an alert that the nightly ETL pipeline failed. Here's how you troubleshoot:

# Step 1: SSH into the production server
ssh production

# Step 2: Check if the process is running
ps aux | grep etl_pipeline

# Step 3: Check the logs
tail -100 /var/log/etl/pipeline.log

# Step 4: Look for errors
grep -i "error\|exception\|failed" /var/log/etl/pipeline.log | tail -20

# Output shows: "ERROR: Database connection timeout"

# Step 5: Check database connectivity
nc -zv database-server 5432

# Output: Connection refused

# Step 6: Check if database service is running
ssh database-server "systemctl status postgresql"

# Step 7: Restart database (if you have permission)
ssh database-server "sudo systemctl restart postgresql"

# Step 8: Verify it's running
ssh database-server "systemctl status postgresql"

# Step 9: Test connection from app server
psql -h database-server -U etl_user -d warehouse -c "SELECT 1;"

# Success! Now rerun the pipeline
cd /opt/data_pipelines
./etl_pipeline.sh

# Step 10: Monitor in real-time
tail -f /var/log/etl/pipeline.log

# Step 11: Verify completion
ls -lh /data/output/ | grep $(date +%Y%m%d)

# Step 12: Document the incident
cat >> /var/log/incidents/$(date +%Y%m%d).txt << EOF
Time: $(date)
Issue: ETL pipeline failure due to database connection timeout
Cause: PostgreSQL service stopped unexpectedly
Resolution: Restarted PostgreSQL service
Status: Resolved
EOF

Scenario 3: Processing Large CSV Files

You need to process a 10GB CSV file, but your laptop doesn't have enough memory.

# Step 1: Check available memory
free -h

# Step 2: Check file size
ls -lh large_dataset.csv
# Output: -rw-r--r-- 1 user user 10G Jan 23 10:00 large_dataset.csv

# Step 3: Count lines without loading entire file
wc -l large_dataset.csv
# Output: 50000000 large_dataset.csv

# Step 4: View first few lines to understand structure
head -20 large_dataset.csv

# Step 5: Split into smaller chunks (1 million lines each)
split -l 1000000 large_dataset.csv chunk_ --additional-suffix=.csv

# Step 6: Process each chunk
for file in chunk_*.csv; do
    echo "Processing $file..."
    python process_chunk.py "$file" "processed_${file}"
done

# Step 7: Combine processed results
cat processed_chunk_*.csv > final_processed.csv

# Step 8: Verify
wc -l final_processed.csv

# Step 9: Clean up
rm chunk_*.csv processed_chunk_*.csv

# Alternative: Use streaming processing
awk -F',' 'NR > 1 && $3 > 100 {print $1","$2","$3}' large_dataset.csv > filtered.csv

# Or use sed for transformation
sed 's/,/ | /g' large_dataset.csv > pipe_delimited.csv

Scenario 4: Automating Daily Data Backups

Set up automated backups of your data warehouse.

# Step 1: Create backup script
cat > /home/dataengineer/scripts/backup_warehouse.sh << 'EOF'
#!/bin/bash

# Configuration
BACKUP_DIR="/backup/warehouse"
DB_NAME="data_warehouse"
DB_USER="backup_user"
RETENTION_DAYS=30
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/warehouse_${DATE}.sql.gz"
LOG_FILE="/var/log/backups/warehouse_backup.log"

# Function to log messages
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Create backup directory if it doesn't exist
mkdir -p "$BACKUP_DIR"

# Start backup
log "Starting backup of $DB_NAME"

# Perform backup with compression
if pg_dump -U "$DB_USER" "$DB_NAME" | gzip > "$BACKUP_FILE"; then
    BACKUP_SIZE=$(du -h "$BACKUP_FILE" | cut -f1)
    log "Backup completed successfully. Size: $BACKUP_SIZE"
else
    log "ERROR: Backup failed!"
    exit 1
fi

# Delete old backups
log "Cleaning up backups older than $RETENTION_DAYS days"
find "$BACKUP_DIR" -name "warehouse_*.sql.gz" -mtime +$RETENTION_DAYS -delete

# Count remaining backups
BACKUP_COUNT=$(find "$BACKUP_DIR" -name "warehouse_*.sql.gz" | wc -l)
log "Total backups retained: $BACKUP_COUNT"

# Optional: Upload to S3
if [ -n "$AWS_BACKUP_BUCKET" ]; then
    log "Uploading to S3..."
    aws s3 cp "$BACKUP_FILE" "s3://${AWS_BACKUP_BUCKET}/$(basename $BACKUP_FILE)"
    log "Upload completed"
fi

log "Backup process completed"
EOF

# Step 2: Make script executable
chmod +x /home/dataengineer/scripts/backup_warehouse.sh

# Step 3: Test the script
/home/dataengineer/scripts/backup_warehouse.sh

# Step 4: Schedule with cron (daily at 2 AM)
crontab -e

# Add this line:
# 0 2 * * * /home/dataengineer/scripts/backup_warehouse.sh

# Step 5: Set up log rotation
sudo cat > /etc/logrotate.d/warehouse_backup << 'EOF'
/var/log/backups/warehouse_backup.log {
    daily
    rotate 30
    compress
    delaycompress
    notifempty
    create 0640 dataengineer dataengineer
}
EOF

Scenario 5: Monitoring Data Pipeline Performance

Create a comprehensive monitoring script.

cat > ~/scripts/monitor_pipeline.sh << 'EOF'
#!/bin/bash

# Configuration
PIPELINE_DIR="/opt/data_pipelines"
LOG_DIR="/var/log/pipelines"
ALERT_EMAIL="team@company.com"
DISK_THRESHOLD=80
MEMORY_THRESHOLD=85

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

# Check disk usage
check_disk() {
    echo "=== Disk Usage Check ==="
    df -h | grep -E '^/dev/' | while read line; do
        usage=$(echo $line | awk '{print $5}' | sed 's/%//')
        partition=$(echo $line | awk '{print $6}')

        if [ "$usage" -ge "$DISK_THRESHOLD" ]; then
            echo -e "${RED}WARNING${NC}: $partition is ${usage}% full"
        else
            echo -e "${GREEN}OK${NC}: $partition is ${usage}% full"
        fi
    done
    echo
}

# Check memory usage
check_memory() {
    echo "=== Memory Usage Check ==="
    mem_usage=$(free | grep Mem | awk '{printf "%.0f", $3/$2 * 100}')

    if [ "$mem_usage" -ge "$MEMORY_THRESHOLD" ]; then
        echo -e "${RED}WARNING${NC}: Memory usage is ${mem_usage}%"
    else
        echo -e "${GREEN}OK${NC}: Memory usage is ${mem_usage}%"
    fi
    echo
}

# Check running pipelines
check_pipelines() {
    echo "=== Pipeline Status ==="

    for pipeline in etl_pipeline data_sync ml_training; do
        if pgrep -f "$pipeline" > /dev/null; then
            echo -e "${GREEN}RUNNING${NC}: $pipeline"
        else
            echo -e "${YELLOW}STOPPED${NC}: $pipeline"
        fi
    done
    echo
}

# Check for recent errors in logs
check_errors() {
    echo "=== Recent Errors (Last Hour) ==="

    for log in ${LOG_DIR}/*.log; do
        if [ -f "$log" ]; then
            error_count=$(grep -c "ERROR" "$log" 2>/dev/null || echo 0)
            if [ "$error_count" -gt 0 ]; then
                echo -e "${RED}$error_count errors${NC} in $(basename $log)"
            fi
        fi
    done
    echo
}

# Check database connectivity
check_database() {
    echo "=== Database Connectivity ==="

    if psql -h localhost -U etl_user -d warehouse -c "SELECT 1;" &>/dev/null; then
        echo -e "${GREEN}OK${NC}: Database connection successful"
    else
        echo -e "${RED}FAILED${NC}: Cannot connect to database"
    fi
    echo
}

# Generate report
generate_report() {
    REPORT_FILE="/tmp/pipeline_status_$(date +%Y%m%d_%H%M%S).txt"

    {
        echo "Pipeline Monitoring Report"
        echo "Generated: $(date)"
        echo "================================"
        echo
        check_disk
        check_memory
        check_pipelines
        check_errors
        check_database
    } | tee "$REPORT_FILE"

    echo "Report saved to: $REPORT_FILE"
}

# Main execution
echo "Starting Pipeline Monitoring..."
echo "================================"
generate_report
EOF

chmod +x ~/scripts/monitor_pipeline.sh

# Run monitoring every 15 minutes
crontab -e
# Add: */15 * * * * /home/dataengineer/scripts/monitor_pipeline.sh >> /var/log/monitoring.log 2>&1

Scenario 6: Data Quality Checks

Implement automated data quality validation.

cat > ~/scripts/data_quality_check.sh << 'EOF'
#!/bin/bash

# Configuration
DATA_DIR="/data/processed"
REPORT_DIR="/data/quality_reports"
DATE=$(date +%Y%m%d)

mkdir -p "$REPORT_DIR"

# Function to check null values
check_nulls() {
    local file=$1
    local null_count=$(grep -c "NULL\|^,$\|,,\|,$" "$file")

    if [ "$null_count" -gt 0 ]; then
        echo "WARNING: Found $null_count potential null values in $(basename $file)"
        return 1
    else
        echo "OK: No null values in $(basename $file)"
        return 0
    fi
}

# Function to check duplicates
check_duplicates() {
    local file=$1
    local total_lines=$(wc -l < "$file")
    local unique_lines=$(sort -u "$file" | wc -l)
    local duplicates=$((total_lines - unique_lines))

    if [ "$duplicates" -gt 0 ]; then
        echo "WARNING: Found $duplicates duplicate rows in $(basename $file)"
        return 1
    else
        echo "OK: No duplicates in $(basename $file)"
        return 0
    fi
}

# Function to validate schema
validate_schema() {
    local file=$1
    local expected_columns=$2
    local actual_columns=$(head -1 "$file" | tr ',' '\n' | wc -l)

    if [ "$actual_columns" -eq "$expected_columns" ]; then
        echo "OK: Schema validation passed for $(basename $file)"
        return 0
    else
        echo "ERROR: Expected $expected_columns columns, found $actual_columns in $(basename $file)"
        return 1
    fi
}

# Main quality check
{
    echo "Data Quality Report - $DATE"
    echo "============================"
    echo

    for file in "${DATA_DIR}"/*.csv; do
        if [ -f "$file" ]; then
            echo "Checking: $(basename $file)"
            echo "Size: $(du -h $file | cut -f1)"
            echo "Rows: $(wc -l < $file)"

            validate_schema "$file" 5
            check_nulls "$file"
            check_duplicates "$file"

            echo "---"
        fi
    done
} | tee "${REPORT_DIR}/quality_report_${DATE}.txt"

echo "Quality check completed. Report saved to ${REPORT_DIR}/quality_report_${DATE}.txt"
EOF

chmod +x ~/scripts/data_quality_check.sh

Best Practices and Tips

Command Line Productivity

1. Use Tab Completion

# Start typing and press Tab
cd /var/lo[Tab]  # Completes to /var/log/
python scr[Tab]  # Completes to script.py

2. Command History

# View command history
history

# Search history
Ctrl+R  # Then type to search

# Execute previous command
!!

# Execute command from history
!123  # Executes command number 123

# Execute last command starting with 'python'
!python

# Use previous command's arguments
ls /long/path/to/directory
cd !$  # Changes to /long/path/to/directory

3. Useful Aliases
Add these to your ~/.bashrc:

# Navigation
alias ..='cd ..'
alias ...='cd ../..'
alias home='cd ~'

# Listing
alias ll='ls -lah'
alias la='ls -A'
alias l='ls -CF'

# Safety
alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'

# Git shortcuts
alias gs='git status'
alias ga='git add'
alias gc='git commit'
alias gp='git push'
alias gl='git log --oneline'

# Data engineering
alias pipes='cd ~/projects/data_pipelines'
alias logs='cd /var/log'
alias dsize='du -sh'
alias ports='netstat -tuln'

# Docker shortcuts
alias dps='docker ps'
alias dim='docker images'
alias dex='docker exec -it'

# Python
alias py='python3'
alias pip='pip3'
alias venv='python3 -m venv'

4. Navigate Directories Efficiently

# Use directory stack
pushd /var/log  # Go to /var/log and save current dir
pushd /etc      # Go to /etc and save /var/log
popd            # Return to /var/log
popd            # Return to original directory

# Show directory stack
dirs

# Quick back and forth
cd -  # Toggle between last two directories

Security Best Practices

1. File Permissions

# Scripts should be readable and executable
chmod 750 script.sh

# Configuration files should not be world-readable
chmod 600 config.ini

# Directories for collaboration
chmod 775 shared_directory/

# Never use 777 unless absolutely necessary (and understand why)

2. Secure Passwords and Credentials

# Never store passwords in scripts
# BAD:
password="mypassword123"

# GOOD: Use environment variables
export DB_PASSWORD=$(cat ~/.secrets/db_pass)

# Or use a secrets manager
export DB_PASSWORD=$(aws secretsmanager get-secret-value --secret-id db_password --query SecretString --output text)

# Ensure secrets files are protected
chmod 600 ~/.secrets/db_pass

3. Regular Updates

# Keep system updated
sudo apt update && sudo apt upgrade

# Security-only updates
sudo apt-get install -only-upgrade security_package

# Check for security advisories
sudo apt-get changelog package_name

Performance Optimization

1. Efficient File Processing

# Use grep -F for fixed strings (faster)
grep -F "exact_string" large_file.log

# Use awk instead of multiple pipes
# Instead of: cat file | grep pattern | cut -d',' -f1
awk -F',' '/pattern/ {print $1}' file

# Process compressed files directly
zgrep "error" logfile.gz
zcat file.gz | awk '{print $1}'

2. Parallel Processing

# Use GNU parallel for concurrent processing
find . -name "*.csv" | parallel python process.py {}

# Process multiple files simultaneously
ls *.csv | xargs -P 4 -I {} python process.py {}
# -P 4 means use 4 parallel processes

3. Monitor Resource Usage

# Before running heavy processes, check resources
free -h
df -h

# Use nice to lower priority of heavy processes
nice -n 19 python heavy_computation.py

# Limit resources with ulimit
ulimit -v 4000000  # Limit virtual memory to ~4GB

Error Handling and Logging

1. Proper Error Handling in Scripts

#!/bin/bash
set -e  # Exit on any error
set -u  # Exit on undefined variable
set -o pipefail  # Catch errors in pipes

# Example
python extract.py || { echo "Extraction failed"; exit 1; }
python transform.py || { echo "Transformation failed"; exit 1; }
python load.py || { echo "Loading failed"; exit 1; }

2. Comprehensive Logging

#!/bin/bash
LOG_FILE="/var/log/my_pipeline.log"

log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

log "INFO: Starting pipeline"
log "INFO: Processing data..."
log "ERROR: Something went wrong"

3. Debugging Scripts

# Run script with debugging
bash -x script.sh

# Add debugging to specific sections
set -x  # Enable debugging
# ... commands ...
set +x  # Disable debugging

# Trace function calls
set -T
trap 'echo "Function: $FUNCNAME"' DEBUG

Documentation

1. Comment Your Scripts

#!/bin/bash
# Script: data_pipeline.sh
# Purpose: Daily ETL pipeline for customer data
# Author: Your Name
# Date: 2024-01-23
# Usage: ./data_pipeline.sh [date]
#
# This script:
# 1. Extracts customer data from source database
# 2. Transforms and cleanses the data
# 3. Loads into data warehouse
#
# Environment variables required:
# - DB_HOST: Database hostname
# - DB_USER: Database username
# - DB_PASS: Database password

# Configuration
SOURCE_DB="customers"
TARGET_TABLE="dim_customers"

2. Maintain README Files

# Create a README for your project
cat > README.md << 'EOF'
# Data Pipeline Project

## Overview
This project contains ETL pipelines for processing customer data.

## Requirements
- Python 3.8+
- PostgreSQL 12+
- pandas, sqlalchemy

## Installation

bash
pip install -r requirements.txt


## Usage

bash

Run daily pipeline

./scripts/daily_etl.sh

Run specific date

./scripts/daily_etl.sh 2024-01-23


## Configuration
Edit `config/database.conf` with your database credentials.

## Troubleshooting
See `docs/troubleshooting.md`

## Contact
Data Engineering Team - data@company.com
EOF

Backup and Disaster Recovery

# Create backup strategy
cat > ~/scripts/comprehensive_backup.sh << 'EOF'
#!/bin/bash

# Backup locations
BACKUP_ROOT="/backup"
DATE=$(date +%Y%m%d)

# Create backup structure
mkdir -p ${BACKUP_ROOT}/{databases,scripts,configs,data}/${DATE}

# Backup databases
pg_dump -U user warehouse > ${BACKUP_ROOT}/databases/${DATE}/warehouse.sql

# Backup scripts and configurations
tar -czf ${BACKUP_ROOT}/scripts/${DATE}/scripts.tar.gz ~/scripts
tar -czf ${BACKUP_ROOT}/configs/${DATE}/configs.tar.gz ~/config

# Backup critical data
rsync -av /data/production/ ${BACKUP_ROOT}/data/${DATE}/

# Verify backups
for dir in databases scripts configs data; do
    if [ "$(find ${BACKUP_ROOT}/${dir}/${DATE} -type f | wc -l)" -gt 0 ]; then
        echo "✓ $dir backup successful"
    else
        echo "✗ $dir backup failed"
    fi
done

# Clean old backups (keep 30 days)
find ${BACKUP_ROOT} -type d -mtime +30 -exec rm -rf {} +

echo "Backup completed: $(date)"
EOF

Conclusion

Congratulations! You've completed this comprehensive guide to Linux for data engineers. You've learned:

Why Linux is fundamental to data engineering
Essential commands for daily tasks
How to use Vi and Nano for text editing
Advanced concepts like shell scripting, cron, and SSH
Real-world scenarios and best practices

Next Steps

1. Practice Daily
The best way to master Linux is to use it every day. Set it up as your primary development environment.

2. Build Projects
Create your own data pipelines, automation scripts, and tools. Real projects solidify your learning.

3. Explore Further

Learn Docker for containerization
Study Kubernetes for orchestration
Explore Apache Airflow for workflow management
Dive deeper into shell scripting with advanced bash techniques

4. Join Communities

Stack Overflow for Q&A
Reddit's r/linux and r/dataengineering
Linux User Groups (LUGs) in your area
Open source projects on GitHub

Resources for Continued Learning

Books:

"The Linux Command Line" by William Shotts
"Linux Pocket Guide" by Daniel J. Barrett
"Shell Scripting: Expert Recipes for Linux, Bash and more" by Steve Parker

Online Resources:

Linux Journey (linuxjourney.com)
OverTheWire: Bandit (for practicing commands)
The Linux Documentation Project (tldp.org)

Practice Platforms:

DigitalOcean (for cheap Linux servers to practice)
AWS Free Tier (for cloud Linux environments)
VirtualBox (for local Linux VMs)

Final Thoughts

Linux mastery is a journey, not a destination. Every data engineer continues to learn new commands, techniques, and best practices throughout their career. The key is to stay curious, experiment safely (with backups!), and never stop learning.

Remember: every expert was once a beginner who didn't give up. The command line might seem intimidating at first, but with practice, it becomes second nature. Soon, you'll find yourself more productive and efficient than you ever thought possible.

Welcome to the world of Linux. Your journey as a data engineer just got a powerful boost!

Quick Reference Card

Most Used Commands

# Navigation
pwd, cd, ls, tree

# File operations
cp, mv, rm, mkdir, touch

# Viewing files
cat, less, head, tail, grep

# System info
df -h, free -h, top, htop

# Permissions
chmod, chown

# Processes
ps, kill, top, jobs

# Network
ssh, scp, rsync, wget, curl

# Text editors
nano, vi/vim

# Package management
apt install, apt update (Debian/Ubuntu)
yum install, yum update (RedHat/CentOS)

Vi Quick Reference

i - Insert mode
Esc - Normal mode
:w - Save
:q - Quit
:wq - Save and quit
dd - Delete line
yy - Copy line
p - Paste
/search - Search forward
:s/old/new/g - Replace in line
:%s/old/new/g - Replace in file

Nano Quick Reference

Ctrl+O - Save
Ctrl+X - Exit
Ctrl+K - Cut line
Ctrl+U - Paste
Ctrl+W - Search
Ctrl+\ - Replace
Ctrl+G - Help

Happy Linux Learning! 🐧

This guide was created for data engineers at all levels. Feel free to bookmark, share, and revisit as you grow in your Linux journey.