DEV Community

Cover image for Linux Essentials for Data Engineers: A Beginner's Guide
Macphalen Oduor
Macphalen Oduor

Posted on

Linux Essentials for Data Engineers: A Beginner's Guide

Contents Guide

Why Linux Matters

The Foundation of Data Engineering

Linux powers the modern data infrastructure. When you process petabytes of data on cloud servers, manage databases, or build data pipelines, you're almost certainly working with Linux.

Why it's essential:

  • Cloud Dominance: AWS, Google Cloud, and Azure run primarily on Linux servers
  • Performance: Lightweight OS that doesn't waste resources on unnecessary processes
  • Automation: Built for scripting - automate repetitive tasks instead of doing them manually
  • Tool Ecosystem: Spark, Hadoop, Kafka, and Airflow are optimized for Linux
  • Cost: Free and open source - companies save millions on licensing
  • Remote Work: Manage servers from anywhere via SSH

Real-World Impact:

Imagine building a data pipeline that processes customer transactions:

  • You write Python code
  • Deploy it to Linux servers (AWS EMR)
  • Schedule it with cron (Linux job scheduler)
  • Monitor logs using Linux commands
  • When it breaks at 2 AM, you SSH into the Linux server to fix it

Every step requires Linux knowledge.

Getting Started

Understanding the Terminal

The terminal is your control center in Linux. Unlike graphical interfaces where you click buttons, here you type commands. Think of it like this:

  • Graphical UI: Ordering from a fixed menu
  • Command Line: Telling the chef exactly what you want and how to prepare it

You get complete control and precision.

When you open a terminal, you'll see the prompt:

user@hostname:~$
Enter fullscreen mode Exit fullscreen mode

Breaking it down:

  • user: Your username (who you are)
  • hostname: Computer name (which machine you're on)
  • ~: Tilde represents your home directory (/home/username)
  • $: The prompt symbol (it's # if you're the root/admin user)

Your First Commands

Let's start with orientation - knowing where you are and what's around you.

Where am I?

user@hostname:~$ pwd
Enter fullscreen mode Exit fullscreen mode

Output:

/home/username
Enter fullscreen mode Exit fullscreen mode

pwd means "print working directory" - like checking your GPS location in the file system.

What's here?

user@hostname:~$ ls
Enter fullscreen mode Exit fullscreen mode

Output:

Documents  Downloads  Pictures  Music  Videos
Enter fullscreen mode Exit fullscreen mode

ls lists files and directories - like looking at folder contents in File Explorer.

Show more details:

user@hostname:~$ ls -lah
Enter fullscreen mode Exit fullscreen mode

Output:

drwxr-xr-x  2 user user 4096 Jan 23 10:30 Documents
-rw-r--r--  1 user user  245 Jan 23 09:15 notes.txt
Enter fullscreen mode Exit fullscreen mode

The -lah flags mean:

  • l: Long format (detailed view)
  • a: All files (including hidden ones starting with .)
  • h: Human-readable sizes (KB, MB instead of bytes)

Who am I?

user@hostname:~$ whoami
Enter fullscreen mode Exit fullscreen mode

Output:

username
Enter fullscreen mode Exit fullscreen mode

Simple but useful - tells you which user account you're using.

Hands-On Exercise 1

Let's practice navigation:

# 1. Check where you are
user@hostname:~$ pwd

# 2. List what's in your current directory
user@hostname:~$ ls

# 3. Create a practice directory
user@hostname:~$ mkdir linux_practice

# 4. Move into it
user@hostname:~$ cd linux_practice

# 5. Confirm you're there
user@hostname:~$ pwd
# Should show: /home/username/linux_practice

# 6. Create some test files
user@hostname:~$ touch file1.txt file2.txt file3.txt

# 7. List them
user@hostname:~$ ls

# 8. Go back to parent directory
user@hostname:~$ cd ..

# 9. Verify where you are
user@hostname:~$ pwd
Enter fullscreen mode Exit fullscreen mode

What you learned:

  • pwd - Know your location
  • ls - See what's around
  • mkdir - Create directories
  • cd - Move between directories
  • touch - Create empty files

Essential Commands

Navigation Commands

Linux organizes files in a tree structure. Understanding paths is crucial.

  • Absolute path: Complete path from root (/home/user/project/data.csv)
  • Relative path: Path from current location (project/data.csv)
  • Special directories:
    • . - Current directory
    • .. - Parent directory
    • ~ - Home directory
    • / - Root (top of file system)

Practice:

# Go to home directory (3 ways)
user@hostname:~$ cd
user@hostname:~$ cd ~
user@hostname:~$ cd /home/username

# Go to specific directory (absolute path)
user@hostname:~$ cd /var/log

# Go to subdirectory (relative path)
# If you're in /home/username:
user@hostname:~$ cd Documents/projects

# Go up one level
user@hostname:~$ cd ..

# Go up two levels
user@hostname:~$ cd ../..

# Toggle between last two directories
user@hostname:~$ cd /var/log
user@hostname:~$ cd /home/username
user@hostname:~$ cd -  # Back to /var/log
user@hostname:~$ cd -  # Back to /home/username
Enter fullscreen mode Exit fullscreen mode

File Operations

Everything in Linux is a file - even devices and processes. Understanding file operations is fundamental.

Creating and Organizing:

# Create a directory
user@hostname:~$ mkdir data_projects

# Create nested directories (parent directories too)
user@hostname:~$ mkdir -p projects/pipeline/scripts

# Create multiple directories
user@hostname:~$ mkdir raw_data processed_data output_data

# Create an empty file
user@hostname:~$ touch data.csv

# Create multiple files
user@hostname:~$ touch file1.txt file2.txt file3.txt
Enter fullscreen mode Exit fullscreen mode

Viewing Files:

# Display entire file
user@hostname:~$ cat data.csv

# View large files page by page (q to quit)
user@hostname:~$ less application.log

# First 10 lines
user@hostname:~$ head data.csv

# Last 10 lines
user@hostname:~$ tail data.csv

# Custom line count
user@hostname:~$ head -n 20 data.csv
user@hostname:~$ tail -n 50 error.log

# Follow a file in real-time (Ctrl+C to stop)
user@hostname:~$ tail -f /var/log/pipeline.log
Enter fullscreen mode Exit fullscreen mode

Copying and Moving:

# Copy file
user@hostname:~$ cp source.csv backup.csv

# Copy directory (recursive)
user@hostname:~$ cp -r source_folder/ backup_folder/

# Move/rename file
user@hostname:~$ mv old_name.txt new_name.txt

# Move to different directory
user@hostname:~$ mv data.csv /home/user/projects/

# Move multiple files
user@hostname:~$ mv *.txt /destination/directory/
Enter fullscreen mode Exit fullscreen mode

Deleting (Careful!):

# Delete a file
user@hostname:~$ rm unwanted.txt

# Delete with confirmation
user@hostname:~$ rm -i important.txt

# Delete directory and contents
user@hostname:~$ rm -r old_project/

# Force delete (dangerous!)
user@hostname:~$ rm -rf directory/
Enter fullscreen mode Exit fullscreen mode

⚠️ Warning: Linux has no Recycle Bin. Deleted = Gone forever!

Hands-On Exercise 2: File Management

# Setup
user@hostname:~$ mkdir -p ~/practice/data/{raw,processed,archive}
cd ~/practice/data

# Create sample files
user@hostname:~$ echo "id,name,age" > raw/users.csv
user@hostname:~$ echo "1,Alice,30" >> raw/users.csv
user@hostname:~$ echo "2,Bob,25" >> raw/users.csv

# View the file
user@hostname:~$ cat raw/users.csv

# Copy to processed
user@hostname:~$ cp raw/users.csv processed/users_cleaned.csv

# Add a timestamp to filename
user@hostname:~$ cp processed/users_cleaned.csv archive/users_$(date +%Y%m%d).csv

# Verify
user@hostname:~$ ls -R

# Clean up
user@hostname:~$ cd ~
user@hostname:~$ rm -r ~/practice
Enter fullscreen mode Exit fullscreen mode

Searching and Finding

Finding files and text is a daily task. Linux provides powerful search tools.

Finding Files:

# Find by name
user@hostname:~$ find /home/user -name "*.csv"

# Find files modified in last 7 days
user@hostname:~$ find /data -type f -mtime -7

# Find large files (>100MB)
user@hostname:~$ find /data -type f -size +100M

# Find and delete old logs
user@hostname:~$ find /var/log -name "*.log" -mtime +30 -delete

# Find by permissions
user@hostname:~$ find /scripts -type f -perm 644
Enter fullscreen mode Exit fullscreen mode

Searching Inside Files:

# Basic search
user@hostname:~$ grep "error" application.log

# Case-insensitive
user@hostname:~$ grep -i "Error" logfile.log

# Search recursively in directories
user@hostname:~$ grep -r "TODO" /home/user/projects/

# Count matches
user@hostname:~$ grep -c "success" pipeline.log

# Show line numbers
user@hostname:~$ grep -n "failed" etl.log

# Show context (2 lines before and after)
user@hostname:~$ grep -C 2 "exception" app.log

# Invert match (show lines NOT matching)
user@hostname:~$ grep -v "debug" app.log
Enter fullscreen mode Exit fullscreen mode

Hands-On Exercise 3: Search Practice

# Create test log file
user@hostname:~$ cat > test.log << EOF
2024-01-23 10:00:00 INFO Starting process
2024-01-23 10:01:00 INFO Processing record 1
2024-01-23 10:02:00 ERROR Database connection failed
2024-01-23 10:03:00 INFO Retrying connection
2024-01-23 10:04:00 INFO Processing record 2
2024-01-23 10:05:00 ERROR Timeout occurred
2024-01-23 10:06:00 INFO Process completed
EOF

# Find all errors
user@hostname:~$ grep "ERROR" test.log

# Count errors
user@hostname:~$ grep -c "ERROR" test.log

# Show errors with context
user@hostname:~$ grep -C 1 "ERROR" test.log

# Find successful records
user@hostname:~$ grep "Processing record" test.log

# Clean up
user@hostname:~$ rm test.log
Enter fullscreen mode Exit fullscreen mode

Text Processing

Data engineers constantly manipulate text files. Master these tools for efficiency.

Piping and Redirection:

Pipes (|) connect commands - output of one becomes input of another.

# Save output to file (overwrites)
user@hostname:~$ ls -l > file_list.txt

# Append to file
user@hostname:~$ echo "New entry" >> logfile.txt

# Redirect errors
user@hostname:~$ python script.py 2> errors.log

# Redirect both output and errors
user@hostname:~$ python script.py > output.log 2>&1

# Chain commands with pipe
user@hostname:~$ cat large.csv | grep "error" | wc -l

# Complex pipeline
user@hostname:~$ cat data.txt | sort | uniq | wc -l
Enter fullscreen mode Exit fullscreen mode

Powerful Text Tools:

# Extract columns from CSV
user@hostname:~$ cut -d',' -f1,3 data.csv

# Example:
user@hostname:~$ echo "name,age,city,salary" | cut -d',' -f1,3
# Output: name,city

# Process with awk
user@hostname:~$ awk -F',' '{print $1, $3}' data.csv

# Sum a column
user@hostname:~$ awk -F',' '{sum += $4} END {print sum}' data.csv

# Count unique values
user@hostname:~$ cut -d',' -f2 data.csv | sort | uniq -c

# Replace text
user@hostname:~$ sed 's/old/new/g' file.txt

# Delete lines matching pattern
user@hostname:~$ sed '/pattern/d' file.txt
Enter fullscreen mode Exit fullscreen mode

Hands-On Exercise 4: Real Data Processing

# Create sample CSV
user@hostname:~$ cat > sales.csv << EOF
date,product,quantity,price
2024-01-20,Widget,10,25.00
2024-01-21,Gadget,5,50.00
2024-01-22,Widget,15,25.00
2024-01-23,Doohickey,8,30.00
EOF

# Extract product and quantity
user@hostname:~$ cut -d',' -f2,3 sales.csv

# Calculate total revenue (quantity * price)
user@hostname:~$ awk -F',' 'NR>1 {revenue += $3 * $4} END {print "Total: $" revenue}' sales.csv

# Count transactions per product
user@hostname:~$ cut -d',' -f2 sales.csv | tail -n +2 | sort | uniq -c

# Find high-value transactions (price > 30)
user@hostname:~$ awk -F',' 'NR>1 && $4 > 30 {print $0}' sales.csv

# Clean up
user@hostname:~$ rm sales.csv
Enter fullscreen mode Exit fullscreen mode

Working with Compression

Data files are often compressed. Learn to work with them efficiently.

# Create compressed archive
user@hostname:~$ tar -czvf backup.tar.gz /data/files
# c: create, z: gzip, v: verbose, f: filename

# Extract archive
user@hostname:~$ tar -xzvf backup.tar.gz
# x: extract

# View archive contents without extracting
user@hostname:~$ tar -tzvf backup.tar.gz

# Compress single file
user@hostname:~$ gzip large_file.csv
# Creates: large_file.csv.gz

# Decompress
user@hostname:~$ gunzip large_file.csv.gz

# View compressed file without extracting
user@hostname:~$ zcat file.csv.gz | head -10

# Search in compressed file
user@hostname:~$ zgrep "error" logs.gz
Enter fullscreen mode Exit fullscreen mode

Hands-On Exercise 5: Archive Practice

# Create test structure
user@hostname:~$ mkdir -p backup_test/data/{logs,reports}
user@hostname:~$ echo "Log entry 1" > backup_test/data/logs/app.log
user@hostname:~$ echo "Report 1" > backup_test/data/reports/monthly.txt

# Create compressed backup
user@hostname:~$ tar -czvf backup_$(date +%Y%m%d).tar.gz backup_test/

# List contents
user@hostname:~$ tar -tzvf backup_*.tar.gz

# Extract to different location
user@hostname:~$ mkdir restore
user@hostname:~$ tar -xzvf backup_*.tar.gz -C restore/

# Verify
user@hostname:~$ ls -R restore/

# Clean up
user@hostname:~$ rm -rf backup_test restore backup_*.tar.gz
Enter fullscreen mode Exit fullscreen mode

Text Editors

Nano (Beginner-Friendly)

# Open file
user@hostname:~$ nano myfile.txt

# Essential shortcuts:
# Ctrl+O  - Save
# Ctrl+X  - Exit
# Ctrl+W  - Search
# Ctrl+K  - Cut line
# Ctrl+U  - Paste
Enter fullscreen mode Exit fullscreen mode

Example: Create a Python script

user@hostname:~$ nano data_processor.py
# Type your code
# Press Ctrl+O, Enter, Ctrl+X
user@hostname:~$ chmod +x data_processor.py
./data_processor.py
Enter fullscreen mode Exit fullscreen mode

Vi/Vim (Power User)

Vi has different modes:

  • Normal Mode: Navigate and run commands
  • Insert Mode: Type text (press i)
  • Command Mode: Save/quit (press :)
# Basic Vi commands:
i          # Enter insert mode
Esc        # Return to normal mode
:w         # Save
:q         # Quit
:wq        # Save and quit
:q!        # Quit without saving

# Navigation (normal mode)
h j k l    # Left, down, up, right
gg         # Top of file
G          # Bottom of file

# Editing
dd         # Delete line
yy         # Copy line
p          # Paste
u          # Undo

# Search
/pattern   # Search forward
n          # Next match
:%s/old/new/g  # Replace all
Enter fullscreen mode Exit fullscreen mode

Quick Vi escape: Press Esc, then type :q! and hit Enter

Automation

Shell Scripts

Create a backup script:

#!/bin/bash
# daily_backup.sh

BACKUP_DIR="/backup"
SOURCE="/data/warehouse"
DATE=$(date +%Y%m%d)

echo "Starting backup at $(date)"
user@hostname:~$ tar -czf "${BACKUP_DIR}/backup_${DATE}.tar.gz" "${SOURCE}"

if [ $? -eq 0 ]; then
    echo "✓ Backup completed"
else
    echo "✗ Backup failed"
    exit 1
fi

# Delete backups older than 30 days
user@hostname:~$ find "${BACKUP_DIR}" -name "backup_*.tar.gz" -mtime +30 -delete
Enter fullscreen mode Exit fullscreen mode

Make it executable:

user@hostname:~$ chmod +x daily_backup.sh
user@hostname:~$ ./daily_backup.sh
Enter fullscreen mode Exit fullscreen mode

Scheduling with Cron

# Edit crontab
user@hostname:~$ crontab -e

# Cron syntax: minute hour day month weekday command
# * * * * * command

# Examples:
0 2 * * * /home/user/daily_backup.sh        # Daily at 2 AM
*/15 * * * * /home/user/check_status.sh     # Every 15 minutes
0 9 * * 1-5 /home/user/weekday_report.sh    # Weekdays at 9 AM
Enter fullscreen mode Exit fullscreen mode

Environment Variables

# Set variable
user@hostname:~$ export DATABASE_URL="postgresql://localhost:5432/mydb"

# Make permanent (add to ~/.bashrc)
user@hostname:~$ echo 'export DATABASE_URL="postgresql://..."' >> ~/.bashrc
user@hostname:~$ source ~/.bashrc

# Use in scripts
echo "Connecting to: $DATABASE_URL"
Enter fullscreen mode Exit fullscreen mode

Real-World Scenarios

Scenario 1: Processing Large CSV Files

Problem: You have a 10GB CSV file that's too large for your laptop's memory.

Instead of loading the entire file, process it in chunks or use streaming commands.

Solution:

# Step 1: Examine the file without loading it
user@hostname:~$ ls -lh large_dataset.csv
# Output: -rw-r--r-- 1 user user 10G Jan 23 10:00 large_dataset.csv

# Step 2: Count lines
user@hostname:~$ wc -l large_dataset.csv
# Output: 50000000 large_dataset.csv

# Step 3: View structure (first 10 lines)
user@hostname:~$ head -10 large_dataset.csv

# Step 4: Split into manageable chunks (1M lines each)
user@hostname:~$ split -l 1000000 large_dataset.csv chunk_ --additional-suffix=.csv

# Step 5: Verify chunks
user@hostname:~$ ls -lh chunk_*.csv

# Step 6: Process each chunk
for file in chunk_*.csv; do
    echo "Processing $file..."
    python process_chunk.py "$file" "processed_${file}"

    if [ $? -eq 0 ]; then
        echo "✓ Completed $file"
    else
        echo "✗ Failed $file"
        exit 1
    fi
done

# Step 7: Combine processed results
user@hostname:~$ cat processed_chunk_*.csv > final_processed.csv

# Step 8: Verify output
user@hostname:~$ wc -l final_processed.csv
user@hostname:~$ du -h final_processed.csv

# Step 9: Clean up intermediate files
user@hostname:~$ rm chunk_*.csv processed_chunk_*.csv

echo "Processing complete!"
Enter fullscreen mode Exit fullscreen mode

Alternative: Stream Processing (no temp files)

# Filter directly without loading entire file
user@hostname:~$ awk -F',' 'NR > 1 && $3 > 100 {print $1","$2","$3}' large_dataset.csv > filtered.csv

# Transform delimiter
user@hostname:~$ sed 's/,/|/g' large_dataset.csv > pipe_delimited.csv

# Count specific values
user@hostname:~$ cut -d',' -f2 large_dataset.csv | sort | uniq -c | sort -rn
Enter fullscreen mode Exit fullscreen mode

Scenario 2: Troubleshooting Failed Pipeline at 3 AM

Problem: You get an alert that the nightly ETL pipeline failed. You need to diagnose and fix it quickly.

Solution:

# Step 1: SSH into production server
user@hostname:~$ ssh production-server

# Step 2: Check if the pipeline process is still running
user@hostname:~$ ps aux | grep etl_pipeline
# If running, get the PID

# Step 3: Check the logs (most recent entries)
user@hostname:~$ tail -100 /var/log/etl/pipeline.log

# Step 4: Search for errors
user@hostname:~$ grep -i "error\|exception\|failed" /var/log/etl/pipeline.log | tail -20

# Found: "ERROR: Database connection timeout at 02:15:33"

# Step 5: Check database connectivity
user@hostname:~$ nc -zv database-server 5432
# Output: Connection refused

# Step 6: SSH to database server
user@hostname:~$ ssh database-server

# Step 7: Check if PostgreSQL is running
user@hostname:~$ sudo systemctl status postgresql
# Output: inactive (dead)

# Step 8: Check why it stopped (system logs)
user@hostname:~$ sudo journalctl -u postgresql --since "02:00:00" --until "02:30:00"

# Step 9: Restart PostgreSQL
user@hostname:~$ sudo systemctl restart postgresql

# Step 10: Verify it's running
user@hostname:~$ sudo systemctl status postgresql
# Output: active (running)

# Step 11: Test connection from application server
user@hostname:~$ ssh production-server
user@hostname:~$ psql -h database-server -U etl_user -d warehouse -c "SELECT 1;"
# Output: Success!

# Step 12: Rerun the failed pipeline
user@hostname:~$ cd /opt/pipelines
user@hostname:~$ ./etl_pipeline.sh --date 2024-01-23

# Step 13: Monitor in real-time
user@hostname:~$ tail -f /var/log/etl/pipeline.log

# Step 14: Verify completion
user@hostname:~$ ls -lh /data/output/$(date +%Y%m%d)*

# Step 15: Document the incident
user@hostname:~$ cat >> /var/log/incidents.txt << EOF
Date: $(date)
Issue: ETL pipeline failure - database connection timeout
Root Cause: PostgreSQL service stopped unexpectedly
Resolution: Restarted PostgreSQL service, pipeline rerun successful
Action Items: Set up PostgreSQL monitoring alert
EOF

# Step 16: Set up monitoring to prevent recurrence
user@hostname:~$ cat > ~/monitor_postgres.sh << 'SCRIPT'
#!/bin/bash
if ! systemctl is-active --quiet postgresql; then
    echo "PostgreSQL down! Attempting restart..." | mail -s "DB Alert" ops@company.com
    sudo systemctl restart postgresql
fi
SCRIPT

chmod +x ~/monitor_postgres.sh
# Add to crontab: */5 * * * * ~/monitor_postgres.sh
Enter fullscreen mode Exit fullscreen mode

Scenario 3: Data Quality Validation

Problem: Implement automated quality checks for incoming data files.

Solution:

#!/bin/bash
# data_quality_check.sh

DATA_FILE=$1
REPORT_FILE="quality_report_$(date +%Y%m%d_%H%M%S).txt"

# Validate input
if [ ! -f "$DATA_FILE" ]; then
    echo "Error: File not found: $DATA_FILE"
    exit 1
fi

# Start report
{
    echo "==================================="
    echo "Data Quality Report"
    echo "File: $DATA_FILE"
    echo "Date: $(date)"
    echo "==================================="
    echo

    # Basic stats
    user@hostname:~$ echo "--- File Statistics ---"
    user@hostname:~$ echo "Size: $(du -h $DATA_FILE | cut -f1)"
    TOTAL_ROWS=$(wc -l < "$DATA_FILE")
    user@hostname:~$ echo "Total rows: $TOTAL_ROWS"
    user@hostname:~$ echo

    # Check schema
    echo "--- Schema Validation ---"
    EXPECTED_COLS=5
    ACTUAL_COLS=$(head -1 "$DATA_FILE" | tr ',' '\n' | wc -l)

    if [ "$ACTUAL_COLS" -eq "$EXPECTED_COLS" ]; then
        echo "✓ Schema validation PASSED"
        echo "  Expected columns: $EXPECTED_COLS"
        echo "  Actual columns: $ACTUAL_COLS"
    else
        echo "✗ Schema validation FAILED"
        echo "  Expected columns: $EXPECTED_COLS"
        echo "  Actual columns: $ACTUAL_COLS"
    fi
    echo

    # Check for null values
    echo "--- Null Value Check ---"
    NULL_COUNT=$(grep -c "NULL\|^,$\|,,\|,$" "$DATA_FILE")

    if [ "$NULL_COUNT" -eq 0 ]; then
        echo "✓ No null values found"
    else
        echo "⚠ Found $NULL_COUNT potential null values"
        echo "  Percentage: $(awk "BEGIN {printf \"%.2f\", ($NULL_COUNT/$TOTAL_ROWS)*100}")%"
    fi
    echo

    # Check for duplicates
    echo "--- Duplicate Check ---"
    UNIQUE_ROWS=$(sort -u "$DATA_FILE" | wc -l)
    DUPLICATES=$((TOTAL_ROWS - UNIQUE_ROWS))

    if [ "$DUPLICATES" -eq 0 ]; then
        echo "✓ No duplicates found"
    else
        echo "⚠ Found $DUPLICATES duplicate rows"
        echo "  Percentage: $(awk "BEGIN {printf \"%.2f\", ($DUPLICATES/$TOTAL_ROWS)*100}")%"
    fi
    echo

    # Check date format (if applicable)
    echo "--- Date Format Check ---"
    # Assuming first column is date in YYYY-MM-DD format
    INVALID_DATES=$(awk -F',' 'NR>1 && $1 !~ /^[0-9]{4}-[0-9]{2}-[0-9]{2}$/ {print $1}' "$DATA_FILE" | wc -l)

    if [ "$INVALID_DATES" -eq 0 ]; then
        echo "✓ All dates in correct format"
    else
        echo "⚠ Found $INVALID_DATES invalid date formats"
    fi
    echo

    # Summary
    echo "==================================="
    if [ "$ACTUAL_COLS" -eq "$EXPECTED_COLS" ] && [ "$NULL_COUNT" -eq 0 ] && [ "$DUPLICATES" -eq 0 ]; then
        echo "Status: PASSED - File ready for processing"
    else
        echo "Status: FAILED - Issues found, review required"
    fi
    echo "==================================="

} | tee "$REPORT_FILE"

echo
echo "Report saved to: $REPORT_FILE"

# Return appropriate exit code
if [ "$ACTUAL_COLS" -ne "$EXPECTED_COLS" ] || [ "$NULL_COUNT" -gt 0 ]; then
    exit 1
fi
Enter fullscreen mode Exit fullscreen mode

Usage:

cuser@hostname:~$ hmod +x data_quality_check.sh
./data_quality_check.sh incoming_data.csv
Enter fullscreen mode Exit fullscreen mode

Scenario 4: Automated Backup System

Problem: Create a comprehensive backup system for your data warehouse.

Solution:

#!/bin/bash
# comprehensive_backup.sh

# Configuration
BACKUP_ROOT="/backup"
DB_NAME="data_warehouse"
DB_USER="backup_user"
RETENTION_DAYS=30
DATE=$(date +%Y%m%d_%H%M%S)
LOG_FILE="/var/log/backups/backup.log"

# Ensure log directory exists
user@hostname:~$ mkdir -p /var/log/backups

# Logging function
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Error handling
user@hostname:~$ set -e
trap 'log "ERROR: Backup failed at line $LINENO"' ERR

# Start backup process
log "=== Starting Backup Process ==="

# Create backup structure
user@hostname:~$ mkdir -p ${BACKUP_ROOT}/{database,files,config}/${DATE}

# 1. Database Backup
log "Backing up database: $DB_NAME"
START_TIME=$(date +%s)

if pg_dump -U "$DB_USER" "$DB_NAME" | gzip > ${BACKUP_ROOT}/database/${DATE}/${DB_NAME}.sql.gz; then
    DB_SIZE=$(du -h ${BACKUP_ROOT}/database/${DATE}/${DB_NAME}.sql.gz | cut -f1)
    END_TIME=$(date +%s)
    DURATION=$((END_TIME - START_TIME))
    log "✓ Database backup completed (Size: $DB_SIZE, Duration: ${DURATION}s)"
else
    log "✗ Database backup failed"
    exit 1
fi

# 2. Files Backup
log "Backing up data files"
START_TIME=$(date +%s)

if tar -czf ${BACKUP_ROOT}/files/${DATE}/data_files.tar.gz /data/production/ 2>/dev/null; then
    FILES_SIZE=$(du -h ${BACKUP_ROOT}/files/${DATE}/data_files.tar.gz | cut -f1)
    END_TIME=$(date +%s)
    DURATION=$((END_TIME - START_TIME))
    log "✓ Files backup completed (Size: $FILES_SIZE, Duration: ${DURATION}s)"
else
    log "⚠ Files backup completed with warnings"
fi

# 3. Configuration Backup
log "Backing up configurations"
tar -czf ${BACKUP_ROOT}/config/${DATE}/configs.tar.gz /etc/pipeline/ ~/config/ 2>/dev/null
log "✓ Configuration backup completed"

# 4. Create backup manifest
cat > ${BACKUP_ROOT}/manifest_${DATE}.txt << EOF
Backup Manifest
===============
Date: $(date)
Database: $DB_NAME ($DB_SIZE)
Files: /data/production/ ($FILES_SIZE)
Configurations: /etc/pipeline/, ~/config/

Backup Locations:
- Database: ${BACKUP_ROOT}/database/${DATE}/
- Files: ${BACKUP_ROOT}/files/${DATE}/
- Config: ${BACKUP_ROOT}/config/${DATE}/
EOF

log "✓ Manifest created"

# 5. Verify backups
log "Verifying backups..."
for dir in database files config; do
    if [ "$(find ${BACKUP_ROOT}/${dir}/${DATE} -type f | wc -l)" -gt 0 ]; then
        log "✓ $dir backup verified"
    else
        log "✗ $dir backup verification failed"
        exit 1
    fi
done

# 6. Clean old backups
log "Cleaning backups older than $RETENTION_DAYS days"
DELETED_COUNT=0
for dir in database files config; do
    DELETED=$(find ${BACKUP_ROOT}/${dir} -type d -mtime +$RETENTION_DAYS -exec rm -rf {} + 2>/dev/null | wc -l)
    ((DELETED_COUNT += DELETED))
done
log "✓ Removed $DELETED_COUNT old backup directories"

# 7. Optional: Upload to cloud
if [ -n "$AWS_BACKUP_BUCKET" ]; then
    log "Uploading to S3: $AWS_BACKUP_BUCKET"
    aws s3 sync ${BACKUP_ROOT} s3://${AWS_BACKUP_BUCKET}/backups/ --exclude "*" --include "*/${DATE}/*"
    log "✓ Cloud upload completed"
fi

# Summary
TOTAL_SIZE=$(du -sh ${BACKUP_ROOT}/*/${DATE} | awk '{sum+=$1} END {print sum}')
BACKUP_COUNT=$(find ${BACKUP_ROOT} -type d -name "20*" | wc -l)

log "=== Backup Summary ==="
log "Total backup size: $TOTAL_SIZE"
log "Total backups retained: $BACKUP_COUNT"
log "Backup completed successfully!"

# Send notification
if command -v mail &> /dev/null; then
    echo "Backup completed at $(date)" | mail -s "Backup Success" admin@company.com
fi
Enter fullscreen mode Exit fullscreen mode

Schedule it with cron:

# Daily at 2 AM
0 2 * * * /home/user/scripts/comprehensive_backup.sh

# Also create restore script
user@hostname:~$ cat > restore_backup.sh << 'EOF'
#!/bin/bash
BACKUP_DATE=$1

if [ -z "$BACKUP_DATE" ]; then
    echo "Usage: $0 <backup_date>"
    echo "Example: $0 20240123_020000"
    exit 1
fi

user@hostname:~$ echo "Restoring from backup: $BACKUP_DATE"
user@hostname:~$ gunzip -c /backup/database/${BACKUP_DATE}/data_warehouse.sql.gz | psql -U admin data_warehouse
user@hostname:~$ tar -xzf /backup/files/${BACKUP_DATE}/data_files.tar.gz -C /
user@hostname:~$ tar -xzf /backup/config/${BACKUP_DATE}/configs.tar.gz -C /

echo "Restore completed!"
EOF

user@hostname:~$ chmod +x restore_backup.sh
Enter fullscreen mode Exit fullscreen mode

Scenario 5: System Monitoring Dashboard

Problem: Create a monitoring script to check system health.

Solution:

#!/bin/bash
# system_monitor.sh

# Configuration
DISK_THRESHOLD=80
MEMORY_THRESHOLD=85
CPU_THRESHOLD=80

# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m'

clear
echo -e "${BLUE}========================================${NC}"
echo -e "${BLUE}   System Monitoring Dashboard${NC}"
echo -e "${BLUE}   $(date '+%Y-%m-%d %H:%M:%S')${NC}"
echo -e "${BLUE}========================================${NC}"
echo

# 1. CPU Usage
echo -e "${BLUE}[CPU Usage]${NC}"
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
CPU_INT=${CPU_USAGE%.*}

if [ "$CPU_INT" -ge "$CPU_THRESHOLD" ]; then
    echo -e "${RED}⚠ HIGH${NC}: ${CPU_USAGE}%"
else
    echo -e "${GREEN}✓ OK${NC}: ${CPU_USAGE}%"
fi
echo

# 2. Memory Usage
echo -e "${BLUE}[Memory Usage]${NC}"
MEM_USAGE=$(free | grep Mem | awk '{printf "%.0f", $3/$2 * 100}')

if [ "$MEM_USAGE" -ge "$MEMORY_THRESHOLD" ]; then
    echo -e "${RED}⚠ HIGH${NC}: ${MEM_USAGE}%"
else
    echo -e "${GREEN}✓ OK${NC}: ${MEM_USAGE}%"
fi
free -h | grep Mem
echo

# 3. Disk Usage
echo -e "${BLUE}[Disk Usage]${NC}"
df -h | grep '^/dev/' | while read line; do
    usage=$(echo $line | awk '{print $5}' | sed 's/%//')
    partition=$(echo $line | awk '{print $6}')

    if [ "$usage" -ge "$DISK_THRESHOLD" ]; then
        echo -e "${RED}$partition${NC}: ${usage}%"
    else
        echo -e "${GREEN}$partition${NC}: ${usage}%"
    fi
done
echo

# 4. Running Pipelines
echo -e "${BLUE}[Data Pipelines]${NC}"
for pipeline in etl_pipeline data_sync ml_training; do
    if pgrep -f "$pipeline" > /dev/null; then
        echo -e "${GREEN}● RUNNING${NC}: $pipeline"
    else
        echo -e "${YELLOW}○ STOPPED${NC}: $pipeline"
    fi
done
echo

# 5. Recent Errors
echo -e "${BLUE}[Recent Errors (Last Hour)]${NC}"
ERROR_COUNT=0
for log in /var/log/pipelines/*.log; do
    if [ -f "$log" ]; then
        recent_errors=$(find "$log" -mmin -60 -exec grep -c "ERROR" {} \; 2>/dev/null || echo 0)
        if [ "$recent_errors" -gt 0 ]; then
            echo -e "${RED}$recent_errors errors${NC} in $(basename $log)"
            ((ERROR_COUNT += recent_errors))
        fi
    fi
done

if [ "$ERROR_COUNT" -eq 0 ]; then
    echo -e "${GREEN}✓ No recent errors${NC}"
fi
echo

# 6. Database Connectivity
echo -e "${BLUE}[Database Status]${NC}"
if psql -h localhost -U monitor -d warehouse -c "SELECT 1;" &>/dev/null; then
    echo -e "${GREEN}✓ Connected${NC}: PostgreSQL"
else
    echo -e "${RED}✗ Failed${NC}: PostgreSQL connection"
fi
echo

# 7. Network
echo -e "${BLUE}[Network]${NC}"
echo "Active connections: $(netstat -an | grep ESTABLISHED | wc -l)"
echo

echo -e "${BLUE}========================================${NC}"
Enter fullscreen mode Exit fullscreen mode

Make it auto-refresh:

watch -n 5 -c ./system_monitor.sh
Enter fullscreen mode Exit fullscreen mode

Best Practices

Productivity Tips

# Use Tab for auto-completion
cd /var/lo[Tab]  # Completes to /var/log/

# Search command history
Ctrl+R  # Then type to search

# Repeat last command
!!

# Useful aliases (add to ~/.bashrc)
alias ll='ls -lah'
alias ..='cd ..'
alias gs='git status'
alias pipes='cd ~/projects/pipelines'
Enter fullscreen mode Exit fullscreen mode

Safety First

# Always backup before editing
cp important.conf important.conf.backup

# Use -i for confirmations
alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'

# Test scripts before running
bash -n script.sh  # Check syntax
Enter fullscreen mode Exit fullscreen mode

Secure Your System

# Proper file permissions
chmod 600 config.ini          # Config files
chmod 700 script.sh           # Scripts
chmod 755 /shared/directory   # Shared directories

# Never store passwords in scripts
# Use environment variables or secrets managers
export DB_PASS=$(cat ~/.secrets/db_pass)
chmod 600 ~/.secrets/db_pass
Enter fullscreen mode Exit fullscreen mode

Performance Optimization

# Process compressed files directly
zgrep "error" logfile.gz

# Use parallel processing
find . -name "*.csv" | parallel python process.py {}

# Monitor resources before heavy operations
free -h && df -h
Enter fullscreen mode Exit fullscreen mode

Learning Resources

Free Courses:

Books:

  • "The Linux Command Line" by William Shotts (free online)
  • "Linux Pocket Guide" by Daniel J. Barrett

Practice Platforms:

  • DigitalOcean ($5/month for Linux server)
  • AWS Free Tier
  • VirtualBox (free local VMs)

Quick Reference

Command Cheat Sheet

# Navigation
pwd, cd, ls, tree

# Files
cp, mv, rm, mkdir, touch, cat, less, head, tail

# Search
find, grep, locate

# System
df -h, free -h, top, htop, ps, kill

# Network
ssh, scp, rsync, wget, curl

# Compression
tar, gzip, gunzip, zip, unzip

# Text Processing
awk, sed, cut, sort, uniq, wc
Enter fullscreen mode Exit fullscreen mode

Nano Reference

Ctrl+O - Save       Ctrl+K - Cut
Ctrl+X - Exit       Ctrl+U - Paste
Ctrl+W - Search     Ctrl+\ - Replace
Enter fullscreen mode Exit fullscreen mode

Vi Reference

i - Insert          dd - Delete line
Esc - Normal mode   yy - Copy line
:w - Save           p - Paste
:q - Quit           u - Undo
:wq - Save & quit   /text - Search
Enter fullscreen mode Exit fullscreen mode

Conclusion

You now have the foundation to use Linux effectively as a data engineer. Remember:

  • Practice daily - Use Linux as your primary development environment
  • Start small - Master basics before advanced topics
  • Build projects - Real work solidifies learning
  • Ask questions - Join communities and forums
  • Stay curious - Linux is a journey, not a destination

Every expert was once a beginner. The command line will soon feel like second nature. Welcome to the world of Linux! 🐧


Found this helpful? Share with fellow data engineers!

Contributing

Have suggestions or found errors? Feel free to:

  • Open an issue
  • Share your feedback

Author: [Oduor Macphalen Lowell]

Last Updated: January 2026

Repository: [https://github.com/mac0duor0fficial-1028]

Top comments (0)