Including Practical Use of Vi and Nano with Examples
Contents Guide
- Introduction
- Why Linux Matters for Data Engineers
- Getting Started with Linux
- Essential Linux Commands
- Working with Files and Directories
- Text Editing with Vi and Nano
- Advanced Concepts for Data Engineers
- Real-World Data Engineering Scenarios
- Best Practices and Tips
- Conclusion
Introduction
Welcome to the world of Linux! If you're a data engineer or aspiring to become one, understanding Linux is like learning to drive before becoming a race car driver. It's not just helpful—it's essential.
In this guide, we'll walk through everything you need to know about Linux, from the very basics to advanced concepts that will make your daily work as a data engineer smooth and efficient. Don't worry if you've never used a command line before; we'll start from the ground up and build your skills step by step.
Think of Linux as the operating system that powers most of the data infrastructure you'll work with. Whether you're processing petabytes of data on cloud servers, managing databases, or building data pipelines, Linux will be your constant companion.
Why Linux Matters for Data Engineers
The Foundation of Modern Data Infrastructure
Imagine you're building a house. You could use any materials, but there's a reason why professionals choose specific tools and materials—they're reliable, efficient, and proven. Linux is that foundation for data engineering.
Here's why Linux is indispensable:
1. It Powers the Cloud
Most cloud computing platforms (AWS, Google Cloud, Azure) run Linux servers. When you spin up a data processing cluster or deploy a machine learning model, chances are you're working with Linux. Understanding Linux means you can directly interact with these systems rather than relying on simplified interfaces that might limit your capabilities.
2. Performance and Efficiency
Linux is lightweight and efficient. When you're processing millions of records, every bit of system resources matters. Linux doesn't waste computing power on unnecessary background processes or visual effects. This efficiency translates to faster data processing and lower costs.
3. Automation and Scripting
Data engineering is all about automation. You don't want to manually process data every day—you want scripts that run automatically. Linux provides powerful scripting capabilities through bash, Python, and other tools that integrate seamlessly with the operating system.
4. Open Source Ecosystem
Most modern data tools are built for Linux first. Apache Spark, Hadoop, Kafka, Airflow, and countless other data engineering tools are designed to run optimally on Linux. By mastering Linux, you're unlocking the full potential of these tools.
5. Remote Server Management
Data engineers rarely work on local machines. You'll be connecting to remote servers to deploy pipelines, troubleshoot issues, and monitor systems. Linux's command-line interface is perfect for remote work—you can do everything through a terminal connection (SSH), even with limited bandwidth.
6. Cost-Effectiveness
Linux is free and open source. For companies processing data at scale, this means significant cost savings. Instead of paying for expensive operating system licenses for hundreds or thousands of servers, organizations use Linux and invest those savings into better hardware or more engineers.
Real-World Example
Let's say you're building a data pipeline that ingests customer transaction data, processes it, and loads it into a data warehouse. Here's what a typical workflow looks like:
- You write your data processing code in Python or Scala
- You deploy it to a Linux server or cluster (like AWS EMR)
- You schedule it using cron (a Linux job scheduler)
- You monitor logs using Linux command-line tools
- When something breaks at 2 AM, you SSH into the Linux server to troubleshoot
- You use Linux text editors to quickly fix configuration files
Every step of this process relies on Linux knowledge. Without it, you'd be stuck waiting for someone else to help or struggling with unfamiliar interfaces.
Getting Started with Linux
Understanding the Terminal
The terminal (also called command line or shell) is your primary interface with Linux. Unlike Windows or macOS where you click buttons and icons, in Linux you type commands. This might seem old-fashioned, but it's actually more powerful.
Think of it this way: clicking is like ordering from a fixed menu at a restaurant. The command line is like telling a chef exactly what you want and how you want it prepared. You have complete control.
When you open a terminal, you'll see something like this:
user@hostname:~$
Let's break this down:
-
user: Your username -
hostname: The name of the computer you're on -
~: The tilde represents your home directory (more on this later) -
$: The prompt symbol (it's#if you're the root/admin user)
Your First Commands
Let's start with the absolute basics. Type these commands and press Enter after each one:
# See where you are
pwd
Output:
/home/username
The pwd command stands for "print working directory." It tells you your current location in the file system. Think of it as checking your GPS location.
# List what's in the current directory
ls
Output:
Documents Downloads Pictures Music Videos
The ls command lists files and directories. It's like looking at the contents of a folder in a graphical file manager.
# See who you are
whoami
Output:
username
This simply tells you which user account you're currently using.
Essential Linux Commands
Let's explore the commands you'll use daily as a data engineer. I'll explain each one with practical examples you might encounter in real work.
Navigation Commands
Moving Around the File System
# Go to your home directory
cd ~
# Go to a specific directory
cd /var/log
# Go back to the previous directory
cd -
# Go up one level
cd ..
# Go up two levels
cd ../..
Example Scenario: You're troubleshooting a data pipeline that failed. The logs are in /var/log/airflow/. Here's how you'd navigate:
# Start from your home directory
pwd
# Output: /home/dataengineer
# Navigate to the logs
cd /var/log/airflow/
# Check where you are
pwd
# Output: /var/log/airflow
# List the log files
ls
# Output: scheduler.log webserver.log worker.log
File and Directory Operations
Creating Directories
# Create a single directory
mkdir data_projects
# Create nested directories (with -p flag)
mkdir -p projects/etl_pipeline/scripts
# Create multiple directories at once
mkdir raw_data processed_data output_data
Example: Setting up a new data project structure:
# Create a complete project structure
mkdir -p ~/data_projects/customer_analytics/{raw,processed,scripts,output}
# Navigate to it
cd ~/data_projects/customer_analytics
# Verify the structure
ls -R
Output:
.:
processed raw scripts output
./processed:
./raw:
./scripts:
./output:
Creating and Viewing Files
# Create an empty file
touch data.csv
# Create multiple files
touch file1.txt file2.txt file3.txt
# View file contents
cat data.csv
# View large files page by page
less large_logfile.log
# View first 10 lines
head data.csv
# View last 10 lines
tail data.csv
# View last 20 lines
tail -n 20 data.csv
# Follow a log file in real-time (useful for monitoring)
tail -f /var/log/application.log
Real-World Example: Monitoring a data pipeline log:
# Watch the pipeline log as it runs
tail -f /var/log/data_pipeline/etl_job.log
This command will show new log entries as they're written, perfect for watching your data pipeline in action.
Copying and Moving Files
# Copy a file
cp source.csv destination.csv
# Copy a directory and its contents
cp -r source_folder destination_folder
# Move/rename a file
mv old_name.csv new_name.csv
# Move a file to a different directory
mv data.csv /home/user/projects/
# Move multiple files
mv file1.txt file2.txt file3.txt /destination/
Example: Organizing raw data files:
# You just received new data files
ls
# Output: transactions_2024.csv customers.csv products.csv
# Create organized structure
mkdir -p raw_data/{transactions,customers,products}
# Move files to appropriate directories
mv transactions_2024.csv raw_data/transactions/
mv customers.csv raw_data/customers/
mv products.csv raw_data/products/
# Verify
ls raw_data/transactions/
# Output: transactions_2024.csv
Deleting Files and Directories
# Delete a file (be careful!)
rm file.txt
# Delete multiple files
rm file1.txt file2.txt file3.txt
# Delete with confirmation prompt
rm -i important_file.csv
# Delete a directory and its contents
rm -r directory_name
# Force delete without confirmation (DANGEROUS!)
rm -rf directory_name
Warning: The rm command is permanent. Unlike Windows or Mac, there's no Recycle Bin. Deleted files are gone forever. Always double-check before using rm -rf.
Searching and Finding
Finding Files
# Find files by name
find /home/user -name "*.csv"
# Find files modified in the last 7 days
find /data -type f -mtime -7
# Find large files (bigger than 100MB)
find /data -type f -size +100M
# Find and delete old log files
find /var/log -name "*.log" -mtime +30 -delete
Searching Inside Files
# Search for text in a file
grep "error" logfile.log
# Search case-insensitively
grep -i "Error" logfile.log
# Search recursively in directories
grep -r "TODO" /home/user/projects/
# Count occurrences
grep -c "success" pipeline.log
# Show line numbers
grep -n "failed" etl.log
# Search with context (2 lines before and after)
grep -C 2 "exception" app.log
Real-World Example: Finding errors in a data pipeline:
# Search for all errors in today's logs
grep -i "error" /var/log/airflow/scheduler.log | grep "2024-01-23"
# Count how many times a specific error occurred
grep "ConnectionTimeout" pipeline.log | wc -l
# Find which data files contain null values
grep -l "NULL" *.csv
Output:
customers.csv
transactions.csv
File Permissions and Ownership
Linux has a robust permission system. Every file has an owner and permissions that control who can read, write, or execute it.
Understanding Permissions
When you run ls -l, you see something like:
ls -l script.py
Output:
-rwxr-xr-x 1 dataengineer users 2048 Jan 23 10:30 script.py
Let's decode this:
-
-rwxr-xr-x: Permissions- First character:
-(file) ord(directory) - Next three:
rwx(owner can read, write, execute) - Next three:
r-x(group can read and execute) - Last three:
r-x(others can read and execute)
- First character:
-
dataengineer: Owner -
users: Group -
2048: File size in bytes -
Jan 23 10:30: Last modified date and time -
script.py: Filename
Changing Permissions
# Make a script executable
chmod +x script.sh
# Give full permissions to owner, read to others
chmod 744 data_script.py
# Remove write permission for everyone
chmod a-w important_config.txt
# Recursively change permissions
chmod -R 755 /data/scripts/
Example: Preparing a Python ETL script for execution:
# Check current permissions
ls -l etl_pipeline.py
# Output: -rw-r--r-- 1 user users 5242 Jan 23 10:30 etl_pipeline.py
# Make it executable
chmod +x etl_pipeline.py
# Verify
ls -l etl_pipeline.py
# Output: -rwxr-xr-x 1 user users 5242 Jan 23 10:30 etl_pipeline.py
# Now you can run it directly
./etl_pipeline.py
Process Management
As a data engineer, you'll often run long-running processes and need to manage them.
# See running processes
ps aux
# See processes in a tree structure
ps auxf
# Filter processes
ps aux | grep python
# Interactive process viewer
top
# Better interactive viewer
htop
# Kill a process
kill 1234 # where 1234 is the process ID
# Force kill a stubborn process
kill -9 1234
# Kill by process name
pkill python
# See what's using a port
lsof -i :8080
Example: Managing a stuck data processing job:
# Find the process
ps aux | grep "spark-submit"
# Output:
# user 12345 15.2 8.5 4567890 123456 ? R 10:30 45:23 spark-submit data_job.py
# Kill it gracefully
kill 12345
# Wait a few seconds, then verify it's gone
ps aux | grep 12345
# If it's still running, force kill
kill -9 12345
System Information
# Disk usage
df -h
# Directory size
du -sh /data/warehouse
# Memory usage
free -h
# CPU information
lscpu
# System uptime
uptime
# Check OS version
cat /etc/os-release
# Kernel version
uname -r
Example Output:
df -h
Output:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 100G 45G 50G 48% /
/dev/sdb1 1.0T 750G 250G 75% /data
This shows you have 250GB of free space on your data drive—important when processing large datasets!
Working with Files and Directories
The Linux File System Structure
Linux organizes everything as files in a hierarchical tree structure. Understanding this is crucial for navigating and managing your data projects.
/ (root - top of the hierarchy)
├── home/ (user home directories)
│ └── dataengineer/ (your home directory)
│ ├── projects/
│ ├── data/
│ └── scripts/
├── var/ (variable data)
│ ├── log/ (log files)
│ └── lib/ (state information)
├── etc/ (configuration files)
├── usr/ (user programs)
│ ├── bin/ (executable programs)
│ └── local/ (locally installed software)
├── tmp/ (temporary files)
└── opt/ (optional software)
Important Paths for Data Engineers:
-
/home/username: Your personal space -
/data: Often where large datasets are stored -
/var/log: System and application logs -
/etc: Configuration files for applications -
/tmp: Temporary files (often cleared on reboot)
Working with Paths
Absolute vs Relative Paths
# Absolute path (starts with /)
cd /home/dataengineer/projects/etl
# Relative path (from current location)
cd projects/etl
# Current directory
.
# Parent directory
..
# Home directory
~
Example Navigation:
# You're here
pwd
# Output: /home/dataengineer
# Using relative path
cd projects/etl
# Using absolute path
cd /home/dataengineer/projects/etl
# Go up two levels
cd ../..
# Where are we now?
pwd
# Output: /home/dataengineer
Wildcards and Pattern Matching
Wildcards help you work with multiple files at once.
# List all CSV files
ls *.csv
# List files starting with "data"
ls data*
# List files with exactly 5 characters
ls ?????
# List all Python files in subdirectories
ls */*.py
# Copy all CSV files to another directory
cp *.csv /backup/
# Delete all log files
rm *.log
Advanced Pattern Matching:
# List files matching a pattern
ls transactions_[0-9][0-9][0-9][0-9].csv
# Matches: transactions_2023.csv, transactions_2024.csv
# List files NOT matching a pattern
ls !(*.tmp)
# Multiple extensions
ls *.{csv,txt,json}
Piping and Redirection
One of Linux's most powerful features is the ability to chain commands together.
Output Redirection:
# Save command output to a file (overwrites)
ls -l > file_list.txt
# Append to a file
echo "New log entry" >> application.log
# Redirect errors to a file
python script.py 2> errors.log
# Redirect both output and errors
python script.py > output.log 2>&1
Piping Commands:
# Chain commands with pipe (|)
cat large_file.csv | grep "error" | wc -l
# Sort and get unique values
cat data.txt | sort | uniq
# Find the 10 largest files
ls -lS | head -10
# Count how many processes are running
ps aux | wc -l
Real-World Example: Analyzing log files:
# Find the most common errors in a log file
cat application.log | grep "ERROR" | sort | uniq -c | sort -rn | head -10
This command:
- Reads the log file
- Filters for lines containing "ERROR"
- Sorts them
- Counts unique occurrences
- Sorts by count (descending)
- Shows top 10
Output:
145 ERROR: Database connection timeout
89 ERROR: Invalid data format
45 ERROR: Memory allocation failed
23 ERROR: Network unreachable
...
Text Processing Commands
cut, awk, and sed
These are powerful text processing tools essential for data engineers.
# Extract specific columns from CSV
cut -d',' -f1,3 data.csv
# Example: Extract first and third columns
# Input: name,age,city,salary
# John,30,NYC,75000
# Output: name,city
# John,NYC
# Using awk for more complex operations
awk -F',' '{print $1, $3}' data.csv
# Sum values in a column
awk -F',' '{sum += $4} END {print sum}' data.csv
# Search and replace with sed
sed 's/old_text/new_text/g' file.txt
# Delete lines matching a pattern
sed '/pattern/d' file.txt
Real-World Example: Processing a CSV file:
# You have a transactions CSV: date,product,quantity,price
# Calculate total revenue
awk -F',' 'NR>1 {revenue += $3 * $4} END {print "Total Revenue: $" revenue}' transactions.csv
Output:
Total Revenue: $125430.50
Compression and Archives
Working with compressed files is daily business for data engineers.
# Create a tar archive
tar -cvf backup.tar /data/files
# Create a compressed tar.gz archive
tar -czvf backup.tar.gz /data/files
# Extract a tar.gz archive
tar -xzvf backup.tar.gz
# Compress a file with gzip
gzip large_file.csv
# Creates: large_file.csv.gz
# Decompress
gunzip large_file.csv.gz
# Compress with bzip2 (better compression)
bzip2 large_file.csv
# View compressed file without extracting
zcat file.gz
# View compressed log file
zcat logs.gz | grep "error"
Example: Backing up project data:
# Create a dated backup
tar -czvf backup_$(date +%Y%m%d).tar.gz ~/projects/data_pipeline/
# Verify contents without extracting
tar -tzvf backup_20240123.tar.gz
# Extract to a specific directory
tar -xzvf backup_20240123.tar.gz -C /restore/location/
Text Editing with Vi and Nano
As a data engineer, you'll constantly edit configuration files, scripts, and data files on remote servers. Mastering terminal-based text editors is essential. We'll cover two editors: Nano (beginner-friendly) and Vi (powerful and ubiquitous).
Why Learn Terminal Text Editors?
When you SSH into a remote server, you won't have access to graphical editors like VS Code or Sublime Text. You need to edit files directly in the terminal. Additionally, terminal editors are:
- Fast and lightweight
- Available on every Linux system
- Efficient for quick edits
- Scriptable and automatable
Nano: The Beginner-Friendly Editor
Nano is straightforward and displays its commands at the bottom of the screen, making it perfect for beginners.
Starting Nano
# Open a new file
nano myfile.txt
# Open an existing file
nano /etc/config/app.conf
# Open file at a specific line number
nano +25 script.py
Basic Nano Operations
When you open Nano, you'll see something like this:
GNU nano 5.4 myfile.txt Modified
This is my text file.
I can start typing immediately.
No special mode needed!
^G Get Help ^O Write Out ^W Where Is ^K Cut Text ^J Justify
^X Exit ^R Read File ^\ Replace ^U Uncut Text ^T To Spell
The ^ symbol means Ctrl. So ^X means "Ctrl+X".
Essential Nano Commands:
# While in Nano:
Ctrl+O # Save file (WriteOut)
Ctrl+X # Exit
Ctrl+K # Cut current line
Ctrl+U # Paste (UnCut)
Ctrl+W # Search
Ctrl+\ # Search and replace
Ctrl+G # Show help
Ctrl+C # Show current line number
Alt+U # Undo
Alt+E # Redo
Ctrl+A # Go to beginning of line
Ctrl+E # Go to end of line
Practical Nano Example 1: Creating a Python Script
Let's create a simple data processing script:
# Open a new Python file
nano data_processor.py
Now type:
#!/usr/bin/env python3
# Data processor script
import pandas as pd
def process_data(input_file):
# Read CSV file
df = pd.read_csv(input_file)
# Basic cleaning
df = df.dropna()
# Save processed data
df.to_csv('processed_data.csv', index=False)
print(f"Processed {len(df)} records")
if __name__ == "__main__":
process_data('raw_data.csv')
Steps:
- Type or paste the code
- Press
Ctrl+Oto save - Press
Enterto confirm filename - Press
Ctrl+Xto exit
Making the script executable:
chmod +x data_processor.py
./data_processor.py
Practical Nano Example 2: Editing a Configuration File
Let's edit a database configuration file:
nano ~/config/database.conf
Original content:
[database]
host=localhost
port=5432
dbname=olddb
user=admin
password=oldpass
To search and replace:
- Press
Ctrl+\(search and replace) - Enter search term:
olddb - Press
Enter - Enter replacement:
newdb - Press
Enter - Press
Ato replace all occurrences - Press
Ctrl+Oto save - Press
Ctrl+Xto exit
Nano Tips for Data Engineers
# Open multiple files in tabs
nano file1.txt file2.txt
# Use Alt+< and Alt+> to switch between files
# Enable line numbers
nano -l script.py
# Create a backup automatically
nano -B important_config.conf
# Creates: important_config.conf~
# Soft wrap long lines
nano -$ data.csv
Vi (Vim): The Power User's Editor
Vi (or its improved version, Vim) is more complex but incredibly powerful. It's on every Unix/Linux system, making it essential to know at least the basics.
Understanding Vi Modes
Vi has different modes, which is what confuses beginners. Think of it like a smartphone: sometimes you're typing a message (insert mode), sometimes you're navigating menus (normal mode).
Three Main Modes:
- Normal Mode (default): For navigation and commands
- Insert Mode: For typing text
- Command Mode: For saving, quitting, searching
Starting Vi
# Open a new file
vi myfile.txt
# Open an existing file
vi /etc/config/app.conf
# Open at a specific line
vi +25 script.py
# Open in read-only mode
view logfile.log
Basic Vi Operations
Entering Insert Mode (where you can type):
# In Normal mode, press:
i # Insert before cursor
a # Insert after cursor
I # Insert at beginning of line
A # Insert at end of line
o # Open new line below
O # Open new line above
# You'll see -- INSERT -- at the bottom when in Insert mode
Returning to Normal Mode:
Esc # Always gets you back to Normal mode
Saving and Quitting:
# In Normal mode, press : to enter Command mode
:w # Save (write)
:q # Quit
:wq # Save and quit
:q! # Quit without saving (force)
:x # Save and quit (shortcut)
ZZ # Save and quit (even shorter)
Navigation in Vi
In Normal mode:
# Basic movement
h # Left
j # Down
k # Up
l # Right
# Word movement
w # Next word
b # Previous word
e # End of word
# Line movement
0 # Beginning of line
^ # First non-blank character
$ # End of line
# Screen movement
gg # Top of file
G # Bottom of file
15G # Go to line 15
Ctrl+f # Page down
Ctrl+b # Page up
Editing in Vi
In Normal mode:
# Deleting
x # Delete character
dd # Delete line
d$ # Delete to end of line
d0 # Delete to beginning of line
5dd # Delete 5 lines
# Copying (yanking) and pasting
yy # Copy line
5yy # Copy 5 lines
p # Paste after cursor
P # Paste before cursor
# Undo and redo
u # Undo
Ctrl+r # Redo
# Change text
cw # Change word
c$ # Change to end of line
cc # Change entire line
Searching in Vi
# In Normal mode
/pattern # Search forward
?pattern # Search backward
n # Next match
N # Previous match
# Search and replace (in Command mode)
:s/old/new/ # Replace first occurrence in line
:s/old/new/g # Replace all in line
:%s/old/new/g # Replace all in file
:%s/old/new/gc # Replace with confirmation
Practical Vi Example 1: Quick Configuration Edit
Let's edit an Apache Airflow configuration:
vi ~/airflow/airflow.cfg
Scenario: You need to change the executor from Sequential to Local.
Steps:
- Open the file:
vi ~/airflow/airflow.cfg - Search for executor:
/executorthen pressEnter - Press
nto find the next occurrence until you find the right one - Move to the word you want to change:
w(to move forward by word) - Change the word:
cw(change word) - Type the new value:
LocalExecutor - Press
Escto return to Normal mode - Save and quit:
:wq
Before:
[core]
executor = SequentialExecutor
After:
[core]
executor = LocalExecutor
Practical Vi Example 2: Editing a Data Pipeline Script
vi etl_pipeline.py
Scenario: Add logging to a Python script.
Original script:
def process_data(file_path):
data = read_csv(file_path)
cleaned = clean_data(data)
load_to_db(cleaned)
Steps:
- Open file:
vi etl_pipeline.py - Go to the function:
/def process_datathenEnter - Open a line below:
o - Add logging: type
print(f"Processing {file_path}...") - Press
Esc - Move to the next line:
j - Repeat for other lines
- Save:
:wq
Modified script:
def process_data(file_path):
print(f"Processing {file_path}...")
data = read_csv(file_path)
print(f"Cleaning data...")
cleaned = clean_data(data)
print(f"Loading to database...")
load_to_db(cleaned)
print(f"Complete!")
Practical Vi Example 3: Bulk Editing a CSV File
vi data.csv
Scenario: Your CSV has incorrect delimiter (semicolon instead of comma).
Before:
name;age;city
John;30;NYC
Jane;25;LA
Bob;35;SF
Steps:
- Open file:
vi data.csv - Enter Command mode:
: - Replace all semicolons:
%s/;/,/g - Press
Enter - Verify changes: navigate with
jandk - Save and quit:
:wq
After:
name,age,city
John,30,NYC
Jane,25,LA
Bob,35,SF
Advanced Vi Techniques
Visual Mode (for selecting text):
v # Start visual character selection
V # Start visual line selection
Ctrl+v # Start visual block selection
# Then use movement keys, then:
d # Delete selection
y # Copy selection
c # Change selection
Macros (record and replay actions):
qa # Start recording macro in register 'a'
# ... perform actions ...
q # Stop recording
@a # Replay macro
5@a # Replay macro 5 times
Example Macro Use: Add quotes around words in a CSV:
# Start on a word
qa # Start recording
i" # Insert opening quote
Esc # Normal mode
e # End of word
a" # Insert closing quote
Esc # Normal mode
q # Stop recording
# Now replay on other words
@a # Apply to next word
Vi Configuration
Create a .vimrc file in your home directory for custom settings:
vi ~/.vimrc
Add useful settings for data engineering:
" Enable line numbers
set number
" Enable syntax highlighting
syntax on
" Set tab width
set tabstop=4
set shiftwidth=4
set expandtab
" Show matching brackets
set showmatch
" Enable search highlighting
set hlsearch
" Incremental search
set incsearch
" Auto-indent
set autoindent
" Show cursor position
set ruler
" Enable mouse support
set mouse=a
Save this configuration and restart Vi. Your editing experience will be much improved!
Vi vs Nano: When to Use Each
Use Nano when:
- Making quick, simple edits
- You're a beginner
- You want immediate visual feedback
- Working with small configuration files
Use Vi when:
- Editing remotely on minimal systems
- You need power and speed
- Working with large files
- Performing complex text transformations
- You want to invest in long-term efficiency
Personal Recommendation: Learn the basics of both. Use Nano for daily quick edits and gradually incorporate Vi for more complex tasks. Many data engineers use Nano for simple configs and Vi/Vim for coding and complex edits.
Recovering from Vi Panic
Stuck in Vi and don't know how to exit? Here's the "escape ladder":
1. Press Esc (multiple times if needed)
2. Type :q! and press Enter (quit without saving)
If that doesn't work:
3. Press Esc again
4. Type :qa! and press Enter (quit all)
Still stuck?
5. Press Ctrl+Z (suspends Vi)
6. Type: kill %1 (kills the suspended process)
Real-World Text Editing Scenarios
Scenario 1: Emergency Log Analysis
Your data pipeline failed at 3 AM. You SSH into the server:
# Quick look with Nano
nano /var/log/airflow/scheduler.log
# Search for "ERROR": Ctrl+W, type "ERROR", Enter
# Navigate through errors: Ctrl+W repeatedly
# Found the issue, exit: Ctrl+X
Scenario 2: Batch Configuration Update
You need to update database connections across 10 config files:
# Use Vi for efficiency
for file in config/*.conf; do
vi -c '%s/old_db_host/new_db_host/g | wq' "$file"
done
This opens each file, replaces the text, saves, and quits automatically.
Scenario 3: Creating a Cron Job
# Edit crontab
crontab -e
# If prompted, choose your preferred editor (select Nano if unsure)
# Add a line to run your data pipeline daily at 2 AM:
0 2 * * * /home/dataengineer/scripts/daily_etl.sh >> /var/log/etl.log 2>&1
# Save and exit
Scenario 4: Editing a Large JSON Configuration
# Use Vi with syntax highlighting
vi config.json
# Navigate quickly
:250 # Jump to line 250
/timeout # Search for timeout setting
cw5000 # Change the word to 5000
:wq # Save and exit
Text Editing Best Practices
- Always backup important files before editing:
cp important.conf important.conf.backup
vi important.conf
- Use version control for scripts and configs:
git add config.yaml
git commit -m "Updated database connection"
- Test syntax after editing code:
python -m py_compile script.py
bash -n script.sh
Keep editing sessions focused: Edit one file at a time to avoid confusion.
Learn keyboard shortcuts: They save immense time over months and years.
Advanced Concepts for Data Engineers
Now that you've mastered the basics, let's dive into advanced concepts that will make you truly proficient in Linux as a data engineer.
Shell Scripting and Automation
Shell scripts allow you to automate repetitive tasks. Instead of typing the same commands every day, you write them once in a script.
Creating Your First Bash Script
# Create a new script
nano daily_backup.sh
Type this:
#!/bin/bash
# Daily backup script for data files
# Set variables
BACKUP_DIR="/backup/data"
SOURCE_DIR="/data/warehouse"
DATE=$(date +%Y%m%d)
BACKUP_FILE="backup_${DATE}.tar.gz"
# Create backup
echo "Starting backup at $(date)"
tar -czf "${BACKUP_DIR}/${BACKUP_FILE}" "${SOURCE_DIR}"
# Check if successful
if [ $? -eq 0 ]; then
echo "Backup completed successfully"
echo "File: ${BACKUP_FILE}"
echo "Size: $(du -h ${BACKUP_DIR}/${BACKUP_FILE} | cut -f1)"
else
echo "Backup failed!"
exit 1
fi
# Delete backups older than 30 days
find "${BACKUP_DIR}" -name "backup_*.tar.gz" -mtime +30 -delete
echo "Old backups cleaned up"
Make it executable and run:
chmod +x daily_backup.sh
./daily_backup.sh
Output:
Starting backup at Wed Jan 23 14:30:00 UTC 2024
Backup completed successfully
File: backup_20240123.tar.gz
Size: 2.3G
Old backups cleaned up
Script Variables and User Input
#!/bin/bash
# Data processing script with user input
# Read user input
echo "Enter the data file to process:"
read DATA_FILE
# Check if file exists
if [ ! -f "$DATA_FILE" ]; then
echo "Error: File $DATA_FILE not found!"
exit 1
fi
# Process the file
echo "Processing $DATA_FILE..."
LINES=$(wc -l < "$DATA_FILE")
echo "Total lines: $LINES"
# Calculate processing time
START=$(date +%s)
# Your processing command here
python process_data.py "$DATA_FILE"
END=$(date +%s)
DURATION=$((END - START))
echo "Processing completed in $DURATION seconds"
Conditional Logic in Scripts
#!/bin/bash
# ETL pipeline with error handling
LOG_FILE="/var/log/etl_pipeline.log"
ERROR_COUNT=0
# Function to log messages
log_message() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
# Extract data
log_message "Starting data extraction..."
if python extract_data.py; then
log_message "Extraction successful"
else
log_message "ERROR: Extraction failed"
((ERROR_COUNT++))
fi
# Transform data
log_message "Starting data transformation..."
if python transform_data.py; then
log_message "Transformation successful"
else
log_message "ERROR: Transformation failed"
((ERROR_COUNT++))
fi
# Load data
log_message "Starting data loading..."
if python load_data.py; then
log_message "Loading successful"
else
log_message "ERROR: Loading failed"
((ERROR_COUNT++))
fi
# Summary
if [ $ERROR_COUNT -eq 0 ]; then
log_message "Pipeline completed successfully"
exit 0
else
log_message "Pipeline completed with $ERROR_COUNT errors"
exit 1
fi
Loops in Shell Scripts
#!/bin/bash
# Process multiple CSV files
DATA_DIR="/data/raw"
OUTPUT_DIR="/data/processed"
# Loop through all CSV files
for file in "${DATA_DIR}"/*.csv; do
filename=$(basename "$file" .csv)
echo "Processing $filename..."
# Run Python processing script
python process_csv.py "$file" "${OUTPUT_DIR}/${filename}_processed.csv"
if [ $? -eq 0 ]; then
echo "✓ $filename completed"
else
echo "✗ $filename failed"
fi
done
echo "All files processed"
Functions in Shell Scripts
#!/bin/bash
# Modular data pipeline with functions
# Function to check if a file exists
check_file() {
if [ ! -f "$1" ]; then
echo "Error: File $1 not found"
return 1
fi
return 0
}
# Function to validate CSV format
validate_csv() {
local file=$1
local expected_columns=$2
actual_columns=$(head -1 "$file" | tr ',' '\n' | wc -l)
if [ "$actual_columns" -eq "$expected_columns" ]; then
echo "✓ CSV validation passed"
return 0
else
echo "✗ Expected $expected_columns columns, found $actual_columns"
return 1
fi
}
# Function to process data
process_data() {
local input=$1
local output=$2
echo "Processing: $input -> $output"
python data_processor.py "$input" "$output"
return $?
}
# Main execution
MAIN() {
INPUT_FILE="raw_data.csv"
OUTPUT_FILE="processed_data.csv"
# Check input file
check_file "$INPUT_FILE" || exit 1
# Validate format
validate_csv "$INPUT_FILE" 5 || exit 1
# Process data
process_data "$INPUT_FILE" "$OUTPUT_FILE" || exit 1
echo "Pipeline completed successfully"
}
# Run main function
MAIN
Environment Variables
Environment variables store system-wide or user-specific configuration.
# View all environment variables
env
# View specific variable
echo $HOME
echo $PATH
echo $USER
# Set a temporary variable (current session only)
export DATABASE_URL="postgresql://localhost:5432/mydb"
# Set permanently (add to ~/.bashrc)
echo 'export DATABASE_URL="postgresql://localhost:5432/mydb"' >> ~/.bashrc
source ~/.bashrc
# Use in scripts
#!/bin/bash
echo "Connecting to: $DATABASE_URL"
python connect_db.py
Common Environment Variables for Data Engineers
# Python path
export PYTHONPATH="/home/user/python_modules:$PYTHONPATH"
# Java home (for Spark)
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk"
# Spark configuration
export SPARK_HOME="/opt/spark"
export PATH="$SPARK_HOME/bin:$PATH"
# AWS credentials
export AWS_ACCESS_KEY_ID="your_key"
export AWS_SECRET_ACCESS_KEY="your_secret"
export AWS_DEFAULT_REGION="us-east-1"
# Airflow home
export AIRFLOW_HOME="/home/user/airflow"
Scheduling with Cron
Cron is the Linux job scheduler. It's essential for automating data pipelines.
Cron Syntax
* * * * * command
│ │ │ │ │
│ │ │ │ └─── Day of week (0-7, 0 and 7 are Sunday)
│ │ │ └───── Month (1-12)
│ │ └─────── Day of month (1-31)
│ └───────── Hour (0-23)
└─────────── Minute (0-59)
Common Cron Patterns
# Edit crontab
crontab -e
# Every day at 2 AM
0 2 * * * /home/user/scripts/daily_etl.sh
# Every hour
0 * * * * /home/user/scripts/hourly_sync.sh
# Every 15 minutes
*/15 * * * * /home/user/scripts/frequent_check.sh
# Monday to Friday at 9 AM
0 9 * * 1-5 /home/user/scripts/weekday_report.sh
# First day of every month at midnight
0 0 1 * * /home/user/scripts/monthly_summary.sh
# Every Sunday at 3 AM
0 3 * * 0 /home/user/scripts/weekly_backup.sh
Real-World Cron Examples
# Data pipeline that runs every 6 hours
0 */6 * * * cd /home/dataengineer/pipelines && python etl_pipeline.py >> /var/log/etl.log 2>&1
# Database backup every day at 1 AM
0 1 * * * /usr/local/bin/backup_database.sh
# Clean temporary files every day at midnight
0 0 * * * find /tmp/data_processing -type f -mtime +7 -delete
# Send daily report at 8 AM on weekdays
0 8 * * 1-5 python /home/user/scripts/daily_report.py && mail -s "Daily Report" team@company.com < report.txt
# Monitor disk space every 30 minutes
*/30 * * * * df -h | grep -E '9[0-9]%|100%' && echo "Disk space warning" | mail -s "Alert" admin@company.com
Cron Best Practices
# Always use absolute paths in cron jobs
# BAD: python script.py
# GOOD: /usr/bin/python3 /home/user/scripts/script.py
# Redirect output to log files
0 2 * * * /home/user/script.sh >> /var/log/script.log 2>&1
# Set environment in the script
#!/bin/bash
export PATH=/usr/local/bin:/usr/bin:/bin
export PYTHONPATH=/home/user/modules
# ... rest of script
# Test scripts before scheduling
/home/user/scripts/test_pipeline.sh
# Use meaningful names and comments in crontab
# Daily ETL pipeline for customer data
0 2 * * * /home/user/pipelines/customer_etl.sh
SSH and Remote Server Management
SSH (Secure Shell) lets you connect to remote servers securely.
Basic SSH Usage
# Connect to a server
ssh username@server_ip
# Example
ssh dataengineer@192.168.1.100
# Connect with a specific port
ssh -p 2222 username@server_ip
# Connect and execute a command
ssh user@server "ls -la /data"
# Copy files to remote server
scp local_file.csv user@server:/remote/path/
# Copy from remote server
scp user@server:/remote/file.csv /local/path/
# Copy entire directory
scp -r local_directory user@server:/remote/path/
# Copy with compression (faster for large files)
scp -C large_dataset.csv user@server:/data/
SSH Key Authentication
More secure and convenient than passwords:
# Generate SSH key pair
ssh-keygen -t rsa -b 4096 -C "your_email@example.com"
# Copy public key to server
ssh-copy-id user@server
# Now connect without password
ssh user@server
SSH Configuration
Create a config file for easier connections:
nano ~/.ssh/config
Add:
Host production
HostName 10.0.1.50
User dataengineer
Port 22
IdentityFile ~/.ssh/id_rsa
Host staging
HostName 10.0.1.51
User dataengineer
Port 22
IdentityFile ~/.ssh/id_rsa
Host datawarehouse
HostName warehouse.company.com
User admin
Port 2222
IdentityFile ~/.ssh/warehouse_key
Now connect simply:
ssh production
ssh staging
ssh datawarehouse
Remote Command Execution
# Check disk space on remote server
ssh production "df -h"
# Run a data pipeline remotely
ssh production "cd /data/pipelines && python daily_etl.py"
# Monitor logs in real-time
ssh production "tail -f /var/log/application.log"
# Execute multiple commands
ssh production << 'EOF'
cd /data/pipelines
source venv/bin/activate
python etl.py
deactivate
EOF
Working with Databases from Linux
As a data engineer, you'll frequently interact with databases from the command line.
PostgreSQL
# Connect to PostgreSQL
psql -h localhost -U username -d database_name
# Execute a query from command line
psql -h localhost -U user -d mydb -c "SELECT COUNT(*) FROM customers;"
# Execute SQL file
psql -h localhost -U user -d mydb -f query.sql
# Export query results to CSV
psql -h localhost -U user -d mydb -c "COPY (SELECT * FROM sales) TO STDOUT WITH CSV HEADER" > sales_export.csv
# Import CSV into table
psql -h localhost -U user -d mydb -c "\COPY customers FROM 'customers.csv' WITH CSV HEADER"
MySQL
# Connect to MySQL
mysql -h localhost -u username -p database_name
# Execute query
mysql -h localhost -u user -p -e "SELECT COUNT(*) FROM orders;" mydb
# Execute SQL file
mysql -h localhost -u user -p mydb < script.sql
# Export to CSV
mysql -h localhost -u user -p -e "SELECT * FROM products;" mydb > products.csv
# Backup database
mysqldump -h localhost -u user -p mydb > backup.sql
# Restore database
mysql -h localhost -u user -p mydb < backup.sql
MongoDB
# Connect to MongoDB
mongo
# Connect to specific database
mongo mydb
# Execute command
mongo mydb --eval "db.users.count()"
# Export collection to JSON
mongoexport --db=mydb --collection=users --out=users.json
# Import JSON
mongoimport --db=mydb --collection=users --file=users.json
# Backup
mongodump --db=mydb --out=/backup/
# Restore
mongorestore --db=mydb /backup/mydb/
System Monitoring and Performance
Understanding system performance is crucial when running data pipelines.
CPU and Memory Monitoring
# Real-time system monitoring
top
# Better alternative with more features
htop
# Memory usage
free -h
# Detailed memory info
cat /proc/meminfo
# CPU information
lscpu
# System load average
uptime
# Process-specific resource usage
ps aux | sort -nrk 3 | head -10 # Top 10 CPU users
ps aux | sort -nrk 4 | head -10 # Top 10 memory users
Disk Monitoring
# Disk usage by filesystem
df -h
# Disk usage by directory
du -sh /data/*
# Find largest directories
du -h /data | sort -rh | head -20
# Disk I/O statistics
iostat -x 2 # Update every 2 seconds
# Monitor disk usage in real-time
watch -n 5 df -h
Network Monitoring
# Network interface statistics
ifconfig
# Or on newer systems
ip addr show
# Network statistics
netstat -tuln
# Check open ports
ss -tuln
# Test network speed
iftop # (may need installation)
# Check if a port is open
telnet hostname port
nc -zv hostname port
# Trace route to destination
traceroute google.com
# DNS lookup
nslookup example.com
dig example.com
Log Management
Logs are vital for troubleshooting data pipelines.
System Logs
# View system logs
tail -f /var/log/syslog # Debian/Ubuntu
tail -f /var/log/messages # RedHat/CentOS
# Application logs
tail -f /var/log/apache2/error.log
tail -f /var/log/nginx/access.log
# Kernel logs
dmesg
# Authentication logs
tail -f /var/log/auth.log
journalctl (systemd logs)
# View all logs
journalctl
# Follow logs in real-time
journalctl -f
# Logs from specific service
journalctl -u apache2
# Logs from last boot
journalctl -b
# Logs from specific time period
journalctl --since "2024-01-23 00:00:00" --until "2024-01-23 23:59:59"
# Show only errors
journalctl -p err
# Logs for specific process
journalctl _PID=1234
Log Analysis Examples
# Count error occurrences
grep "ERROR" application.log | wc -l
# Find most common errors
grep "ERROR" application.log | sort | uniq -c | sort -rn | head -10
# Extract errors from a time period
awk '/2024-01-23 14:00:00/,/2024-01-23 15:00:00/' application.log | grep "ERROR"
# Monitor log file size
watch -n 10 'ls -lh /var/log/application.log'
# Rotate logs (manually)
mv application.log application.log.$(date +%Y%m%d)
touch application.log
# Find slow queries in database logs
awk '$NF > 5' slow_query.log # Queries taking more than 5 seconds
File Transfer and Synchronization
rsync - Efficient File Synchronization
# Basic syntax
rsync -av source/ destination/
# Sync to remote server
rsync -av /local/data/ user@server:/remote/data/
# Sync from remote server
rsync -av user@server:/remote/data/ /local/data/
# Sync with compression (good for slow connections)
rsync -avz source/ destination/
# Show progress
rsync -av --progress large_file.csv user@server:/data/
# Dry run (test without actually copying)
rsync -avn source/ destination/
# Delete files in destination that don't exist in source
rsync -av --delete source/ destination/
# Exclude certain files
rsync -av --exclude='*.tmp' --exclude='*.log' source/ destination/
# Bandwidth limit (in KB/s)
rsync -av --bwlimit=1000 large_dataset/ user@server:/data/
Real-World rsync Examples
# Backup data directory daily
rsync -av --delete /data/warehouse/ /backup/warehouse_$(date +%Y%m%d)/
# Sync processed data to production
rsync -avz --exclude='*.tmp' /data/processed/ prod_server:/data/production/
# Mirror data between servers with logging
rsync -av --log-file=/var/log/rsync.log server1:/data/ server2:/data/
# Resume interrupted transfer
rsync -av --partial --progress user@server:/large_file.csv ./
Package Management
Installing and managing software on Linux.
Debian/Ubuntu (apt)
# Update package list
sudo apt update
# Upgrade all packages
sudo apt upgrade
# Install a package
sudo apt install python3-pip
# Remove a package
sudo apt remove package_name
# Search for packages
apt search postgresql
# Show package info
apt show docker.io
# List installed packages
apt list --installed
# Install from .deb file
sudo dpkg -i package.deb
Red Hat/CentOS (yum/dnf)
# Update packages
sudo yum update
# Install package
sudo yum install python3
# Remove package
sudo yum remove package_name
# Search
yum search postgresql
# List installed
yum list installed
# On newer versions, use dnf
sudo dnf install package_name
Installing Data Engineering Tools
# Python and pip
sudo apt install python3 python3-pip
# PostgreSQL
sudo apt install postgresql postgresql-contrib
# Docker
sudo apt install docker.io
# Apache Spark (download and extract)
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xvzf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 /opt/spark
# Python libraries
pip3 install pandas numpy apache-airflow pyspark
Real-World Data Engineering Scenarios
Let's put everything together with practical scenarios you'll encounter as a data engineer.
Scenario 1: Setting Up a Data Processing Environment
You've just been hired, and you need to set up your Linux environment for data engineering work.
# Step 1: Update the system
sudo apt update && sudo apt upgrade -y
# Step 2: Install essential tools
sudo apt install -y git vim htop curl wget tree
# Step 3: Install Python and dependencies
sudo apt install -y python3 python3-pip python3-venv
# Step 4: Create project structure
mkdir -p ~/projects/{data_pipelines,scripts,notebooks,data/{raw,processed,output}}
# Step 5: Set up Python virtual environment
cd ~/projects/data_pipelines
python3 -m venv venv
source venv/bin/activate
# Step 6: Install Python libraries
pip install pandas numpy apache-airflow pyspark sqlalchemy psycopg2-binary
# Step 7: Configure git
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"
# Step 8: Create .bashrc aliases
cat >> ~/.bashrc << 'EOF'
# Data Engineering Aliases
alias ll='ls -lah'
alias projects='cd ~/projects'
alias activate='source venv/bin/activate'
alias pipes='cd ~/projects/data_pipelines'
# Environment variables
export DATA_HOME=~/projects/data
export PYTHONPATH=$PYTHONPATH:~/projects/scripts
EOF
# Reload bashrc
source ~/.bashrc
# Step 9: Create initial pipeline script
cat > ~/projects/data_pipelines/sample_etl.py << 'EOF'
#!/usr/bin/env python3
"""
Sample ETL Pipeline
"""
import pandas as pd
from datetime import datetime
def extract():
print(f"[{datetime.now()}] Extracting data...")
# Your extraction logic
return pd.DataFrame({'id': [1, 2, 3], 'value': [10, 20, 30]})
def transform(data):
print(f"[{datetime.now()}] Transforming data...")
# Your transformation logic
data['value_doubled'] = data['value'] * 2
return data
def load(data):
print(f"[{datetime.now()}] Loading data...")
# Your loading logic
data.to_csv('output.csv', index=False)
print(f"[{datetime.now()}] Completed!")
if __name__ == "__main__":
df = extract()
df_transformed = transform(df)
load(df_transformed)
EOF
chmod +x ~/projects/data_pipelines/sample_etl.py
echo "Environment setup complete!"
Scenario 2: Troubleshooting a Failed Data Pipeline
It's 3 AM, and you get an alert that the nightly ETL pipeline failed. Here's how you troubleshoot:
# Step 1: SSH into the production server
ssh production
# Step 2: Check if the process is running
ps aux | grep etl_pipeline
# Step 3: Check the logs
tail -100 /var/log/etl/pipeline.log
# Step 4: Look for errors
grep -i "error\|exception\|failed" /var/log/etl/pipeline.log | tail -20
# Output shows: "ERROR: Database connection timeout"
# Step 5: Check database connectivity
nc -zv database-server 5432
# Output: Connection refused
# Step 6: Check if database service is running
ssh database-server "systemctl status postgresql"
# Step 7: Restart database (if you have permission)
ssh database-server "sudo systemctl restart postgresql"
# Step 8: Verify it's running
ssh database-server "systemctl status postgresql"
# Step 9: Test connection from app server
psql -h database-server -U etl_user -d warehouse -c "SELECT 1;"
# Success! Now rerun the pipeline
cd /opt/data_pipelines
./etl_pipeline.sh
# Step 10: Monitor in real-time
tail -f /var/log/etl/pipeline.log
# Step 11: Verify completion
ls -lh /data/output/ | grep $(date +%Y%m%d)
# Step 12: Document the incident
cat >> /var/log/incidents/$(date +%Y%m%d).txt << EOF
Time: $(date)
Issue: ETL pipeline failure due to database connection timeout
Cause: PostgreSQL service stopped unexpectedly
Resolution: Restarted PostgreSQL service
Status: Resolved
EOF
Scenario 3: Processing Large CSV Files
You need to process a 10GB CSV file, but your laptop doesn't have enough memory.
# Step 1: Check available memory
free -h
# Step 2: Check file size
ls -lh large_dataset.csv
# Output: -rw-r--r-- 1 user user 10G Jan 23 10:00 large_dataset.csv
# Step 3: Count lines without loading entire file
wc -l large_dataset.csv
# Output: 50000000 large_dataset.csv
# Step 4: View first few lines to understand structure
head -20 large_dataset.csv
# Step 5: Split into smaller chunks (1 million lines each)
split -l 1000000 large_dataset.csv chunk_ --additional-suffix=.csv
# Step 6: Process each chunk
for file in chunk_*.csv; do
echo "Processing $file..."
python process_chunk.py "$file" "processed_${file}"
done
# Step 7: Combine processed results
cat processed_chunk_*.csv > final_processed.csv
# Step 8: Verify
wc -l final_processed.csv
# Step 9: Clean up
rm chunk_*.csv processed_chunk_*.csv
# Alternative: Use streaming processing
awk -F',' 'NR > 1 && $3 > 100 {print $1","$2","$3}' large_dataset.csv > filtered.csv
# Or use sed for transformation
sed 's/,/ | /g' large_dataset.csv > pipe_delimited.csv
Scenario 4: Automating Daily Data Backups
Set up automated backups of your data warehouse.
# Step 1: Create backup script
cat > /home/dataengineer/scripts/backup_warehouse.sh << 'EOF'
#!/bin/bash
# Configuration
BACKUP_DIR="/backup/warehouse"
DB_NAME="data_warehouse"
DB_USER="backup_user"
RETENTION_DAYS=30
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/warehouse_${DATE}.sql.gz"
LOG_FILE="/var/log/backups/warehouse_backup.log"
# Function to log messages
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
# Create backup directory if it doesn't exist
mkdir -p "$BACKUP_DIR"
# Start backup
log "Starting backup of $DB_NAME"
# Perform backup with compression
if pg_dump -U "$DB_USER" "$DB_NAME" | gzip > "$BACKUP_FILE"; then
BACKUP_SIZE=$(du -h "$BACKUP_FILE" | cut -f1)
log "Backup completed successfully. Size: $BACKUP_SIZE"
else
log "ERROR: Backup failed!"
exit 1
fi
# Delete old backups
log "Cleaning up backups older than $RETENTION_DAYS days"
find "$BACKUP_DIR" -name "warehouse_*.sql.gz" -mtime +$RETENTION_DAYS -delete
# Count remaining backups
BACKUP_COUNT=$(find "$BACKUP_DIR" -name "warehouse_*.sql.gz" | wc -l)
log "Total backups retained: $BACKUP_COUNT"
# Optional: Upload to S3
if [ -n "$AWS_BACKUP_BUCKET" ]; then
log "Uploading to S3..."
aws s3 cp "$BACKUP_FILE" "s3://${AWS_BACKUP_BUCKET}/$(basename $BACKUP_FILE)"
log "Upload completed"
fi
log "Backup process completed"
EOF
# Step 2: Make script executable
chmod +x /home/dataengineer/scripts/backup_warehouse.sh
# Step 3: Test the script
/home/dataengineer/scripts/backup_warehouse.sh
# Step 4: Schedule with cron (daily at 2 AM)
crontab -e
# Add this line:
# 0 2 * * * /home/dataengineer/scripts/backup_warehouse.sh
# Step 5: Set up log rotation
sudo cat > /etc/logrotate.d/warehouse_backup << 'EOF'
/var/log/backups/warehouse_backup.log {
daily
rotate 30
compress
delaycompress
notifempty
create 0640 dataengineer dataengineer
}
EOF
Scenario 5: Monitoring Data Pipeline Performance
Create a comprehensive monitoring script.
cat > ~/scripts/monitor_pipeline.sh << 'EOF'
#!/bin/bash
# Configuration
PIPELINE_DIR="/opt/data_pipelines"
LOG_DIR="/var/log/pipelines"
ALERT_EMAIL="team@company.com"
DISK_THRESHOLD=80
MEMORY_THRESHOLD=85
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Check disk usage
check_disk() {
echo "=== Disk Usage Check ==="
df -h | grep -E '^/dev/' | while read line; do
usage=$(echo $line | awk '{print $5}' | sed 's/%//')
partition=$(echo $line | awk '{print $6}')
if [ "$usage" -ge "$DISK_THRESHOLD" ]; then
echo -e "${RED}WARNING${NC}: $partition is ${usage}% full"
else
echo -e "${GREEN}OK${NC}: $partition is ${usage}% full"
fi
done
echo
}
# Check memory usage
check_memory() {
echo "=== Memory Usage Check ==="
mem_usage=$(free | grep Mem | awk '{printf "%.0f", $3/$2 * 100}')
if [ "$mem_usage" -ge "$MEMORY_THRESHOLD" ]; then
echo -e "${RED}WARNING${NC}: Memory usage is ${mem_usage}%"
else
echo -e "${GREEN}OK${NC}: Memory usage is ${mem_usage}%"
fi
echo
}
# Check running pipelines
check_pipelines() {
echo "=== Pipeline Status ==="
for pipeline in etl_pipeline data_sync ml_training; do
if pgrep -f "$pipeline" > /dev/null; then
echo -e "${GREEN}RUNNING${NC}: $pipeline"
else
echo -e "${YELLOW}STOPPED${NC}: $pipeline"
fi
done
echo
}
# Check for recent errors in logs
check_errors() {
echo "=== Recent Errors (Last Hour) ==="
for log in ${LOG_DIR}/*.log; do
if [ -f "$log" ]; then
error_count=$(grep -c "ERROR" "$log" 2>/dev/null || echo 0)
if [ "$error_count" -gt 0 ]; then
echo -e "${RED}$error_count errors${NC} in $(basename $log)"
fi
fi
done
echo
}
# Check database connectivity
check_database() {
echo "=== Database Connectivity ==="
if psql -h localhost -U etl_user -d warehouse -c "SELECT 1;" &>/dev/null; then
echo -e "${GREEN}OK${NC}: Database connection successful"
else
echo -e "${RED}FAILED${NC}: Cannot connect to database"
fi
echo
}
# Generate report
generate_report() {
REPORT_FILE="/tmp/pipeline_status_$(date +%Y%m%d_%H%M%S).txt"
{
echo "Pipeline Monitoring Report"
echo "Generated: $(date)"
echo "================================"
echo
check_disk
check_memory
check_pipelines
check_errors
check_database
} | tee "$REPORT_FILE"
echo "Report saved to: $REPORT_FILE"
}
# Main execution
echo "Starting Pipeline Monitoring..."
echo "================================"
generate_report
EOF
chmod +x ~/scripts/monitor_pipeline.sh
# Run monitoring every 15 minutes
crontab -e
# Add: */15 * * * * /home/dataengineer/scripts/monitor_pipeline.sh >> /var/log/monitoring.log 2>&1
Scenario 6: Data Quality Checks
Implement automated data quality validation.
cat > ~/scripts/data_quality_check.sh << 'EOF'
#!/bin/bash
# Configuration
DATA_DIR="/data/processed"
REPORT_DIR="/data/quality_reports"
DATE=$(date +%Y%m%d)
mkdir -p "$REPORT_DIR"
# Function to check null values
check_nulls() {
local file=$1
local null_count=$(grep -c "NULL\|^,$\|,,\|,$" "$file")
if [ "$null_count" -gt 0 ]; then
echo "WARNING: Found $null_count potential null values in $(basename $file)"
return 1
else
echo "OK: No null values in $(basename $file)"
return 0
fi
}
# Function to check duplicates
check_duplicates() {
local file=$1
local total_lines=$(wc -l < "$file")
local unique_lines=$(sort -u "$file" | wc -l)
local duplicates=$((total_lines - unique_lines))
if [ "$duplicates" -gt 0 ]; then
echo "WARNING: Found $duplicates duplicate rows in $(basename $file)"
return 1
else
echo "OK: No duplicates in $(basename $file)"
return 0
fi
}
# Function to validate schema
validate_schema() {
local file=$1
local expected_columns=$2
local actual_columns=$(head -1 "$file" | tr ',' '\n' | wc -l)
if [ "$actual_columns" -eq "$expected_columns" ]; then
echo "OK: Schema validation passed for $(basename $file)"
return 0
else
echo "ERROR: Expected $expected_columns columns, found $actual_columns in $(basename $file)"
return 1
fi
}
# Main quality check
{
echo "Data Quality Report - $DATE"
echo "============================"
echo
for file in "${DATA_DIR}"/*.csv; do
if [ -f "$file" ]; then
echo "Checking: $(basename $file)"
echo "Size: $(du -h $file | cut -f1)"
echo "Rows: $(wc -l < $file)"
validate_schema "$file" 5
check_nulls "$file"
check_duplicates "$file"
echo "---"
fi
done
} | tee "${REPORT_DIR}/quality_report_${DATE}.txt"
echo "Quality check completed. Report saved to ${REPORT_DIR}/quality_report_${DATE}.txt"
EOF
chmod +x ~/scripts/data_quality_check.sh
Best Practices and Tips
Command Line Productivity
1. Use Tab Completion
# Start typing and press Tab
cd /var/lo[Tab] # Completes to /var/log/
python scr[Tab] # Completes to script.py
2. Command History
# View command history
history
# Search history
Ctrl+R # Then type to search
# Execute previous command
!!
# Execute command from history
!123 # Executes command number 123
# Execute last command starting with 'python'
!python
# Use previous command's arguments
ls /long/path/to/directory
cd !$ # Changes to /long/path/to/directory
3. Useful Aliases
Add these to your ~/.bashrc:
# Navigation
alias ..='cd ..'
alias ...='cd ../..'
alias home='cd ~'
# Listing
alias ll='ls -lah'
alias la='ls -A'
alias l='ls -CF'
# Safety
alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'
# Git shortcuts
alias gs='git status'
alias ga='git add'
alias gc='git commit'
alias gp='git push'
alias gl='git log --oneline'
# Data engineering
alias pipes='cd ~/projects/data_pipelines'
alias logs='cd /var/log'
alias dsize='du -sh'
alias ports='netstat -tuln'
# Docker shortcuts
alias dps='docker ps'
alias dim='docker images'
alias dex='docker exec -it'
# Python
alias py='python3'
alias pip='pip3'
alias venv='python3 -m venv'
4. Navigate Directories Efficiently
# Use directory stack
pushd /var/log # Go to /var/log and save current dir
pushd /etc # Go to /etc and save /var/log
popd # Return to /var/log
popd # Return to original directory
# Show directory stack
dirs
# Quick back and forth
cd - # Toggle between last two directories
Security Best Practices
1. File Permissions
# Scripts should be readable and executable
chmod 750 script.sh
# Configuration files should not be world-readable
chmod 600 config.ini
# Directories for collaboration
chmod 775 shared_directory/
# Never use 777 unless absolutely necessary (and understand why)
2. Secure Passwords and Credentials
# Never store passwords in scripts
# BAD:
password="mypassword123"
# GOOD: Use environment variables
export DB_PASSWORD=$(cat ~/.secrets/db_pass)
# Or use a secrets manager
export DB_PASSWORD=$(aws secretsmanager get-secret-value --secret-id db_password --query SecretString --output text)
# Ensure secrets files are protected
chmod 600 ~/.secrets/db_pass
3. Regular Updates
# Keep system updated
sudo apt update && sudo apt upgrade
# Security-only updates
sudo apt-get install -only-upgrade security_package
# Check for security advisories
sudo apt-get changelog package_name
Performance Optimization
1. Efficient File Processing
# Use grep -F for fixed strings (faster)
grep -F "exact_string" large_file.log
# Use awk instead of multiple pipes
# Instead of: cat file | grep pattern | cut -d',' -f1
awk -F',' '/pattern/ {print $1}' file
# Process compressed files directly
zgrep "error" logfile.gz
zcat file.gz | awk '{print $1}'
2. Parallel Processing
# Use GNU parallel for concurrent processing
find . -name "*.csv" | parallel python process.py {}
# Process multiple files simultaneously
ls *.csv | xargs -P 4 -I {} python process.py {}
# -P 4 means use 4 parallel processes
3. Monitor Resource Usage
# Before running heavy processes, check resources
free -h
df -h
# Use nice to lower priority of heavy processes
nice -n 19 python heavy_computation.py
# Limit resources with ulimit
ulimit -v 4000000 # Limit virtual memory to ~4GB
Error Handling and Logging
1. Proper Error Handling in Scripts
#!/bin/bash
set -e # Exit on any error
set -u # Exit on undefined variable
set -o pipefail # Catch errors in pipes
# Example
python extract.py || { echo "Extraction failed"; exit 1; }
python transform.py || { echo "Transformation failed"; exit 1; }
python load.py || { echo "Loading failed"; exit 1; }
2. Comprehensive Logging
#!/bin/bash
LOG_FILE="/var/log/my_pipeline.log"
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
log "INFO: Starting pipeline"
log "INFO: Processing data..."
log "ERROR: Something went wrong"
3. Debugging Scripts
# Run script with debugging
bash -x script.sh
# Add debugging to specific sections
set -x # Enable debugging
# ... commands ...
set +x # Disable debugging
# Trace function calls
set -T
trap 'echo "Function: $FUNCNAME"' DEBUG
Documentation
1. Comment Your Scripts
#!/bin/bash
# Script: data_pipeline.sh
# Purpose: Daily ETL pipeline for customer data
# Author: Your Name
# Date: 2024-01-23
# Usage: ./data_pipeline.sh [date]
#
# This script:
# 1. Extracts customer data from source database
# 2. Transforms and cleanses the data
# 3. Loads into data warehouse
#
# Environment variables required:
# - DB_HOST: Database hostname
# - DB_USER: Database username
# - DB_PASS: Database password
# Configuration
SOURCE_DB="customers"
TARGET_TABLE="dim_customers"
2. Maintain README Files
# Create a README for your project
cat > README.md << 'EOF'
# Data Pipeline Project
## Overview
This project contains ETL pipelines for processing customer data.
## Requirements
- Python 3.8+
- PostgreSQL 12+
- pandas, sqlalchemy
## Installation
bash
pip install -r requirements.txt
## Usage
bash
Run daily pipeline
./scripts/daily_etl.sh
Run specific date
./scripts/daily_etl.sh 2024-01-23
## Configuration
Edit `config/database.conf` with your database credentials.
## Troubleshooting
See `docs/troubleshooting.md`
## Contact
Data Engineering Team - data@company.com
EOF
Backup and Disaster Recovery
# Create backup strategy
cat > ~/scripts/comprehensive_backup.sh << 'EOF'
#!/bin/bash
# Backup locations
BACKUP_ROOT="/backup"
DATE=$(date +%Y%m%d)
# Create backup structure
mkdir -p ${BACKUP_ROOT}/{databases,scripts,configs,data}/${DATE}
# Backup databases
pg_dump -U user warehouse > ${BACKUP_ROOT}/databases/${DATE}/warehouse.sql
# Backup scripts and configurations
tar -czf ${BACKUP_ROOT}/scripts/${DATE}/scripts.tar.gz ~/scripts
tar -czf ${BACKUP_ROOT}/configs/${DATE}/configs.tar.gz ~/config
# Backup critical data
rsync -av /data/production/ ${BACKUP_ROOT}/data/${DATE}/
# Verify backups
for dir in databases scripts configs data; do
if [ "$(find ${BACKUP_ROOT}/${dir}/${DATE} -type f | wc -l)" -gt 0 ]; then
echo "✓ $dir backup successful"
else
echo "✗ $dir backup failed"
fi
done
# Clean old backups (keep 30 days)
find ${BACKUP_ROOT} -type d -mtime +30 -exec rm -rf {} +
echo "Backup completed: $(date)"
EOF
Conclusion
Congratulations! You've completed this comprehensive guide to Linux for data engineers. You've learned:
- Why Linux is fundamental to data engineering
- Essential commands for daily tasks
- How to use Vi and Nano for text editing
- Advanced concepts like shell scripting, cron, and SSH
- Real-world scenarios and best practices
Next Steps
1. Practice Daily
The best way to master Linux is to use it every day. Set it up as your primary development environment.
2. Build Projects
Create your own data pipelines, automation scripts, and tools. Real projects solidify your learning.
3. Explore Further
- Learn Docker for containerization
- Study Kubernetes for orchestration
- Explore Apache Airflow for workflow management
- Dive deeper into shell scripting with advanced bash techniques
4. Join Communities
- Stack Overflow for Q&A
- Reddit's r/linux and r/dataengineering
- Linux User Groups (LUGs) in your area
- Open source projects on GitHub
Resources for Continued Learning
Books:
- "The Linux Command Line" by William Shotts
- "Linux Pocket Guide" by Daniel J. Barrett
- "Shell Scripting: Expert Recipes for Linux, Bash and more" by Steve Parker
Online Resources:
- Linux Journey (linuxjourney.com)
- OverTheWire: Bandit (for practicing commands)
- The Linux Documentation Project (tldp.org)
Practice Platforms:
- DigitalOcean (for cheap Linux servers to practice)
- AWS Free Tier (for cloud Linux environments)
- VirtualBox (for local Linux VMs)
Final Thoughts
Linux mastery is a journey, not a destination. Every data engineer continues to learn new commands, techniques, and best practices throughout their career. The key is to stay curious, experiment safely (with backups!), and never stop learning.
Remember: every expert was once a beginner who didn't give up. The command line might seem intimidating at first, but with practice, it becomes second nature. Soon, you'll find yourself more productive and efficient than you ever thought possible.
Welcome to the world of Linux. Your journey as a data engineer just got a powerful boost!
Quick Reference Card
Most Used Commands
# Navigation
pwd, cd, ls, tree
# File operations
cp, mv, rm, mkdir, touch
# Viewing files
cat, less, head, tail, grep
# System info
df -h, free -h, top, htop
# Permissions
chmod, chown
# Processes
ps, kill, top, jobs
# Network
ssh, scp, rsync, wget, curl
# Text editors
nano, vi/vim
# Package management
apt install, apt update (Debian/Ubuntu)
yum install, yum update (RedHat/CentOS)
Vi Quick Reference
i - Insert mode
Esc - Normal mode
:w - Save
:q - Quit
:wq - Save and quit
dd - Delete line
yy - Copy line
p - Paste
/search - Search forward
:s/old/new/g - Replace in line
:%s/old/new/g - Replace in file
Nano Quick Reference
Ctrl+O - Save
Ctrl+X - Exit
Ctrl+K - Cut line
Ctrl+U - Paste
Ctrl+W - Search
Ctrl+\ - Replace
Ctrl+G - Help
Happy Linux Learning! 🐧
This guide was created for data engineers at all levels. Feel free to bookmark, share, and revisit as you grow in your Linux journey.
Top comments (0)