Introduction
As a data engineer, most of your work will happen
on Linux servers. Whether you are managing databases,
running data pipelines, or processing large datasets,
Linux is the operating system that powers the majority
of servers worldwide. Understanding Linux fundamentals
is therefore not optional for data engineers it is
a core skill that separates beginners from
professionals.
In this article, I will walk you through the
essential Linux skills every data engineer needs,
based on my hands-on experience setting up a remote
Ubuntu server, configuring PostgreSQL, and performing
file transfers using SCP. This is part of my
LuxDevHQ Data Engineering Cohort 8 assignment.
1. Setting Up Linux on Windows with WSL
Most data engineers start their journey on Windows.
The good news is that you do not need to install a
separate Linux machine Windows Subsystem for Linux
(WSL) allows you to run a full Linux environment
directly inside Windows.
To install WSL, open Windows CMD as Administrator
and run:
wsl --install -d Ubuntu-22.04
After installation, restart your PC. You can now
launch Ubuntu directly from your Start menu or by
typing wsl in CMD.
One important lesson I learned during setup: WSL
comes in different flavors. If your prompt shows
-sh instead of bash, you are running a minimal
shell, not full Ubuntu. In that case, install Ubuntu
specifically using the command above.
2. Connecting to a Remote Server with SSH
SSH (Secure Shell) is the standard way to connect
to remote Linux servers. As a data engineer you will
use SSH daily to access cloud servers, manage
databases, and run data pipelines remotely.
The basic SSH command syntax is:
ssh username@server_ip -p port_number
For example, to connect to our assignment server:
ssh root@159.65.222.96 -p 22
Port 22 is the default SSH port. When connecting
for the first time, you will see:
Are you sure you want to continue connecting?
(yes/no)
Always type yes and press Enter.
One important thing to understand about your
terminal prompt:
-
#at the end means you are root (full admin) -
$at the end means you are a normal user
Always run whoami to confirm which user you are
operating as this saved me from many permission
errors during this assignment.
3. Linux User Management
On a shared server, every person should have their
own user account. This is important for security,
accountability, and proper file permissions.
To create a new user:
adduser briank
An important lesson I learned: Linux usernames must
be lowercase. When I tried to create a user called
BrianK, I got this error:
Please enter a username matching the regular
expression configured via the NAME_REGEX
configuration variable.
The fix was simple use lowercase:
adduser briank
To give the user sudo (admin) privileges:
usermod -aG sudo briank
To verify the user was created successfully:
id briank
Output:
uid=1088(briank) gid=1088(briank)
groups=1088(briank),27(sudo),100(users)
4. Essential Linux Commands for Data Engineers
Here are the most important Linux commands every
data engineer should know, organized by category:
Navigation Commands
pwd # print current directory
ls # list files
ls -la # detailed list including hidden files
cd Documents # go into a folder
cd .. # go up one level
cd ~ # go to home directory
File Operations
touch notes.txt # create empty file
mkdir linux_assignment # create folder
cp notes.txt backup.txt # copy file
mv backup.txt old.txt # rename/move file
rm old.txt # delete file
cat notes.txt # view file contents
head -10 notes.txt # view first 10 lines
tail -10 notes.txt # view last 10 lines
grep "error" log.txt # search inside file
System Information
whoami # current username
uname -a # system and kernel info
hostname # server name
uptime # how long server has been running
df -h # disk space usage
free -h # memory usage
top # running processes (q to quit)
ps aux # list all processes
File Permissions
Linux file permissions control who can read,
write, and execute files. They are shown as:
-rwxr-xr-x
Breaking this down:
-
r= read (4) -
w= write (2) -
x= execute (1)
Three groups: owner, group, others.
To change permissions:
chmod 755 script.sh # owner: rwx, others: r-x
chmod 644 notes.txt # owner: rw-, others: r--
To change file ownership:
chown briank notes.txt
Networking Commands
ip a # show network interfaces
ping google.com -c 4 # test connectivity
netstat -tulnp # show open ports
ss -tlnp # modern version of netstat
curl ifconfig.me # show your public IP
5. PostgreSQL Setup on Linux
PostgreSQL is the most popular open source database
for data engineering. Here is how to set it up on
Ubuntu:
Installation
apt update
apt install postgresql postgresql-contrib -y
Start and Enable the Service
systemctl start postgresql
systemctl enable postgresql
systemctl status postgresql
Log Into PostgreSQL
su -s /bin/bash postgres
psql
Create a Database and Schema
CREATE DATABASE briank;
\c briank
CREATE SCHEMA staging;
Create a Table and Insert Data
CREATE TABLE staging.farmers (
id SERIAL PRIMARY KEY,
farmer_name VARCHAR(100),
county VARCHAR(50),
subcounty VARCHAR(50),
acreage DECIMAL(5,2),
crop VARCHAR(50),
loan_amount DECIMAL(10,2),
loan_status VARCHAR(20),
season VARCHAR(20)
);
INSERT INTO staging.farmers
(farmer_name, county, subcounty,
acreage, crop, loan_amount, loan_status, season)
VALUES
('John Kipchumba', 'Uasin Gishu',
'Turbo', 2.5, 'Maize', 15000.00, 'Paid', '2023A'),
('Mary Jelimo', 'Uasin Gishu',
'Soy', 1.8, 'Maize', 12000.00, 'Defaulted', '2023A'),
('Peter Rotich', 'Uasin Gishu',
'Eldoret East', 3.2, 'Maize', 20000.00, 'Paid', '2023B');
Useful psql Commands
\l -- list all databases
\c dbname -- connect to database
\dt -- list all tables
\du -- list all users
\q -- quit psql
Allow External Connections
To allow tools like DBeaver or pgAdmin to connect
remotely, two configuration files need editing:
postgresql.conf change listen_addresses:
listen_addresses = ''
**pg_hba.conf* — add this line at the bottom:
host all all 0.0.0.0/0 md5
Then restart PostgreSQL:
systemctl restart postgresql
6. File Transfers with SCP
SCP (Secure Copy Protocol) uses SSH to transfer
files between your local machine and a remote server.
This is essential for data engineers who need to
move datasets, scripts, and configuration files.
Upload from local PC to server
scp C:\Users\Brian\notes.txt root@159.65.222.96:/root/
Download from server to local PC
scp root@159.65.222.96:/root/notes.txt C:\Users\Brian\Downloads\
Copy an entire folder
scp -r myfolder/ root@159.65.222.96:/root/
Use SSH key instead of password
scp -i ~/.ssh/mykey.pem file.txt root@server:/path/
7. Key Lessons Learned
During this assignment I encountered several
real-world challenges that taught me valuable lessons:
Lesson 1 Always check who you are
Running whoami before every session saved me
from making changes as the wrong user. I accidentally
switched to another student's account and spent time
wondering why permissions were denied.
Lesson 2 Usernames must be lowercase
Linux enforces strict naming rules. BrianK failed
but briank worked perfectly.
Lesson 3 The prompt tells you everything
# means root, $ means normal user.
=# in psql means ready, (# means incomplete
command press Ctrl+C to cancel.
Lesson 4 WSL is not always Ubuntu
Not all WSL installations are equal. A minimal
shell missing apt, sudo, and ssh taught me
to always verify my environment with
cat /etc/os-release.
Lesson 5 Shared servers have history
On a shared assignment server, previous students
had already made some configurations. Always check
before editing - use grep and tail to verify
what already exists.
8. Conclusion
Linux is the backbone of modern data engineering.
From managing remote servers with SSH, to setting
up PostgreSQL databases, to transferring files with
SCP every skill covered in this article is used
daily by professional data engineers.
The best way to learn Linux is by doing. Set up WSL
on your Windows machine, spin up a cloud server,
and practice these commands every day. The more you
use them the more natural they become.
As I continue my journey in data engineering at
LuxDevHQ Cohort 8, Linux will remain a foundation
skill that everything else builds on from Python
data pipelines, to cloud infrastructure, to
geospatial data processing with PostGIS.
Brian Kiplangat - LuxDevHQ Data Engineering
Cohort 8 | Nairobi, Kenya
GitHub: https://github.com/kiplangatbrian85/
linux-data-engineering
Top comments (0)