From Zero to Data Engineering: The First Mile on a Thousand-Mile Road That Refuses to Be Straight.
With just basic and level-one intermediate tech skills and from a non-tech background, when I got this assignment, I actually thought it would be as easy as class examples, but it was far from that. I encountered a series of errors, mistakes, and confusion, and I'm sure that's how the guy in class did it moments, each showing something new. I'm writing exactly how it all took place.
The Environment
- Server OS: Ubuntu 24.04.4 LTS (DigitalOcean cloud host)
- Database: PostgreSQL 16.14
- Local Machine: Windows (Powershell)
- Dataset: kenya_counties.csv - 1,000 records generated via Mockaroo
- Tools: SSH, SCP, psql, bash
Section 1: Getting into the server
Establishing a remote connection to the cloud host using SSH was the first step.
ssh root@159.65.222.96
Easy enough. Using root credentials, I connected to port 22, the default SSH port.
What is SSH? The Secure Shell(SSH) protocol is a method for securely sending commands to a computer over an unsecured network. Used for controlling servers remotely, for managing infrastructure, and transferring files.CLICK HERE To read more on
Section 2: Creating a User - Things start being real; my first error.
The instructions given clearly stated the format for creating a Linux user account was first name + surname initials. My name is Magichu Njoroge. So the intended username was MagichuN
I ran.
sudo adduser MagichuN
Then came the first wall:
Please enter a username matching the regular expression...
Root Cause: From my research, since this was a first, Linux enforces username rules. Usernames must all be lowercase. Capital letters are rejected at the system security layer to prevent directory pathing conflicts.
Fix: I simply changed to lowercase:
sudo adduser magichun
Worked like a charm. Got a home directory set up under /home/magichunthe user context registered.
With the user now recognized by the system, I promoted the user to the sudo group so it could run administrative commands:
sudo usermod -aG sudo magichun
What is a
sudoGroup? The sudo group is the de facto standard for granting elevated privileges in modern Linux distributions. Its purpose is tightly linked to the sudo command, which allows users to execute commands with the permissions of another user (typically the root user, the superuser with unrestricted access). Read more: CLICK HERE To read more onsudoandadmingroups in Linux
Then counterchecked the user existed and had correct group membership:
id magichun
cat /etc/passwd | grep magichun
Finally, I switched from root into my new user to confirm everything was working:
su magichun
whoami
pwd
Why not just hang onto root? Not ideal. Security best practice leans on minimal access rights. Root wields total control, one slip, whole machine at risk. Working as a named user with
sudoaccess is the professional standard.CLICK HERE To read more
Section 3: PostgreSQL - The Role Wall
Back as root, I checked what lived on the server already:
psql --version
sudo systemctl status postgresql
PostgreSQL 16.14 was already installed and running in an active state. No installation needed.
Now it was time to set up the database. Had to switch to the postgres admin shell.
sudo -i -u postgres
psql
After that, I attempted to log back as my new user magichun:
psql -U magichun -d magichun
Then trouble showed up for the second time:
psql:error:connection to server on socket...failed: FATAL: role"magichun" does not exist
It took me a minute to figure out what was going on; I had to seek help from a friend. This is what I understood from what he told me (I'm open to much simpler explanations): creating a Linux user does not automatically create a matching PostgreSQL role. One lives in the machine. The other only knows what you tell it. A user that exists in Linux is invisible to PostgreSQL until you explicitly register it there.
And that's exactly what I did
CREATE ROLE magichun WITH LOGIN PASSWORD 'StrongPassword123';
ALTER ROLE magichun CREATEDB;
CREATE DATABASE magichun OWNER magichun;
After that horrifying experience, I logged back as magichun and created the staging schema:
\c magichun
CREATE SCHEMA staging;
Confirmed with:
\dn
This experience made me realize I didn't really understand the database operations and needed to revisit and do much practice. All the corrections I was led on to were what to do, and I would be lying if I said I corrected them all by myself.
Section 4: Generating and Moving Real Data
Rather than inserting manual dummy records, I used Mockaroo to generate a realistic dataset.
Mockaroo: A free test data generator and API mocking tool. It lets you create custom CSV, JSON, SQL, and Excel datasets, perfect for testing database pipelines.
What is a data pipeline? A data pipeline is a set of tools and processes for collecting, processing, and delivering data from one or more sources to a destination where it can be analyzed and used. CLICK HERE To read more on data pipelines.
My dataset: kenya_counties.csv - 1,000 records with these columns:
| Column | Type |
|---|---|
| id | Integer (Primary Key) |
| first_name | VARCHAR |
| last_name | VARCHAR |
| VARCHAR | |
| gender | VARCHAR |
| ip_address | VARCHAR |
| county_name | VARCHAR |
| population | Integer |
| area_sq_km | Decimal |
| population_density | Decimal |
| average_age | Decimal |
The SCP Error (Running It From the Wrong Place)
What is SCP?: The SCP (Secure Copy Protocol) command is used to securely transfer files and directories between systems over a network. It works over SSH, which means all transferred data is encrypted. It is commonly used by system administrators to move backup files, fetch logs, copy configuration files, and transfer data between servers safely and quickly. CLICK HERE To read more on SCP
scp "D:\Datasets\kenya_counties.csv" magichun@159.65.222.96:/home/magichun/
Error:
ssh: could not resolve hostname d: Temporary failure in name resolution
Root Cause: I used the command on the remote server's terminal; it had no context. D:\ means - that's a Windows drive path on my local machine. For SCP to work, it needs to run from the machine that holds the source file.
Fix: Ran Windows PowerShell locally.
scp "D:\Datasets\kenya_counties.csv" magichun@159.65.222.96:/home/magichun/
Then verified on the server:
ls -lh /home/magichun/
head -5 kenya_counties.csv
File existed: 89K, headers confirmed.
Creating the Table
\c magichun
SET search_path TO staging;
CREATE TABLE staging.kenya _counties (
id INT,
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR (150),
gender VARCHAR (20),
ip_address VARCHAR (20),
county_name VARCHAR (100),
population INT,
area_sq_km DECIMAL(10,2),
population_density DECIMAL (10,2),
average_age DECIMAL (5,2)
);
The COPY Permission Wall
COPY staging.kenya_counties
FROM '/home/magichun/kenya_counties.csv'
DELIMITER ',' CSV HEADER;
Error:
ERROR: could not open file "/home/magichun/kenya_counties.csv" for reading: Permission denied
Another error that had me in a chokehold: I didn't understand what was going on until I realized it was an issue with permissions.
Linux Permissions Linux file permissions form the foundation of the system's security model. They define who can read, write, or execute files and directories. CLICK HERE To read more on Linux Permissions
Fix - Two steps:
chmod 755 /home/magichun
chmod 644 /home/magichun/kenya_counties.csv
Then used the client-side \copywrapper instead, which streams the file through the active user session rather than the server process:
\copy staging.kenya_counties
FROM '/home/magichun/kenya_counties.csv'
DELIMITER ',' CSV HEADER;
Result: copy 1000
Verifying the Data
SELECT county_name, population_density
FROM staging.kenya_counties
ORDER BY population_density DESC
LIMIT 5;
Mombasa came out as the county with the highest density.
COPY vs \copy: Felt the need to touch on this as it confused me at first and it might help someone else confused about it.
1. COPY (SQL Command)
-Runs on the PostgreSQL server side.
-File path is relative to the server, not your local machine
-Requires the PostgreSQL server process to have read/write
permissions to that file path.COPY my_table FROM '/var/lib/postgresql/data/input.csv' CSV HEADER;
sql
-This will fail if the file is on your local machine but not on the server.2. \copy (psql Meta-command)
-Runs on the client side** (inside the psql terminal)
-Reads/writes files from your local machine where psql is running
-Uses the COPY command internally, but streams data between the client and server.\copy my_table FROM 'C:/Users/you/data.csv' CSV HEADER;-Works even if the PostgreSQL server is remote, because the file is used locally
Rule of thumb
-If your file is on the server, use COPY.
-If your file is on your local machine, use \copy in psql.
Section 5: The Reverse Pipeline - Exporting Back to windows.
The final step completed the loop, moving the file back to my local machine
\copy (SELECT county_name, population_density
FROM staging.kenya_counties
WHERE population_density > 500)
TO '/home/magichun/high_density_counties.csv'
WITH CSV HEADER;
Output: COPY 205 - 205 high-density records exported
scp -p 22 magichu@159.65.222.96:/home/magichun/high_density_counties.csv "D:\Datasets"
100% download confirmed
Here's a quick reference of the commands used throughout this assignment:
| # | Command | Purpose |
|---|---|---|
| 1 | ssh root@<IP> |
Remote server login |
| 2 | sudo adduser magichun |
Create a new system user |
| 3 | sudo usermod -aG sudo magichun |
Grant admin privileges |
| 4 | id magichun |
Verify user and group membership |
| 5 |
cat /etc/passwd \ grep magichun
|
Confirm user in system registry |
| 6 | su magichun |
Switch active user |
| 7 | whoami |
Check current identity |
| 8 | pwd |
Print working directory |
| 9 | psql --version |
Check PostgreSQL version |
| 10 | sudo systemctl status postgresql |
Check service status |
| 11 | sudo -i -u postgres |
Switch to postgres admin |
| 12 | ls -lh /home/magichun/ |
List files with sizes |
| 13 | head -5 kenya_counties.csv |
Preview file contents |
| 14 | chmod 755 /home/magichun |
Open directory permissions |
| 15 | chmod 644 /home/magichun/kenya_counties.csv |
Set file read permissions |
| 16 | scp "D:\file.csv" user@IP:/path/ |
Upload file to server |
| 17 | scp -P 22 user@IP:/path/file.csv "D:\dest" |
Download file from server |
Clearly, this assignment wasn't a walk in the park, including writing this article, but I pushed through, and that's where growth happens. Perhaps the biggest lesson was realizing that some of the skills I was most confident in were the very ones that exposed my blind spots.
Working on the errors and mistakes through research and external help really brought growth to my skills and confidence.
As someone still early in this journey, I'd appreciate comments, suggestions, and corrections, as they are my real learning curve and the most valuable part of my journey.
*This article documents a real setup task, configuring servers and pulling data, a project completed as part of the LuxDevHQ Data Engineering Program.























Top comments (0)