Gracemunyi

Posted on Mar 30

Scared of Linux as a Beginner Data Engineer? Here’s How to Get Started

#linux #dataengineering #beginners #career

If you're scared of Linux as a beginner data engineer, you're not alone. Almost everyone feels this way at the start. This year, I decided to transition from being a data analyst to a data engineer with zero Linux experience.

Over the past two weeks, I’ve been learning practical Linux skills and how they apply to solving real world data problems for businesses. Here’s a summary of what I’ve learned.

Firstly, Every stage of the data engineering pipeline runs on Linux servers, usually in the cloud.

As a data engineer, here’s what I’ll actually use Linux for:

Setting up and managing servers: Configuring the machines where your data tools run.
Scheduling jobs: Using CRON to trigger data pipelines automatically.
Debugging failures: Connecting via SSH to investigate logs when a pipeline breaks.
Moving and managing files: Handling raw data before it lands in storage like S3.
Installing tools: Setting up Python, Spark, Airflow, and other software on a server.
Monitoring resources: Checking server memory, disk usage, and overall health.

Secondly: In real life, businesses pull data from APIs, databases, or external files daily. One has to automatically pull the data from these APIs using a Linux Server.

To achieve this, one has learn how to:

Connect to a virtual Linux Server
Manage files on the server

Below are simplified steps to achieve this.

Step 1: Connect to the Server via SSH and Update It

SSH (Secure Shell) allowed me to open an encrypted terminal session to
a remote server. I needed two things:

The server's IP address
My username

On Windows, you can use PowerShell or Git Bash.

I was using PowerShell.

Step 1: Connect via SSH

SSH (Secure Shell) opens an encrypted terminal session. You need your
server's IP address and username.

On Windows, use PowerShell or Git Bash:

ssh root@118.173.249.268

Type yes to accept the server key.
Enter your password (it won't show).
Press Enter, and you're in!

Step 2: Update the Server

Always update your server first before doing anything else:

sudo apt update      # Check for updates
sudo apt upgrade     # Install updates
pwd                  # See your current directory
ls                   # List files and folders

Step 3: Create Your Own User

Avoid using root regularly by creating a personal user right after setup:

sudo useradd -m grace   # Create user with home folder
sudo passwd grace       # Set password
logout                       # Log out from root
ssh grace@118.173.249.268

Step 4: Create Folders and Files

Now that you are logged in as your own user, organize your workspace:

mkdir Project        # Create a folder
cd Project           # Enter the folder
touch main.py        # Create a Python file
mkdir data           # Create a sub-folder
ls                   # Verify folder and files

Step 5: Edit Files

Use nano to write or paste your code into the file:

nano main.py

Paste your text or code
Press Ctrl + O to save
Press Ctrl + X to exit

View file contents anytime with:

cat main.py
less main.py
more main.py

Step 6: Downloading file from webpage and Managing it

Now that the workspace is set up, you can bring in data files:

wget https://example.com/data.csv      # Download a file
tar -xzf archive.tgz                   # Extract compressed files

Step 7: Transfer Files Between Your Local PC and Server

Move files from your local machine to the server using SCP
(Secure Copy Protocol):

scp main.py grace@118.173.249.268:/home/grace/from_local/

On the server, navigate to the folder and run your script:

cd from_local
python3 main.py

Summary Takeaways as a beginner

Every tool in a data engineering pipeline runs on a Linux server to navigate, organize, and run tasks.
SSH is your bridge between your PC and the server.
Always update your server and create a personal user before anything else.
Start small: create folders, files, and scripts, then automate tasks.
Everything you do here mirrors real world data engineering work, like managing pipelines, logs, or datasets.

If you’re also learning Linux for data engineering, what’s been challenging for you so far?. Drop a comment. I’d love to learn from your experience.

Also, stay tuned for the next two weeks progress update.

DEV Community