DEV Community

Cover image for Scared of Linux as a Beginner Data Engineer? Here’s How to Get Started
Gracemunyi
Gracemunyi

Posted on

Scared of Linux as a Beginner Data Engineer? Here’s How to Get Started

If you're scared of Linux as a beginner data engineer, you're not alone. Almost everyone feels this way at the start. This year, I decided to transition from being a data analyst to a data engineer with zero Linux experience.

Over the past two weeks, I’ve been learning practical Linux skills and how they apply to solving real world data problems for businesses. Here’s a summary of what I’ve learned.

Firstly, Every stage of the data engineering pipeline runs on Linux servers, usually in the cloud.

As a data engineer, here’s what I’ll actually use Linux for:

  • Setting up and managing servers: Configuring the machines where your data tools run.
  • Scheduling jobs: Using CRON to trigger data pipelines automatically.
  • Debugging failures: Connecting via SSH to investigate logs when a pipeline breaks.
  • Moving and managing files: Handling raw data before it lands in storage like S3.
  • Installing tools: Setting up Python, Spark, Airflow, and other software on a server.
  • Monitoring resources: Checking server memory, disk usage, and overall health.

Secondly: In real life, businesses pull data from APIs, databases, or external files daily. One has to automatically pull the data from these APIs using a Linux Server.

To achieve this, one has learn how to:

  • Connect to a virtual Linux Server
  • Manage files on the server

Below are simplified steps to achieve this.


Step 1: Connect to the Server via SSH and Update It

SSH (Secure Shell) allowed me to open an encrypted terminal session to
a remote server. I needed two things:

  • The server's IP address
  • My username

On Windows, you can use PowerShell or Git Bash.

I was using PowerShell.

Step 1: Connect via SSH

SSH (Secure Shell) opens an encrypted terminal session. You need your
server's IP address and username.

On Windows, use PowerShell or Git Bash:

ssh root@118.173.249.268
Enter fullscreen mode Exit fullscreen mode
  • Type yes to accept the server key.
  • Enter your password (it won't show).
  • Press Enter, and you're in!

Step 2: Update the Server

Always update your server first before doing anything else:

sudo apt update      # Check for updates
sudo apt upgrade     # Install updates
pwd                  # See your current directory
ls                   # List files and folders
Enter fullscreen mode Exit fullscreen mode

Step 3: Create Your Own User

Avoid using root regularly by creating a personal user right after setup:

sudo useradd -m grace   # Create user with home folder
sudo passwd grace       # Set password
logout                       # Log out from root
ssh grace@118.173.249.268
Enter fullscreen mode Exit fullscreen mode

Step 4: Create Folders and Files

Now that you are logged in as your own user, organize your workspace:

mkdir Project        # Create a folder
cd Project           # Enter the folder
touch main.py        # Create a Python file
mkdir data           # Create a sub-folder
ls                   # Verify folder and files
Enter fullscreen mode Exit fullscreen mode

Step 5: Edit Files

Use nano to write or paste your code into the file:

nano main.py
Enter fullscreen mode Exit fullscreen mode
  • Paste your text or code
  • Press Ctrl + O to save
  • Press Ctrl + X to exit

View file contents anytime with:

cat main.py
less main.py
more main.py
Enter fullscreen mode Exit fullscreen mode

Step 6: Downloading file from webpage and Managing it

Now that the workspace is set up, you can bring in data files:

wget https://example.com/data.csv      # Download a file
tar -xzf archive.tgz                   # Extract compressed files
Enter fullscreen mode Exit fullscreen mode

Step 7: Transfer Files Between Your Local PC and Server

Move files from your local machine to the server using SCP
(Secure Copy Protocol):

scp main.py grace@118.173.249.268:/home/grace/from_local/
Enter fullscreen mode Exit fullscreen mode

On the server, navigate to the folder and run your script:

cd from_local
python3 main.py
Enter fullscreen mode Exit fullscreen mode

Summary Takeaways as a beginner

  • Every tool in a data engineering pipeline runs on a Linux server to navigate, organize, and run tasks.
  • SSH is your bridge between your PC and the server.
  • Always update your server and create a personal user before anything else.
  • Start small: create folders, files, and scripts, then automate tasks.
  • Everything you do here mirrors real world data engineering work, like managing pipelines, logs, or datasets.

If you’re also learning Linux for data engineering, what’s been challenging for you so far?. Drop a comment. I’d love to learn from your experience.

Also, stay tuned for the next two weeks progress update.

Top comments (0)