If you're scared of Linux as a beginner data engineer, you're not alone. Almost everyone feels this way at the start. This year, I decided to transition from being a data analyst to a data engineer with zero Linux experience.
Over the past two weeks, I’ve been learning practical Linux skills and how they apply to solving real world data problems for businesses. Here’s a summary of what I’ve learned.
Firstly, Every stage of the data engineering pipeline runs on Linux servers, usually in the cloud.
As a data engineer, here’s what I’ll actually use Linux for:
- Setting up and managing servers: Configuring the machines where your data tools run.
- Scheduling jobs: Using CRON to trigger data pipelines automatically.
- Debugging failures: Connecting via SSH to investigate logs when a pipeline breaks.
- Moving and managing files: Handling raw data before it lands in storage like S3.
- Installing tools: Setting up Python, Spark, Airflow, and other software on a server.
- Monitoring resources: Checking server memory, disk usage, and overall health.
Secondly: In real life, businesses pull data from APIs, databases, or external files daily. One has to automatically pull the data from these APIs using a Linux Server.
To achieve this, one has learn how to:
- Connect to a virtual Linux Server
- Manage files on the server
Below are simplified steps to achieve this.
Step 1: Connect to the Server via SSH and Update It
SSH (Secure Shell) allowed me to open an encrypted terminal session to
a remote server. I needed two things:
- The server's IP address
- My username
On Windows, you can use PowerShell or Git Bash.
I was using PowerShell.
Step 1: Connect via SSH
SSH (Secure Shell) opens an encrypted terminal session. You need your
server's IP address and username.
On Windows, use PowerShell or Git Bash:
ssh root@118.173.249.268
- Type
yesto accept the server key. - Enter your password (it won't show).
- Press Enter, and you're in!
Step 2: Update the Server
Always update your server first before doing anything else:
sudo apt update # Check for updates
sudo apt upgrade # Install updates
pwd # See your current directory
ls # List files and folders
Step 3: Create Your Own User
Avoid using root regularly by creating a personal user right after setup:
sudo useradd -m grace # Create user with home folder
sudo passwd grace # Set password
logout # Log out from root
ssh grace@118.173.249.268
Step 4: Create Folders and Files
Now that you are logged in as your own user, organize your workspace:
mkdir Project # Create a folder
cd Project # Enter the folder
touch main.py # Create a Python file
mkdir data # Create a sub-folder
ls # Verify folder and files
Step 5: Edit Files
Use nano to write or paste your code into the file:
nano main.py
- Paste your text or code
- Press
Ctrl + Oto save - Press
Ctrl + Xto exit
View file contents anytime with:
cat main.py
less main.py
more main.py
Step 6: Downloading file from webpage and Managing it
Now that the workspace is set up, you can bring in data files:
wget https://example.com/data.csv # Download a file
tar -xzf archive.tgz # Extract compressed files
Step 7: Transfer Files Between Your Local PC and Server
Move files from your local machine to the server using SCP
(Secure Copy Protocol):
scp main.py grace@118.173.249.268:/home/grace/from_local/
On the server, navigate to the folder and run your script:
cd from_local
python3 main.py
Summary Takeaways as a beginner
- Every tool in a data engineering pipeline runs on a Linux server to navigate, organize, and run tasks.
- SSH is your bridge between your PC and the server.
- Always update your server and create a personal user before anything else.
- Start small: create folders, files, and scripts, then automate tasks.
- Everything you do here mirrors real world data engineering work, like managing pipelines, logs, or datasets.
If you’re also learning Linux for data engineering, what’s been challenging for you so far?. Drop a comment. I’d love to learn from your experience.
Also, stay tuned for the next two weeks progress update.
Top comments (0)