Amo InvAnalysis

Posted on Feb 10 • Edited on Jun 17

Linux Fundamentals for Data Engineering

#cli #dataengineering #linux #tutorial

The average assumption is that with Python and SQL mastered, you have it all figured out as a data engineer. While this might hold true, it isn’t the complete truth. True mastery of the terminal is what truly sets you apart as a competent data engineer.

Why you might ask? It’s simple. Most if not all of the infrastructure in the data engineering ecosystem runs on Linux, which again is heavy on terminal usage. As a data engineer, you’re bound to face a case of a failed pipeline in a Kubernetes pod or something similar.

In such circumstances, you will not be able to get a graphical user interface (GUI) tool. You’ll have to get your hands dirty, and figure out what went wrong, and even apply the fix via terminal.

Data engineers don’t always have the luxury of swanky GUI tools like say, data analysts or business intelligence analysts. Few, if any tools they get to use have a GUI component, so learning to use the terminal, and by extension terminal based editors like Vi and Nano is non-negotiable.

Dig in and learn basic navigation, essential data manipulation in the terminal and how to edit files on the server using Vi and Nano.

Why Data Engineers Live in the Terminal

GUIs, as easy, intuitive or even convenient as they may seem are just not practical, slow and even not available for data engineering. However, this is not necessarily a bad thing. The terminal is just as good or even better from a data engineer’s perspective.

You can do more and faster with terminal, and here are three reasons why terminal is, or is about to be your staple as a data engineer:

I. Cloud Dominance

As a discipline data engineering involves systems and infrastructure that collect, store and process massive amounts of raw data with the aim of transforming it to usable formats. This naturally calls for significant investments into infrastructure, which while possible in certain circumstances, is certainly not feasible in all cases.

This is where Amazon Web Services (AWS) and Google Cloud Platform (GCP) come in play. Data engineers and organizations can “rent” these massive systems and infrastructure at a fraction of the actual cost and get to collect, store and process their data.

The upside of this is that organizations and data engineers get to reduce operational overhead to do what needs to be done, while also transferring the ever-persistent security risk to the cloud vendors. Not to mention infrastructure de-risking where organizations and engineers shift the risk and responsibility of building, maintaining and scaling necessary systems to a more capable provider.

The downside to this? Not so much, if terminal is second nature to you. So much if not.

To be able to provide such a critical service, cloud providers run most of their infrastructure on Linux, which is a conveniently more resource efficient operating system when it comes to idle RAM usage and even background processes that eat into precious CPU power.

The price for this is more terminal usage, and less to non GUI. Simply put, terminal is a primary skill for a data engineer and you just have to prime your skills to use the terminal better as it’ll make you a better and effective data engineer.

II. Big Data Tools

Data tools like Microsoft Excel or even Google Sheets are impressive for simple data analysis tasks, or even for extracting and mild processing work. However, to a data engineer, they’re vastly under equipped for the job.

To put it into perspective, picture this. Microsoft Excel would be akin to a spade or shovel that is only appropriate for shoveling a small pile of dust whereas data engineering involves breaking down and processing a whole mountain. Hence, the need for something better equipped for the task like Spark, Kafka and Hadoop just to name a few.

Most of these tools are Linux native meaning they are either built on top of, or run best on Linux, which again involves a lot of terminal usage. But it’s also more than just the fact that most big data tools are Linux native.

When the main objective is to process whole “mountains” of data, there is literally no RAM or CPU power to spare for fancy GUI icons and animations.

III. The Automation Factor

Automation is the name of the game in data engineering. Imagine this for a second. Your job description involves getting a factory’s throughput data each day at exactly 2.30 AM, as it’s the only time that operational costs are low for this factory to work. After extracting this data you’re supposed to clean it, and store it in a specified data lake, six days a week.

The smart thing to do in this case would be to automate everything, but you also have to ensure everything runs correctly, consistently and reliably. Now you can’t risk messing up the workflow as it’ll compromise your job by relying on something not built for automation. This again brings you back to Linux; the industry standard especially when it comes to automation.

With Linux, such a job would be as easy as writing a simple bash script and setting up a Cron job so that Linux can run the workflow precisely at 2.30 AM, six days a week without you lifting a finger.

Despite the fearmongering around terminal, it is actually an amazing, convenient tool once you get the hang of it. The more you keep at it, the better you’ll get working your way around the terminal.

Basic Linux Commands – Essential Navigation & File Management

Now time to slay the beast. Here are a few basic Linux terminal commands to help you navigate around the terminal and even manage your files.

Finding Your Bearings

Can’t seem to know where you are once in the terminal? Type in and run these commands:

pwd

The Print Working Directory (pwd) command tells you what folder or directory you’re in, or as the definition says, the directory you’re working from.

Now you know what folder or directory you’re working from but can’t tell what else lurks within. Run this command to find out:

ls

The list (ls) command kind of shines a spotlight on what else is in the folder or directory you’re working on. However, this isn’t very descriptive and you may need some more details in which case you can run the following command:

ls -lh

This combination now outputs extra details about the files and folders in the directory or folder you’re asking about and you also get easy to understand file size specifications so instead of say 52428800 Bytes you get an output of say 50MB which is much easier to make sense of.

Navigating Around

Moving around is the easy part. Navigating in the terminal comes down to this simple command

cd

Change directory (cd) command helps you move up or down the terminal folder/ directory structure. You can:
cd foldery.name

This moves you to “foldery.name” which now becomes your current working directory. If you need to move back to the previous folder or in Linux terms, move up a directory, you can run the following command:

cd ..

And that’s it for navigating around the terminal.

Viewing Files Safely

Viewing files is another easy part, but with a major caveat. Done the wrong way you can easily freeze your terminal and you don’t want that. The thing is, there are many commands that do the same thing, but when used in the wrong context may cause you to run into problems.

A fine example of this would be running such a command to view files:

cat

Although it’s commonly used to quickly view files and even to edit files directly on the terminal, when used on a sufficiently large file, it can easily overwhelm your system as the command explicitly directs the terminal to present all contents in your file on the terminal which can cause it to freeze.

Here's how to view files safely on terminal:

Run the following command with “filename.md” being the name of the file you want to view

less filename.md

This command opens the file in a safe book reader format that allows you to easily scroll through a file’s contents without crashing your terminal session or server.
Alternatively, you can also safely view files on terminal using these two commands:

head filename.md

and

tail filename.md

These two allow you to view the first and the last 10 lines of a given file respectively.

Creating and Deleting Folders/Directories

Creating and deleting folders or directories on terminal is similarly simple. Simply run the following commands to create or delete a directory.

mkdir

The above command creates a folder;

rmdir

Whereas rmdir removes or deletes a folder/ directory.

Nano vs. Vi/Vim Plus Practical Usage

Aside from knowing basic Linux commands, at some point you’ll have to edit files and while you can always use any editor, it’s best practice to settle on one and master how to use it effectively. So which should you go for? Nano or Vi/Vim.

The answer depends on your personal preferences and even the task at hand. Here’s a quick rundown of the differences between the editors, that should hopefully inform your pick.

First off, Nano is more of simple looking editor, like what you would find on Windows operating system as Notepad. Vi/ Vim on the other hand is anything but simple, at first.

However, the tradeoff for an easy to use, simple looking editor is less utility on the part of Nano. Vi/Vim on the other hand, despite the steep learning curve is quite more powerful than Nano.

When it comes down to it, Nano is perfect for simple, regular editing. With time and effort, Vi/Vim can be just as simple. Unfortunately, Nano isn’t very common on most servers compared to Vi/Vim which is the default editor.

Both are capable editors, but if your aim were to be able to work on as many servers as possible, mastering Vi/Vim would be the better choice.

Now let’s move onto the practical side of things. Here’s how to edit files using Nano and Vi/Vim:

Nano

First off, navigate to the directory you’ll want to be working from then run the following command with “myfirstedit.md” being the filename:

nano myfirstedit.md

This will open the file using Nano and as you can see, it’s just like any other text editor you’ve worked with before. You can start typing away and relevant commands for navigating and using the editor being right at the bottom.

After you’ve edited the file to your satisfaction, hit Ctrl + O simultaneously to save your changes to the file, then name your file if you hadn’t done so.

(You may notice all the commands at the bottom of the terminal have this symbol “^” which represents the “Ctrl” button)

Lastly, hit the Ctrl + X keys simultaneously to close the editor.

Vi/Vim

After navigating to the appropriate directory, run the following command on terminal

vi myFirstEdit.md

This opens your file using Vi editor. As you can see the editor is stripped bare with zero clues on how to use the editor. You also can’t edit the file immediately until you enter the “insert mode” which you can do so by hitting i (letter “i”) on the terminal.

Once in insert mode, type away and once you’re done, presxs the “esc” button to exit “Insert” mode, then hit “Shift + :” followed by “wq” to save your changes then hit “Enter.”

And that’s how easy it is to use Nano and Vi/Vim. The more you do it, the easier it becomes with time.

Wrapping Up

The terminal isn't your enemy, it's your advantage.

As a data engineer, the Linux terminal is where the real work happens. Python and SQL open doors, but terminal fluency keeps you effective when pipelines fail at 3 AM, when you're troubleshooting a Kubernetes pod, or when you're automating workflows that process terabytes of data daily.

GUI tools won't save you in production environments. Cloud infrastructure runs on Linux. Big data tools like Spark and Hadoop are Linux-native. Automation workflows depend on bash scripts and Cron jobs. The sooner you learn the terminal, the faster you'll move from competent to indispensable.

Start small. Practice basic navigation with pwd, ls, and cd until they become second nature. Get comfortable viewing files with less instead of risking a terminal freeze with cat. Pick your editor—Nano for simplicity, Vi/Vim for power and ubiquity—and commit to mastering it. These aren't just commands; they're the building blocks of everything you'll do as a data engineer.

The learning curve feels steep at first. Every data engineer you admire went through this same process. They struggled with Vi's modes, forgot to add -lh to their ls commands, and accidentally froze their terminals with massive files. The difference between them and beginners isn't talent, it's repetition.

The more you work in the terminal, the more natural it becomes. What feels awkward today will be muscle memory in weeks. Commands that seem cryptic now will become your preferred way of working because they're faster, more powerful, and more reliable than any GUI alternative.

Your job as a data engineer is to build systems that turn raw data into business value. The terminal is your workshop. Learn to use it well, and you'll build better systems, solve problems faster, and establish yourself as someone who can handle whatever production throws at you.

Get comfortable in the terminal. Your future self will thank you.

DEV Community