DEV Community

Cover image for The top 10 Linux Commands that Every Data Engineer Should Know
Bobby Iliev
Bobby Iliev

Posted on • Originally published at devdojo.com

The top 10 Linux Commands that Every Data Engineer Should Know

Introduction

No matter if you are a data analyst or a system administrator when working with unstructured files in Linux there are a set of commands that would help you a lot in your daily tasks.

In this tutorial, you will learn the basics of the following commands and how to combine them using a pipe (|) and analyze some raw data files.

The list of the commands that we will go over are:

  • cat
  • head and tail
  • pipe (|)
  • wc
  • grep
  • awk
  • sort
  • uniq
  • sed
  • tr

Prerequisites

Here is a very simple example file that I will be using to go over each of the commands:

Username    ID  Name    Orders
rachel  9012    Rachel  2
laura   2070    Laura   3
craig   4081    Craig   4
mary    9346    Mary    1
jamie   5079    Jamie   2
bobby   1456    Bobby   3
tony 5489 Tony  6
devdojo 9874 DevDojo    3
tom 2589 Tom    2
rachel  9012    Rachel  2
laura   2070    Laura   3
craig   4081    Craig   4
mary    9346    Mary    1
jamie   5079    Jamie   2
bobby   1456    Bobby   3
tony 5489 Tony  6
devdojo 9874 DevDojo    3
tom 2589 Tom    2
Enter fullscreen mode Exit fullscreen mode

First, install nano:

apt install nano
Enter fullscreen mode Exit fullscreen mode

Copy the text above and by using your favorite text editor create a new file:

nano demo.txt
Enter fullscreen mode Exit fullscreen mode

Then paste the text and save the file.

With the demo data in place, let's go ahead and learn some of the most important commands that you should know in order to analyze the data!

The cat command

The cat command is used to print out the content of the file directly on your screen.

The syntax is the following:

cat file_name_here
Enter fullscreen mode Exit fullscreen mode

In our case, as the file name is demo.txt we would run the following command:

cat demo.txt
Enter fullscreen mode Exit fullscreen mode

This is very useful when you do not know what's in the file and want to quickly take a quick glance.

The head and tail command

The cat command is great as you can get the content of a file without opening it with an actual editor.

However in some cases, if the file is huge, it might take a while to print out the whole content on your screen. So let's see how we could limit that!

head

You would use the head command to get the first 10 lines of a particular file.

Syntax:

head demo.txt
Enter fullscreen mode Exit fullscreen mode

This would only print the first 10 lines of the file and would be very handy if you were working with a huge file and did not need to see the whole content on your screen.

You can also add the -n flag followed by the number of lines that you wanted to get in order to override the default behavior of only printing out 10 lines. Example:

head -n 5 demo.txt
Enter fullscreen mode Exit fullscreen mode

The above would only print out the first 5 lines of the file.

You could actually exclude the -n flag altogether and use a dash followed by the number of lines directly, for example:

tail -12 demo.txt
Enter fullscreen mode Exit fullscreen mode

The above is going to be the same as tail -n 12 demo.txt.

tail

The tail command has the exact same syntax as the head command and is also used to limit the number of lines that you get on the screen, however rather than getting the first lines of the file, using the tail command, you would get the last lines instead.

For example, if you wanted to get the last 2 lines of a specific file, you would use the following command:

tail -2 demo.txt
Enter fullscreen mode Exit fullscreen mode

Just as with head, if you don't specify a number tail would default to 10.

Another very useful flag for the tail command is the -f flag.

It is useful very handy as it will print out the last 10 lines of the file and then wait for new content to be added to the file and print it out on your screen in real-time. This is very useful when monitoring your server logs:

tail -f /var/log/nginx/access.log
Enter fullscreen mode Exit fullscreen mode

The pipe | command

The pipe (|) is probably one of the most powerful commands out there. The pipe lets you redirect the output of one command to another command. This allows you to chain multiple commands together and manipulate/analyze so that you could get the exact output that you need.

For example, both head and tail can be combined with the cat command using a pipe, |:

cat demo.txt | head -2
Enter fullscreen mode Exit fullscreen mode

In the example above, we would redirect the output of the cat command and pass it to the head -2 command.

We are going to use the pipe command later on to combine most of the commands that we are going to review in this tutorial!

The wc command

The wc command stands for Word Count and it does exactly that, it counts the lines, words, and characters in a text file.

Syntax:

wc demo.txt
Enter fullscreen mode Exit fullscreen mode

Output:

 19  76 362 demo.txt
Enter fullscreen mode Exit fullscreen mode

Here is a quick rundown of the output:

  • 19: this indicates that there are 19 lines in the file
  • 76: this is the total number of words in the file
  • 362: and finally this is the total count of characters

There are some handy flags that you could use in order to get one of those 3 feeds:

  • -l: print only the number of lines
  • -w: print only the number of words
  • -c: print only the number of characters

The grep command

The grep command is probably my favorite one, you can use it to search for a specific string in a file.

grep command in linux

The syntax is the following:

grep some_string demo.txt
Enter fullscreen mode Exit fullscreen mode

Let's say that you wanted to get all of the lines that include the string bobby in them. To do so you would use the following command:

grep bobby demo.txt
Enter fullscreen mode Exit fullscreen mode

Output:

bobby   1456    Bobby   3
bobby   1456    Bobby   3
Enter fullscreen mode Exit fullscreen mode

By default, the grep command is case sensitive, so if you were to search for the string bOBbY you would not get any results back:

grep bOBbY demo.txt
Enter fullscreen mode Exit fullscreen mode

However, there is a handy flag that you could use in order to make the grep command case insensitive:

grep -i bOBbY demo.txt
Enter fullscreen mode Exit fullscreen mode

Output:

output of grep command

By default, grep would match all string references:

grep -i to demo.txt
Enter fullscreen mode Exit fullscreen mode

Output:

grep -i output

As you can see we are getting a match in all words. This might be a problem if you are looking for a specific word and not just a substring in the word. For example, in the output above, we are getting a match for Tom and Tony as we are looking for the to string.

To only match whole words, you can use the -w flag:

grep -w to demo.txt
Enter fullscreen mode Exit fullscreen mode

The above, would not match the tony and tom words.

The awk command

AWK is actually not just a command but a whole scripting language. It is used for text processing. For the scope of this post, we are only going to scratch the surface of awk.

The awk command lets you print out specific columns. By default, the delimiter that awk uses is a space or a tab.

In our case, each column would be considered as: Username, ID, Name, Orders. Each column would be represented as $1, $2, $3 and $4.

So if we wanted to print out only the name column, we would use the following:

cat demo.txt | awk '{ print $3 }'
Enter fullscreen mode Exit fullscreen mode

Output:

Name
Rachel
Laura
Craig
Mary
Enter fullscreen mode Exit fullscreen mode

You could print multiple columns as well:

cat demo.txt | awk '{ print $3 " " $1 }'
Enter fullscreen mode Exit fullscreen mode

You could also specify a different delimiter based on your data file by using the -F flag.

The sort command

The sort command lets you sort your file in a particular order. By default, sort would expect ASCII content and would sort the content alphabetically.

To sort the users alphabetically, you could just use the following:

cat demo.txt | sort
Enter fullscreen mode Exit fullscreen mode

Let's combine this with awk and sort the users by orders:

  • First let's get the 4th and the 3th columns which are the orders and the name of each user:
cat demo.txt | awk '{print $4 " " $3 }'
Enter fullscreen mode Exit fullscreen mode
  • Then with an extra pipe let's sort them:
cat demo.txt | awk '{print $4 " " $3 }' | sort
Enter fullscreen mode Exit fullscreen mode

To sort the items in reverse order, you could add the -r flag:

cat demo.txt | awk '{print $4 " " $3 }' | sort -r
Enter fullscreen mode Exit fullscreen mode

A very handy flag is the -h flag, which lets you sort by human-readable numbers which includes comparing numbers in GB, KB and etc.

The uniq command

As the name suggests, the uniq command, lets you filter out and only show the uniq line records. In most cases, you would use the uniq command together with the sort command, as uniq only filters out identical lines that are right after each other.

I intentionally created the file so that there are some duplicate lines. Let's filter them out by using the uniq command:

cat demo.txt | awk '{ print $1 }' | sort | uniq 
Enter fullscreen mode Exit fullscreen mode

Output:

Username
bobby
craig
devdojo
jamie
laura
mary
rachel
tom
tony
Enter fullscreen mode Exit fullscreen mode

By adding the -c flag, you would also get a count of how many times each line repeats in the file:

cat demo.txt | awk '{ print $1 }' | sort | uniq -c
Enter fullscreen mode Exit fullscreen mode

Output:

      1 Username
      2 bobby
      2 craig
      2 devdojo
      2 jamie
      2 laura
      2 mary
      2 rachel
      2 tom
      2 tony
Enter fullscreen mode Exit fullscreen mode

The sed command

The sed command let's you do a search and replace for a specific string in a text or a file. SED stands for stream editor.

Let's go ahead and use the grep command to find all of the references for devdojo in the file:

cat demo.txt | grep devdojo
Enter fullscreen mode Exit fullscreen mode

Output:

devdojo 9874 DevDojo    3
devdojo 9874 DevDojo    3
Enter fullscreen mode Exit fullscreen mode

Then let's go and use sed to change the devdojo username to something else, like thedevdojo for example:

cat demo.txt | grep devdojo | sed 's/devdojo/thedevdojo/g'
Enter fullscreen mode Exit fullscreen mode

Output:

thedevdojo  9874 DevDojo    3
thedevdojo  9874 DevDojo    3
Enter fullscreen mode Exit fullscreen mode

Let's have another example and change all of the references of the small latter b to a capital case B:

cat demo.txt | sed 's/b/B/g'
Enter fullscreen mode Exit fullscreen mode

Let's have a quick rundown of the sed command:

  • s - this stands for search and searches for a specific string.
  • / - this is the delimiter and could be changed in case that you are searching for the / character.
  • search_string - the string that you are searching for.
  • replace_string - the string that you want to replace the matches with
  • g - stands for global and indicates that all matches on a specific line should be replaced.

You could also the sed command to implement the changes directly to the file by using the -i flag:

sed -i 's/devdojo/thedevdojo/g' demo.txt 
Enter fullscreen mode Exit fullscreen mode

This would change the file and update all of the references for devdojo to thedevdojo. You would not get any output back.

Note that this process is irreversible, so if you are changing an important file, make sure to take a backup first!

The tr command

The tr command is used to change or delete characters. For example, you could use tr to change all lower case characters to upper case:

cat demo.txt | tr "[:lower:]" "[:upper:]"
Enter fullscreen mode Exit fullscreen mode

Output:

USERNAME    ID  NAME    ORDERS
RACHEL  9012    RACHEL  2
LAURA   2070    LAURA   3
CRAIG   4081    CRAIG   4
MARY    9346    MARY    1
JAMIE   5079    JAMIE   2
BOBBY   1456    BOBBY   3
Enter fullscreen mode Exit fullscreen mode

Let's also change the tab character to a command:

cat demo.txt | tr '\t' ','
Enter fullscreen mode Exit fullscreen mode

This is quite handy when you want to change the formatting of a specific file.

Materialize

The above shell commands are very handy to ad-hoc analyze a specific file.
However, if you want to take this to the next level and be able to use SQL to actually analyze a dynamically changing file source, I would recommend taking a look at Materialize.

Materialize is a streaming database for real-time analytics.

It is not a substitution for your transaction database, instead, it accepts input data from a variety of sources like:

  • Messages from streaming sources like Kafka
  • Archived data from object stores like S3
  • Change feeds from databases like PostgreSQL
  • Date in Files: CSV, JSON, and even unstructured files like logs (what we'll be using today.)

And it lets you write standard SQL queries (called materialized views) that are kept up-to-date instead of returning a static set of results from one point in time.

To see the full power of Materialize, make sure to check out this demo from their official documentation here:

Materialize Log Parsing Demo

Conclusion

As a next step I would recommend testing out the commands that you've just learned with different data files that you have in place!

I would also recommend taking a look at this script there that I've created to parse Nginx/Apache access logs which includes the majority of the commands that we've used in this tutorial:

BASH Script to Summarize Your NGINX and Apache Access Logs

If you want to learn more about Bash scripting, make sure to check out this free ebook here:

Open-Source Introduction to Bash Scripting Ebook/Guide

If you have tried out Materialize, make sure to star the project on GitHub:

Materialize GitHub Repository

I hope that you've found this tutorial helpful!

Oldest comments (4)

Collapse
 
iamludal profile image
Ludal 🚀

I love using Linux commands so much, especially when combined together, they are so powerful! 🚀

Collapse
 
bobbyiliev profile image
Bobby Iliev

Yes! Combining the commands into a Bash script is awesome!

Collapse
 
peter279k profile image
peter279k

Firstly, thanks for your shared post about top-10 Linux commands!

And I also think that the man command can be part of the top-10 commands.

The reason that it can be good for developers to know specific command usage when someone doesn't know how to use command.

Collapse
 
bobbyiliev profile image
Bobby Iliev

This is a great idea! I will add it as a tip/bonus command!