DEV Community

Shamil
Shamil

Posted on

Processing Text with Linux Shell - Part 2

grep - searching the haystack made easy

When you need to search a piece of text from a very large source, grep command is the answer. grep accepts strings, regular expressions and it can produce output in various formats.

The most basic syntax to search a piece of text in a file is:

grep "pattern" file_name

If we need to search multiple files for a pattern, we can do the following:

grep "pattern" file1 file2 file3 ...

grep can also be used with standard input.

echo "linux is awesome" | grep linux

If no file or directory is specified, grep reads the STDIN for input.

# Highlight the matched text

One major feature of grep is it's ability to highlight the matched string in its output. There are 3 color options: auto, always and never. To use the color options, we can pass an additional --color flag to instruct which option to use.

grep "pattern" file_name --color=always

As we can already guess, with --color=always option it will always color the output matched string. With --color=auto, grep displays color in the output only if the output is not piped to a command or redirected to any file. The last option, --color=never will turn off the coloring.

If you want to turn on coloring for every grep operation, but feeling lazy to type --color=always with each command (which would be me), you could add GREP_OPTIONS to your environment.

export GREP_OPTIONS='--color=always'

This will turn on coloring for all grep commands.

# Limit what to output

Normally, grep only outputs the line that contains the matched pattern. But there are circumstances where we might want to see the contents of the entire file with the matched pattern highlighted. We can do the following for this:

grep --color 'pattern\|' file_name

This will output all the lines containing the pattern as well as the lines that have an end, i.e the complete contents of the file with only the matched part colored. Of course, we can pass multile patterns to be matched.

grep --color 'pattern_one\|pattern_two\|' file_name

Notice the \ before | . This is because grep only interprets some of the special characters. However we can also write commands without \ .

grep -E --color 'pattern_one|pattern_two|' file_name

The -E option instructs grep to use full set of regular expressions. However, there is a better way of writing such patterns containing regular expressions . We can use egrep, which is an extended version of grep that supports extended regular expressions out of the box. Therefore, we can rewrite the above commands like this.

egrep --color 'pattern_one|pattern_two|' file_name

But what if we neither need the complete line that contains the matching text nor the complete contents, rather just the matching content?

For example, if we are searching for all the email addresses in a file, we do not need the line that contains the email, rather we just need all the emails that appears in the file.

Well, we can use the -o option, which stands for only matching, that limits the output to only the matched text.

egrep -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" file_name

Neat, isn't it? 

What if we need to see all the lines except the ones where the given pattern appears?

grep -v "pattern" file_name

The -v option inverts the results and thus the above command will output all the contents except the lines containing the matched pattern.

# Count the appearances

If we need to get a count of how many times a pattern appears in a file, we can use the -c flag. From the above email pattern example, if we need to see how many email addresses are there in a file, we could do the following.

egrep -c "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" file_name

But wait, there is a gotcha.

The -c counts only the number of lines that contains the pattern, not the actual number of matched pattern. So if there are multiple matches in a single line, it will count only once!

We can overcome this with a little tweak.

Remember how we printed only the matching portions with the -o flag above. If pipe wc -l with it,

egrep -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" file_name | wc -l

it will give us the exact number of times a particular pattern appears in a file.

# Extract additional information

If we want to print the line numbers where the searched pattern appears, we can use the -n flag.

grep -n --color "pattern" file_name

The -n flag will also print the name of the file, in case we are searching in multiple files.

If we want to see the offset where the pattern starts, we can use the -b flag.

echo "linux is the answer" | grep -b -o "the"

This is print 9:the as the searched word "the" starts at the 9th position of the line. Note that that -b flag is always used with the -o flag.

# Search within directories

To see which files contain a particular pattern , we can use the -l flag.

 

grep -l "pattern" file1 file2 file3 ...

This will output only the file names contains the given pattern. If we want to see the name of the files that does not contain the files, we can use -L flag (note the case-sensitive flags). This basically inverts results we get with -l flag.

Well, so far we have only seen how to search in a single file or a set of file which are specified in the command. However, we can also search all the files in a complete directory and it's subdirectories recursively.

grep -r "pattern" .

Here . specifies the current directory. The -r flag implies that the search should be done recursively in all subdirectories. This search is case-sensitive.

If we want to search all occurances of the pattern, we can append -i flag.

grep -ir "pattern" .

This will output all matching patterns irrespective of the case.

# The pattern file

If we have too many patterns, strings etc that we want to search, piping them with | is very tedious. In such scenarios, we can use a pattern file, that contains all the strings/patterns we want to search, each in a new line. Then we can use that pattern file in the grep command.

grep -ir . -f pattern_file

This will print all the matching strings from the pattern file.

ヽ(´▽`)/

(You can find Part 1 of this series here )

Top comments (3)

Collapse
 
pkristiancz profile image
Patrik Kristian

Hey, i like this series! keep going. I am using grep often, but now i know i was using only fraction of stuff it can do. i was using constructions like:
cat file | grep pattern

silly me :)

Collapse
 
shamil profile image
Shamil • Edited

Hey, I'm glad that this is of help to you. I plan to write another 2 articles in this series :) Let's see.

Collapse
 
pkristiancz profile image
Patrik Kristian

2 more? wow, i a looking forward! :)