I ain't afraid of no regex

Rafael Buzzi de Andrade — Fri, 14 Jun 2019 12:06:49 +0000

Sometimes knowing what you do not want is as important as knowing what you do want. This is true for regular expressions (regex), but it is also true for this article which is not for someone already used to regex in daily basis.

But, if you want some understandable examples in order to use it more often, I hope this series of articles might help you. (here is some history if you want)

All examples below are available at my github repository, although if you want to test the expressions as long as you read, I recomend using some page that parses it in realtime. I enjoy using https://rubular.com/, but feel free to choose yours. Now let us code:
Import this packages in your class:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

Let us say you want to find all word terminating in "thing" withing a text. You could do this way:

Pattern pattern = Pattern.compile("\\w+thing");
Matcher matcher = pattern.matcher("A thing I want is to find something, or anything. I do not really care, but I do no want go with nothing at hand.");
while (matcher.find()) {
    System.out.println("Found " + matcher.group());
}

If you want to achieve the exact same result without any use of regex, it would appear with something like this:

String text = "A thing I want is to find something, or anything. I do not really care, but I do no want go with nothing at hand.";
String sufix = "thing";
String[] words = text.split(" ");
for (int i = 0; i < words.length; i++) {
    String word = words[i];
    if (word != null && word.contains(sufix)) { //"endsWith does not work because of "," and "."
        if (word.length() != sufix.length()) { //remember, you want words ending with "thing", but not the words itself
            System.out.println(word
                    .replace(".", "")
                    .replace(",", "")
            );
        }
    }
}

Note that this is a simple example. In a more complex scenario you would need to manually check many other things.

But let us continue, shall we?

What does "\\w+thing" mean? Well, "thing" is the sufix you want, I believe this is pretty obvious, let us take a look at "\\w+".

"\\"

When you see two backslashes it merely means a escape character escaping another escape character. So read as it was only one backslash ("[\w]+thing");

\w

Means any word character. Any letter from a to z (and A to Z), any digit and "_". Could you write it in a different way? Yes, the regex "[a-zA-Z_]+thing" has the exact same result (We will just talk about the brackets). I you believe because this variant if more explicit will be easier to maintain, go on. Regex, like most of things, has many ways to get the same result. So the brackets...

[ ]

It means the options of characters you want to find. If you antes only a "a" or "b", you would write [ab]. If ou want letters from lowercase a to z you would write [a-z], if you want only lowercase a to h, write [a-h] and so on. Oh... you will notice that samples have way more brackets than what is needed, but the results are the same.;

+

This is not a append operation. The "+" means that the characters on the left are mandatory. If you replace it for a "*", thas means optional, you will see that the word "thing" I now be displayed as well

Go on. Try it. I will wait here...

Now, if you take of this operation characters ("\wthing") you will see a different result. It will bring the following values:

ething
ything
othing

Because the matcher will understand you want any word character before "thing". But only one. Do you want two? Use "\w{2}thing" and you will get:

mething
nything
nothing

Do you want at least three befor the sufix, but do not want to limit de size? Use "\w{3,}thing" and "nothing" will not be brought:

something
anything

Do you want at least one character but no more than three? Try "\w{1,3}thing":

omething
anything
nothing

And now you might be thinking that you do not want broken words in your results. Try "\W\w{1,3}thing". This "\W" means any non word character. The exact oposite of "\w". The result will be:

anything
nothing

I could have been writen "\s\w{1,3}thing" as well. "\s" means any whitespace character (yes, "\S" means any non-whitespace character).

As a developer, you probably thought "What would happen if there was a target word in the begining of the phrase?" ~~(In the repository there is a solution. It does not use purely regex to solve, but hey, we are not confined to one pure solution, right?)~~

Try as many variations you want.

See you soon with more complex situations envolving regex.

DEV Community: Rafael Buzzi de Andrade

I ain't afraid of no regex

"\\"

\w

[ ]

+