Sundeep

Posted on Mar 27, 2019 • Edited on Jan 11, 2021 • Originally published at learnbyexample.github.io

Ruby Regexp Part 3 - Alternatives and Escaping

#ruby #regex #alternation #ebook

Alternation and Grouping

Many a times, you want to check if the input string matches multiple patterns. For example, whether an object's color is green or blue or red. In programming terms, you need to perform OR conditional. This chapter will show how to use alternation for such cases. These patterns can have some common elements between them, in which case grouping helps to form terser expressions. This chapter will also discuss the precedence rules used to determine which alternation wins.

OR conditional

A conditional expression combined with logical OR evaluates to true if any of the condition is satisfied. Similarly, in regular expressions, you can use | metacharacter to combine multiple patterns to indicate logical OR. The matching will succeed if any of the alternate pattern is found in the input string. These alternatives have the full power of a regular expression, for example they can have their own independent anchors. Here's some examples.

# match either 'cat' or 'dog'
>> 'I like cats'.match?(/cat|dog/)
=> true
>> 'I like dogs'.match?(/cat|dog/)
=> true
>> 'I like parrots'.match?(/cat|dog/)
=> false

# replace either 'cat' at start of string or 'cat' at end of word
>> 'catapults concatenate cat scat'.gsub(/\Acat|cat\b/, 'X')
=> "Xapults concatenate X sX"

# replace either 'cat' or 'dog' or 'fox' with 'mammal'
>> 'cat dog bee parrot fox'.gsub(/cat|dog|fox/, 'mammal')
=> "mammal mammal bee parrot mammal"

Regexp.union method

You might infer from above examples that there can be cases where lots of alternation is required. The Regexp.union method can be used to build the alternation list automatically. It accepts an array as argument or a list of comma separated arguments.

>> Regexp.union('car', 'jeep')
=> /car|jeep/

>> words = %w[cat dog fox]
>> pat = Regexp.union(words)
>> pat
=> /cat|dog|fox/
>> 'cat dog bee parrot fox'.gsub(pat, 'mammal')
=> "mammal mammal bee parrot mammal"

In the above examples, the elements do not contain any special regular expression characters. Strings having metacharacters will be discussed in Regexp.escape method section.

Grouping

Often, there are some common things among the regexp alternatives. It could be common characters or regexp qualifiers like the anchors. In such cases, you can group them using a pair of parentheses metacharacters. Similar to a(b+c)d = abd+acd in maths, you get a(b|c)d = abd|acd in regular expressions.

# without grouping
>> 'red reform read arrest'.gsub(/reform|rest/, 'X')
=> "red X read arX"
# with grouping
>> 'red reform read arrest'.gsub(/re(form|st)/, 'X')
=> "red X read arX"

# without grouping
>> 'par spare part party'.gsub(/\bpar\b|\bpart\b/, 'X')
=> "X spare X party"
# taking out common anchors
>> 'par spare part party'.gsub(/\b(par|part)\b/, 'X')
=> "X spare X party"
# taking out common characters as well
# you'll later learn a better technique instead of using empty alternate
>> 'par spare part party'.gsub(/\bpar(|t)\b/, 'X')
=> "X spare X party"

There's plenty more features to grouping than just forming terser regexp. It will be discussed as they become relevant in coming chapters.

Regexp.source method

The Regexp.source method helps to interpolate a regexp literal inside another regexp. For example, adding anchors to alternation list created using the Regexp.union method.

>> words = %w[cat par]
>> alt = Regexp.union(words)
>> alt
=> /cat|par/
>> alt_w = /\b(#{alt.source})\b/
>> alt_w
=> /\b(cat|par)\b/

>> 'cater cat concatenate par spare'.gsub(alt, 'X')
=> "Xer X conXenate X sXe"
>> 'cater cat concatenate par spare'.gsub(alt_w, 'X')
=> "cater X concatenate X spare"

The above example will work without Regexp.source method too, but you'll see that /\b(#{alt})\b/ gives /\b((?-mix:cat|par))\b/ instead of /\b(cat|par)\b/. Their meaning will be explained in Modifiers chapter.

Precedence rules

There's some tricky situations when using alternation. If it is used for testing a match to get true/false against a string input, there is no ambiguity. However, for other things like string replacement, it depends on a few factors. Say, you want to replace either are or spared — which one should get precedence? The bigger word spared or the substring are inside it or based on something else?

In Ruby, the regexp alternative which matches earliest in the input string gets precedence. Regexp operator =~ helps to illustrate this concept.

>> words = 'lion elephant are rope not'

>> words =~ /on/
=> 2
>> words =~ /ant/
=> 10

# starting index of 'on' < index of 'ant' for given string input
# so 'on' will be replaced irrespective of order of regexp
>> words.sub(/on|ant/, 'X')
=> "liX elephant are rope not"
>> words.sub(/ant|on/, 'X')
=> "liX elephant are rope not"

So, what happens if two or more alternatives match on same index? The precedence is then left to right in the order of declaration.

>> mood = 'best years'

>> mood =~ /year/
=> 5
>> mood =~ /years/
=> 5

# starting index for 'year' and 'years' will always be same
# so, which one gets replaced depends on the order of alternation
>> mood.sub(/year|years/, 'X')
=> "best Xs"
>> mood.sub(/years|year/, 'X')
=> "best X"

Another example with gsub to drive home the issue.

>> words = 'ear xerox at mare part learn eye'

# this is going to be same as: gsub(/ar/, 'X')
>> words.gsub(/ar|are|art/, 'X')
=> "eX xerox at mXe pXt leXn eye"

# this is going to be same as: gsub(/are|ar/, 'X')
>> words.gsub(/are|ar|art/, 'X')
=> "eX xerox at mX pXt leXn eye"

# phew, finally this one works as needed
>> words.gsub(/are|art|ar/, 'X')
=> "eX xerox at mX pX leXn eye"

If you do not want substrings to sabotage your replacements, a robust workaround is to sort the alternations based on length, longest first.

>> words = %w[hand handy handful]

>> alt = Regexp.union(words.sort_by { |w| -w.length })
>> alt
=> /handful|handy|hand/

>> 'hands handful handed handy'.gsub(alt, 'X')
=> "Xs X Xed X"

# without sorting, order will come into play
>> 'hands handful handed handy'.gsub(Regexp.union(words), 'X')
=> "Xs Xful Xed Xy"

Escaping metacharacters

This chapter will show how to match metacharacters literally, for manually as well as programmatically constructed patterns. You'll also learn about escape sequences supported by regexp and how they differ from strings.

Escaping with \

You have seen a few metacharacters and escape sequences that help to compose a regexp literal. To match the metacharacters literally, i.e. to remove their special meaning, prefix those characters with a \ character. To indicate a literal \ character, use \\.

To spice up the examples a bit, block form has been used below to modify the matched portion of string with an expression. In later chapters, you'll see more ways to work directly with matched portions.

# even though ^ is not being used as anchor, it won't be matched literally
>> 'a^2 + b^2 - C*3'.match?(/b^2/)
=> false
# escaping will work
>> 'a^2 + b^2 - C*3'.gsub(/(a|b)\^2/) { |m| m.upcase }
=> "A^2 + B^2 - C*3"

# match ( or ) literally
>> '(a*b) + c'.gsub(/\(|\)/, '')
=> "a*b + c"

>> '\learn\by\example'.gsub(/\\/, '/')
=> "/learn/by/example"

As emphasized earlier, regular expression is just another tool to process text. Some examples and exercises presented in this book can be solved using normal string methods as well. For real world use cases, ask yourself first if regexp is needed at all?

>> eqn = 'f*(a^b) - 3*(a^b)'

# straightforward search and replace, no need regexp shenanigans
>> eqn.gsub('(a^b)', 'c')
=> "f*c - 3*c"

Regexp.escape method

How to escape all the metacharacters when a regexp is constructed dynamically? Relax, Regexp.escape method has got you covered. No need to manually take care of all the metacharacters or worry about changes in future versions.

>> eqn = 'f*(a^b) - 3*(a^b)'
>> expr = '(a^b)'

>> puts Regexp.escape(expr)
\(a\^b\)

# replace only at the end of string
>> eqn.sub(/#{Regexp.escape(expr)}\z/, 'c')
=> "f*(a^b) - 3*c"

The Regexp.union method automatically applies escaping for string arguments.

# array of strings, assume alternation precedence sorting isn't needed
>> terms = %w[a_42 (a^b) 2|3]

>> pat = Regexp.union(terms)
>> pat
=> /a_42|\(a\^b\)|2\|3/

>> 'ba_423 (a^b)c 2|3 a^b'.gsub(pat, 'X')
=> "bX3 Xc X a^b"

Regexp.union will also take care of mixing string and regexp patterns correctly. (?-mix: seen in the output below will be explained in the Modifiers chapter.

>> Regexp.union(/^cat|dog$/, 'a^b')
=> /(?-mix:^cat|dog$)|a\^b/

Escaping delimiter

Another character to keep track for escaping is the delimiter used to define the regexp literal. Or, you can use a different delimiter than / to define a regexp literal using %r to avoid escaping. Also, you need not worry about unescaped delimiter inside #{} interpolation.

>> path = '/abc/123/foo/baz/ip.txt'

# \/ is also known as 'leaning toothpick syndrome'
>> path.sub(/\A\/abc\/123\//, '~/')
=> "~/foo/baz/ip.txt"

# a different delimiter improves readability and reduces typos
>> path.sub(%r#\A/abc/123/#, '~/')
=> "~/foo/baz/ip.txt"

Escape sequences

In regexp literals, characters like tab and newline can be expressed using escape sequences as \t and \n respectively. These are similar to how they are treated in normal string literals (see ruby-doc: Strings for details). However, escapes like \b (word boundary) and \s (see Escape sequence character sets section) are different for regexps. And octal escapes \nnn have to be three digits to avoid conflict with Backreferences.

>> "a\tb\tc".gsub(/\t/, ':')
=> "a:b:c"

>> "1\n2\n3".gsub(/\n/, ' ')
=> "1 2 3"

If an escape sequence is not defined, it'll match the character that is escaped. For example, \% will match % and not \ followed by %.

>> 'h%x'.match?(/h\%x/)
=> true
>> 'h\%x'.match?(/h\%x/)
=> false

>> 'hello'.match?(/\l/)
=> true

If you represent a metacharacter using escapes, it will be treated literally instead of its metacharacter feature.

# \x20 is hexadecimal for space character
>> 'h e l l o'.gsub(/\x20/, '')
=> "hello"
# \053 is octal for + character
>> 'a+b'.match?(/a\053b/)
=> true

# \x7c is '|' character
>> '12|30'.gsub(/2\x7c3/, '5')
=> "150"
>> '12|30'.gsub(/2|3/, '5')
=> "15|50"

See ASCII code table for a handy cheatsheet with all the ASCII characters and their hexadecimal representation.

Codepoints and Unicode escapes section will discuss escapes for unicode characters.

Exercises

For practice problems, visit Exercises.md file from this book's repository on GitHub.

DEV Community