PWC 376 Doubled Words

#perl #perlweeklychallenge #pwc

There are two things that every Perl developer knows:

Parsing HTML with regular expressions is a bad idea.
My situation is unique and exceptional, such that parsing HTML with regular expressions is ingenious and optimal.

Let's see the task. I think you know where this is going.

Task 2: Double Words

You are given a string (which may contain embedded newlines) which is taken from a page on a website. The string will not contain brackets [].

Write a script that will find doubled words (such as “this this”) and highlight (wrap in brackets) each doubled word.

The script should:

Work across lines, even finding situations where a word at the end of one line is repeated at the beginning of the next.
Find doubled words despite capitalization differences, such as with 'The the...', as well as allow differing amounts of whitespace (spaces, tabs, newlines, and the like) to lie between the words.
Find doubled words even when separated by HTML tags. For example, to make a word bold: '...it is very very important...'.
Only show lines containing doubled words.

Adapted from Mastering Regular Expressions, Third Edition, by Jeffrey E. F. Friedl

Example 1
- Input: $str = "you're given the job of checking the pages on a\nweb server for doubled words (such as 'this this'), a common problem\nwith documents subject to heavy editing."
- Output: "web server for doubled words (such as '[this] [this]'), a common problem"
Example 2
- Input: $str = "Find doubled words despite capitalization differences, such as with 'The\nthe...', as well as allow differing amounts of whitespace (spaces,\ntabs, newlines, and the like) to lie between the words."
- Output: "Find doubled words despite capitalization differences, such as with '[The]\n[the]...', as well as allow differing amounts of whitespace (spaces,"
Example 3
- Input: $str = "to make a word bold: '...it is very very important...'."
- Output: "to make a word bold: '...it is [very] [very] important...'."
Example 4
- Input: $str = "Perl officially stands for Practical Extraction and Report Language, except when it doesn't."
- Output: ""
Example 5
- Input: $str = "There's more than one one way to do it.\nEasy things should be easy and hard things should be possible."
- Output: "There's more than [one] [one] way to do it."

So it begins

The reference to Mastering Regular Expressions seems like a pretty strong hint. Back when I was doing HTML and XML processing, I always favored using a module to parse tagged text (XML::Twig was a particular favorite), but let's take the hint.

A sort of obvious thing to do here is to split the string into words, but it's going to be annoying because of the possibility of HTML tags. And once we break the string up, we'll lose the space and tags that we need to retain for the requested output.

"The string will not contain brackets" is an odd thing to say. But it turns out to be useful, because we only want to print output lines that do contain brackets, so this little clause is actually doing us a favor.

An outline

Let's assume that we can conjure up a regular expression to use with split to extract the words; let's call it $sepRE. We can then compare consecutive pairs of words to find duplication. That solves half the problem, but loses the context in the original string. We need to reproduce the original text, plus highlighting brackets, including all the stuff between words.

Because we have weird habits, we have (more than once) read through all the documentation of split, and so we know that there is a variation in which the function can return both the desired substrings and the separators. To wit, if we enclose the regular expression in parentheses, split will return all the pieces of the string, alternating between separators and substrings. The first piece we need looks like:

my @word = split( /($sepRE)/, $str);

Now the words will be in either the even or odd elements of @word, depending on whether the string started with a separator. We can check for duplicates by skipping along in pairs.

# Take every other word. Skip over a leading separator.
my ($first, $second) = ( $word[0] =~ m/$sepRE/ )
                     ? (1,3) : (0, 2);

while ( $second <= $#word)
{
    my (@w) = ( $word[$first], $word[$second]);

    if ( $w[0] eq $w[1] )
    {
        $word[$first]  =~ s/($w[0])/[$1]/;
        $word[$second] =~ s/($w[1])/[$1]/;
    }
    $first += 2; $second += 2;
}

That marks up the duplicate words, but only in the @word array. We need to face the problem of getting the desired output. To get there, put the @word array back into a single string, and then grep for the lines that contain our newly-inserted brackets.

# Reassemble the string, now with some words possibly bracketed
my $highlighted = join("", @word);

# Output includes only lines where we inserted brackets
return join "\n",
            grep /[\[\]]/, 
                 split(/\n/, $highlighted);

A few minor details

Case insensitivity. In our check for duplicates, we need to ignore case differences, so let's compare the lowercase versions of the words: if ( lc($w[0]) eq lc($w[1]) )
Punctuation. As shown in Example 2, punctuation surrounding the words should be ignored. Let's whip up a quick function to trim leading and trailing punctuation (the [:punct:] character class is handy for this), and use it before we compare

sub trimPunct($word)
{
    $word =~ s/^[[:punct:]]+//;
    $word =~ s/[[:punct:]]+$//;
    return $word
}
. . .
my (@w) = ( trimPunct($word[$first]),
            trimPunct($word[$second]));

About that regular expression

Time to take care of that regular expression that acts as the word separator. With some experimentation (the regex101 web site is good for that), we arrived at this:

 state $sepRE = 
qr/ \s*     # Possible white space followed by
  (?:       # ( a group of a tag and more white space
   <[^>]*>  #  ... an HTML tag
   \s*      #  ... optional space
  )+        # ) there can be multiple tags
  | \s+     # Or just white space
/x;

Not too bad. Didn't even require look-around assertions. Some notes:

state $sepRE = qr/.../ -- This is a constant regular expression, so we can use state to only evaluate it once.
qr//x -- The x flag allows us to add formatting and comments to the expression
(?:...) -- We want to group, but we don't need the overhead of capturing, so (?:) is a non-capturing group.
complicated | \s+ -- Why is the complicated part first? Because the alternatives are evaluated left to right. If \s+ eats up the white space, the expression matches and wouldn't proceed to trying tags.

Okay, here's the whole thing.

sub task($str)
{
    state $sepRE = qr/ \s*     # Possible white space followed by
                     (?:       # ( a group of a tag and more white space
                      <[^>]*>  #  ... an HTML tag
                      \s*      #  ... optional space
                     )+        # ) There can be multiple tags
                     | \s+     # Or just white space
                   /x;

    # Using () in RE includes separators in the array
    my @word = split( /($sepRE)/, $str);

    # Take every other word. Skip over a leading separator.
    my ($first, $second) = ( $word[0] =~ m/$sepRE/ ) ? (1,3) : (0, 2);

    while ( $second <= $#word)
    {
        my (@w) = ( trimPunct($word[$first]), trimPunct($word[$second]));

        if ( lc($w[0]) eq lc($w[1]) )
        {
            $word[$first]  =~ s/($w[0])/[$1]/;
            $word[$second] =~ s/($w[1])/[$1]/;
        }
        $first += 2; $second += 2;
    }

    # Reassemble the string, now with some words possibly bracketed
    my $highlighted = join("", @word);

    # Output includes only lines where we inserted brackets
    return join"\n", grep /[\[\]]/, split(/\n/, $highlighted);
}