Bob Lied

Posted on Nov 21, 2024

PWC 296 String Compression

#perl #pwc #perlweeklychallenge

Weekly Challenge 296, Task 1, asks us to implement a simple form of string compression. Run-length encoding as given here is easy to understand and implement, but it does have the short-coming that you can't encode text with numbers in it.

You are given a string of alphabetic characters, $chars.
Write a script to compress the string with run-length
encoding, as shown in the examples.

A compressed unit can be either a single character or a count followed by a character.

BONUS: Write a decompression function.

Example 1: Input: $chars = "abbc" Output: "a2bc"
Example 2: Input: $chars = "aaabccc" Output: "3ab3c"
Example 3: Input: $chars = "abcc" Output: "ab2c"

Thinking in C

I started my career as a C programmer, so when I see string problems, my first instinct is to iterate over characters. This problem seems to lend itself to that. We'll process left to right; if characters repeat we'll count and build up an output string.

sub rle_enc($str)
{
    my @s = split(//, $str);

    my $out = "";
    while ( defined(my $c = shift @s) )
    {
        my $n = 1;
        while ( @s && $s[0] eq $c ) { $n++; shift @s; }
        $out .= ( $n > 1 ? $n : "" ) . $c;
    }

    return $out;
}

Probably the only odd thing worth noting here is that I consume the input string by converting it to an array and using shift instead of using a for loop.

Perl easily moves between numeric and string types, so the expression ($n > 1 ? $n : "") handles the problem of omitting 1s easily, where a C compiler or other strongly-typed language would tell you that you can't mix types like that.

The bonus decoding is pretty easy, too. Whenever we see a number, remove it and replicate the following character.

sub rle_dec($str)
{
    my $out;
    while ( length($str) > 0 )
    {
        my $n = ( ( $str =~ s/^(\d+)// ) ? $1 : 1 );
        $out .= (substr($str, 0, 1, '') x $n);
    }
    return $out;
}

Useful Perl features in this code:

s/^(\d+)// -- Capturing the matching digit string while deleting it
... x $n -- The replication operator creates $n copies

Thinking in Perl

Another way of approaching this problem is that we're substituting one pattern for another. Anyplace we find a repeated character, we'll replace it with the number of occurrences and one instance of that character.

We've seen the repeated-character pattern several times in the last few weeks of challenges. It's any character; capture it, and find that character again immediately following -- /(.)\1*/. We'll also want to capture the entire group of matching characters, so add a set of parentheses around that, which makes the single character the second capture group ($2) instead of $1 -- /((.)\2*/.

So that's the pattern we want to match, now for the replacement. We need the length of the repeated string. That's the $1 capture group, so the straight-forward way is to build up an output string from repeated matches:

    my $out = "";
    while ( $str =~ s/^((.)\2*)//g )
    {
        my $n = length($1);
        $out .= ( $n > 1 ? $n : "" ) . $2;
    }
    return $out;

What I'd like to do is to avoid the loop and just make a substitution right in the s/// operation. We'll need to figure out the length of the $1 string match. Does Perl help us? Yes, it does.

As a side effect of regular-expression captures, Perl creates the special array variables @- and @+. These represent the offsets in the string where a capture group begins and ends. The $1 group starts at $-[1] and ends just before $+[1], so the length of $1 is $+[1]-$-[1]. If this sort of punctuation overload makes you twitchy, you can make it more readable by adding use English; and substituting Cobol-level verbosity: $LAST_MATCH_END[1]-$LAST_MATCH_START[1].

So what we want to do during the substitution is evaluate a little bit of code:

my $n = $LAST_MATCH_END[1] - $LAST_MATCH_START[1];
( $n > 1 : $n : "" ) . $2

We can do that in Perl. Adding the /e flag to the substitution operator lets us put code in the replacement.
s/.../my $n=$+[1]-$-[1];($n>1?$n:"").$2/e

Two more flags will reduce the problem to a one-liner. We want to do this globally, so we need /g, of course. And we want the result of the substitution as an output value. Normally the s/// operator returns the number of substitutions; to yield the modified string use the /r flag.

The final, dense code looks like:

sub rleRE($str)
{
    return ( $str =~ s/((.)\2*)/my $n=$+[1]-$-[1];($n>1?$n:"").$2/ger );
}

What about the decoding bonus? Even easier with regular expressions. We can exploit the /e flag again. Everywhere that we find the pattern of a number followed by a character, replace it with the replicated character.

sub rle_dec_RE($str)
{
    return $str =~ s/(\d+)(.)/$2x$1/ger;
}

The breakdown:

/(\d+)(.)/ -- Capture an integer ($1) and the character following it ($2).
/$2x$1/ -- This is code, not a string. The x is the Perl replication operator.
s///ger -- As before, we want to do this globally (g), using expression evaluation in the replacement (e), and returning the modified string (r).

But does it blend?

One final note: performance. I did a simple benchmark (included in the code on GitHub), which suggests that the regular expression encoding is about 25% slower than the character-at-a-time version. Pattern matching is concisely expressive, but there is a bit of machinery under the hood that isn't free. Of course, if performance really mattered, I'd write this in C. And if compression really mattered, I'd choose a different algorithm altogether.

Do your career a big favor. Join DEV. (The website you're on right now)

It takes one minute, it's free, and is worth it for your career.

Get started

Community matters

DEV Community

PWC 296 String Compression

Thinking in C

Thinking in Perl

But does it blend?

Top comments (0)

Build apps, not infrastructure.

Read next

📰 Cloudsmith Adds Portals to Streamline Delivery of Software Artifacts

🚀 Terminal Tricks: Essential Tips and Tricks for Developers

🚀 React Patterns: Essential Tips and Tricks for Developers

Unlocking Success: Innovative Strategies to Find Employees for Free in 2025

Okay