DEV Community

Bob Lied
Bob Lied

Posted on

PWC 296 String Compression

Weekly Challenge 296, Task 1, asks us to implement a simple form of string compression. Run-length encoding as given here is easy to understand and implement, but it does have the short-coming that you can't encode text with numbers in it.

You are given a string of alphabetic characters, $chars.
Write a script to compress the string with run-length
encoding, as shown in the examples.

A compressed unit can be either a single character or a count followed by a character.

BONUS: Write a decompression function.
Enter fullscreen mode Exit fullscreen mode
  • Example 1: Input: $chars = "abbc" Output: "a2bc"
  • Example 2: Input: $chars = "aaabccc" Output: "3ab3c"
  • Example 3: Input: $chars = "abcc" Output: "ab2c"

Thinking in C

I started my career as a C programmer, so when I see string problems, my first instinct is to iterate over characters. This problem seems to lend itself to that. We'll process left to right; if characters repeat we'll count and build up an output string.

sub rle_enc($str)
{
    my @s = split(//, $str);

    my $out = "";
    while ( defined(my $c = shift @s) )
    {
        my $n = 1;
        while ( @s && $s[0] eq $c ) { $n++; shift @s; }
        $out .= ( $n > 1 ? $n : "" ) . $c;
    }

    return $out;
}
Enter fullscreen mode Exit fullscreen mode

Probably the only odd thing worth noting here is that I consume the input string by converting it to an array and using shift instead of using a for loop.

Perl easily moves between numeric and string types, so the expression ($n > 1 ? $n : "") handles the problem of omitting 1s easily, where a C compiler or other strongly-typed language would tell you that you can't mix types like that.

The bonus decoding is pretty easy, too. Whenever we see a number, remove it and replicate the following character.

sub rle_dec($str)
{
    my $out;
    while ( length($str) > 0 )
    {
        my $n = ( ( $str =~ s/^(\d+)// ) ? $1 : 1 );
        $out .= (substr($str, 0, 1, '') x $n);
    }
    return $out;
}
Enter fullscreen mode Exit fullscreen mode

Useful Perl features in this code:

  • s/^(\d+)// -- Capturing the matching digit string while deleting it
  • ... x $n -- The replication operator creates $n copies

Thinking in Perl

Another way of approaching this problem is that we're substituting one pattern for another. Anyplace we find a repeated character, we'll replace it with the number of occurrences and one instance of that character.

We've seen the repeated-character pattern several times in the last few weeks of challenges. It's any character; capture it, and find that character again immediately following -- /(.)\1*/. We'll also want to capture the entire group of matching characters, so add a set of parentheses around that, which makes the single character the second capture group ($2) instead of $1 -- /((.)\2*/.

So that's the pattern we want to match, now for the replacement. We need the length of the repeated string. That's the $1 capture group, so the straight-forward way is to build up an output string from repeated matches:

    my $out = "";
    while ( $str =~ s/^((.)\2*)//g )
    {
        my $n = length($1);
        $out .= ( $n > 1 ? $n : "" ) . $2;
    }
    return $out;
Enter fullscreen mode Exit fullscreen mode

What I'd like to do is to avoid the loop and just make a substitution right in the s/// operation. We'll need to figure out the length of the $1 string match. Does Perl help us? Yes, it does.

As a side effect of regular-expression captures, Perl creates the special array variables @- and @+. These represent the offsets in the string where a capture group begins and ends. The $1 group starts at $-[1] and ends just before $+[1], so the length of $1 is $+[1]-$-[1]. If this sort of punctuation overload makes you twitchy, you can make it more readable by adding use English; and substituting Cobol-level verbosity: $LAST_MATCH_END[1]-$LAST_MATCH_START[1].

So what we want to do during the substitution is evaluate a little bit of code:

my $n = $LAST_MATCH_END[1] - $LAST_MATCH_START[1];
( $n > 1 : $n : "" ) . $2
Enter fullscreen mode Exit fullscreen mode

We can do that in Perl. Adding the /e flag to the substitution operator lets us put code in the replacement.
s/.../my $n=$+[1]-$-[1];($n>1?$n:"").$2/e

Two more flags will reduce the problem to a one-liner. We want to do this globally, so we need /g, of course. And we want the result of the substitution as an output value. Normally the s/// operator returns the number of substitutions; to yield the modified string use the /r flag.

The final, dense code looks like:

sub rleRE($str)
{
    return ( $str =~ s/((.)\2*)/my $n=$+[1]-$-[1];($n>1?$n:"").$2/ger );
}
Enter fullscreen mode Exit fullscreen mode

What about the decoding bonus? Even easier with regular expressions. We can exploit the /e flag again. Everywhere that we find the pattern of a number followed by a character, replace it with the replicated character.

sub rle_dec_RE($str)
{
    return $str =~ s/(\d+)(.)/$2x$1/ger;
}
Enter fullscreen mode Exit fullscreen mode

The breakdown:

  • /(\d+)(.)/ -- Capture an integer ($1) and the character following it ($2).
  • /$2x$1/ -- This is code, not a string. The x is the Perl replication operator.
  • s///ger -- As before, we want to do this globally (g), using expression evaluation in the replacement (e), and returning the modified string (r).

But does it blend?

One final note: performance. I did a simple benchmark (included in the code on GitHub), which suggests that the regular expression encoding is about 25% slower than the character-at-a-time version. Pattern matching is concisely expressive, but there is a bit of machinery under the hood that isn't free. Of course, if performance really mattered, I'd write this in C. And if compression really mattered, I'd choose a different algorithm altogether.

Do your career a big favor. Join DEV. (The website you're on right now)

It takes one minute, it's free, and is worth it for your career.

Get started

Community matters

Top comments (0)

Heroku

Build apps, not infrastructure.

Dealing with servers, hardware, and infrastructure can take up your valuable time. Discover the benefits of Heroku, the PaaS of choice for developers since 2007.

Visit Site

👋 Kindness is contagious

Engage with a sea of insights in this enlightening article, highly esteemed within the encouraging DEV Community. Programmers of every skill level are invited to participate and enrich our shared knowledge.

A simple "thank you" can uplift someone's spirits. Express your appreciation in the comments section!

On DEV, sharing knowledge smooths our journey and strengthens our community bonds. Found this useful? A brief thank you to the author can mean a lot.

Okay