Matt Gorle

Posted on Feb 17

I built `wc` in PHP

#codingchallenge #php

I've been aware of John Crickett's Coding Challenges for a little while now and decided this week that I would have a go and see how I got on.

The first challenge is to rebuild wc, the classic unix word counting tool. I've been using Linux since 1997 and, to be honest, wc is not something that I've needed to use particularly much - I certainly haven't studied in any detail. I mean, it's basically a word counter. How hard could it possibly be?

Well, seeing as this is just a simple challenge, I should be able to write it with just a text editor. I'll use vim, over ssh, using my tablet as a terminal, without a keyboard. PHP is my daily language, so I'll write it in that. Should be easy, right?

I got to a complete(ish), working(ish), procedural solution without using a physical keyboard and with vim in "no frills" mode. It was a terrible idea, full of compromises, and resulted in awful code - which I've preserved in the repo for posterity.

You can see my horrible first attempt in ccwc.php

After adding a bluetooth keyboard and turning vim into more of an IDE by adding a few plugins, I was ready to have another go. This time, my implementation was something to be pleased with:

OOP,
mostly TDD,
handles arbitrarily large files,
looks and behaves (mostly) like coreutils wc
decent(ish) performance

Here are my main takeaways after spending some time on this.

The devil is in the detail

How hard could it possibly be?
-- Me, before starting out

wc turned out to have a few surprises for me. Some easy to solve, others not so much...

1. Multibyte support

It's 2025. I should have known that a mature unix tool will have multibyte support, but for some reason, I was surprised by it. Still, we're in PHP, so mb_strlen should cover that. No trouble.

2. Very large file support

I don't know what kind of filesize limit wc has, but it's likely to be effectively unlimited. That means that input (plus processing space) could be larger than the system RAM, so I need to handle the files in chunks. file_get_contents just isn't going to cut it.

There's an additional consideration - what happens if a word crosses a chunk boundary?

3. Input via STDIN

It needs to work with pipes and redirectors. Well, everything's a stream I suppose...

Dev environment and tooling are essential

I've been a PHPStormer for many years, but for this challenge I decided to go back to basics and use a text editor without any extra features. Not using the IDE was a good reminder of the automation it offers and how much I have come to depend on it. In particular, I missed intellisense and the refactoring tools.

My cognitive load increased significantly without any kind of intellisense, making programming feel awkward and clumsy. Suddenly I had to remember parameter order for PHP builtins, as well as the structure of my own code in other parts of the codebase. It slowed me down and not in a good way.

I found that the lack of refactoring tools discouraged the iterative nature of programming. Making structural changes to the codebase became tedious and I found that, rather than correcting them, I chose to live with my poor initial decisions.

Use an IDE. Or give your editor enough intelligence to can behave like an IDE. For me, intellisense and some kind of code navigation/reference finder are the bare minimum.

Automated testing is king

With black box-style tests covering the behaviour of the application's components, it becomes possible to refactor with complete impunity. There's no worry of breaking anything because you can prove if it's working and when it isn't.

On my second go at the challenge, I decided to go for as much testing as I could - ideally TDD. This paid off so many times during development, allowing me to go faster and with more bravery, when things became more difficult.

Just a few examples of what having good test coverage allowed me to do without fear:

refine the Counter class's behaviour 3 times (see below)
completely switch out the CLI option parsing
experiment with using a generator for paged file loading

The Counter Class

The logic in this class went from this:

if ($countMode === CountMode::CHARACTER) {
    return strlen($contents);
}

if ($countMode === CountMode::MB_CHARACTER) {
    return mb_strlen($contents);
}

if ($countMode === CountMode::LINE) {
    return count(preg_split('(\r\n|\r|\n)', $contents)) - 1;
}

if ($countMode === CountMode::WORD) {
    return count(preg_split('/[^\s]+/', $contents)) - 1;
}

to this:

return match($countMode) {
    CountMode::CHARACTER => strlen($contents),
    CountMode::MB_CHARACTER => mb_strlen($contents),
    CountMode::LINE => count(preg_split('(\r\n|\r|\n)', $contents)) - 1,
    CountMode::WORD => static::countWords($contents)
};

Most of an application is "other" stuff

You'd think that wc is about counting things. I certainly thought that. But actually, the counting-related part of the codebase is about 10% in terms of lines of code.

In my solution, the largest segments of the codebase, by lines of code, are:

30% for parsing CLI arguments
25% for display output

You can always learn things, even from an "easy" challenge

A few other things that I learned, reinforced, or remembered, along the way:

generators are powerful and easy to implement
page size is important for performance when loading files
enums are great
arrow functions can't modify scalar variables in their parent's scope

All in all, I had a great time doing this challenge and I'm really looking forward to tackling the next one.

My solution, if you're interested, is available on Github.

DEV Community