I've been aware of John Crickett's Coding Challenges for a little while now and decided this week that I would have a go and see how I got on.
The first challenge is to rebuild wc
, the classic unix word counting tool. I've been using Linux since 1997 and, to be honest, wc
is not something that I've needed to use particularly much - I certainly haven't studied in any detail. I mean, it's basically a word counter. How hard could it possibly be?
Well, seeing as this is just a simple challenge, I should be able to write it with just a text editor. I'll use vim, over ssh, using my tablet as a terminal, without a keyboard. PHP is my daily language, so I'll write it in that. Should be easy, right?
I got to a complete(ish), working(ish), procedural solution without using a physical keyboard and with vim in "no frills" mode. It was a terrible idea, full of compromises, and resulted in awful code - which I've preserved in the repo for posterity.
You can see my horrible first attempt in
ccwc.php
After adding a bluetooth keyboard and turning vim into more of an IDE by adding a few plugins, I was ready to have another go. This time, my implementation was something to be pleased with:
- OOP,
- mostly TDD,
- handles arbitrarily large files,
- looks and behaves (mostly) like coreutils
wc
- decent(ish) performance
Here are my main takeaways after spending some time on this.
The devil is in the detail
How hard could it possibly be?
-- Me, before starting out
wc
turned out to have a few surprises for me. Some easy to solve, others not so much...
1. Multibyte support
It's 2025. I should have known that a mature unix tool will have multibyte support, but for some reason, I was surprised by it. Still, we're in PHP, so mb_strlen
should cover that. No trouble.
2. Very large file support
I don't know what kind of filesize limit wc
has, but it's likely to be effectively unlimited. That means that input (plus processing space) could be larger than the system RAM, so I need to handle the files in chunks. file_get_contents
just isn't going to cut it.
There's an additional consideration - what happens if a word crosses a chunk boundary?
3. Input via STDIN
It needs to work with pipes and redirectors. Well, everything's a stream I suppose...
Dev environment and tooling are essential
I've been a PHPStormer for many years, but for this challenge I decided to go back to basics and use a text editor without any extra features. Not using the IDE was a good reminder of the automation it offers and how much I have come to depend on it. In particular, I missed intellisense and the refactoring tools.
My cognitive load increased significantly without any kind of intellisense, making programming feel awkward and clumsy. Suddenly I had to remember parameter order for PHP builtins, as well as the structure of my own code in other parts of the codebase. It slowed me down and not in a good way.
I found that the lack of refactoring tools discouraged the iterative nature of programming. Making structural changes to the codebase became tedious and I found that, rather than correcting them, I chose to live with my poor initial decisions.
Use an IDE. Or give your editor enough intelligence to can behave like an IDE. For me, intellisense and some kind of code navigation/reference finder are the bare minimum.
Automated testing is king
With black box-style tests covering the behaviour of the application's components, it becomes possible to refactor with complete impunity. There's no worry of breaking anything because you can prove if it's working and when it isn't.
On my second go at the challenge, I decided to go for as much testing as I could - ideally TDD. This paid off so many times during development, allowing me to go faster and with more bravery, when things became more difficult.
Just a few examples of what having good test coverage allowed me to do without fear:
- refine the
Counter
class's behaviour 3 times (see below) - completely switch out the CLI option parsing
- experiment with using a generator for paged file loading
The Counter Class
The logic in this class went from this:
if ($countMode === CountMode::CHARACTER) {
return strlen($contents);
}
if ($countMode === CountMode::MB_CHARACTER) {
return mb_strlen($contents);
}
if ($countMode === CountMode::LINE) {
return count(preg_split('(\r\n|\r|\n)', $contents)) - 1;
}
if ($countMode === CountMode::WORD) {
return count(preg_split('/[^\s]+/', $contents)) - 1;
}
to this:
return match($countMode) {
CountMode::CHARACTER => strlen($contents),
CountMode::MB_CHARACTER => mb_strlen($contents),
CountMode::LINE => count(preg_split('(\r\n|\r|\n)', $contents)) - 1,
CountMode::WORD => static::countWords($contents)
};
Most of an application is "other" stuff
You'd think that wc
is about counting things. I certainly thought that. But actually, the counting-related part of the codebase is about 10% in terms of lines of code.
In my solution, the largest segments of the codebase, by lines of code, are:
- 30% for parsing CLI arguments
- 25% for display output
You can always learn things, even from an "easy" challenge
A few other things that I learned, reinforced, or remembered, along the way:
- generators are powerful and easy to implement
- page size is important for performance when loading files
- enums are great
- arrow functions can't modify scalar variables in their parent's scope
All in all, I had a great time doing this challenge and I'm really looking forward to tackling the next one.
My solution, if you're interested, is available on Github.
Top comments (0)